* [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
@ 2026-03-26 15:02 Andrea Righi
2026-03-26 15:02 ` [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
` (5 more replies)
0 siblings, 6 replies; 24+ messages in thread
From: Andrea Righi @ 2026-03-26 15:02 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, linux-kernel
This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by
introducing SMT awareness.
= Problem =
Nominal per-logical-CPU capacity can overstate usable compute when an SMT
sibling is busy, because the physical core doesn't deliver its full nominal
capacity. So, several SD_ASYM_CPUCAPACITY paths may pick high capacity CPUs
that are not actually good destinations.
= Proposed Solution =
This patch set aligns those paths with a simple rule already used
elsewhere: when SMT is active, prefer fully idle cores and avoid treating
partially idle SMT siblings as full-capacity targets where that would
mislead load balance.
Patch set summary:
- [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
Prefer fully-idle SMT cores in asym-capacity idle selection. In the
wakeup fast path, extend select_idle_capacity() / asym_fits_cpu() so
idle selection can prefer CPUs on fully idle cores, with a safe fallback.
- [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
Provided for consistency with PATCH 1/4.
- [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
Enable EAS with SD_ASYM_CPUCAPACITY and SMT. Also provided for
consistency with PATCH 1/4. I've also tested with/without
/proc/sys/kernel/sched_energy_aware enabled (same platform) and haven't
noticed any regression.
- [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
When choosing the housekeeping CPU that runs the idle load balancer,
prefer an idle CPU on a fully idle core so migrated work lands where
effective capacity is available.
The change is still consistent with the same "avoid CPUs with busy
sibling" logic and it shows some benefits on Vera, but could have
negative impact on other systems, I'm including it for completeness
(feedback is appreciated).
This patch set has been tested on the new NVIDIA Vera Rubin platform, where
SMT is enabled and the firmware exposes small frequency variations (+/-~5%)
as differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.
Without these patches, performance can drop up to ~2x with CPU-intensive
workloads, because the SD_ASYM_CPUCAPACITY idle selection policy does not
account for busy SMT siblings.
Alternative approaches have been evaluated, such as equalizing CPU
capacities, either by exposing uniform values via firmware (ACPI/CPPC) or
normalizing them in the kernel by grouping CPUs within a small capacity
window (+-5%) [1][2], or enabling asympacking [3].
However, adding SMT awareness to SD_ASYM_CPUCAPACITY has shown better
results so far. Improving this policy also seems worthwhile in general, as
other platforms in the future may enable SMT with asymmetric CPU
topologies.
[1] https://lore.kernel.org/lkml/20260324005509.1134981-1-arighi@nvidia.com
[2] https://lore.kernel.org/lkml/20260318092214.130908-1-arighi@nvidia.com
[3] https://lore.kernel.org/all/20260325181314.3875909-1-christian.loehle@arm.com/
Andrea Righi (4):
sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
kernel/sched/fair.c | 163 +++++++++++++++++++++++++++++++++++++++++++-----
kernel/sched/topology.c | 9 ---
2 files changed, 147 insertions(+), 25 deletions(-)
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
2026-03-26 15:02 [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
@ 2026-03-26 15:02 ` Andrea Righi
2026-03-27 8:09 ` Vincent Guittot
2026-03-27 10:44 ` K Prateek Nayak
2026-03-26 15:02 ` [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
` (4 subsequent siblings)
5 siblings, 2 replies; 24+ messages in thread
From: Andrea Righi @ 2026-03-26 15:02 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, linux-kernel
On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
different per-core frequencies), the wakeup path uses
select_idle_capacity() and prioritizes idle CPUs with higher capacity
for better task placement. However, when those CPUs belong to SMT cores,
their effective capacity can be much lower than the nominal capacity
when the sibling thread is busy: SMT siblings compete for shared
resources, so a "high capacity" CPU that is idle but whose sibling is
busy does not deliver its full capacity. This effective capacity
reduction cannot be modeled by the static capacity value alone.
Introduce SMT awareness in the asym-capacity idle selection policy: when
SMT is active prefer fully-idle SMT cores over partially-idle ones. A
two-phase selection first tries only CPUs on fully idle cores, then
falls back to any idle CPU if none fit.
Prioritizing fully-idle SMT cores yields better task placement because
the effective capacity of partially-idle SMT cores is reduced; always
preferring them when available leads to more accurate capacity usage on
task wakeup.
On an SMT system with asymmetric CPU capacities, SMT-aware idle
selection has been shown to improve throughput by around 15-18% for
CPU-bound workloads, running an amount of tasks equal to the amount of
SMT cores.
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Koba Ko <kobak@nvidia.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/fair.c | 86 +++++++++++++++++++++++++++++++++++++++------
1 file changed, 75 insertions(+), 11 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d57c02e82f3a1..9a95628669851 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7940,14 +7940,21 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
* Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
* the task fits. If no CPU is big enough, but there are idle ones, try to
* maximize capacity.
+ *
+ * When @prefer_idle_cores is true (asym + SMT and idle cores exist), prefer
+ * CPUs on fully-idle cores over partially-idle ones in a single pass: track
+ * the best candidate among idle-core CPUs and the best among any idle CPU,
+ * then return the idle-core candidate if found, else the best any-idle.
*/
static int
-select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
+select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target,
+ bool prefer_idle_cores)
{
- unsigned long task_util, util_min, util_max, best_cap = 0;
- int fits, best_fits = 0;
- int cpu, best_cpu = -1;
+ unsigned long task_util, util_min, util_max, best_cap = 0, best_cap_core = 0;
+ int fits, best_fits = 0, best_fits_core = 0;
+ int cpu, best_cpu = -1, best_cpu_core = -1;
struct cpumask *cpus;
+ bool on_idle_core;
cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
@@ -7962,16 +7969,58 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
if (!choose_idle_cpu(cpu, p))
continue;
+ on_idle_core = is_core_idle(cpu);
+ if (prefer_idle_cores && !on_idle_core) {
+ /* Track best among any idle CPU for fallback */
+ fits = util_fits_cpu(task_util, util_min, util_max, cpu);
+ if (fits > 0) {
+ /*
+ * Full fit: strictly better than fits 0 / -1;
+ * among several, prefer higher capacity.
+ */
+ if (best_cpu < 0 || best_fits <= 0 ||
+ (best_fits > 0 && cpu_cap > best_cap)) {
+ best_cap = cpu_cap;
+ best_cpu = cpu;
+ best_fits = fits;
+ }
+ continue;
+ }
+ if (best_fits > 0)
+ continue;
+ if (fits < 0)
+ cpu_cap = get_actual_cpu_capacity(cpu);
+ if ((fits < best_fits) ||
+ ((fits == best_fits) && (cpu_cap > best_cap))) {
+ best_cap = cpu_cap;
+ best_cpu = cpu;
+ best_fits = fits;
+ }
+ continue;
+ }
+
fits = util_fits_cpu(task_util, util_min, util_max, cpu);
/* This CPU fits with all requirements */
- if (fits > 0)
- return cpu;
+ if (fits > 0) {
+ if (prefer_idle_cores && on_idle_core)
+ return cpu;
+ if (!prefer_idle_cores)
+ return cpu;
+ /*
+ * Prefer idle cores: record and keep looking for
+ * idle-core fit.
+ */
+ best_cap = cpu_cap;
+ best_cpu = cpu;
+ best_fits = fits;
+ continue;
+ }
/*
* Only the min performance hint (i.e. uclamp_min) doesn't fit.
* Look for the CPU with best capacity.
*/
- else if (fits < 0)
+ if (fits < 0)
cpu_cap = get_actual_cpu_capacity(cpu);
/*
@@ -7984,8 +8033,17 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
best_cpu = cpu;
best_fits = fits;
}
+ if (prefer_idle_cores && on_idle_core &&
+ ((fits < best_fits_core) ||
+ ((fits == best_fits_core) && (cpu_cap > best_cap_core)))) {
+ best_cap_core = cpu_cap;
+ best_cpu_core = cpu;
+ best_fits_core = fits;
+ }
}
+ if (prefer_idle_cores && best_cpu_core >= 0)
+ return best_cpu_core;
return best_cpu;
}
@@ -7994,12 +8052,17 @@ static inline bool asym_fits_cpu(unsigned long util,
unsigned long util_max,
int cpu)
{
- if (sched_asym_cpucap_active())
+ if (sched_asym_cpucap_active()) {
/*
* Return true only if the cpu fully fits the task requirements
* which include the utilization and the performance hints.
+ *
+ * When SMT is active, also require that the core has no busy
+ * siblings.
*/
- return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
+ return (!sched_smt_active() || is_core_idle(cpu)) &&
+ (util_fits_cpu(util, util_min, util_max, cpu) > 0);
+ }
return true;
}
@@ -8097,8 +8160,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
* capacity path.
*/
if (sd) {
- i = select_idle_capacity(p, sd, target);
- return ((unsigned)i < nr_cpumask_bits) ? i : target;
+ i = select_idle_capacity(p, sd, target,
+ sched_smt_active() && test_idle_cores(target));
+ return ((unsigned int)i < nr_cpumask_bits) ? i : target;
}
}
--
2.53.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
2026-03-26 15:02 [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
2026-03-26 15:02 ` [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
@ 2026-03-26 15:02 ` Andrea Righi
2026-03-26 15:02 ` [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems Andrea Righi
` (3 subsequent siblings)
5 siblings, 0 replies; 24+ messages in thread
From: Andrea Righi @ 2026-03-26 15:02 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, linux-kernel
When SD_ASYM_CPUCAPACITY load balancing considers pulling a misfit task,
capacity_of(dst_cpu) can overstate available compute if the SMT sibling is
busy: the core does not deliver its full nominal capacity.
If SMT is active and dst_cpu is not on a fully idle core, skip this
destination so we do not migrate a misfit expecting a capacity upgrade we
cannot actually provide.
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Koba Ko <kobak@nvidia.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/fair.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a95628669851..f8deaaa5bfc85 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10819,10 +10819,16 @@ static bool update_sd_pick_busiest(struct lb_env *env,
* We can use max_capacity here as reduction in capacity on some
* CPUs in the group should either be possible to resolve
* internally or be covered by avg_load imbalance (eventually).
+ *
+ * When SMT is active, only pull a misfit to dst_cpu if it is on a
+ * fully idle core; otherwise the effective capacity of the core is
+ * reduced and we may not actually provide more capacity than the
+ * source.
*/
if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
(sgs->group_type == group_misfit_task) &&
- (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
+ ((sched_smt_active() && !is_core_idle(env->dst_cpu)) ||
+ !capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
sds->local_stat.group_type != group_has_spare))
return false;
--
2.53.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
2026-03-26 15:02 [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
2026-03-26 15:02 ` [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
2026-03-26 15:02 ` [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
@ 2026-03-26 15:02 ` Andrea Righi
2026-03-27 8:09 ` Vincent Guittot
2026-03-26 15:02 ` [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer Andrea Righi
` (2 subsequent siblings)
5 siblings, 1 reply; 24+ messages in thread
From: Andrea Righi @ 2026-03-26 15:02 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, linux-kernel
Drop the sched_is_eas_possible() guard that rejects EAS whenever SMT is
active. This allows to enable EAS and perf-domain setup to succeed on
SD_ASYM_CPUCAPACITY topologies with SMT enabled.
Moreover, apply to find_energy_efficient_cpu() the same SMT-aware
preference as the non-EAS wakeup path: when SMT is active and there is a
fully-idle core in the relevant domain, prefer max-spare-capacity
candidates on fully-idle cores. Otherwise, fall back to the prior
behavior, to include also partially-idle SMT siblings.
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Koba Ko <kobak@nvidia.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/fair.c | 50 +++++++++++++++++++++++++++++++++++++++--
kernel/sched/topology.c | 9 --------
2 files changed, 48 insertions(+), 11 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f8deaaa5bfc85..593a89f688679 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8658,13 +8658,15 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
eenv_task_busy_time(&eenv, p, prev_cpu);
for (; pd; pd = pd->next) {
- unsigned long util_min = p_util_min, util_max = p_util_max;
unsigned long cpu_cap, cpu_actual_cap, util;
long prev_spare_cap = -1, max_spare_cap = -1;
+ long max_spare_cap_fallback = -1;
unsigned long rq_util_min, rq_util_max;
unsigned long cur_delta, base_energy;
- int max_spare_cap_cpu = -1;
+ int max_spare_cap_cpu = -1, max_spare_cap_cpu_fallback = -1;
int fits, max_fits = -1;
+ int max_fits_fallback = -1;
+ bool prefer_idle_cores;
if (!cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask))
continue;
@@ -8676,6 +8678,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
eenv.cpu_cap = cpu_actual_cap;
eenv.pd_cap = 0;
+ prefer_idle_cores = sched_smt_active() && test_idle_cores(prev_cpu);
+
for_each_cpu(cpu, cpus) {
struct rq *rq = cpu_rq(cpu);
@@ -8687,6 +8691,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
if (!cpumask_test_cpu(cpu, p->cpus_ptr))
continue;
+ if (prefer_idle_cores && cpu != prev_cpu && !is_core_idle(cpu))
+ goto fallback;
+
+ unsigned long util_min = p_util_min, util_max = p_util_max;
+
util = cpu_util(cpu, p, cpu, 0);
cpu_cap = capacity_of(cpu);
@@ -8733,6 +8742,43 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
max_spare_cap_cpu = cpu;
max_fits = fits;
}
+
+fallback:
+ if (!prefer_idle_cores || cpu == prev_cpu || is_core_idle(cpu))
+ continue;
+
+ util_min = p_util_min;
+ util_max = p_util_max;
+ util = cpu_util(cpu, p, cpu, 0);
+ cpu_cap = capacity_of(cpu);
+
+ if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
+ rq_util_min = uclamp_rq_get(rq, UCLAMP_MIN);
+ rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
+
+ util_min = max(rq_util_min, p_util_min);
+ util_max = max(rq_util_max, p_util_max);
+ }
+
+ fits = util_fits_cpu(util, util_min, util_max, cpu);
+ if (!fits)
+ continue;
+
+ lsub_positive(&cpu_cap, util);
+
+ if ((fits > max_fits_fallback) ||
+ ((fits == max_fits_fallback) &&
+ ((long)cpu_cap > max_spare_cap_fallback))) {
+ max_spare_cap_fallback = cpu_cap;
+ max_spare_cap_cpu_fallback = cpu;
+ max_fits_fallback = fits;
+ }
+ }
+
+ if (max_spare_cap_cpu < 0 && max_spare_cap_cpu_fallback >= 0) {
+ max_spare_cap = max_spare_cap_fallback;
+ max_spare_cap_cpu = max_spare_cap_cpu_fallback;
+ max_fits = max_fits_fallback;
}
if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 061f8c85f5552..cb060fe56aec1 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -232,15 +232,6 @@ static bool sched_is_eas_possible(const struct cpumask *cpu_mask)
return false;
}
- /* EAS definitely does *not* handle SMT */
- if (sched_smt_active()) {
- if (sched_debug()) {
- pr_info("rd %*pbl: Checking EAS, SMT is not supported\n",
- cpumask_pr_args(cpu_mask));
- }
- return false;
- }
-
if (!arch_scale_freq_invariant()) {
if (sched_debug()) {
pr_info("rd %*pbl: Checking EAS: frequency-invariant load tracking not yet supported",
--
2.53.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
2026-03-26 15:02 [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
` (2 preceding siblings ...)
2026-03-26 15:02 ` [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems Andrea Righi
@ 2026-03-26 15:02 ` Andrea Righi
2026-03-27 8:45 ` Vincent Guittot
2026-03-27 13:44 ` Shrikanth Hegde
2026-03-26 16:33 ` [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Christian Loehle
2026-03-27 16:31 ` Shrikanth Hegde
5 siblings, 2 replies; 24+ messages in thread
From: Andrea Righi @ 2026-03-26 15:02 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, linux-kernel
When choosing which idle housekeeping CPU runs the idle load balancer,
prefer one on a fully idle core if SMT is active, so balance can migrate
work onto a CPU that still offers full effective capacity. Fall back to
any idle candidate if none qualify.
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Koba Ko <kobak@nvidia.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/fair.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 593a89f688679..a1ee21f7b32f6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12733,11 +12733,15 @@ static inline int on_null_domain(struct rq *rq)
* - When one of the busy CPUs notices that there may be an idle rebalancing
* needed, they will kick the idle load balancer, which then does idle
* load balancing for all the idle CPUs.
+ *
+ * - When SMT is active, prefer a CPU on a fully idle core as the ILB
+ * target, so that when it runs balance it becomes the destination CPU
+ * and can accept migrated tasks with full effective capacity.
*/
static inline int find_new_ilb(void)
{
const struct cpumask *hk_mask;
- int ilb_cpu;
+ int ilb_cpu, fallback = -1;
hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
@@ -12746,11 +12750,22 @@ static inline int find_new_ilb(void)
if (ilb_cpu == smp_processor_id())
continue;
+#ifdef CONFIG_SCHED_SMT
+ if (!idle_cpu(ilb_cpu))
+ continue;
+
+ if (fallback < 0)
+ fallback = ilb_cpu;
+
+ if (!sched_smt_active() || is_core_idle(ilb_cpu))
+ return ilb_cpu;
+#else
if (idle_cpu(ilb_cpu))
return ilb_cpu;
+#endif
}
- return -1;
+ return fallback;
}
/*
--
2.53.0
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
2026-03-26 15:02 [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
` (3 preceding siblings ...)
2026-03-26 15:02 ` [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer Andrea Righi
@ 2026-03-26 16:33 ` Christian Loehle
2026-03-27 6:52 ` Andrea Righi
2026-03-27 16:31 ` Shrikanth Hegde
5 siblings, 1 reply; 24+ messages in thread
From: Christian Loehle @ 2026-03-26 16:33 UTC (permalink / raw)
To: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Koba Ko, Felix Abecassis, Balbir Singh,
linux-kernel
On 3/26/26 15:02, Andrea Righi wrote:
> This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by
> introducing SMT awareness.
>
> = Problem =
>
> Nominal per-logical-CPU capacity can overstate usable compute when an SMT
> sibling is busy, because the physical core doesn't deliver its full nominal
> capacity. So, several SD_ASYM_CPUCAPACITY paths may pick high capacity CPUs
> that are not actually good destinations.
>
> = Proposed Solution =
>
> This patch set aligns those paths with a simple rule already used
> elsewhere: when SMT is active, prefer fully idle cores and avoid treating
> partially idle SMT siblings as full-capacity targets where that would
> mislead load balance.
>
> Patch set summary:
>
> - [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
>
> Prefer fully-idle SMT cores in asym-capacity idle selection. In the
> wakeup fast path, extend select_idle_capacity() / asym_fits_cpu() so
> idle selection can prefer CPUs on fully idle cores, with a safe fallback.
>
> - [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
>
> Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
> Provided for consistency with PATCH 1/4.
>
> - [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
>
> Enable EAS with SD_ASYM_CPUCAPACITY and SMT. Also provided for
> consistency with PATCH 1/4. I've also tested with/without
> /proc/sys/kernel/sched_energy_aware enabled (same platform) and haven't
> noticed any regression.
There's a lot more to unpack, but just to confirm, Vera doesn't have an EM, right?
There's no EAS with it?
(To be more precise, CPPC should bail out of building an artifical EM if there's no
or only one efficiency class:
drivers/cpufreq/cppc_cpufreq.c:
if (bitmap_weight(used_classes, 256) <= 1) {
pr_debug("Efficiency classes are all equal (=%d). "
"No EM registered", class);
return;
}
This is the case, right?
> [snip]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
2026-03-26 16:33 ` [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Christian Loehle
@ 2026-03-27 6:52 ` Andrea Righi
0 siblings, 0 replies; 24+ messages in thread
From: Andrea Righi @ 2026-03-27 6:52 UTC (permalink / raw)
To: Christian Loehle
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Koba Ko, Felix Abecassis, Balbir Singh,
linux-kernel
On Thu, Mar 26, 2026 at 04:33:08PM +0000, Christian Loehle wrote:
> On 3/26/26 15:02, Andrea Righi wrote:
> > This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by
> > introducing SMT awareness.
> >
> > = Problem =
> >
> > Nominal per-logical-CPU capacity can overstate usable compute when an SMT
> > sibling is busy, because the physical core doesn't deliver its full nominal
> > capacity. So, several SD_ASYM_CPUCAPACITY paths may pick high capacity CPUs
> > that are not actually good destinations.
> >
> > = Proposed Solution =
> >
> > This patch set aligns those paths with a simple rule already used
> > elsewhere: when SMT is active, prefer fully idle cores and avoid treating
> > partially idle SMT siblings as full-capacity targets where that would
> > mislead load balance.
> >
> > Patch set summary:
> >
> > - [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
> >
> > Prefer fully-idle SMT cores in asym-capacity idle selection. In the
> > wakeup fast path, extend select_idle_capacity() / asym_fits_cpu() so
> > idle selection can prefer CPUs on fully idle cores, with a safe fallback.
> >
> > - [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
> >
> > Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
> > Provided for consistency with PATCH 1/4.
> >
> > - [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
> >
> > Enable EAS with SD_ASYM_CPUCAPACITY and SMT. Also provided for
> > consistency with PATCH 1/4. I've also tested with/without
> > /proc/sys/kernel/sched_energy_aware enabled (same platform) and haven't
> > noticed any regression.
>
>
> There's a lot more to unpack, but just to confirm, Vera doesn't have an EM, right?
> There's no EAS with it?
> (To be more precise, CPPC should bail out of building an artifical EM if there's no
> or only one efficiency class:
> drivers/cpufreq/cppc_cpufreq.c:
>
> if (bitmap_weight(used_classes, 256) <= 1) {
> pr_debug("Efficiency classes are all equal (=%d). "
> "No EM registered", class);
> return;
> }
>
> This is the case, right?
Yes, that's correct, so my testing on Vera with EAS isn't that meaningful.
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
2026-03-26 15:02 ` [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
@ 2026-03-27 8:09 ` Vincent Guittot
2026-03-27 9:46 ` Andrea Righi
2026-03-27 10:44 ` K Prateek Nayak
1 sibling, 1 reply; 24+ messages in thread
From: Vincent Guittot @ 2026-03-27 8:09 UTC (permalink / raw)
To: Andrea Righi
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
linux-kernel
On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
>
> On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> different per-core frequencies), the wakeup path uses
> select_idle_capacity() and prioritizes idle CPUs with higher capacity
> for better task placement. However, when those CPUs belong to SMT cores,
> their effective capacity can be much lower than the nominal capacity
> when the sibling thread is busy: SMT siblings compete for shared
> resources, so a "high capacity" CPU that is idle but whose sibling is
> busy does not deliver its full capacity. This effective capacity
> reduction cannot be modeled by the static capacity value alone.
>
> Introduce SMT awareness in the asym-capacity idle selection policy: when
> SMT is active prefer fully-idle SMT cores over partially-idle ones. A
> two-phase selection first tries only CPUs on fully idle cores, then
> falls back to any idle CPU if none fit.
>
> Prioritizing fully-idle SMT cores yields better task placement because
> the effective capacity of partially-idle SMT cores is reduced; always
> preferring them when available leads to more accurate capacity usage on
> task wakeup.
>
> On an SMT system with asymmetric CPU capacities, SMT-aware idle
> selection has been shown to improve throughput by around 15-18% for
> CPU-bound workloads, running an amount of tasks equal to the amount of
> SMT cores.
>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Christian Loehle <christian.loehle@arm.com>
> Cc: Koba Ko <kobak@nvidia.com>
> Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
> kernel/sched/fair.c | 86 +++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 75 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d57c02e82f3a1..9a95628669851 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7940,14 +7940,21 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> * the task fits. If no CPU is big enough, but there are idle ones, try to
> * maximize capacity.
> + *
> + * When @prefer_idle_cores is true (asym + SMT and idle cores exist), prefer
> + * CPUs on fully-idle cores over partially-idle ones in a single pass: track
> + * the best candidate among idle-core CPUs and the best among any idle CPU,
> + * then return the idle-core candidate if found, else the best any-idle.
> */
> static int
> -select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> +select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target,
> + bool prefer_idle_cores)
> {
> - unsigned long task_util, util_min, util_max, best_cap = 0;
> - int fits, best_fits = 0;
> - int cpu, best_cpu = -1;
> + unsigned long task_util, util_min, util_max, best_cap = 0, best_cap_core = 0;
> + int fits, best_fits = 0, best_fits_core = 0;
> + int cpu, best_cpu = -1, best_cpu_core = -1;
> struct cpumask *cpus;
> + bool on_idle_core;
>
> cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> @@ -7962,16 +7969,58 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> if (!choose_idle_cpu(cpu, p))
> continue;
>
> + on_idle_core = is_core_idle(cpu);
> + if (prefer_idle_cores && !on_idle_core) {
> + /* Track best among any idle CPU for fallback */
> + fits = util_fits_cpu(task_util, util_min, util_max, cpu);
fits = util_fits_cpu(task_util, util_min, util_max, cpu); is always
called so call it once above this if condition
this will help factorize the selection of best_cpu and best_cpu_core
> + if (fits > 0) {
> + /*
> + * Full fit: strictly better than fits 0 / -1;
> + * among several, prefer higher capacity.
> + */
> + if (best_cpu < 0 || best_fits <= 0 ||
> + (best_fits > 0 && cpu_cap > best_cap)) {
> + best_cap = cpu_cap;
> + best_cpu = cpu;
> + best_fits = fits;
> + }
> + continue;
> + }
> + if (best_fits > 0)
> + continue;
> + if (fits < 0)
> + cpu_cap = get_actual_cpu_capacity(cpu);
> + if ((fits < best_fits) ||
> + ((fits == best_fits) && (cpu_cap > best_cap))) {
> + best_cap = cpu_cap;
> + best_cpu = cpu;
> + best_fits = fits;
> + }
> + continue;
> + }
> +
> fits = util_fits_cpu(task_util, util_min, util_max, cpu);
>
> /* This CPU fits with all requirements */
> - if (fits > 0)
> - return cpu;
> + if (fits > 0) {
> + if (prefer_idle_cores && on_idle_core)
> + return cpu;
> + if (!prefer_idle_cores)
> + return cpu;
> + /*
> + * Prefer idle cores: record and keep looking for
> + * idle-core fit.
> + */
> + best_cap = cpu_cap;
> + best_cpu = cpu;
> + best_fits = fits;
> + continue;
> + }
> /*
> * Only the min performance hint (i.e. uclamp_min) doesn't fit.
> * Look for the CPU with best capacity.
> */
> - else if (fits < 0)
> + if (fits < 0)
> cpu_cap = get_actual_cpu_capacity(cpu);
>
> /*
> @@ -7984,8 +8033,17 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> best_cpu = cpu;
> best_fits = fits;
> }
> + if (prefer_idle_cores && on_idle_core &&
> + ((fits < best_fits_core) ||
> + ((fits == best_fits_core) && (cpu_cap > best_cap_core)))) {
> + best_cap_core = cpu_cap;
> + best_cpu_core = cpu;
> + best_fits_core = fits;
> + }
> }
>
> + if (prefer_idle_cores && best_cpu_core >= 0)
> + return best_cpu_core;
> return best_cpu;
> }
>
> @@ -7994,12 +8052,17 @@ static inline bool asym_fits_cpu(unsigned long util,
> unsigned long util_max,
> int cpu)
> {
> - if (sched_asym_cpucap_active())
> + if (sched_asym_cpucap_active()) {
> /*
> * Return true only if the cpu fully fits the task requirements
> * which include the utilization and the performance hints.
> + *
> + * When SMT is active, also require that the core has no busy
> + * siblings.
> */
> - return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> + return (!sched_smt_active() || is_core_idle(cpu)) &&
> + (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> + }
>
> return true;
> }
> @@ -8097,8 +8160,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> * capacity path.
> */
> if (sd) {
> - i = select_idle_capacity(p, sd, target);
> - return ((unsigned)i < nr_cpumask_bits) ? i : target;
> + i = select_idle_capacity(p, sd, target,
> + sched_smt_active() && test_idle_cores(target));
Move "sched_smt_active() && test_idle_cores(target)" inside
select_idle_capacity(). I don't see the benefit of making it a
parameter
or use has_idle_core for the parameter like other smt related function
> + return ((unsigned int)i < nr_cpumask_bits) ? i : target;
> }
> }
>
> --
> 2.53.0
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
2026-03-26 15:02 ` [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems Andrea Righi
@ 2026-03-27 8:09 ` Vincent Guittot
2026-03-27 9:45 ` Andrea Righi
0 siblings, 1 reply; 24+ messages in thread
From: Vincent Guittot @ 2026-03-27 8:09 UTC (permalink / raw)
To: Andrea Righi
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
linux-kernel
On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
>
> Drop the sched_is_eas_possible() guard that rejects EAS whenever SMT is
> active. This allows to enable EAS and perf-domain setup to succeed on
> SD_ASYM_CPUCAPACITY topologies with SMT enabled.
I don't think that we want to enable EAS with SMT. So keep EAS and SMT
exclusive, at least for now
>
> Moreover, apply to find_energy_efficient_cpu() the same SMT-aware
> preference as the non-EAS wakeup path: when SMT is active and there is a
> fully-idle core in the relevant domain, prefer max-spare-capacity
> candidates on fully-idle cores. Otherwise, fall back to the prior
> behavior, to include also partially-idle SMT siblings.
>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Christian Loehle <christian.loehle@arm.com>
> Cc: Koba Ko <kobak@nvidia.com>
> Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
> kernel/sched/fair.c | 50 +++++++++++++++++++++++++++++++++++++++--
> kernel/sched/topology.c | 9 --------
> 2 files changed, 48 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f8deaaa5bfc85..593a89f688679 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8658,13 +8658,15 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> eenv_task_busy_time(&eenv, p, prev_cpu);
>
> for (; pd; pd = pd->next) {
> - unsigned long util_min = p_util_min, util_max = p_util_max;
> unsigned long cpu_cap, cpu_actual_cap, util;
> long prev_spare_cap = -1, max_spare_cap = -1;
> + long max_spare_cap_fallback = -1;
> unsigned long rq_util_min, rq_util_max;
> unsigned long cur_delta, base_energy;
> - int max_spare_cap_cpu = -1;
> + int max_spare_cap_cpu = -1, max_spare_cap_cpu_fallback = -1;
> int fits, max_fits = -1;
> + int max_fits_fallback = -1;
> + bool prefer_idle_cores;
>
> if (!cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask))
> continue;
> @@ -8676,6 +8678,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> eenv.cpu_cap = cpu_actual_cap;
> eenv.pd_cap = 0;
>
> + prefer_idle_cores = sched_smt_active() && test_idle_cores(prev_cpu);
> +
> for_each_cpu(cpu, cpus) {
> struct rq *rq = cpu_rq(cpu);
>
> @@ -8687,6 +8691,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> if (!cpumask_test_cpu(cpu, p->cpus_ptr))
> continue;
>
> + if (prefer_idle_cores && cpu != prev_cpu && !is_core_idle(cpu))
> + goto fallback;
> +
> + unsigned long util_min = p_util_min, util_max = p_util_max;
> +
> util = cpu_util(cpu, p, cpu, 0);
> cpu_cap = capacity_of(cpu);
>
> @@ -8733,6 +8742,43 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> max_spare_cap_cpu = cpu;
> max_fits = fits;
> }
> +
> +fallback:
> + if (!prefer_idle_cores || cpu == prev_cpu || is_core_idle(cpu))
> + continue;
> +
> + util_min = p_util_min;
> + util_max = p_util_max;
> + util = cpu_util(cpu, p, cpu, 0);
> + cpu_cap = capacity_of(cpu);
> +
> + if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
> + rq_util_min = uclamp_rq_get(rq, UCLAMP_MIN);
> + rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
> +
> + util_min = max(rq_util_min, p_util_min);
> + util_max = max(rq_util_max, p_util_max);
> + }
> +
> + fits = util_fits_cpu(util, util_min, util_max, cpu);
> + if (!fits)
> + continue;
> +
> + lsub_positive(&cpu_cap, util);
> +
> + if ((fits > max_fits_fallback) ||
> + ((fits == max_fits_fallback) &&
> + ((long)cpu_cap > max_spare_cap_fallback))) {
> + max_spare_cap_fallback = cpu_cap;
> + max_spare_cap_cpu_fallback = cpu;
> + max_fits_fallback = fits;
> + }
> + }
> +
> + if (max_spare_cap_cpu < 0 && max_spare_cap_cpu_fallback >= 0) {
> + max_spare_cap = max_spare_cap_fallback;
> + max_spare_cap_cpu = max_spare_cap_cpu_fallback;
> + max_fits = max_fits_fallback;
> }
>
> if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 061f8c85f5552..cb060fe56aec1 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -232,15 +232,6 @@ static bool sched_is_eas_possible(const struct cpumask *cpu_mask)
> return false;
> }
>
> - /* EAS definitely does *not* handle SMT */
> - if (sched_smt_active()) {
> - if (sched_debug()) {
> - pr_info("rd %*pbl: Checking EAS, SMT is not supported\n",
> - cpumask_pr_args(cpu_mask));
> - }
> - return false;
> - }
> -
> if (!arch_scale_freq_invariant()) {
> if (sched_debug()) {
> pr_info("rd %*pbl: Checking EAS: frequency-invariant load tracking not yet supported",
> --
> 2.53.0
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
2026-03-26 15:02 ` [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer Andrea Righi
@ 2026-03-27 8:45 ` Vincent Guittot
2026-03-27 9:44 ` Andrea Righi
2026-03-27 13:44 ` Shrikanth Hegde
1 sibling, 1 reply; 24+ messages in thread
From: Vincent Guittot @ 2026-03-27 8:45 UTC (permalink / raw)
To: Andrea Righi
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
linux-kernel
On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
>
> When choosing which idle housekeeping CPU runs the idle load balancer,
> prefer one on a fully idle core if SMT is active, so balance can migrate
> work onto a CPU that still offers full effective capacity. Fall back to
> any idle candidate if none qualify.
This one isn't straightforward for me. The ilb cpu will check all
other idle CPUs 1st and finish with itself so unless the next CPU in
the idle_cpus_mask is a sibling, this should not make a difference
Did you see any perf diff ?
>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Christian Loehle <christian.loehle@arm.com>
> Cc: Koba Ko <kobak@nvidia.com>
> Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
> kernel/sched/fair.c | 19 +++++++++++++++++--
> 1 file changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 593a89f688679..a1ee21f7b32f6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12733,11 +12733,15 @@ static inline int on_null_domain(struct rq *rq)
> * - When one of the busy CPUs notices that there may be an idle rebalancing
> * needed, they will kick the idle load balancer, which then does idle
> * load balancing for all the idle CPUs.
> + *
> + * - When SMT is active, prefer a CPU on a fully idle core as the ILB
> + * target, so that when it runs balance it becomes the destination CPU
> + * and can accept migrated tasks with full effective capacity.
> */
> static inline int find_new_ilb(void)
> {
> const struct cpumask *hk_mask;
> - int ilb_cpu;
> + int ilb_cpu, fallback = -1;
>
> hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
>
> @@ -12746,11 +12750,22 @@ static inline int find_new_ilb(void)
> if (ilb_cpu == smp_processor_id())
> continue;
>
> +#ifdef CONFIG_SCHED_SMT
you can probably get rid of the CONFIG and put this special case below
sched_smt_active()
> + if (!idle_cpu(ilb_cpu))
> + continue;
> +
> + if (fallback < 0)
> + fallback = ilb_cpu;
> +
> + if (!sched_smt_active() || is_core_idle(ilb_cpu))
> + return ilb_cpu;
> +#else
> if (idle_cpu(ilb_cpu))
> return ilb_cpu;
> +#endif
> }
>
> - return -1;
> + return fallback;
> }
>
> /*
> --
> 2.53.0
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
2026-03-27 8:45 ` Vincent Guittot
@ 2026-03-27 9:44 ` Andrea Righi
2026-03-27 11:34 ` K Prateek Nayak
0 siblings, 1 reply; 24+ messages in thread
From: Andrea Righi @ 2026-03-27 9:44 UTC (permalink / raw)
To: Vincent Guittot
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
linux-kernel
Hi Vincent,
On Fri, Mar 27, 2026 at 09:45:56AM +0100, Vincent Guittot wrote:
> On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
> >
> > When choosing which idle housekeeping CPU runs the idle load balancer,
> > prefer one on a fully idle core if SMT is active, so balance can migrate
> > work onto a CPU that still offers full effective capacity. Fall back to
> > any idle candidate if none qualify.
>
> This one isn't straightforward for me. The ilb cpu will check all
> other idle CPUs 1st and finish with itself so unless the next CPU in
> the idle_cpus_mask is a sibling, this should not make a difference
>
> Did you see any perf diff ?
I actually see a benefit, in particular, with the first patch applied I see
a ~1.76x speedup, if I add this on top I get ~1.9x speedup vs baseline,
which seems pretty consistent across runs (definitely not in error range).
The intention with this change was to minimize SMT noise running the ILB
code on a fully-idle core when possible, but I also didn't expect to see
such big difference.
I'll investigate more to better understand what's happening.
>
>
> >
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Christian Loehle <christian.loehle@arm.com>
> > Cc: Koba Ko <kobak@nvidia.com>
> > Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> > kernel/sched/fair.c | 19 +++++++++++++++++--
> > 1 file changed, 17 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 593a89f688679..a1ee21f7b32f6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -12733,11 +12733,15 @@ static inline int on_null_domain(struct rq *rq)
> > * - When one of the busy CPUs notices that there may be an idle rebalancing
> > * needed, they will kick the idle load balancer, which then does idle
> > * load balancing for all the idle CPUs.
> > + *
> > + * - When SMT is active, prefer a CPU on a fully idle core as the ILB
> > + * target, so that when it runs balance it becomes the destination CPU
> > + * and can accept migrated tasks with full effective capacity.
> > */
> > static inline int find_new_ilb(void)
> > {
> > const struct cpumask *hk_mask;
> > - int ilb_cpu;
> > + int ilb_cpu, fallback = -1;
> >
> > hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
> >
> > @@ -12746,11 +12750,22 @@ static inline int find_new_ilb(void)
> > if (ilb_cpu == smp_processor_id())
> > continue;
> >
> > +#ifdef CONFIG_SCHED_SMT
>
> you can probably get rid of the CONFIG and put this special case below
> sched_smt_active()
Ah good point, will change this.
>
>
> > + if (!idle_cpu(ilb_cpu))
> > + continue;
> > +
> > + if (fallback < 0)
> > + fallback = ilb_cpu;
> > +
> > + if (!sched_smt_active() || is_core_idle(ilb_cpu))
> > + return ilb_cpu;
> > +#else
> > if (idle_cpu(ilb_cpu))
> > return ilb_cpu;
> > +#endif
> > }
> >
> > - return -1;
> > + return fallback;
> > }
> >
> > /*
> > --
> > 2.53.0
> >
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
2026-03-27 8:09 ` Vincent Guittot
@ 2026-03-27 9:45 ` Andrea Righi
0 siblings, 0 replies; 24+ messages in thread
From: Andrea Righi @ 2026-03-27 9:45 UTC (permalink / raw)
To: Vincent Guittot
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
linux-kernel
On Fri, Mar 27, 2026 at 09:09:35AM +0100, Vincent Guittot wrote:
> On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
> >
> > Drop the sched_is_eas_possible() guard that rejects EAS whenever SMT is
> > active. This allows to enable EAS and perf-domain setup to succeed on
> > SD_ASYM_CPUCAPACITY topologies with SMT enabled.
>
> I don't think that we want to enable EAS with SMT. So keep EAS and SMT
> exclusive, at least for now
Ack.
Thanks,
-Andrea
>
>
> >
> > Moreover, apply to find_energy_efficient_cpu() the same SMT-aware
> > preference as the non-EAS wakeup path: when SMT is active and there is a
> > fully-idle core in the relevant domain, prefer max-spare-capacity
> > candidates on fully-idle cores. Otherwise, fall back to the prior
> > behavior, to include also partially-idle SMT siblings.
> >
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Christian Loehle <christian.loehle@arm.com>
> > Cc: Koba Ko <kobak@nvidia.com>
> > Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> > kernel/sched/fair.c | 50 +++++++++++++++++++++++++++++++++++++++--
> > kernel/sched/topology.c | 9 --------
> > 2 files changed, 48 insertions(+), 11 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index f8deaaa5bfc85..593a89f688679 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8658,13 +8658,15 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > eenv_task_busy_time(&eenv, p, prev_cpu);
> >
> > for (; pd; pd = pd->next) {
> > - unsigned long util_min = p_util_min, util_max = p_util_max;
> > unsigned long cpu_cap, cpu_actual_cap, util;
> > long prev_spare_cap = -1, max_spare_cap = -1;
> > + long max_spare_cap_fallback = -1;
> > unsigned long rq_util_min, rq_util_max;
> > unsigned long cur_delta, base_energy;
> > - int max_spare_cap_cpu = -1;
> > + int max_spare_cap_cpu = -1, max_spare_cap_cpu_fallback = -1;
> > int fits, max_fits = -1;
> > + int max_fits_fallback = -1;
> > + bool prefer_idle_cores;
> >
> > if (!cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask))
> > continue;
> > @@ -8676,6 +8678,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > eenv.cpu_cap = cpu_actual_cap;
> > eenv.pd_cap = 0;
> >
> > + prefer_idle_cores = sched_smt_active() && test_idle_cores(prev_cpu);
> > +
> > for_each_cpu(cpu, cpus) {
> > struct rq *rq = cpu_rq(cpu);
> >
> > @@ -8687,6 +8691,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > if (!cpumask_test_cpu(cpu, p->cpus_ptr))
> > continue;
> >
> > + if (prefer_idle_cores && cpu != prev_cpu && !is_core_idle(cpu))
> > + goto fallback;
> > +
> > + unsigned long util_min = p_util_min, util_max = p_util_max;
> > +
> > util = cpu_util(cpu, p, cpu, 0);
> > cpu_cap = capacity_of(cpu);
> >
> > @@ -8733,6 +8742,43 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > max_spare_cap_cpu = cpu;
> > max_fits = fits;
> > }
> > +
> > +fallback:
> > + if (!prefer_idle_cores || cpu == prev_cpu || is_core_idle(cpu))
> > + continue;
> > +
> > + util_min = p_util_min;
> > + util_max = p_util_max;
> > + util = cpu_util(cpu, p, cpu, 0);
> > + cpu_cap = capacity_of(cpu);
> > +
> > + if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
> > + rq_util_min = uclamp_rq_get(rq, UCLAMP_MIN);
> > + rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
> > +
> > + util_min = max(rq_util_min, p_util_min);
> > + util_max = max(rq_util_max, p_util_max);
> > + }
> > +
> > + fits = util_fits_cpu(util, util_min, util_max, cpu);
> > + if (!fits)
> > + continue;
> > +
> > + lsub_positive(&cpu_cap, util);
> > +
> > + if ((fits > max_fits_fallback) ||
> > + ((fits == max_fits_fallback) &&
> > + ((long)cpu_cap > max_spare_cap_fallback))) {
> > + max_spare_cap_fallback = cpu_cap;
> > + max_spare_cap_cpu_fallback = cpu;
> > + max_fits_fallback = fits;
> > + }
> > + }
> > +
> > + if (max_spare_cap_cpu < 0 && max_spare_cap_cpu_fallback >= 0) {
> > + max_spare_cap = max_spare_cap_fallback;
> > + max_spare_cap_cpu = max_spare_cap_cpu_fallback;
> > + max_fits = max_fits_fallback;
> > }
> >
> > if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 061f8c85f5552..cb060fe56aec1 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -232,15 +232,6 @@ static bool sched_is_eas_possible(const struct cpumask *cpu_mask)
> > return false;
> > }
> >
> > - /* EAS definitely does *not* handle SMT */
> > - if (sched_smt_active()) {
> > - if (sched_debug()) {
> > - pr_info("rd %*pbl: Checking EAS, SMT is not supported\n",
> > - cpumask_pr_args(cpu_mask));
> > - }
> > - return false;
> > - }
> > -
> > if (!arch_scale_freq_invariant()) {
> > if (sched_debug()) {
> > pr_info("rd %*pbl: Checking EAS: frequency-invariant load tracking not yet supported",
> > --
> > 2.53.0
> >
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
2026-03-27 8:09 ` Vincent Guittot
@ 2026-03-27 9:46 ` Andrea Righi
0 siblings, 0 replies; 24+ messages in thread
From: Andrea Righi @ 2026-03-27 9:46 UTC (permalink / raw)
To: Vincent Guittot
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
linux-kernel
Hi Vincent,
On Fri, Mar 27, 2026 at 09:09:24AM +0100, Vincent Guittot wrote:
> On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
> >
> > On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> > different per-core frequencies), the wakeup path uses
> > select_idle_capacity() and prioritizes idle CPUs with higher capacity
> > for better task placement. However, when those CPUs belong to SMT cores,
> > their effective capacity can be much lower than the nominal capacity
> > when the sibling thread is busy: SMT siblings compete for shared
> > resources, so a "high capacity" CPU that is idle but whose sibling is
> > busy does not deliver its full capacity. This effective capacity
> > reduction cannot be modeled by the static capacity value alone.
> >
> > Introduce SMT awareness in the asym-capacity idle selection policy: when
> > SMT is active prefer fully-idle SMT cores over partially-idle ones. A
> > two-phase selection first tries only CPUs on fully idle cores, then
> > falls back to any idle CPU if none fit.
> >
> > Prioritizing fully-idle SMT cores yields better task placement because
> > the effective capacity of partially-idle SMT cores is reduced; always
> > preferring them when available leads to more accurate capacity usage on
> > task wakeup.
> >
> > On an SMT system with asymmetric CPU capacities, SMT-aware idle
> > selection has been shown to improve throughput by around 15-18% for
> > CPU-bound workloads, running an amount of tasks equal to the amount of
> > SMT cores.
> >
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Christian Loehle <christian.loehle@arm.com>
> > Cc: Koba Ko <kobak@nvidia.com>
> > Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> > kernel/sched/fair.c | 86 +++++++++++++++++++++++++++++++++++++++------
> > 1 file changed, 75 insertions(+), 11 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d57c02e82f3a1..9a95628669851 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7940,14 +7940,21 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> > * the task fits. If no CPU is big enough, but there are idle ones, try to
> > * maximize capacity.
> > + *
> > + * When @prefer_idle_cores is true (asym + SMT and idle cores exist), prefer
> > + * CPUs on fully-idle cores over partially-idle ones in a single pass: track
> > + * the best candidate among idle-core CPUs and the best among any idle CPU,
> > + * then return the idle-core candidate if found, else the best any-idle.
> > */
> > static int
> > -select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > +select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target,
> > + bool prefer_idle_cores)
> > {
> > - unsigned long task_util, util_min, util_max, best_cap = 0;
> > - int fits, best_fits = 0;
> > - int cpu, best_cpu = -1;
> > + unsigned long task_util, util_min, util_max, best_cap = 0, best_cap_core = 0;
> > + int fits, best_fits = 0, best_fits_core = 0;
> > + int cpu, best_cpu = -1, best_cpu_core = -1;
> > struct cpumask *cpus;
> > + bool on_idle_core;
> >
> > cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> > cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > @@ -7962,16 +7969,58 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > if (!choose_idle_cpu(cpu, p))
> > continue;
> >
> > + on_idle_core = is_core_idle(cpu);
> > + if (prefer_idle_cores && !on_idle_core) {
> > + /* Track best among any idle CPU for fallback */
> > + fits = util_fits_cpu(task_util, util_min, util_max, cpu);
>
> fits = util_fits_cpu(task_util, util_min, util_max, cpu); is always
> called so call it once above this if condition
>
> this will help factorize the selection of best_cpu and best_cpu_core
Makes sense.
>
> > + if (fits > 0) {
> > + /*
> > + * Full fit: strictly better than fits 0 / -1;
> > + * among several, prefer higher capacity.
> > + */
> > + if (best_cpu < 0 || best_fits <= 0 ||
> > + (best_fits > 0 && cpu_cap > best_cap)) {
> > + best_cap = cpu_cap;
> > + best_cpu = cpu;
> > + best_fits = fits;
> > + }
> > + continue;
> > + }
> > + if (best_fits > 0)
> > + continue;
> > + if (fits < 0)
> > + cpu_cap = get_actual_cpu_capacity(cpu);
> > + if ((fits < best_fits) ||
> > + ((fits == best_fits) && (cpu_cap > best_cap))) {
> > + best_cap = cpu_cap;
> > + best_cpu = cpu;
> > + best_fits = fits;
> > + }
> > + continue;
> > + }
> > +
> > fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> >
> > /* This CPU fits with all requirements */
> > - if (fits > 0)
> > - return cpu;
> > + if (fits > 0) {
> > + if (prefer_idle_cores && on_idle_core)
> > + return cpu;
> > + if (!prefer_idle_cores)
> > + return cpu;
> > + /*
> > + * Prefer idle cores: record and keep looking for
> > + * idle-core fit.
> > + */
> > + best_cap = cpu_cap;
> > + best_cpu = cpu;
> > + best_fits = fits;
> > + continue;
> > + }
> > /*
> > * Only the min performance hint (i.e. uclamp_min) doesn't fit.
> > * Look for the CPU with best capacity.
> > */
> > - else if (fits < 0)
> > + if (fits < 0)
> > cpu_cap = get_actual_cpu_capacity(cpu);
> >
> > /*
> > @@ -7984,8 +8033,17 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > best_cpu = cpu;
> > best_fits = fits;
> > }
> > + if (prefer_idle_cores && on_idle_core &&
> > + ((fits < best_fits_core) ||
> > + ((fits == best_fits_core) && (cpu_cap > best_cap_core)))) {
> > + best_cap_core = cpu_cap;
> > + best_cpu_core = cpu;
> > + best_fits_core = fits;
> > + }
> > }
> >
> > + if (prefer_idle_cores && best_cpu_core >= 0)
> > + return best_cpu_core;
> > return best_cpu;
> > }
> >
> > @@ -7994,12 +8052,17 @@ static inline bool asym_fits_cpu(unsigned long util,
> > unsigned long util_max,
> > int cpu)
> > {
> > - if (sched_asym_cpucap_active())
> > + if (sched_asym_cpucap_active()) {
> > /*
> > * Return true only if the cpu fully fits the task requirements
> > * which include the utilization and the performance hints.
> > + *
> > + * When SMT is active, also require that the core has no busy
> > + * siblings.
> > */
> > - return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> > + return (!sched_smt_active() || is_core_idle(cpu)) &&
> > + (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> > + }
> >
> > return true;
> > }
> > @@ -8097,8 +8160,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> > * capacity path.
> > */
> > if (sd) {
> > - i = select_idle_capacity(p, sd, target);
> > - return ((unsigned)i < nr_cpumask_bits) ? i : target;
> > + i = select_idle_capacity(p, sd, target,
> > + sched_smt_active() && test_idle_cores(target));
>
> Move "sched_smt_active() && test_idle_cores(target)" inside
> select_idle_capacity(). I don't see the benefit of making it a
> parameter
> or use has_idle_core for the parameter like other smt related function
And also makes sense.
>
>
> > + return ((unsigned int)i < nr_cpumask_bits) ? i : target;
> > }
> > }
> >
> > --
> > 2.53.0
> >
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
2026-03-26 15:02 ` [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
2026-03-27 8:09 ` Vincent Guittot
@ 2026-03-27 10:44 ` K Prateek Nayak
2026-03-27 10:58 ` Andrea Righi
1 sibling, 1 reply; 24+ messages in thread
From: K Prateek Nayak @ 2026-03-27 10:44 UTC (permalink / raw)
To: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, linux-kernel
Hello Andrea,
On 3/26/2026 8:32 PM, Andrea Righi wrote:
> /* This CPU fits with all requirements */
> - if (fits > 0)
> - return cpu;
> + if (fits > 0) {
> + if (prefer_idle_cores && on_idle_core)
> + return cpu;
> + if (!prefer_idle_cores)
> + return cpu;
nit.
Can the above two be re-wittern as:
if (!prefer_idle_cores || on_idle_core)
return cpu;
since they are equivalent.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
2026-03-27 10:44 ` K Prateek Nayak
@ 2026-03-27 10:58 ` Andrea Righi
2026-03-27 11:14 ` K Prateek Nayak
0 siblings, 1 reply; 24+ messages in thread
From: Andrea Righi @ 2026-03-27 10:58 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, linux-kernel
Hi Prateek,
On Fri, Mar 27, 2026 at 04:14:57PM +0530, K Prateek Nayak wrote:
> Hello Andrea,
>
> On 3/26/2026 8:32 PM, Andrea Righi wrote:
> > /* This CPU fits with all requirements */
> > - if (fits > 0)
> > - return cpu;
> > + if (fits > 0) {
> > + if (prefer_idle_cores && on_idle_core)
> > + return cpu;
> > + if (!prefer_idle_cores)
> > + return cpu;
>
> nit.
>
> Can the above two be re-wittern as:
>
> if (!prefer_idle_cores || on_idle_core)
> return cpu;
>
> since they are equivalent.
Oh yes, indeed.
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
2026-03-27 10:58 ` Andrea Righi
@ 2026-03-27 11:14 ` K Prateek Nayak
2026-03-27 16:39 ` Andrea Righi
0 siblings, 1 reply; 24+ messages in thread
From: K Prateek Nayak @ 2026-03-27 11:14 UTC (permalink / raw)
To: Andrea Righi
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, linux-kernel
Hello Andrea,
On 3/27/2026 4:28 PM, Andrea Righi wrote:
> On Fri, Mar 27, 2026 at 04:14:57PM +0530, K Prateek Nayak wrote:
>> Hello Andrea,
>>
>> On 3/26/2026 8:32 PM, Andrea Righi wrote:
>>> /* This CPU fits with all requirements */
>>> - if (fits > 0)
>>> - return cpu;
>>> + if (fits > 0) {
>>> + if (prefer_idle_cores && on_idle_core)
>>> + return cpu;
>>> + if (!prefer_idle_cores)
>>> + return cpu;
>>
>> nit.
>>
>> Can the above two be re-wittern as:
>>
>> if (!prefer_idle_cores || on_idle_core)
>> return cpu;
>>
>> since they are equivalent.
>
> Oh yes, indeed.
Also, can we just rewrite this Patch as:
(Includes feedback from Vincent; Only build tested)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 700d0f145ca6..cffd5649b54e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7946,6 +7946,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
static int
select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
{
+ bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
unsigned long task_util, util_min, util_max, best_cap = 0;
int fits, best_fits = 0;
int cpu, best_cpu = -1;
@@ -7959,6 +7960,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
util_max = uclamp_eff_value(p, UCLAMP_MAX);
for_each_cpu_wrap(cpu, cpus, target) {
+ bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
unsigned long cpu_cap = capacity_of(cpu);
if (!choose_idle_cpu(cpu, p))
@@ -7967,7 +7969,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
fits = util_fits_cpu(task_util, util_min, util_max, cpu);
/* This CPU fits with all requirements */
- if (fits > 0)
+ if (fits > 0 && preferred_core)
return cpu;
/*
* Only the min performance hint (i.e. uclamp_min) doesn't fit.
@@ -7976,6 +7978,14 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
else if (fits < 0)
cpu_cap = get_actual_cpu_capacity(cpu);
+ /*
+ * If we are on an preferred core, translate the range of fits
+ * from [-1, 1] to [-4, -2]. This ensures that an idle core
+ * is always given priority over (paritally) busy core.
+ */
+ if (preferred_core)
+ fits -= 3;
+
/*
* First, select CPU which fits better (-1 being better than 0).
* Then, select the one with best capacity at same level.
---
My naive eyes say it should be equivalent of what you have but maybe
I'm wrong?
--
Thanks and Regards,
Prateek
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
2026-03-27 9:44 ` Andrea Righi
@ 2026-03-27 11:34 ` K Prateek Nayak
2026-03-27 20:36 ` Andrea Righi
0 siblings, 1 reply; 24+ messages in thread
From: K Prateek Nayak @ 2026-03-27 11:34 UTC (permalink / raw)
To: Andrea Righi, Vincent Guittot
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
linux-kernel
Hello Andrea,
On 3/27/2026 3:14 PM, Andrea Righi wrote:
> Hi Vincent,
>
> On Fri, Mar 27, 2026 at 09:45:56AM +0100, Vincent Guittot wrote:
>> On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
>>>
>>> When choosing which idle housekeeping CPU runs the idle load balancer,
>>> prefer one on a fully idle core if SMT is active, so balance can migrate
>>> work onto a CPU that still offers full effective capacity. Fall back to
>>> any idle candidate if none qualify.
>>
>> This one isn't straightforward for me. The ilb cpu will check all
>> other idle CPUs 1st and finish with itself so unless the next CPU in
>> the idle_cpus_mask is a sibling, this should not make a difference
>>
>> Did you see any perf diff ?
>
> I actually see a benefit, in particular, with the first patch applied I see
> a ~1.76x speedup, if I add this on top I get ~1.9x speedup vs baseline,
> which seems pretty consistent across runs (definitely not in error range).
>
> The intention with this change was to minimize SMT noise running the ILB
> code on a fully-idle core when possible, but I also didn't expect to see
> such big difference.
>
> I'll investigate more to better understand what's happening.
Interesting! Either this "CPU-intensive workload" hates SMT turning
busy (but to an extent where performance drops visibly?) or ILB
keeps getting interrupted on an SMT sibling that is burdened by
interrupts leading to slower balance (or IRQs driving the workload
being delayed by rq_lock disabling them)
Would it be possible to share the total SCHED_SOFTIRQ time, load
balancing attempts, and utlization with and without the patch? I too
will go queue up some runs to see if this makes a difference.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
2026-03-26 15:02 ` [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer Andrea Righi
2026-03-27 8:45 ` Vincent Guittot
@ 2026-03-27 13:44 ` Shrikanth Hegde
1 sibling, 0 replies; 24+ messages in thread
From: Shrikanth Hegde @ 2026-03-27 13:44 UTC (permalink / raw)
To: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, linux-kernel
On 3/26/26 8:32 PM, Andrea Righi wrote:
> When choosing which idle housekeeping CPU runs the idle load balancer,
> prefer one on a fully idle core if SMT is active, so balance can migrate
> work onto a CPU that still offers full effective capacity. Fall back to
> any idle candidate if none qualify.
>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Christian Loehle <christian.loehle@arm.com>
> Cc: Koba Ko <kobak@nvidia.com>
> Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
> kernel/sched/fair.c | 19 +++++++++++++++++--
> 1 file changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 593a89f688679..a1ee21f7b32f6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12733,11 +12733,15 @@ static inline int on_null_domain(struct rq *rq)
> * - When one of the busy CPUs notices that there may be an idle rebalancing
> * needed, they will kick the idle load balancer, which then does idle
> * load balancing for all the idle CPUs.
> + *
> + * - When SMT is active, prefer a CPU on a fully idle core as the ILB
> + * target, so that when it runs balance it becomes the destination CPU
> + * and can accept migrated tasks with full effective capacity.
> */
> static inline int find_new_ilb(void)
> {
> const struct cpumask *hk_mask;
> - int ilb_cpu;
> + int ilb_cpu, fallback = -1;
>
> hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
>
> @@ -12746,11 +12750,22 @@ static inline int find_new_ilb(void)
> if (ilb_cpu == smp_processor_id())
> continue;
>
> +#ifdef CONFIG_SCHED_SMT
> + if (!idle_cpu(ilb_cpu))
> + continue;
> +
> + if (fallback < 0)
> + fallback = ilb_cpu;
> +
> + if (!sched_smt_active() || is_core_idle(ilb_cpu))
is_core_idle does loop for all sublings and nohz.idle_cpus_mask
will have all siblings likely.
So that might turn out be a bit expensive on large SMT system such as SMT=4
Also, this is with interrupt disabled.
Will try to run this on powerpc system and see if simple benchmarks show anything.
> + return ilb_cpu;
> +#else
> if (idle_cpu(ilb_cpu))
> return ilb_cpu;
> +#endif
> }
>
> - return -1;
> + return fallback;
> }
>
> /*
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
2026-03-26 15:02 [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
` (4 preceding siblings ...)
2026-03-26 16:33 ` [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Christian Loehle
@ 2026-03-27 16:31 ` Shrikanth Hegde
2026-03-27 17:08 ` Andrea Righi
5 siblings, 1 reply; 24+ messages in thread
From: Shrikanth Hegde @ 2026-03-27 16:31 UTC (permalink / raw)
To: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, linux-kernel
Hi Andrea.
On 3/26/26 8:32 PM, Andrea Righi wrote:
> This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by
> introducing SMT awareness.
>
> = Problem =
>
> Nominal per-logical-CPU capacity can overstate usable compute when an SMT
> sibling is busy, because the physical core doesn't deliver its full nominal
> capacity. So, several SD_ASYM_CPUCAPACITY paths may pick high capacity CPUs
> that are not actually good destinations.
>
How does energy model define the opp for SMT?
SMT systems have multiple of different functional blocks, a few ALU(arithmetic),
LSU(load store unit) etc. If same/similar workload runs on sibling, it would affect the
performance, but sibling is using different functional blocks, then it would
not.
So underlying actual CPU Capacity of each thread depends on what each sibling is running.
I don't understand how does the firmware/energy models define this.
> = Proposed Solution =
>
> This patch set aligns those paths with a simple rule already used
> elsewhere: when SMT is active, prefer fully idle cores and avoid treating
> partially idle SMT siblings as full-capacity targets where that would
> mislead load balance.
>
> Patch set summary:
>
> - [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
>
> Prefer fully-idle SMT cores in asym-capacity idle selection. In the
> wakeup fast path, extend select_idle_capacity() / asym_fits_cpu() so
> idle selection can prefer CPUs on fully idle cores, with a safe fallback.
>
> - [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
>
> Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
> Provided for consistency with PATCH 1/4.
>
> - [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
>
> Enable EAS with SD_ASYM_CPUCAPACITY and SMT. Also provided for
> consistency with PATCH 1/4. I've also tested with/without
> /proc/sys/kernel/sched_energy_aware enabled (same platform) and haven't
> noticed any regression.
>
> - [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
>
> When choosing the housekeeping CPU that runs the idle load balancer,
> prefer an idle CPU on a fully idle core so migrated work lands where
> effective capacity is available.
>
> The change is still consistent with the same "avoid CPUs with busy
> sibling" logic and it shows some benefits on Vera, but could have
> negative impact on other systems, I'm including it for completeness
> (feedback is appreciated).
>
> This patch set has been tested on the new NVIDIA Vera Rubin platform, where
> SMT is enabled and the firmware exposes small frequency variations (+/-~5%)
> as differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.
>
I assume the CPU_CAPACITY values fixed?
first sibling has max, while other has less?
> Without these patches, performance can drop up to ~2x with CPU-intensive
> workloads, because the SD_ASYM_CPUCAPACITY idle selection policy does not
> account for busy SMT siblings.
>
How is the performance measured here? Which benchmark?
By any chance you are running number_running_task <= (nr_cpus / smt_threads_per_core),
so it is all fitting nicely?
If you increase those numbers, how does the performance numbers compare?
Also, whats the system is like? SMT level?
> Alternative approaches have been evaluated, such as equalizing CPU
> capacities, either by exposing uniform values via firmware (ACPI/CPPC) or
> normalizing them in the kernel by grouping CPUs within a small capacity
> window (+-5%) [1][2], or enabling asympacking [3].
>
> However, adding SMT awareness to SD_ASYM_CPUCAPACITY has shown better
> results so far. Improving this policy also seems worthwhile in general, as
> other platforms in the future may enable SMT with asymmetric CPU
> topologies.
>
> [1] https://lore.kernel.org/lkml/20260324005509.1134981-1-arighi@nvidia.com
> [2] https://lore.kernel.org/lkml/20260318092214.130908-1-arighi@nvidia.com
> [3] https://lore.kernel.org/all/20260325181314.3875909-1-christian.loehle@arm.com/
>
> Andrea Righi (4):
> sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
> sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
> sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
> sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
>
> kernel/sched/fair.c | 163 +++++++++++++++++++++++++++++++++++++++++++-----
> kernel/sched/topology.c | 9 ---
> 2 files changed, 147 insertions(+), 25 deletions(-)
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
2026-03-27 11:14 ` K Prateek Nayak
@ 2026-03-27 16:39 ` Andrea Righi
0 siblings, 0 replies; 24+ messages in thread
From: Andrea Righi @ 2026-03-27 16:39 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, linux-kernel
Hi Prateek,
On Fri, Mar 27, 2026 at 04:44:01PM +0530, K Prateek Nayak wrote:
> Hello Andrea,
>
> On 3/27/2026 4:28 PM, Andrea Righi wrote:
> > On Fri, Mar 27, 2026 at 04:14:57PM +0530, K Prateek Nayak wrote:
> >> Hello Andrea,
> >>
> >> On 3/26/2026 8:32 PM, Andrea Righi wrote:
> >>> /* This CPU fits with all requirements */
> >>> - if (fits > 0)
> >>> - return cpu;
> >>> + if (fits > 0) {
> >>> + if (prefer_idle_cores && on_idle_core)
> >>> + return cpu;
> >>> + if (!prefer_idle_cores)
> >>> + return cpu;
> >>
> >> nit.
> >>
> >> Can the above two be re-wittern as:
> >>
> >> if (!prefer_idle_cores || on_idle_core)
> >> return cpu;
> >>
> >> since they are equivalent.
> >
> > Oh yes, indeed.
>
> Also, can we just rewrite this Patch as:
>
> (Includes feedback from Vincent; Only build tested)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 700d0f145ca6..cffd5649b54e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7946,6 +7946,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> static int
> select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> {
> + bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
> unsigned long task_util, util_min, util_max, best_cap = 0;
> int fits, best_fits = 0;
> int cpu, best_cpu = -1;
> @@ -7959,6 +7960,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> util_max = uclamp_eff_value(p, UCLAMP_MAX);
>
> for_each_cpu_wrap(cpu, cpus, target) {
> + bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
> unsigned long cpu_cap = capacity_of(cpu);
>
> if (!choose_idle_cpu(cpu, p))
> @@ -7967,7 +7969,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> fits = util_fits_cpu(task_util, util_min, util_max, cpu);
>
> /* This CPU fits with all requirements */
> - if (fits > 0)
> + if (fits > 0 && preferred_core)
> return cpu;
> /*
> * Only the min performance hint (i.e. uclamp_min) doesn't fit.
> @@ -7976,6 +7978,14 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> else if (fits < 0)
> cpu_cap = get_actual_cpu_capacity(cpu);
>
> + /*
> + * If we are on an preferred core, translate the range of fits
> + * from [-1, 1] to [-4, -2]. This ensures that an idle core
> + * is always given priority over (paritally) busy core.
> + */
> + if (preferred_core)
> + fits -= 3;
> +
Ah, I like this trick. Yes, this definitely makes the patch more compact.
> /*
> * First, select CPU which fits better (-1 being better than 0).
> * Then, select the one with best capacity at same level.
> ---
>
> My naive eyes say it should be equivalent of what you have but maybe
> I'm wrong?
It seems correct to my naive eyes as well. Will test this out to make sure.
Unfortunately I just lost access to my system (bummer), I found another
Vera machine, but this one has a version of the firmware that exposes all
CPUs with the same highest_perf... so I can still do some testing, but not
the same one with SD_ASYM_CPUCAPACITY + SMT. I should get access to the
previous system with the different highest_perf values on Monday.
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
2026-03-27 16:31 ` Shrikanth Hegde
@ 2026-03-27 17:08 ` Andrea Righi
2026-03-28 6:51 ` Shrikanth Hegde
0 siblings, 1 reply; 24+ messages in thread
From: Andrea Righi @ 2026-03-27 17:08 UTC (permalink / raw)
To: Shrikanth Hegde
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, linux-kernel
On Fri, Mar 27, 2026 at 10:01:03PM +0530, Shrikanth Hegde wrote:
> Hi Andrea.
>
> On 3/26/26 8:32 PM, Andrea Righi wrote:
> > This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by
> > introducing SMT awareness.
> >
> > = Problem =
> >
> > Nominal per-logical-CPU capacity can overstate usable compute when an SMT
> > sibling is busy, because the physical core doesn't deliver its full nominal
> > capacity. So, several SD_ASYM_CPUCAPACITY paths may pick high capacity CPUs
> > that are not actually good destinations.
> >
>
> How does energy model define the opp for SMT?
For now, as suggested by Vincent, we should probably ignore EAS / energy
model and keep it as it is (not compatible with SMT). I'll drop PATCH 3/4
and focus only at SD_ASYM_CPUCAPACITY + SMT.
>
> SMT systems have multiple of different functional blocks, a few ALU(arithmetic),
> LSU(load store unit) etc. If same/similar workload runs on sibling, it would affect the
> performance, but sibling is using different functional blocks, then it would
> not.
>
> So underlying actual CPU Capacity of each thread depends on what each sibling is running.
> I don't understand how does the firmware/energy models define this.
They don't and they probably shouldn't. I don't think it's possible to
model CPU capacity with a static nominal value when SMT is enabled, since
the effective capacity changes if the corresponding sibling is busy or not.
It should be up to the scheduler to figure out a reasonable way to estimate
the actual capacity, considering the status of the other sibling (e.g.,
prioritizing the fully-idle SMT cores over the partially-idle SMT cores,
like we do in other parts of the scheduler code).
>
> > = Proposed Solution =
> >
> > This patch set aligns those paths with a simple rule already used
> > elsewhere: when SMT is active, prefer fully idle cores and avoid treating
> > partially idle SMT siblings as full-capacity targets where that would
> > mislead load balance.
> >
> > Patch set summary:
> >
> > - [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
> >
> > Prefer fully-idle SMT cores in asym-capacity idle selection. In the
> > wakeup fast path, extend select_idle_capacity() / asym_fits_cpu() so
> > idle selection can prefer CPUs on fully idle cores, with a safe fallback.
> >
> > - [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
> >
> > Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
> > Provided for consistency with PATCH 1/4.
> >
> > - [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
> >
> > Enable EAS with SD_ASYM_CPUCAPACITY and SMT. Also provided for
> > consistency with PATCH 1/4. I've also tested with/without
> > /proc/sys/kernel/sched_energy_aware enabled (same platform) and haven't
> > noticed any regression.
> >
> > - [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
> >
> > When choosing the housekeeping CPU that runs the idle load balancer,
> > prefer an idle CPU on a fully idle core so migrated work lands where
> > effective capacity is available.
> >
> > The change is still consistent with the same "avoid CPUs with busy
> > sibling" logic and it shows some benefits on Vera, but could have
> > negative impact on other systems, I'm including it for completeness
> > (feedback is appreciated).
> >
> > This patch set has been tested on the new NVIDIA Vera Rubin platform, where
> > SMT is enabled and the firmware exposes small frequency variations (+/-~5%)
> > as differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.
> >
>
> I assume the CPU_CAPACITY values fixed?
> first sibling has max, while other has less?
The firmware is exposing the same capacity for both siblings. SMT cores may
have different capacity, but siblings within the same SMT core have the
same capacity.
There was an idea to expose a higher capacity for all the 1st sibling and
a lower capacity for all the 2nd siblings, but I don't think it's a good
idea, since that would just confuse the scheduler (and the 2nd sibling
doesn't really have a lower nominal capacity if it's running alone).
>
> > Without these patches, performance can drop up to ~2x with CPU-intensive
> > workloads, because the SD_ASYM_CPUCAPACITY idle selection policy does not
> > account for busy SMT siblings.
> >
>
> How is the performance measured here? Which benchmark?
I've used an internal NVIDIA suite (based on NVBLAS), I also tried Linpack
and got similar results. I'm planning to repeat the tests using public
benchmarks and share the results as soon as I can.
> By any chance you are running number_running_task <= (nr_cpus / smt_threads_per_core),
> so it is all fitting nicely?
That's the case that gives me the optimal results.
>
> If you increase those numbers, how does the performance numbers compare?
I tried different number of tasks. The more I approach system saturation
the smaller the benefits are. When I completely saturate the system I don't
see any benefit with this changes, neither regressions, but I guess that's
expected.
>
> Also, whats the system is like? SMT level?
2 siblings for each SMT core.
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
2026-03-27 11:34 ` K Prateek Nayak
@ 2026-03-27 20:36 ` Andrea Righi
2026-03-27 22:45 ` Andrea Righi
0 siblings, 1 reply; 24+ messages in thread
From: Andrea Righi @ 2026-03-27 20:36 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, linux-kernel
On Fri, Mar 27, 2026 at 05:04:23PM +0530, K Prateek Nayak wrote:
> Hello Andrea,
>
> On 3/27/2026 3:14 PM, Andrea Righi wrote:
> > Hi Vincent,
> >
> > On Fri, Mar 27, 2026 at 09:45:56AM +0100, Vincent Guittot wrote:
> >> On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
> >>>
> >>> When choosing which idle housekeeping CPU runs the idle load balancer,
> >>> prefer one on a fully idle core if SMT is active, so balance can migrate
> >>> work onto a CPU that still offers full effective capacity. Fall back to
> >>> any idle candidate if none qualify.
> >>
> >> This one isn't straightforward for me. The ilb cpu will check all
> >> other idle CPUs 1st and finish with itself so unless the next CPU in
> >> the idle_cpus_mask is a sibling, this should not make a difference
> >>
> >> Did you see any perf diff ?
> >
> > I actually see a benefit, in particular, with the first patch applied I see
> > a ~1.76x speedup, if I add this on top I get ~1.9x speedup vs baseline,
> > which seems pretty consistent across runs (definitely not in error range).
> >
> > The intention with this change was to minimize SMT noise running the ILB
> > code on a fully-idle core when possible, but I also didn't expect to see
> > such big difference.
> >
> > I'll investigate more to better understand what's happening.
>
> Interesting! Either this "CPU-intensive workload" hates SMT turning
> busy (but to an extent where performance drops visibly?) or ILB
> keeps getting interrupted on an SMT sibling that is burdened by
> interrupts leading to slower balance (or IRQs driving the workload
> being delayed by rq_lock disabling them)
>
> Would it be possible to share the total SCHED_SOFTIRQ time, load
> balancing attempts, and utlization with and without the patch? I too
> will go queue up some runs to see if this makes a difference.
Quick update: I also tried this on a Vera machine with a firmware that
exposes the same capacity for all the CPUs (so with SD_ASYM_CPUCAPACITY
disabled and SMT still on of course) and I see similar performance
benefits.
Looking at SCHED_SOFTIRQ and load balancing attempts I don't see big
differences, all within error range (results produced using a vibe-coded
python script):
- baseline (stats/sec):
SCHED softirq count : 2,625
LB attempts (total) : 69,832
Per-domain breakdown:
domain0 (SMT):
lb_count (total) : 68,482 [balanced=68,472 failed=9]
CPU_IDLE : lb=1,408 imb(load=0 util=0 task=0 misfit=0) gained=0
CPU_NEWLY_IDLE : lb=67,041 imb(load=0 util=0 task=7 misfit=0) gained=0
CPU_NOT_IDLE : lb=33 imb(load=0 util=0 task=2 misfit=0) gained=0
domain1 (MC):
lb_count (total) : 902 [balanced=900 failed=2]
CPU_NEWLY_IDLE : lb=869 imb(load=0 util=0 task=0 misfit=0) gained=0
CPU_NOT_IDLE : lb=33 imb(load=0 util=0 task=2 misfit=0) gained=0
domain2 (NUMA):
lb_count (total) : 448 [balanced=441 failed=7]
CPU_NEWLY_IDLE : lb=415 imb(load=0 util=0 task=44 misfit=0) gained=0
CPU_NOT_IDLE : lb=33 imb(load=0 util=0 task=268 misfit=0) gained=0
- with ilb-smt (stats/sec):
SCHED softirq count : 2,671
LB attempts (total) : 68,572
Per-domain breakdown:
domain0 (SMT):
lb_count (total) : 67,239 [balanced=67,197 failed=41]
CPU_IDLE : lb=1,419 imb(load=0 util=0 task=0 misfit=0) gained=0
CPU_NEWLY_IDLE : lb=65,783 imb(load=0 util=0 task=42 misfit=0) gained=1
CPU_NOT_IDLE : lb=37 imb(load=0 util=0 task=0 misfit=0) gained=0
domain1 (MC):
lb_count (total) : 833 [balanced=833 failed=0]
CPU_NEWLY_IDLE : lb=796 imb(load=0 util=0 task=0 misfit=0) gained=0
CPU_NOT_IDLE : lb=37 imb(load=0 util=0 task=0 misfit=0) gained=0
domain2 (NUMA):
lb_count (total) : 500 [balanced=488 failed=12]
CPU_NEWLY_IDLE : lb=463 imb(load=0 util=0 task=44 misfit=0) gained=0
CPU_NOT_IDLE : lb=37 imb(load=0 util=0 task=627 misfit=0) gained=0
I'll add more direct instrumentation to check what ILB is doing
differently...
And I'll also repeat the test and collect the same metrics on the Vera
machine with the firmware that exposes different CPU capacities as soon as
I get access again.
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
2026-03-27 20:36 ` Andrea Righi
@ 2026-03-27 22:45 ` Andrea Righi
0 siblings, 0 replies; 24+ messages in thread
From: Andrea Righi @ 2026-03-27 22:45 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, linux-kernel
On Fri, Mar 27, 2026 at 09:36:15PM +0100, Andrea Righi wrote:
> On Fri, Mar 27, 2026 at 05:04:23PM +0530, K Prateek Nayak wrote:
> > Hello Andrea,
> >
> > On 3/27/2026 3:14 PM, Andrea Righi wrote:
> > > Hi Vincent,
> > >
> > > On Fri, Mar 27, 2026 at 09:45:56AM +0100, Vincent Guittot wrote:
> > >> On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
> > >>>
> > >>> When choosing which idle housekeeping CPU runs the idle load balancer,
> > >>> prefer one on a fully idle core if SMT is active, so balance can migrate
> > >>> work onto a CPU that still offers full effective capacity. Fall back to
> > >>> any idle candidate if none qualify.
> > >>
> > >> This one isn't straightforward for me. The ilb cpu will check all
> > >> other idle CPUs 1st and finish with itself so unless the next CPU in
> > >> the idle_cpus_mask is a sibling, this should not make a difference
> > >>
> > >> Did you see any perf diff ?
> > >
> > > I actually see a benefit, in particular, with the first patch applied I see
> > > a ~1.76x speedup, if I add this on top I get ~1.9x speedup vs baseline,
> > > which seems pretty consistent across runs (definitely not in error range).
> > >
> > > The intention with this change was to minimize SMT noise running the ILB
> > > code on a fully-idle core when possible, but I also didn't expect to see
> > > such big difference.
> > >
> > > I'll investigate more to better understand what's happening.
> >
> > Interesting! Either this "CPU-intensive workload" hates SMT turning
> > busy (but to an extent where performance drops visibly?) or ILB
> > keeps getting interrupted on an SMT sibling that is burdened by
> > interrupts leading to slower balance (or IRQs driving the workload
> > being delayed by rq_lock disabling them)
> >
> > Would it be possible to share the total SCHED_SOFTIRQ time, load
> > balancing attempts, and utlization with and without the patch? I too
> > will go queue up some runs to see if this makes a difference.
>
> Quick update: I also tried this on a Vera machine with a firmware that
> exposes the same capacity for all the CPUs (so with SD_ASYM_CPUCAPACITY
> disabled and SMT still on of course) and I see similar performance
> benefits.
>
> Looking at SCHED_SOFTIRQ and load balancing attempts I don't see big
> differences, all within error range (results produced using a vibe-coded
> python script):
>
> - baseline (stats/sec):
>
> SCHED softirq count : 2,625
> LB attempts (total) : 69,832
>
> Per-domain breakdown:
> domain0 (SMT):
> lb_count (total) : 68,482 [balanced=68,472 failed=9]
> CPU_IDLE : lb=1,408 imb(load=0 util=0 task=0 misfit=0) gained=0
> CPU_NEWLY_IDLE : lb=67,041 imb(load=0 util=0 task=7 misfit=0) gained=0
> CPU_NOT_IDLE : lb=33 imb(load=0 util=0 task=2 misfit=0) gained=0
> domain1 (MC):
> lb_count (total) : 902 [balanced=900 failed=2]
> CPU_NEWLY_IDLE : lb=869 imb(load=0 util=0 task=0 misfit=0) gained=0
> CPU_NOT_IDLE : lb=33 imb(load=0 util=0 task=2 misfit=0) gained=0
> domain2 (NUMA):
> lb_count (total) : 448 [balanced=441 failed=7]
> CPU_NEWLY_IDLE : lb=415 imb(load=0 util=0 task=44 misfit=0) gained=0
> CPU_NOT_IDLE : lb=33 imb(load=0 util=0 task=268 misfit=0) gained=0
>
> - with ilb-smt (stats/sec):
>
> SCHED softirq count : 2,671
> LB attempts (total) : 68,572
>
> Per-domain breakdown:
> domain0 (SMT):
> lb_count (total) : 67,239 [balanced=67,197 failed=41]
> CPU_IDLE : lb=1,419 imb(load=0 util=0 task=0 misfit=0) gained=0
> CPU_NEWLY_IDLE : lb=65,783 imb(load=0 util=0 task=42 misfit=0) gained=1
> CPU_NOT_IDLE : lb=37 imb(load=0 util=0 task=0 misfit=0) gained=0
> domain1 (MC):
> lb_count (total) : 833 [balanced=833 failed=0]
> CPU_NEWLY_IDLE : lb=796 imb(load=0 util=0 task=0 misfit=0) gained=0
> CPU_NOT_IDLE : lb=37 imb(load=0 util=0 task=0 misfit=0) gained=0
> domain2 (NUMA):
> lb_count (total) : 500 [balanced=488 failed=12]
> CPU_NEWLY_IDLE : lb=463 imb(load=0 util=0 task=44 misfit=0) gained=0
> CPU_NOT_IDLE : lb=37 imb(load=0 util=0 task=627 misfit=0) gained=0
>
> I'll add more direct instrumentation to check what ILB is doing
> differently...
More data.
== SMT contention ==
tracepoint:sched:sched_switch
{
if (args->next_pid != 0) {
@busy[cpu] = 1;
} else {
delete(@busy[cpu]);
}
}
tracepoint:sched:sched_switch
/ args->prev_pid == 0 && args->next_pid != 0 /
{
$sib = (cpu + 176) % 352;
if (@busy[$sib]) {
@smt_contention++;
} else {
@smt_no_contention++;
}
}
END
{
printf("smt_contention %lld\n", (int64)@smt_contention);
printf("smt_no_contention %lld\n", (int64)@smt_no_contention);
}
- baseline:
@smt_contention: 1103
@smt_no_contention: 3815
- ilb-smt:
@smt_contention: 937
@smt_no_contention: 4459
== ILB duration ==
- baseline:
@ilb_duration_us:
[0] 147 | |
[1] 354 |@ |
[2, 4) 739 |@@@ |
[4, 8) 3040 |@@@@@@@@@@@@@@@@ |
[8, 16) 9825 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32) 8142 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[32, 64) 1267 |@@@@@@ |
[64, 128) 1607 |@@@@@@@@ |
[128, 256) 2222 |@@@@@@@@@@@ |
[256, 512) 2326 |@@@@@@@@@@@@ |
[512, 1K) 141 | |
[1K, 2K) 37 | |
[2K, 4K) 7 |
- ilb-smt:
@ilb_duration_us:
[0] 79 | |
[1] 137 | |
[2, 4) 1440 |@@@@@@@@@@ |
[4, 8) 2897 |@@@@@@@@@@@@@@@@@@@@ |
[8, 16) 7433 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32) 4993 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[32, 64) 2390 |@@@@@@@@@@@@@@@@ |
[64, 128) 2254 |@@@@@@@@@@@@@@@ |
[128, 256) 2731 |@@@@@@@@@@@@@@@@@@@ |
[256, 512) 1083 |@@@@@@@ |
[512, 1K) 265 |@ |
[1K, 2K) 29 | |
[2K, 4K) 5 | |
== rq_lock hold ==
- baseline:
@lb_rqlock_hold_us:
[0] 664396 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1] 77446 |@@@@@@ |
[2, 4) 25044 |@ |
[4, 8) 19847 |@ |
[8, 16) 2434 | |
[16, 32) 605 | |
[32, 64) 308 | |
[64, 128) 38 | |
[128, 256) 2 | |
- ilb-smt:
@lb_rqlock_hold_us:
[0] 229152 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1] 135060 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[2, 4) 26989 |@@@@@@ |
[4, 8) 48034 |@@@@@@@@@@ |
[8, 16) 1919 | |
[16, 32) 2236 | |
[32, 64) 595 | |
[64, 128) 135 | |
[128, 256) 27 | |
For what I see ILB runs are more expensive, but I still don't see why I'm
getting the speedup with this ilb-smt patch. I'll keep investigating...
-Andrea
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
2026-03-27 17:08 ` Andrea Righi
@ 2026-03-28 6:51 ` Shrikanth Hegde
0 siblings, 0 replies; 24+ messages in thread
From: Shrikanth Hegde @ 2026-03-28 6:51 UTC (permalink / raw)
To: Andrea Righi
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, linux-kernel
>> How is the performance measured here? Which benchmark?
>
> I've used an internal NVIDIA suite (based on NVBLAS), I also tried Linpack
> and got similar results. I'm planning to repeat the tests using public
> benchmarks and share the results as soon as I can.
>
>> By any chance you are running number_running_task <= (nr_cpus / smt_threads_per_core),
>> so it is all fitting nicely?
>
> That's the case that gives me the optimal results.
>
>>
>> If you increase those numbers, how does the performance numbers compare?
>
> I tried different number of tasks. The more I approach system saturation
> the smaller the benefits are. When I completely saturate the system I don't
> see any benefit with this changes, neither regressions, but I guess that's
> expected.
>
Ok. That's good.
I gave hackbench on powerpc with SMT=4, i didn't observe any regressions or improvements.
Only PATCH 4/4 applies in this case as there is no asym_cpu_capacity
^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2026-03-28 6:51 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-26 15:02 [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
2026-03-26 15:02 ` [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
2026-03-27 8:09 ` Vincent Guittot
2026-03-27 9:46 ` Andrea Righi
2026-03-27 10:44 ` K Prateek Nayak
2026-03-27 10:58 ` Andrea Righi
2026-03-27 11:14 ` K Prateek Nayak
2026-03-27 16:39 ` Andrea Righi
2026-03-26 15:02 ` [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
2026-03-26 15:02 ` [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems Andrea Righi
2026-03-27 8:09 ` Vincent Guittot
2026-03-27 9:45 ` Andrea Righi
2026-03-26 15:02 ` [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer Andrea Righi
2026-03-27 8:45 ` Vincent Guittot
2026-03-27 9:44 ` Andrea Righi
2026-03-27 11:34 ` K Prateek Nayak
2026-03-27 20:36 ` Andrea Righi
2026-03-27 22:45 ` Andrea Righi
2026-03-27 13:44 ` Shrikanth Hegde
2026-03-26 16:33 ` [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Christian Loehle
2026-03-27 6:52 ` Andrea Righi
2026-03-27 16:31 ` Shrikanth Hegde
2026-03-27 17:08 ` Andrea Righi
2026-03-28 6:51 ` Shrikanth Hegde
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox