* [PATCH v3 0/5] sched/fair: SMT-aware asymmetric CPU capacity
@ 2026-04-23 7:36 Andrea Righi
2026-04-23 7:36 ` [PATCH 1/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity Andrea Righi
` (4 more replies)
0 siblings, 5 replies; 23+ messages in thread
From: Andrea Righi @ 2026-04-23 7:36 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
linux-kernel
This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by introducing
SMT awareness.
= Problem =
Nominal per-logical-CPU capacity can overstate usable compute when an SMT
sibling is busy, because the physical core doesn't deliver its full nominal
capacity. So, several asym-cpu-capacity paths may pick high capacity idle CPUs
that are not actually good destinations.
= Solution =
This patch set aligns those paths with a simple rule already used elsewhere:
when SMT is active, prefer fully idle cores and avoid treating partially idle
SMT siblings as full-capacity targets where that would mislead load balance.
Patch set summary:
- Attach sched_domain_shared to sd_asym_cpucapacity in SD_ASYM_CPUCAPACITY to
use has_idle_cores hint consistently in the wakeup idle scan
- Prefer fully-idle SMT cores in asym-capacity idle selection: in the wakeup
fast path, extend select_idle_capacity() / asym_fits_cpu() so idle
selection can prefer CPUs on fully idle cores.
- Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
- Add SIS_UTIL support to select_idle_capacity(): add to select_idle_capacity()
the same SIS_UTIL-controlled idle-scan mechanism, already used by
select_idle_cpu()
This patch set has been tested on the new NVIDIA Vera Rubin platform, where SMT
is enabled and the firmware exposes small frequency variations (+/-~5%) as
differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.
Without these patches, performance can drop by up to ~2x with CPU-intensive
workloads, because the SD_ASYM_CPUCAPACITY idle selection policy does not
account for busy SMT siblings.
Alternative approaches have been evaluated, such as equalizing CPU capacities,
either by exposing uniform values via firmware or normalizing them in the kernel
by grouping CPUs within a small capacity window (+-5%).
However, the SMT-aware SD_ASYM_CPUCAPACITY approach has shown better results so
far. Improving this policy also seems worthwhile in general, as future platforms
may enable SMT with asymmetric CPU topologies.
Performance results on Vera Rubin with SD_ASYM_CPUCAPACITY (mainline) vs
SD_ASYM_CPUCAPACITY + SMT
- NVBLAS benchblas (one task / SMT core):
+---------------------------------+--------+
| Configuration | gflops |
+---------------------------------+--------+
| ASYM (mainline) + SIS_UTIL | 5478 |
| ASYM (mainline) + NO_SIS_UTIL | 5491 |
| | |
| NO ASYM + SIS_UTIL | 8912 |
| NO ASYM + NO_SIS_UTIL | 8978 |
| | |
| ASYM + SMT + SIS_UTIL | 9259 |
| ASYM + SMT + NO_SIS_UTIL | 9291 |
+---------------------------------+--------+
- DCPerf MediaWiki (all CPUs):
+---------------------------------+--------+--------+--------+--------+
| Configuration | rps | p50 | p95 | p99 |
+---------------------------------+--------+--------+--------+--------+
| ASYM (mainline) + SIS_UTIL | 7994 | 0.052 | 0.223 | 0.246 |
| ASYM (mainline) + NO_SIS_UTIL | 7993 | 0.052 | 0.221 | 0.245 |
| | | | | |
| NO ASYM + SIS_UTIL | 8113 | 0.067 | 0.184 | 0.225 |
| NO ASYM + NO_SIS_UTIL | 8093 | 0.068 | 0.184 | 0.223 |
| | | | | |
| ASYM + SMT + SIS_UTIL | 8129 | 0.076 | 0.149 | 0.188 |
| ASYM + SMT + NO_SIS_UTIL | 8138 | 0.076 | 0.148 | 0.186 |
+---------------------------------+--------+--------+--------+--------+
In the MediaWiki case SMT awareness is less impactful, because for the majority
of the run all CPUs are used, but it still seems to provide some benefits at
reducing tail latency.
See also:
- https://lore.kernel.org/lkml/20260324005509.1134981-1-arighi@nvidia.com
- https://lore.kernel.org/lkml/20260318092214.130908-1-arighi@nvidia.com
Changes in v3:
- Add SIS_UTIL support to select_idle_capacity() (K Prateek Nayak)
- Attach sched_domain_shared to sd_asym_cpucapacity (K Prateek Nayak)
- Add enum for the different fit state (K Prateek Nayak)
- Update has_idle_cores hint (Vincent Guittot)
- Link to v2: https://lore.kernel.org/all/20260403053654.1559142-1-arighi@nvidia.com
Changes in v2:
- Rework SMT awareness logic in select_idle_capacity() (K Prateek Nayak)
- Drop EAS and find_new_ilb() changes for now
- Link to v1: https://lore.kernel.org/all/20260326151211.1862600-1-arighi@nvidia.com
Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git asym-cpu-capacity-smt
Andrea Righi (2):
sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
K Prateek Nayak (3):
sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
sched/fair: Add SIS_UTIL support to select_idle_capacity()
sched/fair: Make asym CPU capacity idle rank values self-documenting
kernel/sched/fair.c | 107 ++++++++++++++++++++++++++++++++++++++++++++----
kernel/sched/topology.c | 81 +++++++++++++++++++++++++++++++-----
2 files changed, 168 insertions(+), 20 deletions(-)
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 1/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
2026-04-23 7:36 [PATCH v3 0/5] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
@ 2026-04-23 7:36 ` Andrea Righi
2026-04-24 5:14 ` K Prateek Nayak
2026-04-23 7:36 ` [PATCH 2/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
` (3 subsequent siblings)
4 siblings, 1 reply; 23+ messages in thread
From: Andrea Righi @ 2026-04-23 7:36 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
linux-kernel
From: K Prateek Nayak <kprateek.nayak@amd.com>
On asymmetric CPU capacity systems, the wakeup path uses
select_idle_capacity(), which scans the span of sd_asym_cpucapacity
rather than sd_llc.
The has_idle_cores hint however lives on sd_llc->shared, so the
wakeup-time read of has_idle_cores operates on an LLC-scoped blob while
the actual scan/decision spans the asym domain; nr_busy_cpus also lives
in the same shared sched_domain data, but it's never used in the asym
CPU capacity scenario.
Therefore, move the sched_domain_shared object to sd_asym_cpucapacity
whenever the CPU has a SD_ASYM_CPUCAPACITY_FULL ancestor and that
ancestor is non-overlapping (i.e., not built from SD_NUMA). In that case
the scope of has_idle_cores matches the scope of the wakeup scan.
Fall back to attaching the shared object to sd_llc in three cases:
1) plain symmetric systems (no SD_ASYM_CPUCAPACITY_FULL anywhere);
2) CPUs in an exclusive cpuset that carves out a symmetric capacity
island: has_asym is system-wide but those CPUs have no
SD_ASYM_CPUCAPACITY_FULL ancestor in their hierarchy and follow
the symmetric LLC path in select_idle_sibling();
3) exotic topologies where SD_ASYM_CPUCAPACITY_FULL lands on an
SD_NUMA-built domain. init_sched_domain_shared() keys the shared
blob off cpumask_first(span), which on overlapping NUMA domains
would alias unrelated spans onto the same blob. Keep the shared
object on the LLC there; select_idle_capacity() gracefully skips
the has_idle_cores preference when sd->shared is NULL.
Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
kernel/sched/fair.c | 8 ++--
kernel/sched/topology.c | 81 +++++++++++++++++++++++++++++++++++------
2 files changed, 75 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 69361c63353ad..934eb663f445e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7925,7 +7925,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
int i, cpu, idle_cpu = -1, nr = INT_MAX;
- if (sched_feat(SIS_UTIL)) {
+ if (sched_feat(SIS_UTIL) && sd->shared) {
/*
* Increment because !--nr is the condition to stop scan.
*
@@ -12840,7 +12840,8 @@ static void set_cpu_sd_state_busy(int cpu)
goto unlock;
sd->nohz_idle = 0;
- atomic_inc(&sd->shared->nr_busy_cpus);
+ if (sd->shared)
+ atomic_inc(&sd->shared->nr_busy_cpus);
unlock:
rcu_read_unlock();
}
@@ -12869,7 +12870,8 @@ static void set_cpu_sd_state_idle(int cpu)
goto unlock;
sd->nohz_idle = 1;
- atomic_dec(&sd->shared->nr_busy_cpus);
+ if (sd->shared)
+ atomic_dec(&sd->shared->nr_busy_cpus);
unlock:
rcu_read_unlock();
}
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5847b83d9d552..dc50193b198c6 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -680,19 +680,39 @@ static void update_top_cache_domain(int cpu)
int id = cpu;
int size = 1;
+ sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
+ /*
+ * The shared object is attached to sd_asym_cpucapacity only when the
+ * asym domain is non-overlapping (i.e., not built from SD_NUMA).
+ * On overlapping (NUMA) asym domains we fall back to letting the
+ * SD_SHARE_LLC path own the shared object, so sd->shared may be NULL
+ * here.
+ */
+ if (sd && sd->shared)
+ sds = sd->shared;
+
+ rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
+
sd = highest_flag_domain(cpu, SD_SHARE_LLC);
if (sd) {
id = cpumask_first(sched_domain_span(sd));
size = cpumask_weight(sched_domain_span(sd));
- /* If sd_llc exists, sd_llc_shared should exist too. */
- WARN_ON_ONCE(!sd->shared);
- sds = sd->shared;
+ /*
+ * If sd_asym_cpucapacity didn't claim the shared object,
+ * sd_llc must have one linked.
+ */
+ if (!sds) {
+ WARN_ON_ONCE(!sd->shared);
+ sds = sd->shared;
+ }
}
rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
per_cpu(sd_llc_size, cpu) = size;
per_cpu(sd_llc_id, cpu) = id;
+
+ /* TODO: Rename sd_llc_shared to fit the new role. */
rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
sd = lowest_flag_domain(cpu, SD_CLUSTER);
@@ -711,9 +731,6 @@ static void update_top_cache_domain(int cpu)
sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
-
- sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
- rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
}
/*
@@ -2650,6 +2667,15 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc)
}
}
+static void init_sched_domain_shared(struct s_data *d, struct sched_domain *sd)
+{
+ int sd_id = cpumask_first(sched_domain_span(sd));
+
+ sd->shared = *per_cpu_ptr(d->sds, sd_id);
+ atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
+ atomic_inc(&sd->shared->ref);
+}
+
/*
* Build sched domains for a given set of CPUs and attach the sched domains
* to the individual CPUs
@@ -2708,20 +2734,53 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
}
for_each_cpu(i, cpu_map) {
+ struct sched_domain *sd_asym = NULL;
+ bool asym_claimed = false;
+
sd = *per_cpu_ptr(d.sd, i);
if (!sd)
continue;
+ /*
+ * In case of ASYM_CPUCAPACITY, attach sd->shared to
+ * sd_asym_cpucapacity for wakeup stat tracking.
+ *
+ * Caveats:
+ *
+ * 1) has_asym is system-wide, but a given CPU may still
+ * lack an SD_ASYM_CPUCAPACITY_FULL ancestor (e.g., an
+ * exclusive cpuset carving out a symmetric capacity island).
+ * Such CPUs must fall through to the LLC seeding path below.
+ *
+ * 2) Skip the asym attach if the asym ancestor is an
+ * overlapping domain (SD_NUMA). On those topologies let the
+ * LLC path own the shared object instead.
+ *
+ * XXX: This assumes SD_ASYM_CPUCAPACITY_FULL domain
+ * always has more than one group else it is prone to
+ * degeneration.
+ */
+ sd_asym = sd;
+ while (sd_asym && !(sd_asym->flags & SD_ASYM_CPUCAPACITY_FULL))
+ sd_asym = sd_asym->parent;
+
+ if (sd_asym && !(sd_asym->flags & SD_NUMA)) {
+ init_sched_domain_shared(&d, sd_asym);
+ asym_claimed = true;
+ }
+
/* First, find the topmost SD_SHARE_LLC domain */
+ sd = *per_cpu_ptr(d.sd, i);
while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
sd = sd->parent;
if (sd->flags & SD_SHARE_LLC) {
- int sd_id = cpumask_first(sched_domain_span(sd));
-
- sd->shared = *per_cpu_ptr(d.sds, sd_id);
- atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
- atomic_inc(&sd->shared->ref);
+ /*
+ * Initialize the sd->shared for SD_SHARE_LLC unless
+ * the asym path above already claimed it.
+ */
+ if (!asym_claimed)
+ init_sched_domain_shared(&d, sd);
/*
* In presence of higher domains, adjust the
--
2.54.0
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 2/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
2026-04-23 7:36 [PATCH v3 0/5] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
2026-04-23 7:36 ` [PATCH 1/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity Andrea Righi
@ 2026-04-23 7:36 ` Andrea Righi
2026-04-24 5:42 ` K Prateek Nayak
2026-04-23 7:36 ` [PATCH 3/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
` (2 subsequent siblings)
4 siblings, 1 reply; 23+ messages in thread
From: Andrea Righi @ 2026-04-23 7:36 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
linux-kernel
On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
different per-core frequencies), the wakeup path uses
select_idle_capacity() and prioritizes idle CPUs with higher capacity
for better task placement. However, when those CPUs belong to SMT cores,
their effective capacity can be much lower than the nominal capacity
when the sibling thread is busy: SMT siblings compete for shared
resources, so a "high capacity" CPU that is idle but whose sibling is
busy does not deliver its full capacity. This effective capacity
reduction cannot be modeled by the static capacity value alone.
Introduce SMT awareness in the asym-capacity idle selection policy: when
SMT is active, always prefer fully-idle SMT cores over partially-idle
ones.
Prioritizing fully-idle SMT cores yields better task placement because
the effective capacity of partially-idle SMT cores is reduced; always
preferring them when available leads to more accurate capacity usage on
task wakeup.
On an SMT system with asymmetric CPU capacities, SMT-aware idle
selection has been shown to improve throughput by around 15-18% for
CPU-bound workloads, running an amount of tasks equal to the amount of
SMT cores.
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Koba Ko <kobak@nvidia.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/fair.c | 48 +++++++++++++++++++++++++++++++++++++++++----
1 file changed, 44 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 934eb663f445e..d0c8bcc0d96a1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7997,6 +7997,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
static int
select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
{
+ bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
unsigned long task_util, util_min, util_max, best_cap = 0;
int fits, best_fits = 0;
int cpu, best_cpu = -1;
@@ -8010,6 +8011,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
util_max = uclamp_eff_value(p, UCLAMP_MAX);
for_each_cpu_wrap(cpu, cpus, target) {
+ bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
unsigned long cpu_cap = capacity_of(cpu);
if (!choose_idle_cpu(cpu, p))
@@ -8018,7 +8020,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
fits = util_fits_cpu(task_util, util_min, util_max, cpu);
/* This CPU fits with all requirements */
- if (fits > 0)
+ if (fits > 0 && preferred_core)
return cpu;
/*
* Only the min performance hint (i.e. uclamp_min) doesn't fit.
@@ -8026,9 +8028,30 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
*/
else if (fits < 0)
cpu_cap = get_actual_cpu_capacity(cpu);
+ /*
+ * fits > 0 implies we are not on a preferred core
+ * but the util fits CPU capacity. Set fits to -2 so
+ * the effective range becomes [-2, 0] where:
+ * 0 - does not fit
+ * -1 - fits with the exception of UCLAMP_MIN
+ * -2 - fits with the exception of preferred_core
+ */
+ else if (fits > 0)
+ fits = -2;
+
+ /*
+ * If we are on a preferred core, translate the range of fits
+ * of [-1, 0] to [-4, -3]. This ensures that an idle core
+ * is always given priority over (partially) busy core.
+ *
+ * A fully fitting idle core would have returned early and hence
+ * fits > 0 for preferred_core need not be dealt with.
+ */
+ if (preferred_core)
+ fits -= 3;
/*
- * First, select CPU which fits better (-1 being better than 0).
+ * First, select CPU which fits better (lower is more preferred).
* Then, select the one with best capacity at same level.
*/
if ((fits < best_fits) ||
@@ -8039,6 +8062,18 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
}
}
+ /*
+ * A value in the [-4, -3] range means the chosen CPU is in a fully idle
+ * SMT core. Values above -3 mean we never ranked such a CPU best.
+ *
+ * The asym-capacity wakeup path returns from select_idle_sibling()
+ * after this function and never runs select_idle_cpu(), so the usual
+ * select_idle_cpu() tail that clears idle cores must live here when the
+ * idle-core preference did not win.
+ */
+ if (prefers_idle_core && best_fits > -3)
+ set_idle_cores(target, false);
+
return best_cpu;
}
@@ -8047,12 +8082,17 @@ static inline bool asym_fits_cpu(unsigned long util,
unsigned long util_max,
int cpu)
{
- if (sched_asym_cpucap_active())
+ if (sched_asym_cpucap_active()) {
/*
* Return true only if the cpu fully fits the task requirements
* which include the utilization and the performance hints.
+ *
+ * When SMT is active, also require that the core has no busy
+ * siblings.
*/
- return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
+ return (!sched_smt_active() || is_core_idle(cpu)) &&
+ (util_fits_cpu(util, util_min, util_max, cpu) > 0);
+ }
return true;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 3/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
2026-04-23 7:36 [PATCH v3 0/5] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
2026-04-23 7:36 ` [PATCH 1/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity Andrea Righi
2026-04-23 7:36 ` [PATCH 2/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
@ 2026-04-23 7:36 ` Andrea Righi
2026-04-24 5:37 ` K Prateek Nayak
2026-04-23 7:36 ` [PATCH 4/5] sched/fair: Add SIS_UTIL support to select_idle_capacity() Andrea Righi
2026-04-23 7:36 ` [PATCH 5/5] sched/fair: Make asym CPU capacity idle rank values self-documenting Andrea Righi
4 siblings, 1 reply; 23+ messages in thread
From: Andrea Righi @ 2026-04-23 7:36 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
linux-kernel
When SD_ASYM_CPUCAPACITY load balancing considers pulling a misfit task,
capacity_of(dst_cpu) can overstate available compute if the SMT sibling is
busy: the core does not deliver its full nominal capacity.
If SMT is active and dst_cpu is not on a fully idle core, skip this
destination so we do not migrate a misfit expecting a capacity upgrade we
cannot actually provide.
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Koba Ko <kobak@nvidia.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/fair.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0c8bcc0d96a1..9bd9dc6e0882e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10815,10 +10815,16 @@ static bool update_sd_pick_busiest(struct lb_env *env,
* We can use max_capacity here as reduction in capacity on some
* CPUs in the group should either be possible to resolve
* internally or be covered by avg_load imbalance (eventually).
+ *
+ * When SMT is active, only pull a misfit to dst_cpu if it is on a
+ * fully idle core; otherwise the effective capacity of the core is
+ * reduced and we may not actually provide more capacity than the
+ * source.
*/
if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
(sgs->group_type == group_misfit_task) &&
- (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
+ ((sched_smt_active() && !is_core_idle(env->dst_cpu)) ||
+ !capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
sds->local_stat.group_type != group_has_spare))
return false;
--
2.54.0
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 4/5] sched/fair: Add SIS_UTIL support to select_idle_capacity()
2026-04-23 7:36 [PATCH v3 0/5] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
` (2 preceding siblings ...)
2026-04-23 7:36 ` [PATCH 3/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
@ 2026-04-23 7:36 ` Andrea Righi
2026-04-24 5:55 ` K Prateek Nayak
2026-04-24 12:32 ` Vincent Guittot
2026-04-23 7:36 ` [PATCH 5/5] sched/fair: Make asym CPU capacity idle rank values self-documenting Andrea Righi
4 siblings, 2 replies; 23+ messages in thread
From: Andrea Righi @ 2026-04-23 7:36 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
linux-kernel
From: K Prateek Nayak <kprateek.nayak@amd.com>
Add to select_idle_capacity() the same SIS_UTIL-controlled idle-scan
mechanism, already used by select_idle_cpu(): when sched_feat(SIS_UTIL)
is enabled and the LLC domain has sched_domain_shared data, derive the
per-attempt scan limit from sd->shared->nr_idle_scan.
That bounds the walk on large LLCs and allows an early return once the
scan limit is reached, if we already picked a sufficiently strong
idle-core candidate (best_fits == -4).
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
kernel/sched/fair.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9bd9dc6e0882e..6b67049f04c3e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8002,6 +8002,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
int fits, best_fits = 0;
int cpu, best_cpu = -1;
struct cpumask *cpus;
+ int nr = INT_MAX;
cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
@@ -8010,10 +8011,30 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
util_min = uclamp_eff_value(p, UCLAMP_MIN);
util_max = uclamp_eff_value(p, UCLAMP_MAX);
+ if (sched_feat(SIS_UTIL) && sd->shared) {
+ /*
+ * Increment because !--nr is the condition to stop scan.
+ *
+ * Since "sd" is "sd_llc" for target CPU dereferenced in the
+ * caller, it is safe to directly dereference "sd->shared".
+ * Topology bits always ensure it assigned for "sd_llc" and it
+ * cannot disappear as long as we have a RCU protected
+ * reference to one the associated "sd" here.
+ */
+ nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
+ /* overloaded LLC is unlikely to have idle cpu/core */
+ if (nr == 1)
+ return -1;
+ }
+
for_each_cpu_wrap(cpu, cpus, target) {
bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
unsigned long cpu_cap = capacity_of(cpu);
+ /* We have found a good enough target. Just use it. */
+ if (--nr <= 0 && best_fits == -4)
+ return best_cpu;
+
if (!choose_idle_cpu(cpu, p))
continue;
--
2.54.0
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 5/5] sched/fair: Make asym CPU capacity idle rank values self-documenting
2026-04-23 7:36 [PATCH v3 0/5] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
` (3 preceding siblings ...)
2026-04-23 7:36 ` [PATCH 4/5] sched/fair: Add SIS_UTIL support to select_idle_capacity() Andrea Righi
@ 2026-04-23 7:36 ` Andrea Righi
2026-04-24 4:29 ` K Prateek Nayak
4 siblings, 1 reply; 23+ messages in thread
From: Andrea Righi @ 2026-04-23 7:36 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
linux-kernel
From: K Prateek Nayak <kprateek.nayak@amd.com>
Introduce enum asym_fits_state for select_idle_capacity() scan
preferences, instead of using raw constants to improve readability.
No functional change.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
kernel/sched/fair.c | 48 ++++++++++++++++++++++++++++++++-------------
1 file changed, 34 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b67049f04c3e..a3ecbed43c34f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7989,6 +7989,22 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
return idle_cpu;
}
+/*
+ * Idle-capacity scan ranks transformed util_fits_cpu() outcomes; lower values
+ * are more preferred (see select_idle_capacity()).
+ */
+enum asym_fits_state {
+ /* In descending order of preference */
+ ASYM_IDLE_CORE_UCLAMP_MISFIT = -4,
+ ASYM_IDLE_CORE_COMPLETE_MISFIT,
+ ASYM_IDLE_THREAD_FITS,
+ ASYM_IDLE_THREAD_UCLAMP_MISFIT,
+ ASYM_IDLE_COMPLETE_MISFIT,
+
+ /* asym_fits_cpu() bias for an idle core. */
+ ASYM_IDLE_CORE_BIAS = -3,
+};
+
/*
* Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
* the task fits. If no CPU is big enough, but there are idle ones, try to
@@ -7999,7 +8015,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
{
bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
unsigned long task_util, util_min, util_max, best_cap = 0;
- int fits, best_fits = 0;
+ int fits, best_fits = ASYM_IDLE_COMPLETE_MISFIT;
int cpu, best_cpu = -1;
struct cpumask *cpus;
int nr = INT_MAX;
@@ -8032,7 +8048,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
unsigned long cpu_cap = capacity_of(cpu);
/* We have found a good enough target. Just use it. */
- if (--nr <= 0 && best_fits == -4)
+ if (--nr <= 0 && best_fits == ASYM_IDLE_CORE_UCLAMP_MISFIT)
return best_cpu;
if (!choose_idle_cpu(cpu, p))
@@ -8051,25 +8067,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
cpu_cap = get_actual_cpu_capacity(cpu);
/*
* fits > 0 implies we are not on a preferred core
- * but the util fits CPU capacity. Set fits to -2 so
- * the effective range becomes [-2, 0] where:
- * 0 - does not fit
- * -1 - fits with the exception of UCLAMP_MIN
- * -2 - fits with the exception of preferred_core
+ * but the util fits CPU capacity. Set fits to ASYM_IDLE_THREAD_FITS
+ * so the effective range becomes
+ * [ASYM_IDLE_THREAD_FITS, ASYM_IDLE_COMPLETE_MISFIT], where:
+ * ASYM_IDLE_COMPLETE_MISFIT - does not fit
+ * ASYM_IDLE_THREAD_UCLAMP_MISFIT - fits with the exception of UCLAMP_MIN
+ * ASYM_IDLE_THREAD_FITS - fits with the exception of preferred_core
*/
else if (fits > 0)
- fits = -2;
+ fits = ASYM_IDLE_THREAD_FITS;
/*
* If we are on a preferred core, translate the range of fits
- * of [-1, 0] to [-4, -3]. This ensures that an idle core
- * is always given priority over (partially) busy core.
+ * of [ASYM_IDLE_THREAD_UCLAMP_MISFIT, ASYM_IDLE_COMPLETE_MISFIT] to
+ * [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_COMPLETE_MISFIT].
+ * This ensures that an idle core is always given priority over
+ * (partially) busy core.
*
* A fully fitting idle core would have returned early and hence
* fits > 0 for preferred_core need not be dealt with.
*/
if (preferred_core)
- fits -= 3;
+ fits += ASYM_IDLE_CORE_BIAS;
/*
* First, select CPU which fits better (lower is more preferred).
@@ -8084,15 +8103,16 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
}
/*
- * A value in the [-4, -3] range means the chosen CPU is in a fully idle
- * SMT core. Values above -3 mean we never ranked such a CPU best.
+ * A value in the [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_BIAS]
+ * range means the chosen CPU is in a fully idle SMT core. Values above
+ * ASYM_IDLE_CORE_BIAS mean we never ranked such a CPU best.
*
* The asym-capacity wakeup path returns from select_idle_sibling()
* after this function and never runs select_idle_cpu(), so the usual
* select_idle_cpu() tail that clears idle cores must live here when the
* idle-core preference did not win.
*/
- if (prefers_idle_core && best_fits > -3)
+ if (prefers_idle_core && best_fits > ASYM_IDLE_CORE_BIAS)
set_idle_cores(target, false);
return best_cpu;
--
2.54.0
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 5/5] sched/fair: Make asym CPU capacity idle rank values self-documenting
2026-04-23 7:36 ` [PATCH 5/5] sched/fair: Make asym CPU capacity idle rank values self-documenting Andrea Righi
@ 2026-04-24 4:29 ` K Prateek Nayak
2026-04-24 5:19 ` Andrea Righi
0 siblings, 1 reply; 23+ messages in thread
From: K Prateek Nayak @ 2026-04-24 4:29 UTC (permalink / raw)
To: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel
Hello Andrea,
On 4/23/2026 1:06 PM, Andrea Righi wrote:
> From: K Prateek Nayak <kprateek.nayak@amd.com>
>
> Introduce enum asym_fits_state for select_idle_capacity() scan
> preferences, instead of using raw constants to improve readability.
>
> No functional change.
>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
All I did was suggest an enum; You did all the work on this :-)
I think this can be squashed into Patch 3 and that way we have we
can avoid the magic numbers from the get-go. Thoughts?
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
2026-04-23 7:36 ` [PATCH 1/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity Andrea Righi
@ 2026-04-24 5:14 ` K Prateek Nayak
2026-04-24 8:46 ` Andrea Righi
0 siblings, 1 reply; 23+ messages in thread
From: K Prateek Nayak @ 2026-04-24 5:14 UTC (permalink / raw)
To: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel
Hello Andrea,
On 4/23/2026 1:06 PM, Andrea Righi wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 69361c63353ad..934eb663f445e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7925,7 +7925,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> int i, cpu, idle_cpu = -1, nr = INT_MAX;
>
> - if (sched_feat(SIS_UTIL)) {
> + if (sched_feat(SIS_UTIL) && sd->shared) {
> /*
> * Increment because !--nr is the condition to stop scan.
> *
> @@ -12840,7 +12840,8 @@ static void set_cpu_sd_state_busy(int cpu)
> goto unlock;
> sd->nohz_idle = 0;
I just realised this flag only matters for accounting to "nr_busy_cpus"
and we can bail out earlier if we don't have an sd->shared altogether.
You can probably adapt this to use guard(rcu)() while you are at it
and send these bits as a separate cleanup first saying that the
assumption of sd_llc->shared always existing will change with the
coming patches and you are introducing guard rails for the same.
>
> - atomic_inc(&sd->shared->nr_busy_cpus);
> + if (sd->shared)
> + atomic_inc(&sd->shared->nr_busy_cpus);
> unlock:
> rcu_read_unlock();
> }
> @@ -12869,7 +12870,8 @@ static void set_cpu_sd_state_idle(int cpu)
> goto unlock;
> sd->nohz_idle = 1;
>
> - atomic_dec(&sd->shared->nr_busy_cpus);
> + if (sd->shared)
> + atomic_dec(&sd->shared->nr_busy_cpus);
> unlock:
> rcu_read_unlock();
> }
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 5847b83d9d552..dc50193b198c6 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -680,19 +680,39 @@ static void update_top_cache_domain(int cpu)
> int id = cpu;
> int size = 1;
>
> + sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
> + /*
> + * The shared object is attached to sd_asym_cpucapacity only when the
> + * asym domain is non-overlapping (i.e., not built from SD_NUMA).
> + * On overlapping (NUMA) asym domains we fall back to letting the
> + * SD_SHARE_LLC path own the shared object, so sd->shared may be NULL
> + * here.
> + */
> + if (sd && sd->shared)
> + sds = sd->shared;
> +
> + rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
> +
> sd = highest_flag_domain(cpu, SD_SHARE_LLC);
> if (sd) {
> id = cpumask_first(sched_domain_span(sd));
> size = cpumask_weight(sched_domain_span(sd));
>
> - /* If sd_llc exists, sd_llc_shared should exist too. */
> - WARN_ON_ONCE(!sd->shared);
> - sds = sd->shared;
> + /*
> + * If sd_asym_cpucapacity didn't claim the shared object,
> + * sd_llc must have one linked.
> + */
> + if (!sds) {
> + WARN_ON_ONCE(!sd->shared);
> + sds = sd->shared;
> + }
> }
>
> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
> per_cpu(sd_llc_size, cpu) = size;
> per_cpu(sd_llc_id, cpu) = id;
> +
> + /* TODO: Rename sd_llc_shared to fit the new role. */
> rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
Would love for folks to chime in but IMO "sd_wakeup_shared" sounds
pretty reasonable since it is mainly the wakeup path that depends on
this except for one !ASYM load balancing trigger.
>
> sd = lowest_flag_domain(cpu, SD_CLUSTER);
> @@ -711,9 +731,6 @@ static void update_top_cache_domain(int cpu)
>
> sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
> rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
> -
> - sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
> - rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
> }
>
> /*
> @@ -2650,6 +2667,15 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc)
> }
> }
>
> +static void init_sched_domain_shared(struct s_data *d, struct sched_domain *sd)
> +{
> + int sd_id = cpumask_first(sched_domain_span(sd));
> +
> + sd->shared = *per_cpu_ptr(d->sds, sd_id);
> + atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> + atomic_inc(&sd->shared->ref);
> +}
> +
> /*
> * Build sched domains for a given set of CPUs and attach the sched domains
> * to the individual CPUs
> @@ -2708,20 +2734,53 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> }
>
> for_each_cpu(i, cpu_map) {
> + struct sched_domain *sd_asym = NULL;
> + bool asym_claimed = false;
> +
> sd = *per_cpu_ptr(d.sd, i);
> if (!sd)
> continue;
>
> + /*
> + * In case of ASYM_CPUCAPACITY, attach sd->shared to
> + * sd_asym_cpucapacity for wakeup stat tracking.
> + *
> + * Caveats:
> + *
> + * 1) has_asym is system-wide, but a given CPU may still
> + * lack an SD_ASYM_CPUCAPACITY_FULL ancestor (e.g., an
> + * exclusive cpuset carving out a symmetric capacity island).
> + * Such CPUs must fall through to the LLC seeding path below.
> + *
> + * 2) Skip the asym attach if the asym ancestor is an
> + * overlapping domain (SD_NUMA). On those topologies let the
> + * LLC path own the shared object instead.
> + *
> + * XXX: This assumes SD_ASYM_CPUCAPACITY_FULL domain
> + * always has more than one group else it is prone to
> + * degeneration.
I looked into this and we only set SD_ASYM_CPUCAPACITY if we find more
than one capacity and SD_ASYM_CPUCAPACITY_FULL implies there are atleast
two CPUs covering differnt capcities in the span.
The very first SD_ASYM_CPUCAPACITY_FULL domain should be safe from
degeneration when it is non-overlapping.
> + */
> + sd_asym = sd;
> + while (sd_asym && !(sd_asym->flags & SD_ASYM_CPUCAPACITY_FULL))
> + sd_asym = sd_asym->parent;
> +
> + if (sd_asym && !(sd_asym->flags & SD_NUMA)) {
> + init_sched_domain_shared(&d, sd_asym);
> + asym_claimed = true;
> + }
We should probably guard this behind a "has_asym" check. Maybe even
extract into a sperate helper if the nesting gets too deep. Thoughts?
> +
> /* First, find the topmost SD_SHARE_LLC domain */
> + sd = *per_cpu_ptr(d.sd, i);
nit.
I think this reassignment is no longer required since you use a separate
"sd_asym" variable now.
> while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
> sd = sd->parent;
>
> if (sd->flags & SD_SHARE_LLC) {
> - int sd_id = cpumask_first(sched_domain_span(sd));
> -
> - sd->shared = *per_cpu_ptr(d.sds, sd_id);
> - atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> - atomic_inc(&sd->shared->ref);
> + /*
> + * Initialize the sd->shared for SD_SHARE_LLC unless
> + * the asym path above already claimed it.
> + */
> + if (!asym_claimed)
> + init_sched_domain_shared(&d, sd);
Tbh, if "has_asym" is true, we probabaly don't even need this since the
nr_busy_cpus accounting gets us nothing.
Might save a little overhead and space on those systems but I would
love to hear if there are any concerns if we just drop the
sd_llc->shared when we detect asym capacities.
>
> /*
> * In presence of higher domains, adjust the
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 5/5] sched/fair: Make asym CPU capacity idle rank values self-documenting
2026-04-24 4:29 ` K Prateek Nayak
@ 2026-04-24 5:19 ` Andrea Righi
2026-04-24 12:34 ` Vincent Guittot
0 siblings, 1 reply; 23+ messages in thread
From: Andrea Righi @ 2026-04-24 5:19 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel
Hi Prateek,
On Fri, Apr 24, 2026 at 09:59:42AM +0530, K Prateek Nayak wrote:
> Hello Andrea,
>
> On 4/23/2026 1:06 PM, Andrea Righi wrote:
> > From: K Prateek Nayak <kprateek.nayak@amd.com>
> >
> > Introduce enum asym_fits_state for select_idle_capacity() scan
> > preferences, instead of using raw constants to improve readability.
> >
> > No functional change.
> >
> > Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>
> All I did was suggest an enum; You did all the work on this :-)
Well, actually you wrote the code, I wrote the patch description. :)
> I think this can be squashed into Patch 3 and that way we have we
> can avoid the magic numbers from the get-go. Thoughts?
Yeah, that's probably better, no reason to have a separate patch. I'll do that
in the next version.
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
2026-04-23 7:36 ` [PATCH 3/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
@ 2026-04-24 5:37 ` K Prateek Nayak
2026-04-24 9:21 ` Andrea Righi
0 siblings, 1 reply; 23+ messages in thread
From: K Prateek Nayak @ 2026-04-24 5:37 UTC (permalink / raw)
To: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel
Hello Andrea,
On 4/23/2026 1:06 PM, Andrea Righi wrote:
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10815,10 +10815,16 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> * We can use max_capacity here as reduction in capacity on some
> * CPUs in the group should either be possible to resolve
> * internally or be covered by avg_load imbalance (eventually).
> + *
> + * When SMT is active, only pull a misfit to dst_cpu if it is on a
> + * fully idle core; otherwise the effective capacity of the core is
> + * reduced and we may not actually provide more capacity than the
> + * source.
> */
> if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> (sgs->group_type == group_misfit_task) &&
> - (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
> + ((sched_smt_active() && !is_core_idle(env->dst_cpu)) ||
update_sd_pick_busiest() goes over all the groups of the domain but the
"env->dst_cpu" remains the same throughout.
Maybe you can cache the status of is_core_idle(env->dst_cpu) in lb_env
at the beginning of update_sd_lb_stats() and save on repeating the
is_core_idle() test here?
Sure, the status can change while this CPU is load balancing but in
that case the sibling going idle will start newidle balance and it
should see the correct status and do the right thing on behalf of the
core.
> + !capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
> sds->local_stat.group_type != group_has_spare))
> return false;
>
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
2026-04-23 7:36 ` [PATCH 2/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
@ 2026-04-24 5:42 ` K Prateek Nayak
0 siblings, 0 replies; 23+ messages in thread
From: K Prateek Nayak @ 2026-04-24 5:42 UTC (permalink / raw)
To: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel
Hello Andrea,
On 4/23/2026 1:06 PM, Andrea Righi wrote:
> On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> different per-core frequencies), the wakeup path uses
> select_idle_capacity() and prioritizes idle CPUs with higher capacity
> for better task placement. However, when those CPUs belong to SMT cores,
> their effective capacity can be much lower than the nominal capacity
> when the sibling thread is busy: SMT siblings compete for shared
> resources, so a "high capacity" CPU that is idle but whose sibling is
> busy does not deliver its full capacity. This effective capacity
> reduction cannot be modeled by the static capacity value alone.
>
> Introduce SMT awareness in the asym-capacity idle selection policy: when
> SMT is active, always prefer fully-idle SMT cores over partially-idle
> ones.
>
> Prioritizing fully-idle SMT cores yields better task placement because
> the effective capacity of partially-idle SMT cores is reduced; always
> preferring them when available leads to more accurate capacity usage on
> task wakeup.
>
> On an SMT system with asymmetric CPU capacities, SMT-aware idle
> selection has been shown to improve throughput by around 15-18% for
> CPU-bound workloads, running an amount of tasks equal to the amount of
> SMT cores.
>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Christian Loehle <christian.loehle@arm.com>
> Cc: Koba Ko <kobak@nvidia.com>
> Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
Thank you for chasing and cleaning this up. After squashing in changes
from Patch 5, feel free to include:
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/5] sched/fair: Add SIS_UTIL support to select_idle_capacity()
2026-04-23 7:36 ` [PATCH 4/5] sched/fair: Add SIS_UTIL support to select_idle_capacity() Andrea Righi
@ 2026-04-24 5:55 ` K Prateek Nayak
2026-04-24 12:32 ` Vincent Guittot
1 sibling, 0 replies; 23+ messages in thread
From: K Prateek Nayak @ 2026-04-24 5:55 UTC (permalink / raw)
To: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel
Hello Andrea,
On 4/23/2026 1:06 PM, Andrea Righi wrote:
> From: K Prateek Nayak <kprateek.nayak@amd.com>
>
> Add to select_idle_capacity() the same SIS_UTIL-controlled idle-scan
> mechanism, already used by select_idle_cpu(): when sched_feat(SIS_UTIL)
> is enabled and the LLC domain has sched_domain_shared data, derive the
> per-attempt scan limit from sd->shared->nr_idle_scan.
>
> That bounds the walk on large LLCs and allows an early return once the
> scan limit is reached, if we already picked a sufficiently strong
> idle-core candidate (best_fits == -4).
>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
I think this still requires your S-o-b ;-)
Thank you again for cleaning this up and testing it. Since you mentioned
having tested on Grace without SMT, I don't think this will be a problem
for other CAS use cases too although I'll love to hear if Android with
overloaded status could run into any problems when that state is
triggered.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
2026-04-24 5:14 ` K Prateek Nayak
@ 2026-04-24 8:46 ` Andrea Righi
2026-04-24 11:18 ` K Prateek Nayak
0 siblings, 1 reply; 23+ messages in thread
From: Andrea Righi @ 2026-04-24 8:46 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel
Hi Prateek,
On Fri, Apr 24, 2026 at 10:44:09AM +0530, K Prateek Nayak wrote:
> Hello Andrea,
>
> On 4/23/2026 1:06 PM, Andrea Righi wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 69361c63353ad..934eb663f445e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7925,7 +7925,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> > int i, cpu, idle_cpu = -1, nr = INT_MAX;
> >
> > - if (sched_feat(SIS_UTIL)) {
> > + if (sched_feat(SIS_UTIL) && sd->shared) {
> > /*
> > * Increment because !--nr is the condition to stop scan.
> > *
> > @@ -12840,7 +12840,8 @@ static void set_cpu_sd_state_busy(int cpu)
> > goto unlock;
> > sd->nohz_idle = 0;
>
> I just realised this flag only matters for accounting to "nr_busy_cpus"
> and we can bail out earlier if we don't have an sd->shared altogether.
>
> You can probably adapt this to use guard(rcu)() while you are at it
> and send these bits as a separate cleanup first saying that the
> assumption of sd_llc->shared always existing will change with the
> coming patches and you are introducing guard rails for the same.
Ack.
>
> >
> > - atomic_inc(&sd->shared->nr_busy_cpus);
> > + if (sd->shared)
> > + atomic_inc(&sd->shared->nr_busy_cpus);
> > unlock:
> > rcu_read_unlock();
> > }
> > @@ -12869,7 +12870,8 @@ static void set_cpu_sd_state_idle(int cpu)
> > goto unlock;
> > sd->nohz_idle = 1;
> >
> > - atomic_dec(&sd->shared->nr_busy_cpus);
> > + if (sd->shared)
> > + atomic_dec(&sd->shared->nr_busy_cpus);
> > unlock:
> > rcu_read_unlock();
> > }
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 5847b83d9d552..dc50193b198c6 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -680,19 +680,39 @@ static void update_top_cache_domain(int cpu)
> > int id = cpu;
> > int size = 1;
> >
> > + sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
> > + /*
> > + * The shared object is attached to sd_asym_cpucapacity only when the
> > + * asym domain is non-overlapping (i.e., not built from SD_NUMA).
> > + * On overlapping (NUMA) asym domains we fall back to letting the
> > + * SD_SHARE_LLC path own the shared object, so sd->shared may be NULL
> > + * here.
> > + */
> > + if (sd && sd->shared)
> > + sds = sd->shared;
> > +
> > + rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
> > +
> > sd = highest_flag_domain(cpu, SD_SHARE_LLC);
> > if (sd) {
> > id = cpumask_first(sched_domain_span(sd));
> > size = cpumask_weight(sched_domain_span(sd));
> >
> > - /* If sd_llc exists, sd_llc_shared should exist too. */
> > - WARN_ON_ONCE(!sd->shared);
> > - sds = sd->shared;
> > + /*
> > + * If sd_asym_cpucapacity didn't claim the shared object,
> > + * sd_llc must have one linked.
> > + */
> > + if (!sds) {
> > + WARN_ON_ONCE(!sd->shared);
> > + sds = sd->shared;
> > + }
> > }
> >
> > rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
> > per_cpu(sd_llc_size, cpu) = size;
> > per_cpu(sd_llc_id, cpu) = id;
> > +
> > + /* TODO: Rename sd_llc_shared to fit the new role. */
> > rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>
> Would love for folks to chime in but IMO "sd_wakeup_shared" sounds
> pretty reasonable since it is mainly the wakeup path that depends on
> this except for one !ASYM load balancing trigger.
sd_wakeup_shared captures the bigger consumer (wakeup), but not the nohz
balancer kick logic.
Maybe "sd_balance_shared" (balance in a broad sense, wakeup is still affecting
balancing at the end) or "sd_effective_shared" (if we want to stress that
topology may move: LLC vs asym)?
>
> >
> > sd = lowest_flag_domain(cpu, SD_CLUSTER);
> > @@ -711,9 +731,6 @@ static void update_top_cache_domain(int cpu)
> >
> > sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
> > rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
> > -
> > - sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
> > - rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
> > }
> >
> > /*
> > @@ -2650,6 +2667,15 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc)
> > }
> > }
> >
> > +static void init_sched_domain_shared(struct s_data *d, struct sched_domain *sd)
> > +{
> > + int sd_id = cpumask_first(sched_domain_span(sd));
> > +
> > + sd->shared = *per_cpu_ptr(d->sds, sd_id);
> > + atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> > + atomic_inc(&sd->shared->ref);
> > +}
> > +
> > /*
> > * Build sched domains for a given set of CPUs and attach the sched domains
> > * to the individual CPUs
> > @@ -2708,20 +2734,53 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > }
> >
> > for_each_cpu(i, cpu_map) {
> > + struct sched_domain *sd_asym = NULL;
> > + bool asym_claimed = false;
> > +
> > sd = *per_cpu_ptr(d.sd, i);
> > if (!sd)
> > continue;
> >
> > + /*
> > + * In case of ASYM_CPUCAPACITY, attach sd->shared to
> > + * sd_asym_cpucapacity for wakeup stat tracking.
> > + *
> > + * Caveats:
> > + *
> > + * 1) has_asym is system-wide, but a given CPU may still
> > + * lack an SD_ASYM_CPUCAPACITY_FULL ancestor (e.g., an
> > + * exclusive cpuset carving out a symmetric capacity island).
> > + * Such CPUs must fall through to the LLC seeding path below.
> > + *
> > + * 2) Skip the asym attach if the asym ancestor is an
> > + * overlapping domain (SD_NUMA). On those topologies let the
> > + * LLC path own the shared object instead.
> > + *
> > + * XXX: This assumes SD_ASYM_CPUCAPACITY_FULL domain
> > + * always has more than one group else it is prone to
> > + * degeneration.
>
> I looked into this and we only set SD_ASYM_CPUCAPACITY if we find more
> than one capacity and SD_ASYM_CPUCAPACITY_FULL implies there are atleast
> two CPUs covering differnt capcities in the span.
>
> The very first SD_ASYM_CPUCAPACITY_FULL domain should be safe from
> degeneration when it is non-overlapping.
Makes sense, maybe we can replace the XXX part with note like this:
* Note: SD_ASYM_CPUCAPACITY_FULL is only set when multiple distinct
* capacities exist in the domain span, so the asym domain we attach
* to cannot degenerate into a single-capacity group. The relevant
* edge cases are instead covered by the caveats above.
>
> > + */
> > + sd_asym = sd;
> > + while (sd_asym && !(sd_asym->flags & SD_ASYM_CPUCAPACITY_FULL))
> > + sd_asym = sd_asym->parent;
> > +
> > + if (sd_asym && !(sd_asym->flags & SD_NUMA)) {
> > + init_sched_domain_shared(&d, sd_asym);
> > + asym_claimed = true;
> > + }
>
> We should probably guard this behind a "has_asym" check. Maybe even
> extract into a sperate helper if the nesting gets too deep. Thoughts?
Ack, we can add an `if (has_asym)` as a quick skip logic and fold the walk +
NUMA check into a small helper.
>
> > +
> > /* First, find the topmost SD_SHARE_LLC domain */
> > + sd = *per_cpu_ptr(d.sd, i);
>
> nit.
>
> I think this reassignment is no longer required since you use a separate
> "sd_asym" variable now.
Ack.
>
> > while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
> > sd = sd->parent;
> >
> > if (sd->flags & SD_SHARE_LLC) {
> > - int sd_id = cpumask_first(sched_domain_span(sd));
> > -
> > - sd->shared = *per_cpu_ptr(d.sds, sd_id);
> > - atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> > - atomic_inc(&sd->shared->ref);
> > + /*
> > + * Initialize the sd->shared for SD_SHARE_LLC unless
> > + * the asym path above already claimed it.
> > + */
> > + if (!asym_claimed)
> > + init_sched_domain_shared(&d, sd);
>
> Tbh, if "has_asym" is true, we probabaly don't even need this since the
> nr_busy_cpus accounting gets us nothing.
>
> Might save a little overhead and space on those systems but I would
> love to hear if there are any concerns if we just drop the
> sd_llc->shared when we detect asym capacities.
Hm... but "has_asym" is global, we may still need LLC-owned shared for symmetric
islands and NUMA-overlap cases, no?
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
2026-04-24 5:37 ` K Prateek Nayak
@ 2026-04-24 9:21 ` Andrea Righi
0 siblings, 0 replies; 23+ messages in thread
From: Andrea Righi @ 2026-04-24 9:21 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel
Hi Prateek,
On Fri, Apr 24, 2026 at 11:07:01AM +0530, K Prateek Nayak wrote:
> Hello Andrea,
>
> On 4/23/2026 1:06 PM, Andrea Righi wrote:
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10815,10 +10815,16 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> > * We can use max_capacity here as reduction in capacity on some
> > * CPUs in the group should either be possible to resolve
> > * internally or be covered by avg_load imbalance (eventually).
> > + *
> > + * When SMT is active, only pull a misfit to dst_cpu if it is on a
> > + * fully idle core; otherwise the effective capacity of the core is
> > + * reduced and we may not actually provide more capacity than the
> > + * source.
> > */
> > if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> > (sgs->group_type == group_misfit_task) &&
> > - (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
> > + ((sched_smt_active() && !is_core_idle(env->dst_cpu)) ||
>
> update_sd_pick_busiest() goes over all the groups of the domain but the
> "env->dst_cpu" remains the same throughout.
>
> Maybe you can cache the status of is_core_idle(env->dst_cpu) in lb_env
> at the beginning of update_sd_lb_stats() and save on repeating the
> is_core_idle() test here?
>
> Sure, the status can change while this CPU is load balancing but in
> that case the sibling going idle will start newidle balance and it
> should see the correct status and do the right thing on behalf of the
> core.
Yep, we can definitely cache is_core_idle(env->dst_cpu), I'll change that.
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
2026-04-24 8:46 ` Andrea Righi
@ 2026-04-24 11:18 ` K Prateek Nayak
2026-04-24 23:29 ` Andrea Righi
0 siblings, 1 reply; 23+ messages in thread
From: K Prateek Nayak @ 2026-04-24 11:18 UTC (permalink / raw)
To: Andrea Righi
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel
Hello Andrea,
On 4/24/2026 2:16 PM, Andrea Righi wrote:
>>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>>> per_cpu(sd_llc_size, cpu) = size;
>>> per_cpu(sd_llc_id, cpu) = id;
>>> +
>>> + /* TODO: Rename sd_llc_shared to fit the new role. */
>>> rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>>
>> Would love for folks to chime in but IMO "sd_wakeup_shared" sounds
>> pretty reasonable since it is mainly the wakeup path that depends on
>> this except for one !ASYM load balancing trigger.
>
> sd_wakeup_shared captures the bigger consumer (wakeup), but not the nohz
> balancer kick logic.
>
> Maybe "sd_balance_shared" (balance in a broad sense, wakeup is still affecting
> balancing at the end) or "sd_effective_shared" (if we want to stress that
> topology may move: LLC vs asym)?
Works for me! I don't have any strong feelings on this.
[..snip..]
>>> + /*
>>> + * In case of ASYM_CPUCAPACITY, attach sd->shared to
>>> + * sd_asym_cpucapacity for wakeup stat tracking.
>>> + *
>>> + * Caveats:
>>> + *
>>> + * 1) has_asym is system-wide, but a given CPU may still
>>> + * lack an SD_ASYM_CPUCAPACITY_FULL ancestor (e.g., an
>>> + * exclusive cpuset carving out a symmetric capacity island).
>>> + * Such CPUs must fall through to the LLC seeding path below.
>>> + *
>>> + * 2) Skip the asym attach if the asym ancestor is an
>>> + * overlapping domain (SD_NUMA). On those topologies let the
>>> + * LLC path own the shared object instead.
>>> + *
>>> + * XXX: This assumes SD_ASYM_CPUCAPACITY_FULL domain
>>> + * always has more than one group else it is prone to
>>> + * degeneration.
>>
>> I looked into this and we only set SD_ASYM_CPUCAPACITY if we find more
>> than one capacity and SD_ASYM_CPUCAPACITY_FULL implies there are atleast
>> two CPUs covering differnt capcities in the span.
>>
>> The very first SD_ASYM_CPUCAPACITY_FULL domain should be safe from
>> degeneration when it is non-overlapping.
>
> Makes sense, maybe we can replace the XXX part with note like this:
>
> * Note: SD_ASYM_CPUCAPACITY_FULL is only set when multiple distinct
> * capacities exist in the domain span, so the asym domain we attach
> * to cannot degenerate into a single-capacity group. The relevant
> * edge cases are instead covered by the caveats above.
Ack! That should make it clear. Thank you.
[..snip..]
>>> if (sd->flags & SD_SHARE_LLC) {
>>> - int sd_id = cpumask_first(sched_domain_span(sd));
>>> -
>>> - sd->shared = *per_cpu_ptr(d.sds, sd_id);
>>> - atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
>>> - atomic_inc(&sd->shared->ref);
>>> + /*
>>> + * Initialize the sd->shared for SD_SHARE_LLC unless
>>> + * the asym path above already claimed it.
>>> + */
>>> + if (!asym_claimed)
>>> + init_sched_domain_shared(&d, sd);
>>
>> Tbh, if "has_asym" is true, we probabaly don't even need this since the
>> nr_busy_cpus accounting gets us nothing.
>>
>> Might save a little overhead and space on those systems but I would
>> love to hear if there are any concerns if we just drop the
>> sd_llc->shared when we detect asym capacities.
>
> Hm... but "has_asym" is global, we may still need LLC-owned shared for symmetric
> islands and NUMA-overlap cases, no?
"has_asym" is local to build_sched_domains() right? So it should operate
per cpuset partition since we call build_sched_domains() for every
"cpu_map".
Ack to everything that was snipped off!
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/5] sched/fair: Add SIS_UTIL support to select_idle_capacity()
2026-04-23 7:36 ` [PATCH 4/5] sched/fair: Add SIS_UTIL support to select_idle_capacity() Andrea Righi
2026-04-24 5:55 ` K Prateek Nayak
@ 2026-04-24 12:32 ` Vincent Guittot
2026-04-24 17:13 ` Andrea Righi
2026-04-27 5:13 ` K Prateek Nayak
1 sibling, 2 replies; 23+ messages in thread
From: Vincent Guittot @ 2026-04-24 12:32 UTC (permalink / raw)
To: Andrea Righi
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
K Prateek Nayak, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel
On Thu, 23 Apr 2026 at 09:42, Andrea Righi <arighi@nvidia.com> wrote:
>
> From: K Prateek Nayak <kprateek.nayak@amd.com>
>
> Add to select_idle_capacity() the same SIS_UTIL-controlled idle-scan
> mechanism, already used by select_idle_cpu(): when sched_feat(SIS_UTIL)
> is enabled and the LLC domain has sched_domain_shared data, derive the
> per-attempt scan limit from sd->shared->nr_idle_scan.
>
> That bounds the walk on large LLCs and allows an early return once the
> scan limit is reached, if we already picked a sufficiently strong
> idle-core candidate (best_fits == -4).
>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> kernel/sched/fair.c | 21 +++++++++++++++++++++
> 1 file changed, 21 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9bd9dc6e0882e..6b67049f04c3e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8002,6 +8002,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> int fits, best_fits = 0;
> int cpu, best_cpu = -1;
> struct cpumask *cpus;
> + int nr = INT_MAX;
>
> cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> @@ -8010,10 +8011,30 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> util_min = uclamp_eff_value(p, UCLAMP_MIN);
> util_max = uclamp_eff_value(p, UCLAMP_MAX);
>
> + if (sched_feat(SIS_UTIL) && sd->shared) {
> + /*
> + * Increment because !--nr is the condition to stop scan.
> + *
> + * Since "sd" is "sd_llc" for target CPU dereferenced in the
> + * caller, it is safe to directly dereference "sd->shared".
> + * Topology bits always ensure it assigned for "sd_llc" and it
> + * cannot disappear as long as we have a RCU protected
> + * reference to one the associated "sd" here.
> + */
> + nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
> + /* overloaded LLC is unlikely to have idle cpu/core */
> + if (nr == 1)
> + return -1;
The comment below applies to select_idle_cpu but we want same behavior
for both function
If test_idle_cores is true we will not look for it whereas we don't
care about nr value when test_idle_core is true in the
for_each_cpu_wrap loop
> + }
> +
> for_each_cpu_wrap(cpu, cpus, target) {
> bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
> unsigned long cpu_cap = capacity_of(cpu);
>
> + /* We have found a good enough target. Just use it. */
> + if (--nr <= 0 && best_fits == -4)
> + return best_cpu;
In select_idle_cpu(), we return immediatly when nr == 0 and
test_idle_cores is false but we loop on all cpus if test_idle_cores is
true until we found an idle core. In the case of
select_idle_capacity(), I agree that util_fits_cpu() add another level
but shouldn't we continue to loop even if we found a best_fits == -4
> +
> if (!choose_idle_cpu(cpu, p))
> continue;
>
> --
> 2.54.0
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 5/5] sched/fair: Make asym CPU capacity idle rank values self-documenting
2026-04-24 5:19 ` Andrea Righi
@ 2026-04-24 12:34 ` Vincent Guittot
0 siblings, 0 replies; 23+ messages in thread
From: Vincent Guittot @ 2026-04-24 12:34 UTC (permalink / raw)
To: Andrea Righi
Cc: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel
On Fri, 24 Apr 2026 at 07:19, Andrea Righi <arighi@nvidia.com> wrote:
>
> Hi Prateek,
>
> On Fri, Apr 24, 2026 at 09:59:42AM +0530, K Prateek Nayak wrote:
> > Hello Andrea,
> >
> > On 4/23/2026 1:06 PM, Andrea Righi wrote:
> > > From: K Prateek Nayak <kprateek.nayak@amd.com>
> > >
> > > Introduce enum asym_fits_state for select_idle_capacity() scan
> > > preferences, instead of using raw constants to improve readability.
> > >
> > > No functional change.
> > >
> > > Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> >
> > All I did was suggest an enum; You did all the work on this :-)
>
> Well, actually you wrote the code, I wrote the patch description. :)
>
> > I think this can be squashed into Patch 3 and that way we have we
> > can avoid the magic numbers from the get-go. Thoughts?
>
> Yeah, that's probably better, no reason to have a separate patch. I'll do that
> in the next version.
I agree that this should be merged with patch 2
>
> Thanks,
> -Andrea
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/5] sched/fair: Add SIS_UTIL support to select_idle_capacity()
2026-04-24 12:32 ` Vincent Guittot
@ 2026-04-24 17:13 ` Andrea Righi
2026-04-27 5:13 ` K Prateek Nayak
1 sibling, 0 replies; 23+ messages in thread
From: Andrea Righi @ 2026-04-24 17:13 UTC (permalink / raw)
To: Vincent Guittot
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
K Prateek Nayak, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel
Hi Vincent,
On Fri, Apr 24, 2026 at 02:32:30PM +0200, Vincent Guittot wrote:
> On Thu, 23 Apr 2026 at 09:42, Andrea Righi <arighi@nvidia.com> wrote:
> >
> > From: K Prateek Nayak <kprateek.nayak@amd.com>
> >
> > Add to select_idle_capacity() the same SIS_UTIL-controlled idle-scan
> > mechanism, already used by select_idle_cpu(): when sched_feat(SIS_UTIL)
> > is enabled and the LLC domain has sched_domain_shared data, derive the
> > per-attempt scan limit from sd->shared->nr_idle_scan.
> >
> > That bounds the walk on large LLCs and allows an early return once the
> > scan limit is reached, if we already picked a sufficiently strong
> > idle-core candidate (best_fits == -4).
> >
> > Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > ---
> > kernel/sched/fair.c | 21 +++++++++++++++++++++
> > 1 file changed, 21 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 9bd9dc6e0882e..6b67049f04c3e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8002,6 +8002,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > int fits, best_fits = 0;
> > int cpu, best_cpu = -1;
> > struct cpumask *cpus;
> > + int nr = INT_MAX;
> >
> > cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> > cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > @@ -8010,10 +8011,30 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > util_min = uclamp_eff_value(p, UCLAMP_MIN);
> > util_max = uclamp_eff_value(p, UCLAMP_MAX);
> >
> > + if (sched_feat(SIS_UTIL) && sd->shared) {
> > + /*
> > + * Increment because !--nr is the condition to stop scan.
> > + *
> > + * Since "sd" is "sd_llc" for target CPU dereferenced in the
> > + * caller, it is safe to directly dereference "sd->shared".
> > + * Topology bits always ensure it assigned for "sd_llc" and it
> > + * cannot disappear as long as we have a RCU protected
> > + * reference to one the associated "sd" here.
> > + */
> > + nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
> > + /* overloaded LLC is unlikely to have idle cpu/core */
> > + if (nr == 1)
> > + return -1;
>
> The comment below applies to select_idle_cpu but we want same behavior
> for both function
> If test_idle_cores is true we will not look for it whereas we don't
> care about nr value when test_idle_core is true in the
> for_each_cpu_wrap loop
>
>
> > + }
> > +
> > for_each_cpu_wrap(cpu, cpus, target) {
> > bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
> > unsigned long cpu_cap = capacity_of(cpu);
> >
> > + /* We have found a good enough target. Just use it. */
> > + if (--nr <= 0 && best_fits == -4)
> > + return best_cpu;
>
> In select_idle_cpu(), we return immediatly when nr == 0 and
> test_idle_cores is false but we loop on all cpus if test_idle_cores is
> true until we found an idle core. In the case of
> select_idle_capacity(), I agree that util_fits_cpu() add another level
> but shouldn't we continue to loop even if we found a best_fits == -4
>
Agreed that we should keep the behavior consistent between select_idle_cpu() and
select_idle_capacity().
I ran some quick tests with nr / early return matching select_idle_cpu() (using
the SIS_UTIL scan cap only with !prefers_idle_core). So far, I'm not seeing any
noticeable performance difference on my side, so that looks fine to me.
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
2026-04-24 11:18 ` K Prateek Nayak
@ 2026-04-24 23:29 ` Andrea Righi
0 siblings, 0 replies; 23+ messages in thread
From: Andrea Righi @ 2026-04-24 23:29 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel
Hi Prateek,
On Fri, Apr 24, 2026 at 04:48:59PM +0530, K Prateek Nayak wrote:
...
> >>> if (sd->flags & SD_SHARE_LLC) {
> >>> - int sd_id = cpumask_first(sched_domain_span(sd));
> >>> -
> >>> - sd->shared = *per_cpu_ptr(d.sds, sd_id);
> >>> - atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> >>> - atomic_inc(&sd->shared->ref);
> >>> + /*
> >>> + * Initialize the sd->shared for SD_SHARE_LLC unless
> >>> + * the asym path above already claimed it.
> >>> + */
> >>> + if (!asym_claimed)
> >>> + init_sched_domain_shared(&d, sd);
> >>
> >> Tbh, if "has_asym" is true, we probabaly don't even need this since the
> >> nr_busy_cpus accounting gets us nothing.
> >>
> >> Might save a little overhead and space on those systems but I would
> >> love to hear if there are any concerns if we just drop the
> >> sd_llc->shared when we detect asym capacities.
> >
> > Hm... but "has_asym" is global, we may still need LLC-owned shared for symmetric
> > islands and NUMA-overlap cases, no?
>
> "has_asym" is local to build_sched_domains() right? So it should operate
> per cpuset partition since we call build_sched_domains() for every
> "cpu_map".
Yeah sorry, I meant has_asym is aggregated across the whole cpu_map (not global
machine-wide).
Just to make sure I understand correctly, one CPU in the map can hit the asym
flags while another never claims asym shared and still needs the sd_llc->shared,
but it'd be unused, so why allocate/initialize it, right?
In that case, it'd make sense, but it sounds a bit bug prone, is the extra risk
worth the overhead/space saved? Also, what if we decide to wire nr_busy_cpus
through the asym-capacity path in the future?
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/5] sched/fair: Add SIS_UTIL support to select_idle_capacity()
2026-04-24 12:32 ` Vincent Guittot
2026-04-24 17:13 ` Andrea Righi
@ 2026-04-27 5:13 ` K Prateek Nayak
2026-04-27 8:35 ` Vincent Guittot
1 sibling, 1 reply; 23+ messages in thread
From: K Prateek Nayak @ 2026-04-27 5:13 UTC (permalink / raw)
To: Vincent Guittot, Andrea Righi
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
Joel Fernandes, Shrikanth Hegde, linux-kernel
Hello Vincent,
On 4/24/2026 6:02 PM, Vincent Guittot wrote:
>> + if (sched_feat(SIS_UTIL) && sd->shared) {
>> + /*
>> + * Increment because !--nr is the condition to stop scan.
>> + *
>> + * Since "sd" is "sd_llc" for target CPU dereferenced in the
>> + * caller, it is safe to directly dereference "sd->shared".
>> + * Topology bits always ensure it assigned for "sd_llc" and it
>> + * cannot disappear as long as we have a RCU protected
>> + * reference to one the associated "sd" here.
>> + */
>> + nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
>> + /* overloaded LLC is unlikely to have idle cpu/core */
>> + if (nr == 1)
>> + return -1;
>
> The comment below applies to select_idle_cpu but we want same behavior
> for both function
> If test_idle_cores is true we will not look for it whereas we don't
> care about nr value when test_idle_core is true in the
> for_each_cpu_wrap loop
Ack but the initial "nr" based bailout in select_idle_cpu() applies even
when test_idle_cores() is true too right?
>
>
>> + }
>> +
>> for_each_cpu_wrap(cpu, cpus, target) {
>> bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
>> unsigned long cpu_cap = capacity_of(cpu);
>>
>> + /* We have found a good enough target. Just use it. */
>> + if (--nr <= 0 && best_fits == -4)
>> + return best_cpu;
>
> In select_idle_cpu(), we return immediatly when nr == 0 and
> test_idle_cores is false but we loop on all cpus if test_idle_cores is
> true until we found an idle core. In the case of
> select_idle_capacity(), I agree that util_fits_cpu() add another level
> but shouldn't we continue to loop even if we found a best_fits == -4
I see what you mean! We can additionally guard the bailout on
"!prefers_idle_core".
For the case where test_idle_cores() returns false, is stopping for a
UCLAMP_MIN restricted CPU alright if SIS_UTIL bailout suggests it might
not be fruitful to search further?
Do you think it might trigger too many misfit balancing?
>
>> +
>> if (!choose_idle_cpu(cpu, p))
>> continue;
>>
>> --
>> 2.54.0
>>
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/5] sched/fair: Add SIS_UTIL support to select_idle_capacity()
2026-04-27 5:13 ` K Prateek Nayak
@ 2026-04-27 8:35 ` Vincent Guittot
2026-04-27 16:01 ` Andrea Righi
0 siblings, 1 reply; 23+ messages in thread
From: Vincent Guittot @ 2026-04-27 8:35 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel
On Mon, 27 Apr 2026 at 07:13, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Vincent,
>
> On 4/24/2026 6:02 PM, Vincent Guittot wrote:
> >> + if (sched_feat(SIS_UTIL) && sd->shared) {
> >> + /*
> >> + * Increment because !--nr is the condition to stop scan.
> >> + *
> >> + * Since "sd" is "sd_llc" for target CPU dereferenced in the
> >> + * caller, it is safe to directly dereference "sd->shared".
> >> + * Topology bits always ensure it assigned for "sd_llc" and it
> >> + * cannot disappear as long as we have a RCU protected
> >> + * reference to one the associated "sd" here.
> >> + */
> >> + nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
> >> + /* overloaded LLC is unlikely to have idle cpu/core */
> >> + if (nr == 1)
> >> + return -1;
> >
> > The comment below applies to select_idle_cpu but we want same behavior
> > for both function
> > If test_idle_cores is true we will not look for it whereas we don't
> > care about nr value when test_idle_core is true in the
> > for_each_cpu_wrap loop
>
> Ack but the initial "nr" based bailout in select_idle_cpu() applies even
> when test_idle_cores() is true too right?
Yes there is an early bail out for nr == 1 even with idle core.
>
> >
> >
> >> + }
> >> +
> >> for_each_cpu_wrap(cpu, cpus, target) {
> >> bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
> >> unsigned long cpu_cap = capacity_of(cpu);
> >>
> >> + /* We have found a good enough target. Just use it. */
> >> + if (--nr <= 0 && best_fits == -4)
> >> + return best_cpu;
> >
> > In select_idle_cpu(), we return immediatly when nr == 0 and
> > test_idle_cores is false but we loop on all cpus if test_idle_cores is
> > true until we found an idle core. In the case of
> > select_idle_capacity(), I agree that util_fits_cpu() add another level
> > but shouldn't we continue to loop even if we found a best_fits == -4
>
> I see what you mean! We can additionally guard the bailout on
> "!prefers_idle_core".
>
> For the case where test_idle_cores() returns false, is stopping for a
> UCLAMP_MIN restricted CPU alright if SIS_UTIL bailout suggests it might
> not be fruitful to search further?
>
> Do you think it might trigger too many misfit balancing?
For SMT, I don't think this will make a real diff but for !SMT this
would indeed trigger more misfits.
>
> >
> >> +
> >> if (!choose_idle_cpu(cpu, p))
> >> continue;
> >>
> >> --
> >> 2.54.0
> >>
>
> --
> Thanks and Regards,
> Prateek
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/5] sched/fair: Add SIS_UTIL support to select_idle_capacity()
2026-04-27 8:35 ` Vincent Guittot
@ 2026-04-27 16:01 ` Andrea Righi
2026-04-27 17:26 ` Vincent Guittot
0 siblings, 1 reply; 23+ messages in thread
From: Andrea Righi @ 2026-04-27 16:01 UTC (permalink / raw)
To: Vincent Guittot
Cc: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel
Hi Vincent and Prateek,
On Mon, Apr 27, 2026 at 10:35:59AM +0200, Vincent Guittot wrote:
> On Mon, 27 Apr 2026 at 07:13, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> >
> > Hello Vincent,
> >
> > On 4/24/2026 6:02 PM, Vincent Guittot wrote:
> > >> + if (sched_feat(SIS_UTIL) && sd->shared) {
> > >> + /*
> > >> + * Increment because !--nr is the condition to stop scan.
> > >> + *
> > >> + * Since "sd" is "sd_llc" for target CPU dereferenced in the
> > >> + * caller, it is safe to directly dereference "sd->shared".
> > >> + * Topology bits always ensure it assigned for "sd_llc" and it
> > >> + * cannot disappear as long as we have a RCU protected
> > >> + * reference to one the associated "sd" here.
> > >> + */
> > >> + nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
> > >> + /* overloaded LLC is unlikely to have idle cpu/core */
> > >> + if (nr == 1)
> > >> + return -1;
> > >
> > > The comment below applies to select_idle_cpu but we want same behavior
> > > for both function
> > > If test_idle_cores is true we will not look for it whereas we don't
> > > care about nr value when test_idle_core is true in the
> > > for_each_cpu_wrap loop
> >
> > Ack but the initial "nr" based bailout in select_idle_cpu() applies even
> > when test_idle_cores() is true too right?
>
> Yes there is an early bail out for nr == 1 even with idle core.
I did some tests changing the early bailout condition as
!prefers_idle_core && nr == 1.
On Vera I see small performance regressions with this (4-5% slowdown),
especially when approaching system saturation, which makes sense I think.
At this point I'd personally stick with the early bailout condition, considering
that it's still consistent with select_idle_cpu() and it seems to provide
slightly better results. What do you think?
>
> >
> > >
> > >
> > >> + }
> > >> +
> > >> for_each_cpu_wrap(cpu, cpus, target) {
> > >> bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
> > >> unsigned long cpu_cap = capacity_of(cpu);
> > >>
> > >> + /* We have found a good enough target. Just use it. */
> > >> + if (--nr <= 0 && best_fits == -4)
> > >> + return best_cpu;
> > >
> > > In select_idle_cpu(), we return immediatly when nr == 0 and
> > > test_idle_cores is false but we loop on all cpus if test_idle_cores is
> > > true until we found an idle core. In the case of
> > > select_idle_capacity(), I agree that util_fits_cpu() add another level
> > > but shouldn't we continue to loop even if we found a best_fits == -4
> >
> > I see what you mean! We can additionally guard the bailout on
> > "!prefers_idle_core".
> >
> > For the case where test_idle_cores() returns false, is stopping for a
> > UCLAMP_MIN restricted CPU alright if SIS_UTIL bailout suggests it might
> > not be fruitful to search further?
> >
> > Do you think it might trigger too many misfit balancing?
>
> For SMT, I don't think this will make a real diff but for !SMT this
> would indeed trigger more misfits.
You mean the SIS_UTIL mechanism in general (not the extra !prefers_idle_core
guard), right? Because in the !SMT case, prefers_idle_core is always false.
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/5] sched/fair: Add SIS_UTIL support to select_idle_capacity()
2026-04-27 16:01 ` Andrea Righi
@ 2026-04-27 17:26 ` Vincent Guittot
0 siblings, 0 replies; 23+ messages in thread
From: Vincent Guittot @ 2026-04-27 17:26 UTC (permalink / raw)
To: Andrea Righi
Cc: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel
On Mon, 27 Apr 2026 at 18:02, Andrea Righi <arighi@nvidia.com> wrote:
>
> Hi Vincent and Prateek,
>
> On Mon, Apr 27, 2026 at 10:35:59AM +0200, Vincent Guittot wrote:
> > On Mon, 27 Apr 2026 at 07:13, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> > >
> > > Hello Vincent,
> > >
> > > On 4/24/2026 6:02 PM, Vincent Guittot wrote:
> > > >> + if (sched_feat(SIS_UTIL) && sd->shared) {
> > > >> + /*
> > > >> + * Increment because !--nr is the condition to stop scan.
> > > >> + *
> > > >> + * Since "sd" is "sd_llc" for target CPU dereferenced in the
> > > >> + * caller, it is safe to directly dereference "sd->shared".
> > > >> + * Topology bits always ensure it assigned for "sd_llc" and it
> > > >> + * cannot disappear as long as we have a RCU protected
> > > >> + * reference to one the associated "sd" here.
> > > >> + */
> > > >> + nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
> > > >> + /* overloaded LLC is unlikely to have idle cpu/core */
> > > >> + if (nr == 1)
> > > >> + return -1;
> > > >
> > > > The comment below applies to select_idle_cpu but we want same behavior
> > > > for both function
> > > > If test_idle_cores is true we will not look for it whereas we don't
> > > > care about nr value when test_idle_core is true in the
> > > > for_each_cpu_wrap loop
> > >
> > > Ack but the initial "nr" based bailout in select_idle_cpu() applies even
> > > when test_idle_cores() is true too right?
> >
> > Yes there is an early bail out for nr == 1 even with idle core.
>
> I did some tests changing the early bailout condition as
> !prefers_idle_core && nr == 1.
>
> On Vera I see small performance regressions with this (4-5% slowdown),
> especially when approaching system saturation, which makes sense I think.
>
> At this point I'd personally stick with the early bailout condition, considering
> that it's still consistent with select_idle_cpu() and it seems to provide
> slightly better results. What do you think?
Yes, that seems the best option for now
>
> >
> > >
> > > >
> > > >
> > > >> + }
> > > >> +
> > > >> for_each_cpu_wrap(cpu, cpus, target) {
> > > >> bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
> > > >> unsigned long cpu_cap = capacity_of(cpu);
> > > >>
> > > >> + /* We have found a good enough target. Just use it. */
> > > >> + if (--nr <= 0 && best_fits == -4)
> > > >> + return best_cpu;
> > > >
> > > > In select_idle_cpu(), we return immediatly when nr == 0 and
> > > > test_idle_cores is false but we loop on all cpus if test_idle_cores is
> > > > true until we found an idle core. In the case of
> > > > select_idle_capacity(), I agree that util_fits_cpu() add another level
> > > > but shouldn't we continue to loop even if we found a best_fits == -4
> > >
> > > I see what you mean! We can additionally guard the bailout on
> > > "!prefers_idle_core".
> > >
> > > For the case where test_idle_cores() returns false, is stopping for a
> > > UCLAMP_MIN restricted CPU alright if SIS_UTIL bailout suggests it might
> > > not be fruitful to search further?
> > >
> > > Do you think it might trigger too many misfit balancing?
> >
> > For SMT, I don't think this will make a real diff but for !SMT this
> > would indeed trigger more misfits.
>
> You mean the SIS_UTIL mechanism in general (not the extra !prefers_idle_core
> guard), right? Because in the !SMT case, prefers_idle_core is always false.
Yes, this was for SIS_UTIL mecanis and about early bailout for nr == 1
>
> Thanks,
> -Andrea
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2026-04-27 17:27 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-23 7:36 [PATCH v3 0/5] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
2026-04-23 7:36 ` [PATCH 1/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity Andrea Righi
2026-04-24 5:14 ` K Prateek Nayak
2026-04-24 8:46 ` Andrea Righi
2026-04-24 11:18 ` K Prateek Nayak
2026-04-24 23:29 ` Andrea Righi
2026-04-23 7:36 ` [PATCH 2/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
2026-04-24 5:42 ` K Prateek Nayak
2026-04-23 7:36 ` [PATCH 3/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
2026-04-24 5:37 ` K Prateek Nayak
2026-04-24 9:21 ` Andrea Righi
2026-04-23 7:36 ` [PATCH 4/5] sched/fair: Add SIS_UTIL support to select_idle_capacity() Andrea Righi
2026-04-24 5:55 ` K Prateek Nayak
2026-04-24 12:32 ` Vincent Guittot
2026-04-24 17:13 ` Andrea Righi
2026-04-27 5:13 ` K Prateek Nayak
2026-04-27 8:35 ` Vincent Guittot
2026-04-27 16:01 ` Andrea Righi
2026-04-27 17:26 ` Vincent Guittot
2026-04-23 7:36 ` [PATCH 5/5] sched/fair: Make asym CPU capacity idle rank values self-documenting Andrea Righi
2026-04-24 4:29 ` K Prateek Nayak
2026-04-24 5:19 ` Andrea Righi
2026-04-24 12:34 ` Vincent Guittot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox