* [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path
2026-05-09 18:07 [PATCH v6 0/5 RESEND] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
@ 2026-05-09 18:07 ` Andrea Righi
2026-05-11 13:04 ` Vincent Guittot
2026-05-09 18:07 ` [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity Andrea Righi
` (3 subsequent siblings)
4 siblings, 1 reply; 13+ messages in thread
From: Andrea Righi @ 2026-05-09 18:07 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
Shrikanth Hegde, linux-kernel
nohz_balancer_kick() is reached from sched_balance_trigger(), which is
called from sched_tick(). sched_tick() runs with IRQs disabled, so the
additional rcu_read_lock/unlock() used around sched_domain accesses in
this path is redundant. Rely on the existing IRQ-disabled context (and
the rcu_dereference_all() checking) instead.
The same applies to set_cpu_sd_state_idle(), called from the idle entry
path with IRQs disabled, and to set_cpu_sd_state_busy(), reachable via
nohz_balance_exit_idle() from two contexts: nohz_balancer_kick() (IRQs
disabled, as above) and sched_cpu_deactivate() (the CPUHP_AP_ACTIVE
teardown, which runs under cpus_write_lock(), so it cannot race with
sched-domain rebuilds). In both cases the rcu_dereference_all()
validation is sufficient.
No functional change intended.
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/fair.c | 38 +++++++++++---------------------------
1 file changed, 11 insertions(+), 27 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ebec186f9823..6b059ee80b631 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12785,8 +12785,6 @@ static void nohz_balancer_kick(struct rq *rq)
goto out;
}
- rcu_read_lock();
-
sd = rcu_dereference_all(rq->sd);
if (sd) {
/*
@@ -12794,8 +12792,8 @@ static void nohz_balancer_kick(struct rq *rq)
* capacity, kick the ILB to see if there's a better CPU to run on:
*/
if (rq->cfs.h_nr_runnable >= 1 && check_cpu_capacity(rq, sd)) {
- flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
- goto unlock;
+ flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
+ goto out;
}
}
@@ -12811,8 +12809,8 @@ static void nohz_balancer_kick(struct rq *rq)
*/
for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) {
if (sched_asym(sd, i, cpu)) {
- flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
- goto unlock;
+ flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
+ goto out;
}
}
}
@@ -12823,10 +12821,8 @@ static void nohz_balancer_kick(struct rq *rq)
* When ASYM_CPUCAPACITY; see if there's a higher capacity CPU
* to run the misfit task on.
*/
- if (check_misfit_status(rq)) {
- flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
- goto unlock;
- }
+ if (check_misfit_status(rq))
+ flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
/*
* For asymmetric systems, we do not want to nicely balance
@@ -12835,7 +12831,7 @@ static void nohz_balancer_kick(struct rq *rq)
*
* Skip the LLC logic because it's not relevant in that case.
*/
- goto unlock;
+ goto out;
}
sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
@@ -12850,13 +12846,9 @@ static void nohz_balancer_kick(struct rq *rq)
* like this LLC domain has tasks we could move.
*/
nr_busy = atomic_read(&sds->nr_busy_cpus);
- if (nr_busy > 1) {
- flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
- goto unlock;
- }
+ if (nr_busy > 1)
+ flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
}
-unlock:
- rcu_read_unlock();
out:
if (READ_ONCE(nohz.needs_update))
flags |= NOHZ_NEXT_KICK;
@@ -12868,17 +12860,13 @@ static void nohz_balancer_kick(struct rq *rq)
static void set_cpu_sd_state_busy(int cpu)
{
struct sched_domain *sd;
-
- rcu_read_lock();
sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
if (!sd || !sd->nohz_idle)
- goto unlock;
+ return;
sd->nohz_idle = 0;
atomic_inc(&sd->shared->nr_busy_cpus);
-unlock:
- rcu_read_unlock();
}
void nohz_balance_exit_idle(struct rq *rq)
@@ -12897,17 +12885,13 @@ void nohz_balance_exit_idle(struct rq *rq)
static void set_cpu_sd_state_idle(int cpu)
{
struct sched_domain *sd;
-
- rcu_read_lock();
sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
if (!sd || sd->nohz_idle)
- goto unlock;
+ return;
sd->nohz_idle = 1;
atomic_dec(&sd->shared->nr_busy_cpus);
-unlock:
- rcu_read_unlock();
}
/*
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path
2026-05-09 18:07 ` [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path Andrea Righi
@ 2026-05-11 13:04 ` Vincent Guittot
0 siblings, 0 replies; 13+ messages in thread
From: Vincent Guittot @ 2026-05-11 13:04 UTC (permalink / raw)
To: Andrea Righi
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
K Prateek Nayak, Christian Loehle, Phil Auld, Koba Ko,
Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
linux-kernel
On Sat, 9 May 2026 at 20:10, Andrea Righi <arighi@nvidia.com> wrote:
>
> nohz_balancer_kick() is reached from sched_balance_trigger(), which is
> called from sched_tick(). sched_tick() runs with IRQs disabled, so the
> additional rcu_read_lock/unlock() used around sched_domain accesses in
> this path is redundant. Rely on the existing IRQ-disabled context (and
> the rcu_dereference_all() checking) instead.
>
> The same applies to set_cpu_sd_state_idle(), called from the idle entry
> path with IRQs disabled, and to set_cpu_sd_state_busy(), reachable via
> nohz_balance_exit_idle() from two contexts: nohz_balancer_kick() (IRQs
> disabled, as above) and sched_cpu_deactivate() (the CPUHP_AP_ACTIVE
> teardown, which runs under cpus_write_lock(), so it cannot race with
> sched-domain rebuilds). In both cases the rcu_dereference_all()
> validation is sufficient.
>
> No functional change intended.
>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> kernel/sched/fair.c | 38 +++++++++++---------------------------
> 1 file changed, 11 insertions(+), 27 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3ebec186f9823..6b059ee80b631 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12785,8 +12785,6 @@ static void nohz_balancer_kick(struct rq *rq)
> goto out;
> }
>
> - rcu_read_lock();
> -
> sd = rcu_dereference_all(rq->sd);
> if (sd) {
> /*
> @@ -12794,8 +12792,8 @@ static void nohz_balancer_kick(struct rq *rq)
> * capacity, kick the ILB to see if there's a better CPU to run on:
> */
> if (rq->cfs.h_nr_runnable >= 1 && check_cpu_capacity(rq, sd)) {
> - flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> - goto unlock;
> + flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> + goto out;
> }
> }
>
> @@ -12811,8 +12809,8 @@ static void nohz_balancer_kick(struct rq *rq)
> */
> for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) {
> if (sched_asym(sd, i, cpu)) {
> - flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> - goto unlock;
> + flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> + goto out;
> }
> }
> }
> @@ -12823,10 +12821,8 @@ static void nohz_balancer_kick(struct rq *rq)
> * When ASYM_CPUCAPACITY; see if there's a higher capacity CPU
> * to run the misfit task on.
> */
> - if (check_misfit_status(rq)) {
> - flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> - goto unlock;
> - }
> + if (check_misfit_status(rq))
> + flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
>
> /*
> * For asymmetric systems, we do not want to nicely balance
> @@ -12835,7 +12831,7 @@ static void nohz_balancer_kick(struct rq *rq)
> *
> * Skip the LLC logic because it's not relevant in that case.
> */
> - goto unlock;
> + goto out;
> }
>
> sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
> @@ -12850,13 +12846,9 @@ static void nohz_balancer_kick(struct rq *rq)
> * like this LLC domain has tasks we could move.
> */
> nr_busy = atomic_read(&sds->nr_busy_cpus);
> - if (nr_busy > 1) {
> - flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> - goto unlock;
> - }
> + if (nr_busy > 1)
> + flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> }
> -unlock:
> - rcu_read_unlock();
> out:
> if (READ_ONCE(nohz.needs_update))
> flags |= NOHZ_NEXT_KICK;
> @@ -12868,17 +12860,13 @@ static void nohz_balancer_kick(struct rq *rq)
> static void set_cpu_sd_state_busy(int cpu)
> {
> struct sched_domain *sd;
> -
> - rcu_read_lock();
> sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
>
> if (!sd || !sd->nohz_idle)
> - goto unlock;
> + return;
> sd->nohz_idle = 0;
>
> atomic_inc(&sd->shared->nr_busy_cpus);
> -unlock:
> - rcu_read_unlock();
> }
>
> void nohz_balance_exit_idle(struct rq *rq)
> @@ -12897,17 +12885,13 @@ void nohz_balance_exit_idle(struct rq *rq)
> static void set_cpu_sd_state_idle(int cpu)
> {
> struct sched_domain *sd;
> -
> - rcu_read_lock();
> sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
>
> if (!sd || sd->nohz_idle)
> - goto unlock;
> + return;
> sd->nohz_idle = 1;
>
> atomic_dec(&sd->shared->nr_busy_cpus);
> -unlock:
> - rcu_read_unlock();
> }
>
> /*
> --
> 2.54.0
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
2026-05-09 18:07 [PATCH v6 0/5 RESEND] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
2026-05-09 18:07 ` [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path Andrea Righi
@ 2026-05-09 18:07 ` Andrea Righi
2026-05-11 13:04 ` Vincent Guittot
2026-05-09 18:07 ` [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
` (2 subsequent siblings)
4 siblings, 1 reply; 13+ messages in thread
From: Andrea Righi @ 2026-05-09 18:07 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
Shrikanth Hegde, linux-kernel
From: K Prateek Nayak <kprateek.nayak@amd.com>
On asymmetric CPU capacity systems, the wakeup path uses
select_idle_capacity(), which scans the span of sd_asym_cpucapacity
rather than sd_llc.
The has_idle_cores hint however lives on sd_llc->shared, so the
wakeup-time read of has_idle_cores operates on an LLC-scoped blob while
the actual scan/decision spans the asym domain; nr_busy_cpus also lives
in the same shared sched_domain data, but it's never used in the asym
CPU capacity scenario.
Therefore, move the sched_domain_shared object to sd_asym_cpucapacity
whenever the CPU has a SD_ASYM_CPUCAPACITY_FULL ancestor and that
ancestor is non-overlapping (i.e., not built from SD_NUMA). In that case
the scope of has_idle_cores matches the scope of the wakeup scan.
Fall back to attaching the shared object to sd_llc in three cases:
1) plain symmetric systems (no SD_ASYM_CPUCAPACITY_FULL anywhere);
2) CPUs in an exclusive cpuset that carves out a symmetric capacity
island: has_asym is system-wide but those CPUs have no
SD_ASYM_CPUCAPACITY_FULL ancestor in their hierarchy and follow
the symmetric LLC path in select_idle_sibling();
3) exotic topologies where SD_ASYM_CPUCAPACITY_FULL lands on an
SD_NUMA-built domain. init_sched_domain_shared() keys the shared
blob off cpumask_first(span), which on overlapping NUMA domains
would alias unrelated spans onto the same blob. Keep the shared
object on the LLC there; select_idle_capacity() gracefully skips
the has_idle_cores preference when sd->shared is NULL.
While at it, also rename the per-CPU sd_llc_shared to sd_balance_shared,
as it is no longer strictly tied to the LLC.
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
kernel/sched/fair.c | 19 ++++++---
kernel/sched/sched.h | 2 +-
kernel/sched/topology.c | 95 +++++++++++++++++++++++++++++++++++------
3 files changed, 95 insertions(+), 21 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b059ee80b631..960a1a9696b98 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7819,7 +7819,7 @@ static inline void set_idle_cores(int cpu, int val)
{
struct sched_domain_shared *sds;
- sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
+ sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
if (sds)
WRITE_ONCE(sds->has_idle_cores, val);
}
@@ -7828,7 +7828,7 @@ static inline bool test_idle_cores(int cpu)
{
struct sched_domain_shared *sds;
- sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
+ sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
if (sds)
return READ_ONCE(sds->has_idle_cores);
@@ -7837,7 +7837,7 @@ static inline bool test_idle_cores(int cpu)
/*
* Scans the local SMT mask to see if the entire core is idle, and records this
- * information in sd_llc_shared->has_idle_cores.
+ * information in sd_balance_shared->has_idle_cores.
*
* Since SMT siblings share all cache levels, inspecting this limited remote
* state should be fairly cheap.
@@ -7954,7 +7954,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
int i, cpu, idle_cpu = -1, nr = INT_MAX;
- if (sched_feat(SIS_UTIL)) {
+ if (sched_feat(SIS_UTIL) && sd->shared) {
/*
* Increment because !--nr is the condition to stop scan.
*
@@ -12834,7 +12834,7 @@ static void nohz_balancer_kick(struct rq *rq)
goto out;
}
- sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
+ sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
if (sds) {
/*
* If there is an imbalance between LLC domains (IOW we could
@@ -12862,7 +12862,11 @@ static void set_cpu_sd_state_busy(int cpu)
struct sched_domain *sd;
sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
- if (!sd || !sd->nohz_idle)
+ /*
+ * sd->nohz_idle only pairs with nr_busy_cpus on sd->shared; if this
+ * domain has no shared object there is nothing to clear or account.
+ */
+ if (!sd || !sd->shared || !sd->nohz_idle)
return;
sd->nohz_idle = 0;
@@ -12887,7 +12891,8 @@ static void set_cpu_sd_state_idle(int cpu)
struct sched_domain *sd;
sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
- if (!sd || sd->nohz_idle)
+ /* See set_cpu_sd_state_busy(): nohz_idle is only used with sd->shared. */
+ if (!sd || !sd->shared || sd->nohz_idle)
return;
sd->nohz_idle = 1;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f63b15d309d1..330f5893c4561 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2170,7 +2170,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
DECLARE_PER_CPU(int, sd_llc_size);
DECLARE_PER_CPU(int, sd_llc_id);
DECLARE_PER_CPU(int, sd_share_id);
-DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_balance_shared);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5847b83d9d552..9bc4d11dd6a98 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -665,7 +665,7 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
DEFINE_PER_CPU(int, sd_llc_size);
DEFINE_PER_CPU(int, sd_llc_id);
DEFINE_PER_CPU(int, sd_share_id);
-DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_balance_shared);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
@@ -680,20 +680,38 @@ static void update_top_cache_domain(int cpu)
int id = cpu;
int size = 1;
+ sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
+ /*
+ * The shared object is attached to sd_asym_cpucapacity only when the
+ * asym domain is non-overlapping (i.e., not built from SD_NUMA).
+ * On overlapping (NUMA) asym domains we fall back to letting the
+ * SD_SHARE_LLC path own the shared object, so sd->shared may be NULL
+ * here.
+ */
+ if (sd && sd->shared)
+ sds = sd->shared;
+
+ rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
+
sd = highest_flag_domain(cpu, SD_SHARE_LLC);
if (sd) {
id = cpumask_first(sched_domain_span(sd));
size = cpumask_weight(sched_domain_span(sd));
- /* If sd_llc exists, sd_llc_shared should exist too. */
- WARN_ON_ONCE(!sd->shared);
- sds = sd->shared;
+ /*
+ * If sd_asym_cpucapacity didn't claim the shared object,
+ * sd_llc must have one linked.
+ */
+ if (!sds) {
+ WARN_ON_ONCE(!sd->shared);
+ sds = sd->shared;
+ }
}
rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
per_cpu(sd_llc_size, cpu) = size;
per_cpu(sd_llc_id, cpu) = id;
- rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
+ rcu_assign_pointer(per_cpu(sd_balance_shared, cpu), sds);
sd = lowest_flag_domain(cpu, SD_CLUSTER);
if (sd)
@@ -711,9 +729,6 @@ static void update_top_cache_domain(int cpu)
sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
-
- sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
- rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
}
/*
@@ -2650,6 +2665,54 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc)
}
}
+static void init_sched_domain_shared(struct s_data *d, struct sched_domain *sd)
+{
+ int sd_id = cpumask_first(sched_domain_span(sd));
+
+ sd->shared = *per_cpu_ptr(d->sds, sd_id);
+ /*
+ * nr_busy_cpus is consumed only by the NOHZ kick path via
+ * sd_balance_shared; on the asym-capacity path it is initialized but
+ * never read.
+ */
+ atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
+ atomic_inc(&sd->shared->ref);
+}
+
+/*
+ * For asymmetric CPU capacity, attach sched_domain_shared on the innermost
+ * SD_ASYM_CPUCAPACITY_FULL ancestor of @cpu's base domain when that ancestor is
+ * not an overlapping NUMA-built domain (then LLC should claim shared).
+ *
+ * A CPU may lack any FULL ancestor (e.g., exclusive cpuset symmetric island),
+ * then LLC must claim shared instead.
+ *
+ * Note: SD_ASYM_CPUCAPACITY_FULL is only set when all CPU capacity values
+ * are present in the domain span, so the asym domain we attach to cannot
+ * degenerate into a single-capacity group. The relevant edge cases are instead
+ * covered by the caveats above.
+ *
+ * Return true if this CPU's asym path claimed sd->shared, false otherwise.
+ */
+static bool claim_asym_sched_domain_shared(struct s_data *d, int cpu)
+{
+ struct sched_domain *sd = *per_cpu_ptr(d->sd, cpu);
+ struct sched_domain *sd_asym;
+
+ if (!sd)
+ return false;
+
+ sd_asym = sd;
+ while (sd_asym && !(sd_asym->flags & SD_ASYM_CPUCAPACITY_FULL))
+ sd_asym = sd_asym->parent;
+
+ if (!sd_asym || (sd_asym->flags & SD_NUMA))
+ return false;
+
+ init_sched_domain_shared(d, sd_asym);
+ return true;
+}
+
/*
* Build sched domains for a given set of CPUs and attach the sched domains
* to the individual CPUs
@@ -2708,20 +2771,26 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
}
for_each_cpu(i, cpu_map) {
+ bool asym_claimed = false;
+
sd = *per_cpu_ptr(d.sd, i);
if (!sd)
continue;
+ if (has_asym)
+ asym_claimed = claim_asym_sched_domain_shared(&d, i);
+
/* First, find the topmost SD_SHARE_LLC domain */
while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
sd = sd->parent;
if (sd->flags & SD_SHARE_LLC) {
- int sd_id = cpumask_first(sched_domain_span(sd));
-
- sd->shared = *per_cpu_ptr(d.sds, sd_id);
- atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
- atomic_inc(&sd->shared->ref);
+ /*
+ * Initialize the sd->shared for SD_SHARE_LLC unless
+ * the asym path above already claimed it.
+ */
+ if (!asym_claimed)
+ init_sched_domain_shared(&d, sd);
/*
* In presence of higher domains, adjust the
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
2026-05-09 18:07 ` [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity Andrea Righi
@ 2026-05-11 13:04 ` Vincent Guittot
0 siblings, 0 replies; 13+ messages in thread
From: Vincent Guittot @ 2026-05-11 13:04 UTC (permalink / raw)
To: Andrea Righi
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
K Prateek Nayak, Christian Loehle, Phil Auld, Koba Ko,
Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
linux-kernel
On Sat, 9 May 2026 at 20:10, Andrea Righi <arighi@nvidia.com> wrote:
>
> From: K Prateek Nayak <kprateek.nayak@amd.com>
>
> On asymmetric CPU capacity systems, the wakeup path uses
> select_idle_capacity(), which scans the span of sd_asym_cpucapacity
> rather than sd_llc.
>
> The has_idle_cores hint however lives on sd_llc->shared, so the
> wakeup-time read of has_idle_cores operates on an LLC-scoped blob while
> the actual scan/decision spans the asym domain; nr_busy_cpus also lives
> in the same shared sched_domain data, but it's never used in the asym
> CPU capacity scenario.
>
> Therefore, move the sched_domain_shared object to sd_asym_cpucapacity
> whenever the CPU has a SD_ASYM_CPUCAPACITY_FULL ancestor and that
> ancestor is non-overlapping (i.e., not built from SD_NUMA). In that case
> the scope of has_idle_cores matches the scope of the wakeup scan.
>
> Fall back to attaching the shared object to sd_llc in three cases:
>
> 1) plain symmetric systems (no SD_ASYM_CPUCAPACITY_FULL anywhere);
>
> 2) CPUs in an exclusive cpuset that carves out a symmetric capacity
> island: has_asym is system-wide but those CPUs have no
> SD_ASYM_CPUCAPACITY_FULL ancestor in their hierarchy and follow
> the symmetric LLC path in select_idle_sibling();
>
> 3) exotic topologies where SD_ASYM_CPUCAPACITY_FULL lands on an
> SD_NUMA-built domain. init_sched_domain_shared() keys the shared
> blob off cpumask_first(span), which on overlapping NUMA domains
> would alias unrelated spans onto the same blob. Keep the shared
> object on the LLC there; select_idle_capacity() gracefully skips
> the has_idle_cores preference when sd->shared is NULL.
>
> While at it, also rename the per-CPU sd_llc_shared to sd_balance_shared,
> as it is no longer strictly tied to the LLC.
>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Co-developed-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> kernel/sched/fair.c | 19 ++++++---
> kernel/sched/sched.h | 2 +-
> kernel/sched/topology.c | 95 +++++++++++++++++++++++++++++++++++------
> 3 files changed, 95 insertions(+), 21 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6b059ee80b631..960a1a9696b98 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7819,7 +7819,7 @@ static inline void set_idle_cores(int cpu, int val)
> {
> struct sched_domain_shared *sds;
>
> - sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
> + sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
> if (sds)
> WRITE_ONCE(sds->has_idle_cores, val);
> }
> @@ -7828,7 +7828,7 @@ static inline bool test_idle_cores(int cpu)
> {
> struct sched_domain_shared *sds;
>
> - sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
> + sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
> if (sds)
> return READ_ONCE(sds->has_idle_cores);
>
> @@ -7837,7 +7837,7 @@ static inline bool test_idle_cores(int cpu)
>
> /*
> * Scans the local SMT mask to see if the entire core is idle, and records this
> - * information in sd_llc_shared->has_idle_cores.
> + * information in sd_balance_shared->has_idle_cores.
> *
> * Since SMT siblings share all cache levels, inspecting this limited remote
> * state should be fairly cheap.
> @@ -7954,7 +7954,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> int i, cpu, idle_cpu = -1, nr = INT_MAX;
>
> - if (sched_feat(SIS_UTIL)) {
> + if (sched_feat(SIS_UTIL) && sd->shared) {
> /*
> * Increment because !--nr is the condition to stop scan.
> *
> @@ -12834,7 +12834,7 @@ static void nohz_balancer_kick(struct rq *rq)
> goto out;
> }
>
> - sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
> + sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
> if (sds) {
> /*
> * If there is an imbalance between LLC domains (IOW we could
> @@ -12862,7 +12862,11 @@ static void set_cpu_sd_state_busy(int cpu)
> struct sched_domain *sd;
> sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
>
> - if (!sd || !sd->nohz_idle)
> + /*
> + * sd->nohz_idle only pairs with nr_busy_cpus on sd->shared; if this
> + * domain has no shared object there is nothing to clear or account.
> + */
> + if (!sd || !sd->shared || !sd->nohz_idle)
> return;
> sd->nohz_idle = 0;
>
> @@ -12887,7 +12891,8 @@ static void set_cpu_sd_state_idle(int cpu)
> struct sched_domain *sd;
> sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
>
> - if (!sd || sd->nohz_idle)
> + /* See set_cpu_sd_state_busy(): nohz_idle is only used with sd->shared. */
> + if (!sd || !sd->shared || sd->nohz_idle)
> return;
> sd->nohz_idle = 1;
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 9f63b15d309d1..330f5893c4561 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2170,7 +2170,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> DECLARE_PER_CPU(int, sd_llc_size);
> DECLARE_PER_CPU(int, sd_llc_id);
> DECLARE_PER_CPU(int, sd_share_id);
> -DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> +DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_balance_shared);
> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 5847b83d9d552..9bc4d11dd6a98 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -665,7 +665,7 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> DEFINE_PER_CPU(int, sd_llc_size);
> DEFINE_PER_CPU(int, sd_llc_id);
> DEFINE_PER_CPU(int, sd_share_id);
> -DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> +DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_balance_shared);
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> @@ -680,20 +680,38 @@ static void update_top_cache_domain(int cpu)
> int id = cpu;
> int size = 1;
>
> + sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
> + /*
> + * The shared object is attached to sd_asym_cpucapacity only when the
> + * asym domain is non-overlapping (i.e., not built from SD_NUMA).
> + * On overlapping (NUMA) asym domains we fall back to letting the
> + * SD_SHARE_LLC path own the shared object, so sd->shared may be NULL
> + * here.
> + */
> + if (sd && sd->shared)
> + sds = sd->shared;
> +
> + rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
> +
> sd = highest_flag_domain(cpu, SD_SHARE_LLC);
> if (sd) {
> id = cpumask_first(sched_domain_span(sd));
> size = cpumask_weight(sched_domain_span(sd));
>
> - /* If sd_llc exists, sd_llc_shared should exist too. */
> - WARN_ON_ONCE(!sd->shared);
> - sds = sd->shared;
> + /*
> + * If sd_asym_cpucapacity didn't claim the shared object,
> + * sd_llc must have one linked.
> + */
> + if (!sds) {
> + WARN_ON_ONCE(!sd->shared);
> + sds = sd->shared;
> + }
> }
>
> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
> per_cpu(sd_llc_size, cpu) = size;
> per_cpu(sd_llc_id, cpu) = id;
> - rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
> + rcu_assign_pointer(per_cpu(sd_balance_shared, cpu), sds);
>
> sd = lowest_flag_domain(cpu, SD_CLUSTER);
> if (sd)
> @@ -711,9 +729,6 @@ static void update_top_cache_domain(int cpu)
>
> sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
> rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
> -
> - sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
> - rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
> }
>
> /*
> @@ -2650,6 +2665,54 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc)
> }
> }
>
> +static void init_sched_domain_shared(struct s_data *d, struct sched_domain *sd)
> +{
> + int sd_id = cpumask_first(sched_domain_span(sd));
> +
> + sd->shared = *per_cpu_ptr(d->sds, sd_id);
> + /*
> + * nr_busy_cpus is consumed only by the NOHZ kick path via
> + * sd_balance_shared; on the asym-capacity path it is initialized but
> + * never read.
> + */
> + atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> + atomic_inc(&sd->shared->ref);
> +}
> +
> +/*
> + * For asymmetric CPU capacity, attach sched_domain_shared on the innermost
> + * SD_ASYM_CPUCAPACITY_FULL ancestor of @cpu's base domain when that ancestor is
> + * not an overlapping NUMA-built domain (then LLC should claim shared).
> + *
> + * A CPU may lack any FULL ancestor (e.g., exclusive cpuset symmetric island),
> + * then LLC must claim shared instead.
> + *
> + * Note: SD_ASYM_CPUCAPACITY_FULL is only set when all CPU capacity values
> + * are present in the domain span, so the asym domain we attach to cannot
> + * degenerate into a single-capacity group. The relevant edge cases are instead
> + * covered by the caveats above.
> + *
> + * Return true if this CPU's asym path claimed sd->shared, false otherwise.
> + */
> +static bool claim_asym_sched_domain_shared(struct s_data *d, int cpu)
> +{
> + struct sched_domain *sd = *per_cpu_ptr(d->sd, cpu);
> + struct sched_domain *sd_asym;
> +
> + if (!sd)
> + return false;
> +
> + sd_asym = sd;
> + while (sd_asym && !(sd_asym->flags & SD_ASYM_CPUCAPACITY_FULL))
> + sd_asym = sd_asym->parent;
> +
> + if (!sd_asym || (sd_asym->flags & SD_NUMA))
> + return false;
> +
> + init_sched_domain_shared(d, sd_asym);
> + return true;
> +}
> +
> /*
> * Build sched domains for a given set of CPUs and attach the sched domains
> * to the individual CPUs
> @@ -2708,20 +2771,26 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> }
>
> for_each_cpu(i, cpu_map) {
> + bool asym_claimed = false;
> +
> sd = *per_cpu_ptr(d.sd, i);
> if (!sd)
> continue;
>
> + if (has_asym)
> + asym_claimed = claim_asym_sched_domain_shared(&d, i);
> +
> /* First, find the topmost SD_SHARE_LLC domain */
> while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
> sd = sd->parent;
>
> if (sd->flags & SD_SHARE_LLC) {
> - int sd_id = cpumask_first(sched_domain_span(sd));
> -
> - sd->shared = *per_cpu_ptr(d.sds, sd_id);
> - atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> - atomic_inc(&sd->shared->ref);
> + /*
> + * Initialize the sd->shared for SD_SHARE_LLC unless
> + * the asym path above already claimed it.
> + */
> + if (!asym_claimed)
> + init_sched_domain_shared(&d, sd);
>
> /*
> * In presence of higher domains, adjust the
> --
> 2.54.0
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
2026-05-09 18:07 [PATCH v6 0/5 RESEND] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
2026-05-09 18:07 ` [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path Andrea Righi
2026-05-09 18:07 ` [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity Andrea Righi
@ 2026-05-09 18:07 ` Andrea Righi
2026-05-11 13:07 ` Vincent Guittot
2026-05-09 18:07 ` [PATCH 4/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
2026-05-09 18:07 ` [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity() Andrea Righi
4 siblings, 1 reply; 13+ messages in thread
From: Andrea Righi @ 2026-05-09 18:07 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
Shrikanth Hegde, linux-kernel
On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
different per-core frequencies), the wakeup path uses
select_idle_capacity() and prioritizes idle CPUs with higher capacity
for better task placement. However, when those CPUs belong to SMT cores,
their effective capacity can be much lower than the nominal capacity
when the sibling thread is busy: SMT siblings compete for shared
resources, so a "high capacity" CPU that is idle but whose sibling is
busy does not deliver its full capacity. This effective capacity
reduction cannot be modeled by the static capacity value alone.
Introduce SMT awareness in the asym-capacity idle selection policy: when
SMT is active, always prefer fully-idle SMT cores over partially-idle
ones.
Prioritizing fully-idle SMT cores yields better task placement because
the effective capacity of partially-idle SMT cores is reduced; always
preferring them when available leads to more accurate capacity usage on
task wakeup.
On an SMT system with asymmetric CPU capacities (NVIDIA Vera Rubin),
SMT-aware idle selection has been shown to improve throughput by around
15-18% over NO_ASYM mainline and by around 60% over ASYM mainline, for
CPU-bound workloads (NVBLAS) running an amount of tasks equal to the
amount of SMT cores.
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Koba Ko <kobak@nvidia.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/fair.c | 119 +++++++++++++++++++++++++++++++++++++++++---
1 file changed, 113 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 960a1a9696b98..6f0835c15ee11 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8018,6 +8018,54 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
return idle_cpu;
}
+/*
+ * Idle-capacity scan converts util_fits_cpu() outcomes into preference ranks,
+ * where lower values indicate a better fit - see select_idle_capacity().
+ *
+ * A CPU that both fits the task and sits on a fully-idle SMT core is returned
+ * immediately and is never assigned one of these ranks. On !SMT every CPU is
+ * its own "core", so the early return covers all fits-and-idle cases and the
+ * core-tier ranks below become unreachable.
+ *
+ * Rank Val Tier Meaning
+ * ------------------------------ --- ------ ---------------------------
+ * ASYM_IDLE_CORE_UCLAMP_MISFIT -4 core Idle core; capacity fits
+ * util but uclamp_min misses.
+ * ASYM_IDLE_CORE_COMPLETE_MISFIT -3 core Idle core; capacity does
+ * not fit. Still beats every
+ * thread-tier rank: a busy
+ * sibling cuts effective
+ * capacity more than a
+ * misfit hurts a quiet core.
+ * ASYM_IDLE_THREAD_FITS -2 thread Busy SMT sibling; capacity
+ * fits util + uclamp.
+ * ASYM_IDLE_THREAD_UCLAMP_MISFIT -1 thread Busy SMT sibling; capacity
+ * fits but uclamp_min misses
+ * (native util_fits_cpu()
+ * return value).
+ * ASYM_IDLE_COMPLETE_MISFIT 0 thread Busy SMT sibling; capacity
+ * does not fit.
+ *
+ * ASYM_IDLE_CORE_BIAS (-3) is an offset, not a state. On an idle core,
+ * fits += ASYM_IDLE_CORE_BIAS rebases thread-tier ranks into the core tier:
+ *
+ * ASYM_IDLE_THREAD_UCLAMP_MISFIT (-1) + BIAS -> CORE_UCLAMP_MISFIT (-4)
+ * ASYM_IDLE_COMPLETE_MISFIT (0) + BIAS -> CORE_COMPLETE_MISFIT (-3)
+ *
+ * ASYM_IDLE_THREAD_FITS (-2) is never rebased because a fully-fitting idle-core
+ * candidate early-returns from select_idle_capacity().
+ */
+enum asym_fits_state {
+ ASYM_IDLE_CORE_UCLAMP_MISFIT = -4,
+ ASYM_IDLE_CORE_COMPLETE_MISFIT,
+ ASYM_IDLE_THREAD_FITS,
+ ASYM_IDLE_THREAD_UCLAMP_MISFIT,
+ ASYM_IDLE_COMPLETE_MISFIT,
+
+ /* util_fits_cpu() bias for idle core */
+ ASYM_IDLE_CORE_BIAS = -3,
+};
+
/*
* Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
* the task fits. If no CPU is big enough, but there are idle ones, try to
@@ -8026,8 +8074,14 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
static int
select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
{
+ /*
+ * On !SMT systems, has_idle_core is always false and preferred_core
+ * is always true (CPU == core), so the SMT preference logic below
+ * collapses to the plain capacity scan.
+ */
+ bool has_idle_core = sched_smt_active() && test_idle_cores(target);
unsigned long task_util, util_min, util_max, best_cap = 0;
- int fits, best_fits = 0;
+ int fits, best_fits = ASYM_IDLE_COMPLETE_MISFIT;
int cpu, best_cpu = -1;
struct cpumask *cpus;
@@ -8039,6 +8093,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
util_max = uclamp_eff_value(p, UCLAMP_MAX);
for_each_cpu_wrap(cpu, cpus, target) {
+ bool preferred_core = !has_idle_core || is_core_idle(cpu);
unsigned long cpu_cap = capacity_of(cpu);
if (!choose_idle_cpu(cpu, p))
@@ -8046,8 +8101,13 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
fits = util_fits_cpu(task_util, util_min, util_max, cpu);
- /* This CPU fits with all requirements */
- if (fits > 0)
+ /*
+ * Perfect fit: capacity satisfies util + uclamp and the CPU
+ * sits on a fully-idle SMT core (or this is a !SMT system).
+ * Short-circuit the rank-based selection and return
+ * immediately.
+ */
+ if (fits > 0 && preferred_core)
return cpu;
/*
* Only the min performance hint (i.e. uclamp_min) doesn't fit.
@@ -8055,9 +8115,33 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
*/
else if (fits < 0)
cpu_cap = get_actual_cpu_capacity(cpu);
+ /*
+ * fits > 0 implies we are not on a preferred core, but the util
+ * fits CPU capacity. Set fits to ASYM_IDLE_THREAD_FITS
+ * so the effective range becomes
+ * [ASYM_IDLE_THREAD_FITS, ASYM_IDLE_COMPLETE_MISFIT], where:
+ * ASYM_IDLE_COMPLETE_MISFIT - does not fit
+ * ASYM_IDLE_THREAD_UCLAMP_MISFIT - fits with the exception of UCLAMP_MIN
+ * ASYM_IDLE_THREAD_FITS - fits with the exception of preferred_core
+ */
+ else if (fits > 0)
+ fits = ASYM_IDLE_THREAD_FITS;
/*
- * First, select CPU which fits better (-1 being better than 0).
+ * If we are on a preferred core, translate the range of fits
+ * of [ASYM_IDLE_THREAD_UCLAMP_MISFIT, ASYM_IDLE_COMPLETE_MISFIT] to
+ * [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_COMPLETE_MISFIT].
+ * This ensures that an idle core is always given priority over
+ * (partially) busy core.
+ *
+ * A fully fitting idle core would have returned early and hence
+ * fits > 0 for preferred_core need not be dealt with.
+ */
+ if (preferred_core)
+ fits += ASYM_IDLE_CORE_BIAS;
+
+ /*
+ * First, select CPU which fits better (lower is more preferred).
* Then, select the one with best capacity at same level.
*/
if ((fits < best_fits) ||
@@ -8068,6 +8152,19 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
}
}
+ /*
+ * A value in the [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_COMPLETE_MISFIT]
+ * range means the chosen CPU is in a fully idle SMT core. Values above
+ * ASYM_IDLE_CORE_COMPLETE_MISFIT mean we never ranked such a CPU best.
+ *
+ * The asym-capacity wakeup path returns from select_idle_sibling()
+ * after this function and never runs select_idle_cpu(), so the usual
+ * select_idle_cpu() tail that clears idle cores must live here when the
+ * idle-core preference did not win.
+ */
+ if (has_idle_core && best_fits > ASYM_IDLE_CORE_COMPLETE_MISFIT)
+ set_idle_cores(target, false);
+
return best_cpu;
}
@@ -8076,12 +8173,22 @@ static inline bool asym_fits_cpu(unsigned long util,
unsigned long util_max,
int cpu)
{
- if (sched_asym_cpucap_active())
+ if (sched_asym_cpucap_active()) {
/*
* Return true only if the cpu fully fits the task requirements
* which include the utilization and the performance hints.
+ *
+ * When SMT is active, also require that the core has no busy
+ * siblings.
+ *
+ * Note: gating on is_core_idle() also makes the early-bailout
+ * candidates in select_idle_sibling() (target, prev,
+ * recent_used_cpu) idle-core-aware on ASYM+SMT, which the
+ * NO_ASYM path does not do.
*/
- return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
+ return (!sched_smt_active() || is_core_idle(cpu)) &&
+ (util_fits_cpu(util, util_min, util_max, cpu) > 0);
+ }
return true;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
2026-05-09 18:07 ` [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
@ 2026-05-11 13:07 ` Vincent Guittot
2026-05-11 13:45 ` Andrea Righi
2026-05-11 14:25 ` [PATCH v2 " Andrea Righi
0 siblings, 2 replies; 13+ messages in thread
From: Vincent Guittot @ 2026-05-11 13:07 UTC (permalink / raw)
To: Andrea Righi
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
K Prateek Nayak, Christian Loehle, Phil Auld, Koba Ko,
Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
linux-kernel
On Sat, 9 May 2026 at 20:10, Andrea Righi <arighi@nvidia.com> wrote:
>
> On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> different per-core frequencies), the wakeup path uses
> select_idle_capacity() and prioritizes idle CPUs with higher capacity
> for better task placement. However, when those CPUs belong to SMT cores,
> their effective capacity can be much lower than the nominal capacity
> when the sibling thread is busy: SMT siblings compete for shared
> resources, so a "high capacity" CPU that is idle but whose sibling is
> busy does not deliver its full capacity. This effective capacity
> reduction cannot be modeled by the static capacity value alone.
>
> Introduce SMT awareness in the asym-capacity idle selection policy: when
> SMT is active, always prefer fully-idle SMT cores over partially-idle
> ones.
>
> Prioritizing fully-idle SMT cores yields better task placement because
> the effective capacity of partially-idle SMT cores is reduced; always
> preferring them when available leads to more accurate capacity usage on
> task wakeup.
>
> On an SMT system with asymmetric CPU capacities (NVIDIA Vera Rubin),
> SMT-aware idle selection has been shown to improve throughput by around
> 15-18% over NO_ASYM mainline and by around 60% over ASYM mainline, for
> CPU-bound workloads (NVBLAS) running an amount of tasks equal to the
> amount of SMT cores.
>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Christian Loehle <christian.loehle@arm.com>
> Cc: Koba Ko <kobak@nvidia.com>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
I still have comments about the description and naming below but
overall, the patch looks good to me
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> kernel/sched/fair.c | 119 +++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 113 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 960a1a9696b98..6f0835c15ee11 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8018,6 +8018,54 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> return idle_cpu;
> }
>
> +/*
> + * Idle-capacity scan converts util_fits_cpu() outcomes into preference ranks,
> + * where lower values indicate a better fit - see select_idle_capacity().
> + *
> + * A CPU that both fits the task and sits on a fully-idle SMT core is returned
> + * immediately and is never assigned one of these ranks. On !SMT every CPU is
> + * its own "core", so the early return covers all fits-and-idle cases and the
> + * core-tier ranks below become unreachable.
> + *
> + * Rank Val Tier Meaning
> + * ------------------------------ --- ------ ---------------------------
> + * ASYM_IDLE_CORE_UCLAMP_MISFIT -4 core Idle core; capacity fits
> + * util but uclamp_min misses.
> + * ASYM_IDLE_CORE_COMPLETE_MISFIT -3 core Idle core; capacity does
> + * not fit. Still beats every
> + * thread-tier rank: a busy
> + * sibling cuts effective
> + * capacity more than a
> + * misfit hurts a quiet core.
> + * ASYM_IDLE_THREAD_FITS -2 thread Busy SMT sibling; capacity
> + * fits util + uclamp.
> + * ASYM_IDLE_THREAD_UCLAMP_MISFIT -1 thread Busy SMT sibling; capacity
> + * fits but uclamp_min misses
> + * (native util_fits_cpu()
> + * return value).
> + * ASYM_IDLE_COMPLETE_MISFIT 0 thread Busy SMT sibling; capacity
> + * does not fit.
> + *
> + * ASYM_IDLE_CORE_BIAS (-3) is an offset, not a state. On an idle core,
> + * fits += ASYM_IDLE_CORE_BIAS rebases thread-tier ranks into the core tier:
> + *
> + * ASYM_IDLE_THREAD_UCLAMP_MISFIT (-1) + BIAS -> CORE_UCLAMP_MISFIT (-4)
> + * ASYM_IDLE_COMPLETE_MISFIT (0) + BIAS -> CORE_COMPLETE_MISFIT (-3)
> + *
> + * ASYM_IDLE_THREAD_FITS (-2) is never rebased because a fully-fitting idle-core
> + * candidate early-returns from select_idle_capacity().
> + */
> +enum asym_fits_state {
> + ASYM_IDLE_CORE_UCLAMP_MISFIT = -4,
ASYM_IDLE_UCLAMP_MISFIT
See why in comments for select_idle_capacity()
> + ASYM_IDLE_CORE_COMPLETE_MISFIT,
ASYM_IDLE_COMPLETE_MISFIT,
> + ASYM_IDLE_THREAD_FITS,
> + ASYM_IDLE_THREAD_UCLAMP_MISFIT,
> + ASYM_IDLE_COMPLETE_MISFIT,
ASYM_IDLE_THREAD_MISFIT,
> +
> + /* util_fits_cpu() bias for idle core */
> + ASYM_IDLE_CORE_BIAS = -3,
> +};
> +
> /*
> * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> * the task fits. If no CPU is big enough, but there are idle ones, try to
> @@ -8026,8 +8074,14 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> static int
> select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> {
> + /*
> + * On !SMT systems, has_idle_core is always false and preferred_core
> + * is always true (CPU == core), so the SMT preference logic below
> + * collapses to the plain capacity scan.
> + */
> + bool has_idle_core = sched_smt_active() && test_idle_cores(target);
> unsigned long task_util, util_min, util_max, best_cap = 0;
> - int fits, best_fits = 0;
> + int fits, best_fits = ASYM_IDLE_COMPLETE_MISFIT;
> int cpu, best_cpu = -1;
> struct cpumask *cpus;
>
> @@ -8039,6 +8093,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> util_max = uclamp_eff_value(p, UCLAMP_MAX);
>
> for_each_cpu_wrap(cpu, cpus, target) {
> + bool preferred_core = !has_idle_core || is_core_idle(cpu);
If sched_smt_active() is true and test_idle_cores(target) is false
(meaning we have SMT but no idle core), then has_idle_core is false
and preferred_core is true. We will returns immediatly if
util_fits_cpu and we will use the ASYM_IDLE_CORE_* values otherwise.
So I think that we should remove the "CORE_" in the naming
ASYM_IDLE_THREAD_* values are only used when we are promised to find
an idle core with SMT
> unsigned long cpu_cap = capacity_of(cpu);
>
> if (!choose_idle_cpu(cpu, p))
> @@ -8046,8 +8101,13 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>
> fits = util_fits_cpu(task_util, util_min, util_max, cpu);
>
> - /* This CPU fits with all requirements */
> - if (fits > 0)
> + /*
> + * Perfect fit: capacity satisfies util + uclamp and the CPU
> + * sits on a fully-idle SMT core (or this is a !SMT system).
Or there is no idle core to find.
> + * Short-circuit the rank-based selection and return
> + * immediately.
> + */
> + if (fits > 0 && preferred_core)
> return cpu;
> /*
> * Only the min performance hint (i.e. uclamp_min) doesn't fit.
> @@ -8055,9 +8115,33 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> */
> else if (fits < 0)
> cpu_cap = get_actual_cpu_capacity(cpu);
> + /*
> + * fits > 0 implies we are not on a preferred core, but the util
> + * fits CPU capacity. Set fits to ASYM_IDLE_THREAD_FITS
> + * so the effective range becomes
> + * [ASYM_IDLE_THREAD_FITS, ASYM_IDLE_COMPLETE_MISFIT], where:
> + * ASYM_IDLE_COMPLETE_MISFIT - does not fit
> + * ASYM_IDLE_THREAD_UCLAMP_MISFIT - fits with the exception of UCLAMP_MIN
> + * ASYM_IDLE_THREAD_FITS - fits with the exception of preferred_core
> + */
> + else if (fits > 0)
> + fits = ASYM_IDLE_THREAD_FITS;
>
> /*
> - * First, select CPU which fits better (-1 being better than 0).
> + * If we are on a preferred core, translate the range of fits
> + * of [ASYM_IDLE_THREAD_UCLAMP_MISFIT, ASYM_IDLE_COMPLETE_MISFIT] to
> + * [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_COMPLETE_MISFIT].
> + * This ensures that an idle core is always given priority over
> + * (partially) busy core.
> + *
> + * A fully fitting idle core would have returned early and hence
> + * fits > 0 for preferred_core need not be dealt with.
> + */
> + if (preferred_core)
> + fits += ASYM_IDLE_CORE_BIAS;
> +
> + /*
> + * First, select CPU which fits better (lower is more preferred).
> * Then, select the one with best capacity at same level.
> */
> if ((fits < best_fits) ||
> @@ -8068,6 +8152,19 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> }
> }
>
> + /*
> + * A value in the [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_COMPLETE_MISFIT]
> + * range means the chosen CPU is in a fully idle SMT core. Values above
> + * ASYM_IDLE_CORE_COMPLETE_MISFIT mean we never ranked such a CPU best.
> + *
> + * The asym-capacity wakeup path returns from select_idle_sibling()
> + * after this function and never runs select_idle_cpu(), so the usual
> + * select_idle_cpu() tail that clears idle cores must live here when the
> + * idle-core preference did not win.
> + */
> + if (has_idle_core && best_fits > ASYM_IDLE_CORE_COMPLETE_MISFIT)
> + set_idle_cores(target, false);
> +
> return best_cpu;
> }
>
> @@ -8076,12 +8173,22 @@ static inline bool asym_fits_cpu(unsigned long util,
> unsigned long util_max,
> int cpu)
> {
> - if (sched_asym_cpucap_active())
> + if (sched_asym_cpucap_active()) {
> /*
> * Return true only if the cpu fully fits the task requirements
> * which include the utilization and the performance hints.
> + *
> + * When SMT is active, also require that the core has no busy
> + * siblings.
> + *
> + * Note: gating on is_core_idle() also makes the early-bailout
> + * candidates in select_idle_sibling() (target, prev,
> + * recent_used_cpu) idle-core-aware on ASYM+SMT, which the
> + * NO_ASYM path does not do.
> */
> - return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> + return (!sched_smt_active() || is_core_idle(cpu)) &&
> + (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> + }
>
> return true;
> }
> --
> 2.54.0
>
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
2026-05-11 13:07 ` Vincent Guittot
@ 2026-05-11 13:45 ` Andrea Righi
2026-05-11 14:25 ` [PATCH v2 " Andrea Righi
1 sibling, 0 replies; 13+ messages in thread
From: Andrea Righi @ 2026-05-11 13:45 UTC (permalink / raw)
To: Vincent Guittot
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
K Prateek Nayak, Christian Loehle, Phil Auld, Koba Ko,
Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
linux-kernel
Hi Vincent,
On Mon, May 11, 2026 at 03:07:50PM +0200, Vincent Guittot wrote:
> On Sat, 9 May 2026 at 20:10, Andrea Righi <arighi@nvidia.com> wrote:
...
> > +/*
> > + * Idle-capacity scan converts util_fits_cpu() outcomes into preference ranks,
> > + * where lower values indicate a better fit - see select_idle_capacity().
> > + *
> > + * A CPU that both fits the task and sits on a fully-idle SMT core is returned
> > + * immediately and is never assigned one of these ranks. On !SMT every CPU is
> > + * its own "core", so the early return covers all fits-and-idle cases and the
> > + * core-tier ranks below become unreachable.
> > + *
> > + * Rank Val Tier Meaning
> > + * ------------------------------ --- ------ ---------------------------
> > + * ASYM_IDLE_CORE_UCLAMP_MISFIT -4 core Idle core; capacity fits
> > + * util but uclamp_min misses.
> > + * ASYM_IDLE_CORE_COMPLETE_MISFIT -3 core Idle core; capacity does
> > + * not fit. Still beats every
> > + * thread-tier rank: a busy
> > + * sibling cuts effective
> > + * capacity more than a
> > + * misfit hurts a quiet core.
> > + * ASYM_IDLE_THREAD_FITS -2 thread Busy SMT sibling; capacity
> > + * fits util + uclamp.
> > + * ASYM_IDLE_THREAD_UCLAMP_MISFIT -1 thread Busy SMT sibling; capacity
> > + * fits but uclamp_min misses
> > + * (native util_fits_cpu()
> > + * return value).
> > + * ASYM_IDLE_COMPLETE_MISFIT 0 thread Busy SMT sibling; capacity
> > + * does not fit.
> > + *
> > + * ASYM_IDLE_CORE_BIAS (-3) is an offset, not a state. On an idle core,
> > + * fits += ASYM_IDLE_CORE_BIAS rebases thread-tier ranks into the core tier:
> > + *
> > + * ASYM_IDLE_THREAD_UCLAMP_MISFIT (-1) + BIAS -> CORE_UCLAMP_MISFIT (-4)
> > + * ASYM_IDLE_COMPLETE_MISFIT (0) + BIAS -> CORE_COMPLETE_MISFIT (-3)
> > + *
> > + * ASYM_IDLE_THREAD_FITS (-2) is never rebased because a fully-fitting idle-core
> > + * candidate early-returns from select_idle_capacity().
> > + */
> > +enum asym_fits_state {
> > + ASYM_IDLE_CORE_UCLAMP_MISFIT = -4,
>
> ASYM_IDLE_UCLAMP_MISFIT
> See why in comments for select_idle_capacity()
>
> > + ASYM_IDLE_CORE_COMPLETE_MISFIT,
>
> ASYM_IDLE_COMPLETE_MISFIT,
>
> > + ASYM_IDLE_THREAD_FITS,
> > + ASYM_IDLE_THREAD_UCLAMP_MISFIT,
> > + ASYM_IDLE_COMPLETE_MISFIT,
>
> ASYM_IDLE_THREAD_MISFIT,
>
> > +
> > + /* util_fits_cpu() bias for idle core */
> > + ASYM_IDLE_CORE_BIAS = -3,
> > +};
> > +
> > /*
> > * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> > * the task fits. If no CPU is big enough, but there are idle ones, try to
> > @@ -8026,8 +8074,14 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > static int
> > select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > {
> > + /*
> > + * On !SMT systems, has_idle_core is always false and preferred_core
> > + * is always true (CPU == core), so the SMT preference logic below
> > + * collapses to the plain capacity scan.
> > + */
> > + bool has_idle_core = sched_smt_active() && test_idle_cores(target);
> > unsigned long task_util, util_min, util_max, best_cap = 0;
> > - int fits, best_fits = 0;
> > + int fits, best_fits = ASYM_IDLE_COMPLETE_MISFIT;
> > int cpu, best_cpu = -1;
> > struct cpumask *cpus;
> >
> > @@ -8039,6 +8093,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > util_max = uclamp_eff_value(p, UCLAMP_MAX);
> >
> > for_each_cpu_wrap(cpu, cpus, target) {
> > + bool preferred_core = !has_idle_core || is_core_idle(cpu);
>
> If sched_smt_active() is true and test_idle_cores(target) is false
> (meaning we have SMT but no idle core), then has_idle_core is false
> and preferred_core is true. We will returns immediatly if
> util_fits_cpu and we will use the ASYM_IDLE_CORE_* values otherwise.
> So I think that we should remove the "CORE_" in the naming
>
> ASYM_IDLE_THREAD_* values are only used when we are promised to find
> an idle core with SMT
Yes, I agree, the CORE_ prefix is just misleading, those ranks can be assigned
also when sched_smt_active() && !test_idle_cores(target). I'll send an updated
patch with your naming schema.
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 13+ messages in thread* [PATCH v2 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
2026-05-11 13:07 ` Vincent Guittot
2026-05-11 13:45 ` Andrea Righi
@ 2026-05-11 14:25 ` Andrea Righi
1 sibling, 0 replies; 13+ messages in thread
From: Andrea Righi @ 2026-05-11 14:25 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
Shrikanth Hegde, linux-kernel
On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
different per-core frequencies), the wakeup path uses
select_idle_capacity() and prioritizes idle CPUs with higher capacity
for better task placement. However, when those CPUs belong to SMT cores,
their effective capacity can be much lower than the nominal capacity
when the sibling thread is busy: SMT siblings compete for shared
resources, so a "high capacity" CPU that is idle but whose sibling is
busy does not deliver its full capacity. This effective capacity
reduction cannot be modeled by the static capacity value alone.
Introduce SMT awareness in the asym-capacity idle selection policy: when
SMT is active, always prefer fully-idle SMT cores over partially-idle
ones.
Prioritizing fully-idle SMT cores yields better task placement because
the effective capacity of partially-idle SMT cores is reduced; always
preferring them when available leads to more accurate capacity usage on
task wakeup.
On an SMT system with asymmetric CPU capacities (NVIDIA Vera Rubin),
SMT-aware idle selection has been shown to improve throughput by around
15-18% over NO_ASYM mainline and by around 60% over ASYM mainline, for
CPU-bound workloads (NVBLAS) running an amount of tasks equal to the
amount of SMT cores.
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Koba Ko <kobak@nvidia.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
Changes in v2:
- Drop the misleading "CORE_" prefix from ASYM_IDLE_* ranks (Vincent Guittot)
kernel/sched/fair.c | 120 +++++++++++++++++++++++++++++++++++++++++---
1 file changed, 114 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 960a1a9696b98..ffe3af10e5602 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8018,6 +8018,54 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
return idle_cpu;
}
+/*
+ * Idle-capacity scan converts util_fits_cpu() outcomes into preference ranks,
+ * where lower values indicate a better fit - see select_idle_capacity().
+ *
+ * A CPU that both fits the task and sits on a fully-idle SMT core is returned
+ * immediately and is never assigned one of these ranks. On !SMT every CPU is
+ * its own "core", so the early return covers all fits-and-idle cases and the
+ * core-tier ranks below become unreachable.
+ *
+ * Rank Val Tier Meaning
+ * ------------------------------ --- ------ ---------------------------
+ * ASYM_IDLE_UCLAMP_MISFIT -4 core Idle core; capacity fits
+ * util but uclamp_min misses.
+ * ASYM_IDLE_COMPLETE_MISFIT -3 core Idle core; capacity does
+ * not fit. Still beats every
+ * thread-tier rank: a busy
+ * sibling cuts effective
+ * capacity more than a
+ * misfit hurts a quiet core.
+ * ASYM_IDLE_THREAD_FITS -2 thread Busy SMT sibling; capacity
+ * fits util + uclamp.
+ * ASYM_IDLE_THREAD_UCLAMP_MISFIT -1 thread Busy SMT sibling; capacity
+ * fits but uclamp_min misses
+ * (native util_fits_cpu()
+ * return value).
+ * ASYM_IDLE_THREAD_MISFIT 0 thread Busy SMT sibling; capacity
+ * does not fit.
+ *
+ * ASYM_IDLE_CORE_BIAS (-3) is an offset, not a state. On an idle core,
+ * fits += ASYM_IDLE_CORE_BIAS rebases thread-tier ranks into the core tier:
+ *
+ * ASYM_IDLE_THREAD_UCLAMP_MISFIT (-1) + BIAS -> ASYM_IDLE_UCLAMP_MISFIT (-4)
+ * ASYM_IDLE_THREAD_MISFIT (0) + BIAS -> ASYM_IDLE_COMPLETE_MISFIT (-3)
+ *
+ * ASYM_IDLE_THREAD_FITS (-2) is never rebased because a fully-fitting idle-core
+ * candidate early-returns from select_idle_capacity().
+ */
+enum asym_fits_state {
+ ASYM_IDLE_UCLAMP_MISFIT = -4,
+ ASYM_IDLE_COMPLETE_MISFIT,
+ ASYM_IDLE_THREAD_FITS,
+ ASYM_IDLE_THREAD_UCLAMP_MISFIT,
+ ASYM_IDLE_THREAD_MISFIT,
+
+ /* util_fits_cpu() bias for idle core */
+ ASYM_IDLE_CORE_BIAS = -3,
+};
+
/*
* Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
* the task fits. If no CPU is big enough, but there are idle ones, try to
@@ -8026,8 +8074,14 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
static int
select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
{
+ /*
+ * On !SMT systems, has_idle_core is always false and preferred_core
+ * is always true (CPU == core), so the SMT preference logic below
+ * collapses to the plain capacity scan.
+ */
+ bool has_idle_core = sched_smt_active() && test_idle_cores(target);
unsigned long task_util, util_min, util_max, best_cap = 0;
- int fits, best_fits = 0;
+ int fits, best_fits = ASYM_IDLE_THREAD_MISFIT;
int cpu, best_cpu = -1;
struct cpumask *cpus;
@@ -8039,6 +8093,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
util_max = uclamp_eff_value(p, UCLAMP_MAX);
for_each_cpu_wrap(cpu, cpus, target) {
+ bool preferred_core = !has_idle_core || is_core_idle(cpu);
unsigned long cpu_cap = capacity_of(cpu);
if (!choose_idle_cpu(cpu, p))
@@ -8046,8 +8101,14 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
fits = util_fits_cpu(task_util, util_min, util_max, cpu);
- /* This CPU fits with all requirements */
- if (fits > 0)
+ /*
+ * Perfect fit: capacity satisfies util + uclamp and the CPU
+ * sits on a fully-idle SMT core, this is a !SMT system, or
+ * there is no idle core to find.
+ * Short-circuit the rank-based selection and return
+ * immediately.
+ */
+ if (fits > 0 && preferred_core)
return cpu;
/*
* Only the min performance hint (i.e. uclamp_min) doesn't fit.
@@ -8055,9 +8116,33 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
*/
else if (fits < 0)
cpu_cap = get_actual_cpu_capacity(cpu);
+ /*
+ * fits > 0 implies we are not on a preferred core, but the util
+ * fits CPU capacity. Set fits to ASYM_IDLE_THREAD_FITS
+ * so the effective range becomes
+ * [ASYM_IDLE_THREAD_FITS, ASYM_IDLE_THREAD_MISFIT], where:
+ * ASYM_IDLE_THREAD_MISFIT - does not fit
+ * ASYM_IDLE_THREAD_UCLAMP_MISFIT - fits with the exception of UCLAMP_MIN
+ * ASYM_IDLE_THREAD_FITS - fits with the exception of preferred_core
+ */
+ else if (fits > 0)
+ fits = ASYM_IDLE_THREAD_FITS;
/*
- * First, select CPU which fits better (-1 being better than 0).
+ * If we are on a preferred core, translate the range of fits
+ * of [ASYM_IDLE_THREAD_UCLAMP_MISFIT, ASYM_IDLE_THREAD_MISFIT] to
+ * [ASYM_IDLE_UCLAMP_MISFIT, ASYM_IDLE_COMPLETE_MISFIT].
+ * This ensures that an idle core is always given priority over
+ * (partially) busy core.
+ *
+ * A fully fitting idle core would have returned early and hence
+ * fits > 0 for preferred_core need not be dealt with.
+ */
+ if (preferred_core)
+ fits += ASYM_IDLE_CORE_BIAS;
+
+ /*
+ * First, select CPU which fits better (lower is more preferred).
* Then, select the one with best capacity at same level.
*/
if ((fits < best_fits) ||
@@ -8068,6 +8153,19 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
}
}
+ /*
+ * A value in the [ASYM_IDLE_UCLAMP_MISFIT, ASYM_IDLE_COMPLETE_MISFIT]
+ * range means the chosen CPU is in a fully idle SMT core. Values above
+ * ASYM_IDLE_COMPLETE_MISFIT mean we never ranked such a CPU best.
+ *
+ * The asym-capacity wakeup path returns from select_idle_sibling()
+ * after this function and never runs select_idle_cpu(), so the usual
+ * select_idle_cpu() tail that clears idle cores must live here when the
+ * idle-core preference did not win.
+ */
+ if (has_idle_core && best_fits > ASYM_IDLE_COMPLETE_MISFIT)
+ set_idle_cores(target, false);
+
return best_cpu;
}
@@ -8076,12 +8174,22 @@ static inline bool asym_fits_cpu(unsigned long util,
unsigned long util_max,
int cpu)
{
- if (sched_asym_cpucap_active())
+ if (sched_asym_cpucap_active()) {
/*
* Return true only if the cpu fully fits the task requirements
* which include the utilization and the performance hints.
+ *
+ * When SMT is active, also require that the core has no busy
+ * siblings.
+ *
+ * Note: gating on is_core_idle() also makes the early-bailout
+ * candidates in select_idle_sibling() (target, prev,
+ * recent_used_cpu) idle-core-aware on ASYM+SMT, which the
+ * NO_ASYM path does not do.
*/
- return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
+ return (!sched_smt_active() || is_core_idle(cpu)) &&
+ (util_fits_cpu(util, util_min, util_max, cpu) > 0);
+ }
return true;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 4/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
2026-05-09 18:07 [PATCH v6 0/5 RESEND] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
` (2 preceding siblings ...)
2026-05-09 18:07 ` [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
@ 2026-05-09 18:07 ` Andrea Righi
2026-05-11 13:07 ` Vincent Guittot
2026-05-09 18:07 ` [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity() Andrea Righi
4 siblings, 1 reply; 13+ messages in thread
From: Andrea Righi @ 2026-05-09 18:07 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
Shrikanth Hegde, linux-kernel
When SD_ASYM_CPUCAPACITY load balancing considers pulling a misfit task,
capacity_of(dst_cpu) can overstate available compute if the SMT sibling is
busy: the core does not deliver its full nominal capacity.
If SMT is active and dst_cpu is not on a fully idle core, skip this
destination so we do not migrate a misfit expecting a capacity upgrade we
cannot actually provide.
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Koba Ko <kobak@nvidia.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/fair.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6f0835c15ee11..2ddba8bd27e59 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9693,6 +9693,7 @@ struct lb_env {
int dst_cpu;
struct rq *dst_rq;
+ bool dst_core_idle;
struct cpumask *dst_grpmask;
int new_dst_cpu;
@@ -10918,10 +10919,16 @@ static bool update_sd_pick_busiest(struct lb_env *env,
* We can use max_capacity here as reduction in capacity on some
* CPUs in the group should either be possible to resolve
* internally or be covered by avg_load imbalance (eventually).
+ *
+ * When SMT is active, only pull a misfit to dst_cpu if it is on a
+ * fully idle core; otherwise the effective capacity of the core is
+ * reduced and we may not actually provide more capacity than the
+ * source.
*/
if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
(sgs->group_type == group_misfit_task) &&
- (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
+ (!env->dst_core_idle ||
+ !capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
sds->local_stat.group_type != group_has_spare))
return false;
@@ -11485,6 +11492,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
unsigned long sum_util = 0;
bool sg_overloaded = 0, sg_overutilized = 0;
+ env->dst_core_idle = !sched_smt_active() || is_core_idle(env->dst_cpu);
+
do {
struct sg_lb_stats *sgs = &tmp_sgs;
int local_group;
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH 4/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
2026-05-09 18:07 ` [PATCH 4/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
@ 2026-05-11 13:07 ` Vincent Guittot
0 siblings, 0 replies; 13+ messages in thread
From: Vincent Guittot @ 2026-05-11 13:07 UTC (permalink / raw)
To: Andrea Righi
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
K Prateek Nayak, Christian Loehle, Phil Auld, Koba Ko,
Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
linux-kernel
On Sat, 9 May 2026 at 20:10, Andrea Righi <arighi@nvidia.com> wrote:
>
> When SD_ASYM_CPUCAPACITY load balancing considers pulling a misfit task,
> capacity_of(dst_cpu) can overstate available compute if the SMT sibling is
> busy: the core does not deliver its full nominal capacity.
>
> If SMT is active and dst_cpu is not on a fully idle core, skip this
> destination so we do not migrate a misfit expecting a capacity upgrade we
> cannot actually provide.
>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Christian Loehle <christian.loehle@arm.com>
> Cc: Koba Ko <kobak@nvidia.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> kernel/sched/fair.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6f0835c15ee11..2ddba8bd27e59 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9693,6 +9693,7 @@ struct lb_env {
>
> int dst_cpu;
> struct rq *dst_rq;
> + bool dst_core_idle;
>
> struct cpumask *dst_grpmask;
> int new_dst_cpu;
> @@ -10918,10 +10919,16 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> * We can use max_capacity here as reduction in capacity on some
> * CPUs in the group should either be possible to resolve
> * internally or be covered by avg_load imbalance (eventually).
> + *
> + * When SMT is active, only pull a misfit to dst_cpu if it is on a
> + * fully idle core; otherwise the effective capacity of the core is
> + * reduced and we may not actually provide more capacity than the
> + * source.
> */
> if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> (sgs->group_type == group_misfit_task) &&
> - (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
> + (!env->dst_core_idle ||
> + !capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
> sds->local_stat.group_type != group_has_spare))
> return false;
>
> @@ -11485,6 +11492,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> unsigned long sum_util = 0;
> bool sg_overloaded = 0, sg_overutilized = 0;
>
> + env->dst_core_idle = !sched_smt_active() || is_core_idle(env->dst_cpu);
> +
> do {
> struct sg_lb_stats *sgs = &tmp_sgs;
> int local_group;
> --
> 2.54.0
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity()
2026-05-09 18:07 [PATCH v6 0/5 RESEND] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
` (3 preceding siblings ...)
2026-05-09 18:07 ` [PATCH 4/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
@ 2026-05-09 18:07 ` Andrea Righi
2026-05-11 13:08 ` Vincent Guittot
4 siblings, 1 reply; 13+ messages in thread
From: Andrea Righi @ 2026-05-09 18:07 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
Shrikanth Hegde, linux-kernel
From: K Prateek Nayak <kprateek.nayak@amd.com>
Add to select_idle_capacity() the same SIS_UTIL-controlled idle-scan
mechanism, already used by select_idle_cpu(): when sched_feat(SIS_UTIL)
is enabled and the LLC domain has sched_domain_shared data, derive the
per-attempt scan limit from sd->shared->nr_idle_scan.
That bounds the walk on large LLCs: once nr_idle_scan is exhausted,
return the best CPU seen so far. The early exit is gated on
!has_idle_core so an active idle-core search (SMT with idle cores
reported by test_idle_cores()) isn't cut short before it gets a chance
to find one.
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
kernel/sched/fair.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2ddba8bd27e59..494149f14d98f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8084,6 +8084,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
int fits, best_fits = ASYM_IDLE_COMPLETE_MISFIT;
int cpu, best_cpu = -1;
struct cpumask *cpus;
+ int nr = INT_MAX;
cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
@@ -8092,10 +8093,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
util_min = uclamp_eff_value(p, UCLAMP_MIN);
util_max = uclamp_eff_value(p, UCLAMP_MAX);
+ if (sched_feat(SIS_UTIL) && sd->shared) {
+ /*
+ * Same nr_idle_scan hint as select_idle_cpu(), nr only limits
+ * the scan when not preferring an idle core.
+ */
+ nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
+ /* overloaded domain is unlikely to have idle cpu/core */
+ if (nr == 1)
+ return -1;
+ }
+
for_each_cpu_wrap(cpu, cpus, target) {
bool preferred_core = !has_idle_core || is_core_idle(cpu);
unsigned long cpu_cap = capacity_of(cpu);
+ /*
+ * Stop when the nr_idle_scan is exhausted (mirrors
+ * select_idle_cpu() logic).
+ */
+ if (!has_idle_core && --nr <= 0)
+ return best_cpu;
+
if (!choose_idle_cpu(cpu, p))
continue;
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity()
2026-05-09 18:07 ` [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity() Andrea Righi
@ 2026-05-11 13:08 ` Vincent Guittot
0 siblings, 0 replies; 13+ messages in thread
From: Vincent Guittot @ 2026-05-11 13:08 UTC (permalink / raw)
To: Andrea Righi
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
K Prateek Nayak, Christian Loehle, Phil Auld, Koba Ko,
Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
linux-kernel
On Sat, 9 May 2026 at 20:10, Andrea Righi <arighi@nvidia.com> wrote:
>
> From: K Prateek Nayak <kprateek.nayak@amd.com>
>
> Add to select_idle_capacity() the same SIS_UTIL-controlled idle-scan
> mechanism, already used by select_idle_cpu(): when sched_feat(SIS_UTIL)
> is enabled and the LLC domain has sched_domain_shared data, derive the
> per-attempt scan limit from sd->shared->nr_idle_scan.
>
> That bounds the walk on large LLCs: once nr_idle_scan is exhausted,
> return the best CPU seen so far. The early exit is gated on
> !has_idle_core so an active idle-core search (SMT with idle cores
> reported by test_idle_cores()) isn't cut short before it gets a chance
> to find one.
>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Co-developed-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> kernel/sched/fair.c | 19 +++++++++++++++++++
> 1 file changed, 19 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2ddba8bd27e59..494149f14d98f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8084,6 +8084,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> int fits, best_fits = ASYM_IDLE_COMPLETE_MISFIT;
> int cpu, best_cpu = -1;
> struct cpumask *cpus;
> + int nr = INT_MAX;
>
> cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> @@ -8092,10 +8093,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> util_min = uclamp_eff_value(p, UCLAMP_MIN);
> util_max = uclamp_eff_value(p, UCLAMP_MAX);
>
> + if (sched_feat(SIS_UTIL) && sd->shared) {
> + /*
> + * Same nr_idle_scan hint as select_idle_cpu(), nr only limits
> + * the scan when not preferring an idle core.
> + */
> + nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
> + /* overloaded domain is unlikely to have idle cpu/core */
> + if (nr == 1)
> + return -1;
> + }
> +
> for_each_cpu_wrap(cpu, cpus, target) {
> bool preferred_core = !has_idle_core || is_core_idle(cpu);
> unsigned long cpu_cap = capacity_of(cpu);
>
> + /*
> + * Stop when the nr_idle_scan is exhausted (mirrors
> + * select_idle_cpu() logic).
> + */
> + if (!has_idle_core && --nr <= 0)
> + return best_cpu;
> +
> if (!choose_idle_cpu(cpu, p))
> continue;
>
> --
> 2.54.0
>
^ permalink raw reply [flat|nested] 13+ messages in thread