[PATCH v6 0/5 RESEND] sched/fair: SMT-aware asymmetric CPU capacity

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v6 0/5 RESEND] sched/fair: SMT-aware asymmetric CPU capacity
@ 2026-05-09 18:07 Andrea Righi
  2026-05-09 18:07 ` [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path Andrea Righi
                   ` (4 more replies)
  0 siblings, 5 replies; 47+ messages in thread
From: Andrea Righi @ 2026-05-09 18:07 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	Shrikanth Hegde, linux-kernel

[ Re-sending due to missing subject in the previous email. ]

This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by introducing
SMT awareness.

= Problem =

Nominal per-logical-CPU capacity can overstate usable compute when an SMT
sibling is busy, because the physical core doesn't deliver its full nominal
capacity. So, several asym-cpu-capacity paths may pick high capacity idle CPUs
that are not actually good destinations.

= Solution =

This patch set aligns those paths with a simple rule already used elsewhere:
when SMT is active, prefer fully idle cores and avoid treating partially idle
SMT siblings as full-capacity targets where that would mislead load balance.

Patch set summary:
 - Attach sched_domain_shared to sd_asym_cpucapacity in SD_ASYM_CPUCAPACITY to
   use has_idle_cores hint consistently in the wakeup idle scan and rename
   sd_llc_shared -> sd_balance_shared.
 - Prefer fully-idle SMT cores in asym-capacity idle selection: in the wakeup
   fast path, extend select_idle_capacity() / asym_fits_cpu() so idle
   selection can prefer CPUs on fully idle cores.
 - Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
 - Add SIS_UTIL support to select_idle_capacity(): add to select_idle_capacity()
   the same SIS_UTIL-controlled idle-scan mechanism, already used by
   select_idle_cpu().

This patch set has been tested on the new NVIDIA Vera Rubin platform, where SMT
is enabled and the firmware exposes small frequency variations (+/-~5%) as
differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.

Without these patches, performance can drop by up to ~2x with CPU-intensive
workloads, because the SD_ASYM_CPUCAPACITY idle selection policy does not
account for busy SMT siblings.

Alternative approaches have been evaluated, such as equalizing CPU capacities,
either by exposing uniform values via firmware or normalizing them in the kernel
by grouping CPUs within a small capacity window (+-5%).

However, the SMT-aware SD_ASYM_CPUCAPACITY approach has shown better results so
far. Improving this policy also seems worthwhile in general, as future platforms
may enable SMT with asymmetric CPU topologies.

Performance results on Vera Rubin with SD_ASYM_CPUCAPACITY (mainline) vs
SD_ASYM_CPUCAPACITY + SMT

- NVBLAS benchblas (one task / SMT core):

 +---------------------------------+--------+
 | Configuration                   | gflops |
 +---------------------------------+--------+
 | ASYM (mainline) + SIS_UTIL      |  5478  |
 | ASYM (mainline) + NO_SIS_UTIL   |  5491  |
 |                                 |        |
 | NO ASYM + SIS_UTIL              |  8912  |
 | NO ASYM + NO_SIS_UTIL           |  8978  |
 |                                 |        |
 | ASYM + SMT + SIS_UTIL           |  9259  |
 | ASYM + SMT + NO_SIS_UTIL        |  9291  |
 +---------------------------------+--------+

 - DCPerf MediaWiki (all CPUs):

 +---------------------------------+--------+--------+--------+--------+
 | Configuration                   |   rps  |  p50   |  p95   |  p99   |
 +---------------------------------+--------+--------+--------+--------+
 | ASYM (mainline) + SIS_UTIL      |  7994  |  0.052 |  0.223 |  0.246 |
 | ASYM (mainline) + NO_SIS_UTIL   |  7993  |  0.052 |  0.221 |  0.245 |
 |                                 |        |        |        |        |
 | NO ASYM + SIS_UTIL              |  8113  |  0.067 |  0.184 |  0.225 |
 | NO ASYM + NO_SIS_UTIL           |  8093  |  0.068 |  0.184 |  0.223 |
 |                                 |        |        |        |        |
 | ASYM + SMT + SIS_UTIL           |  8129  |  0.076 |  0.149 |  0.188 |
 | ASYM + SMT + NO_SIS_UTIL        |  8138  |  0.076 |  0.148 |  0.186 |
 +---------------------------------+--------+--------+--------+--------+

In the MediaWiki case SMT awareness is less impactful, because for the majority
of the run all CPUs are used, but it still seems to provide some benefits at
reducing tail latency.

Tests have also been conducted on NVIDIA Grace (which does not support SMT) to
ensure that SIS_UTIL support in select_idle_capacity() does not introduce
regressions and results show slight improvements under the same workloads.

See also:
 - https://lore.kernel.org/lkml/20260324005509.1134981-1-arighi@nvidia.com
 - https://lore.kernel.org/lkml/20260318092214.130908-1-arighi@nvidia.com

Changes in v6:
 - Simplify the SIS_UTIL early-exit in select_idle_capacity(): drop the
   best_fits == ASYM_IDLE_CORE_UCLAMP_MISFIT guard and exit once the scan
   budget is exhausted, matching select_idle_cpu() (Dietmar Eggemann)
 - Move the sd_llc_shared -> sd_balance_shared rename in nohz_balancer_kick()
   from the NOHZ RCU cleanup patch into the sd_asym_cpucapacity attach patch,
   where it logically belongs (Dietmar Eggemann)
 - Rename prefers_idle_core to has_idle_core in select_idle_capacity()
   (Dietmar Eggemann)
 - Use ASYM_IDLE_CORE_COMPLETE_MISFIT instead of ASYM_IDLE_CORE_BIAS in the
   select_idle_capacity() comments (Vincent Guittot)
 - Expand the asym_fits_state docstring with a per-rank table and an
   explanation of ASYM_IDLE_CORE_BIAS as an offset rather than a state
 - Small code comment adjustments based on previous reviews
 - Link to v5: https://lore.kernel.org/all/20260428144352.3575863-1-arighi@nvidia.com

Changes in v5:
 - Drop redundant RCU protection in nohz_balancer_kick() (Prateek Nayak)
 - Do not remove CPU capacity asymmetry / SMT warning (Prateek Nayak)
 - Link to v4: https://lore.kernel.org/all/20260428051720.3180182-1-arighi@nvidia.com

Changes in v4:
 - Rename sd_llc_shared -> sd_balance_shared
 - Add preliminary cleanup patch to use guard(rcu)() for sched_domain RCU
   (Prateek Nayak)
 - Apply SIS_UTIL scan cap only with !prefers_idle_core, matching
   select_idle_cpu() / has_idle_core logic (Vincent Guittot)
 - Cache env->dst_cpu idle state to reduce is_core_idle() calls (Prateek Nayak)
 - Remove warning about CPU capacity asymmetry not supporting SMT
 - Link to v3: https://lore.kernel.org/all/20260423074135.380390-1-arighi@nvidia.com

Changes in v3:
 - Add SIS_UTIL support to select_idle_capacity() (K Prateek Nayak)
 - Attach sched_domain_shared to sd_asym_cpucapacity (K Prateek Nayak)
 - Add enum for the different fit state (K Prateek Nayak)
 - Update has_idle_cores hint (Vincent Guittot)
 - Link to v2: https://lore.kernel.org/all/20260403053654.1559142-1-arighi@nvidia.com

Changes in v2:
 - Rework SMT awareness logic in select_idle_capacity() (K Prateek Nayak)
 - Drop EAS and find_new_ilb() changes for now
 - Link to v1: https://lore.kernel.org/all/20260326151211.1862600-1-arighi@nvidia.com

Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git sched-asym-smt-v6

Andrea Righi (3):
      sched/fair: Drop redundant RCU read lock in NOHZ kick path
      sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
      sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity

K Prateek Nayak (2):
      sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
      sched/fair: Add SIS_UTIL support to select_idle_capacity()

 kernel/sched/fair.c     | 206 ++++++++++++++++++++++++++++++++++++++----------
 kernel/sched/sched.h    |   2 +-
 kernel/sched/topology.c |  95 +++++++++++++++++++---
 3 files changed, 248 insertions(+), 55 deletions(-)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path
  2026-05-09 18:07 [PATCH v6 0/5 RESEND] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
@ 2026-05-09 18:07 ` Andrea Righi
  2026-05-11 13:04   ` Vincent Guittot
                     ` (3 more replies)
  2026-05-09 18:07 ` [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity Andrea Righi
                   ` (3 subsequent siblings)
  4 siblings, 4 replies; 47+ messages in thread
From: Andrea Righi @ 2026-05-09 18:07 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	Shrikanth Hegde, linux-kernel

nohz_balancer_kick() is reached from sched_balance_trigger(), which is
called from sched_tick(). sched_tick() runs with IRQs disabled, so the
additional rcu_read_lock/unlock() used around sched_domain accesses in
this path is redundant. Rely on the existing IRQ-disabled context (and
the rcu_dereference_all() checking) instead.

The same applies to set_cpu_sd_state_idle(), called from the idle entry
path with IRQs disabled, and to set_cpu_sd_state_busy(), reachable via
nohz_balance_exit_idle() from two contexts: nohz_balancer_kick() (IRQs
disabled, as above) and sched_cpu_deactivate() (the CPUHP_AP_ACTIVE
teardown, which runs under cpus_write_lock(), so it cannot race with
sched-domain rebuilds). In both cases the rcu_dereference_all()
validation is sufficient.

No functional change intended.

Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/fair.c | 38 +++++++++++---------------------------
 1 file changed, 11 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ebec186f9823..6b059ee80b631 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12785,8 +12785,6 @@ static void nohz_balancer_kick(struct rq *rq)
 		goto out;
 	}
 
-	rcu_read_lock();
-
 	sd = rcu_dereference_all(rq->sd);
 	if (sd) {
 		/*
@@ -12794,8 +12792,8 @@ static void nohz_balancer_kick(struct rq *rq)
 		 * capacity, kick the ILB to see if there's a better CPU to run on:
 		 */
 		if (rq->cfs.h_nr_runnable >= 1 && check_cpu_capacity(rq, sd)) {
-			flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-			goto unlock;
+			flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
+			goto out;
 		}
 	}
 
@@ -12811,8 +12809,8 @@ static void nohz_balancer_kick(struct rq *rq)
 		 */
 		for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) {
 			if (sched_asym(sd, i, cpu)) {
-				flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-				goto unlock;
+				flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
+				goto out;
 			}
 		}
 	}
@@ -12823,10 +12821,8 @@ static void nohz_balancer_kick(struct rq *rq)
 		 * When ASYM_CPUCAPACITY; see if there's a higher capacity CPU
 		 * to run the misfit task on.
 		 */
-		if (check_misfit_status(rq)) {
-			flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-			goto unlock;
-		}
+		if (check_misfit_status(rq))
+			flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
 
 		/*
 		 * For asymmetric systems, we do not want to nicely balance
@@ -12835,7 +12831,7 @@ static void nohz_balancer_kick(struct rq *rq)
 		 *
 		 * Skip the LLC logic because it's not relevant in that case.
 		 */
-		goto unlock;
+		goto out;
 	}
 
 	sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
@@ -12850,13 +12846,9 @@ static void nohz_balancer_kick(struct rq *rq)
 		 * like this LLC domain has tasks we could move.
 		 */
 		nr_busy = atomic_read(&sds->nr_busy_cpus);
-		if (nr_busy > 1) {
-			flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-			goto unlock;
-		}
+		if (nr_busy > 1)
+			flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
 	}
-unlock:
-	rcu_read_unlock();
 out:
 	if (READ_ONCE(nohz.needs_update))
 		flags |= NOHZ_NEXT_KICK;
@@ -12868,17 +12860,13 @@ static void nohz_balancer_kick(struct rq *rq)
 static void set_cpu_sd_state_busy(int cpu)
 {
 	struct sched_domain *sd;
-
-	rcu_read_lock();
 	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
 
 	if (!sd || !sd->nohz_idle)
-		goto unlock;
+		return;
 	sd->nohz_idle = 0;
 
 	atomic_inc(&sd->shared->nr_busy_cpus);
-unlock:
-	rcu_read_unlock();
 }
 
 void nohz_balance_exit_idle(struct rq *rq)
@@ -12897,17 +12885,13 @@ void nohz_balance_exit_idle(struct rq *rq)
 static void set_cpu_sd_state_idle(int cpu)
 {
 	struct sched_domain *sd;
-
-	rcu_read_lock();
 	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
 
 	if (!sd || sd->nohz_idle)
-		goto unlock;
+		return;
 	sd->nohz_idle = 1;
 
 	atomic_dec(&sd->shared->nr_busy_cpus);
-unlock:
-	rcu_read_unlock();
 }
 
 /*
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
  2026-05-09 18:07 [PATCH v6 0/5 RESEND] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
  2026-05-09 18:07 ` [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path Andrea Righi
@ 2026-05-09 18:07 ` Andrea Righi
  2026-05-11 13:04   ` Vincent Guittot
                     ` (2 more replies)
  2026-05-09 18:07 ` [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 47+ messages in thread
From: Andrea Righi @ 2026-05-09 18:07 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	Shrikanth Hegde, linux-kernel

From: K Prateek Nayak <kprateek.nayak@amd.com>

On asymmetric CPU capacity systems, the wakeup path uses
select_idle_capacity(), which scans the span of sd_asym_cpucapacity
rather than sd_llc.

The has_idle_cores hint however lives on sd_llc->shared, so the
wakeup-time read of has_idle_cores operates on an LLC-scoped blob while
the actual scan/decision spans the asym domain; nr_busy_cpus also lives
in the same shared sched_domain data, but it's never used in the asym
CPU capacity scenario.

Therefore, move the sched_domain_shared object to sd_asym_cpucapacity
whenever the CPU has a SD_ASYM_CPUCAPACITY_FULL ancestor and that
ancestor is non-overlapping (i.e., not built from SD_NUMA). In that case
the scope of has_idle_cores matches the scope of the wakeup scan.

Fall back to attaching the shared object to sd_llc in three cases:

  1) plain symmetric systems (no SD_ASYM_CPUCAPACITY_FULL anywhere);

  2) CPUs in an exclusive cpuset that carves out a symmetric capacity
     island: has_asym is system-wide but those CPUs have no
     SD_ASYM_CPUCAPACITY_FULL ancestor in their hierarchy and follow
     the symmetric LLC path in select_idle_sibling();

  3) exotic topologies where SD_ASYM_CPUCAPACITY_FULL lands on an
     SD_NUMA-built domain. init_sched_domain_shared() keys the shared
     blob off cpumask_first(span), which on overlapping NUMA domains
     would alias unrelated spans onto the same blob. Keep the shared
     object on the LLC there; select_idle_capacity() gracefully skips
     the has_idle_cores preference when sd->shared is NULL.

While at it, also rename the per-CPU sd_llc_shared to sd_balance_shared,
as it is no longer strictly tied to the LLC.

Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c     | 19 ++++++---
 kernel/sched/sched.h    |  2 +-
 kernel/sched/topology.c | 95 +++++++++++++++++++++++++++++++++++------
 3 files changed, 95 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b059ee80b631..960a1a9696b98 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7819,7 +7819,7 @@ static inline void set_idle_cores(int cpu, int val)
 {
 	struct sched_domain_shared *sds;
 
-	sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
+	sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
 	if (sds)
 		WRITE_ONCE(sds->has_idle_cores, val);
 }
@@ -7828,7 +7828,7 @@ static inline bool test_idle_cores(int cpu)
 {
 	struct sched_domain_shared *sds;
 
-	sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
+	sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
 	if (sds)
 		return READ_ONCE(sds->has_idle_cores);
 
@@ -7837,7 +7837,7 @@ static inline bool test_idle_cores(int cpu)
 
 /*
  * Scans the local SMT mask to see if the entire core is idle, and records this
- * information in sd_llc_shared->has_idle_cores.
+ * information in sd_balance_shared->has_idle_cores.
  *
  * Since SMT siblings share all cache levels, inspecting this limited remote
  * state should be fairly cheap.
@@ -7954,7 +7954,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
 
-	if (sched_feat(SIS_UTIL)) {
+	if (sched_feat(SIS_UTIL) && sd->shared) {
 		/*
 		 * Increment because !--nr is the condition to stop scan.
 		 *
@@ -12834,7 +12834,7 @@ static void nohz_balancer_kick(struct rq *rq)
 		goto out;
 	}
 
-	sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
+	sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
 	if (sds) {
 		/*
 		 * If there is an imbalance between LLC domains (IOW we could
@@ -12862,7 +12862,11 @@ static void set_cpu_sd_state_busy(int cpu)
 	struct sched_domain *sd;
 	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
 
-	if (!sd || !sd->nohz_idle)
+	/*
+	 * sd->nohz_idle only pairs with nr_busy_cpus on sd->shared; if this
+	 * domain has no shared object there is nothing to clear or account.
+	 */
+	if (!sd || !sd->shared || !sd->nohz_idle)
 		return;
 	sd->nohz_idle = 0;
 
@@ -12887,7 +12891,8 @@ static void set_cpu_sd_state_idle(int cpu)
 	struct sched_domain *sd;
 	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
 
-	if (!sd || sd->nohz_idle)
+	/* See set_cpu_sd_state_busy(): nohz_idle is only used with sd->shared. */
+	if (!sd || !sd->shared || sd->nohz_idle)
 		return;
 	sd->nohz_idle = 1;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f63b15d309d1..330f5893c4561 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2170,7 +2170,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
 DECLARE_PER_CPU(int, sd_share_id);
-DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_balance_shared);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5847b83d9d552..9bc4d11dd6a98 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -665,7 +665,7 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
 DEFINE_PER_CPU(int, sd_share_id);
-DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_balance_shared);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
@@ -680,20 +680,38 @@ static void update_top_cache_domain(int cpu)
 	int id = cpu;
 	int size = 1;
 
+	sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
+	/*
+	 * The shared object is attached to sd_asym_cpucapacity only when the
+	 * asym domain is non-overlapping (i.e., not built from SD_NUMA).
+	 * On overlapping (NUMA) asym domains we fall back to letting the
+	 * SD_SHARE_LLC path own the shared object, so sd->shared may be NULL
+	 * here.
+	 */
+	if (sd && sd->shared)
+		sds = sd->shared;
+
+	rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
+
 	sd = highest_flag_domain(cpu, SD_SHARE_LLC);
 	if (sd) {
 		id = cpumask_first(sched_domain_span(sd));
 		size = cpumask_weight(sched_domain_span(sd));
 
-		/* If sd_llc exists, sd_llc_shared should exist too. */
-		WARN_ON_ONCE(!sd->shared);
-		sds = sd->shared;
+		/*
+		 * If sd_asym_cpucapacity didn't claim the shared object,
+		 * sd_llc must have one linked.
+		 */
+		if (!sds) {
+			WARN_ON_ONCE(!sd->shared);
+			sds = sd->shared;
+		}
 	}
 
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
 	per_cpu(sd_llc_size, cpu) = size;
 	per_cpu(sd_llc_id, cpu) = id;
-	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
+	rcu_assign_pointer(per_cpu(sd_balance_shared, cpu), sds);
 
 	sd = lowest_flag_domain(cpu, SD_CLUSTER);
 	if (sd)
@@ -711,9 +729,6 @@ static void update_top_cache_domain(int cpu)
 
 	sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
 	rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
-
-	sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
-	rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
 }
 
 /*
@@ -2650,6 +2665,54 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc)
 	}
 }
 
+static void init_sched_domain_shared(struct s_data *d, struct sched_domain *sd)
+{
+	int sd_id = cpumask_first(sched_domain_span(sd));
+
+	sd->shared = *per_cpu_ptr(d->sds, sd_id);
+	/*
+	 * nr_busy_cpus is consumed only by the NOHZ kick path via
+	 * sd_balance_shared; on the asym-capacity path it is initialized but
+	 * never read.
+	 */
+	atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
+	atomic_inc(&sd->shared->ref);
+}
+
+/*
+ * For asymmetric CPU capacity, attach sched_domain_shared on the innermost
+ * SD_ASYM_CPUCAPACITY_FULL ancestor of @cpu's base domain when that ancestor is
+ * not an overlapping NUMA-built domain (then LLC should claim shared).
+ *
+ * A CPU may lack any FULL ancestor (e.g., exclusive cpuset symmetric island),
+ * then LLC must claim shared instead.
+ *
+ * Note: SD_ASYM_CPUCAPACITY_FULL is only set when all CPU capacity values
+ * are present in the domain span, so the asym domain we attach to cannot
+ * degenerate into a single-capacity group. The relevant edge cases are instead
+ * covered by the caveats above.
+ *
+ * Return true if this CPU's asym path claimed sd->shared, false otherwise.
+ */
+static bool claim_asym_sched_domain_shared(struct s_data *d, int cpu)
+{
+	struct sched_domain *sd = *per_cpu_ptr(d->sd, cpu);
+	struct sched_domain *sd_asym;
+
+	if (!sd)
+		return false;
+
+	sd_asym = sd;
+	while (sd_asym && !(sd_asym->flags & SD_ASYM_CPUCAPACITY_FULL))
+		sd_asym = sd_asym->parent;
+
+	if (!sd_asym || (sd_asym->flags & SD_NUMA))
+		return false;
+
+	init_sched_domain_shared(d, sd_asym);
+	return true;
+}
+
 /*
  * Build sched domains for a given set of CPUs and attach the sched domains
  * to the individual CPUs
@@ -2708,20 +2771,26 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 	}
 
 	for_each_cpu(i, cpu_map) {
+		bool asym_claimed = false;
+
 		sd = *per_cpu_ptr(d.sd, i);
 		if (!sd)
 			continue;
 
+		if (has_asym)
+			asym_claimed = claim_asym_sched_domain_shared(&d, i);
+
 		/* First, find the topmost SD_SHARE_LLC domain */
 		while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
 			sd = sd->parent;
 
 		if (sd->flags & SD_SHARE_LLC) {
-			int sd_id = cpumask_first(sched_domain_span(sd));
-
-			sd->shared = *per_cpu_ptr(d.sds, sd_id);
-			atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
-			atomic_inc(&sd->shared->ref);
+			/*
+			 * Initialize the sd->shared for SD_SHARE_LLC unless
+			 * the asym path above already claimed it.
+			 */
+			if (!asym_claimed)
+				init_sched_domain_shared(&d, sd);
 
 			/*
 			 * In presence of higher domains, adjust the
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-05-09 18:07 [PATCH v6 0/5 RESEND] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
  2026-05-09 18:07 ` [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path Andrea Righi
  2026-05-09 18:07 ` [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity Andrea Righi
@ 2026-05-09 18:07 ` Andrea Righi
  2026-05-11 13:07   ` Vincent Guittot
  2026-05-09 18:07 ` [PATCH 4/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
  2026-05-09 18:07 ` [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity() Andrea Righi
  4 siblings, 1 reply; 47+ messages in thread
From: Andrea Righi @ 2026-05-09 18:07 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	Shrikanth Hegde, linux-kernel

On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
different per-core frequencies), the wakeup path uses
select_idle_capacity() and prioritizes idle CPUs with higher capacity
for better task placement. However, when those CPUs belong to SMT cores,
their effective capacity can be much lower than the nominal capacity
when the sibling thread is busy: SMT siblings compete for shared
resources, so a "high capacity" CPU that is idle but whose sibling is
busy does not deliver its full capacity. This effective capacity
reduction cannot be modeled by the static capacity value alone.

Introduce SMT awareness in the asym-capacity idle selection policy: when
SMT is active, always prefer fully-idle SMT cores over partially-idle
ones.

Prioritizing fully-idle SMT cores yields better task placement because
the effective capacity of partially-idle SMT cores is reduced; always
preferring them when available leads to more accurate capacity usage on
task wakeup.

On an SMT system with asymmetric CPU capacities (NVIDIA Vera Rubin),
SMT-aware idle selection has been shown to improve throughput by around
15-18% over NO_ASYM mainline and by around 60% over ASYM mainline, for
CPU-bound workloads (NVBLAS) running an amount of tasks equal to the
amount of SMT cores.

Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Koba Ko <kobak@nvidia.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/fair.c | 119 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 113 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 960a1a9696b98..6f0835c15ee11 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8018,6 +8018,54 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	return idle_cpu;
 }
 
+/*
+ * Idle-capacity scan converts util_fits_cpu() outcomes into preference ranks,
+ * where lower values indicate a better fit - see select_idle_capacity().
+ *
+ * A CPU that both fits the task and sits on a fully-idle SMT core is returned
+ * immediately and is never assigned one of these ranks. On !SMT every CPU is
+ * its own "core", so the early return covers all fits-and-idle cases and the
+ * core-tier ranks below become unreachable.
+ *
+ *   Rank                            Val  Tier    Meaning
+ *   ------------------------------  ---  ------  ---------------------------
+ *   ASYM_IDLE_CORE_UCLAMP_MISFIT    -4   core    Idle core; capacity fits
+ *                                                util but uclamp_min misses.
+ *   ASYM_IDLE_CORE_COMPLETE_MISFIT  -3   core    Idle core; capacity does
+ *                                                not fit. Still beats every
+ *                                                thread-tier rank: a busy
+ *                                                sibling cuts effective
+ *                                                capacity more than a
+ *                                                misfit hurts a quiet core.
+ *   ASYM_IDLE_THREAD_FITS           -2   thread  Busy SMT sibling; capacity
+ *                                                fits util + uclamp.
+ *   ASYM_IDLE_THREAD_UCLAMP_MISFIT  -1   thread  Busy SMT sibling; capacity
+ *                                                fits but uclamp_min misses
+ *                                                (native util_fits_cpu()
+ *                                                return value).
+ *   ASYM_IDLE_COMPLETE_MISFIT        0   thread  Busy SMT sibling; capacity
+ *                                                does not fit.
+ *
+ * ASYM_IDLE_CORE_BIAS (-3) is an offset, not a state. On an idle core,
+ * fits += ASYM_IDLE_CORE_BIAS rebases thread-tier ranks into the core tier:
+ *
+ *   ASYM_IDLE_THREAD_UCLAMP_MISFIT (-1) + BIAS -> CORE_UCLAMP_MISFIT (-4)
+ *   ASYM_IDLE_COMPLETE_MISFIT       (0) + BIAS -> CORE_COMPLETE_MISFIT (-3)
+ *
+ * ASYM_IDLE_THREAD_FITS (-2) is never rebased because a fully-fitting idle-core
+ * candidate early-returns from select_idle_capacity().
+ */
+enum asym_fits_state {
+	ASYM_IDLE_CORE_UCLAMP_MISFIT = -4,
+	ASYM_IDLE_CORE_COMPLETE_MISFIT,
+	ASYM_IDLE_THREAD_FITS,
+	ASYM_IDLE_THREAD_UCLAMP_MISFIT,
+	ASYM_IDLE_COMPLETE_MISFIT,
+
+	/* util_fits_cpu() bias for idle core */
+	ASYM_IDLE_CORE_BIAS = -3,
+};
+
 /*
  * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
  * the task fits. If no CPU is big enough, but there are idle ones, try to
@@ -8026,8 +8074,14 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 static int
 select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 {
+	/*
+	 * On !SMT systems, has_idle_core is always false and preferred_core
+	 * is always true (CPU == core), so the SMT preference logic below
+	 * collapses to the plain capacity scan.
+	 */
+	bool has_idle_core = sched_smt_active() && test_idle_cores(target);
 	unsigned long task_util, util_min, util_max, best_cap = 0;
-	int fits, best_fits = 0;
+	int fits, best_fits = ASYM_IDLE_COMPLETE_MISFIT;
 	int cpu, best_cpu = -1;
 	struct cpumask *cpus;
 
@@ -8039,6 +8093,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	util_max = uclamp_eff_value(p, UCLAMP_MAX);
 
 	for_each_cpu_wrap(cpu, cpus, target) {
+		bool preferred_core = !has_idle_core || is_core_idle(cpu);
 		unsigned long cpu_cap = capacity_of(cpu);
 
 		if (!choose_idle_cpu(cpu, p))
@@ -8046,8 +8101,13 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 
 		fits = util_fits_cpu(task_util, util_min, util_max, cpu);
 
-		/* This CPU fits with all requirements */
-		if (fits > 0)
+		/*
+		 * Perfect fit: capacity satisfies util + uclamp and the CPU
+		 * sits on a fully-idle SMT core (or this is a !SMT system).
+		 * Short-circuit the rank-based selection and return
+		 * immediately.
+		 */
+		if (fits > 0 && preferred_core)
 			return cpu;
 		/*
 		 * Only the min performance hint (i.e. uclamp_min) doesn't fit.
@@ -8055,9 +8115,33 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 		 */
 		else if (fits < 0)
 			cpu_cap = get_actual_cpu_capacity(cpu);
+		/*
+		 * fits > 0 implies we are not on a preferred core, but the util
+		 * fits CPU capacity. Set fits to ASYM_IDLE_THREAD_FITS
+		 * so the effective range becomes
+		 * [ASYM_IDLE_THREAD_FITS, ASYM_IDLE_COMPLETE_MISFIT], where:
+		 *    ASYM_IDLE_COMPLETE_MISFIT - does not fit
+		 *    ASYM_IDLE_THREAD_UCLAMP_MISFIT - fits with the exception of UCLAMP_MIN
+		 *    ASYM_IDLE_THREAD_FITS - fits with the exception of preferred_core
+		 */
+		else if (fits > 0)
+			fits = ASYM_IDLE_THREAD_FITS;
 
 		/*
-		 * First, select CPU which fits better (-1 being better than 0).
+		 * If we are on a preferred core, translate the range of fits
+		 * of [ASYM_IDLE_THREAD_UCLAMP_MISFIT, ASYM_IDLE_COMPLETE_MISFIT] to
+		 * [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_COMPLETE_MISFIT].
+		 * This ensures that an idle core is always given priority over
+		 * (partially) busy core.
+		 *
+		 * A fully fitting idle core would have returned early and hence
+		 * fits > 0 for preferred_core need not be dealt with.
+		 */
+		if (preferred_core)
+			fits += ASYM_IDLE_CORE_BIAS;
+
+		/*
+		 * First, select CPU which fits better (lower is more preferred).
 		 * Then, select the one with best capacity at same level.
 		 */
 		if ((fits < best_fits) ||
@@ -8068,6 +8152,19 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 		}
 	}
 
+	/*
+	 * A value in the [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_COMPLETE_MISFIT]
+	 * range means the chosen CPU is in a fully idle SMT core. Values above
+	 * ASYM_IDLE_CORE_COMPLETE_MISFIT mean we never ranked such a CPU best.
+	 *
+	 * The asym-capacity wakeup path returns from select_idle_sibling()
+	 * after this function and never runs select_idle_cpu(), so the usual
+	 * select_idle_cpu() tail that clears idle cores must live here when the
+	 * idle-core preference did not win.
+	 */
+	if (has_idle_core && best_fits > ASYM_IDLE_CORE_COMPLETE_MISFIT)
+		set_idle_cores(target, false);
+
 	return best_cpu;
 }
 
@@ -8076,12 +8173,22 @@ static inline bool asym_fits_cpu(unsigned long util,
 				 unsigned long util_max,
 				 int cpu)
 {
-	if (sched_asym_cpucap_active())
+	if (sched_asym_cpucap_active()) {
 		/*
 		 * Return true only if the cpu fully fits the task requirements
 		 * which include the utilization and the performance hints.
+		 *
+		 * When SMT is active, also require that the core has no busy
+		 * siblings.
+		 *
+		 * Note: gating on is_core_idle() also makes the early-bailout
+		 * candidates in select_idle_sibling() (target, prev,
+		 * recent_used_cpu) idle-core-aware on ASYM+SMT, which the
+		 * NO_ASYM path does not do.
 		 */
-		return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
+		return (!sched_smt_active() || is_core_idle(cpu)) &&
+		       (util_fits_cpu(util, util_min, util_max, cpu) > 0);
+	}
 
 	return true;
 }
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 4/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
  2026-05-09 18:07 [PATCH v6 0/5 RESEND] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
                   ` (2 preceding siblings ...)
  2026-05-09 18:07 ` [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
@ 2026-05-09 18:07 ` Andrea Righi
  2026-05-11 13:07   ` Vincent Guittot
                     ` (2 more replies)
  2026-05-09 18:07 ` [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity() Andrea Righi
  4 siblings, 3 replies; 47+ messages in thread
From: Andrea Righi @ 2026-05-09 18:07 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	Shrikanth Hegde, linux-kernel

When SD_ASYM_CPUCAPACITY load balancing considers pulling a misfit task,
capacity_of(dst_cpu) can overstate available compute if the SMT sibling is
busy: the core does not deliver its full nominal capacity.

If SMT is active and dst_cpu is not on a fully idle core, skip this
destination so we do not migrate a misfit expecting a capacity upgrade we
cannot actually provide.

Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Koba Ko <kobak@nvidia.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/fair.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6f0835c15ee11..2ddba8bd27e59 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9693,6 +9693,7 @@ struct lb_env {
 
 	int			dst_cpu;
 	struct rq		*dst_rq;
+	bool			dst_core_idle;
 
 	struct cpumask		*dst_grpmask;
 	int			new_dst_cpu;
@@ -10918,10 +10919,16 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 * We can use max_capacity here as reduction in capacity on some
 	 * CPUs in the group should either be possible to resolve
 	 * internally or be covered by avg_load imbalance (eventually).
+	 *
+	 * When SMT is active, only pull a misfit to dst_cpu if it is on a
+	 * fully idle core; otherwise the effective capacity of the core is
+	 * reduced and we may not actually provide more capacity than the
+	 * source.
 	 */
 	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
 	    (sgs->group_type == group_misfit_task) &&
-	    (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
+	    (!env->dst_core_idle ||
+	     !capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
 	     sds->local_stat.group_type != group_has_spare))
 		return false;
 
@@ -11485,6 +11492,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	unsigned long sum_util = 0;
 	bool sg_overloaded = 0, sg_overutilized = 0;
 
+	env->dst_core_idle = !sched_smt_active() || is_core_idle(env->dst_cpu);
+
 	do {
 		struct sg_lb_stats *sgs = &tmp_sgs;
 		int local_group;
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity()
  2026-05-09 18:07 [PATCH v6 0/5 RESEND] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
                   ` (3 preceding siblings ...)
  2026-05-09 18:07 ` [PATCH 4/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
@ 2026-05-09 18:07 ` Andrea Righi
  2026-05-11 13:08   ` Vincent Guittot
  2026-05-20  8:34   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
  4 siblings, 2 replies; 47+ messages in thread
From: Andrea Righi @ 2026-05-09 18:07 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	Shrikanth Hegde, linux-kernel

From: K Prateek Nayak <kprateek.nayak@amd.com>

Add to select_idle_capacity() the same SIS_UTIL-controlled idle-scan
mechanism, already used by select_idle_cpu(): when sched_feat(SIS_UTIL)
is enabled and the LLC domain has sched_domain_shared data, derive the
per-attempt scan limit from sd->shared->nr_idle_scan.

That bounds the walk on large LLCs: once nr_idle_scan is exhausted,
return the best CPU seen so far. The early exit is gated on
!has_idle_core so an active idle-core search (SMT with idle cores
reported by test_idle_cores()) isn't cut short before it gets a chance
to find one.

Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2ddba8bd27e59..494149f14d98f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8084,6 +8084,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	int fits, best_fits = ASYM_IDLE_COMPLETE_MISFIT;
 	int cpu, best_cpu = -1;
 	struct cpumask *cpus;
+	int nr = INT_MAX;
 
 	cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
@@ -8092,10 +8093,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	util_min = uclamp_eff_value(p, UCLAMP_MIN);
 	util_max = uclamp_eff_value(p, UCLAMP_MAX);
 
+	if (sched_feat(SIS_UTIL) && sd->shared) {
+		/*
+		 * Same nr_idle_scan hint as select_idle_cpu(), nr only limits
+		 * the scan when not preferring an idle core.
+		 */
+		nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
+		/* overloaded domain is unlikely to have idle cpu/core */
+		if (nr == 1)
+			return -1;
+	}
+
 	for_each_cpu_wrap(cpu, cpus, target) {
 		bool preferred_core = !has_idle_core || is_core_idle(cpu);
 		unsigned long cpu_cap = capacity_of(cpu);
 
+		/*
+		 * Stop when the nr_idle_scan is exhausted (mirrors
+		 * select_idle_cpu() logic).
+		 */
+		if (!has_idle_core && --nr <= 0)
+			return best_cpu;
+
 		if (!choose_idle_cpu(cpu, p))
 			continue;
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path
  2026-05-09 18:07 ` [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path Andrea Righi
@ 2026-05-11 13:04   ` Vincent Guittot
  2026-05-15  6:49   ` Shrikanth Hegde
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 47+ messages in thread
From: Vincent Guittot @ 2026-05-11 13:04 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Phil Auld, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel

On Sat, 9 May 2026 at 20:10, Andrea Righi <arighi@nvidia.com> wrote:
>
> nohz_balancer_kick() is reached from sched_balance_trigger(), which is
> called from sched_tick(). sched_tick() runs with IRQs disabled, so the
> additional rcu_read_lock/unlock() used around sched_domain accesses in
> this path is redundant. Rely on the existing IRQ-disabled context (and
> the rcu_dereference_all() checking) instead.
>
> The same applies to set_cpu_sd_state_idle(), called from the idle entry
> path with IRQs disabled, and to set_cpu_sd_state_busy(), reachable via
> nohz_balance_exit_idle() from two contexts: nohz_balancer_kick() (IRQs
> disabled, as above) and sched_cpu_deactivate() (the CPUHP_AP_ACTIVE
> teardown, which runs under cpus_write_lock(), so it cannot race with
> sched-domain rebuilds). In both cases the rcu_dereference_all()
> validation is sufficient.
>
> No functional change intended.
>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---
>  kernel/sched/fair.c | 38 +++++++++++---------------------------
>  1 file changed, 11 insertions(+), 27 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3ebec186f9823..6b059ee80b631 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12785,8 +12785,6 @@ static void nohz_balancer_kick(struct rq *rq)
>                 goto out;
>         }
>
> -       rcu_read_lock();
> -
>         sd = rcu_dereference_all(rq->sd);
>         if (sd) {
>                 /*
> @@ -12794,8 +12792,8 @@ static void nohz_balancer_kick(struct rq *rq)
>                  * capacity, kick the ILB to see if there's a better CPU to run on:
>                  */
>                 if (rq->cfs.h_nr_runnable >= 1 && check_cpu_capacity(rq, sd)) {
> -                       flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> -                       goto unlock;
> +                       flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> +                       goto out;
>                 }
>         }
>
> @@ -12811,8 +12809,8 @@ static void nohz_balancer_kick(struct rq *rq)
>                  */
>                 for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) {
>                         if (sched_asym(sd, i, cpu)) {
> -                               flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> -                               goto unlock;
> +                               flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> +                               goto out;
>                         }
>                 }
>         }
> @@ -12823,10 +12821,8 @@ static void nohz_balancer_kick(struct rq *rq)
>                  * When ASYM_CPUCAPACITY; see if there's a higher capacity CPU
>                  * to run the misfit task on.
>                  */
> -               if (check_misfit_status(rq)) {
> -                       flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> -                       goto unlock;
> -               }
> +               if (check_misfit_status(rq))
> +                       flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
>
>                 /*
>                  * For asymmetric systems, we do not want to nicely balance
> @@ -12835,7 +12831,7 @@ static void nohz_balancer_kick(struct rq *rq)
>                  *
>                  * Skip the LLC logic because it's not relevant in that case.
>                  */
> -               goto unlock;
> +               goto out;
>         }
>
>         sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
> @@ -12850,13 +12846,9 @@ static void nohz_balancer_kick(struct rq *rq)
>                  * like this LLC domain has tasks we could move.
>                  */
>                 nr_busy = atomic_read(&sds->nr_busy_cpus);
> -               if (nr_busy > 1) {
> -                       flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> -                       goto unlock;
> -               }
> +               if (nr_busy > 1)
> +                       flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
>         }
> -unlock:
> -       rcu_read_unlock();
>  out:
>         if (READ_ONCE(nohz.needs_update))
>                 flags |= NOHZ_NEXT_KICK;
> @@ -12868,17 +12860,13 @@ static void nohz_balancer_kick(struct rq *rq)
>  static void set_cpu_sd_state_busy(int cpu)
>  {
>         struct sched_domain *sd;
> -
> -       rcu_read_lock();
>         sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
>
>         if (!sd || !sd->nohz_idle)
> -               goto unlock;
> +               return;
>         sd->nohz_idle = 0;
>
>         atomic_inc(&sd->shared->nr_busy_cpus);
> -unlock:
> -       rcu_read_unlock();
>  }
>
>  void nohz_balance_exit_idle(struct rq *rq)
> @@ -12897,17 +12885,13 @@ void nohz_balance_exit_idle(struct rq *rq)
>  static void set_cpu_sd_state_idle(int cpu)
>  {
>         struct sched_domain *sd;
> -
> -       rcu_read_lock();
>         sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
>
>         if (!sd || sd->nohz_idle)
> -               goto unlock;
> +               return;
>         sd->nohz_idle = 1;
>
>         atomic_dec(&sd->shared->nr_busy_cpus);
> -unlock:
> -       rcu_read_unlock();
>  }
>
>  /*
> --
> 2.54.0
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
  2026-05-09 18:07 ` [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity Andrea Righi
@ 2026-05-11 13:04   ` Vincent Guittot
  2026-05-15 10:05   ` Shrikanth Hegde
  2026-07-03 10:22   ` kmemleak: sched_domain_shared leaked on asymmetric-capacity + SCHED_CACHE Breno Leitao
  2 siblings, 0 replies; 47+ messages in thread
From: Vincent Guittot @ 2026-05-11 13:04 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Phil Auld, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel

On Sat, 9 May 2026 at 20:10, Andrea Righi <arighi@nvidia.com> wrote:
>
> From: K Prateek Nayak <kprateek.nayak@amd.com>
>
> On asymmetric CPU capacity systems, the wakeup path uses
> select_idle_capacity(), which scans the span of sd_asym_cpucapacity
> rather than sd_llc.
>
> The has_idle_cores hint however lives on sd_llc->shared, so the
> wakeup-time read of has_idle_cores operates on an LLC-scoped blob while
> the actual scan/decision spans the asym domain; nr_busy_cpus also lives
> in the same shared sched_domain data, but it's never used in the asym
> CPU capacity scenario.
>
> Therefore, move the sched_domain_shared object to sd_asym_cpucapacity
> whenever the CPU has a SD_ASYM_CPUCAPACITY_FULL ancestor and that
> ancestor is non-overlapping (i.e., not built from SD_NUMA). In that case
> the scope of has_idle_cores matches the scope of the wakeup scan.
>
> Fall back to attaching the shared object to sd_llc in three cases:
>
>   1) plain symmetric systems (no SD_ASYM_CPUCAPACITY_FULL anywhere);
>
>   2) CPUs in an exclusive cpuset that carves out a symmetric capacity
>      island: has_asym is system-wide but those CPUs have no
>      SD_ASYM_CPUCAPACITY_FULL ancestor in their hierarchy and follow
>      the symmetric LLC path in select_idle_sibling();
>
>   3) exotic topologies where SD_ASYM_CPUCAPACITY_FULL lands on an
>      SD_NUMA-built domain. init_sched_domain_shared() keys the shared
>      blob off cpumask_first(span), which on overlapping NUMA domains
>      would alias unrelated spans onto the same blob. Keep the shared
>      object on the LLC there; select_idle_capacity() gracefully skips
>      the has_idle_cores preference when sd->shared is NULL.
>
> While at it, also rename the per-CPU sd_llc_shared to sd_balance_shared,
> as it is no longer strictly tied to the LLC.
>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Co-developed-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>

Acked-by: Vincent Guittot <vincent.guittot@linaro.org>


> ---
>  kernel/sched/fair.c     | 19 ++++++---
>  kernel/sched/sched.h    |  2 +-
>  kernel/sched/topology.c | 95 +++++++++++++++++++++++++++++++++++------
>  3 files changed, 95 insertions(+), 21 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6b059ee80b631..960a1a9696b98 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7819,7 +7819,7 @@ static inline void set_idle_cores(int cpu, int val)
>  {
>         struct sched_domain_shared *sds;
>
> -       sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
> +       sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
>         if (sds)
>                 WRITE_ONCE(sds->has_idle_cores, val);
>  }
> @@ -7828,7 +7828,7 @@ static inline bool test_idle_cores(int cpu)
>  {
>         struct sched_domain_shared *sds;
>
> -       sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
> +       sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
>         if (sds)
>                 return READ_ONCE(sds->has_idle_cores);
>
> @@ -7837,7 +7837,7 @@ static inline bool test_idle_cores(int cpu)
>
>  /*
>   * Scans the local SMT mask to see if the entire core is idle, and records this
> - * information in sd_llc_shared->has_idle_cores.
> + * information in sd_balance_shared->has_idle_cores.
>   *
>   * Since SMT siblings share all cache levels, inspecting this limited remote
>   * state should be fairly cheap.
> @@ -7954,7 +7954,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>         struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
>         int i, cpu, idle_cpu = -1, nr = INT_MAX;
>
> -       if (sched_feat(SIS_UTIL)) {
> +       if (sched_feat(SIS_UTIL) && sd->shared) {
>                 /*
>                  * Increment because !--nr is the condition to stop scan.
>                  *
> @@ -12834,7 +12834,7 @@ static void nohz_balancer_kick(struct rq *rq)
>                 goto out;
>         }
>
> -       sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
> +       sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
>         if (sds) {
>                 /*
>                  * If there is an imbalance between LLC domains (IOW we could
> @@ -12862,7 +12862,11 @@ static void set_cpu_sd_state_busy(int cpu)
>         struct sched_domain *sd;
>         sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
>
> -       if (!sd || !sd->nohz_idle)
> +       /*
> +        * sd->nohz_idle only pairs with nr_busy_cpus on sd->shared; if this
> +        * domain has no shared object there is nothing to clear or account.
> +        */
> +       if (!sd || !sd->shared || !sd->nohz_idle)
>                 return;
>         sd->nohz_idle = 0;
>
> @@ -12887,7 +12891,8 @@ static void set_cpu_sd_state_idle(int cpu)
>         struct sched_domain *sd;
>         sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
>
> -       if (!sd || sd->nohz_idle)
> +       /* See set_cpu_sd_state_busy(): nohz_idle is only used with sd->shared. */
> +       if (!sd || !sd->shared || sd->nohz_idle)
>                 return;
>         sd->nohz_idle = 1;
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 9f63b15d309d1..330f5893c4561 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2170,7 +2170,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>  DECLARE_PER_CPU(int, sd_llc_size);
>  DECLARE_PER_CPU(int, sd_llc_id);
>  DECLARE_PER_CPU(int, sd_share_id);
> -DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> +DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_balance_shared);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 5847b83d9d552..9bc4d11dd6a98 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -665,7 +665,7 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>  DEFINE_PER_CPU(int, sd_llc_size);
>  DEFINE_PER_CPU(int, sd_llc_id);
>  DEFINE_PER_CPU(int, sd_share_id);
> -DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> +DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_balance_shared);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> @@ -680,20 +680,38 @@ static void update_top_cache_domain(int cpu)
>         int id = cpu;
>         int size = 1;
>
> +       sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
> +       /*
> +        * The shared object is attached to sd_asym_cpucapacity only when the
> +        * asym domain is non-overlapping (i.e., not built from SD_NUMA).
> +        * On overlapping (NUMA) asym domains we fall back to letting the
> +        * SD_SHARE_LLC path own the shared object, so sd->shared may be NULL
> +        * here.
> +        */
> +       if (sd && sd->shared)
> +               sds = sd->shared;
> +
> +       rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
> +
>         sd = highest_flag_domain(cpu, SD_SHARE_LLC);
>         if (sd) {
>                 id = cpumask_first(sched_domain_span(sd));
>                 size = cpumask_weight(sched_domain_span(sd));
>
> -               /* If sd_llc exists, sd_llc_shared should exist too. */
> -               WARN_ON_ONCE(!sd->shared);
> -               sds = sd->shared;
> +               /*
> +                * If sd_asym_cpucapacity didn't claim the shared object,
> +                * sd_llc must have one linked.
> +                */
> +               if (!sds) {
> +                       WARN_ON_ONCE(!sd->shared);
> +                       sds = sd->shared;
> +               }
>         }
>
>         rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>         per_cpu(sd_llc_size, cpu) = size;
>         per_cpu(sd_llc_id, cpu) = id;
> -       rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
> +       rcu_assign_pointer(per_cpu(sd_balance_shared, cpu), sds);
>
>         sd = lowest_flag_domain(cpu, SD_CLUSTER);
>         if (sd)
> @@ -711,9 +729,6 @@ static void update_top_cache_domain(int cpu)
>
>         sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
>         rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
> -
> -       sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
> -       rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
>  }
>
>  /*
> @@ -2650,6 +2665,54 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc)
>         }
>  }
>
> +static void init_sched_domain_shared(struct s_data *d, struct sched_domain *sd)
> +{
> +       int sd_id = cpumask_first(sched_domain_span(sd));
> +
> +       sd->shared = *per_cpu_ptr(d->sds, sd_id);
> +       /*
> +        * nr_busy_cpus is consumed only by the NOHZ kick path via
> +        * sd_balance_shared; on the asym-capacity path it is initialized but
> +        * never read.
> +        */
> +       atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> +       atomic_inc(&sd->shared->ref);
> +}
> +
> +/*
> + * For asymmetric CPU capacity, attach sched_domain_shared on the innermost
> + * SD_ASYM_CPUCAPACITY_FULL ancestor of @cpu's base domain when that ancestor is
> + * not an overlapping NUMA-built domain (then LLC should claim shared).
> + *
> + * A CPU may lack any FULL ancestor (e.g., exclusive cpuset symmetric island),
> + * then LLC must claim shared instead.
> + *
> + * Note: SD_ASYM_CPUCAPACITY_FULL is only set when all CPU capacity values
> + * are present in the domain span, so the asym domain we attach to cannot
> + * degenerate into a single-capacity group. The relevant edge cases are instead
> + * covered by the caveats above.
> + *
> + * Return true if this CPU's asym path claimed sd->shared, false otherwise.
> + */
> +static bool claim_asym_sched_domain_shared(struct s_data *d, int cpu)
> +{
> +       struct sched_domain *sd = *per_cpu_ptr(d->sd, cpu);
> +       struct sched_domain *sd_asym;
> +
> +       if (!sd)
> +               return false;
> +
> +       sd_asym = sd;
> +       while (sd_asym && !(sd_asym->flags & SD_ASYM_CPUCAPACITY_FULL))
> +               sd_asym = sd_asym->parent;
> +
> +       if (!sd_asym || (sd_asym->flags & SD_NUMA))
> +               return false;
> +
> +       init_sched_domain_shared(d, sd_asym);
> +       return true;
> +}
> +
>  /*
>   * Build sched domains for a given set of CPUs and attach the sched domains
>   * to the individual CPUs
> @@ -2708,20 +2771,26 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>         }
>
>         for_each_cpu(i, cpu_map) {
> +               bool asym_claimed = false;
> +
>                 sd = *per_cpu_ptr(d.sd, i);
>                 if (!sd)
>                         continue;
>
> +               if (has_asym)
> +                       asym_claimed = claim_asym_sched_domain_shared(&d, i);
> +
>                 /* First, find the topmost SD_SHARE_LLC domain */
>                 while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
>                         sd = sd->parent;
>
>                 if (sd->flags & SD_SHARE_LLC) {
> -                       int sd_id = cpumask_first(sched_domain_span(sd));
> -
> -                       sd->shared = *per_cpu_ptr(d.sds, sd_id);
> -                       atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> -                       atomic_inc(&sd->shared->ref);
> +                       /*
> +                        * Initialize the sd->shared for SD_SHARE_LLC unless
> +                        * the asym path above already claimed it.
> +                        */
> +                       if (!asym_claimed)
> +                               init_sched_domain_shared(&d, sd);
>
>                         /*
>                          * In presence of higher domains, adjust the
> --
> 2.54.0
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-05-09 18:07 ` [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
@ 2026-05-11 13:07   ` Vincent Guittot
  2026-05-11 13:45     ` Andrea Righi
  2026-05-11 14:25     ` [PATCH v2 " Andrea Righi
  0 siblings, 2 replies; 47+ messages in thread
From: Vincent Guittot @ 2026-05-11 13:07 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Phil Auld, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel

On Sat, 9 May 2026 at 20:10, Andrea Righi <arighi@nvidia.com> wrote:
>
> On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> different per-core frequencies), the wakeup path uses
> select_idle_capacity() and prioritizes idle CPUs with higher capacity
> for better task placement. However, when those CPUs belong to SMT cores,
> their effective capacity can be much lower than the nominal capacity
> when the sibling thread is busy: SMT siblings compete for shared
> resources, so a "high capacity" CPU that is idle but whose sibling is
> busy does not deliver its full capacity. This effective capacity
> reduction cannot be modeled by the static capacity value alone.
>
> Introduce SMT awareness in the asym-capacity idle selection policy: when
> SMT is active, always prefer fully-idle SMT cores over partially-idle
> ones.
>
> Prioritizing fully-idle SMT cores yields better task placement because
> the effective capacity of partially-idle SMT cores is reduced; always
> preferring them when available leads to more accurate capacity usage on
> task wakeup.
>
> On an SMT system with asymmetric CPU capacities (NVIDIA Vera Rubin),
> SMT-aware idle selection has been shown to improve throughput by around
> 15-18% over NO_ASYM mainline and by around 60% over ASYM mainline, for
> CPU-bound workloads (NVBLAS) running an amount of tasks equal to the
> amount of SMT cores.
>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Christian Loehle <christian.loehle@arm.com>
> Cc: Koba Ko <kobak@nvidia.com>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>

I still have comments about the description and naming below but
overall, the patch looks good to me

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---
>  kernel/sched/fair.c | 119 +++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 113 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 960a1a9696b98..6f0835c15ee11 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8018,6 +8018,54 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>         return idle_cpu;
>  }
>
> +/*
> + * Idle-capacity scan converts util_fits_cpu() outcomes into preference ranks,
> + * where lower values indicate a better fit - see select_idle_capacity().
> + *
> + * A CPU that both fits the task and sits on a fully-idle SMT core is returned
> + * immediately and is never assigned one of these ranks. On !SMT every CPU is
> + * its own "core", so the early return covers all fits-and-idle cases and the
> + * core-tier ranks below become unreachable.
> + *
> + *   Rank                            Val  Tier    Meaning
> + *   ------------------------------  ---  ------  ---------------------------
> + *   ASYM_IDLE_CORE_UCLAMP_MISFIT    -4   core    Idle core; capacity fits
> + *                                                util but uclamp_min misses.
> + *   ASYM_IDLE_CORE_COMPLETE_MISFIT  -3   core    Idle core; capacity does
> + *                                                not fit. Still beats every
> + *                                                thread-tier rank: a busy
> + *                                                sibling cuts effective
> + *                                                capacity more than a
> + *                                                misfit hurts a quiet core.
> + *   ASYM_IDLE_THREAD_FITS           -2   thread  Busy SMT sibling; capacity
> + *                                                fits util + uclamp.
> + *   ASYM_IDLE_THREAD_UCLAMP_MISFIT  -1   thread  Busy SMT sibling; capacity
> + *                                                fits but uclamp_min misses
> + *                                                (native util_fits_cpu()
> + *                                                return value).
> + *   ASYM_IDLE_COMPLETE_MISFIT        0   thread  Busy SMT sibling; capacity
> + *                                                does not fit.
> + *
> + * ASYM_IDLE_CORE_BIAS (-3) is an offset, not a state. On an idle core,
> + * fits += ASYM_IDLE_CORE_BIAS rebases thread-tier ranks into the core tier:
> + *
> + *   ASYM_IDLE_THREAD_UCLAMP_MISFIT (-1) + BIAS -> CORE_UCLAMP_MISFIT (-4)
> + *   ASYM_IDLE_COMPLETE_MISFIT       (0) + BIAS -> CORE_COMPLETE_MISFIT (-3)
> + *
> + * ASYM_IDLE_THREAD_FITS (-2) is never rebased because a fully-fitting idle-core
> + * candidate early-returns from select_idle_capacity().
> + */
> +enum asym_fits_state {
> +       ASYM_IDLE_CORE_UCLAMP_MISFIT = -4,

ASYM_IDLE_UCLAMP_MISFIT
See why in comments for select_idle_capacity()

> +       ASYM_IDLE_CORE_COMPLETE_MISFIT,

ASYM_IDLE_COMPLETE_MISFIT,

> +       ASYM_IDLE_THREAD_FITS,
> +       ASYM_IDLE_THREAD_UCLAMP_MISFIT,
> +       ASYM_IDLE_COMPLETE_MISFIT,

ASYM_IDLE_THREAD_MISFIT,

> +
> +       /* util_fits_cpu() bias for idle core */
> +       ASYM_IDLE_CORE_BIAS = -3,
> +};
> +
>  /*
>   * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
>   * the task fits. If no CPU is big enough, but there are idle ones, try to
> @@ -8026,8 +8074,14 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  static int
>  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  {
> +       /*
> +        * On !SMT systems, has_idle_core is always false and preferred_core
> +        * is always true (CPU == core), so the SMT preference logic below
> +        * collapses to the plain capacity scan.
> +        */
> +       bool has_idle_core = sched_smt_active() && test_idle_cores(target);
>         unsigned long task_util, util_min, util_max, best_cap = 0;
> -       int fits, best_fits = 0;
> +       int fits, best_fits = ASYM_IDLE_COMPLETE_MISFIT;
>         int cpu, best_cpu = -1;
>         struct cpumask *cpus;
>
> @@ -8039,6 +8093,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>         util_max = uclamp_eff_value(p, UCLAMP_MAX);
>
>         for_each_cpu_wrap(cpu, cpus, target) {
> +               bool preferred_core = !has_idle_core || is_core_idle(cpu);

If sched_smt_active() is true and test_idle_cores(target) is false
(meaning we have SMT but no idle core), then has_idle_core is false
and preferred_core is true. We will returns immediatly if
util_fits_cpu and we will use the ASYM_IDLE_CORE_* values otherwise.
So I think that we should remove the "CORE_" in the naming

ASYM_IDLE_THREAD_* values are only used when we are promised to find
an idle core with SMT

>                 unsigned long cpu_cap = capacity_of(cpu);
>
>                 if (!choose_idle_cpu(cpu, p))
> @@ -8046,8 +8101,13 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>
>                 fits = util_fits_cpu(task_util, util_min, util_max, cpu);
>
> -               /* This CPU fits with all requirements */
> -               if (fits > 0)
> +               /*
> +                * Perfect fit: capacity satisfies util + uclamp and the CPU
> +                * sits on a fully-idle SMT core (or this is a !SMT system).

Or there is no idle core to find.


> +                * Short-circuit the rank-based selection and return
> +                * immediately.
> +                */
> +               if (fits > 0 && preferred_core)
>                         return cpu;
>                 /*
>                  * Only the min performance hint (i.e. uclamp_min) doesn't fit.
> @@ -8055,9 +8115,33 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>                  */
>                 else if (fits < 0)
>                         cpu_cap = get_actual_cpu_capacity(cpu);
> +               /*
> +                * fits > 0 implies we are not on a preferred core, but the util
> +                * fits CPU capacity. Set fits to ASYM_IDLE_THREAD_FITS
> +                * so the effective range becomes
> +                * [ASYM_IDLE_THREAD_FITS, ASYM_IDLE_COMPLETE_MISFIT], where:
> +                *    ASYM_IDLE_COMPLETE_MISFIT - does not fit
> +                *    ASYM_IDLE_THREAD_UCLAMP_MISFIT - fits with the exception of UCLAMP_MIN
> +                *    ASYM_IDLE_THREAD_FITS - fits with the exception of preferred_core
> +                */
> +               else if (fits > 0)
> +                       fits = ASYM_IDLE_THREAD_FITS;
>
>                 /*
> -                * First, select CPU which fits better (-1 being better than 0).
> +                * If we are on a preferred core, translate the range of fits
> +                * of [ASYM_IDLE_THREAD_UCLAMP_MISFIT, ASYM_IDLE_COMPLETE_MISFIT] to
> +                * [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_COMPLETE_MISFIT].
> +                * This ensures that an idle core is always given priority over
> +                * (partially) busy core.
> +                *
> +                * A fully fitting idle core would have returned early and hence
> +                * fits > 0 for preferred_core need not be dealt with.
> +                */
> +               if (preferred_core)
> +                       fits += ASYM_IDLE_CORE_BIAS;
> +
> +               /*
> +                * First, select CPU which fits better (lower is more preferred).
>                  * Then, select the one with best capacity at same level.
>                  */
>                 if ((fits < best_fits) ||
> @@ -8068,6 +8152,19 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>                 }
>         }
>
> +       /*
> +        * A value in the [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_COMPLETE_MISFIT]
> +        * range means the chosen CPU is in a fully idle SMT core. Values above
> +        * ASYM_IDLE_CORE_COMPLETE_MISFIT mean we never ranked such a CPU best.
> +        *
> +        * The asym-capacity wakeup path returns from select_idle_sibling()
> +        * after this function and never runs select_idle_cpu(), so the usual
> +        * select_idle_cpu() tail that clears idle cores must live here when the
> +        * idle-core preference did not win.
> +        */
> +       if (has_idle_core && best_fits > ASYM_IDLE_CORE_COMPLETE_MISFIT)
> +               set_idle_cores(target, false);
> +
>         return best_cpu;
>  }
>
> @@ -8076,12 +8173,22 @@ static inline bool asym_fits_cpu(unsigned long util,
>                                  unsigned long util_max,
>                                  int cpu)
>  {
> -       if (sched_asym_cpucap_active())
> +       if (sched_asym_cpucap_active()) {
>                 /*
>                  * Return true only if the cpu fully fits the task requirements
>                  * which include the utilization and the performance hints.
> +                *
> +                * When SMT is active, also require that the core has no busy
> +                * siblings.
> +                *
> +                * Note: gating on is_core_idle() also makes the early-bailout
> +                * candidates in select_idle_sibling() (target, prev,
> +                * recent_used_cpu) idle-core-aware on ASYM+SMT, which the
> +                * NO_ASYM path does not do.
>                  */
> -               return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> +               return (!sched_smt_active() || is_core_idle(cpu)) &&
> +                      (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> +       }
>
>         return true;
>  }
> --
> 2.54.0
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
  2026-05-09 18:07 ` [PATCH 4/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
@ 2026-05-11 13:07   ` Vincent Guittot
  2026-05-15 10:09   ` Shrikanth Hegde
  2026-05-20  8:34   ` [tip: sched/core] " tip-bot2 for Andrea Righi
  2 siblings, 0 replies; 47+ messages in thread
From: Vincent Guittot @ 2026-05-11 13:07 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Phil Auld, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel

On Sat, 9 May 2026 at 20:10, Andrea Righi <arighi@nvidia.com> wrote:
>
> When SD_ASYM_CPUCAPACITY load balancing considers pulling a misfit task,
> capacity_of(dst_cpu) can overstate available compute if the SMT sibling is
> busy: the core does not deliver its full nominal capacity.
>
> If SMT is active and dst_cpu is not on a fully idle core, skip this
> destination so we do not migrate a misfit expecting a capacity upgrade we
> cannot actually provide.
>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Christian Loehle <christian.loehle@arm.com>
> Cc: Koba Ko <kobak@nvidia.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>


> ---
>  kernel/sched/fair.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6f0835c15ee11..2ddba8bd27e59 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9693,6 +9693,7 @@ struct lb_env {
>
>         int                     dst_cpu;
>         struct rq               *dst_rq;
> +       bool                    dst_core_idle;
>
>         struct cpumask          *dst_grpmask;
>         int                     new_dst_cpu;
> @@ -10918,10 +10919,16 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>          * We can use max_capacity here as reduction in capacity on some
>          * CPUs in the group should either be possible to resolve
>          * internally or be covered by avg_load imbalance (eventually).
> +        *
> +        * When SMT is active, only pull a misfit to dst_cpu if it is on a
> +        * fully idle core; otherwise the effective capacity of the core is
> +        * reduced and we may not actually provide more capacity than the
> +        * source.
>          */
>         if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
>             (sgs->group_type == group_misfit_task) &&
> -           (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
> +           (!env->dst_core_idle ||
> +            !capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
>              sds->local_stat.group_type != group_has_spare))
>                 return false;
>
> @@ -11485,6 +11492,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>         unsigned long sum_util = 0;
>         bool sg_overloaded = 0, sg_overutilized = 0;
>
> +       env->dst_core_idle = !sched_smt_active() || is_core_idle(env->dst_cpu);
> +
>         do {
>                 struct sg_lb_stats *sgs = &tmp_sgs;
>                 int local_group;
> --
> 2.54.0
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity()
  2026-05-09 18:07 ` [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity() Andrea Righi
@ 2026-05-11 13:08   ` Vincent Guittot
  2026-05-20  8:34   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
  1 sibling, 0 replies; 47+ messages in thread
From: Vincent Guittot @ 2026-05-11 13:08 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Phil Auld, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel

On Sat, 9 May 2026 at 20:10, Andrea Righi <arighi@nvidia.com> wrote:
>
> From: K Prateek Nayak <kprateek.nayak@amd.com>
>
> Add to select_idle_capacity() the same SIS_UTIL-controlled idle-scan
> mechanism, already used by select_idle_cpu(): when sched_feat(SIS_UTIL)
> is enabled and the LLC domain has sched_domain_shared data, derive the
> per-attempt scan limit from sd->shared->nr_idle_scan.
>
> That bounds the walk on large LLCs: once nr_idle_scan is exhausted,
> return the best CPU seen so far. The early exit is gated on
> !has_idle_core so an active idle-core search (SMT with idle cores
> reported by test_idle_cores()) isn't cut short before it gets a chance
> to find one.
>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Co-developed-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>


> ---
>  kernel/sched/fair.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2ddba8bd27e59..494149f14d98f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8084,6 +8084,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>         int fits, best_fits = ASYM_IDLE_COMPLETE_MISFIT;
>         int cpu, best_cpu = -1;
>         struct cpumask *cpus;
> +       int nr = INT_MAX;
>
>         cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
>         cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> @@ -8092,10 +8093,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>         util_min = uclamp_eff_value(p, UCLAMP_MIN);
>         util_max = uclamp_eff_value(p, UCLAMP_MAX);
>
> +       if (sched_feat(SIS_UTIL) && sd->shared) {
> +               /*
> +                * Same nr_idle_scan hint as select_idle_cpu(), nr only limits
> +                * the scan when not preferring an idle core.
> +                */
> +               nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
> +               /* overloaded domain is unlikely to have idle cpu/core */
> +               if (nr == 1)
> +                       return -1;
> +       }
> +
>         for_each_cpu_wrap(cpu, cpus, target) {
>                 bool preferred_core = !has_idle_core || is_core_idle(cpu);
>                 unsigned long cpu_cap = capacity_of(cpu);
>
> +               /*
> +                * Stop when the nr_idle_scan is exhausted (mirrors
> +                * select_idle_cpu() logic).
> +                */
> +               if (!has_idle_core && --nr <= 0)
> +                       return best_cpu;
> +
>                 if (!choose_idle_cpu(cpu, p))
>                         continue;
>
> --
> 2.54.0
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-05-11 13:07   ` Vincent Guittot
@ 2026-05-11 13:45     ` Andrea Righi
  2026-05-11 14:25     ` [PATCH v2 " Andrea Righi
  1 sibling, 0 replies; 47+ messages in thread
From: Andrea Righi @ 2026-05-11 13:45 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Phil Auld, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel

Hi Vincent,

On Mon, May 11, 2026 at 03:07:50PM +0200, Vincent Guittot wrote:
> On Sat, 9 May 2026 at 20:10, Andrea Righi <arighi@nvidia.com> wrote:
...
> > +/*
> > + * Idle-capacity scan converts util_fits_cpu() outcomes into preference ranks,
> > + * where lower values indicate a better fit - see select_idle_capacity().
> > + *
> > + * A CPU that both fits the task and sits on a fully-idle SMT core is returned
> > + * immediately and is never assigned one of these ranks. On !SMT every CPU is
> > + * its own "core", so the early return covers all fits-and-idle cases and the
> > + * core-tier ranks below become unreachable.
> > + *
> > + *   Rank                            Val  Tier    Meaning
> > + *   ------------------------------  ---  ------  ---------------------------
> > + *   ASYM_IDLE_CORE_UCLAMP_MISFIT    -4   core    Idle core; capacity fits
> > + *                                                util but uclamp_min misses.
> > + *   ASYM_IDLE_CORE_COMPLETE_MISFIT  -3   core    Idle core; capacity does
> > + *                                                not fit. Still beats every
> > + *                                                thread-tier rank: a busy
> > + *                                                sibling cuts effective
> > + *                                                capacity more than a
> > + *                                                misfit hurts a quiet core.
> > + *   ASYM_IDLE_THREAD_FITS           -2   thread  Busy SMT sibling; capacity
> > + *                                                fits util + uclamp.
> > + *   ASYM_IDLE_THREAD_UCLAMP_MISFIT  -1   thread  Busy SMT sibling; capacity
> > + *                                                fits but uclamp_min misses
> > + *                                                (native util_fits_cpu()
> > + *                                                return value).
> > + *   ASYM_IDLE_COMPLETE_MISFIT        0   thread  Busy SMT sibling; capacity
> > + *                                                does not fit.
> > + *
> > + * ASYM_IDLE_CORE_BIAS (-3) is an offset, not a state. On an idle core,
> > + * fits += ASYM_IDLE_CORE_BIAS rebases thread-tier ranks into the core tier:
> > + *
> > + *   ASYM_IDLE_THREAD_UCLAMP_MISFIT (-1) + BIAS -> CORE_UCLAMP_MISFIT (-4)
> > + *   ASYM_IDLE_COMPLETE_MISFIT       (0) + BIAS -> CORE_COMPLETE_MISFIT (-3)
> > + *
> > + * ASYM_IDLE_THREAD_FITS (-2) is never rebased because a fully-fitting idle-core
> > + * candidate early-returns from select_idle_capacity().
> > + */
> > +enum asym_fits_state {
> > +       ASYM_IDLE_CORE_UCLAMP_MISFIT = -4,
> 
> ASYM_IDLE_UCLAMP_MISFIT
> See why in comments for select_idle_capacity()
> 
> > +       ASYM_IDLE_CORE_COMPLETE_MISFIT,
> 
> ASYM_IDLE_COMPLETE_MISFIT,
> 
> > +       ASYM_IDLE_THREAD_FITS,
> > +       ASYM_IDLE_THREAD_UCLAMP_MISFIT,
> > +       ASYM_IDLE_COMPLETE_MISFIT,
> 
> ASYM_IDLE_THREAD_MISFIT,
> 
> > +
> > +       /* util_fits_cpu() bias for idle core */
> > +       ASYM_IDLE_CORE_BIAS = -3,
> > +};
> > +
> >  /*
> >   * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> >   * the task fits. If no CPU is big enough, but there are idle ones, try to
> > @@ -8026,8 +8074,14 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >  static int
> >  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >  {
> > +       /*
> > +        * On !SMT systems, has_idle_core is always false and preferred_core
> > +        * is always true (CPU == core), so the SMT preference logic below
> > +        * collapses to the plain capacity scan.
> > +        */
> > +       bool has_idle_core = sched_smt_active() && test_idle_cores(target);
> >         unsigned long task_util, util_min, util_max, best_cap = 0;
> > -       int fits, best_fits = 0;
> > +       int fits, best_fits = ASYM_IDLE_COMPLETE_MISFIT;
> >         int cpu, best_cpu = -1;
> >         struct cpumask *cpus;
> >
> > @@ -8039,6 +8093,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >         util_max = uclamp_eff_value(p, UCLAMP_MAX);
> >
> >         for_each_cpu_wrap(cpu, cpus, target) {
> > +               bool preferred_core = !has_idle_core || is_core_idle(cpu);
> 
> If sched_smt_active() is true and test_idle_cores(target) is false
> (meaning we have SMT but no idle core), then has_idle_core is false
> and preferred_core is true. We will returns immediatly if
> util_fits_cpu and we will use the ASYM_IDLE_CORE_* values otherwise.
> So I think that we should remove the "CORE_" in the naming
> 
> ASYM_IDLE_THREAD_* values are only used when we are promised to find
> an idle core with SMT

Yes, I agree, the CORE_ prefix is just misleading, those ranks can be assigned
also when sched_smt_active() && !test_idle_cores(target). I'll send an updated
patch with your naming schema.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-05-11 13:07   ` Vincent Guittot
  2026-05-11 13:45     ` Andrea Righi
@ 2026-05-11 14:25     ` Andrea Righi
  2026-05-20  8:34       ` [tip: sched/core] " tip-bot2 for Andrea Righi
  1 sibling, 1 reply; 47+ messages in thread
From: Andrea Righi @ 2026-05-11 14:25 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	Shrikanth Hegde, linux-kernel

On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
different per-core frequencies), the wakeup path uses
select_idle_capacity() and prioritizes idle CPUs with higher capacity
for better task placement. However, when those CPUs belong to SMT cores,
their effective capacity can be much lower than the nominal capacity
when the sibling thread is busy: SMT siblings compete for shared
resources, so a "high capacity" CPU that is idle but whose sibling is
busy does not deliver its full capacity. This effective capacity
reduction cannot be modeled by the static capacity value alone.

Introduce SMT awareness in the asym-capacity idle selection policy: when
SMT is active, always prefer fully-idle SMT cores over partially-idle
ones.

Prioritizing fully-idle SMT cores yields better task placement because
the effective capacity of partially-idle SMT cores is reduced; always
preferring them when available leads to more accurate capacity usage on
task wakeup.

On an SMT system with asymmetric CPU capacities (NVIDIA Vera Rubin),
SMT-aware idle selection has been shown to improve throughput by around
15-18% over NO_ASYM mainline and by around 60% over ASYM mainline, for
CPU-bound workloads (NVBLAS) running an amount of tasks equal to the
amount of SMT cores.

Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Koba Ko <kobak@nvidia.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
Changes in v2:
 - Drop the misleading "CORE_" prefix from ASYM_IDLE_* ranks (Vincent Guittot)

 kernel/sched/fair.c | 120 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 114 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 960a1a9696b98..ffe3af10e5602 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8018,6 +8018,54 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	return idle_cpu;
 }
 
+/*
+ * Idle-capacity scan converts util_fits_cpu() outcomes into preference ranks,
+ * where lower values indicate a better fit - see select_idle_capacity().
+ *
+ * A CPU that both fits the task and sits on a fully-idle SMT core is returned
+ * immediately and is never assigned one of these ranks. On !SMT every CPU is
+ * its own "core", so the early return covers all fits-and-idle cases and the
+ * core-tier ranks below become unreachable.
+ *
+ *   Rank                            Val  Tier    Meaning
+ *   ------------------------------  ---  ------  ---------------------------
+ *   ASYM_IDLE_UCLAMP_MISFIT         -4   core    Idle core; capacity fits
+ *                                                util but uclamp_min misses.
+ *   ASYM_IDLE_COMPLETE_MISFIT       -3   core    Idle core; capacity does
+ *                                                not fit. Still beats every
+ *                                                thread-tier rank: a busy
+ *                                                sibling cuts effective
+ *                                                capacity more than a
+ *                                                misfit hurts a quiet core.
+ *   ASYM_IDLE_THREAD_FITS           -2   thread  Busy SMT sibling; capacity
+ *                                                fits util + uclamp.
+ *   ASYM_IDLE_THREAD_UCLAMP_MISFIT  -1   thread  Busy SMT sibling; capacity
+ *                                                fits but uclamp_min misses
+ *                                                (native util_fits_cpu()
+ *                                                return value).
+ *   ASYM_IDLE_THREAD_MISFIT          0   thread  Busy SMT sibling; capacity
+ *                                                does not fit.
+ *
+ * ASYM_IDLE_CORE_BIAS (-3) is an offset, not a state. On an idle core,
+ * fits += ASYM_IDLE_CORE_BIAS rebases thread-tier ranks into the core tier:
+ *
+ *   ASYM_IDLE_THREAD_UCLAMP_MISFIT (-1) + BIAS -> ASYM_IDLE_UCLAMP_MISFIT   (-4)
+ *   ASYM_IDLE_THREAD_MISFIT         (0) + BIAS -> ASYM_IDLE_COMPLETE_MISFIT (-3)
+ *
+ * ASYM_IDLE_THREAD_FITS (-2) is never rebased because a fully-fitting idle-core
+ * candidate early-returns from select_idle_capacity().
+ */
+enum asym_fits_state {
+	ASYM_IDLE_UCLAMP_MISFIT = -4,
+	ASYM_IDLE_COMPLETE_MISFIT,
+	ASYM_IDLE_THREAD_FITS,
+	ASYM_IDLE_THREAD_UCLAMP_MISFIT,
+	ASYM_IDLE_THREAD_MISFIT,
+
+	/* util_fits_cpu() bias for idle core */
+	ASYM_IDLE_CORE_BIAS = -3,
+};
+
 /*
  * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
  * the task fits. If no CPU is big enough, but there are idle ones, try to
@@ -8026,8 +8074,14 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 static int
 select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 {
+	/*
+	 * On !SMT systems, has_idle_core is always false and preferred_core
+	 * is always true (CPU == core), so the SMT preference logic below
+	 * collapses to the plain capacity scan.
+	 */
+	bool has_idle_core = sched_smt_active() && test_idle_cores(target);
 	unsigned long task_util, util_min, util_max, best_cap = 0;
-	int fits, best_fits = 0;
+	int fits, best_fits = ASYM_IDLE_THREAD_MISFIT;
 	int cpu, best_cpu = -1;
 	struct cpumask *cpus;
 
@@ -8039,6 +8093,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	util_max = uclamp_eff_value(p, UCLAMP_MAX);
 
 	for_each_cpu_wrap(cpu, cpus, target) {
+		bool preferred_core = !has_idle_core || is_core_idle(cpu);
 		unsigned long cpu_cap = capacity_of(cpu);
 
 		if (!choose_idle_cpu(cpu, p))
@@ -8046,8 +8101,14 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 
 		fits = util_fits_cpu(task_util, util_min, util_max, cpu);
 
-		/* This CPU fits with all requirements */
-		if (fits > 0)
+		/*
+		 * Perfect fit: capacity satisfies util + uclamp and the CPU
+		 * sits on a fully-idle SMT core, this is a !SMT system, or
+		 * there is no idle core to find.
+		 * Short-circuit the rank-based selection and return
+		 * immediately.
+		 */
+		if (fits > 0 && preferred_core)
 			return cpu;
 		/*
 		 * Only the min performance hint (i.e. uclamp_min) doesn't fit.
@@ -8055,9 +8116,33 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 		 */
 		else if (fits < 0)
 			cpu_cap = get_actual_cpu_capacity(cpu);
+		/*
+		 * fits > 0 implies we are not on a preferred core, but the util
+		 * fits CPU capacity. Set fits to ASYM_IDLE_THREAD_FITS
+		 * so the effective range becomes
+		 * [ASYM_IDLE_THREAD_FITS, ASYM_IDLE_THREAD_MISFIT], where:
+		 *    ASYM_IDLE_THREAD_MISFIT - does not fit
+		 *    ASYM_IDLE_THREAD_UCLAMP_MISFIT - fits with the exception of UCLAMP_MIN
+		 *    ASYM_IDLE_THREAD_FITS - fits with the exception of preferred_core
+		 */
+		else if (fits > 0)
+			fits = ASYM_IDLE_THREAD_FITS;
 
 		/*
-		 * First, select CPU which fits better (-1 being better than 0).
+		 * If we are on a preferred core, translate the range of fits
+		 * of [ASYM_IDLE_THREAD_UCLAMP_MISFIT, ASYM_IDLE_THREAD_MISFIT] to
+		 * [ASYM_IDLE_UCLAMP_MISFIT, ASYM_IDLE_COMPLETE_MISFIT].
+		 * This ensures that an idle core is always given priority over
+		 * (partially) busy core.
+		 *
+		 * A fully fitting idle core would have returned early and hence
+		 * fits > 0 for preferred_core need not be dealt with.
+		 */
+		if (preferred_core)
+			fits += ASYM_IDLE_CORE_BIAS;
+
+		/*
+		 * First, select CPU which fits better (lower is more preferred).
 		 * Then, select the one with best capacity at same level.
 		 */
 		if ((fits < best_fits) ||
@@ -8068,6 +8153,19 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 		}
 	}
 
+	/*
+	 * A value in the [ASYM_IDLE_UCLAMP_MISFIT, ASYM_IDLE_COMPLETE_MISFIT]
+	 * range means the chosen CPU is in a fully idle SMT core. Values above
+	 * ASYM_IDLE_COMPLETE_MISFIT mean we never ranked such a CPU best.
+	 *
+	 * The asym-capacity wakeup path returns from select_idle_sibling()
+	 * after this function and never runs select_idle_cpu(), so the usual
+	 * select_idle_cpu() tail that clears idle cores must live here when the
+	 * idle-core preference did not win.
+	 */
+	if (has_idle_core && best_fits > ASYM_IDLE_COMPLETE_MISFIT)
+		set_idle_cores(target, false);
+
 	return best_cpu;
 }
 
@@ -8076,12 +8174,22 @@ static inline bool asym_fits_cpu(unsigned long util,
 				 unsigned long util_max,
 				 int cpu)
 {
-	if (sched_asym_cpucap_active())
+	if (sched_asym_cpucap_active()) {
 		/*
 		 * Return true only if the cpu fully fits the task requirements
 		 * which include the utilization and the performance hints.
+		 *
+		 * When SMT is active, also require that the core has no busy
+		 * siblings.
+		 *
+		 * Note: gating on is_core_idle() also makes the early-bailout
+		 * candidates in select_idle_sibling() (target, prev,
+		 * recent_used_cpu) idle-core-aware on ASYM+SMT, which the
+		 * NO_ASYM path does not do.
 		 */
-		return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
+		return (!sched_smt_active() || is_core_idle(cpu)) &&
+		       (util_fits_cpu(util, util_min, util_max, cpu) > 0);
+	}
 
 	return true;
 }
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path
  2026-05-09 18:07 ` [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path Andrea Righi
  2026-05-11 13:04   ` Vincent Guittot
@ 2026-05-15  6:49   ` Shrikanth Hegde
  2026-05-16  5:45     ` Andrea Righi
  2026-05-20  8:34   ` [tip: sched/core] " tip-bot2 for Andrea Righi
  2026-05-21 19:47   ` [PATCH 1/5] " Marek Szyprowski
  3 siblings, 1 reply; 47+ messages in thread
From: Shrikanth Hegde @ 2026-05-15  6:49 UTC (permalink / raw)
  To: Andrea Righi, K Prateek Nayak, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Phil Auld, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, linux-kernel



On 5/9/26 11:37 PM, Andrea Righi wrote:
> nohz_balancer_kick() is reached from sched_balance_trigger(), which is
> called from sched_tick(). sched_tick() runs with IRQs disabled, so the
> additional rcu_read_lock/unlock() used around sched_domain accesses in
> this path is redundant. Rely on the existing IRQ-disabled context (and
> the rcu_dereference_all() checking) instead.
> 
> The same applies to set_cpu_sd_state_idle(), called from the idle entry
> path with IRQs disabled, and to set_cpu_sd_state_busy(), reachable via
> nohz_balance_exit_idle() from two contexts: nohz_balancer_kick() (IRQs
> disabled, as above) and sched_cpu_deactivate() (the CPUHP_AP_ACTIVE
> teardown, which runs under cpus_write_lock(), so it cannot race with
> sched-domain rebuilds). In both cases the rcu_dereference_all()
> validation is sufficient.
> 
> No functional change intended.
> 

For this patch, few more comments below.

Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>

> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>


> @@ -12868,17 +12860,13 @@ static void nohz_balancer_kick(struct rq *rq)
>   static void set_cpu_sd_state_busy(int cpu)
>   {
>   	struct sched_domain *sd;
> -
> -	rcu_read_lock();
>   	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
>   
>   	if (!sd || !sd->nohz_idle)
> -		goto unlock;
> +		return;
>   	sd->nohz_idle = 0;
>   
>   	atomic_inc(&sd->shared->nr_busy_cpus);
> -unlock:
> -	rcu_read_unlock();
>   }
>   
>   void nohz_balance_exit_idle(struct rq *rq)
> @@ -12897,17 +12885,13 @@ void nohz_balance_exit_idle(struct rq *rq)
>   static void set_cpu_sd_state_idle(int cpu)
>   {
>   	struct sched_domain *sd;
> -
> -	rcu_read_lock();
>   	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
>   
>   	if (!sd || sd->nohz_idle)
> -		goto unlock;
> +		return;
>   	sd->nohz_idle = 1;
>   
>   	atomic_dec(&sd->shared->nr_busy_cpus);
> -unlock:
> -	rcu_read_unlock();
>   }
>   
>   /*

I was looking at other users of sd_llc, i.e test_idle_core and set_idle_core.
They have rcu_dereference_all. So callers need not call rcu_read_lock/unlock if
the irq disabled/preempt_disabled.

One more place would be update_idle_core. I think it is called with interrupt disabled
in __schedule path.

And in sched_ext, scx_idle_update_selcpu_topology, It seems to be tied to cpu hotplug and
by same logic of cpus_write_lock held, one could remove redundant rcu_read_lock there as well.

No?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
  2026-05-09 18:07 ` [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity Andrea Righi
  2026-05-11 13:04   ` Vincent Guittot
@ 2026-05-15 10:05   ` Shrikanth Hegde
  2026-05-16  5:58     ` [PATCH v2 " Andrea Righi
  2026-07-03 10:22   ` kmemleak: sched_domain_shared leaked on asymmetric-capacity + SCHED_CACHE Breno Leitao
  2 siblings, 1 reply; 47+ messages in thread
From: Shrikanth Hegde @ 2026-05-15 10:05 UTC (permalink / raw)
  To: Andrea Righi, K Prateek Nayak
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Phil Auld, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot



On 5/9/26 11:37 PM, Andrea Righi wrote:
> From: K Prateek Nayak <kprateek.nayak@amd.com>
> 
> On asymmetric CPU capacity systems, the wakeup path uses
> select_idle_capacity(), which scans the span of sd_asym_cpucapacity
> rather than sd_llc.
> 
> The has_idle_cores hint however lives on sd_llc->shared, so the
> wakeup-time read of has_idle_cores operates on an LLC-scoped blob while
> the actual scan/decision spans the asym domain; nr_busy_cpus also lives
> in the same shared sched_domain data, but it's never used in the asym
> CPU capacity scenario.
> 
> Therefore, move the sched_domain_shared object to sd_asym_cpucapacity
> whenever the CPU has a SD_ASYM_CPUCAPACITY_FULL ancestor and that
> ancestor is non-overlapping (i.e., not built from SD_NUMA). In that case
> the scope of has_idle_cores matches the scope of the wakeup scan.
> 
> Fall back to attaching the shared object to sd_llc in three cases:
> 
>    1) plain symmetric systems (no SD_ASYM_CPUCAPACITY_FULL anywhere);
> 
>    2) CPUs in an exclusive cpuset that carves out a symmetric capacity
>       island: has_asym is system-wide but those CPUs have no
>       SD_ASYM_CPUCAPACITY_FULL ancestor in their hierarchy and follow
>       the symmetric LLC path in select_idle_sibling();
> 
>    3) exotic topologies where SD_ASYM_CPUCAPACITY_FULL lands on an
>       SD_NUMA-built domain. init_sched_domain_shared() keys the shared
>       blob off cpumask_first(span), which on overlapping NUMA domains
>       would alias unrelated spans onto the same blob. Keep the shared
>       object on the LLC there; select_idle_capacity() gracefully skips
>       the has_idle_cores preference when sd->shared is NULL.
> 
> While at it, also rename the per-CPU sd_llc_shared to sd_balance_shared,
> as it is no longer strictly tied to the LLC.
> 
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Co-developed-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>   kernel/sched/fair.c     | 19 ++++++---
>   kernel/sched/sched.h    |  2 +-
>   kernel/sched/topology.c | 95 +++++++++++++++++++++++++++++++++++------
>   3 files changed, 95 insertions(+), 21 deletions(-)
> 
nit: There is comment still in fair.c still. Please fix that as well.
"sd_llc->shared->has_idle_cores and enabled through update_idle_core() above."

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
  2026-05-09 18:07 ` [PATCH 4/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
  2026-05-11 13:07   ` Vincent Guittot
@ 2026-05-15 10:09   ` Shrikanth Hegde
  2026-05-16  9:04     ` Andrea Righi
  2026-05-20  8:34   ` [tip: sched/core] " tip-bot2 for Andrea Righi
  2 siblings, 1 reply; 47+ messages in thread
From: Shrikanth Hegde @ 2026-05-15 10:09 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot



On 5/9/26 11:37 PM, Andrea Righi wrote:
> When SD_ASYM_CPUCAPACITY load balancing considers pulling a misfit task,
> capacity_of(dst_cpu) can overstate available compute if the SMT sibling is
> busy: the core does not deliver its full nominal capacity.
> 
> If SMT is active and dst_cpu is not on a fully idle core, skip this
> destination so we do not migrate a misfit expecting a capacity upgrade we
> cannot actually provide.
> 
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Christian Loehle <christian.loehle@arm.com>
> Cc: Koba Ko <kobak@nvidia.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>   kernel/sched/fair.c | 11 ++++++++++-
>   1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6f0835c15ee11..2ddba8bd27e59 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9693,6 +9693,7 @@ struct lb_env {
>   
>   	int			dst_cpu;
>   	struct rq		*dst_rq;
> +	bool			dst_core_idle;
>   
>   	struct cpumask		*dst_grpmask;
>   	int			new_dst_cpu;
> @@ -10918,10 +10919,16 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>   	 * We can use max_capacity here as reduction in capacity on some
>   	 * CPUs in the group should either be possible to resolve
>   	 * internally or be covered by avg_load imbalance (eventually).
> +	 *
> +	 * When SMT is active, only pull a misfit to dst_cpu if it is on a
> +	 * fully idle core; otherwise the effective capacity of the core is
> +	 * reduced and we may not actually provide more capacity than the
> +	 * source.
>   	 */
>   	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
>   	    (sgs->group_type == group_misfit_task) &&
> -	    (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
> +	    (!env->dst_core_idle ||
> +	     !capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
>   	     sds->local_stat.group_type != group_has_spare))
>   		return false;
>   
> @@ -11485,6 +11492,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>   	unsigned long sum_util = 0;
>   	bool sg_overloaded = 0, sg_overutilized = 0;
>   
> +	env->dst_core_idle = !sched_smt_active() || is_core_idle(env->dst_cpu);
> +
>   	do {
>   		struct sg_lb_stats *sgs = &tmp_sgs;
>   		int local_group;


This is kind of similar to what ASYM_PACKING would have done at MC domain with
equal CPU capacities. i.e pull the load if the core is idle.

In your table in the cover-letter, if you do "NO ASYM + SIS_UTIL + ASYM_PACKING (at MC)"
does it achieve close to "ASYM + SMT + SIS_UTIL"?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path
  2026-05-15  6:49   ` Shrikanth Hegde
@ 2026-05-16  5:45     ` Andrea Righi
  2026-05-16 17:15       ` Shrikanth Hegde
  0 siblings, 1 reply; 47+ messages in thread
From: Andrea Righi @ 2026-05-16  5:45 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	linux-kernel

Hi Shrikanth,

On Fri, May 15, 2026 at 12:19:16PM +0530, Shrikanth Hegde wrote:
> 
> 
> On 5/9/26 11:37 PM, Andrea Righi wrote:
> > nohz_balancer_kick() is reached from sched_balance_trigger(), which is
> > called from sched_tick(). sched_tick() runs with IRQs disabled, so the
> > additional rcu_read_lock/unlock() used around sched_domain accesses in
> > this path is redundant. Rely on the existing IRQ-disabled context (and
> > the rcu_dereference_all() checking) instead.
> > 
> > The same applies to set_cpu_sd_state_idle(), called from the idle entry
> > path with IRQs disabled, and to set_cpu_sd_state_busy(), reachable via
> > nohz_balance_exit_idle() from two contexts: nohz_balancer_kick() (IRQs
> > disabled, as above) and sched_cpu_deactivate() (the CPUHP_AP_ACTIVE
> > teardown, which runs under cpus_write_lock(), so it cannot race with
> > sched-domain rebuilds). In both cases the rcu_dereference_all()
> > validation is sufficient.
> > 
> > No functional change intended.
> > 
> 
> For this patch, few more comments below.
> 
> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> 
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> 
> 
> > @@ -12868,17 +12860,13 @@ static void nohz_balancer_kick(struct rq *rq)
> >   static void set_cpu_sd_state_busy(int cpu)
> >   {
> >   	struct sched_domain *sd;
> > -
> > -	rcu_read_lock();
> >   	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
> >   	if (!sd || !sd->nohz_idle)
> > -		goto unlock;
> > +		return;
> >   	sd->nohz_idle = 0;
> >   	atomic_inc(&sd->shared->nr_busy_cpus);
> > -unlock:
> > -	rcu_read_unlock();
> >   }
> >   void nohz_balance_exit_idle(struct rq *rq)
> > @@ -12897,17 +12885,13 @@ void nohz_balance_exit_idle(struct rq *rq)
> >   static void set_cpu_sd_state_idle(int cpu)
> >   {
> >   	struct sched_domain *sd;
> > -
> > -	rcu_read_lock();
> >   	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
> >   	if (!sd || sd->nohz_idle)
> > -		goto unlock;
> > +		return;
> >   	sd->nohz_idle = 1;
> >   	atomic_dec(&sd->shared->nr_busy_cpus);
> > -unlock:
> > -	rcu_read_unlock();
> >   }
> >   /*
> 
> I was looking at other users of sd_llc, i.e test_idle_core and set_idle_core.
> They have rcu_dereference_all. So callers need not call rcu_read_lock/unlock if
> the irq disabled/preempt_disabled.
> 
> One more place would be update_idle_core. I think it is called with interrupt disabled
> in __schedule path.

Good point, __update_idle_core() reaches set_next_task_idle() via
pick_next_task() in __schedule(), and __schedule() disables IRQs before that
path.

Since set_idle_cores()/test_idle_cores() use rcu_dereference_all(), the
rcu_read_lock/unlock() pair in __update_idle_core() is indeed redundant. I can
send a follow-up patch for this.

> 
> And in sched_ext, scx_idle_update_selcpu_topology, It seems to be tied to cpu hotplug and
> by same logic of cpus_write_lock held, one could remove redundant rcu_read_lock there as well.
> 
> No?

For scx_idle_update_selcpu_topology() it's a bit more nuanced, if I'm not
missing anything:
 - the helpers it uses (llc_weight/llc_span/numa_weight/numa_span) use plain
   rcu_dereference(), so simply dropping rcu_read_lock() in the caller would
   trip the lockdep check. They'd need to be converted to rcu_dereference_all()
   first;
 - the two call sites have different protection:
    - handle_hotplug() runs from a CPU hotplug callback, so cpus_write_lock()
      is held, serializes against sched-domain rebuilds,
    - scx_enable() only holds cpus_read_lock(), which doesn't on
      its own prevent cpuset sched-domain rebuilds (those run under
      cpus_read_lock() too).

I think this one needs a separate, more careful patch. Maybe we should keep this
series scoped to the NOHZ kick path and address those as follow-ups?

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
  2026-05-15 10:05   ` Shrikanth Hegde
@ 2026-05-16  5:58     ` Andrea Righi
  2026-05-16 17:19       ` Shrikanth Hegde
                         ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Andrea Righi @ 2026-05-16  5:58 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	Shrikanth Hegde, linux-kernel

From: K Prateek Nayak <kprateek.nayak@amd.com>

On asymmetric CPU capacity systems, the wakeup path uses
select_idle_capacity(), which scans the span of sd_asym_cpucapacity
rather than sd_llc.

The has_idle_cores hint however lives on sd_llc->shared, so the
wakeup-time read of has_idle_cores operates on an LLC-scoped blob while
the actual scan/decision spans the asym domain; nr_busy_cpus also lives
in the same shared sched_domain data, but it's never used in the asym
CPU capacity scenario.

Therefore, move the sched_domain_shared object to sd_asym_cpucapacity
whenever the CPU has a SD_ASYM_CPUCAPACITY_FULL ancestor and that
ancestor is non-overlapping (i.e., not built from SD_NUMA). In that case
the scope of has_idle_cores matches the scope of the wakeup scan.

Fall back to attaching the shared object to sd_llc in three cases:

  1) plain symmetric systems (no SD_ASYM_CPUCAPACITY_FULL anywhere);

  2) CPUs in an exclusive cpuset that carves out a symmetric capacity
     island: has_asym is system-wide but those CPUs have no
     SD_ASYM_CPUCAPACITY_FULL ancestor in their hierarchy and follow
     the symmetric LLC path in select_idle_sibling();

  3) exotic topologies where SD_ASYM_CPUCAPACITY_FULL lands on an
     SD_NUMA-built domain. init_sched_domain_shared() keys the shared
     blob off cpumask_first(span), which on overlapping NUMA domains
     would alias unrelated spans onto the same blob. Keep the shared
     object on the LLC there; select_idle_capacity() gracefully skips
     the has_idle_cores preference when sd->shared is NULL.

While at it, also rename the per-CPU sd_llc_shared to sd_balance_shared,
as it is no longer strictly tied to the LLC.

Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changes in v2:
 - update comment referencing to the old sd_llc->shared->has_idle_cores
   (Shrikanth Hegde)

 kernel/sched/fair.c     | 22 ++++++----
 kernel/sched/sched.h    |  2 +-
 kernel/sched/topology.c | 95 +++++++++++++++++++++++++++++++++++------
 3 files changed, 97 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b059ee80b631..4ef028605cecf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7819,7 +7819,7 @@ static inline void set_idle_cores(int cpu, int val)
 {
 	struct sched_domain_shared *sds;
 
-	sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
+	sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
 	if (sds)
 		WRITE_ONCE(sds->has_idle_cores, val);
 }
@@ -7828,7 +7828,7 @@ static inline bool test_idle_cores(int cpu)
 {
 	struct sched_domain_shared *sds;
 
-	sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
+	sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
 	if (sds)
 		return READ_ONCE(sds->has_idle_cores);
 
@@ -7837,7 +7837,7 @@ static inline bool test_idle_cores(int cpu)
 
 /*
  * Scans the local SMT mask to see if the entire core is idle, and records this
- * information in sd_llc_shared->has_idle_cores.
+ * information in sd_balance_shared->has_idle_cores.
  *
  * Since SMT siblings share all cache levels, inspecting this limited remote
  * state should be fairly cheap.
@@ -7867,7 +7867,8 @@ void __update_idle_core(struct rq *rq)
 /*
  * Scan the entire LLC domain for idle cores; this dynamically switches off if
  * there are no idle cores left in the system; tracked through
- * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
+ * sd_balance_shared->has_idle_cores and enabled through update_idle_core()
+ * above.
  */
 static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
 {
@@ -7954,7 +7955,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
 
-	if (sched_feat(SIS_UTIL)) {
+	if (sched_feat(SIS_UTIL) && sd->shared) {
 		/*
 		 * Increment because !--nr is the condition to stop scan.
 		 *
@@ -12834,7 +12835,7 @@ static void nohz_balancer_kick(struct rq *rq)
 		goto out;
 	}
 
-	sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
+	sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
 	if (sds) {
 		/*
 		 * If there is an imbalance between LLC domains (IOW we could
@@ -12862,7 +12863,11 @@ static void set_cpu_sd_state_busy(int cpu)
 	struct sched_domain *sd;
 	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
 
-	if (!sd || !sd->nohz_idle)
+	/*
+	 * sd->nohz_idle only pairs with nr_busy_cpus on sd->shared; if this
+	 * domain has no shared object there is nothing to clear or account.
+	 */
+	if (!sd || !sd->shared || !sd->nohz_idle)
 		return;
 	sd->nohz_idle = 0;
 
@@ -12887,7 +12892,8 @@ static void set_cpu_sd_state_idle(int cpu)
 	struct sched_domain *sd;
 	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
 
-	if (!sd || sd->nohz_idle)
+	/* See set_cpu_sd_state_busy(): nohz_idle is only used with sd->shared. */
+	if (!sd || !sd->shared || sd->nohz_idle)
 		return;
 	sd->nohz_idle = 1;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f63b15d309d1..330f5893c4561 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2170,7 +2170,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
 DECLARE_PER_CPU(int, sd_share_id);
-DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_balance_shared);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5847b83d9d552..9bc4d11dd6a98 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -665,7 +665,7 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
 DEFINE_PER_CPU(int, sd_share_id);
-DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_balance_shared);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
@@ -680,20 +680,38 @@ static void update_top_cache_domain(int cpu)
 	int id = cpu;
 	int size = 1;
 
+	sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
+	/*
+	 * The shared object is attached to sd_asym_cpucapacity only when the
+	 * asym domain is non-overlapping (i.e., not built from SD_NUMA).
+	 * On overlapping (NUMA) asym domains we fall back to letting the
+	 * SD_SHARE_LLC path own the shared object, so sd->shared may be NULL
+	 * here.
+	 */
+	if (sd && sd->shared)
+		sds = sd->shared;
+
+	rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
+
 	sd = highest_flag_domain(cpu, SD_SHARE_LLC);
 	if (sd) {
 		id = cpumask_first(sched_domain_span(sd));
 		size = cpumask_weight(sched_domain_span(sd));
 
-		/* If sd_llc exists, sd_llc_shared should exist too. */
-		WARN_ON_ONCE(!sd->shared);
-		sds = sd->shared;
+		/*
+		 * If sd_asym_cpucapacity didn't claim the shared object,
+		 * sd_llc must have one linked.
+		 */
+		if (!sds) {
+			WARN_ON_ONCE(!sd->shared);
+			sds = sd->shared;
+		}
 	}
 
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
 	per_cpu(sd_llc_size, cpu) = size;
 	per_cpu(sd_llc_id, cpu) = id;
-	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
+	rcu_assign_pointer(per_cpu(sd_balance_shared, cpu), sds);
 
 	sd = lowest_flag_domain(cpu, SD_CLUSTER);
 	if (sd)
@@ -711,9 +729,6 @@ static void update_top_cache_domain(int cpu)
 
 	sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
 	rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
-
-	sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
-	rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
 }
 
 /*
@@ -2650,6 +2665,54 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc)
 	}
 }
 
+static void init_sched_domain_shared(struct s_data *d, struct sched_domain *sd)
+{
+	int sd_id = cpumask_first(sched_domain_span(sd));
+
+	sd->shared = *per_cpu_ptr(d->sds, sd_id);
+	/*
+	 * nr_busy_cpus is consumed only by the NOHZ kick path via
+	 * sd_balance_shared; on the asym-capacity path it is initialized but
+	 * never read.
+	 */
+	atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
+	atomic_inc(&sd->shared->ref);
+}
+
+/*
+ * For asymmetric CPU capacity, attach sched_domain_shared on the innermost
+ * SD_ASYM_CPUCAPACITY_FULL ancestor of @cpu's base domain when that ancestor is
+ * not an overlapping NUMA-built domain (then LLC should claim shared).
+ *
+ * A CPU may lack any FULL ancestor (e.g., exclusive cpuset symmetric island),
+ * then LLC must claim shared instead.
+ *
+ * Note: SD_ASYM_CPUCAPACITY_FULL is only set when all CPU capacity values
+ * are present in the domain span, so the asym domain we attach to cannot
+ * degenerate into a single-capacity group. The relevant edge cases are instead
+ * covered by the caveats above.
+ *
+ * Return true if this CPU's asym path claimed sd->shared, false otherwise.
+ */
+static bool claim_asym_sched_domain_shared(struct s_data *d, int cpu)
+{
+	struct sched_domain *sd = *per_cpu_ptr(d->sd, cpu);
+	struct sched_domain *sd_asym;
+
+	if (!sd)
+		return false;
+
+	sd_asym = sd;
+	while (sd_asym && !(sd_asym->flags & SD_ASYM_CPUCAPACITY_FULL))
+		sd_asym = sd_asym->parent;
+
+	if (!sd_asym || (sd_asym->flags & SD_NUMA))
+		return false;
+
+	init_sched_domain_shared(d, sd_asym);
+	return true;
+}
+
 /*
  * Build sched domains for a given set of CPUs and attach the sched domains
  * to the individual CPUs
@@ -2708,20 +2771,26 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 	}
 
 	for_each_cpu(i, cpu_map) {
+		bool asym_claimed = false;
+
 		sd = *per_cpu_ptr(d.sd, i);
 		if (!sd)
 			continue;
 
+		if (has_asym)
+			asym_claimed = claim_asym_sched_domain_shared(&d, i);
+
 		/* First, find the topmost SD_SHARE_LLC domain */
 		while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
 			sd = sd->parent;
 
 		if (sd->flags & SD_SHARE_LLC) {
-			int sd_id = cpumask_first(sched_domain_span(sd));
-
-			sd->shared = *per_cpu_ptr(d.sds, sd_id);
-			atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
-			atomic_inc(&sd->shared->ref);
+			/*
+			 * Initialize the sd->shared for SD_SHARE_LLC unless
+			 * the asym path above already claimed it.
+			 */
+			if (!asym_claimed)
+				init_sched_domain_shared(&d, sd);
 
 			/*
 			 * In presence of higher domains, adjust the
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
  2026-05-15 10:09   ` Shrikanth Hegde
@ 2026-05-16  9:04     ` Andrea Righi
  0 siblings, 0 replies; 47+ messages in thread
From: Andrea Righi @ 2026-05-16  9:04 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	linux-kernel, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot

Hi Shrikanth,

On Fri, May 15, 2026 at 03:39:55PM +0530, Shrikanth Hegde wrote:
> On 5/9/26 11:37 PM, Andrea Righi wrote:
> > When SD_ASYM_CPUCAPACITY load balancing considers pulling a misfit task,
> > capacity_of(dst_cpu) can overstate available compute if the SMT sibling is
> > busy: the core does not deliver its full nominal capacity.
> > 
> > If SMT is active and dst_cpu is not on a fully idle core, skip this
> > destination so we do not migrate a misfit expecting a capacity upgrade we
> > cannot actually provide.
> > 
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Christian Loehle <christian.loehle@arm.com>
> > Cc: Koba Ko <kobak@nvidia.com>
> > Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> > Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >   kernel/sched/fair.c | 11 ++++++++++-
> >   1 file changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6f0835c15ee11..2ddba8bd27e59 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9693,6 +9693,7 @@ struct lb_env {
> >   	int			dst_cpu;
> >   	struct rq		*dst_rq;
> > +	bool			dst_core_idle;
> >   	struct cpumask		*dst_grpmask;
> >   	int			new_dst_cpu;
> > @@ -10918,10 +10919,16 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >   	 * We can use max_capacity here as reduction in capacity on some
> >   	 * CPUs in the group should either be possible to resolve
> >   	 * internally or be covered by avg_load imbalance (eventually).
> > +	 *
> > +	 * When SMT is active, only pull a misfit to dst_cpu if it is on a
> > +	 * fully idle core; otherwise the effective capacity of the core is
> > +	 * reduced and we may not actually provide more capacity than the
> > +	 * source.
> >   	 */
> >   	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> >   	    (sgs->group_type == group_misfit_task) &&
> > -	    (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
> > +	    (!env->dst_core_idle ||
> > +	     !capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
> >   	     sds->local_stat.group_type != group_has_spare))
> >   		return false;
> > @@ -11485,6 +11492,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >   	unsigned long sum_util = 0;
> >   	bool sg_overloaded = 0, sg_overutilized = 0;
> > +	env->dst_core_idle = !sched_smt_active() || is_core_idle(env->dst_cpu);
> > +
> >   	do {
> >   		struct sg_lb_stats *sgs = &tmp_sgs;
> >   		int local_group;
> 
> 
> This is kind of similar to what ASYM_PACKING would have done at MC domain with
> equal CPU capacities. i.e pull the load if the core is idle.

I think that's right, semantically "only pull a misfit to dst_cpu if its core is
idle" is essentially the same heuristics that SD_ASYM_PACKING ends up doing at
MC: prefer destinations on cores that can actually deliver their nominal
capacity. With equal per-CPU priorities the asym_packing path collapses to
"prefer the idle core", which is essentially what this patch enforces for the
misfit case.

> 
> In your table in the cover-letter, if you do "NO ASYM + SIS_UTIL + ASYM_PACKING (at MC)"
> does it achieve close to "ASYM + SMT + SIS_UTIL"?

Christian already explored the "NO ASYM_CPUCAPACITY + SD_ASYM_PACKING" idea
(https://lore.kernel.org/all/20260325181314.3875909-1-christian.loehle@arm.com).

I gave it a spin on Vera at the time. Summarizing the numbers I reported on that
thread (all vs. baseline = default SD_ASYM_CPUCAPACITY, no SMT awareness, on my
CPU-bound workload):
 - SD_ASYM_PACKING at MC (Christian's RFC):    ~1.5x speedup
 - equalize capacities within +/-5% (NO_ASYM): ~1.6x speedup
 - SMT-aware SD_ASYM_CPUCAPACITY (PATCH 3/5):  ~1.7x speedup

So SD_ASYM_PACKING seems to help, but not as much as NO_ASYM baseline (even if
it's pretty close) or this series.

I think the structural reason is that ASYM_PACKING at MC only fixes destination
selection in load balance, it doesn't change select_idle_capacity() /
asym_fits_cpu() on the wakeup path, where I think most of the placement
decisions actually happen in this case.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path
  2026-05-16  5:45     ` Andrea Righi
@ 2026-05-16 17:15       ` Shrikanth Hegde
  0 siblings, 0 replies; 47+ messages in thread
From: Shrikanth Hegde @ 2026-05-16 17:15 UTC (permalink / raw)
  To: Andrea Righi
  Cc: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	linux-kernel



On 5/16/26 11:15 AM, Andrea Righi wrote:
> Hi Shrikanth,
> 
> On Fri, May 15, 2026 at 12:19:16PM +0530, Shrikanth Hegde wrote:
>>
>>
>> On 5/9/26 11:37 PM, Andrea Righi wrote:
>>> nohz_balancer_kick() is reached from sched_balance_trigger(), which is
>>> called from sched_tick(). sched_tick() runs with IRQs disabled, so the
>>> additional rcu_read_lock/unlock() used around sched_domain accesses in
>>> this path is redundant. Rely on the existing IRQ-disabled context (and
>>> the rcu_dereference_all() checking) instead.
>>>
>>> The same applies to set_cpu_sd_state_idle(), called from the idle entry
>>> path with IRQs disabled, and to set_cpu_sd_state_busy(), reachable via
>>> nohz_balance_exit_idle() from two contexts: nohz_balancer_kick() (IRQs
>>> disabled, as above) and sched_cpu_deactivate() (the CPUHP_AP_ACTIVE
>>> teardown, which runs under cpus_write_lock(), so it cannot race with
>>> sched-domain rebuilds). In both cases the rcu_dereference_all()
>>> validation is sufficient.
>>>
>>> No functional change intended.
>>>
>>
>> For this patch, few more comments below.
>>
>> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>>
>>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>>> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
>>> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>>> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
>>> Signed-off-by: Andrea Righi <arighi@nvidia.com>
>>
>>
>>> @@ -12868,17 +12860,13 @@ static void nohz_balancer_kick(struct rq *rq)
>>>    static void set_cpu_sd_state_busy(int cpu)
>>>    {
>>>    	struct sched_domain *sd;
>>> -
>>> -	rcu_read_lock();
>>>    	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
>>>    	if (!sd || !sd->nohz_idle)
>>> -		goto unlock;
>>> +		return;
>>>    	sd->nohz_idle = 0;
>>>    	atomic_inc(&sd->shared->nr_busy_cpus);
>>> -unlock:
>>> -	rcu_read_unlock();
>>>    }
>>>    void nohz_balance_exit_idle(struct rq *rq)
>>> @@ -12897,17 +12885,13 @@ void nohz_balance_exit_idle(struct rq *rq)
>>>    static void set_cpu_sd_state_idle(int cpu)
>>>    {
>>>    	struct sched_domain *sd;
>>> -
>>> -	rcu_read_lock();
>>>    	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
>>>    	if (!sd || sd->nohz_idle)
>>> -		goto unlock;
>>> +		return;
>>>    	sd->nohz_idle = 1;
>>>    	atomic_dec(&sd->shared->nr_busy_cpus);
>>> -unlock:
>>> -	rcu_read_unlock();
>>>    }
>>>    /*
>>
>> I was looking at other users of sd_llc, i.e test_idle_core and set_idle_core.
>> They have rcu_dereference_all. So callers need not call rcu_read_lock/unlock if
>> the irq disabled/preempt_disabled.
>>
>> One more place would be update_idle_core. I think it is called with interrupt disabled
>> in __schedule path.
> 
> Good point, __update_idle_core() reaches set_next_task_idle() via
> pick_next_task() in __schedule(), and __schedule() disables IRQs before that
> path.
> 
> Since set_idle_cores()/test_idle_cores() use rcu_dereference_all(), the
> rcu_read_lock/unlock() pair in __update_idle_core() is indeed redundant. I can
> send a follow-up patch for this.
> 

Thanks.

>>
>> And in sched_ext, scx_idle_update_selcpu_topology, It seems to be tied to cpu hotplug and
>> by same logic of cpus_write_lock held, one could remove redundant rcu_read_lock there as well.
>>
>> No?
> 
> For scx_idle_update_selcpu_topology() it's a bit more nuanced, if I'm not
> missing anything:
>   - the helpers it uses (llc_weight/llc_span/numa_weight/numa_span) use plain
>     rcu_dereference(), so simply dropping rcu_read_lock() in the caller would
>     trip the lockdep check. They'd need to be converted to rcu_dereference_all()
>     first;
>   - the two call sites have different protection:
>      - handle_hotplug() runs from a CPU hotplug callback, so cpus_write_lock()
>        is held, serializes against sched-domain rebuilds,
>      - scx_enable() only holds cpus_read_lock(), which doesn't on
>        its own prevent cpuset sched-domain rebuilds (those run under
>        cpus_read_lock() too).
> 
> I think this one needs a separate, more careful patch. Maybe we should keep this
> series scoped to the NOHZ kick path and address those as follow-ups?
> 
> Thanks,
> -Andrea

Yes. That makes sense.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
  2026-05-16  5:58     ` [PATCH v2 " Andrea Righi
@ 2026-05-16 17:19       ` Shrikanth Hegde
  2026-05-18 20:58       ` Peter Zijlstra
  2026-05-20  8:34       ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
  2 siblings, 0 replies; 47+ messages in thread
From: Shrikanth Hegde @ 2026-05-16 17:19 UTC (permalink / raw)
  To: Andrea Righi, K Prateek Nayak
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Phil Auld, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, linux-kernel,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot



On 5/16/26 11:28 AM, Andrea Righi wrote:
> From: K Prateek Nayak <kprateek.nayak@amd.com>
> 
> On asymmetric CPU capacity systems, the wakeup path uses
> select_idle_capacity(), which scans the span of sd_asym_cpucapacity
> rather than sd_llc.
> 
> The has_idle_cores hint however lives on sd_llc->shared, so the
> wakeup-time read of has_idle_cores operates on an LLC-scoped blob while
> the actual scan/decision spans the asym domain; nr_busy_cpus also lives
> in the same shared sched_domain data, but it's never used in the asym
> CPU capacity scenario.
> 
> Therefore, move the sched_domain_shared object to sd_asym_cpucapacity
> whenever the CPU has a SD_ASYM_CPUCAPACITY_FULL ancestor and that
> ancestor is non-overlapping (i.e., not built from SD_NUMA). In that case
> the scope of has_idle_cores matches the scope of the wakeup scan.
> 
> Fall back to attaching the shared object to sd_llc in three cases:
> 
>    1) plain symmetric systems (no SD_ASYM_CPUCAPACITY_FULL anywhere);
> 
>    2) CPUs in an exclusive cpuset that carves out a symmetric capacity
>       island: has_asym is system-wide but those CPUs have no
>       SD_ASYM_CPUCAPACITY_FULL ancestor in their hierarchy and follow
>       the symmetric LLC path in select_idle_sibling();
> 
>    3) exotic topologies where SD_ASYM_CPUCAPACITY_FULL lands on an
>       SD_NUMA-built domain. init_sched_domain_shared() keys the shared
>       blob off cpumask_first(span), which on overlapping NUMA domains
>       would alias unrelated spans onto the same blob. Keep the shared
>       object on the LLC there; select_idle_capacity() gracefully skips
>       the has_idle_cores preference when sd->shared is NULL.
> 
> While at it, also rename the per-CPU sd_llc_shared to sd_balance_shared,
> as it is no longer strictly tied to the LLC.
> 
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
> Co-developed-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> Changes in v2:
>   - update comment referencing to the old sd_llc->shared->has_idle_cores
>     (Shrikanth Hegde)
> 

Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
  2026-05-16  5:58     ` [PATCH v2 " Andrea Righi
  2026-05-16 17:19       ` Shrikanth Hegde
@ 2026-05-18 20:58       ` Peter Zijlstra
  2026-05-18 21:31         ` Andrea Righi
  2026-05-19  5:52         ` K Prateek Nayak
  2026-05-20  8:34       ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
  2 siblings, 2 replies; 47+ messages in thread
From: Peter Zijlstra @ 2026-05-18 20:58 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Phil Auld, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel, tim.c.chen, yu.c.chen

On Sat, May 16, 2026 at 07:58:50AM +0200, Andrea Righi wrote:
> From: K Prateek Nayak <kprateek.nayak@amd.com>
> 
> On asymmetric CPU capacity systems, the wakeup path uses
> select_idle_capacity(), which scans the span of sd_asym_cpucapacity
> rather than sd_llc.
> 
> The has_idle_cores hint however lives on sd_llc->shared, so the
> wakeup-time read of has_idle_cores operates on an LLC-scoped blob while
> the actual scan/decision spans the asym domain; nr_busy_cpus also lives
> in the same shared sched_domain data, but it's never used in the asym
> CPU capacity scenario.
> 
> Therefore, move the sched_domain_shared object to sd_asym_cpucapacity
> whenever the CPU has a SD_ASYM_CPUCAPACITY_FULL ancestor and that
> ancestor is non-overlapping (i.e., not built from SD_NUMA). In that case
> the scope of has_idle_cores matches the scope of the wakeup scan.
> 
> Fall back to attaching the shared object to sd_llc in three cases:
> 
>   1) plain symmetric systems (no SD_ASYM_CPUCAPACITY_FULL anywhere);
> 
>   2) CPUs in an exclusive cpuset that carves out a symmetric capacity
>      island: has_asym is system-wide but those CPUs have no
>      SD_ASYM_CPUCAPACITY_FULL ancestor in their hierarchy and follow
>      the symmetric LLC path in select_idle_sibling();
> 
>   3) exotic topologies where SD_ASYM_CPUCAPACITY_FULL lands on an
>      SD_NUMA-built domain. init_sched_domain_shared() keys the shared
>      blob off cpumask_first(span), which on overlapping NUMA domains
>      would alias unrelated spans onto the same blob. Keep the shared
>      object on the LLC there; select_idle_capacity() gracefully skips
>      the has_idle_cores preference when sd->shared is NULL.
> 
> While at it, also rename the per-CPU sd_llc_shared to sd_balance_shared,
> as it is no longer strictly tied to the LLC.
> 
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
> Co-developed-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> Changes in v2:
>  - update comment referencing to the old sd_llc->shared->has_idle_cores
>    (Shrikanth Hegde)

Right, so I just merged a branch that has this series with a branch that
has the cache aware load balancing stuff on, and the result ain't
pretty.

That cache aware thing really wants sd_llc_shared. Now, I imagine that
for now the intersection between ASYM and SCHED_CACHE is not that
interesting, but at the same time, I'm fairly sure that is something
people will end up looking at.

For now, I've stomped on things and the merge holds the below. It
builds, not tested much beyond that.

I've pushed out the whole pile into queue/sched/core.

diff --cc kernel/sched/topology.c
index f96d50131495,e47a3f72eb72..000000000000
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@@ -663,9 -670,9 +670,10 @@@ static void destroy_sched_domains(struc
   */
  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
  DEFINE_PER_CPU(int, sd_llc_size);
- DEFINE_PER_CPU(int, sd_llc_id);
+ DEFINE_PER_CPU(int, sd_llc_id) = -1;
  DEFINE_PER_CPU(int, sd_share_id);
+ DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 +DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_balance_shared);
  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
@@@ -729,6 -717,9 +718,20 @@@ static void update_top_cache_domain(in
  
  	sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
  	rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
+ 
+ 	sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
++	/*
++	 * The shared object is attached to sd_asym_cpucapacity only when the
++	 * asym domain is non-overlapping (i.e., not built from SD_NUMA).
++	 * On overlapping (NUMA) asym domains we fall back to letting the
++	 * SD_SHARE_LLC path own the shared object, so sd->shared may be NULL
++	 * here.
++	 */
++	if (sd && sd->shared)
++		sds = sd->shared;
++
+ 	rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
++	rcu_assign_pointer(per_cpu(sd_balance_shared, cpu), sds);
  }
  
  /*
@@@ -2663,54 -2906,61 +2916,109 @@@ static void adjust_numa_imbalance(struc
  	}
  }
  
 +static void init_sched_domain_shared(struct s_data *d, struct sched_domain *sd)
 +{
 +	int sd_id = cpumask_first(sched_domain_span(sd));
 +
 +	sd->shared = *per_cpu_ptr(d->sds, sd_id);
 +	/*
 +	 * nr_busy_cpus is consumed only by the NOHZ kick path via
 +	 * sd_balance_shared; on the asym-capacity path it is initialized but
 +	 * never read.
 +	 */
 +	atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
 +	atomic_inc(&sd->shared->ref);
 +}
 +
 +/*
 + * For asymmetric CPU capacity, attach sched_domain_shared on the innermost
 + * SD_ASYM_CPUCAPACITY_FULL ancestor of @cpu's base domain when that ancestor is
 + * not an overlapping NUMA-built domain (then LLC should claim shared).
 + *
 + * A CPU may lack any FULL ancestor (e.g., exclusive cpuset symmetric island),
 + * then LLC must claim shared instead.
 + *
 + * Note: SD_ASYM_CPUCAPACITY_FULL is only set when all CPU capacity values
 + * are present in the domain span, so the asym domain we attach to cannot
 + * degenerate into a single-capacity group. The relevant edge cases are instead
 + * covered by the caveats above.
 + *
 + * Return true if this CPU's asym path claimed sd->shared, false otherwise.
 + */
 +static bool claim_asym_sched_domain_shared(struct s_data *d, int cpu)
 +{
 +	struct sched_domain *sd = *per_cpu_ptr(d->sd, cpu);
 +	struct sched_domain *sd_asym;
 +
 +	if (!sd)
 +		return false;
 +
 +	sd_asym = sd;
 +	while (sd_asym && !(sd_asym->flags & SD_ASYM_CPUCAPACITY_FULL))
 +		sd_asym = sd_asym->parent;
 +
 +	if (!sd_asym || (sd_asym->flags & SD_NUMA))
 +		return false;
 +
 +	init_sched_domain_shared(d, sd_asym);
 +	return true;
 +}
 +
+ static int __sched_domains_alloc_llc_id(void)
+ {
+ 	int lid, max;
+ 
+ 	lockdep_assert_held(&sched_domains_mutex);
+ 
+ 	lid = cpumask_first_zero(sched_domains_llc_id_allocmask);
+ 	/*
+ 	 * llc_id space should never grow larger than the
+ 	 * possible number of CPUs in the system.
+ 	 */
+ 	if (lid >= nr_cpu_ids)
+ 		return -1;
+ 
+ 	__cpumask_set_cpu(lid, sched_domains_llc_id_allocmask);
+ 	max = cpumask_last(sched_domains_llc_id_allocmask);
+ 	if (max > max_lid)
+ 		max_lid = max;
+ 
+ 	return lid;
+ }
+ 
+ static void __sched_domains_free_llc_id(int cpu)
+ {
+ 	int i, lid, max;
+ 
+ 	lockdep_assert_held(&sched_domains_mutex);
+ 
+ 	lid = per_cpu(sd_llc_id, cpu);
+ 	if (lid == -1 || lid >= nr_cpu_ids)
+ 		return;
+ 
+ 	per_cpu(sd_llc_id, cpu) = -1;
+ 
+ 	for_each_cpu(i, llc_mask(cpu)) {
+ 		/* An online CPU owns the llc_id. */
+ 		if (per_cpu(sd_llc_id, i) == lid)
+ 			return;
+ 	}
+ 
+ 	__cpumask_clear_cpu(lid, sched_domains_llc_id_allocmask);
+ 
+ 	max = cpumask_last(sched_domains_llc_id_allocmask);
+ 	/* shrink max lid to save memory */
+ 	if (max < max_lid)
+ 		max_lid = max;
+ }
+ 
+ void sched_domains_free_llc_id(int cpu)
+ {
+ 	sched_domains_mutex_lock();
+ 	__sched_domains_free_llc_id(cpu);
+ 	sched_domains_mutex_unlock();
+ }
+ 
  /*
   * Build sched domains for a given set of CPUs and attach the sched domains
   * to the individual CPUs
@@@ -2775,20 -3049,16 +3107,15 @@@ build_sched_domains(const struct cpumas
  		if (!sd)
  			continue;
  
 +		if (has_asym)
- 			asym_claimed = claim_asym_sched_domain_shared(&d, i);
++			claim_asym_sched_domain_shared(&d, i);
 +
  		/* First, find the topmost SD_SHARE_LLC domain */
  		while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
  			sd = sd->parent;
  
  		if (sd->flags & SD_SHARE_LLC) {
- 			/*
- 			 * Initialize the sd->shared for SD_SHARE_LLC unless
- 			 * the asym path above already claimed it.
- 			 */
- 			if (!asym_claimed)
- 				init_sched_domain_shared(&d, sd);
 -			int sd_id = cpumask_first(sched_domain_span(sd));
 -
 -			sd->shared = *per_cpu_ptr(d.sds, sd_id);
 -			atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
 -			atomic_inc(&sd->shared->ref);
++			init_sched_domain_shared(&d, sd);
  
  			/*
  			 * In presence of higher domains, adjust the
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9dd4a94801c9..300320b0248a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2191,6 +2191,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
 DECLARE_PER_CPU(int, sd_share_id);
+DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_balance_shared);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
  2026-05-18 20:58       ` Peter Zijlstra
@ 2026-05-18 21:31         ` Andrea Righi
  2026-05-19  5:52         ` K Prateek Nayak
  1 sibling, 0 replies; 47+ messages in thread
From: Andrea Righi @ 2026-05-18 21:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Phil Auld, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel, tim.c.chen, yu.c.chen

Hi Peter,

On Mon, May 18, 2026 at 10:58:59PM +0200, Peter Zijlstra wrote:
> On Sat, May 16, 2026 at 07:58:50AM +0200, Andrea Righi wrote:
...
> Right, so I just merged a branch that has this series with a branch that
> has the cache aware load balancing stuff on, and the result ain't
> pretty.
> 
> That cache aware thing really wants sd_llc_shared. Now, I imagine that
> for now the intersection between ASYM and SCHED_CACHE is not that
> interesting, but at the same time, I'm fairly sure that is something
> people will end up looking at.
> 
> For now, I've stomped on things and the merge holds the below. It
> builds, not tested much beyond that.
> 
> I've pushed out the whole pile into queue/sched/core.

Conceptually makes sense to me. IIUC cache-aware code necessarily needs per-LLC
util_avg/capacity, while the asym path needs has_idle_cores at asym span, so you
basically restored sd_llc_shared alongside sd_balance_shared.

I'll re-run my tests with your sched/core branch and report back.

Thanks!
-Andrea

> 
> diff --cc kernel/sched/topology.c
> index f96d50131495,e47a3f72eb72..000000000000
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@@ -663,9 -670,9 +670,10 @@@ static void destroy_sched_domains(struc
>    */
>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>   DEFINE_PER_CPU(int, sd_llc_size);
> - DEFINE_PER_CPU(int, sd_llc_id);
> + DEFINE_PER_CPU(int, sd_llc_id) = -1;
>   DEFINE_PER_CPU(int, sd_share_id);
> + DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>  +DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_balance_shared);
>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> @@@ -729,6 -717,9 +718,20 @@@ static void update_top_cache_domain(in
>   
>   	sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
>   	rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
> + 
> + 	sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
> ++	/*
> ++	 * The shared object is attached to sd_asym_cpucapacity only when the
> ++	 * asym domain is non-overlapping (i.e., not built from SD_NUMA).
> ++	 * On overlapping (NUMA) asym domains we fall back to letting the
> ++	 * SD_SHARE_LLC path own the shared object, so sd->shared may be NULL
> ++	 * here.
> ++	 */
> ++	if (sd && sd->shared)
> ++		sds = sd->shared;
> ++
> + 	rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
> ++	rcu_assign_pointer(per_cpu(sd_balance_shared, cpu), sds);
>   }
>   
>   /*
> @@@ -2663,54 -2906,61 +2916,109 @@@ static void adjust_numa_imbalance(struc
>   	}
>   }
>   
>  +static void init_sched_domain_shared(struct s_data *d, struct sched_domain *sd)
>  +{
>  +	int sd_id = cpumask_first(sched_domain_span(sd));
>  +
>  +	sd->shared = *per_cpu_ptr(d->sds, sd_id);
>  +	/*
>  +	 * nr_busy_cpus is consumed only by the NOHZ kick path via
>  +	 * sd_balance_shared; on the asym-capacity path it is initialized but
>  +	 * never read.
>  +	 */
>  +	atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
>  +	atomic_inc(&sd->shared->ref);
>  +}
>  +
>  +/*
>  + * For asymmetric CPU capacity, attach sched_domain_shared on the innermost
>  + * SD_ASYM_CPUCAPACITY_FULL ancestor of @cpu's base domain when that ancestor is
>  + * not an overlapping NUMA-built domain (then LLC should claim shared).
>  + *
>  + * A CPU may lack any FULL ancestor (e.g., exclusive cpuset symmetric island),
>  + * then LLC must claim shared instead.
>  + *
>  + * Note: SD_ASYM_CPUCAPACITY_FULL is only set when all CPU capacity values
>  + * are present in the domain span, so the asym domain we attach to cannot
>  + * degenerate into a single-capacity group. The relevant edge cases are instead
>  + * covered by the caveats above.
>  + *
>  + * Return true if this CPU's asym path claimed sd->shared, false otherwise.
>  + */
>  +static bool claim_asym_sched_domain_shared(struct s_data *d, int cpu)
>  +{
>  +	struct sched_domain *sd = *per_cpu_ptr(d->sd, cpu);
>  +	struct sched_domain *sd_asym;
>  +
>  +	if (!sd)
>  +		return false;
>  +
>  +	sd_asym = sd;
>  +	while (sd_asym && !(sd_asym->flags & SD_ASYM_CPUCAPACITY_FULL))
>  +		sd_asym = sd_asym->parent;
>  +
>  +	if (!sd_asym || (sd_asym->flags & SD_NUMA))
>  +		return false;
>  +
>  +	init_sched_domain_shared(d, sd_asym);
>  +	return true;
>  +}
>  +
> + static int __sched_domains_alloc_llc_id(void)
> + {
> + 	int lid, max;
> + 
> + 	lockdep_assert_held(&sched_domains_mutex);
> + 
> + 	lid = cpumask_first_zero(sched_domains_llc_id_allocmask);
> + 	/*
> + 	 * llc_id space should never grow larger than the
> + 	 * possible number of CPUs in the system.
> + 	 */
> + 	if (lid >= nr_cpu_ids)
> + 		return -1;
> + 
> + 	__cpumask_set_cpu(lid, sched_domains_llc_id_allocmask);
> + 	max = cpumask_last(sched_domains_llc_id_allocmask);
> + 	if (max > max_lid)
> + 		max_lid = max;
> + 
> + 	return lid;
> + }
> + 
> + static void __sched_domains_free_llc_id(int cpu)
> + {
> + 	int i, lid, max;
> + 
> + 	lockdep_assert_held(&sched_domains_mutex);
> + 
> + 	lid = per_cpu(sd_llc_id, cpu);
> + 	if (lid == -1 || lid >= nr_cpu_ids)
> + 		return;
> + 
> + 	per_cpu(sd_llc_id, cpu) = -1;
> + 
> + 	for_each_cpu(i, llc_mask(cpu)) {
> + 		/* An online CPU owns the llc_id. */
> + 		if (per_cpu(sd_llc_id, i) == lid)
> + 			return;
> + 	}
> + 
> + 	__cpumask_clear_cpu(lid, sched_domains_llc_id_allocmask);
> + 
> + 	max = cpumask_last(sched_domains_llc_id_allocmask);
> + 	/* shrink max lid to save memory */
> + 	if (max < max_lid)
> + 		max_lid = max;
> + }
> + 
> + void sched_domains_free_llc_id(int cpu)
> + {
> + 	sched_domains_mutex_lock();
> + 	__sched_domains_free_llc_id(cpu);
> + 	sched_domains_mutex_unlock();
> + }
> + 
>   /*
>    * Build sched domains for a given set of CPUs and attach the sched domains
>    * to the individual CPUs
> @@@ -2775,20 -3049,16 +3107,15 @@@ build_sched_domains(const struct cpumas
>   		if (!sd)
>   			continue;
>   
>  +		if (has_asym)
> - 			asym_claimed = claim_asym_sched_domain_shared(&d, i);
> ++			claim_asym_sched_domain_shared(&d, i);
>  +
>   		/* First, find the topmost SD_SHARE_LLC domain */
>   		while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
>   			sd = sd->parent;
>   
>   		if (sd->flags & SD_SHARE_LLC) {
> - 			/*
> - 			 * Initialize the sd->shared for SD_SHARE_LLC unless
> - 			 * the asym path above already claimed it.
> - 			 */
> - 			if (!asym_claimed)
> - 				init_sched_domain_shared(&d, sd);
>  -			int sd_id = cpumask_first(sched_domain_span(sd));
>  -
>  -			sd->shared = *per_cpu_ptr(d.sds, sd_id);
>  -			atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
>  -			atomic_inc(&sd->shared->ref);
> ++			init_sched_domain_shared(&d, sd);
>   
>   			/*
>   			 * In presence of higher domains, adjust the
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 9dd4a94801c9..300320b0248a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2191,6 +2191,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>  DECLARE_PER_CPU(int, sd_llc_size);
>  DECLARE_PER_CPU(int, sd_llc_id);
>  DECLARE_PER_CPU(int, sd_share_id);
> +DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>  DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_balance_shared);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
  2026-05-18 20:58       ` Peter Zijlstra
  2026-05-18 21:31         ` Andrea Righi
@ 2026-05-19  5:52         ` K Prateek Nayak
  2026-05-19  6:43           ` Andrea Righi
  2026-05-19  8:46           ` Peter Zijlstra
  1 sibling, 2 replies; 47+ messages in thread
From: K Prateek Nayak @ 2026-05-19  5:52 UTC (permalink / raw)
  To: Peter Zijlstra, Andrea Righi
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Christian Loehle, Phil Auld, Koba Ko, Felix Abecassis,
	Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel,
	tim.c.chen, yu.c.chen

Hello Peter, Andrea,

On 5/19/2026 2:28 AM, Peter Zijlstra wrote:
> @@@ -2775,20 -3049,16 +3107,15 @@@ build_sched_domains(const struct cpumas
>   		if (!sd)
>   			continue;
>   
>  +		if (has_asym)
> - 			asym_claimed = claim_asym_sched_domain_shared(&d, i);
> ++			claim_asym_sched_domain_shared(&d, i);
>  +
>   		/* First, find the topmost SD_SHARE_LLC domain */
>   		while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
>   			sd = sd->parent;
>   
>   		if (sd->flags & SD_SHARE_LLC) {
> - 			/*
> - 			 * Initialize the sd->shared for SD_SHARE_LLC unless
> - 			 * the asym path above already claimed it.
> - 			 */
> - 			if (!asym_claimed)
> - 				init_sched_domain_shared(&d, sd);
>  -			int sd_id = cpumask_first(sched_domain_span(sd));
>  -
>  -			sd->shared = *per_cpu_ptr(d.sds, sd_id);
>  -			atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
>  -			atomic_inc(&sd->shared->ref);
> ++			init_sched_domain_shared(&d, sd);

This will run into a small problem with "nr_idle_scan" if
cpumask_first(sched_domain_span(sd)) is the same for both sd_asym and
sd_llc.

Load balancer at different domains will populate "nr_idle_scan" with
different values and they alias to same ->shared if one isn't
degenerated and I believe there is at least one way to hit the WARN_ON()
from cpu_attach_domain() if the SD_ASYM_CPUCAPACITY_FULL comes before
the last SD_SHARE_LLC domain and the latter is degenerated.

How about this:

  (On top of queue:sched/core; Lightly tested on !ASYM_CPUCAPACITY system)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index fe09d3268bc9..1d2c98dca211 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -67,7 +67,15 @@ struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
-	int		nr_idle_scan;
+	union {
+		int	nr_idle_scan;
+		/*
+		 * Used during allocation to claim the
+		 * sched_domain_shared object at
+		 * multiple levels.
+		 */
+		int	alloc_flags;
+	};
 #ifdef CONFIG_SCHED_CACHE
 	unsigned long	util_avg;
 	unsigned long	capacity;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index dbfd9657f897..9ebd14652e9d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -623,6 +623,12 @@ static void free_sched_groups(struct sched_group *sg, int free_sgc)
 	} while (sg != first);
 }
 
+static void free_sched_domain_shared(struct sched_domain_shared *sds)
+{
+	if (sds && atomic_dec_and_test(&sds->ref))
+		kfree(sds);
+}
+
 static void destroy_sched_domain(struct sched_domain *sd)
 {
 	/*
@@ -631,9 +637,7 @@ static void destroy_sched_domain(struct sched_domain *sd)
 	 * dropping group/capacity references, freeing where none remain.
 	 */
 	free_sched_groups(sd->groups, 1);
-
-	if (sd->shared && atomic_dec_and_test(&sd->shared->ref))
-		kfree(sd->shared);
+	free_sched_domain_shared(sd->shared);
 
 #ifdef CONFIG_SCHED_CACHE
 	/* only the bottom sd has llc_counts array */
@@ -755,7 +759,14 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 
 			/* Pick reference to parent->shared. */
 			if (parent->shared) {
-				WARN_ON_ONCE(tmp->shared);
+				/*
+				 * It is safe to free a sd->shared that
+				 * has not been published yet. If a
+				 * sd->shared was published, the refcount
+				 * will end up being non-zero and it will
+				 * not be freed here.
+				 */
+				free_sched_domain_shared(tmp->shared);
 				tmp->shared = parent->shared;
 				parent->shared = NULL;
 			}
@@ -2916,11 +2927,40 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc)
 	}
 }
 
-static void init_sched_domain_shared(struct s_data *d, struct sched_domain *sd)
+static void
+init_sched_domain_shared(struct s_data *d, struct sched_domain *sd, int flags)
 {
-	int sd_id = cpumask_first(sched_domain_span(sd));
+	int cpu;
+
+	/*
+	 * Multiple domains can try to claim a shared object like
+	 * SD_ASYM_CPUCAPACITY and SD_SHARE_LLC which can alias to
+	 * same cpumask_first(sched_domain_span(sd)) CPU and can
+	 * cause "nr_idle_scan" to be populated incorrectly during
+	 * load balncing.
+	 *
+	 * Find the first CPU in sched_domain_span(sd) with an
+	 * unclaimed domain (!alloc_flags) or where the alloc_flag
+	 * matches the requested flag (SD_* flag)
+	 */
+	for_each_cpu(cpu, sched_domain_span(sd)) {
+		struct sched_domain_shared *sds = *per_cpu_ptr(d->sds, cpu);
+
+		/*
+		 * If the domain only has single CPU, allow temporary overlap
+		 * in allocation since the domains will be degenerated anyways.
+		 */
+		if (!sds->alloc_flags ||
+		    sd->span_weight == 1 ||
+		    sds->alloc_flags == flags) {
+			sds->alloc_flags = flags;
+			sd->shared = sds;
+			break;
+		}
+	}
+
+	BUG_ON(!sd->shared);
 
-	sd->shared = *per_cpu_ptr(d->sds, sd_id);
 	/*
 	 * nr_busy_cpus is consumed only by the NOHZ kick path via
 	 * sd_balance_shared; on the asym-capacity path it is initialized but
@@ -2960,7 +3000,7 @@ static bool claim_asym_sched_domain_shared(struct s_data *d, int cpu)
 	if (!sd_asym || (sd_asym->flags & SD_NUMA))
 		return false;
 
-	init_sched_domain_shared(d, sd_asym);
+	init_sched_domain_shared(d, sd_asym, SD_ASYM_CPUCAPACITY);
 	return true;
 }
 
@@ -3115,7 +3155,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 			sd = sd->parent;
 
 		if (sd->flags & SD_SHARE_LLC) {
-			init_sched_domain_shared(&d, sd);
+			init_sched_domain_shared(&d, sd, SD_SHARE_LLC);
 
 			/*
 			 * In presence of higher domains, adjust the
-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
  2026-05-19  5:52         ` K Prateek Nayak
@ 2026-05-19  6:43           ` Andrea Righi
  2026-05-19  7:47             ` K Prateek Nayak
  2026-05-19  8:46           ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Andrea Righi @ 2026-05-19  6:43 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Phil Auld, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel, tim.c.chen, yu.c.chen

Hi Prateek,

On Tue, May 19, 2026 at 11:22:32AM +0530, K Prateek Nayak wrote:
> Hello Peter, Andrea,
> 
> On 5/19/2026 2:28 AM, Peter Zijlstra wrote:
> > @@@ -2775,20 -3049,16 +3107,15 @@@ build_sched_domains(const struct cpumas
> >   		if (!sd)
> >   			continue;
> >   
> >  +		if (has_asym)
> > - 			asym_claimed = claim_asym_sched_domain_shared(&d, i);
> > ++			claim_asym_sched_domain_shared(&d, i);
> >  +
> >   		/* First, find the topmost SD_SHARE_LLC domain */
> >   		while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
> >   			sd = sd->parent;
> >   
> >   		if (sd->flags & SD_SHARE_LLC) {
> > - 			/*
> > - 			 * Initialize the sd->shared for SD_SHARE_LLC unless
> > - 			 * the asym path above already claimed it.
> > - 			 */
> > - 			if (!asym_claimed)
> > - 				init_sched_domain_shared(&d, sd);
> >  -			int sd_id = cpumask_first(sched_domain_span(sd));
> >  -
> >  -			sd->shared = *per_cpu_ptr(d.sds, sd_id);
> >  -			atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> >  -			atomic_inc(&sd->shared->ref);
> > ++			init_sched_domain_shared(&d, sd);
> 
> This will run into a small problem with "nr_idle_scan" if
> cpumask_first(sched_domain_span(sd)) is the same for both sd_asym and
> sd_llc.

Ah, good catch! When cpumask_first(asym_span) == cpumask_first(llc_span)
(big.LITTLE typical case), both sd_asym->shared and sd_llc->shared would alias
to d->sds[0].

> 
> Load balancer at different domains will populate "nr_idle_scan" with
> different values and they alias to same ->shared if one isn't
> degenerated and I believe there is at least one way to hit the WARN_ON()
> from cpu_attach_domain() if the SD_ASYM_CPUCAPACITY_FULL comes before
> the last SD_SHARE_LLC domain and the latter is degenerated.
> 
> How about this:
> 
>   (On top of queue:sched/core; Lightly tested on !ASYM_CPUCAPACITY system)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index fe09d3268bc9..1d2c98dca211 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -67,7 +67,15 @@ struct sched_domain_shared {
>  	atomic_t	ref;
>  	atomic_t	nr_busy_cpus;
>  	int		has_idle_cores;
> -	int		nr_idle_scan;
> +	union {
> +		int	nr_idle_scan;
> +		/*
> +		 * Used during allocation to claim the
> +		 * sched_domain_shared object at
> +		 * multiple levels.

I think between build and the first LB tick, readers of nr_idle_scan may observe
leftover SD_* flags in nr_idle_scan. This shouldn't be a problem and should
self-heal soon, but maybe it's worth a comment? Something like:

  * Note: between build and the first periodic LB tick, which
  * rewrites the union via update_idle_cpu_scan(), readers of
  * nr_idle_scan may observe the transient SD_* flag value as
  * the scan bound. The flag bits are small positive integers,
  * so the effect is just a slightly relaxed scan bound for one
  * window and self-heals on the first tick.

> +		 */
> +		int	alloc_flags;
> +	};
>  #ifdef CONFIG_SCHED_CACHE
>  	unsigned long	util_avg;
>  	unsigned long	capacity;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index dbfd9657f897..9ebd14652e9d 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -623,6 +623,12 @@ static void free_sched_groups(struct sched_group *sg, int free_sgc)
>  	} while (sg != first);
>  }
>  
> +static void free_sched_domain_shared(struct sched_domain_shared *sds)
> +{
> +	if (sds && atomic_dec_and_test(&sds->ref))
> +		kfree(sds);
> +}
> +
>  static void destroy_sched_domain(struct sched_domain *sd)
>  {
>  	/*
> @@ -631,9 +637,7 @@ static void destroy_sched_domain(struct sched_domain *sd)
>  	 * dropping group/capacity references, freeing where none remain.
>  	 */
>  	free_sched_groups(sd->groups, 1);
> -
> -	if (sd->shared && atomic_dec_and_test(&sd->shared->ref))
> -		kfree(sd->shared);
> +	free_sched_domain_shared(sd->shared);
>  
>  #ifdef CONFIG_SCHED_CACHE
>  	/* only the bottom sd has llc_counts array */
> @@ -755,7 +759,14 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
>  
>  			/* Pick reference to parent->shared. */
>  			if (parent->shared) {
> -				WARN_ON_ONCE(tmp->shared);
> +				/*
> +				 * It is safe to free a sd->shared that
> +				 * has not been published yet. If a
> +				 * sd->shared was published, the refcount
> +				 * will end up being non-zero and it will
> +				 * not be freed here.
> +				 */
> +				free_sched_domain_shared(tmp->shared);
>  				tmp->shared = parent->shared;
>  				parent->shared = NULL;
>  			}
> @@ -2916,11 +2927,40 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc)
>  	}
>  }
>  
> -static void init_sched_domain_shared(struct s_data *d, struct sched_domain *sd)
> +static void
> +init_sched_domain_shared(struct s_data *d, struct sched_domain *sd, int flags)
>  {
> -	int sd_id = cpumask_first(sched_domain_span(sd));
> +	int cpu;
> +
> +	/*
> +	 * Multiple domains can try to claim a shared object like
> +	 * SD_ASYM_CPUCAPACITY and SD_SHARE_LLC which can alias to
> +	 * same cpumask_first(sched_domain_span(sd)) CPU and can
> +	 * cause "nr_idle_scan" to be populated incorrectly during
> +	 * load balncing.

nit: s/balncing/balancing/

> +	 *
> +	 * Find the first CPU in sched_domain_span(sd) with an
> +	 * unclaimed domain (!alloc_flags) or where the alloc_flag
> +	 * matches the requested flag (SD_* flag)
> +	 */
> +	for_each_cpu(cpu, sched_domain_span(sd)) {
> +		struct sched_domain_shared *sds = *per_cpu_ptr(d->sds, cpu);
> +
> +		/*
> +		 * If the domain only has single CPU, allow temporary overlap
> +		 * in allocation since the domains will be degenerated anyways.
> +		 */
> +		if (!sds->alloc_flags ||
> +		    sd->span_weight == 1 ||
> +		    sds->alloc_flags == flags) {
> +			sds->alloc_flags = flags;
> +			sd->shared = sds;
> +			break;
> +		}
> +	}
> +
> +	BUG_ON(!sd->shared);

Unreachable in practice, but should we have a WARN_ON_ONCE() +
bail/early-return? In this way we'd fall back to using LLC's shared for
sd_balance_shared, which seems nicer than a BUG_ON().

>  
> -	sd->shared = *per_cpu_ptr(d->sds, sd_id);
>  	/*
>  	 * nr_busy_cpus is consumed only by the NOHZ kick path via
>  	 * sd_balance_shared; on the asym-capacity path it is initialized but
> @@ -2960,7 +3000,7 @@ static bool claim_asym_sched_domain_shared(struct s_data *d, int cpu)
>  	if (!sd_asym || (sd_asym->flags & SD_NUMA))
>  		return false;
>  
> -	init_sched_domain_shared(d, sd_asym);
> +	init_sched_domain_shared(d, sd_asym, SD_ASYM_CPUCAPACITY);
>  	return true;
>  }
>  
> @@ -3115,7 +3155,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>  			sd = sd->parent;
>  
>  		if (sd->flags & SD_SHARE_LLC) {
> -			init_sched_domain_shared(&d, sd);
> +			init_sched_domain_shared(&d, sd, SD_SHARE_LLC);
>  
>  			/*
>  			 * In presence of higher domains, adjust the
> -- 
> Thanks and Regards,
> Prateek
> 

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
  2026-05-19  6:43           ` Andrea Righi
@ 2026-05-19  7:47             ` K Prateek Nayak
  2026-05-19  7:54               ` Andrea Righi
  0 siblings, 1 reply; 47+ messages in thread
From: K Prateek Nayak @ 2026-05-19  7:47 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Phil Auld, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel, tim.c.chen, yu.c.chen

Hello Andrea,

Thank you for taking a look at the diff!

On 5/19/2026 12:13 PM, Andrea Righi wrote:
> Hi Prateek,
> 
> On Tue, May 19, 2026 at 11:22:32AM +0530, K Prateek Nayak wrote:
>> Hello Peter, Andrea,
>>
>> On 5/19/2026 2:28 AM, Peter Zijlstra wrote:
>>> @@@ -2775,20 -3049,16 +3107,15 @@@ build_sched_domains(const struct cpumas
>>>   		if (!sd)
>>>   			continue;
>>>   
>>>  +		if (has_asym)
>>> - 			asym_claimed = claim_asym_sched_domain_shared(&d, i);
>>> ++			claim_asym_sched_domain_shared(&d, i);
>>>  +
>>>   		/* First, find the topmost SD_SHARE_LLC domain */
>>>   		while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
>>>   			sd = sd->parent;
>>>   
>>>   		if (sd->flags & SD_SHARE_LLC) {
>>> - 			/*
>>> - 			 * Initialize the sd->shared for SD_SHARE_LLC unless
>>> - 			 * the asym path above already claimed it.
>>> - 			 */
>>> - 			if (!asym_claimed)
>>> - 				init_sched_domain_shared(&d, sd);
>>>  -			int sd_id = cpumask_first(sched_domain_span(sd));
>>>  -
>>>  -			sd->shared = *per_cpu_ptr(d.sds, sd_id);
>>>  -			atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
>>>  -			atomic_inc(&sd->shared->ref);
>>> ++			init_sched_domain_shared(&d, sd);
>>
>> This will run into a small problem with "nr_idle_scan" if
>> cpumask_first(sched_domain_span(sd)) is the same for both sd_asym and
>> sd_llc.
> 
> Ah, good catch! When cpumask_first(asym_span) == cpumask_first(llc_span)
> (big.LITTLE typical case), both sd_asym->shared and sd_llc->shared would alias
> to d->sds[0].
> 
>>
>> Load balancer at different domains will populate "nr_idle_scan" with
>> different values and they alias to same ->shared if one isn't
>> degenerated and I believe there is at least one way to hit the WARN_ON()
>> from cpu_attach_domain() if the SD_ASYM_CPUCAPACITY_FULL comes before
>> the last SD_SHARE_LLC domain and the latter is degenerated.
>>
>> How about this:
>>
>>   (On top of queue:sched/core; Lightly tested on !ASYM_CPUCAPACITY system)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index fe09d3268bc9..1d2c98dca211 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -67,7 +67,15 @@ struct sched_domain_shared {
>>  	atomic_t	ref;
>>  	atomic_t	nr_busy_cpus;
>>  	int		has_idle_cores;
>> -	int		nr_idle_scan;
>> +	union {
>> +		int	nr_idle_scan;
>> +		/*
>> +		 * Used during allocation to claim the
>> +		 * sched_domain_shared object at
>> +		 * multiple levels.
> 
> I think between build and the first LB tick, readers of nr_idle_scan may observe
> leftover SD_* flags in nr_idle_scan. This shouldn't be a problem and should
> self-heal soon, but maybe it's worth a comment? Something like:
> 
>   * Note: between build and the first periodic LB tick, which
>   * rewrites the union via update_idle_cpu_scan(), readers of
>   * nr_idle_scan may observe the transient SD_* flag value as
>   * the scan bound. The flag bits are small positive integers,
>   * so the effect is just a slightly relaxed scan bound for one
>   * window and self-heals on the first tick.

Ack! We start with 0 today which isn't representative of the system
state either and depend on the eventual correctness to fix the value
after a hotplug / cpuset.

I can fold in the note and resend it as a formal patch.

Peter, would you prefer a formal patch or would you like to do this
(or something similar) as a part of the conflict resolution itself?

>> +	BUG_ON(!sd->shared);
> 
> Unreachable in practice, but should we have a WARN_ON_ONCE() +
> bail/early-return? In this way we'd fall back to using LLC's shared for
> sd_balance_shared, which seems nicer than a BUG_ON().

Ack! We can just use the last CPU's "sds" if we don't end up finding a
free one as a backup. I just had the BUG_ON() to easily spot my VM
crashing ;-)

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
  2026-05-19  7:47             ` K Prateek Nayak
@ 2026-05-19  7:54               ` Andrea Righi
  0 siblings, 0 replies; 47+ messages in thread
From: Andrea Righi @ 2026-05-19  7:54 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Phil Auld, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel, tim.c.chen, yu.c.chen

On Tue, May 19, 2026 at 01:17:20PM +0530, K Prateek Nayak wrote:
> Hello Andrea,
> 
> Thank you for taking a look at the diff!

BTW I just re-ran the NVBLAS benchmark on a Vera Rubin machine using
queue:sched/core + this on top, all good!

Thanks,
-Andrea

> 
> On 5/19/2026 12:13 PM, Andrea Righi wrote:
> > Hi Prateek,
> > 
> > On Tue, May 19, 2026 at 11:22:32AM +0530, K Prateek Nayak wrote:
> >> Hello Peter, Andrea,
> >>
> >> On 5/19/2026 2:28 AM, Peter Zijlstra wrote:
> >>> @@@ -2775,20 -3049,16 +3107,15 @@@ build_sched_domains(const struct cpumas
> >>>   		if (!sd)
> >>>   			continue;
> >>>   
> >>>  +		if (has_asym)
> >>> - 			asym_claimed = claim_asym_sched_domain_shared(&d, i);
> >>> ++			claim_asym_sched_domain_shared(&d, i);
> >>>  +
> >>>   		/* First, find the topmost SD_SHARE_LLC domain */
> >>>   		while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
> >>>   			sd = sd->parent;
> >>>   
> >>>   		if (sd->flags & SD_SHARE_LLC) {
> >>> - 			/*
> >>> - 			 * Initialize the sd->shared for SD_SHARE_LLC unless
> >>> - 			 * the asym path above already claimed it.
> >>> - 			 */
> >>> - 			if (!asym_claimed)
> >>> - 				init_sched_domain_shared(&d, sd);
> >>>  -			int sd_id = cpumask_first(sched_domain_span(sd));
> >>>  -
> >>>  -			sd->shared = *per_cpu_ptr(d.sds, sd_id);
> >>>  -			atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> >>>  -			atomic_inc(&sd->shared->ref);
> >>> ++			init_sched_domain_shared(&d, sd);
> >>
> >> This will run into a small problem with "nr_idle_scan" if
> >> cpumask_first(sched_domain_span(sd)) is the same for both sd_asym and
> >> sd_llc.
> > 
> > Ah, good catch! When cpumask_first(asym_span) == cpumask_first(llc_span)
> > (big.LITTLE typical case), both sd_asym->shared and sd_llc->shared would alias
> > to d->sds[0].
> > 
> >>
> >> Load balancer at different domains will populate "nr_idle_scan" with
> >> different values and they alias to same ->shared if one isn't
> >> degenerated and I believe there is at least one way to hit the WARN_ON()
> >> from cpu_attach_domain() if the SD_ASYM_CPUCAPACITY_FULL comes before
> >> the last SD_SHARE_LLC domain and the latter is degenerated.
> >>
> >> How about this:
> >>
> >>   (On top of queue:sched/core; Lightly tested on !ASYM_CPUCAPACITY system)
> >>
> >> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> >> index fe09d3268bc9..1d2c98dca211 100644
> >> --- a/include/linux/sched/topology.h
> >> +++ b/include/linux/sched/topology.h
> >> @@ -67,7 +67,15 @@ struct sched_domain_shared {
> >>  	atomic_t	ref;
> >>  	atomic_t	nr_busy_cpus;
> >>  	int		has_idle_cores;
> >> -	int		nr_idle_scan;
> >> +	union {
> >> +		int	nr_idle_scan;
> >> +		/*
> >> +		 * Used during allocation to claim the
> >> +		 * sched_domain_shared object at
> >> +		 * multiple levels.
> > 
> > I think between build and the first LB tick, readers of nr_idle_scan may observe
> > leftover SD_* flags in nr_idle_scan. This shouldn't be a problem and should
> > self-heal soon, but maybe it's worth a comment? Something like:
> > 
> >   * Note: between build and the first periodic LB tick, which
> >   * rewrites the union via update_idle_cpu_scan(), readers of
> >   * nr_idle_scan may observe the transient SD_* flag value as
> >   * the scan bound. The flag bits are small positive integers,
> >   * so the effect is just a slightly relaxed scan bound for one
> >   * window and self-heals on the first tick.
> 
> Ack! We start with 0 today which isn't representative of the system
> state either and depend on the eventual correctness to fix the value
> after a hotplug / cpuset.
> 
> I can fold in the note and resend it as a formal patch.
> 
> Peter, would you prefer a formal patch or would you like to do this
> (or something similar) as a part of the conflict resolution itself?
> 
> >> +	BUG_ON(!sd->shared);
> > 
> > Unreachable in practice, but should we have a WARN_ON_ONCE() +
> > bail/early-return? In this way we'd fall back to using LLC's shared for
> > sd_balance_shared, which seems nicer than a BUG_ON().
> 
> Ack! We can just use the last CPU's "sds" if we don't end up finding a
> free one as a backup. I just had the BUG_ON() to easily spot my VM
> crashing ;-)
> 
> -- 
> Thanks and Regards,
> Prateek
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
  2026-05-19  5:52         ` K Prateek Nayak
  2026-05-19  6:43           ` Andrea Righi
@ 2026-05-19  8:46           ` Peter Zijlstra
  2026-05-19 11:27             ` K Prateek Nayak
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2026-05-19  8:46 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Andrea Righi, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Phil Auld, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel, tim.c.chen, yu.c.chen

On Tue, May 19, 2026 at 11:22:32AM +0530, K Prateek Nayak wrote:
> Hello Peter, Andrea,
> 
> On 5/19/2026 2:28 AM, Peter Zijlstra wrote:
> > @@@ -2775,20 -3049,16 +3107,15 @@@ build_sched_domains(const struct cpumas
> >   		if (!sd)
> >   			continue;
> >   
> >  +		if (has_asym)
> > - 			asym_claimed = claim_asym_sched_domain_shared(&d, i);
> > ++			claim_asym_sched_domain_shared(&d, i);
> >  +
> >   		/* First, find the topmost SD_SHARE_LLC domain */
> >   		while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
> >   			sd = sd->parent;
> >   
> >   		if (sd->flags & SD_SHARE_LLC) {
> > - 			/*
> > - 			 * Initialize the sd->shared for SD_SHARE_LLC unless
> > - 			 * the asym path above already claimed it.
> > - 			 */
> > - 			if (!asym_claimed)
> > - 				init_sched_domain_shared(&d, sd);
> >  -			int sd_id = cpumask_first(sched_domain_span(sd));
> >  -
> >  -			sd->shared = *per_cpu_ptr(d.sds, sd_id);
> >  -			atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> >  -			atomic_inc(&sd->shared->ref);
> > ++			init_sched_domain_shared(&d, sd);
> 
> This will run into a small problem with "nr_idle_scan" if
> cpumask_first(sched_domain_span(sd)) is the same for both sd_asym and
> sd_llc.
> 
> Load balancer at different domains will populate "nr_idle_scan" with
> different values and they alias to same ->shared if one isn't
> degenerated and I believe there is at least one way to hit the WARN_ON()
> from cpu_attach_domain() if the SD_ASYM_CPUCAPACITY_FULL comes before
> the last SD_SHARE_LLC domain and the latter is degenerated.
> 
> How about this:
> 
>   (On top of queue:sched/core; Lightly tested on !ASYM_CPUCAPACITY system)

Shrikanth noted I had an old version of his SMT ifdeffery patches on, so
I need to rebuild that tree (and the merge) anyway.

Do you want me to munge this in, or keep it on top as a fixie?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
  2026-05-19  8:46           ` Peter Zijlstra
@ 2026-05-19 11:27             ` K Prateek Nayak
  2026-05-19 11:47               ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: K Prateek Nayak @ 2026-05-19 11:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrea Righi, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Phil Auld, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel, tim.c.chen, yu.c.chen

Hello Peter,

On 5/19/2026 2:16 PM, Peter Zijlstra wrote:
> Shrikanth noted I had an old version of his SMT ifdeffery patches on, so
> I need to rebuild that tree (and the merge) anyway.
> 
> Do you want me to munge this in, or keep it on top as a fixie?

Feel free to munge it but if you want to retain some context, here is the
full patch with suggestions from Andrea incorporated in:

  (Based on top of 5162728eecc2 ("Merge branch 'sched/cache'"))

---
From b0a8ad4b225820c2369f45242517c1c06bac1826 Mon Sep 17 00:00:00 2001
From: K Prateek Nayak <kprateek.nayak@amd.com>
Date: Tue, 19 May 2026 05:14:23 +0000
Subject: [PATCH] sched/topology: Allow multiple domains to claim
 sched_domain_shared

Recent optimizations of sd->shared assignment moved to allocating a
single instance of per-CPU sched_domain_shared objects per s_data.

Recent optimizations to select_idle_capacity() moved the sd->shared
assignments to "sd_asym" domain when ASYM_CPUCAPACITY is detected but
cache-aware scheduling mandates the presence of "sd_llc_shared" to
compute and cache per-LLC statistics.

Use an "alloc_flags" union in sched_domain_shared to claim a
sched_domain_shared object per sched_domain. Allocation starts searching
for an available / matching sched_domain_shared instance from the first
CPU of sched_domain_span(sd) (sd can be sd_llc, or sd_asym). If the
shared object is claimed by another domain, the instance corresponding
to next CPU in the domain span is explored until a matching / available
instance is found.

In case of a single CPU in sched_domain_span(), the domain will be
degenerated and a temporary overlap of ->shared objects across different
domains is acceptable.

"alloc_flags" forms a union with "nr_idle_scan" and the stale flags are
left as is when the sd->shared is published. The expectation is for the
first load balancing instance to correct the value just like the current
behavior, except the initial value is no longer 0.

Originally-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 include/linux/sched/topology.h | 16 ++++++++-
 kernel/sched/topology.c        | 63 +++++++++++++++++++++++++++++-----
 2 files changed, 69 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index fe09d3268bc9..b5d9d7c2b8ad 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -67,7 +67,21 @@ struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
-	int		nr_idle_scan;
+	union {
+		int	nr_idle_scan;
+		/*
+		 * Used during allocation to claim the sched_domain_shared
+		 * object at multiple levels.
+		 *
+		 * Note: between build and the first periodic LB tick, which
+		 * rewrites the union via update_idle_cpu_scan(), readers of
+		 * nr_idle_scan may observe the transient SD_* flag value as
+		 * the scan bound. The flag bits are small positive integers,
+		 * so the effect is just a slightly relaxed scan bound for one
+		 * window and self-heals on the first tick.
+		 */
+		int	alloc_flags;
+	};
 #ifdef CONFIG_SCHED_CACHE
 	unsigned long	util_avg;
 	unsigned long	capacity;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index dbfd9657f897..df2ceb54c970 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -623,6 +623,12 @@ static void free_sched_groups(struct sched_group *sg, int free_sgc)
 	} while (sg != first);
 }
 
+static void free_sched_domain_shared(struct sched_domain_shared *sds)
+{
+	if (sds && atomic_dec_and_test(&sds->ref))
+		kfree(sds);
+}
+
 static void destroy_sched_domain(struct sched_domain *sd)
 {
 	/*
@@ -631,9 +637,7 @@ static void destroy_sched_domain(struct sched_domain *sd)
 	 * dropping group/capacity references, freeing where none remain.
 	 */
 	free_sched_groups(sd->groups, 1);
-
-	if (sd->shared && atomic_dec_and_test(&sd->shared->ref))
-		kfree(sd->shared);
+	free_sched_domain_shared(sd->shared);
 
 #ifdef CONFIG_SCHED_CACHE
 	/* only the bottom sd has llc_counts array */
@@ -755,7 +759,14 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 
 			/* Pick reference to parent->shared. */
 			if (parent->shared) {
-				WARN_ON_ONCE(tmp->shared);
+				/*
+				 * It is safe to free a sd->shared that
+				 * has not been published yet. If a
+				 * sd->shared was published, the refcount
+				 * will end up being non-zero and it will
+				 * not be freed here.
+				 */
+				free_sched_domain_shared(tmp->shared);
 				tmp->shared = parent->shared;
 				parent->shared = NULL;
 			}
@@ -2916,11 +2927,45 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc)
 	}
 }
 
-static void init_sched_domain_shared(struct s_data *d, struct sched_domain *sd)
+static void
+init_sched_domain_shared(struct s_data *d, struct sched_domain *sd, int flags)
 {
-	int sd_id = cpumask_first(sched_domain_span(sd));
+	struct sched_domain_shared *sds = NULL;
+	int cpu;
+
+	/*
+	 * Multiple domains can try to claim a shared object like
+	 * SD_ASYM_CPUCAPACITY and SD_SHARE_LLC which can alias to
+	 * same cpumask_first(sched_domain_span(sd)) CPU and can
+	 * cause "nr_idle_scan" to be populated incorrectly during
+	 * load balancing.
+	 *
+	 * Find the first CPU in sched_domain_span(sd) with an
+	 * unclaimed domain (!alloc_flags) or where the alloc_flag
+	 * matches the requested flag (SD_* flag)
+	 *
+	 * If the domain only has single CPU, allow temporary overlap
+	 * in allocation since the domains will be degenerated later.
+	 */
+	for_each_cpu(cpu, sched_domain_span(sd)) {
+		sds = *per_cpu_ptr(d->sds, cpu);
+
+		if (!sds->alloc_flags ||
+		    sd->span_weight == 1 ||
+		    sds->alloc_flags == flags) {
+			sds->alloc_flags = flags;
+			sd->shared = sds;
+			break;
+		}
+	}
+
+	/*
+	 * Use the sd_shared corresponding to the last
+	 * CPU in the span if none are avaialable.
+	 */
+	if (WARN_ON_ONCE(!sd->shared))
+		sd->shared = sds;
 
-	sd->shared = *per_cpu_ptr(d->sds, sd_id);
 	/*
 	 * nr_busy_cpus is consumed only by the NOHZ kick path via
 	 * sd_balance_shared; on the asym-capacity path it is initialized but
@@ -2960,7 +3005,7 @@ static bool claim_asym_sched_domain_shared(struct s_data *d, int cpu)
 	if (!sd_asym || (sd_asym->flags & SD_NUMA))
 		return false;
 
-	init_sched_domain_shared(d, sd_asym);
+	init_sched_domain_shared(d, sd_asym, SD_ASYM_CPUCAPACITY);
 	return true;
 }
 
@@ -3115,7 +3160,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 			sd = sd->parent;
 
 		if (sd->flags & SD_SHARE_LLC) {
-			init_sched_domain_shared(&d, sd);
+			init_sched_domain_shared(&d, sd, SD_SHARE_LLC);
 
 			/*
 			 * In presence of higher domains, adjust the
-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
  2026-05-19 11:27             ` K Prateek Nayak
@ 2026-05-19 11:47               ` Peter Zijlstra
  2026-05-25  8:30                 ` Chen, Yu C
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2026-05-19 11:47 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Andrea Righi, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Phil Auld, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel, tim.c.chen, yu.c.chen

On Tue, May 19, 2026 at 04:57:11PM +0530, K Prateek Nayak wrote:
> Hello Peter,
> 
> On 5/19/2026 2:16 PM, Peter Zijlstra wrote:
> > Shrikanth noted I had an old version of his SMT ifdeffery patches on, so
> > I need to rebuild that tree (and the merge) anyway.
> > 
> > Do you want me to munge this in, or keep it on top as a fixie?
> 
> Feel free to munge it but if you want to retain some context, here is the
> full patch with suggestions from Andrea incorporated in:
> 
>   (Based on top of 5162728eecc2 ("Merge branch 'sched/cache'"))
> 
> ---
> From b0a8ad4b225820c2369f45242517c1c06bac1826 Mon Sep 17 00:00:00 2001
> From: K Prateek Nayak <kprateek.nayak@amd.com>
> Date: Tue, 19 May 2026 05:14:23 +0000
> Subject: [PATCH] sched/topology: Allow multiple domains to claim
>  sched_domain_shared
> 
> Recent optimizations of sd->shared assignment moved to allocating a
> single instance of per-CPU sched_domain_shared objects per s_data.
> 
> Recent optimizations to select_idle_capacity() moved the sd->shared
> assignments to "sd_asym" domain when ASYM_CPUCAPACITY is detected but
> cache-aware scheduling mandates the presence of "sd_llc_shared" to
> compute and cache per-LLC statistics.
> 
> Use an "alloc_flags" union in sched_domain_shared to claim a
> sched_domain_shared object per sched_domain. Allocation starts searching
> for an available / matching sched_domain_shared instance from the first
> CPU of sched_domain_span(sd) (sd can be sd_llc, or sd_asym). If the
> shared object is claimed by another domain, the instance corresponding
> to next CPU in the domain span is explored until a matching / available
> instance is found.
> 
> In case of a single CPU in sched_domain_span(), the domain will be
> degenerated and a temporary overlap of ->shared objects across different
> domains is acceptable.
> 
> "alloc_flags" forms a union with "nr_idle_scan" and the stale flags are
> left as is when the sd->shared is published. The expectation is for the
> first load balancing instance to correct the value just like the current
> behavior, except the initial value is no longer 0.
> 
> Originally-by: Peter Zijlstra <peterz@infradead.org>
> Tested-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>

Since you wrote a nice changelog, I stuck it on top. Pushed out a fresh
queue:sched/core with updated patches from Shrikanth and this on top.

Seems to build and boot in a random vm, so must be good ;-)

I'll push into -tip in a day or so, provided nothing goes boom in the
meantime.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [tip: sched/core] sched/fair: Add SIS_UTIL support to select_idle_capacity()
  2026-05-09 18:07 ` [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity() Andrea Righi
  2026-05-11 13:08   ` Vincent Guittot
@ 2026-05-20  8:34   ` tip-bot2 for K Prateek Nayak
  1 sibling, 0 replies; 47+ messages in thread
From: tip-bot2 for K Prateek Nayak @ 2026-05-20  8:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Andrea Righi, K Prateek Nayak, Peter Zijlstra (Intel),
	Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     61ea17a63719bac51e1bc50eb39fc637f0fdc06e
Gitweb:        https://git.kernel.org/tip/61ea17a63719bac51e1bc50eb39fc637f0fdc06e
Author:        K Prateek Nayak <kprateek.nayak@amd.com>
AuthorDate:    Sat, 09 May 2026 20:07:29 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 19 May 2026 12:17:39 +02:00

sched/fair: Add SIS_UTIL support to select_idle_capacity()

Add to select_idle_capacity() the same SIS_UTIL-controlled idle-scan
mechanism, already used by select_idle_cpu(): when sched_feat(SIS_UTIL)
is enabled and the LLC domain has sched_domain_shared data, derive the
per-attempt scan limit from sd->shared->nr_idle_scan.

That bounds the walk on large LLCs: once nr_idle_scan is exhausted,
return the best CPU seen so far. The early exit is gated on
!has_idle_core so an active idle-core search (SMT with idle cores
reported by test_idle_cores()) isn't cut short before it gets a chance
to find one.

Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20260509180955.1840064-6-arighi@nvidia.com
---
 kernel/sched/fair.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f69ee5a..69ba882 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8016,6 +8016,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	int fits, best_fits = ASYM_IDLE_THREAD_MISFIT;
 	int cpu, best_cpu = -1;
 	struct cpumask *cpus;
+	int nr = INT_MAX;
 
 	cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
@@ -8024,10 +8025,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	util_min = uclamp_eff_value(p, UCLAMP_MIN);
 	util_max = uclamp_eff_value(p, UCLAMP_MAX);
 
+	if (sched_feat(SIS_UTIL) && sd->shared) {
+		/*
+		 * Same nr_idle_scan hint as select_idle_cpu(), nr only limits
+		 * the scan when not preferring an idle core.
+		 */
+		nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
+		/* overloaded domain is unlikely to have idle cpu/core */
+		if (nr == 1)
+			return -1;
+	}
+
 	for_each_cpu_wrap(cpu, cpus, target) {
 		bool preferred_core = !has_idle_core || is_core_idle(cpu);
 		unsigned long cpu_cap = capacity_of(cpu);
 
+		/*
+		 * Stop when the nr_idle_scan is exhausted (mirrors
+		 * select_idle_cpu() logic).
+		 */
+		if (!has_idle_core && --nr <= 0)
+			return best_cpu;
+
 		if (!choose_idle_cpu(cpu, p))
 			continue;
 

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [tip: sched/core] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
  2026-05-09 18:07 ` [PATCH 4/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
  2026-05-11 13:07   ` Vincent Guittot
  2026-05-15 10:09   ` Shrikanth Hegde
@ 2026-05-20  8:34   ` tip-bot2 for Andrea Righi
  2 siblings, 0 replies; 47+ messages in thread
From: tip-bot2 for Andrea Righi @ 2026-05-20  8:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Felix Abecassis, Andrea Righi, Peter Zijlstra (Intel),
	Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     bf6aa722198d3c06e4236e8c5a480f30a64e1513
Gitweb:        https://git.kernel.org/tip/bf6aa722198d3c06e4236e8c5a480f30a64e1513
Author:        Andrea Righi <arighi@nvidia.com>
AuthorDate:    Sat, 09 May 2026 20:07:28 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 19 May 2026 12:17:38 +02:00

sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity

When SD_ASYM_CPUCAPACITY load balancing considers pulling a misfit task,
capacity_of(dst_cpu) can overstate available compute if the SMT sibling is
busy: the core does not deliver its full nominal capacity.

If SMT is active and dst_cpu is not on a fully idle core, skip this
destination so we do not migrate a misfit expecting a capacity upgrade we
cannot actually provide.

Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20260509180955.1840064-5-arighi@nvidia.com
---
 kernel/sched/fair.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8854d4d..f69ee5a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9625,6 +9625,7 @@ struct lb_env {
 
 	int			dst_cpu;
 	struct rq		*dst_rq;
+	bool			dst_core_idle;
 
 	struct cpumask		*dst_grpmask;
 	int			new_dst_cpu;
@@ -10850,10 +10851,16 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 * We can use max_capacity here as reduction in capacity on some
 	 * CPUs in the group should either be possible to resolve
 	 * internally or be covered by avg_load imbalance (eventually).
+	 *
+	 * When SMT is active, only pull a misfit to dst_cpu if it is on a
+	 * fully idle core; otherwise the effective capacity of the core is
+	 * reduced and we may not actually provide more capacity than the
+	 * source.
 	 */
 	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
 	    (sgs->group_type == group_misfit_task) &&
-	    (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
+	    (!env->dst_core_idle ||
+	     !capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
 	     sds->local_stat.group_type != group_has_spare))
 		return false;
 
@@ -11417,6 +11424,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	unsigned long sum_util = 0;
 	bool sg_overloaded = 0, sg_overutilized = 0;
 
+	env->dst_core_idle = !sched_smt_active() || is_core_idle(env->dst_cpu);
+
 	do {
 		struct sg_lb_stats *sgs = &tmp_sgs;
 		int local_group;

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [tip: sched/core] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-05-11 14:25     ` [PATCH v2 " Andrea Righi
@ 2026-05-20  8:34       ` tip-bot2 for Andrea Righi
  0 siblings, 0 replies; 47+ messages in thread
From: tip-bot2 for Andrea Righi @ 2026-05-20  8:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Felix Abecassis, Andrea Righi, Peter Zijlstra (Intel),
	Vincent Guittot, K Prateek Nayak, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     25a32e400a14009601c0a727643057f5515152df
Gitweb:        https://git.kernel.org/tip/25a32e400a14009601c0a727643057f5515152df
Author:        Andrea Righi <arighi@nvidia.com>
AuthorDate:    Mon, 11 May 2026 16:25:02 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 19 May 2026 12:17:38 +02:00

sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection

On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
different per-core frequencies), the wakeup path uses
select_idle_capacity() and prioritizes idle CPUs with higher capacity
for better task placement. However, when those CPUs belong to SMT cores,
their effective capacity can be much lower than the nominal capacity
when the sibling thread is busy: SMT siblings compete for shared
resources, so a "high capacity" CPU that is idle but whose sibling is
busy does not deliver its full capacity. This effective capacity
reduction cannot be modeled by the static capacity value alone.

Introduce SMT awareness in the asym-capacity idle selection policy: when
SMT is active, always prefer fully-idle SMT cores over partially-idle
ones.

Prioritizing fully-idle SMT cores yields better task placement because
the effective capacity of partially-idle SMT cores is reduced; always
preferring them when available leads to more accurate capacity usage on
task wakeup.

On an SMT system with asymmetric CPU capacities (NVIDIA Vera Rubin),
SMT-aware idle selection has been shown to improve throughput by around
15-18% over NO_ASYM mainline and by around 60% over ASYM mainline, for
CPU-bound workloads (NVBLAS) running an amount of tasks equal to the
amount of SMT cores.

Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260511142502.3873984-1-arighi@nvidia.com
---
 kernel/sched/fair.c | 120 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 114 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2637a6f..8854d4d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7951,6 +7951,54 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool 
 }
 
 /*
+ * Idle-capacity scan converts util_fits_cpu() outcomes into preference ranks,
+ * where lower values indicate a better fit - see select_idle_capacity().
+ *
+ * A CPU that both fits the task and sits on a fully-idle SMT core is returned
+ * immediately and is never assigned one of these ranks. On !SMT every CPU is
+ * its own "core", so the early return covers all fits-and-idle cases and the
+ * core-tier ranks below become unreachable.
+ *
+ *   Rank                            Val  Tier    Meaning
+ *   ------------------------------  ---  ------  ---------------------------
+ *   ASYM_IDLE_UCLAMP_MISFIT         -4   core    Idle core; capacity fits
+ *                                                util but uclamp_min misses.
+ *   ASYM_IDLE_COMPLETE_MISFIT       -3   core    Idle core; capacity does
+ *                                                not fit. Still beats every
+ *                                                thread-tier rank: a busy
+ *                                                sibling cuts effective
+ *                                                capacity more than a
+ *                                                misfit hurts a quiet core.
+ *   ASYM_IDLE_THREAD_FITS           -2   thread  Busy SMT sibling; capacity
+ *                                                fits util + uclamp.
+ *   ASYM_IDLE_THREAD_UCLAMP_MISFIT  -1   thread  Busy SMT sibling; capacity
+ *                                                fits but uclamp_min misses
+ *                                                (native util_fits_cpu()
+ *                                                return value).
+ *   ASYM_IDLE_THREAD_MISFIT          0   thread  Busy SMT sibling; capacity
+ *                                                does not fit.
+ *
+ * ASYM_IDLE_CORE_BIAS (-3) is an offset, not a state. On an idle core,
+ * fits += ASYM_IDLE_CORE_BIAS rebases thread-tier ranks into the core tier:
+ *
+ *   ASYM_IDLE_THREAD_UCLAMP_MISFIT (-1) + BIAS -> ASYM_IDLE_UCLAMP_MISFIT   (-4)
+ *   ASYM_IDLE_THREAD_MISFIT         (0) + BIAS -> ASYM_IDLE_COMPLETE_MISFIT (-3)
+ *
+ * ASYM_IDLE_THREAD_FITS (-2) is never rebased because a fully-fitting idle-core
+ * candidate early-returns from select_idle_capacity().
+ */
+enum asym_fits_state {
+	ASYM_IDLE_UCLAMP_MISFIT = -4,
+	ASYM_IDLE_COMPLETE_MISFIT,
+	ASYM_IDLE_THREAD_FITS,
+	ASYM_IDLE_THREAD_UCLAMP_MISFIT,
+	ASYM_IDLE_THREAD_MISFIT,
+
+	/* util_fits_cpu() bias for idle core */
+	ASYM_IDLE_CORE_BIAS = -3,
+};
+
+/*
  * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
  * the task fits. If no CPU is big enough, but there are idle ones, try to
  * maximize capacity.
@@ -7958,8 +8006,14 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool 
 static int
 select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 {
+	/*
+	 * On !SMT systems, has_idle_core is always false and preferred_core
+	 * is always true (CPU == core), so the SMT preference logic below
+	 * collapses to the plain capacity scan.
+	 */
+	bool has_idle_core = sched_smt_active() && test_idle_cores(target);
 	unsigned long task_util, util_min, util_max, best_cap = 0;
-	int fits, best_fits = 0;
+	int fits, best_fits = ASYM_IDLE_THREAD_MISFIT;
 	int cpu, best_cpu = -1;
 	struct cpumask *cpus;
 
@@ -7971,6 +8025,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	util_max = uclamp_eff_value(p, UCLAMP_MAX);
 
 	for_each_cpu_wrap(cpu, cpus, target) {
+		bool preferred_core = !has_idle_core || is_core_idle(cpu);
 		unsigned long cpu_cap = capacity_of(cpu);
 
 		if (!choose_idle_cpu(cpu, p))
@@ -7978,8 +8033,14 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 
 		fits = util_fits_cpu(task_util, util_min, util_max, cpu);
 
-		/* This CPU fits with all requirements */
-		if (fits > 0)
+		/*
+		 * Perfect fit: capacity satisfies util + uclamp and the CPU
+		 * sits on a fully-idle SMT core, this is a !SMT system, or
+		 * there is no idle core to find.
+		 * Short-circuit the rank-based selection and return
+		 * immediately.
+		 */
+		if (fits > 0 && preferred_core)
 			return cpu;
 		/*
 		 * Only the min performance hint (i.e. uclamp_min) doesn't fit.
@@ -7987,9 +8048,33 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 		 */
 		else if (fits < 0)
 			cpu_cap = get_actual_cpu_capacity(cpu);
+		/*
+		 * fits > 0 implies we are not on a preferred core, but the util
+		 * fits CPU capacity. Set fits to ASYM_IDLE_THREAD_FITS
+		 * so the effective range becomes
+		 * [ASYM_IDLE_THREAD_FITS, ASYM_IDLE_THREAD_MISFIT], where:
+		 *    ASYM_IDLE_THREAD_MISFIT - does not fit
+		 *    ASYM_IDLE_THREAD_UCLAMP_MISFIT - fits with the exception of UCLAMP_MIN
+		 *    ASYM_IDLE_THREAD_FITS - fits with the exception of preferred_core
+		 */
+		else if (fits > 0)
+			fits = ASYM_IDLE_THREAD_FITS;
 
 		/*
-		 * First, select CPU which fits better (-1 being better than 0).
+		 * If we are on a preferred core, translate the range of fits
+		 * of [ASYM_IDLE_THREAD_UCLAMP_MISFIT, ASYM_IDLE_THREAD_MISFIT] to
+		 * [ASYM_IDLE_UCLAMP_MISFIT, ASYM_IDLE_COMPLETE_MISFIT].
+		 * This ensures that an idle core is always given priority over
+		 * (partially) busy core.
+		 *
+		 * A fully fitting idle core would have returned early and hence
+		 * fits > 0 for preferred_core need not be dealt with.
+		 */
+		if (preferred_core)
+			fits += ASYM_IDLE_CORE_BIAS;
+
+		/*
+		 * First, select CPU which fits better (lower is more preferred).
 		 * Then, select the one with best capacity at same level.
 		 */
 		if ((fits < best_fits) ||
@@ -8000,6 +8085,19 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 		}
 	}
 
+	/*
+	 * A value in the [ASYM_IDLE_UCLAMP_MISFIT, ASYM_IDLE_COMPLETE_MISFIT]
+	 * range means the chosen CPU is in a fully idle SMT core. Values above
+	 * ASYM_IDLE_COMPLETE_MISFIT mean we never ranked such a CPU best.
+	 *
+	 * The asym-capacity wakeup path returns from select_idle_sibling()
+	 * after this function and never runs select_idle_cpu(), so the usual
+	 * select_idle_cpu() tail that clears idle cores must live here when the
+	 * idle-core preference did not win.
+	 */
+	if (has_idle_core && best_fits > ASYM_IDLE_COMPLETE_MISFIT)
+		set_idle_cores(target, false);
+
 	return best_cpu;
 }
 
@@ -8008,12 +8106,22 @@ static inline bool asym_fits_cpu(unsigned long util,
 				 unsigned long util_max,
 				 int cpu)
 {
-	if (sched_asym_cpucap_active())
+	if (sched_asym_cpucap_active()) {
 		/*
 		 * Return true only if the cpu fully fits the task requirements
 		 * which include the utilization and the performance hints.
+		 *
+		 * When SMT is active, also require that the core has no busy
+		 * siblings.
+		 *
+		 * Note: gating on is_core_idle() also makes the early-bailout
+		 * candidates in select_idle_sibling() (target, prev,
+		 * recent_used_cpu) idle-core-aware on ASYM+SMT, which the
+		 * NO_ASYM path does not do.
 		 */
-		return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
+		return (!sched_smt_active() || is_core_idle(cpu)) &&
+		       (util_fits_cpu(util, util_min, util_max, cpu) > 0);
+	}
 
 	return true;
 }

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [tip: sched/core] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
  2026-05-16  5:58     ` [PATCH v2 " Andrea Righi
  2026-05-16 17:19       ` Shrikanth Hegde
  2026-05-18 20:58       ` Peter Zijlstra
@ 2026-05-20  8:34       ` tip-bot2 for K Prateek Nayak
  2 siblings, 0 replies; 47+ messages in thread
From: tip-bot2 for K Prateek Nayak @ 2026-05-20  8:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Andrea Righi, K Prateek Nayak, Peter Zijlstra (Intel),
	Shrikanth Hegde, Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     fdfe5a8cd8731dd81840f26abfb6527edd27b0cb
Gitweb:        https://git.kernel.org/tip/fdfe5a8cd8731dd81840f26abfb6527edd27b0cb
Author:        K Prateek Nayak <kprateek.nayak@amd.com>
AuthorDate:    Sat, 16 May 2026 07:58:50 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 19 May 2026 12:17:38 +02:00

sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity

On asymmetric CPU capacity systems, the wakeup path uses
select_idle_capacity(), which scans the span of sd_asym_cpucapacity
rather than sd_llc.

The has_idle_cores hint however lives on sd_llc->shared, so the
wakeup-time read of has_idle_cores operates on an LLC-scoped blob while
the actual scan/decision spans the asym domain; nr_busy_cpus also lives
in the same shared sched_domain data, but it's never used in the asym
CPU capacity scenario.

Therefore, move the sched_domain_shared object to sd_asym_cpucapacity
whenever the CPU has a SD_ASYM_CPUCAPACITY_FULL ancestor and that
ancestor is non-overlapping (i.e., not built from SD_NUMA). In that case
the scope of has_idle_cores matches the scope of the wakeup scan.

Fall back to attaching the shared object to sd_llc in three cases:

  1) plain symmetric systems (no SD_ASYM_CPUCAPACITY_FULL anywhere);

  2) CPUs in an exclusive cpuset that carves out a symmetric capacity
     island: has_asym is system-wide but those CPUs have no
     SD_ASYM_CPUCAPACITY_FULL ancestor in their hierarchy and follow
     the symmetric LLC path in select_idle_sibling();

  3) exotic topologies where SD_ASYM_CPUCAPACITY_FULL lands on an
     SD_NUMA-built domain. init_sched_domain_shared() keys the shared
     blob off cpumask_first(span), which on overlapping NUMA domains
     would alias unrelated spans onto the same blob. Keep the shared
     object on the LLC there; select_idle_capacity() gracefully skips
     the has_idle_cores preference when sd->shared is NULL.

While at it, also rename the per-CPU sd_llc_shared to sd_balance_shared,
as it is no longer strictly tied to the LLC.

Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20260516055850.1345932-1-arighi@nvidia.com
---
 kernel/sched/fair.c     | 22 +++++----
 kernel/sched/sched.h    |  2 +-
 kernel/sched/topology.c | 95 ++++++++++++++++++++++++++++++++++------
 3 files changed, 97 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03f63b0..2637a6f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7773,7 +7773,7 @@ static inline void set_idle_cores(int cpu, int val)
 {
 	struct sched_domain_shared *sds;
 
-	sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
+	sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
 	if (sds)
 		WRITE_ONCE(sds->has_idle_cores, val);
 }
@@ -7782,7 +7782,7 @@ static inline bool test_idle_cores(int cpu)
 {
 	struct sched_domain_shared *sds;
 
-	sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
+	sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
 	if (sds)
 		return READ_ONCE(sds->has_idle_cores);
 
@@ -7791,7 +7791,7 @@ static inline bool test_idle_cores(int cpu)
 
 /*
  * Scans the local SMT mask to see if the entire core is idle, and records this
- * information in sd_llc_shared->has_idle_cores.
+ * information in sd_balance_shared->has_idle_cores.
  *
  * Since SMT siblings share all cache levels, inspecting this limited remote
  * state should be fairly cheap.
@@ -7821,7 +7821,8 @@ unlock:
 /*
  * Scan the entire LLC domain for idle cores; this dynamically switches off if
  * there are no idle cores left in the system; tracked through
- * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
+ * sd_balance_shared->has_idle_cores and enabled through update_idle_core()
+ * above.
  */
 static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
 {
@@ -7885,7 +7886,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool 
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
 
-	if (sched_feat(SIS_UTIL)) {
+	if (sched_feat(SIS_UTIL) && sd->shared) {
 		/*
 		 * Increment because !--nr is the condition to stop scan.
 		 *
@@ -12764,7 +12765,7 @@ static void nohz_balancer_kick(struct rq *rq)
 		goto out;
 	}
 
-	sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
+	sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
 	if (sds) {
 		/*
 		 * If there is an imbalance between LLC domains (IOW we could
@@ -12792,7 +12793,11 @@ static void set_cpu_sd_state_busy(int cpu)
 	struct sched_domain *sd;
 	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
 
-	if (!sd || !sd->nohz_idle)
+	/*
+	 * sd->nohz_idle only pairs with nr_busy_cpus on sd->shared; if this
+	 * domain has no shared object there is nothing to clear or account.
+	 */
+	if (!sd || !sd->shared || !sd->nohz_idle)
 		return;
 	sd->nohz_idle = 0;
 
@@ -12817,7 +12822,8 @@ static void set_cpu_sd_state_idle(int cpu)
 	struct sched_domain *sd;
 	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
 
-	if (!sd || sd->nohz_idle)
+	/* See set_cpu_sd_state_busy(): nohz_idle is only used with sd->shared. */
+	if (!sd || !sd->shared || sd->nohz_idle)
 		return;
 	sd->nohz_idle = 1;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ffe77b2..bfb4b47 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2164,7 +2164,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
 DECLARE_PER_CPU(int, sd_share_id);
-DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_balance_shared);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index a1f46e3..f96d501 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -665,7 +665,7 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
 DEFINE_PER_CPU(int, sd_share_id);
-DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_balance_shared);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
@@ -680,20 +680,38 @@ static void update_top_cache_domain(int cpu)
 	int id = cpu;
 	int size = 1;
 
+	sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
+	/*
+	 * The shared object is attached to sd_asym_cpucapacity only when the
+	 * asym domain is non-overlapping (i.e., not built from SD_NUMA).
+	 * On overlapping (NUMA) asym domains we fall back to letting the
+	 * SD_SHARE_LLC path own the shared object, so sd->shared may be NULL
+	 * here.
+	 */
+	if (sd && sd->shared)
+		sds = sd->shared;
+
+	rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
+
 	sd = highest_flag_domain(cpu, SD_SHARE_LLC);
 	if (sd) {
 		id = cpumask_first(sched_domain_span(sd));
 		size = cpumask_weight(sched_domain_span(sd));
 
-		/* If sd_llc exists, sd_llc_shared should exist too. */
-		WARN_ON_ONCE(!sd->shared);
-		sds = sd->shared;
+		/*
+		 * If sd_asym_cpucapacity didn't claim the shared object,
+		 * sd_llc must have one linked.
+		 */
+		if (!sds) {
+			WARN_ON_ONCE(!sd->shared);
+			sds = sd->shared;
+		}
 	}
 
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
 	per_cpu(sd_llc_size, cpu) = size;
 	per_cpu(sd_llc_id, cpu) = id;
-	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
+	rcu_assign_pointer(per_cpu(sd_balance_shared, cpu), sds);
 
 	sd = lowest_flag_domain(cpu, SD_CLUSTER);
 	if (sd)
@@ -711,9 +729,6 @@ static void update_top_cache_domain(int cpu)
 
 	sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
 	rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
-
-	sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
-	rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
 }
 
 /*
@@ -2648,6 +2663,54 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc)
 	}
 }
 
+static void init_sched_domain_shared(struct s_data *d, struct sched_domain *sd)
+{
+	int sd_id = cpumask_first(sched_domain_span(sd));
+
+	sd->shared = *per_cpu_ptr(d->sds, sd_id);
+	/*
+	 * nr_busy_cpus is consumed only by the NOHZ kick path via
+	 * sd_balance_shared; on the asym-capacity path it is initialized but
+	 * never read.
+	 */
+	atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
+	atomic_inc(&sd->shared->ref);
+}
+
+/*
+ * For asymmetric CPU capacity, attach sched_domain_shared on the innermost
+ * SD_ASYM_CPUCAPACITY_FULL ancestor of @cpu's base domain when that ancestor is
+ * not an overlapping NUMA-built domain (then LLC should claim shared).
+ *
+ * A CPU may lack any FULL ancestor (e.g., exclusive cpuset symmetric island),
+ * then LLC must claim shared instead.
+ *
+ * Note: SD_ASYM_CPUCAPACITY_FULL is only set when all CPU capacity values
+ * are present in the domain span, so the asym domain we attach to cannot
+ * degenerate into a single-capacity group. The relevant edge cases are instead
+ * covered by the caveats above.
+ *
+ * Return true if this CPU's asym path claimed sd->shared, false otherwise.
+ */
+static bool claim_asym_sched_domain_shared(struct s_data *d, int cpu)
+{
+	struct sched_domain *sd = *per_cpu_ptr(d->sd, cpu);
+	struct sched_domain *sd_asym;
+
+	if (!sd)
+		return false;
+
+	sd_asym = sd;
+	while (sd_asym && !(sd_asym->flags & SD_ASYM_CPUCAPACITY_FULL))
+		sd_asym = sd_asym->parent;
+
+	if (!sd_asym || (sd_asym->flags & SD_NUMA))
+		return false;
+
+	init_sched_domain_shared(d, sd_asym);
+	return true;
+}
+
 /*
  * Build sched domains for a given set of CPUs and attach the sched domains
  * to the individual CPUs
@@ -2706,20 +2769,26 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 	}
 
 	for_each_cpu(i, cpu_map) {
+		bool asym_claimed = false;
+
 		sd = *per_cpu_ptr(d.sd, i);
 		if (!sd)
 			continue;
 
+		if (has_asym)
+			asym_claimed = claim_asym_sched_domain_shared(&d, i);
+
 		/* First, find the topmost SD_SHARE_LLC domain */
 		while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
 			sd = sd->parent;
 
 		if (sd->flags & SD_SHARE_LLC) {
-			int sd_id = cpumask_first(sched_domain_span(sd));
-
-			sd->shared = *per_cpu_ptr(d.sds, sd_id);
-			atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
-			atomic_inc(&sd->shared->ref);
+			/*
+			 * Initialize the sd->shared for SD_SHARE_LLC unless
+			 * the asym path above already claimed it.
+			 */
+			if (!asym_claimed)
+				init_sched_domain_shared(&d, sd);
 
 			/*
 			 * In presence of higher domains, adjust the

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [tip: sched/core] sched/fair: Drop redundant RCU read lock in NOHZ kick path
  2026-05-09 18:07 ` [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path Andrea Righi
  2026-05-11 13:04   ` Vincent Guittot
  2026-05-15  6:49   ` Shrikanth Hegde
@ 2026-05-20  8:34   ` tip-bot2 for Andrea Righi
  2026-05-21 19:47   ` [PATCH 1/5] " Marek Szyprowski
  3 siblings, 0 replies; 47+ messages in thread
From: tip-bot2 for Andrea Righi @ 2026-05-20  8:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: K Prateek Nayak, Andrea Righi, Peter Zijlstra (Intel),
	Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     c9d93a73ce871ca32caf9308562501290b64b955
Gitweb:        https://git.kernel.org/tip/c9d93a73ce871ca32caf9308562501290b64b955
Author:        Andrea Righi <arighi@nvidia.com>
AuthorDate:    Sat, 09 May 2026 20:07:25 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 19 May 2026 12:17:37 +02:00

sched/fair: Drop redundant RCU read lock in NOHZ kick path

nohz_balancer_kick() is reached from sched_balance_trigger(), which is
called from sched_tick(). sched_tick() runs with IRQs disabled, so the
additional rcu_read_lock/unlock() used around sched_domain accesses in
this path is redundant. Rely on the existing IRQ-disabled context (and
the rcu_dereference_all() checking) instead.

The same applies to set_cpu_sd_state_idle(), called from the idle entry
path with IRQs disabled, and to set_cpu_sd_state_busy(), reachable via
nohz_balance_exit_idle() from two contexts: nohz_balancer_kick() (IRQs
disabled, as above) and sched_cpu_deactivate() (the CPUHP_AP_ACTIVE
teardown, which runs under cpus_write_lock(), so it cannot race with
sched-domain rebuilds). In both cases the rcu_dereference_all()
validation is sufficient.

No functional change intended.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20260509180955.1840064-2-arighi@nvidia.com
---
 kernel/sched/fair.c | 38 +++++++++++---------------------------
 1 file changed, 11 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bcaaddd..03f63b0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12715,8 +12715,6 @@ static void nohz_balancer_kick(struct rq *rq)
 		goto out;
 	}
 
-	rcu_read_lock();
-
 	sd = rcu_dereference_all(rq->sd);
 	if (sd) {
 		/*
@@ -12724,8 +12722,8 @@ static void nohz_balancer_kick(struct rq *rq)
 		 * capacity, kick the ILB to see if there's a better CPU to run on:
 		 */
 		if (rq->cfs.h_nr_runnable >= 1 && check_cpu_capacity(rq, sd)) {
-			flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-			goto unlock;
+			flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
+			goto out;
 		}
 	}
 
@@ -12741,8 +12739,8 @@ static void nohz_balancer_kick(struct rq *rq)
 		 */
 		for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) {
 			if (sched_asym(sd, i, cpu)) {
-				flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-				goto unlock;
+				flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
+				goto out;
 			}
 		}
 	}
@@ -12753,10 +12751,8 @@ static void nohz_balancer_kick(struct rq *rq)
 		 * When ASYM_CPUCAPACITY; see if there's a higher capacity CPU
 		 * to run the misfit task on.
 		 */
-		if (check_misfit_status(rq)) {
-			flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-			goto unlock;
-		}
+		if (check_misfit_status(rq))
+			flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
 
 		/*
 		 * For asymmetric systems, we do not want to nicely balance
@@ -12765,7 +12761,7 @@ static void nohz_balancer_kick(struct rq *rq)
 		 *
 		 * Skip the LLC logic because it's not relevant in that case.
 		 */
-		goto unlock;
+		goto out;
 	}
 
 	sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
@@ -12780,13 +12776,9 @@ static void nohz_balancer_kick(struct rq *rq)
 		 * like this LLC domain has tasks we could move.
 		 */
 		nr_busy = atomic_read(&sds->nr_busy_cpus);
-		if (nr_busy > 1) {
-			flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-			goto unlock;
-		}
+		if (nr_busy > 1)
+			flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
 	}
-unlock:
-	rcu_read_unlock();
 out:
 	if (READ_ONCE(nohz.needs_update))
 		flags |= NOHZ_NEXT_KICK;
@@ -12798,17 +12790,13 @@ out:
 static void set_cpu_sd_state_busy(int cpu)
 {
 	struct sched_domain *sd;
-
-	rcu_read_lock();
 	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
 
 	if (!sd || !sd->nohz_idle)
-		goto unlock;
+		return;
 	sd->nohz_idle = 0;
 
 	atomic_inc(&sd->shared->nr_busy_cpus);
-unlock:
-	rcu_read_unlock();
 }
 
 void nohz_balance_exit_idle(struct rq *rq)
@@ -12827,17 +12815,13 @@ void nohz_balance_exit_idle(struct rq *rq)
 static void set_cpu_sd_state_idle(int cpu)
 {
 	struct sched_domain *sd;
-
-	rcu_read_lock();
 	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
 
 	if (!sd || sd->nohz_idle)
-		goto unlock;
+		return;
 	sd->nohz_idle = 1;
 
 	atomic_dec(&sd->shared->nr_busy_cpus);
-unlock:
-	rcu_read_unlock();
 }
 
 /*

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path
  2026-05-09 18:07 ` [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path Andrea Righi
                     ` (2 preceding siblings ...)
  2026-05-20  8:34   ` [tip: sched/core] " tip-bot2 for Andrea Righi
@ 2026-05-21 19:47   ` Marek Szyprowski
  2026-05-21 20:13     ` Andrea Righi
  3 siblings, 1 reply; 47+ messages in thread
From: Marek Szyprowski @ 2026-05-21 19:47 UTC (permalink / raw)
  To: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	Shrikanth Hegde, linux-kernel

On 09.05.2026 20:07, Andrea Righi wrote:
> nohz_balancer_kick() is reached from sched_balance_trigger(), which is
> called from sched_tick(). sched_tick() runs with IRQs disabled, so the
> additional rcu_read_lock/unlock() used around sched_domain accesses in
> this path is redundant. Rely on the existing IRQ-disabled context (and
> the rcu_dereference_all() checking) instead.
>
> The same applies to set_cpu_sd_state_idle(), called from the idle entry
> path with IRQs disabled, and to set_cpu_sd_state_busy(), reachable via
> nohz_balance_exit_idle() from two contexts: nohz_balancer_kick() (IRQs
> disabled, as above) and sched_cpu_deactivate() (the CPUHP_AP_ACTIVE
> teardown, which runs under cpus_write_lock(), so it cannot race with
> sched-domain rebuilds). In both cases the rcu_dereference_all()
> validation is sufficient.
>
> No functional change intended.
>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
This patch landed in today's linux-next as commit c9d93a73ce87 ("sched/fair: Drop
redundant RCU read lock in NOHZ kick path"). In my tests I found that it introduced
the following warning during the CPU hot-plug tests:


root@target:~# for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0 >$i/online; done

=============================
WARNING: suspicious RCU usage
7.1.0-rc2+ #12775 Not tainted
-----------------------------
kernel/sched/fair.c:12793 suspicious rcu_dereference_check() usage!

other info that might help us debug this:


rcu_scheduler_active = 2, debug_locks = 1
2 locks held by cpuhp/1/20:
 #0: ffffffff81a16220 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x42/0x1ae
 #1: ffffffff81a16270 (cpuhp_state-down){+.+.}-{0:0}, at: cpuhp_thread_fun+0x72/0x1ae

stack backtrace:
CPU: 1 UID: 0 PID: 20 Comm: cpuhp/1 Not tainted 7.1.0-rc2+ #12775 PREEMPTLAZY
Hardware name: StarFive VisionFive 2 v1.2A (DT)
Call Trace:
[<ffffffff8001827c>] dump_backtrace+0x1c/0x24
[<ffffffff800014c0>] show_stack+0x28/0x34
[<ffffffff80010d42>] dump_stack_lvl+0x5e/0x86
[<ffffffff80010d7e>] dump_stack+0x14/0x1c
[<ffffffff800987ec>] lockdep_rcu_suspicious+0x14c/0x1b8
[<ffffffff80079992>] nohz_balance_exit_idle+0xf4/0xf6
[<ffffffff800664e6>] sched_cpu_deactivate+0x6c/0x1c8
[<ffffffff8002a5d0>] cpuhp_invoke_callback+0xf8/0x1ce
[<ffffffff8002a944>] cpuhp_thread_fun+0x150/0x1ae
[<ffffffff8005dc64>] smpboot_thread_fn+0x138/0x2a4
[<ffffffff800554ae>] kthread+0xea/0x10c
[<ffffffff800134c4>] ret_from_fork_kernel+0x22/0x386
[<ffffffff80c278ee>] ret_from_fork_kernel_asm+0x16/0x18
CPU1: off
CPU2: off
CPU3: off

This issue is observed on most of my ARM 32bit, ARM 64bit and RiscV64 based boards.


> ---
>  kernel/sched/fair.c | 38 +++++++++++---------------------------
>  1 file changed, 11 insertions(+), 27 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3ebec186f9823..6b059ee80b631 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12785,8 +12785,6 @@ static void nohz_balancer_kick(struct rq *rq)
>  		goto out;
>  	}
>  
> -	rcu_read_lock();
> -
>  	sd = rcu_dereference_all(rq->sd);
>  	if (sd) {
>  		/*
> @@ -12794,8 +12792,8 @@ static void nohz_balancer_kick(struct rq *rq)
>  		 * capacity, kick the ILB to see if there's a better CPU to run on:
>  		 */
>  		if (rq->cfs.h_nr_runnable >= 1 && check_cpu_capacity(rq, sd)) {
> -			flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> -			goto unlock;
> +			flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> +			goto out;
>  		}
>  	}
>  
> @@ -12811,8 +12809,8 @@ static void nohz_balancer_kick(struct rq *rq)
>  		 */
>  		for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) {
>  			if (sched_asym(sd, i, cpu)) {
> -				flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> -				goto unlock;
> +				flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> +				goto out;
>  			}
>  		}
>  	}
> @@ -12823,10 +12821,8 @@ static void nohz_balancer_kick(struct rq *rq)
>  		 * When ASYM_CPUCAPACITY; see if there's a higher capacity CPU
>  		 * to run the misfit task on.
>  		 */
> -		if (check_misfit_status(rq)) {
> -			flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> -			goto unlock;
> -		}
> +		if (check_misfit_status(rq))
> +			flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
>  
>  		/*
>  		 * For asymmetric systems, we do not want to nicely balance
> @@ -12835,7 +12831,7 @@ static void nohz_balancer_kick(struct rq *rq)
>  		 *
>  		 * Skip the LLC logic because it's not relevant in that case.
>  		 */
> -		goto unlock;
> +		goto out;
>  	}
>  
>  	sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
> @@ -12850,13 +12846,9 @@ static void nohz_balancer_kick(struct rq *rq)
>  		 * like this LLC domain has tasks we could move.
>  		 */
>  		nr_busy = atomic_read(&sds->nr_busy_cpus);
> -		if (nr_busy > 1) {
> -			flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> -			goto unlock;
> -		}
> +		if (nr_busy > 1)
> +			flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
>  	}
> -unlock:
> -	rcu_read_unlock();
>  out:
>  	if (READ_ONCE(nohz.needs_update))
>  		flags |= NOHZ_NEXT_KICK;
> @@ -12868,17 +12860,13 @@ static void nohz_balancer_kick(struct rq *rq)
>  static void set_cpu_sd_state_busy(int cpu)
>  {
>  	struct sched_domain *sd;
> -
> -	rcu_read_lock();
>  	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
>  
>  	if (!sd || !sd->nohz_idle)
> -		goto unlock;
> +		return;
>  	sd->nohz_idle = 0;
>  
>  	atomic_inc(&sd->shared->nr_busy_cpus);
> -unlock:
> -	rcu_read_unlock();
>  }
>  
>  void nohz_balance_exit_idle(struct rq *rq)
> @@ -12897,17 +12885,13 @@ void nohz_balance_exit_idle(struct rq *rq)
>  static void set_cpu_sd_state_idle(int cpu)
>  {
>  	struct sched_domain *sd;
> -
> -	rcu_read_lock();
>  	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
>  
>  	if (!sd || sd->nohz_idle)
> -		goto unlock;
> +		return;
>  	sd->nohz_idle = 1;
>  
>  	atomic_dec(&sd->shared->nr_busy_cpus);
> -unlock:
> -	rcu_read_unlock();
>  }
>  
>  /*

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path
  2026-05-21 19:47   ` [PATCH 1/5] " Marek Szyprowski
@ 2026-05-21 20:13     ` Andrea Righi
  0 siblings, 0 replies; 47+ messages in thread
From: Andrea Righi @ 2026-05-21 20:13 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	Shrikanth Hegde, linux-kernel

Hi Marek,

On Thu, May 21, 2026 at 09:47:03PM +0200, Marek Szyprowski wrote:
> On 09.05.2026 20:07, Andrea Righi wrote:
> > nohz_balancer_kick() is reached from sched_balance_trigger(), which is
> > called from sched_tick(). sched_tick() runs with IRQs disabled, so the
> > additional rcu_read_lock/unlock() used around sched_domain accesses in
> > this path is redundant. Rely on the existing IRQ-disabled context (and
> > the rcu_dereference_all() checking) instead.
> >
> > The same applies to set_cpu_sd_state_idle(), called from the idle entry
> > path with IRQs disabled, and to set_cpu_sd_state_busy(), reachable via
> > nohz_balance_exit_idle() from two contexts: nohz_balancer_kick() (IRQs
> > disabled, as above) and sched_cpu_deactivate() (the CPUHP_AP_ACTIVE
> > teardown, which runs under cpus_write_lock(), so it cannot race with
> > sched-domain rebuilds). In both cases the rcu_dereference_all()
> > validation is sufficient.
> >
> > No functional change intended.
> >
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> This patch landed in today's linux-next as commit c9d93a73ce87 ("sched/fair: Drop
> redundant RCU read lock in NOHZ kick path"). In my tests I found that it introduced
> the following warning during the CPU hot-plug tests:
> 
> 
> root@target:~# for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0 >$i/online; done
> 
> =============================
> WARNING: suspicious RCU usage
> 7.1.0-rc2+ #12775 Not tainted
> -----------------------------
> kernel/sched/fair.c:12793 suspicious rcu_dereference_check() usage!
> 
> other info that might help us debug this:
> 
> 
> rcu_scheduler_active = 2, debug_locks = 1
> 2 locks held by cpuhp/1/20:
>  #0: ffffffff81a16220 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x42/0x1ae
>  #1: ffffffff81a16270 (cpuhp_state-down){+.+.}-{0:0}, at: cpuhp_thread_fun+0x72/0x1ae
> 
> stack backtrace:
> CPU: 1 UID: 0 PID: 20 Comm: cpuhp/1 Not tainted 7.1.0-rc2+ #12775 PREEMPTLAZY
> Hardware name: StarFive VisionFive 2 v1.2A (DT)
> Call Trace:
> [<ffffffff8001827c>] dump_backtrace+0x1c/0x24
> [<ffffffff800014c0>] show_stack+0x28/0x34
> [<ffffffff80010d42>] dump_stack_lvl+0x5e/0x86
> [<ffffffff80010d7e>] dump_stack+0x14/0x1c
> [<ffffffff800987ec>] lockdep_rcu_suspicious+0x14c/0x1b8
> [<ffffffff80079992>] nohz_balance_exit_idle+0xf4/0xf6
> [<ffffffff800664e6>] sched_cpu_deactivate+0x6c/0x1c8
> [<ffffffff8002a5d0>] cpuhp_invoke_callback+0xf8/0x1ce
> [<ffffffff8002a944>] cpuhp_thread_fun+0x150/0x1ae
> [<ffffffff8005dc64>] smpboot_thread_fn+0x138/0x2a4
> [<ffffffff800554ae>] kthread+0xea/0x10c
> [<ffffffff800134c4>] ret_from_fork_kernel+0x22/0x386
> [<ffffffff80c278ee>] ret_from_fork_kernel_asm+0x16/0x18
> CPU1: off
> CPU2: off
> CPU3: off
> 
> This issue is observed on most of my ARM 32bit, ARM 64bit and RiscV64 based boards.
> 

Ah, yes, makes sense. We missed the CPU hotplug case. When CPUs are taken
offline, set_cpu_sd_state_busy() is invoked via:

    cpuhp/N kthread
      cpuhp_thread_fun()
        cpuhp_invoke_callback()
          sched_cpu_deactivate()
            nohz_balance_exit_idle()
              set_cpu_sd_state_busy()
                rcu_dereference_all(per_cpu(sd_llc, cpu))

The cpuhp kthread holds cpu_hotplug_lock, but runs with preemption and IRQs
enabled. I think we should just restore the RCU read lock in
set_cpu_sd_state_{busy,idle}() to fix this. I'll send a patch soon.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
  2026-05-19 11:47               ` Peter Zijlstra
@ 2026-05-25  8:30                 ` Chen, Yu C
  0 siblings, 0 replies; 47+ messages in thread
From: Chen, Yu C @ 2026-05-25  8:30 UTC (permalink / raw)
  To: Peter Zijlstra, K Prateek Nayak
  Cc: Andrea Righi, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Phil Auld, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel, tim.c.chen, ricardo.neri, Sang, Oliver

On 5/19/2026 7:47 PM, Peter Zijlstra wrote:
> On Tue, May 19, 2026 at 04:57:11PM +0530, K Prateek Nayak wrote:
>> Hello Peter,
>>
>> On 5/19/2026 2:16 PM, Peter Zijlstra wrote:
>>> Shrikanth noted I had an old version of his SMT ifdeffery patches on, so
>>> I need to rebuild that tree (and the merge) anyway.
>>>
>>> Do you want me to munge this in, or keep it on top as a fixie?
>>
>> Feel free to munge it but if you want to retain some context, here is the
>> full patch with suggestions from Andrea incorporated in:
>>
>>    (Based on top of 5162728eecc2 ("Merge branch 'sched/cache'"))
>>
>> ---
>>  From b0a8ad4b225820c2369f45242517c1c06bac1826 Mon Sep 17 00:00:00 2001
>> From: K Prateek Nayak <kprateek.nayak@amd.com>
>> Date: Tue, 19 May 2026 05:14:23 +0000
>> Subject: [PATCH] sched/topology: Allow multiple domains to claim
>>   sched_domain_shared
>>
>> Recent optimizations of sd->shared assignment moved to allocating a
>> single instance of per-CPU sched_domain_shared objects per s_data.
>>
>> Recent optimizations to select_idle_capacity() moved the sd->shared
>> assignments to "sd_asym" domain when ASYM_CPUCAPACITY is detected but
>> cache-aware scheduling mandates the presence of "sd_llc_shared" to
>> compute and cache per-LLC statistics.
>>
>> Use an "alloc_flags" union in sched_domain_shared to claim a
>> sched_domain_shared object per sched_domain. Allocation starts searching
>> for an available / matching sched_domain_shared instance from the first
>> CPU of sched_domain_span(sd) (sd can be sd_llc, or sd_asym). If the
>> shared object is claimed by another domain, the instance corresponding
>> to next CPU in the domain span is explored until a matching / available
>> instance is found.
>>
>> In case of a single CPU in sched_domain_span(), the domain will be
>> degenerated and a temporary overlap of ->shared objects across different
>> domains is acceptable.
>>
>> "alloc_flags" forms a union with "nr_idle_scan" and the stale flags are
>> left as is when the sd->shared is published. The expectation is for the
>> first load balancing instance to correct the value just like the current
>> behavior, except the initial value is no longer 0.
>>
>> Originally-by: Peter Zijlstra <peterz@infradead.org>
>> Tested-by: Andrea Righi <arighi@nvidia.com>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> 
> Since you wrote a nice changelog, I stuck it on top. Pushed out a fresh
> queue:sched/core with updated patches from Shrikanth and this on top.
> 
> Seems to build and boot in a random vm, so must be good ;-)
> 
> I'll push into -tip in a day or so, provided nothing goes boom in the
> meantime.

Cache-aware scheduling works as expected on top of latest sched/core.
Performance gains were observed via hackbench. Tests were also
conducted on Alder Lake, P-core with SMT-enabled and E-cores without SMT,
yet no notable discrepancies were found, so only paste the server data
here. In future we may need to apply cache-aware scheduling to hybrid CPUs.
For now, the current version should be OK, unless LKP 0day spots anything
unusual later.

=========================================
Hackbench comparison
BASE: baseline
TEST: sched_cache
=========================================
MODE     G  FD |         BASE(s) |         TEST(s) |   DIFF% | RESULT
------- -- ----+-----------------+-----------------+---------+----------
threads  1  10 |    108.020/0.3% |     65.355/2.1% |  39.50% | IMPROVED
threads  1   2 |     17.246/2.6% |      9.761/1.9% |  43.40% | IMPROVED
threads  1  20 |    253.990/1.7% |    245.875/3.0% |   3.20% | IMPROVED
threads  1   4 |     40.849/5.2% |     24.030/0.0% |  41.17% | IMPROVED
threads  1   6 |     61.659/1.0% |     38.852/3.5% |  36.99% | IMPROVED
threads  1   8 |     86.121/0.5% |     51.900/1.7% |  39.74% | IMPROVED
threads  2  10 |    121.315/0.7% |   115.118/11.1% |   5.11% | IMPROVED
threads  2   2 |     17.647/5.6% |      9.858/0.4% |  44.14% | IMPROVED
threads  2  20 |    333.429/5.4% |    320.477/5.4% |   3.88% | IMPROVED
threads  2   4 |     43.612/1.0% |     26.275/1.9% |  39.75% | IMPROVED
threads  2   6 |     69.413/0.9% |     41.151/0.8% |  40.72% | IMPROVED
threads  2   8 |     95.811/0.1% |     52.816/1.7% |  44.87% | IMPROVED
threads  4  10 |    148.215/1.3% |    146.949/3.2% |   0.85% | IMPROVED
threads  4   2 |     18.987/0.9% |     10.065/1.8% |  46.99% | IMPROVED
threads  4  20 |    832.352/2.0% |    768.323/1.7% |   7.69% | IMPROVED
threads  4   4 |     47.413/0.9% |     24.087/6.2% |  49.20% | IMPROVED
threads  4   6 |     75.117/0.2% |     76.142/0.3% |  -1.36% | REGRESSED
threads  4   8 |    110.599/1.0% |    108.190/2.7% |   2.18% | IMPROVED



^ permalink raw reply	[flat|nested] 47+ messages in thread

* kmemleak: sched_domain_shared leaked on asymmetric-capacity + SCHED_CACHE
  2026-05-09 18:07 ` [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity Andrea Righi
  2026-05-11 13:04   ` Vincent Guittot
  2026-05-15 10:05   ` Shrikanth Hegde
@ 2026-07-03 10:22   ` Breno Leitao
  2026-07-03 10:35     ` K Prateek Nayak
  2 siblings, 1 reply; 47+ messages in thread
From: Breno Leitao @ 2026-07-03 10:22 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	Shrikanth Hegde, linux-kernel

Hi,

On current linux-next (CONFIG_SCHED_CACHE=y) we hit a kmemleak on an
arm64 box with asymmetric CPU capacity, triggered by a cpuset-driven
sched-domain rebuild:

kmemleak: unreferenced object 0xffff000100c95e80 (size 32):
  comm "kworker/22:1", pid 407, jiffies 4294669077
  hex dump (first 32 bytes):
    48 00 00 00 48 00 00 00 00 00 00 00 20 00 00 00  H...H....... ...
  backtrace (crc ec5d7053):
    __kmalloc_cache_node_noprof
    build_sched_domains
    partition_sched_domains
    rebuild_sched_domains_locked
    rebuild_sched_domains
    process_scheduled_works
    kthread
    ret_from_fork
  kmemleak: 1 new suspected memory leaks

The leaked object is a struct sched_domain_shared (32 bytes with
CONFIG_SCHED_CACHE) allocated in __sds_alloc(), inlined into
build_sched_domains()

Is this a known issue?

--breno

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: kmemleak: sched_domain_shared leaked on asymmetric-capacity + SCHED_CACHE
  2026-07-03 10:22   ` kmemleak: sched_domain_shared leaked on asymmetric-capacity + SCHED_CACHE Breno Leitao
@ 2026-07-03 10:35     ` K Prateek Nayak
  2026-07-03 16:19       ` Breno Leitao
  0 siblings, 1 reply; 47+ messages in thread
From: K Prateek Nayak @ 2026-07-03 10:35 UTC (permalink / raw)
  To: Breno Leitao, Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Phil Auld, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel

Hello Breno,

On 7/3/2026 3:52 PM, Breno Leitao wrote:
> On current linux-next (CONFIG_SCHED_CACHE=y) we hit a kmemleak on an
> arm64 box with asymmetric CPU capacity, triggered by a cpuset-driven
> sched-domain rebuild:
> 
> kmemleak: unreferenced object 0xffff000100c95e80 (size 32):
>   comm "kworker/22:1", pid 407, jiffies 4294669077
>   hex dump (first 32 bytes):
>     48 00 00 00 48 00 00 00 00 00 00 00 20 00 00 00  H...H....... ...
>   backtrace (crc ec5d7053):
>     __kmalloc_cache_node_noprof
>     build_sched_domains
>     partition_sched_domains
>     rebuild_sched_domains_locked
>     rebuild_sched_domains
>     process_scheduled_works
>     kthread
>     ret_from_fork
>   kmemleak: 1 new suspected memory leaks
> 
> The leaked object is a struct sched_domain_shared (32 bytes with
> CONFIG_SCHED_CACHE) allocated in __sds_alloc(), inlined into
> build_sched_domains()
> 
> Is this a known issue?

It is theoretically possible but there is a defensive WARN_ON()
in topology code that you should have hit first. Do you see any
other warning?

If it is not too much trouble, could you add "sched_verbose"
to your kernel cmdline (or do
echo Y > /sys/kernel/debug/sched/verbose) and redo this cpuset
that leaks the data and share the dmesg. It should give some clue
what the topology looks like that causes this.

Thank you for reporting the issue.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: kmemleak: sched_domain_shared leaked on asymmetric-capacity + SCHED_CACHE
  2026-07-03 10:35     ` K Prateek Nayak
@ 2026-07-03 16:19       ` Breno Leitao
  2026-07-04  6:16         ` K Prateek Nayak
  0 siblings, 1 reply; 47+ messages in thread
From: Breno Leitao @ 2026-07-03 16:19 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	Shrikanth Hegde, linux-kernel

Hello Prateek,

On Fri, Jul 03, 2026 at 04:05:59PM +0530, K Prateek Nayak wrote:
> On 7/3/2026 3:52 PM, Breno Leitao wrote:
> > On current linux-next (CONFIG_SCHED_CACHE=y) we hit a kmemleak on an
> > arm64 box with asymmetric CPU capacity, triggered by a cpuset-driven
> > sched-domain rebuild:
> > 
> > kmemleak: unreferenced object 0xffff000100c95e80 (size 32):
> >   comm "kworker/22:1", pid 407, jiffies 4294669077
> >   hex dump (first 32 bytes):
> >     48 00 00 00 48 00 00 00 00 00 00 00 20 00 00 00  H...H....... ...
> >   backtrace (crc ec5d7053):
> >     __kmalloc_cache_node_noprof
> >     build_sched_domains
> >     partition_sched_domains
> >     rebuild_sched_domains_locked
> >     rebuild_sched_domains
> >     process_scheduled_works
> >     kthread
> >     ret_from_fork
> >   kmemleak: 1 new suspected memory leaks
> > 
> > The leaked object is a struct sched_domain_shared (32 bytes with
> > CONFIG_SCHED_CACHE) allocated in __sds_alloc(), inlined into
> > build_sched_domains()
> > 
> > Is this a known issue?
> 
> It is theoretically possible but there is a defensive WARN_ON()
> in topology code that you should have hit first. Do you see any
> other warning?
> 
> If it is not too much trouble, could you add "sched_verbose"
> to your kernel cmdline (or do
> echo Y > /sys/kernel/debug/sched/verbose) and redo this cpuset
> that leaks the data and share the dmesg. It should give some clue
> what the topology looks like that causes this.

Sure, does this one help:

  bash-5.1# cat /sys/kernel/debug/kmemleak
  unreferenced object 0xffff0000c1180820 (size 32):
    comm "swapper/0", pid 1, jiffies 4294667323
    hex dump (first 32 bytes):
      08 00 00 00 08 00 00 00 00 00 00 00 20 00 00 00  ............ ...
      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    backtrace (crc f4478cb7):
      kmemleak_alloc+0x44/0xd8
      __kmalloc_cache_node_noprof+0x344/0x5d8
      build_sched_domains+0x2f8/0x2110
      sched_init_domains+0xec/0x160
      sched_init_smp+0x48/0x108
      kernel_init_freeable+0x140/0x200
      kernel_init+0x30/0x170
      ret_from_fork+0x10/0x20

  bash-5.1# cd /sys/kernel/debug/sched/domains/

  bash-5.1# grep . -r *
  cpu0/domain0/level:1
  cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_SHARE_LLC SD_PREFER_SIBLING
  cpu0/domain0/name:MC
  cpu0/domain0/cache_nice_tries:1
  cpu0/domain0/imbalance_pct:117
  cpu0/domain0/busy_factor:16
  cpu0/domain0/max_newidle_lb_cost:2127
  cpu0/domain0/max_interval:16
  cpu0/domain0/min_interval:8
  cpu1/domain0/level:1
  cpu1/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_SHARE_LLC SD_PREFER_SIBLING
  cpu1/domain0/name:MC
  cpu1/domain0/cache_nice_tries:1
  cpu1/domain0/imbalance_pct:117
  cpu1/domain0/busy_factor:16
  cpu1/domain0/max_newidle_lb_cost:18112
  cpu1/domain0/max_interval:16
  cpu1/domain0/min_interval:8
  cpu2/domain0/level:1
  cpu2/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_SHARE_LLC SD_PREFER_SIBLING
  cpu2/domain0/name:MC
  cpu2/domain0/cache_nice_tries:1
  cpu2/domain0/imbalance_pct:117
  cpu2/domain0/busy_factor:16
  cpu2/domain0/max_newidle_lb_cost:3147
  cpu2/domain0/max_interval:16
  cpu2/domain0/min_interval:8
  cpu3/domain0/level:1
  cpu3/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_SHARE_LLC SD_PREFER_SIBLING
  cpu3/domain0/name:MC
  cpu3/domain0/cache_nice_tries:1
  cpu3/domain0/imbalance_pct:117
  cpu3/domain0/busy_factor:16
  cpu3/domain0/max_newidle_lb_cost:16399
  cpu3/domain0/max_interval:16
  cpu3/domain0/min_interval:8
  cpu4/domain0/level:1
  cpu4/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_SHARE_LLC SD_PREFER_SIBLING
  cpu4/domain0/name:MC
  cpu4/domain0/cache_nice_tries:1
  cpu4/domain0/imbalance_pct:117
  cpu4/domain0/busy_factor:16
  cpu4/domain0/max_newidle_lb_cost:27180
  cpu4/domain0/max_interval:16
  cpu4/domain0/min_interval:8
  cpu5/domain0/level:1
  cpu5/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_SHARE_LLC SD_PREFER_SIBLING
  cpu5/domain0/name:MC
  cpu5/domain0/cache_nice_tries:1
  cpu5/domain0/imbalance_pct:117
  cpu5/domain0/busy_factor:16
  cpu5/domain0/max_newidle_lb_cost:18384
  cpu5/domain0/max_interval:16
  cpu5/domain0/min_interval:8
  cpu6/domain0/level:1
  cpu6/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_SHARE_LLC SD_PREFER_SIBLING
  cpu6/domain0/name:MC
  cpu6/domain0/cache_nice_tries:1
  cpu6/domain0/imbalance_pct:117
  cpu6/domain0/busy_factor:16
  cpu6/domain0/max_newidle_lb_cost:16261
  cpu6/domain0/max_interval:16
  cpu6/domain0/min_interval:8
  cpu7/domain0/level:1
  cpu7/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_SHARE_LLC SD_PREFER_SIBLING
  cpu7/domain0/name:MC
  cpu7/domain0/cache_nice_tries:1
  cpu7/domain0/imbalance_pct:117
  cpu7/domain0/busy_factor:16
  cpu7/domain0/max_newidle_lb_cost:7780
  cpu7/domain0/max_interval:16
  cpu7/domain0/min_interval:8

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: kmemleak: sched_domain_shared leaked on asymmetric-capacity + SCHED_CACHE
  2026-07-03 16:19       ` Breno Leitao
@ 2026-07-04  6:16         ` K Prateek Nayak
  2026-07-06 14:38           ` Dietmar Eggemann
  0 siblings, 1 reply; 47+ messages in thread
From: K Prateek Nayak @ 2026-07-04  6:16 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	Shrikanth Hegde, linux-kernel

Hello Breno,

On 7/3/2026 9:49 PM, Breno Leitao wrote:
> Hello Prateek,
> 
> On Fri, Jul 03, 2026 at 04:05:59PM +0530, K Prateek Nayak wrote:
>> On 7/3/2026 3:52 PM, Breno Leitao wrote:
>>> On current linux-next (CONFIG_SCHED_CACHE=y) we hit a kmemleak on an
>>> arm64 box with asymmetric CPU capacity, triggered by a cpuset-driven
>>> sched-domain rebuild:
>>>
>>> kmemleak: unreferenced object 0xffff000100c95e80 (size 32):
>>>   comm "kworker/22:1", pid 407, jiffies 4294669077
>>>   hex dump (first 32 bytes):
>>>     48 00 00 00 48 00 00 00 00 00 00 00 20 00 00 00  H...H....... ...
>>>   backtrace (crc ec5d7053):
>>>     __kmalloc_cache_node_noprof
>>>     build_sched_domains
>>>     partition_sched_domains
>>>     rebuild_sched_domains_locked
>>>     rebuild_sched_domains
>>>     process_scheduled_works
>>>     kthread
>>>     ret_from_fork
>>>   kmemleak: 1 new suspected memory leaks
>>>
>>> The leaked object is a struct sched_domain_shared (32 bytes with
>>> CONFIG_SCHED_CACHE) allocated in __sds_alloc(), inlined into
>>> build_sched_domains()
>>>
>>> Is this a known issue?
>>
>> It is theoretically possible but there is a defensive WARN_ON()
>> in topology code that you should have hit first. Do you see any
>> other warning?
>>
>> If it is not too much trouble, could you add "sched_verbose"
>> to your kernel cmdline (or do
>> echo Y > /sys/kernel/debug/sched/verbose) and redo this cpuset
>> that leaks the data and share the dmesg. It should give some clue
>> what the topology looks like that causes this.
> 
> Sure, does this one help:

Unfortunately no! This doesn't give the span and capacity
information so I cannot really tell what kind of topology
actually triggers this.

Ideally, you enable verbose logging, do the cpuset operation
that leaks the object, and then save the demesg after which
will have a complete topology for all CPUs there including
their capacities and we can piece together a qemu environment
to debug where it all went wrong.

> 
> 
> 
>   bash-5.1# cat /sys/kernel/debug/kmemleak
>   unreferenced object 0xffff0000c1180820 (size 32):
>     comm "swapper/0", pid 1, jiffies 4294667323
>     hex dump (first 32 bytes):
>       08 00 00 00 08 00 00 00 00 00 00 00 20 00 00 00  ............ ...
>       00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     backtrace (crc f4478cb7):
>       kmemleak_alloc+0x44/0xd8
>       __kmalloc_cache_node_noprof+0x344/0x5d8
>       build_sched_domains+0x2f8/0x2110
>       sched_init_domains+0xec/0x160
>       sched_init_smp+0x48/0x108
>       kernel_init_freeable+0x140/0x200
>       kernel_init+0x30/0x170
>       ret_from_fork+0x10/0x20
>   
>   bash-5.1# cd /sys/kernel/debug/sched/domains/
>   
>   bash-5.1# grep . -r *
>   cpu0/domain0/level:1
>   cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_SHARE_LLC SD_PREFER_SIBLING
>   cpu0/domain0/name:MC
>   cpu0/domain0/cache_nice_tries:1
>   cpu0/domain0/imbalance_pct:117
>   cpu0/domain0/busy_factor:16
>   cpu0/domain0/max_newidle_lb_cost:2127
>   cpu0/domain0/max_interval:16
>   cpu0/domain0/min_interval:8
>   cpu1/domain0/level:1
>   cpu1/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_SHARE_LLC SD_PREFER_SIBLING
>   cpu1/domain0/name:MC
>   cpu1/domain0/cache_nice_tries:1
>   cpu1/domain0/imbalance_pct:117
>   cpu1/domain0/busy_factor:16
>   cpu1/domain0/max_newidle_lb_cost:18112
>   cpu1/domain0/max_interval:16
>   cpu1/domain0/min_interval:8
>   cpu2/domain0/level:1
>   cpu2/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_SHARE_LLC SD_PREFER_SIBLING
>   cpu2/domain0/name:MC
>   cpu2/domain0/cache_nice_tries:1
>   cpu2/domain0/imbalance_pct:117
>   cpu2/domain0/busy_factor:16
>   cpu2/domain0/max_newidle_lb_cost:3147
>   cpu2/domain0/max_interval:16
>   cpu2/domain0/min_interval:8
>   cpu3/domain0/level:1
>   cpu3/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_SHARE_LLC SD_PREFER_SIBLING
>   cpu3/domain0/name:MC
>   cpu3/domain0/cache_nice_tries:1
>   cpu3/domain0/imbalance_pct:117
>   cpu3/domain0/busy_factor:16
>   cpu3/domain0/max_newidle_lb_cost:16399
>   cpu3/domain0/max_interval:16
>   cpu3/domain0/min_interval:8
>   cpu4/domain0/level:1
>   cpu4/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_SHARE_LLC SD_PREFER_SIBLING
>   cpu4/domain0/name:MC
>   cpu4/domain0/cache_nice_tries:1
>   cpu4/domain0/imbalance_pct:117
>   cpu4/domain0/busy_factor:16
>   cpu4/domain0/max_newidle_lb_cost:27180
>   cpu4/domain0/max_interval:16
>   cpu4/domain0/min_interval:8
>   cpu5/domain0/level:1
>   cpu5/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_SHARE_LLC SD_PREFER_SIBLING
>   cpu5/domain0/name:MC
>   cpu5/domain0/cache_nice_tries:1
>   cpu5/domain0/imbalance_pct:117
>   cpu5/domain0/busy_factor:16
>   cpu5/domain0/max_newidle_lb_cost:18384
>   cpu5/domain0/max_interval:16
>   cpu5/domain0/min_interval:8
>   cpu6/domain0/level:1
>   cpu6/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_SHARE_LLC SD_PREFER_SIBLING
>   cpu6/domain0/name:MC
>   cpu6/domain0/cache_nice_tries:1
>   cpu6/domain0/imbalance_pct:117
>   cpu6/domain0/busy_factor:16
>   cpu6/domain0/max_newidle_lb_cost:16261
>   cpu6/domain0/max_interval:16
>   cpu6/domain0/min_interval:8
>   cpu7/domain0/level:1
>   cpu7/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_SHARE_LLC SD_PREFER_SIBLING
>   cpu7/domain0/name:MC
>   cpu7/domain0/cache_nice_tries:1
>   cpu7/domain0/imbalance_pct:117
>   cpu7/domain0/busy_factor:16
>   cpu7/domain0/max_newidle_lb_cost:7780
>   cpu7/domain0/max_interval:16
>   cpu7/domain0/min_interval:8

Is this the defualt topology before the cpuset? Because
min_interval of 8 means there are 8 cPUs in the span and
you have 8 CPUs in your machine so there seems to be a
single root partition when you captured this.

Also does your dmesg have any warnings from topology.c?

The only way I can see this triggering is there is a single
cpu SD_SHARE_LLC at top which claims all the sd->shared and
then you get a SD_ASYM_CPUCAPACITY_FULL on top which too
tries to claim a sd->shared but all are already gone.

Afaict, all CPUs in your system are on single LLC and you
can't SD_ASYM_CPUCAPACITY_FULL without having more than one
capacity in the span so I'm at a loss currently as to how
this reference gets dropped without being freed.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: kmemleak: sched_domain_shared leaked on asymmetric-capacity + SCHED_CACHE
  2026-07-04  6:16         ` K Prateek Nayak
@ 2026-07-06 14:38           ` Dietmar Eggemann
  2026-07-07  4:11             ` K Prateek Nayak
  0 siblings, 1 reply; 47+ messages in thread
From: Dietmar Eggemann @ 2026-07-06 14:38 UTC (permalink / raw)
  To: K Prateek Nayak, Breno Leitao
  Cc: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Phil Auld, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel

Hi Breno & Prateek,

On 04.07.26 08:16, K Prateek Nayak wrote:
> Hello Breno,
> 
> On 7/3/2026 9:49 PM, Breno Leitao wrote:
>> Hello Prateek,
>>
>> On Fri, Jul 03, 2026 at 04:05:59PM +0530, K Prateek Nayak wrote:
>>> On 7/3/2026 3:52 PM, Breno Leitao wrote:
>>>> On current linux-next (CONFIG_SCHED_CACHE=y) we hit a kmemleak on an
>>>> arm64 box with asymmetric CPU capacity, triggered by a cpuset-driven
>>>> sched-domain rebuild:
>>>>
>>>> kmemleak: unreferenced object 0xffff000100c95e80 (size 32):
>>>>   comm "kworker/22:1", pid 407, jiffies 4294669077
>>>>   hex dump (first 32 bytes):
>>>>     48 00 00 00 48 00 00 00 00 00 00 00 20 00 00 00  H...H....... ...
>>>>   backtrace (crc ec5d7053):
>>>>     __kmalloc_cache_node_noprof
>>>>     build_sched_domains
>>>>     partition_sched_domains
>>>>     rebuild_sched_domains_locked
>>>>     rebuild_sched_domains
>>>>     process_scheduled_works
>>>>     kthread
>>>>     ret_from_fork
>>>>   kmemleak: 1 new suspected memory leaks
>>>>
>>>> The leaked object is a struct sched_domain_shared (32 bytes with
>>>> CONFIG_SCHED_CACHE) allocated in __sds_alloc(), inlined into
>>>> build_sched_domains()
>>>>
>>>> Is this a known issue?
>>>
>>> It is theoretically possible but there is a defensive WARN_ON()
>>> in topology code that you should have hit first. Do you see any
>>> other warning?
>>>
>>> If it is not too much trouble, could you add "sched_verbose"
>>> to your kernel cmdline (or do
>>> echo Y > /sys/kernel/debug/sched/verbose) and redo this cpuset
>>> that leaks the data and share the dmesg. It should give some clue
>>> what the topology looks like that causes this.
>>
>> Sure, does this one help:
> 
> Unfortunately no! This doesn't give the span and capacity
> information so I cannot really tell what kind of topology
> actually triggers this.

I'm trying to recreate this on Arm64 qemu with dtb file to get the
asymmetric CPU capacities (emulating Pixel6):

$ cat /sys/devices/system/cpu/cpu*/cpu_capacity
160
160
160
160
498
498
1024
1024

I'm on linux_next master:

eb8711ece2c - (HEAD -> linux-next, tag: next-20260702, linux-next/master)

so it has:

9e005ed21152 - sched/topology: Allow multiple domains to claim
sched_domain_shared (2026-05-19 K Prateek Nayak)

$ cat /sys/kernel/debug/sched/domains/cpu0/domain0/{name,flags}
MC
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE
SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_SHARE_LLC SD_PREFER_SIBLING

8 CPUs flat:

$ cat /proc/schedstat | grep "^c\|^d" | awk '{ print $1 " " $2 " " $3}'
cpu0 0 0
domain0 MC ff
cpu1 0 0
domain0 MC ff
cpu2 0 0
domain0 MC ff
cpu3 0 0
domain0 MC ff
cpu4 0 0
domain0 MC ff
cpu5 0 0
domain0 MC ff
cpu6 0 0
domain0 MC ff
cpu7 0 0
domain0 MC ff

> Ideally, you enable verbose logging, do the cpuset operation
> that leaks the object, and then save the demesg after which
> will have a complete topology for all CPUs there including
> their capacities and we can piece together a qemu environment
> to debug where it all went wrong.

+1

$ cat /proc/cmdline
... sched_verbose=1

What exactly is this cpuset operation forcing a sched domain rebuild?

I'm trying a split into 2 exclusive cpusets {0,2,4,6} and {1,3,5,7} to
get 2 asymmetric CPU capacity islands.

...

[   95.747359] root domain span: 1,3,5,7
[   95.747760] root domain span: 0,2,4,6
[   95.747771] _sched_cache_active_set: enabling cache aware scheduling
[   95.747788] rd 1,3,5,7: Checking EAS: frequency-invariant load
tracking not yet supported
...

$ cat /sys/kernel/debug/sched/domains/cpu0/domain0/{name,flags}
MC
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE
SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_SHARE_LLC SD_PREFER_SIBLING

$ cat /proc/schedstat | grep "^c\|^d" | awk '{ print $1 " " $2 " " $3}'
cpu0 0 0
domain0 MC 55
cpu1 0 0
domain0 MC aa
cpu2 0 0
domain0 MC 55
cpu3 0 0
domain0 MC aa
cpu4 0 0
domain0 MC 55
cpu5 0 0
domain0 MC aa
cpu6 0 0
domain0 MC 55
cpu7 0 0
domain0 MC aa

But this hasn't triggered kmemleak so far.

That said, it triggered once (~600s uptime) but w/o any cpuset
operation? Never happened again since then :-(

[...]

> Is this the defualt topology before the cpuset? Because
> min_interval of 8 means there are 8 cPUs in the span and
> you have 8 CPUs in your machine so there seems to be a
> single root partition when you captured this.

+1

> Also does your dmesg have any warnings from topology.c?
> 
> The only way I can see this triggering is there is a single
> cpu SD_SHARE_LLC at top which claims all the sd->shared and
> then you get a SD_ASYM_CPUCAPACITY_FULL on top which too
> tries to claim a sd->shared but all are already gone.

?

> Afaict, all CPUs in your system are on single LLC and you
> can't SD_ASYM_CPUCAPACITY_FULL without having more than one
> capacity in the span so I'm at a loss currently as to how
> this reference gets dropped without being freed.

$ cat /sys/devices/system/cpu/cpu*/cpu_capacity  should tell.


Another thing: does your system enable 'cache aware scheduling', i.e. do
you not see:

pr_info("%s: cache aware scheduling not supported on this platform\n",
__func__);




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: kmemleak: sched_domain_shared leaked on asymmetric-capacity + SCHED_CACHE
  2026-07-06 14:38           ` Dietmar Eggemann
@ 2026-07-07  4:11             ` K Prateek Nayak
  2026-07-07 13:31               ` Breno Leitao
  0 siblings, 1 reply; 47+ messages in thread
From: K Prateek Nayak @ 2026-07-07  4:11 UTC (permalink / raw)
  To: Dietmar Eggemann, Breno Leitao
  Cc: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Phil Auld, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel

Hello Dietmar, Breno,

On 7/6/2026 8:08 PM, Dietmar Eggemann wrote:
>> Is this the defualt topology before the cpuset? Because
>> min_interval of 8 means there are 8 cPUs in the span and
>> you have 8 CPUs in your machine so there seems to be a
>> single root partition when you captured this.
> 
> +1
> 
>> Also does your dmesg have any warnings from topology.c?
>>
>> The only way I can see this triggering is there is a single
>> cpu SD_SHARE_LLC at top which claims all the sd->shared and
>> then you get a SD_ASYM_CPUCAPACITY_FULL on top which too
>> tries to claim a sd->shared but all are already gone.
> 
> ?

I may have found what might be happening. Since the last SD_SHARE_LLC
and the first SD_ASYM_CPUCAPACITY_FULL overlap,
init_sched_domain_shared() for SD_SHARE_LLC might just be overwriting
the assignment from claim_asym_sched_domain_shared() and we are left
with a non-zero refcounted shared that evades claim_allocations() but
is not used anywhere either.

Breno, could you try the below diff:

  (Only build tested)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 622e2e01974c..07e5a2c08132 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2942,6 +2942,16 @@ init_sched_domain_shared(struct s_data *d, struct sched_domain *sd, int flags)
 	struct sched_domain_shared *sds = NULL;
 	int cpu;
 
+	/*
+	 * If sd->shared is already assigned, there is an overlap in the
+	 * flags (For example: last SD_SHARE_LLC domain, is also the
+	 * first SD_ASYM_CPUCAPACITY_FULL domain).
+	 *
+	 * Return early to prevent overriding the sd->shared already
+	 * assigned which can lead to dangling reference.
+	 */
+	if (sd->shared)
+		return;
 	/*
 	 * Multiple domains can try to claim a shared object like
 	 * SD_ASYM_CPUCAPACITY and SD_SHARE_LLC which can alias to
---

It is prepared on top of tip:sched/core but should apply cleanly
on any tree after v7.2-rc1 since nothing else has changed in
these parts since the merge window.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: kmemleak: sched_domain_shared leaked on asymmetric-capacity + SCHED_CACHE
  2026-07-07  4:11             ` K Prateek Nayak
@ 2026-07-07 13:31               ` Breno Leitao
  2026-07-07 13:59                 ` K Prateek Nayak
  0 siblings, 1 reply; 47+ messages in thread
From: Breno Leitao @ 2026-07-07 13:31 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Dietmar Eggemann, Andrea Righi, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	Shrikanth Hegde, linux-kernel

Hello Praateek,

On Tue, Jul 07, 2026 at 09:41:29AM +0530, K Prateek Nayak wrote:
> Hello Dietmar, Breno,
> 
> On 7/6/2026 8:08 PM, Dietmar Eggemann wrote:
> >> Is this the defualt topology before the cpuset? Because
> >> min_interval of 8 means there are 8 cPUs in the span and
> >> you have 8 CPUs in your machine so there seems to be a
> >> single root partition when you captured this.
> > 
> > +1
> > 
> >> Also does your dmesg have any warnings from topology.c?
> >>
> >> The only way I can see this triggering is there is a single
> >> cpu SD_SHARE_LLC at top which claims all the sd->shared and
> >> then you get a SD_ASYM_CPUCAPACITY_FULL on top which too
> >> tries to claim a sd->shared but all are already gone.
> > 
> > ?
> 
> I may have found what might be happening. Since the last SD_SHARE_LLC
> and the first SD_ASYM_CPUCAPACITY_FULL overlap,
> init_sched_domain_shared() for SD_SHARE_LLC might just be overwriting
> the assignment from claim_asym_sched_domain_shared() and we are left
> with a non-zero refcounted shared that evades claim_allocations() but
> is not used anywhere either.
> 
> Breno, could you try the below diff:

Sure, I've tested it and I don't see the kmemleak report anymore, that
solved the issue I've raised.

Feel free to include the following if you are planning to send it to the
list:

Tested-by: Breno Leitao <leitao@debian.org>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: kmemleak: sched_domain_shared leaked on asymmetric-capacity + SCHED_CACHE
  2026-07-07 13:31               ` Breno Leitao
@ 2026-07-07 13:59                 ` K Prateek Nayak
  2026-07-07 15:31                   ` Dietmar Eggemann
  0 siblings, 1 reply; 47+ messages in thread
From: K Prateek Nayak @ 2026-07-07 13:59 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Dietmar Eggemann, Andrea Righi, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	Shrikanth Hegde, linux-kernel

Hello Breno,

On 7/7/2026 7:01 PM, Breno Leitao wrote:
>> I may have found what might be happening. Since the last SD_SHARE_LLC
>> and the first SD_ASYM_CPUCAPACITY_FULL overlap,
>> init_sched_domain_shared() for SD_SHARE_LLC might just be overwriting
>> the assignment from claim_asym_sched_domain_shared() and we are left
>> with a non-zero refcounted shared that evades claim_allocations() but
>> is not used anywhere either.
>>
>> Breno, could you try the below diff:
> 
> Sure, I've tested it and I don't see the kmemleak report anymore, that
> solved the issue I've raised.
> 
> Feel free to include the following if you are planning to send it to the
> list:
> 
> Tested-by: Breno Leitao <leitao@debian.org>

Thanks a ton! I'll send out an official patch shortly for Peter to pick
up once he is back from holidays after some more testing. Thank you
again for the report and testing. Much appreciated _/\_

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: kmemleak: sched_domain_shared leaked on asymmetric-capacity + SCHED_CACHE
  2026-07-07 13:59                 ` K Prateek Nayak
@ 2026-07-07 15:31                   ` Dietmar Eggemann
  0 siblings, 0 replies; 47+ messages in thread
From: Dietmar Eggemann @ 2026-07-07 15:31 UTC (permalink / raw)
  To: K Prateek Nayak, Breno Leitao
  Cc: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Phil Auld, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel

On 07.07.26 15:59, K Prateek Nayak wrote:
> Hello Breno,
> 
> On 7/7/2026 7:01 PM, Breno Leitao wrote:
>>> I may have found what might be happening. Since the last SD_SHARE_LLC
>>> and the first SD_ASYM_CPUCAPACITY_FULL overlap,
>>> init_sched_domain_shared() for SD_SHARE_LLC might just be overwriting
>>> the assignment from claim_asym_sched_domain_shared() and we are left
>>> with a non-zero refcounted shared that evades claim_allocations() but
>>> is not used anywhere either.
>>>
>>> Breno, could you try the below diff:
>>
>> Sure, I've tested it and I don't see the kmemleak report anymore, that
>> solved the issue I've raised.
>>
>> Feel free to include the following if you are planning to send it to the
>> list:
>>
>> Tested-by: Breno Leitao <leitao@debian.org>
> 
> Thanks a ton! I'll send out an official patch shortly for Peter to pick
> up once he is back from holidays after some more testing. Thank you
> again for the report and testing. Much appreciated _/\_

Switched to an Arm64 board (Juno) with {L B B L L L} where this issue
happens all the time since we rebuild the sched domain after CPU
capacity setup and EAS bringup during boot. Simple CPU hotplug also
shows it:

With additional printks:

base:

root@juno:~# dmesg | grep -i "shared\|sd->shared\|_domain"
[    0.224492] build_sched_domains() this_cpu=0
[    0.228817] build_sched_domains() cpu=0 sd=SMT
[    0.233284] claim_asym_sched_domain_shared() cpu=0 sd=SMT
[    0.238733] init_sched_domain_shared() this_cpu=0 cpu=0 flags=32
sd=MC sds=ffff00080003e800
          ^^^^^^^^^^^^^^^^
[    0.247129] build_sched_domains() cpu=0 sd=MC has SD_SHARE_LLC
[    0.252987] init_sched_domain_shared() this_cpu=0 cpu=1 flags=512
sd=MC sds=ffff00080003e7e0
          ^^^^^^^^^^^^^^^^
[    0.261460] build_sched_domains() cpu=1 sd=SMT
...
[    2.066061] free_sched_domain_shared() sds=ffff00080003e7e0 call
                                              ^^^^^^^^^^^^^^^^
kfree()


root@juno:~# echo scan > /sys/kernel/debug/kmemleak
[  104.064760] kmemleak: unreferenced object 0xffff00080003e800 (size
                                             ^^^^^^^^^^^^^^^^^^
32):
[  104.064778] kmemleak:   comm "swapper/0", pid 1, jiffies 4294892335
[  104.064786] kmemleak:   hex dump (first 32 bytes):
[  104.064793] kmemleak:     06 00 00 00 06 00 00 00 00 00 00 00 20 00
00 00  ............ ...
[  104.064800] kmemleak:     00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00  ................
[  104.064805] kmemleak:   backtrace (crc 560f497a):
root@juno:~# [  104.064809] kmemleak:     kmemleak_alloc+0x38/0x44
[  104.064825] kmemleak:     __kmalloc_cache_node_noprof+0x2bc/0x3a8
[  104.064837] kmemleak:     build_sched_domains+0x228/0x1540
[  104.064846] kmemleak:     sched_init_domains+0xd8/0x134
[  104.064859] kmemleak:     sched_init_smp+0x88/0x10c
[  104.064868] kmemleak:     kernel_init_freeable+0x14c/0x2d4
[  104.064877] kmemleak:     kernel_init+0x2c/0x130
[  104.064884] kmemleak:     ret_from_fork+0x10/0x20

w/ patch:

[    0.224451] build_sched_domains() this_cpu=0
[    0.228778] build_sched_domains() cpu=0 sd=SMT
[    0.233245] claim_asym_sched_domain_shared() cpu=0 sd=SMT
[    0.238694] init_sched_domain_shared() this_cpu=0 cpu=0 flags=32
sd=MC sds=ffff00080003e800
          ^^^^^^^^^^^^^^^^
[    0.247090] build_sched_domains() cpu=0 sd=MC has SD_SHARE_LLC
[    0.252948] init_sched_domain_shared() this_cpu=0 sd=MC
sd->shared=ffff00080003e800
[    0.252958] build_sched_domains() cpu=1 sd=SMT
...
[    2.090601] free_sched_domain_shared() sds=ffff00080003e800 call
                                              ^^^^^^^^^^^^^^^^
kfree()


Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>



^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2026-07-07 15:31 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-09 18:07 [PATCH v6 0/5 RESEND] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
2026-05-09 18:07 ` [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path Andrea Righi
2026-05-11 13:04   ` Vincent Guittot
2026-05-15  6:49   ` Shrikanth Hegde
2026-05-16  5:45     ` Andrea Righi
2026-05-16 17:15       ` Shrikanth Hegde
2026-05-20  8:34   ` [tip: sched/core] " tip-bot2 for Andrea Righi
2026-05-21 19:47   ` [PATCH 1/5] " Marek Szyprowski
2026-05-21 20:13     ` Andrea Righi
2026-05-09 18:07 ` [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity Andrea Righi
2026-05-11 13:04   ` Vincent Guittot
2026-05-15 10:05   ` Shrikanth Hegde
2026-05-16  5:58     ` [PATCH v2 " Andrea Righi
2026-05-16 17:19       ` Shrikanth Hegde
2026-05-18 20:58       ` Peter Zijlstra
2026-05-18 21:31         ` Andrea Righi
2026-05-19  5:52         ` K Prateek Nayak
2026-05-19  6:43           ` Andrea Righi
2026-05-19  7:47             ` K Prateek Nayak
2026-05-19  7:54               ` Andrea Righi
2026-05-19  8:46           ` Peter Zijlstra
2026-05-19 11:27             ` K Prateek Nayak
2026-05-19 11:47               ` Peter Zijlstra
2026-05-25  8:30                 ` Chen, Yu C
2026-05-20  8:34       ` [tip: sched/core] " tip-bot2 for K Prateek Nayak
2026-07-03 10:22   ` kmemleak: sched_domain_shared leaked on asymmetric-capacity + SCHED_CACHE Breno Leitao
2026-07-03 10:35     ` K Prateek Nayak
2026-07-03 16:19       ` Breno Leitao
2026-07-04  6:16         ` K Prateek Nayak
2026-07-06 14:38           ` Dietmar Eggemann
2026-07-07  4:11             ` K Prateek Nayak
2026-07-07 13:31               ` Breno Leitao
2026-07-07 13:59                 ` K Prateek Nayak
2026-07-07 15:31                   ` Dietmar Eggemann
2026-05-09 18:07 ` [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
2026-05-11 13:07   ` Vincent Guittot
2026-05-11 13:45     ` Andrea Righi
2026-05-11 14:25     ` [PATCH v2 " Andrea Righi
2026-05-20  8:34       ` [tip: sched/core] " tip-bot2 for Andrea Righi
2026-05-09 18:07 ` [PATCH 4/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
2026-05-11 13:07   ` Vincent Guittot
2026-05-15 10:09   ` Shrikanth Hegde
2026-05-16  9:04     ` Andrea Righi
2026-05-20  8:34   ` [tip: sched/core] " tip-bot2 for Andrea Righi
2026-05-09 18:07 ` [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity() Andrea Righi
2026-05-11 13:08   ` Vincent Guittot
2026-05-20  8:34   ` [tip: sched/core] " tip-bot2 for K Prateek Nayak

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.