[PATCH v5 0/5] sched/fair: SMT-aware asymmetric CPU capacity

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH v5 0/5] sched/fair: SMT-aware asymmetric CPU capacity
@ 2026-04-28 14:41 Andrea Righi
  2026-04-28 14:41 ` [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path Andrea Righi
                   ` (5 more replies)
  0 siblings, 6 replies; 34+ messages in thread
From: Andrea Righi @ 2026-04-28 14:41 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel

This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by introducing
SMT awareness.

= Problem =

Nominal per-logical-CPU capacity can overstate usable compute when an SMT
sibling is busy, because the physical core doesn't deliver its full nominal
capacity. So, several asym-cpu-capacity paths may pick high capacity idle CPUs
that are not actually good destinations.

= Solution =

This patch set aligns those paths with a simple rule already used elsewhere:
when SMT is active, prefer fully idle cores and avoid treating partially idle
SMT siblings as full-capacity targets where that would mislead load balance.

Patch set summary:
 - Attach sched_domain_shared to sd_asym_cpucapacity in SD_ASYM_CPUCAPACITY to
   use has_idle_cores hint consistently in the wakeup idle scan and rename
   sd_llc_shared -> sd_balance_shared.
 - Prefer fully-idle SMT cores in asym-capacity idle selection: in the wakeup
   fast path, extend select_idle_capacity() / asym_fits_cpu() so idle
   selection can prefer CPUs on fully idle cores.
 - Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
 - Add SIS_UTIL support to select_idle_capacity(): add to select_idle_capacity()
   the same SIS_UTIL-controlled idle-scan mechanism, already used by
   select_idle_cpu().

This patch set has been tested on the new NVIDIA Vera Rubin platform, where SMT
is enabled and the firmware exposes small frequency variations (+/-~5%) as
differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.

Without these patches, performance can drop by up to ~2x with CPU-intensive
workloads, because the SD_ASYM_CPUCAPACITY idle selection policy does not
account for busy SMT siblings.

Alternative approaches have been evaluated, such as equalizing CPU capacities,
either by exposing uniform values via firmware or normalizing them in the kernel
by grouping CPUs within a small capacity window (+-5%).

However, the SMT-aware SD_ASYM_CPUCAPACITY approach has shown better results so
far. Improving this policy also seems worthwhile in general, as future platforms
may enable SMT with asymmetric CPU topologies.

Performance results on Vera Rubin with SD_ASYM_CPUCAPACITY (mainline) vs
SD_ASYM_CPUCAPACITY + SMT

- NVBLAS benchblas (one task / SMT core):

 +---------------------------------+--------+
 | Configuration                   | gflops |
 +---------------------------------+--------+
 | ASYM (mainline) + SIS_UTIL      |  5478  |
 | ASYM (mainline) + NO_SIS_UTIL   |  5491  |
 |                                 |        |
 | NO ASYM + SIS_UTIL              |  8912  |
 | NO ASYM + NO_SIS_UTIL           |  8978  |
 |                                 |        |
 | ASYM + SMT + SIS_UTIL           |  9259  |
 | ASYM + SMT + NO_SIS_UTIL        |  9291  |
 +---------------------------------+--------+

 - DCPerf MediaWiki (all CPUs):

 +---------------------------------+--------+--------+--------+--------+
 | Configuration                   |   rps  |  p50   |  p95   |  p99   |
 +---------------------------------+--------+--------+--------+--------+
 | ASYM (mainline) + SIS_UTIL      |  7994  |  0.052 |  0.223 |  0.246 |
 | ASYM (mainline) + NO_SIS_UTIL   |  7993  |  0.052 |  0.221 |  0.245 |
 |                                 |        |        |        |        |
 | NO ASYM + SIS_UTIL              |  8113  |  0.067 |  0.184 |  0.225 |
 | NO ASYM + NO_SIS_UTIL           |  8093  |  0.068 |  0.184 |  0.223 |
 |                                 |        |        |        |        |
 | ASYM + SMT + SIS_UTIL           |  8129  |  0.076 |  0.149 |  0.188 |
 | ASYM + SMT + NO_SIS_UTIL        |  8138  |  0.076 |  0.148 |  0.186 |
 +---------------------------------+--------+--------+--------+--------+

In the MediaWiki case SMT awareness is less impactful, because for the majority
of the run all CPUs are used, but it still seems to provide some benefits at
reducing tail latency.

Tests have also been conducted on NVIDIA Grace (which does not support SMT) to
ensure that SIS_UTIL support in select_idle_capacity() does not introduce
regressions and results show slight improvements under the same workloads.

See also:
 - https://lore.kernel.org/lkml/20260324005509.1134981-1-arighi@nvidia.com
 - https://lore.kernel.org/lkml/20260318092214.130908-1-arighi@nvidia.com

Changes in v5:
 - Drop redundant RCU protection in nohz_balancer_kick() (Prateek Nayak)
 - Do not remove CPU capacity asymmetry / SMT warning (Prateek Nayak)
 - Link to v4: https://lore.kernel.org/all/20260428051720.3180182-1-arighi@nvidia.com

Changes in v4:
 - Rename sd_llc_shared -> sd_balance_shared
 - Add preliminary cleanup patch to use guard(rcu)() for sched_domain RCU
   (Prateek Nayak)
 - Apply SIS_UTIL scan cap only with !prefers_idle_core, matching
   select_idle_cpu() / has_idle_core logic (Vincent Guittot)
 - Cache env->dst_cpu idle state to reduce is_core_idle() calls (Prateek Nayak)
 - Remove warning about CPU capacity asymmetry not supporting SMT
 - Link to v3: https://lore.kernel.org/all/20260423074135.380390-1-arighi@nvidia.com

Changes in v3:
 - Add SIS_UTIL support to select_idle_capacity() (K Prateek Nayak)
 - Attach sched_domain_shared to sd_asym_cpucapacity (K Prateek Nayak)
 - Add enum for the different fit state (K Prateek Nayak)
 - Update has_idle_cores hint (Vincent Guittot)
 - Link to v2: https://lore.kernel.org/all/20260403053654.1559142-1-arighi@nvidia.com

Changes in v2:
 - Rework SMT awareness logic in select_idle_capacity() (K Prateek Nayak)
 - Drop EAS and find_new_ilb() changes for now
 - Link to v1: https://lore.kernel.org/all/20260326151211.1862600-1-arighi@nvidia.com

Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git sched-asym-smt-v5

Andrea Righi (3):
      sched/fair: Drop redundant RCU read lock in NOHZ kick path
      sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
      sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity

K Prateek Nayak (2):
      sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
      sched/fair: Add SIS_UTIL support to select_idle_capacity()

 kernel/sched/fair.c     | 157 ++++++++++++++++++++++++++++++++++++------------
 kernel/sched/sched.h    |   2 +-
 kernel/sched/topology.c |  90 +++++++++++++++++++++++----
 3 files changed, 195 insertions(+), 54 deletions(-)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path
  2026-04-28 14:41 [PATCH v5 0/5] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
@ 2026-04-28 14:41 ` Andrea Righi
  2026-04-28 16:29   ` K Prateek Nayak
  2026-05-05  9:15   ` [PATCH " Dietmar Eggemann
  2026-04-28 14:41 ` [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity Andrea Righi
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 34+ messages in thread
From: Andrea Righi @ 2026-04-28 14:41 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel

nohz_balancer_kick() is reached from sched_balance_trigger(), which is
called from sched_tick(). sched_tick() runs with IRQs disabled, so the
additional rcu_read_lock/unlock() used around sched_domain accesses in
this path is redundant. Rely on the existing IRQ-disabled context (and
the rcu_dereference_all() checking) instead.

No functional change intended.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/fair.c | 40 ++++++++++++----------------------------
 1 file changed, 12 insertions(+), 28 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 69361c63353ad..e0f75dedc8456 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12749,8 +12749,6 @@ static void nohz_balancer_kick(struct rq *rq)
 		goto out;
 	}
 
-	rcu_read_lock();
-
 	sd = rcu_dereference_all(rq->sd);
 	if (sd) {
 		/*
@@ -12758,8 +12756,8 @@ static void nohz_balancer_kick(struct rq *rq)
 		 * capacity, kick the ILB to see if there's a better CPU to run on:
 		 */
 		if (rq->cfs.h_nr_runnable >= 1 && check_cpu_capacity(rq, sd)) {
-			flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-			goto unlock;
+			flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
+			goto out;
 		}
 	}
 
@@ -12775,8 +12773,8 @@ static void nohz_balancer_kick(struct rq *rq)
 		 */
 		for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) {
 			if (sched_asym(sd, i, cpu)) {
-				flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-				goto unlock;
+				flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
+				goto out;
 			}
 		}
 	}
@@ -12787,10 +12785,8 @@ static void nohz_balancer_kick(struct rq *rq)
 		 * When ASYM_CPUCAPACITY; see if there's a higher capacity CPU
 		 * to run the misfit task on.
 		 */
-		if (check_misfit_status(rq)) {
-			flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-			goto unlock;
-		}
+		if (check_misfit_status(rq))
+			flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
 
 		/*
 		 * For asymmetric systems, we do not want to nicely balance
@@ -12799,10 +12795,10 @@ static void nohz_balancer_kick(struct rq *rq)
 		 *
 		 * Skip the LLC logic because it's not relevant in that case.
 		 */
-		goto unlock;
+		goto out;
 	}
 
-	sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
+	sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
 	if (sds) {
 		/*
 		 * If there is an imbalance between LLC domains (IOW we could
@@ -12814,13 +12810,9 @@ static void nohz_balancer_kick(struct rq *rq)
 		 * like this LLC domain has tasks we could move.
 		 */
 		nr_busy = atomic_read(&sds->nr_busy_cpus);
-		if (nr_busy > 1) {
-			flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-			goto unlock;
-		}
+		if (nr_busy > 1)
+			flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
 	}
-unlock:
-	rcu_read_unlock();
 out:
 	if (READ_ONCE(nohz.needs_update))
 		flags |= NOHZ_NEXT_KICK;
@@ -12832,17 +12824,13 @@ static void nohz_balancer_kick(struct rq *rq)
 static void set_cpu_sd_state_busy(int cpu)
 {
 	struct sched_domain *sd;
-
-	rcu_read_lock();
 	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
 
 	if (!sd || !sd->nohz_idle)
-		goto unlock;
+		return;
 	sd->nohz_idle = 0;
 
 	atomic_inc(&sd->shared->nr_busy_cpus);
-unlock:
-	rcu_read_unlock();
 }
 
 void nohz_balance_exit_idle(struct rq *rq)
@@ -12861,17 +12849,13 @@ void nohz_balance_exit_idle(struct rq *rq)
 static void set_cpu_sd_state_idle(int cpu)
 {
 	struct sched_domain *sd;
-
-	rcu_read_lock();
 	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
 
 	if (!sd || sd->nohz_idle)
-		goto unlock;
+		return;
 	sd->nohz_idle = 1;
 
 	atomic_dec(&sd->shared->nr_busy_cpus);
-unlock:
-	rcu_read_unlock();
 }
 
 /*
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
  2026-04-28 14:41 [PATCH v5 0/5] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
  2026-04-28 14:41 ` [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path Andrea Righi
@ 2026-04-28 14:41 ` Andrea Righi
  2026-05-05 12:48   ` Dietmar Eggemann
  2026-05-06  9:45   ` Vincent Guittot
  2026-04-28 14:41 ` [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 34+ messages in thread
From: Andrea Righi @ 2026-04-28 14:41 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel

From: K Prateek Nayak <kprateek.nayak@amd.com>

On asymmetric CPU capacity systems, the wakeup path uses
select_idle_capacity(), which scans the span of sd_asym_cpucapacity
rather than sd_llc.

The has_idle_cores hint however lives on sd_llc->shared, so the
wakeup-time read of has_idle_cores operates on an LLC-scoped blob while
the actual scan/decision spans the asym domain; nr_busy_cpus also lives
in the same shared sched_domain data, but it's never used in the asym
CPU capacity scenario.

Therefore, move the sched_domain_shared object to sd_asym_cpucapacity
whenever the CPU has a SD_ASYM_CPUCAPACITY_FULL ancestor and that
ancestor is non-overlapping (i.e., not built from SD_NUMA). In that case
the scope of has_idle_cores matches the scope of the wakeup scan.

Fall back to attaching the shared object to sd_llc in three cases:

  1) plain symmetric systems (no SD_ASYM_CPUCAPACITY_FULL anywhere);

  2) CPUs in an exclusive cpuset that carves out a symmetric capacity
     island: has_asym is system-wide but those CPUs have no
     SD_ASYM_CPUCAPACITY_FULL ancestor in their hierarchy and follow
     the symmetric LLC path in select_idle_sibling();

  3) exotic topologies where SD_ASYM_CPUCAPACITY_FULL lands on an
     SD_NUMA-built domain. init_sched_domain_shared() keys the shared
     blob off cpumask_first(span), which on overlapping NUMA domains
     would alias unrelated spans onto the same blob. Keep the shared
     object on the LLC there; select_idle_capacity() gracefully skips
     the has_idle_cores preference when sd->shared is NULL.

While at it, also rename the per-CPU sd_llc_shared to sd_balance_shared,
as it is no longer strictly tied to the LLC.

Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c     | 17 +++++---
 kernel/sched/sched.h    |  2 +-
 kernel/sched/topology.c | 90 +++++++++++++++++++++++++++++++++++------
 3 files changed, 89 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e0f75dedc8456..bbdf537f61154 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7790,7 +7790,7 @@ static inline void set_idle_cores(int cpu, int val)
 {
 	struct sched_domain_shared *sds;
 
-	sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
+	sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
 	if (sds)
 		WRITE_ONCE(sds->has_idle_cores, val);
 }
@@ -7799,7 +7799,7 @@ static inline bool test_idle_cores(int cpu)
 {
 	struct sched_domain_shared *sds;
 
-	sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
+	sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
 	if (sds)
 		return READ_ONCE(sds->has_idle_cores);
 
@@ -7808,7 +7808,7 @@ static inline bool test_idle_cores(int cpu)
 
 /*
  * Scans the local SMT mask to see if the entire core is idle, and records this
- * information in sd_llc_shared->has_idle_cores.
+ * information in sd_balance_shared->has_idle_cores.
  *
  * Since SMT siblings share all cache levels, inspecting this limited remote
  * state should be fairly cheap.
@@ -7925,7 +7925,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
 
-	if (sched_feat(SIS_UTIL)) {
+	if (sched_feat(SIS_UTIL) && sd->shared) {
 		/*
 		 * Increment because !--nr is the condition to stop scan.
 		 *
@@ -12826,7 +12826,11 @@ static void set_cpu_sd_state_busy(int cpu)
 	struct sched_domain *sd;
 	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
 
-	if (!sd || !sd->nohz_idle)
+	/*
+	 * sd->nohz_idle only pairs with nr_busy_cpus on sd->shared; if this
+	 * domain has no shared object there is nothing to clear or account.
+	 */
+	if (!sd || !sd->shared || !sd->nohz_idle)
 		return;
 	sd->nohz_idle = 0;
 
@@ -12851,7 +12855,8 @@ static void set_cpu_sd_state_idle(int cpu)
 	struct sched_domain *sd;
 	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
 
-	if (!sd || sd->nohz_idle)
+	/* See set_cpu_sd_state_busy(): nohz_idle is only used with sd->shared. */
+	if (!sd || !sd->shared || sd->nohz_idle)
 		return;
 	sd->nohz_idle = 1;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9f63b15d309d1..330f5893c4561 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2170,7 +2170,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
 DECLARE_PER_CPU(int, sd_share_id);
-DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_balance_shared);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5847b83d9d552..69d465cc93ab4 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -665,7 +665,7 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
 DEFINE_PER_CPU(int, sd_share_id);
-DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_balance_shared);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
@@ -680,20 +680,38 @@ static void update_top_cache_domain(int cpu)
 	int id = cpu;
 	int size = 1;
 
+	sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
+	/*
+	 * The shared object is attached to sd_asym_cpucapacity only when the
+	 * asym domain is non-overlapping (i.e., not built from SD_NUMA).
+	 * On overlapping (NUMA) asym domains we fall back to letting the
+	 * SD_SHARE_LLC path own the shared object, so sd->shared may be NULL
+	 * here.
+	 */
+	if (sd && sd->shared)
+		sds = sd->shared;
+
+	rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
+
 	sd = highest_flag_domain(cpu, SD_SHARE_LLC);
 	if (sd) {
 		id = cpumask_first(sched_domain_span(sd));
 		size = cpumask_weight(sched_domain_span(sd));
 
-		/* If sd_llc exists, sd_llc_shared should exist too. */
-		WARN_ON_ONCE(!sd->shared);
-		sds = sd->shared;
+		/*
+		 * If sd_asym_cpucapacity didn't claim the shared object,
+		 * sd_llc must have one linked.
+		 */
+		if (!sds) {
+			WARN_ON_ONCE(!sd->shared);
+			sds = sd->shared;
+		}
 	}
 
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
 	per_cpu(sd_llc_size, cpu) = size;
 	per_cpu(sd_llc_id, cpu) = id;
-	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
+	rcu_assign_pointer(per_cpu(sd_balance_shared, cpu), sds);
 
 	sd = lowest_flag_domain(cpu, SD_CLUSTER);
 	if (sd)
@@ -711,9 +729,6 @@ static void update_top_cache_domain(int cpu)
 
 	sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
 	rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
-
-	sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
-	rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
 }
 
 /*
@@ -2650,6 +2665,49 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc)
 	}
 }
 
+static void init_sched_domain_shared(struct s_data *d, struct sched_domain *sd)
+{
+	int sd_id = cpumask_first(sched_domain_span(sd));
+
+	sd->shared = *per_cpu_ptr(d->sds, sd_id);
+	atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
+	atomic_inc(&sd->shared->ref);
+}
+
+/*
+ * For asymmetric CPU capacity, attach sched_domain_shared on the innermost
+ * SD_ASYM_CPUCAPACITY_FULL ancestor of @cpu's base domain when that ancestor is
+ * not an overlapping NUMA-built domain (then LLC should claim shared).
+ *
+ * A CPU may lack any FULL ancestor (e.g., exclusive cpuset symmetric island),
+ * then LLC must claim shared instead.
+ *
+ * Note: SD_ASYM_CPUCAPACITY_FULL is only set when multiple distinct capacities
+ * exist in the domain span, so the asym domain we attach to cannot degenerate
+ * into a single-capacity group. The relevant edge cases are instead covered by
+ * the caveats above.
+ *
+ * Return true if this CPU's asym path claimed sd->shared, false otherwise.
+ */
+static bool claim_asym_sched_domain_shared(struct s_data *d, int cpu)
+{
+	struct sched_domain *sd = *per_cpu_ptr(d->sd, cpu);
+	struct sched_domain *sd_asym;
+
+	if (!sd)
+		return false;
+
+	sd_asym = sd;
+	while (sd_asym && !(sd_asym->flags & SD_ASYM_CPUCAPACITY_FULL))
+		sd_asym = sd_asym->parent;
+
+	if (!sd_asym || (sd_asym->flags & SD_NUMA))
+		return false;
+
+	init_sched_domain_shared(d, sd_asym);
+	return true;
+}
+
 /*
  * Build sched domains for a given set of CPUs and attach the sched domains
  * to the individual CPUs
@@ -2708,20 +2766,26 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 	}
 
 	for_each_cpu(i, cpu_map) {
+		bool asym_claimed = false;
+
 		sd = *per_cpu_ptr(d.sd, i);
 		if (!sd)
 			continue;
 
+		if (has_asym)
+			asym_claimed = claim_asym_sched_domain_shared(&d, i);
+
 		/* First, find the topmost SD_SHARE_LLC domain */
 		while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
 			sd = sd->parent;
 
 		if (sd->flags & SD_SHARE_LLC) {
-			int sd_id = cpumask_first(sched_domain_span(sd));
-
-			sd->shared = *per_cpu_ptr(d.sds, sd_id);
-			atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
-			atomic_inc(&sd->shared->ref);
+			/*
+			 * Initialize the sd->shared for SD_SHARE_LLC unless
+			 * the asym path above already claimed it.
+			 */
+			if (!asym_claimed)
+				init_sched_domain_shared(&d, sd);
 
 			/*
 			 * In presence of higher domains, adjust the
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-04-28 14:41 [PATCH v5 0/5] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
  2026-04-28 14:41 ` [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path Andrea Righi
  2026-04-28 14:41 ` [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity Andrea Righi
@ 2026-04-28 14:41 ` Andrea Righi
  2026-05-05 17:20   ` Dietmar Eggemann
  2026-05-06 10:29   ` Vincent Guittot
  2026-04-28 14:41 ` [PATCH 4/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 34+ messages in thread
From: Andrea Righi @ 2026-04-28 14:41 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel

On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
different per-core frequencies), the wakeup path uses
select_idle_capacity() and prioritizes idle CPUs with higher capacity
for better task placement. However, when those CPUs belong to SMT cores,
their effective capacity can be much lower than the nominal capacity
when the sibling thread is busy: SMT siblings compete for shared
resources, so a "high capacity" CPU that is idle but whose sibling is
busy does not deliver its full capacity. This effective capacity
reduction cannot be modeled by the static capacity value alone.

Introduce SMT awareness in the asym-capacity idle selection policy: when
SMT is active, always prefer fully-idle SMT cores over partially-idle
ones.

Prioritizing fully-idle SMT cores yields better task placement because
the effective capacity of partially-idle SMT cores is reduced; always
preferring them when available leads to more accurate capacity usage on
task wakeup.

On an SMT system with asymmetric CPU capacities, SMT-aware idle
selection has been shown to improve throughput by around 15-18% for
CPU-bound workloads, running an amount of tasks equal to the amount of
SMT cores.

Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Koba Ko <kobak@nvidia.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/fair.c | 70 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 65 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bbdf537f61154..6a7e4943804b5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7989,6 +7989,22 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	return idle_cpu;
 }
 
+/*
+ * Idle-capacity scan ranks transformed util_fits_cpu() outcomes; lower values
+ * are more preferred (see select_idle_capacity()).
+ */
+enum asym_fits_state {
+	/* In descending order of preference */
+	ASYM_IDLE_CORE_UCLAMP_MISFIT = -4,
+	ASYM_IDLE_CORE_COMPLETE_MISFIT,
+	ASYM_IDLE_THREAD_FITS,
+	ASYM_IDLE_THREAD_UCLAMP_MISFIT,
+	ASYM_IDLE_COMPLETE_MISFIT,
+
+	/* util_fits_cpu() bias for an idle core. */
+	ASYM_IDLE_CORE_BIAS = -3,
+};
+
 /*
  * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
  * the task fits. If no CPU is big enough, but there are idle ones, try to
@@ -7997,8 +8013,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 static int
 select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 {
+	bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
 	unsigned long task_util, util_min, util_max, best_cap = 0;
-	int fits, best_fits = 0;
+	int fits, best_fits = ASYM_IDLE_COMPLETE_MISFIT;
 	int cpu, best_cpu = -1;
 	struct cpumask *cpus;
 
@@ -8010,6 +8027,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	util_max = uclamp_eff_value(p, UCLAMP_MAX);
 
 	for_each_cpu_wrap(cpu, cpus, target) {
+		bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
 		unsigned long cpu_cap = capacity_of(cpu);
 
 		if (!choose_idle_cpu(cpu, p))
@@ -8018,7 +8036,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 		fits = util_fits_cpu(task_util, util_min, util_max, cpu);
 
 		/* This CPU fits with all requirements */
-		if (fits > 0)
+		if (fits > 0 && preferred_core)
 			return cpu;
 		/*
 		 * Only the min performance hint (i.e. uclamp_min) doesn't fit.
@@ -8026,9 +8044,33 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 		 */
 		else if (fits < 0)
 			cpu_cap = get_actual_cpu_capacity(cpu);
+		/*
+		 * fits > 0 implies we are not on a preferred core
+		 * but the util fits CPU capacity. Set fits to ASYM_IDLE_THREAD_FITS
+		 * so the effective range becomes
+		 * [ASYM_IDLE_THREAD_FITS, ASYM_IDLE_COMPLETE_MISFIT], where:
+		 *    ASYM_IDLE_COMPLETE_MISFIT - does not fit
+		 *    ASYM_IDLE_THREAD_UCLAMP_MISFIT - fits with the exception of UCLAMP_MIN
+		 *    ASYM_IDLE_THREAD_FITS - fits with the exception of preferred_core
+		 */
+		else if (fits > 0)
+			fits = ASYM_IDLE_THREAD_FITS;
+
+		/*
+		 * If we are on a preferred core, translate the range of fits
+		 * of [ASYM_IDLE_THREAD_UCLAMP_MISFIT, ASYM_IDLE_COMPLETE_MISFIT] to
+		 * [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_COMPLETE_MISFIT].
+		 * This ensures that an idle core is always given priority over
+		 * (partially) busy core.
+		 *
+		 * A fully fitting idle core would have returned early and hence
+		 * fits > 0 for preferred_core need not be dealt with.
+		 */
+		if (preferred_core)
+			fits += ASYM_IDLE_CORE_BIAS;
 
 		/*
-		 * First, select CPU which fits better (-1 being better than 0).
+		 * First, select CPU which fits better (lower is more preferred).
 		 * Then, select the one with best capacity at same level.
 		 */
 		if ((fits < best_fits) ||
@@ -8039,6 +8081,19 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 		}
 	}
 
+	/*
+	 * A value in the [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_BIAS]
+	 * range means the chosen CPU is in a fully idle SMT core. Values above
+	 * ASYM_IDLE_CORE_BIAS mean we never ranked such a CPU best.
+	 *
+	 * The asym-capacity wakeup path returns from select_idle_sibling()
+	 * after this function and never runs select_idle_cpu(), so the usual
+	 * select_idle_cpu() tail that clears idle cores must live here when the
+	 * idle-core preference did not win.
+	 */
+	if (prefers_idle_core && best_fits > ASYM_IDLE_CORE_BIAS)
+		set_idle_cores(target, false);
+
 	return best_cpu;
 }
 
@@ -8047,12 +8102,17 @@ static inline bool asym_fits_cpu(unsigned long util,
 				 unsigned long util_max,
 				 int cpu)
 {
-	if (sched_asym_cpucap_active())
+	if (sched_asym_cpucap_active()) {
 		/*
 		 * Return true only if the cpu fully fits the task requirements
 		 * which include the utilization and the performance hints.
+		 *
+		 * When SMT is active, also require that the core has no busy
+		 * siblings.
 		 */
-		return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
+		return (!sched_smt_active() || is_core_idle(cpu)) &&
+		       (util_fits_cpu(util, util_min, util_max, cpu) > 0);
+	}
 
 	return true;
 }
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
  2026-04-28 14:41 [PATCH v5 0/5] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
                   ` (2 preceding siblings ...)
  2026-04-28 14:41 ` [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
@ 2026-04-28 14:41 ` Andrea Righi
  2026-04-28 14:41 ` [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity() Andrea Righi
  2026-05-05 20:40 ` [PATCH v5 0/5] sched/fair: SMT-aware asymmetric CPU capacity Dietmar Eggemann
  5 siblings, 0 replies; 34+ messages in thread
From: Andrea Righi @ 2026-04-28 14:41 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel

When SD_ASYM_CPUCAPACITY load balancing considers pulling a misfit task,
capacity_of(dst_cpu) can overstate available compute if the SMT sibling is
busy: the core does not deliver its full nominal capacity.

If SMT is active and dst_cpu is not on a fully idle core, skip this
destination so we do not migrate a misfit expecting a capacity upgrade we
cannot actually provide.

Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Koba Ko <kobak@nvidia.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/fair.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6a7e4943804b5..a1f4d70f6b3d9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9610,6 +9610,7 @@ struct lb_env {
 
 	int			dst_cpu;
 	struct rq		*dst_rq;
+	bool			dst_core_idle;
 
 	struct cpumask		*dst_grpmask;
 	int			new_dst_cpu;
@@ -10835,10 +10836,16 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 * We can use max_capacity here as reduction in capacity on some
 	 * CPUs in the group should either be possible to resolve
 	 * internally or be covered by avg_load imbalance (eventually).
+	 *
+	 * When SMT is active, only pull a misfit to dst_cpu if it is on a
+	 * fully idle core; otherwise the effective capacity of the core is
+	 * reduced and we may not actually provide more capacity than the
+	 * source.
 	 */
 	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
 	    (sgs->group_type == group_misfit_task) &&
-	    (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
+	    (!env->dst_core_idle ||
+	     !capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
 	     sds->local_stat.group_type != group_has_spare))
 		return false;
 
@@ -11402,6 +11409,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	unsigned long sum_util = 0;
 	bool sg_overloaded = 0, sg_overutilized = 0;
 
+	env->dst_core_idle = !sched_smt_active() || is_core_idle(env->dst_cpu);
+
 	do {
 		struct sg_lb_stats *sgs = &tmp_sgs;
 		int local_group;
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity()
  2026-04-28 14:41 [PATCH v5 0/5] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
                   ` (3 preceding siblings ...)
  2026-04-28 14:41 ` [PATCH 4/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
@ 2026-04-28 14:41 ` Andrea Righi
  2026-05-06 12:59   ` Vincent Guittot
  2026-05-05 20:40 ` [PATCH v5 0/5] sched/fair: SMT-aware asymmetric CPU capacity Dietmar Eggemann
  5 siblings, 1 reply; 34+ messages in thread
From: Andrea Righi @ 2026-04-28 14:41 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel

From: K Prateek Nayak <kprateek.nayak@amd.com>

Add to select_idle_capacity() the same SIS_UTIL-controlled idle-scan
mechanism, already used by select_idle_cpu(): when sched_feat(SIS_UTIL)
is enabled and the LLC domain has sched_domain_shared data, derive the
per-attempt scan limit from sd->shared->nr_idle_scan.

That bounds the walk on large LLCs and allows an early return once the
scan limit is reached, if we already picked a sufficiently strong
idle-core candidate (best_fits == ASYM_IDLE_CORE_UCLAMP_MISFIT).

Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a1f4d70f6b3d9..1cde3a9b1e0f5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8018,6 +8018,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	int fits, best_fits = ASYM_IDLE_COMPLETE_MISFIT;
 	int cpu, best_cpu = -1;
 	struct cpumask *cpus;
+	int nr = INT_MAX;
 
 	cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
@@ -8026,10 +8027,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	util_min = uclamp_eff_value(p, UCLAMP_MIN);
 	util_max = uclamp_eff_value(p, UCLAMP_MAX);
 
+	if (sched_feat(SIS_UTIL) && sd->shared) {
+		/*
+		 * Same nr_idle_scan hint as select_idle_cpu(), nr only limits
+		 * the scan when not preferring an idle core.
+		 */
+		nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
+		/* overloaded domain is unlikely to have idle cpu/core */
+		if (nr == 1)
+			return -1;
+	}
+
 	for_each_cpu_wrap(cpu, cpus, target) {
 		bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
 		unsigned long cpu_cap = capacity_of(cpu);
 
+		/*
+		 * Good-enough early exit (mirrors select_idle_cpu() logic).
+		 */
+		if (!prefers_idle_core &&
+		    --nr <= 0 && best_fits == ASYM_IDLE_CORE_UCLAMP_MISFIT)
+			return best_cpu;
+
 		if (!choose_idle_cpu(cpu, p))
 			continue;
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path
  2026-04-28 14:41 ` [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path Andrea Righi
@ 2026-04-28 16:29   ` K Prateek Nayak
  2026-04-29 16:07     ` [PATCH v2 " Andrea Righi
  2026-05-05  9:15   ` [PATCH " Dietmar Eggemann
  1 sibling, 1 reply; 34+ messages in thread
From: K Prateek Nayak @ 2026-04-28 16:29 UTC (permalink / raw)
  To: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel

Hello Andrea,

On 4/28/2026 8:11 PM, Andrea Righi wrote:
> nohz_balancer_kick() is reached from sched_balance_trigger(), which is
> called from sched_tick(). sched_tick() runs with IRQs disabled, so the
> additional rcu_read_lock/unlock() used around sched_domain accesses in
> this path is redundant. Rely on the existing IRQ-disabled context (and
> the rcu_dereference_all() checking) instead.

nit. Perhaps a small note like below in case there is a follow-up:

set_cpu_sd_state_idle() is called from idle entry path after the IRQs
have been disabled making the rcu_dereference_all() check sufficient.

> 
> No functional change intended.
> 
> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>

Thank you for cleaning these bits up. Feel free to include:

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path
  2026-04-28 16:29   ` K Prateek Nayak
@ 2026-04-29 16:07     ` Andrea Righi
  0 siblings, 0 replies; 34+ messages in thread
From: Andrea Righi @ 2026-04-29 16:07 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel

nohz_balancer_kick() is reached from sched_balance_trigger(), which is
called from sched_tick(). sched_tick() runs with IRQs disabled, so the
additional rcu_read_lock/unlock() used around sched_domain accesses in
this path is redundant. Rely on the existing IRQ-disabled context (and
the rcu_dereference_all() checking) instead.

Note that the same applies to set_cpu_sd_state_idle(), which is called
from the idle entry path after the IRQs have been disabled, making the
rcu_dereference_all() check sufficient.

No functional change intended.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/fair.c | 40 ++++++++++++----------------------------
 1 file changed, 12 insertions(+), 28 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 69361c63353ad..e0f75dedc8456 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12749,8 +12749,6 @@ static void nohz_balancer_kick(struct rq *rq)
 		goto out;
 	}
 
-	rcu_read_lock();
-
 	sd = rcu_dereference_all(rq->sd);
 	if (sd) {
 		/*
@@ -12758,8 +12756,8 @@ static void nohz_balancer_kick(struct rq *rq)
 		 * capacity, kick the ILB to see if there's a better CPU to run on:
 		 */
 		if (rq->cfs.h_nr_runnable >= 1 && check_cpu_capacity(rq, sd)) {
-			flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-			goto unlock;
+			flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
+			goto out;
 		}
 	}
 
@@ -12775,8 +12773,8 @@ static void nohz_balancer_kick(struct rq *rq)
 		 */
 		for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) {
 			if (sched_asym(sd, i, cpu)) {
-				flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-				goto unlock;
+				flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
+				goto out;
 			}
 		}
 	}
@@ -12787,10 +12785,8 @@ static void nohz_balancer_kick(struct rq *rq)
 		 * When ASYM_CPUCAPACITY; see if there's a higher capacity CPU
 		 * to run the misfit task on.
 		 */
-		if (check_misfit_status(rq)) {
-			flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-			goto unlock;
-		}
+		if (check_misfit_status(rq))
+			flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
 
 		/*
 		 * For asymmetric systems, we do not want to nicely balance
@@ -12799,10 +12795,10 @@ static void nohz_balancer_kick(struct rq *rq)
 		 *
 		 * Skip the LLC logic because it's not relevant in that case.
 		 */
-		goto unlock;
+		goto out;
 	}
 
-	sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
+	sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
 	if (sds) {
 		/*
 		 * If there is an imbalance between LLC domains (IOW we could
@@ -12814,13 +12810,9 @@ static void nohz_balancer_kick(struct rq *rq)
 		 * like this LLC domain has tasks we could move.
 		 */
 		nr_busy = atomic_read(&sds->nr_busy_cpus);
-		if (nr_busy > 1) {
-			flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-			goto unlock;
-		}
+		if (nr_busy > 1)
+			flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
 	}
-unlock:
-	rcu_read_unlock();
 out:
 	if (READ_ONCE(nohz.needs_update))
 		flags |= NOHZ_NEXT_KICK;
@@ -12832,17 +12824,13 @@ static void nohz_balancer_kick(struct rq *rq)
 static void set_cpu_sd_state_busy(int cpu)
 {
 	struct sched_domain *sd;
-
-	rcu_read_lock();
 	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
 
 	if (!sd || !sd->nohz_idle)
-		goto unlock;
+		return;
 	sd->nohz_idle = 0;
 
 	atomic_inc(&sd->shared->nr_busy_cpus);
-unlock:
-	rcu_read_unlock();
 }
 
 void nohz_balance_exit_idle(struct rq *rq)
@@ -12861,17 +12849,13 @@ void nohz_balance_exit_idle(struct rq *rq)
 static void set_cpu_sd_state_idle(int cpu)
 {
 	struct sched_domain *sd;
-
-	rcu_read_lock();
 	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
 
 	if (!sd || sd->nohz_idle)
-		goto unlock;
+		return;
 	sd->nohz_idle = 1;
 
 	atomic_dec(&sd->shared->nr_busy_cpus);
-unlock:
-	rcu_read_unlock();
 }
 
 /*
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path
  2026-04-28 14:41 ` [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path Andrea Righi
  2026-04-28 16:29   ` K Prateek Nayak
@ 2026-05-05  9:15   ` Dietmar Eggemann
  2026-05-05  9:22     ` Andrea Righi
  1 sibling, 1 reply; 34+ messages in thread
From: Dietmar Eggemann @ 2026-05-05  9:15 UTC (permalink / raw)
  To: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot
  Cc: Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel

On 28.04.26 16:41, Andrea Righi wrote:

[...]

> @@ -12799,10 +12795,10 @@ static void nohz_balancer_kick(struct rq *rq)
>  		 *
>  		 * Skip the LLC logic because it's not relevant in that case.
>  		 */
> -		goto unlock;
> +		goto out;
>  	}
>  
> -	sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
> +	sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));

nit: sd_balance_shared is only defined in 2/5.

[...]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path
  2026-05-05  9:15   ` [PATCH " Dietmar Eggemann
@ 2026-05-05  9:22     ` Andrea Righi
  0 siblings, 0 replies; 34+ messages in thread
From: Andrea Righi @ 2026-05-05  9:22 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel

Hi Dietmar,

On Tue, May 05, 2026 at 11:15:12AM +0200, Dietmar Eggemann wrote:
> On 28.04.26 16:41, Andrea Righi wrote:
> 
> [...]
> 
> > @@ -12799,10 +12795,10 @@ static void nohz_balancer_kick(struct rq *rq)
> >  		 *
> >  		 * Skip the LLC logic because it's not relevant in that case.
> >  		 */
> > -		goto unlock;
> > +		goto out;
> >  	}
> >  
> > -	sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
> > +	sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
> 
> nit: sd_balance_shared is only defined in 2/5.

Ah, good catch! Apparently I forgot to test-build each individual patch, I'll
fix this.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
  2026-04-28 14:41 ` [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity Andrea Righi
@ 2026-05-05 12:48   ` Dietmar Eggemann
  2026-05-06  9:45   ` Vincent Guittot
  1 sibling, 0 replies; 34+ messages in thread
From: Dietmar Eggemann @ 2026-05-05 12:48 UTC (permalink / raw)
  To: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot
  Cc: Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel

On 28.04.26 16:41, Andrea Righi wrote:
> From: K Prateek Nayak <kprateek.nayak@amd.com>
> 
> On asymmetric CPU capacity systems, the wakeup path uses
> select_idle_capacity(), which scans the span of sd_asym_cpucapacity
> rather than sd_llc.
> 
> The has_idle_cores hint however lives on sd_llc->shared, so the
> wakeup-time read of has_idle_cores operates on an LLC-scoped blob while
> the actual scan/decision spans the asym domain; nr_busy_cpus also lives
> in the same shared sched_domain data, but it's never used in the asym
> CPU capacity scenario.
> 
> Therefore, move the sched_domain_shared object to sd_asym_cpucapacity
> whenever the CPU has a SD_ASYM_CPUCAPACITY_FULL ancestor and that
> ancestor is non-overlapping (i.e., not built from SD_NUMA). In that case
> the scope of has_idle_cores matches the scope of the wakeup scan.
> 
> Fall back to attaching the shared object to sd_llc in three cases:
> 
>   1) plain symmetric systems (no SD_ASYM_CPUCAPACITY_FULL anywhere);
> 
>   2) CPUs in an exclusive cpuset that carves out a symmetric capacity
>      island: has_asym is system-wide but those CPUs have no
>      SD_ASYM_CPUCAPACITY_FULL ancestor in their hierarchy and follow
>      the symmetric LLC path in select_idle_sibling();
> 
>   3) exotic topologies where SD_ASYM_CPUCAPACITY_FULL lands on an
>      SD_NUMA-built domain. init_sched_domain_shared() keys the shared
>      blob off cpumask_first(span), which on overlapping NUMA domains
>      would alias unrelated spans onto the same blob. Keep the shared
>      object on the LLC there; select_idle_capacity() gracefully skips
>      the has_idle_cores preference when sd->shared is NULL.

Tested it with a coule of real & exotic topolgies, seems to work nicely.

$ cat /sys/devices/system/cpu/cpu*/cpu_capacity
160
160
160
160
498
498
1024
1024

(1) grouping CPUs with same CPU capacities

$ cat /sys/kernel/debug/sched/domains/cpu[0-7]/domain*/name
MC
PKG

$ cat /sys/kernel/debug/sched/domains/cpu[0-7]/domain*/flags
... SD_SHARE_LLC
... SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL ...

  PKG  {      0-7      }
  MC   {0-3} {4,5} {6,7}

(2) flat

$ cat /sys/kernel/debug/sched/domains/cpu[0-7]/domain*/name
MC

$ cat /sys/kernel/debug/sched/domains/cpu[0-7]/domain*/flags
... SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL ...

  MC  {      0-7      }

(3) flat, exotic, since w/ SMT

$ cat /sys/kernel/debug/sched/domains/cpu[0-7]/domain*/name
SMT
MC

... SD_SHARE_CPUCAPACITY SD_SHARE_LLC ...
... SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_SHARE_LLC ...

  MC   {         0-7         }
  SMT  {0-1} {2-3} {4-5} {6-7}


(4) exotic, since asymmetric and w/ SMT

 $ cat /sys/kernel/debug/sched/domains/cpu[0-3]/domain*/name
SMT
MC
PKG

$ cat /sys/kernel/debug/sched/domains/cpu[0-3]/domain*/flags

... SD_SHARE_CPUCAPACITY SD_SHARE_LLC ...
... SD_SHARE_LLC
... SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL ...


$ cat /sys/kernel/debug/sched/domains/cpu[4-7]/domain*/name
SMT
PKG

$ cat /sys/kernel/debug/sched/domains/cpu[4-7]/domain*/flags
... SD_SHARE_CPUCAPACITY SD_SHARE_LLC ...
... SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL ...


  PKG  {         0-7         }
  MC   {   0-3   }
  SMT  {0-1} {2-3} {4-5} {6-7}

(5) same as (4) but partial CPU capacity asymmetry in MC { 0-3 }

cat /sys/devices/system/cpu/cpu*/cpu_capacity
160
160
498
498
160
160
1024
1024

$ cat /sys/kernel/debug/sched/domains/cpu[0-3]/domain*/flags

... SD_SHARE_CPUCAPACITY SD_SHARE_LLC ...
... SD_ASYM_CPUCAPACITY SD_SHARE_LLC ...
    ^^^^^^^^^^^^^^^^^^^
... SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL ...

(6) (5) w/ exclusive cpusets with one symmetric island

cd /sys/fs/cgroup
echo +cpuset > cgroup.subtree_control
mkdir cs1
echo "threaded" > cs1/cgroup.type
echo 0-1,4-5 > cs1/cpuset.cpus
echo 0 > cs1/cpuset.mems
echo root > cs1/cpuset.cpus.partition
mkdir cs2
echo "threaded" > cs2/cgroup.type
echo 0 > cs2/cpuset.mems
echo 2-3,6-7 > cs2/cpuset.cpus
echo root > cs2/cpuset.cpus.partition

[    0.006866] claim_asym_sched_domain_shared() (2) sd_asym=PKG cpu=0
[    0.006868] claim_asym_sched_domain_shared() (2) sd_asym=PKG cpu=1
[    0.006869] claim_asym_sched_domain_shared() (2) sd_asym=PKG cpu=2
[    0.006869] claim_asym_sched_domain_shared() (2) sd_asym=PKG cpu=3
[    0.006869] claim_asym_sched_domain_shared() (2) sd_asym=PKG cpu=4
[    0.006869] claim_asym_sched_domain_shared() (2) sd_asym=PKG cpu=5
[    0.006870] claim_asym_sched_domain_shared() (2) sd_asym=PKG cpu=6
[    0.006870] claim_asym_sched_domain_shared() (2) sd_asym=PKG cpu=7
...
[  222.767275] claim_asym_sched_domain_shared() (2) sd_asym=PKG cpu=2
[  222.767324] claim_asym_sched_domain_shared() (2) sd_asym=PKG cpu=3
[  222.767710] claim_asym_sched_domain_shared() (2) sd_asym=PKG cpu=6
[  222.767789] claim_asym_sched_domain_shared() (2) sd_asym=PKG cpu=7
[  222.781015] build_sched_domains() (3) sd=MC cpu=0
[  222.781017] build_sched_domains() (3) sd=MC cpu=1
[  222.781017] build_sched_domains() (3) sd=MC cpu=4
[  222.781018] build_sched_domains() (3) sd=MC cpu=5

[...]

> @@ -2650,6 +2665,49 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc)
>  	}
>  }
>  
> +static void init_sched_domain_shared(struct s_data *d, struct sched_domain *sd)
> +{
> +	int sd_id = cpumask_first(sched_domain_span(sd));
> +
> +	sd->shared = *per_cpu_ptr(d->sds, sd_id);
> +	atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);

Will be used only for sd_llc->shared, not for sd_asym, right?

> +	atomic_inc(&sd->shared->ref);
> +}
> +
> +/*
> + * For asymmetric CPU capacity, attach sched_domain_shared on the innermost
> + * SD_ASYM_CPUCAPACITY_FULL ancestor of @cpu's base domain when that ancestor is
> + * not an overlapping NUMA-built domain (then LLC should claim shared).
> + *
> + * A CPU may lack any FULL ancestor (e.g., exclusive cpuset symmetric island),
> + * then LLC must claim shared instead.
> + *
> + * Note: SD_ASYM_CPUCAPACITY_FULL is only set when multiple distinct capacities

s/multiple/all ? We want to see all possible CPU capacity values in wakeup.

> + * exist in the domain span, so the asym domain we attach to cannot degenerate
> + * into a single-capacity group. The relevant edge cases are instead covered by
> + * the caveats above.
[...]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-04-28 14:41 ` [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
@ 2026-05-05 17:20   ` Dietmar Eggemann
  2026-05-06 18:31     ` Andrea Righi
  2026-05-06 10:29   ` Vincent Guittot
  1 sibling, 1 reply; 34+ messages in thread
From: Dietmar Eggemann @ 2026-05-05 17:20 UTC (permalink / raw)
  To: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot
  Cc: Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel

On 28.04.26 16:41, Andrea Righi wrote:
> On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> different per-core frequencies), the wakeup path uses

I assume those CPPC systems w/ different per-core frequencies (like your
Vera) are the only real one which would make use of this. Mobile
big.LITTLE/DynamIQ don't have SMT.

Phil mentioned other machines (PowerPC ?) which had issues with using
select_idle_capacity():

https://lore.kernel.org/r/20260325124840.GA98184@pauld.westford.csb

[...]

> On an SMT system with asymmetric CPU capacities, SMT-aware idle
> selection has been shown to improve throughput by around 15-18% for
> CPU-bound workloads, running an amount of tasks equal to the amount of
> SMT cores.

Just to make sure, this should be your internal NVBLAS benchmark. Is
this 'ASYM (mainline) vs. ASYM + SMT' or 'NO_ASYM vs. ASYM + SMT' ? I
try to match the cover letter's table numbers.

[...]

> @@ -7997,8 +8013,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  static int
>  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  {
> +	bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);

nit: why prefers_idle_core and not has_idle_core like in sis()?

[...]

> @@ -8047,12 +8102,17 @@ static inline bool asym_fits_cpu(unsigned long util,
>  				 unsigned long util_max,
>  				 int cpu)
>  {
> -	if (sched_asym_cpucap_active())
> +	if (sched_asym_cpucap_active()) {
>  		/*
>  		 * Return true only if the cpu fully fits the task requirements
>  		 * which include the utilization and the performance hints.
> +		 *
> +		 * When SMT is active, also require that the core has no busy
> +		 * siblings.
>  		 */
> -		return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> +		return (!sched_smt_active() || is_core_idle(cpu)) &&
> +		       (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> +	}

Not sure whether this has been discussed already. This makes all early
bailout conditions in sis() idle core aware for 'ASYM + SMT' but it's
not for 'NO_ASYM'?

Otherwise, LGTM.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 0/5] sched/fair: SMT-aware asymmetric CPU capacity
  2026-04-28 14:41 [PATCH v5 0/5] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
                   ` (4 preceding siblings ...)
  2026-04-28 14:41 ` [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity() Andrea Righi
@ 2026-05-05 20:40 ` Dietmar Eggemann
  5 siblings, 0 replies; 34+ messages in thread
From: Dietmar Eggemann @ 2026-05-05 20:40 UTC (permalink / raw)
  To: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot
  Cc: Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel

On 28.04.26 16:41, Andrea Righi wrote:


[...]

>  - DCPerf MediaWiki (all CPUs):
> 
>  +---------------------------------+--------+--------+--------+--------+
>  | Configuration                   |   rps  |  p50   |  p95   |  p99   |
>  +---------------------------------+--------+--------+--------+--------+
>  | ASYM (mainline) + SIS_UTIL      |  7994  |  0.052 |  0.223 |  0.246 |
>  | ASYM (mainline) + NO_SIS_UTIL   |  7993  |  0.052 |  0.221 |  0.245 |
>  |                                 |        |        |        |        |
>  | NO ASYM + SIS_UTIL              |  8113  |  0.067 |  0.184 |  0.225 |
>  | NO ASYM + NO_SIS_UTIL           |  8093  |  0.068 |  0.184 |  0.223 |
>  |                                 |        |        |        |        |
>  | ASYM + SMT + SIS_UTIL           |  8129  |  0.076 |  0.149 |  0.188 |
>  | ASYM + SMT + NO_SIS_UTIL        |  8138  |  0.076 |  0.148 |  0.186 |
>  +---------------------------------+--------+--------+--------+--------+
> 
> In the MediaWiki case SMT awareness is less impactful, because for the majority
> of the run all CPUs are used, but it still seems to provide some benefits at
> reducing tail latency.
> 
> Tests have also been conducted on NVIDIA Grace (which does not support SMT) to
> ensure that SIS_UTIL support in select_idle_capacity() does not introduce
> regressions and results show slight improvements under the same workloads.

Somehow unrelated to this smt extension but I always wanted to know why
even with !smt (e.g. Grace) we can see better values w/ ASYM.

DCPerf Mediawiki: Grace 72 CPUs, ~800 tasks (last test run):
+---------------------------------+--------+--------+--------+--------+
| Configuration                   |   rps  |  p50   |  p95   |  p99   |
+---------------------------------+--------+--------+--------+--------+
| v6.8 NO ASYM                    |  4470  |  0.026 |  0.040 |  0.046 |
| v6.8 ASYM                       |  4636  |  0.022 |  0.037 |  0.043 |
+---------------------------------+--------+--------+--------+--------+
values from run_details.json: Wrk RPS, Nginx P50 {, P90, P95, P99} time

I always got 4%-5% higher rps and slightly better latencies w/ ASYM.

Possible explanation:

NO_ASYM

 * More local wakeups
 * sis()->select_idle_cpu() runs pretty fast into SIS_UTIL !nr_idle_scan
   -> falls back to pick this_cpu or prev_cpu
 * Causes more runqueue contention → more load balancing
 * More short idle periods + migrations

ASYM

 * More remote wakeups
 * select_idle_capacity() always scans sd_asym
 * Less balancing needed; CPUs go idle less often but for longer
 * Better placement -> less contention -> higher rps

AFAICS, in this high-load scenario, ASYM avoids the !nr_idle_scan
bailout, spreading tasks more effectively and so reducing contention and
balancing overhead.

Do you have a chance to check this on mainline on your Grace machine?

[...]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
  2026-04-28 14:41 ` [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity Andrea Righi
  2026-05-05 12:48   ` Dietmar Eggemann
@ 2026-05-06  9:45   ` Vincent Guittot
  2026-05-06 10:19     ` K Prateek Nayak
  1 sibling, 1 reply; 34+ messages in thread
From: Vincent Guittot @ 2026-05-06  9:45 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel

On Tue, 28 Apr 2026 at 16:44, Andrea Righi <arighi@nvidia.com> wrote:
>
> From: K Prateek Nayak <kprateek.nayak@amd.com>
>
> On asymmetric CPU capacity systems, the wakeup path uses
> select_idle_capacity(), which scans the span of sd_asym_cpucapacity
> rather than sd_llc.
>
> The has_idle_cores hint however lives on sd_llc->shared, so the
> wakeup-time read of has_idle_cores operates on an LLC-scoped blob while
> the actual scan/decision spans the asym domain; nr_busy_cpus also lives
> in the same shared sched_domain data, but it's never used in the asym
> CPU capacity scenario.
>
> Therefore, move the sched_domain_shared object to sd_asym_cpucapacity
> whenever the CPU has a SD_ASYM_CPUCAPACITY_FULL ancestor and that
> ancestor is non-overlapping (i.e., not built from SD_NUMA). In that case
> the scope of has_idle_cores matches the scope of the wakeup scan.
>
> Fall back to attaching the shared object to sd_llc in three cases:
>
>   1) plain symmetric systems (no SD_ASYM_CPUCAPACITY_FULL anywhere);
>
>   2) CPUs in an exclusive cpuset that carves out a symmetric capacity
>      island: has_asym is system-wide but those CPUs have no
>      SD_ASYM_CPUCAPACITY_FULL ancestor in their hierarchy and follow
>      the symmetric LLC path in select_idle_sibling();
>
>   3) exotic topologies where SD_ASYM_CPUCAPACITY_FULL lands on an
>      SD_NUMA-built domain. init_sched_domain_shared() keys the shared
>      blob off cpumask_first(span), which on overlapping NUMA domains
>      would alias unrelated spans onto the same blob. Keep the shared
>      object on the LLC there; select_idle_capacity() gracefully skips
>      the has_idle_cores preference when sd->shared is NULL.
>
> While at it, also rename the per-CPU sd_llc_shared to sd_balance_shared,
> as it is no longer strictly tied to the LLC.
>
> Co-developed-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>  kernel/sched/fair.c     | 17 +++++---
>  kernel/sched/sched.h    |  2 +-
>  kernel/sched/topology.c | 90 +++++++++++++++++++++++++++++++++++------
>  3 files changed, 89 insertions(+), 20 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e0f75dedc8456..bbdf537f61154 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7790,7 +7790,7 @@ static inline void set_idle_cores(int cpu, int val)
>  {
>         struct sched_domain_shared *sds;
>
> -       sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
> +       sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
>         if (sds)
>                 WRITE_ONCE(sds->has_idle_cores, val);
>  }
> @@ -7799,7 +7799,7 @@ static inline bool test_idle_cores(int cpu)
>  {
>         struct sched_domain_shared *sds;
>
> -       sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
> +       sds = rcu_dereference_all(per_cpu(sd_balance_shared, cpu));
>         if (sds)
>                 return READ_ONCE(sds->has_idle_cores);
>
> @@ -7808,7 +7808,7 @@ static inline bool test_idle_cores(int cpu)
>
>  /*
>   * Scans the local SMT mask to see if the entire core is idle, and records this
> - * information in sd_llc_shared->has_idle_cores.
> + * information in sd_balance_shared->has_idle_cores.
>   *
>   * Since SMT siblings share all cache levels, inspecting this limited remote
>   * state should be fairly cheap.
> @@ -7925,7 +7925,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>         struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
>         int i, cpu, idle_cpu = -1, nr = INT_MAX;
>
> -       if (sched_feat(SIS_UTIL)) {
> +       if (sched_feat(SIS_UTIL) && sd->shared) {

If shared is attached to sd_asym_cpucapacity instead of sd_llc we
should never reach this point. Or I'm missing a case ?

>                 /*
>                  * Increment because !--nr is the condition to stop scan.
>                  *
> @@ -12826,7 +12826,11 @@ static void set_cpu_sd_state_busy(int cpu)
>         struct sched_domain *sd;
>         sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
>
> -       if (!sd || !sd->nohz_idle)
> +       /*
> +        * sd->nohz_idle only pairs with nr_busy_cpus on sd->shared; if this
> +        * domain has no shared object there is nothing to clear or account.
> +        */
> +       if (!sd || !sd->shared || !sd->nohz_idle)
>                 return;
>         sd->nohz_idle = 0;
>
> @@ -12851,7 +12855,8 @@ static void set_cpu_sd_state_idle(int cpu)
>         struct sched_domain *sd;
>         sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
>
> -       if (!sd || sd->nohz_idle)
> +       /* See set_cpu_sd_state_busy(): nohz_idle is only used with sd->shared. */
> +       if (!sd || !sd->shared || sd->nohz_idle)
>                 return;
>         sd->nohz_idle = 1;
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 9f63b15d309d1..330f5893c4561 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2170,7 +2170,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>  DECLARE_PER_CPU(int, sd_llc_size);
>  DECLARE_PER_CPU(int, sd_llc_id);
>  DECLARE_PER_CPU(int, sd_share_id);
> -DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> +DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_balance_shared);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>  DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 5847b83d9d552..69d465cc93ab4 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -665,7 +665,7 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>  DEFINE_PER_CPU(int, sd_llc_size);
>  DEFINE_PER_CPU(int, sd_llc_id);
>  DEFINE_PER_CPU(int, sd_share_id);
> -DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> +DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_balance_shared);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> @@ -680,20 +680,38 @@ static void update_top_cache_domain(int cpu)
>         int id = cpu;
>         int size = 1;
>
> +       sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
> +       /*
> +        * The shared object is attached to sd_asym_cpucapacity only when the
> +        * asym domain is non-overlapping (i.e., not built from SD_NUMA).
> +        * On overlapping (NUMA) asym domains we fall back to letting the
> +        * SD_SHARE_LLC path own the shared object, so sd->shared may be NULL
> +        * here.
> +        */
> +       if (sd && sd->shared)
> +               sds = sd->shared;
> +
> +       rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
> +
>         sd = highest_flag_domain(cpu, SD_SHARE_LLC);
>         if (sd) {
>                 id = cpumask_first(sched_domain_span(sd));
>                 size = cpumask_weight(sched_domain_span(sd));
>
> -               /* If sd_llc exists, sd_llc_shared should exist too. */
> -               WARN_ON_ONCE(!sd->shared);
> -               sds = sd->shared;
> +               /*
> +                * If sd_asym_cpucapacity didn't claim the shared object,
> +                * sd_llc must have one linked.
> +                */
> +               if (!sds) {
> +                       WARN_ON_ONCE(!sd->shared);
> +                       sds = sd->shared;
> +               }
>         }
>
>         rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>         per_cpu(sd_llc_size, cpu) = size;
>         per_cpu(sd_llc_id, cpu) = id;
> -       rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
> +       rcu_assign_pointer(per_cpu(sd_balance_shared, cpu), sds);
>
>         sd = lowest_flag_domain(cpu, SD_CLUSTER);
>         if (sd)
> @@ -711,9 +729,6 @@ static void update_top_cache_domain(int cpu)
>
>         sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
>         rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
> -
> -       sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
> -       rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
>  }
>
>  /*
> @@ -2650,6 +2665,49 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc)
>         }
>  }
>
> +static void init_sched_domain_shared(struct s_data *d, struct sched_domain *sd)
> +{
> +       int sd_id = cpumask_first(sched_domain_span(sd));
> +
> +       sd->shared = *per_cpu_ptr(d->sds, sd_id);
> +       atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> +       atomic_inc(&sd->shared->ref);
> +}
> +
> +/*
> + * For asymmetric CPU capacity, attach sched_domain_shared on the innermost
> + * SD_ASYM_CPUCAPACITY_FULL ancestor of @cpu's base domain when that ancestor is
> + * not an overlapping NUMA-built domain (then LLC should claim shared).
> + *
> + * A CPU may lack any FULL ancestor (e.g., exclusive cpuset symmetric island),
> + * then LLC must claim shared instead.
> + *
> + * Note: SD_ASYM_CPUCAPACITY_FULL is only set when multiple distinct capacities
> + * exist in the domain span, so the asym domain we attach to cannot degenerate
> + * into a single-capacity group. The relevant edge cases are instead covered by
> + * the caveats above.
> + *
> + * Return true if this CPU's asym path claimed sd->shared, false otherwise.
> + */
> +static bool claim_asym_sched_domain_shared(struct s_data *d, int cpu)
> +{
> +       struct sched_domain *sd = *per_cpu_ptr(d->sd, cpu);
> +       struct sched_domain *sd_asym;
> +
> +       if (!sd)
> +               return false;
> +
> +       sd_asym = sd;
> +       while (sd_asym && !(sd_asym->flags & SD_ASYM_CPUCAPACITY_FULL))
> +               sd_asym = sd_asym->parent;
> +
> +       if (!sd_asym || (sd_asym->flags & SD_NUMA))
> +               return false;
> +
> +       init_sched_domain_shared(d, sd_asym);
> +       return true;
> +}
> +
>  /*
>   * Build sched domains for a given set of CPUs and attach the sched domains
>   * to the individual CPUs
> @@ -2708,20 +2766,26 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>         }
>
>         for_each_cpu(i, cpu_map) {
> +               bool asym_claimed = false;
> +
>                 sd = *per_cpu_ptr(d.sd, i);
>                 if (!sd)
>                         continue;
>
> +               if (has_asym)
> +                       asym_claimed = claim_asym_sched_domain_shared(&d, i);
> +
>                 /* First, find the topmost SD_SHARE_LLC domain */
>                 while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
>                         sd = sd->parent;
>
>                 if (sd->flags & SD_SHARE_LLC) {
> -                       int sd_id = cpumask_first(sched_domain_span(sd));
> -
> -                       sd->shared = *per_cpu_ptr(d.sds, sd_id);
> -                       atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> -                       atomic_inc(&sd->shared->ref);
> +                       /*
> +                        * Initialize the sd->shared for SD_SHARE_LLC unless
> +                        * the asym path above already claimed it.
> +                        */
> +                       if (!asym_claimed)
> +                               init_sched_domain_shared(&d, sd);
>
>                         /*
>                          * In presence of higher domains, adjust the
> --
> 2.54.0
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
  2026-05-06  9:45   ` Vincent Guittot
@ 2026-05-06 10:19     ` K Prateek Nayak
  2026-05-06 10:30       ` Vincent Guittot
  0 siblings, 1 reply; 34+ messages in thread
From: K Prateek Nayak @ 2026-05-06 10:19 UTC (permalink / raw)
  To: Vincent Guittot, Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
	Joel Fernandes, Shrikanth Hegde, linux-kernel

Hello Vincent,

On 5/6/2026 3:15 PM, Vincent Guittot wrote:
>> @@ -7925,7 +7925,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>         struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
>>         int i, cpu, idle_cpu = -1, nr = INT_MAX;
>>
>> -       if (sched_feat(SIS_UTIL)) {
>> +       if (sched_feat(SIS_UTIL) && sd->shared) {
> 
> If shared is attached to sd_asym_cpucapacity instead of sd_llc we
> should never reach this point. Or I'm missing a case ?

So the hotpulg might race with a wakeup like:

  claim_asym_sched_domain_shared()
    init_sched_domain_shared(d, sd_asym);
    return true;
  update_top_cache_domain()
    rcu_assign_pointer(sd_llc, sd);
    ...                                              select_idle_sibling()
                                                       sd = rcu_dereference_all(sd_asym_cpucapacity)
                                                       /* sd_asym_cpucapacity still hasn't been updated */
                                                       if (sd /* NULL */) { ... }
                                                       sd = rcu_dereference_all(sd_llc); /* Valid */
                                                       select_idle_cpu(sd)
    rcu_assign_pointer(sd_asym_cpucapacity, sd);         sd->shared /* NULL */


This prevents that rare race where a remote CPU will see sd_llc
before sd_asym is published and take the !ASYM wakeup route only
to find sd->shared is NULL since sd_asym has claimed it.

> 
>>                 /*
>>                  * Increment because !--nr is the condition to stop scan.
>>                  *

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-04-28 14:41 ` [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
  2026-05-05 17:20   ` Dietmar Eggemann
@ 2026-05-06 10:29   ` Vincent Guittot
  2026-05-06 12:34     ` Vincent Guittot
  2026-05-06 18:15     ` Andrea Righi
  1 sibling, 2 replies; 34+ messages in thread
From: Vincent Guittot @ 2026-05-06 10:29 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel

On Tue, 28 Apr 2026 at 16:44, Andrea Righi <arighi@nvidia.com> wrote:
>
> On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> different per-core frequencies), the wakeup path uses
> select_idle_capacity() and prioritizes idle CPUs with higher capacity
> for better task placement. However, when those CPUs belong to SMT cores,
> their effective capacity can be much lower than the nominal capacity
> when the sibling thread is busy: SMT siblings compete for shared
> resources, so a "high capacity" CPU that is idle but whose sibling is
> busy does not deliver its full capacity. This effective capacity
> reduction cannot be modeled by the static capacity value alone.
>
> Introduce SMT awareness in the asym-capacity idle selection policy: when
> SMT is active, always prefer fully-idle SMT cores over partially-idle
> ones.
>
> Prioritizing fully-idle SMT cores yields better task placement because
> the effective capacity of partially-idle SMT cores is reduced; always
> preferring them when available leads to more accurate capacity usage on
> task wakeup.
>
> On an SMT system with asymmetric CPU capacities, SMT-aware idle
> selection has been shown to improve throughput by around 15-18% for
> CPU-bound workloads, running an amount of tasks equal to the amount of
> SMT cores.
>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Christian Loehle <christian.loehle@arm.com>
> Cc: Koba Ko <kobak@nvidia.com>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  kernel/sched/fair.c | 70 +++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 65 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bbdf537f61154..6a7e4943804b5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7989,6 +7989,22 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>         return idle_cpu;
>  }
>
> +/*
> + * Idle-capacity scan ranks transformed util_fits_cpu() outcomes; lower values
> + * are more preferred (see select_idle_capacity()).
> + */
> +enum asym_fits_state {
> +       /* In descending order of preference */
> +       ASYM_IDLE_CORE_UCLAMP_MISFIT = -4,
> +       ASYM_IDLE_CORE_COMPLETE_MISFIT,
> +       ASYM_IDLE_THREAD_FITS,
> +       ASYM_IDLE_THREAD_UCLAMP_MISFIT,
> +       ASYM_IDLE_COMPLETE_MISFIT,
> +
> +       /* util_fits_cpu() bias for an idle core. */
> +       ASYM_IDLE_CORE_BIAS = -3,
> +};
> +
>  /*
>   * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
>   * the task fits. If no CPU is big enough, but there are idle ones, try to
> @@ -7997,8 +8013,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  static int
>  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  {
> +       bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
>         unsigned long task_util, util_min, util_max, best_cap = 0;
> -       int fits, best_fits = 0;
> +       int fits, best_fits = ASYM_IDLE_COMPLETE_MISFIT;
>         int cpu, best_cpu = -1;
>         struct cpumask *cpus;
>
> @@ -8010,6 +8027,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>         util_max = uclamp_eff_value(p, UCLAMP_MAX);
>
>         for_each_cpu_wrap(cpu, cpus, target) {
> +               bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
>                 unsigned long cpu_cap = capacity_of(cpu);
>
>                 if (!choose_idle_cpu(cpu, p))
> @@ -8018,7 +8036,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>                 fits = util_fits_cpu(task_util, util_min, util_max, cpu);
>
>                 /* This CPU fits with all requirements */
> -               if (fits > 0)
> +               if (fits > 0 && preferred_core)
>                         return cpu;
>                 /*
>                  * Only the min performance hint (i.e. uclamp_min) doesn't fit.
> @@ -8026,9 +8044,33 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>                  */
>                 else if (fits < 0)
>                         cpu_cap = get_actual_cpu_capacity(cpu);
> +               /*
> +                * fits > 0 implies we are not on a preferred core
> +                * but the util fits CPU capacity. Set fits to ASYM_IDLE_THREAD_FITS
> +                * so the effective range becomes
> +                * [ASYM_IDLE_THREAD_FITS, ASYM_IDLE_COMPLETE_MISFIT], where:
> +                *    ASYM_IDLE_COMPLETE_MISFIT - does not fit
> +                *    ASYM_IDLE_THREAD_UCLAMP_MISFIT - fits with the exception of UCLAMP_MIN
> +                *    ASYM_IDLE_THREAD_FITS - fits with the exception of preferred_core
> +                */
> +               else if (fits > 0)
> +                       fits = ASYM_IDLE_THREAD_FITS;
> +
> +               /*
> +                * If we are on a preferred core, translate the range of fits
> +                * of [ASYM_IDLE_THREAD_UCLAMP_MISFIT, ASYM_IDLE_COMPLETE_MISFIT] to
> +                * [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_COMPLETE_MISFIT].
> +                * This ensures that an idle core is always given priority over
> +                * (partially) busy core.
> +                *
> +                * A fully fitting idle core would have returned early and hence
> +                * fits > 0 for preferred_core need not be dealt with.
> +                */
> +               if (preferred_core)
> +                       fits += ASYM_IDLE_CORE_BIAS;

It might be good to add a comment stating that if the system doesn't
have SMT, prefers_idle_core and preferred_core are always true.

This is okay because CPU == Core in this case but the value differs
from the default 0 or -1 of util_fits_cpu

>
>                 /*
> -                * First, select CPU which fits better (-1 being better than 0).
> +                * First, select CPU which fits better (lower is more preferred).
>                  * Then, select the one with best capacity at same level.
>                  */
>                 if ((fits < best_fits) ||
> @@ -8039,6 +8081,19 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>                 }
>         }
>
> +       /*
> +        * A value in the [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_BIAS]

s/ASYM_IDLE_CORE_BIAS/ASYM_IDLE_CORE_COMPLETE_MISFIT/

ASYM_IDLE_CORE_BIAS is an offset to move an idle core that doesn't
fully fit in the preferred range [ASYM_IDLE_CORE_UCLAMP_MISFIT,
ASYM_IDLE_CORE_COMPLETE_MISFIT]

Keeping in mind that ASYM_IDLE_CORE_BIAS == -3 == ASYM_IDLE_CORE_BIAS

> +        * range means the chosen CPU is in a fully idle SMT core. Values above
> +        * ASYM_IDLE_CORE_BIAS mean we never ranked such a CPU best.

s/ASYM_IDLE_CORE_BIAS/ASYM_IDLE_CORE_COMPLETE_MISFIT/

> +        *
> +        * The asym-capacity wakeup path returns from select_idle_sibling()
> +        * after this function and never runs select_idle_cpu(), so the usual
> +        * select_idle_cpu() tail that clears idle cores must live here when the
> +        * idle-core preference did not win.
> +        */
> +       if (prefers_idle_core && best_fits > ASYM_IDLE_CORE_BIAS)

s/ASYM_IDLE_CORE_BIAS/ASYM_IDLE_CORE_COMPLETE_MISFIT/

> +               set_idle_cores(target, false);
> +
>         return best_cpu;
>  }
>
> @@ -8047,12 +8102,17 @@ static inline bool asym_fits_cpu(unsigned long util,
>                                  unsigned long util_max,
>                                  int cpu)
>  {
> -       if (sched_asym_cpucap_active())
> +       if (sched_asym_cpucap_active()) {
>                 /*
>                  * Return true only if the cpu fully fits the task requirements
>                  * which include the utilization and the performance hints.
> +                *
> +                * When SMT is active, also require that the core has no busy
> +                * siblings.
>                  */
> -               return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> +               return (!sched_smt_active() || is_core_idle(cpu)) &&
> +                      (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> +       }
>
>         return true;
>  }
> --
> 2.54.0
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity
  2026-05-06 10:19     ` K Prateek Nayak
@ 2026-05-06 10:30       ` Vincent Guittot
  0 siblings, 0 replies; 34+ messages in thread
From: Vincent Guittot @ 2026-05-06 10:30 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel

On Wed, 6 May 2026 at 12:20, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Vincent,
>
> On 5/6/2026 3:15 PM, Vincent Guittot wrote:
> >> @@ -7925,7 +7925,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >>         struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> >>         int i, cpu, idle_cpu = -1, nr = INT_MAX;
> >>
> >> -       if (sched_feat(SIS_UTIL)) {
> >> +       if (sched_feat(SIS_UTIL) && sd->shared) {
> >
> > If shared is attached to sd_asym_cpucapacity instead of sd_llc we
> > should never reach this point. Or I'm missing a case ?
>
> So the hotpulg might race with a wakeup like:
>
>   claim_asym_sched_domain_shared()
>     init_sched_domain_shared(d, sd_asym);
>     return true;
>   update_top_cache_domain()
>     rcu_assign_pointer(sd_llc, sd);
>     ...                                              select_idle_sibling()
>                                                        sd = rcu_dereference_all(sd_asym_cpucapacity)
>                                                        /* sd_asym_cpucapacity still hasn't been updated */
>                                                        if (sd /* NULL */) { ... }
>                                                        sd = rcu_dereference_all(sd_llc); /* Valid */
>                                                        select_idle_cpu(sd)
>     rcu_assign_pointer(sd_asym_cpucapacity, sd);         sd->shared /* NULL */
>
>
> This prevents that rare race where a remote CPU will see sd_llc
> before sd_asym is published and take the !ASYM wakeup route only
> to find sd->shared is NULL since sd_asym has claimed it.

fair enough

>
> >
> >>                 /*
> >>                  * Increment because !--nr is the condition to stop scan.
> >>                  *
>
> --
> Thanks and Regards,
> Prateek
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-05-06 10:29   ` Vincent Guittot
@ 2026-05-06 12:34     ` Vincent Guittot
  2026-05-06 18:15     ` Andrea Righi
  1 sibling, 0 replies; 34+ messages in thread
From: Vincent Guittot @ 2026-05-06 12:34 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel

On Wed, 6 May 2026 at 12:29, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>
> On Tue, 28 Apr 2026 at 16:44, Andrea Righi <arighi@nvidia.com> wrote:
> >
> > On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> > different per-core frequencies), the wakeup path uses
> > select_idle_capacity() and prioritizes idle CPUs with higher capacity
> > for better task placement. However, when those CPUs belong to SMT cores,
> > their effective capacity can be much lower than the nominal capacity
> > when the sibling thread is busy: SMT siblings compete for shared
> > resources, so a "high capacity" CPU that is idle but whose sibling is
> > busy does not deliver its full capacity. This effective capacity
> > reduction cannot be modeled by the static capacity value alone.
> >
> > Introduce SMT awareness in the asym-capacity idle selection policy: when
> > SMT is active, always prefer fully-idle SMT cores over partially-idle
> > ones.
> >
> > Prioritizing fully-idle SMT cores yields better task placement because
> > the effective capacity of partially-idle SMT cores is reduced; always
> > preferring them when available leads to more accurate capacity usage on
> > task wakeup.
> >
> > On an SMT system with asymmetric CPU capacities, SMT-aware idle
> > selection has been shown to improve throughput by around 15-18% for
> > CPU-bound workloads, running an amount of tasks equal to the amount of
> > SMT cores.
> >
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Christian Loehle <christian.loehle@arm.com>
> > Cc: Koba Ko <kobak@nvidia.com>
> > Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  kernel/sched/fair.c | 70 +++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 65 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index bbdf537f61154..6a7e4943804b5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7989,6 +7989,22 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >         return idle_cpu;
> >  }
> >
> > +/*
> > + * Idle-capacity scan ranks transformed util_fits_cpu() outcomes; lower values
> > + * are more preferred (see select_idle_capacity()).
> > + */
> > +enum asym_fits_state {
> > +       /* In descending order of preference */
> > +       ASYM_IDLE_CORE_UCLAMP_MISFIT = -4,
> > +       ASYM_IDLE_CORE_COMPLETE_MISFIT,
> > +       ASYM_IDLE_THREAD_FITS,
> > +       ASYM_IDLE_THREAD_UCLAMP_MISFIT,
> > +       ASYM_IDLE_COMPLETE_MISFIT,
> > +
> > +       /* util_fits_cpu() bias for an idle core. */
> > +       ASYM_IDLE_CORE_BIAS = -3,
> > +};
> > +
> >  /*
> >   * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> >   * the task fits. If no CPU is big enough, but there are idle ones, try to
> > @@ -7997,8 +8013,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >  static int
> >  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >  {
> > +       bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
> >         unsigned long task_util, util_min, util_max, best_cap = 0;
> > -       int fits, best_fits = 0;
> > +       int fits, best_fits = ASYM_IDLE_COMPLETE_MISFIT;
> >         int cpu, best_cpu = -1;
> >         struct cpumask *cpus;
> >
> > @@ -8010,6 +8027,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >         util_max = uclamp_eff_value(p, UCLAMP_MAX);
> >
> >         for_each_cpu_wrap(cpu, cpus, target) {
> > +               bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
> >                 unsigned long cpu_cap = capacity_of(cpu);
> >
> >                 if (!choose_idle_cpu(cpu, p))
> > @@ -8018,7 +8036,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >                 fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> >
> >                 /* This CPU fits with all requirements */
> > -               if (fits > 0)
> > +               if (fits > 0 && preferred_core)
> >                         return cpu;
> >                 /*
> >                  * Only the min performance hint (i.e. uclamp_min) doesn't fit.
> > @@ -8026,9 +8044,33 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >                  */
> >                 else if (fits < 0)
> >                         cpu_cap = get_actual_cpu_capacity(cpu);
> > +               /*
> > +                * fits > 0 implies we are not on a preferred core
> > +                * but the util fits CPU capacity. Set fits to ASYM_IDLE_THREAD_FITS
> > +                * so the effective range becomes
> > +                * [ASYM_IDLE_THREAD_FITS, ASYM_IDLE_COMPLETE_MISFIT], where:
> > +                *    ASYM_IDLE_COMPLETE_MISFIT - does not fit
> > +                *    ASYM_IDLE_THREAD_UCLAMP_MISFIT - fits with the exception of UCLAMP_MIN
> > +                *    ASYM_IDLE_THREAD_FITS - fits with the exception of preferred_core
> > +                */
> > +               else if (fits > 0)
> > +                       fits = ASYM_IDLE_THREAD_FITS;
> > +
> > +               /*
> > +                * If we are on a preferred core, translate the range of fits
> > +                * of [ASYM_IDLE_THREAD_UCLAMP_MISFIT, ASYM_IDLE_COMPLETE_MISFIT] to
> > +                * [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_COMPLETE_MISFIT].
> > +                * This ensures that an idle core is always given priority over
> > +                * (partially) busy core.
> > +                *
> > +                * A fully fitting idle core would have returned early and hence
> > +                * fits > 0 for preferred_core need not be dealt with.
> > +                */
> > +               if (preferred_core)
> > +                       fits += ASYM_IDLE_CORE_BIAS;
>
> It might be good to add a comment stating that if the system doesn't
> have SMT, prefers_idle_core and preferred_core are always true.

I meant prefers_idle_core is alway false and preferred_core is always true

>
> This is okay because CPU == Core in this case but the value differs
> from the default 0 or -1 of util_fits_cpu
>
> >
> >                 /*
> > -                * First, select CPU which fits better (-1 being better than 0).
> > +                * First, select CPU which fits better (lower is more preferred).
> >                  * Then, select the one with best capacity at same level.
> >                  */
> >                 if ((fits < best_fits) ||
> > @@ -8039,6 +8081,19 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >                 }
> >         }
> >
> > +       /*
> > +        * A value in the [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_BIAS]
>
> s/ASYM_IDLE_CORE_BIAS/ASYM_IDLE_CORE_COMPLETE_MISFIT/
>
> ASYM_IDLE_CORE_BIAS is an offset to move an idle core that doesn't
> fully fit in the preferred range [ASYM_IDLE_CORE_UCLAMP_MISFIT,
> ASYM_IDLE_CORE_COMPLETE_MISFIT]
>
> Keeping in mind that ASYM_IDLE_CORE_BIAS == -3 == ASYM_IDLE_CORE_BIAS
>
> > +        * range means the chosen CPU is in a fully idle SMT core. Values above
> > +        * ASYM_IDLE_CORE_BIAS mean we never ranked such a CPU best.
>
> s/ASYM_IDLE_CORE_BIAS/ASYM_IDLE_CORE_COMPLETE_MISFIT/
>
> > +        *
> > +        * The asym-capacity wakeup path returns from select_idle_sibling()
> > +        * after this function and never runs select_idle_cpu(), so the usual
> > +        * select_idle_cpu() tail that clears idle cores must live here when the
> > +        * idle-core preference did not win.
> > +        */
> > +       if (prefers_idle_core && best_fits > ASYM_IDLE_CORE_BIAS)
>
> s/ASYM_IDLE_CORE_BIAS/ASYM_IDLE_CORE_COMPLETE_MISFIT/
>
> > +               set_idle_cores(target, false);
> > +
> >         return best_cpu;
> >  }
> >
> > @@ -8047,12 +8102,17 @@ static inline bool asym_fits_cpu(unsigned long util,
> >                                  unsigned long util_max,
> >                                  int cpu)
> >  {
> > -       if (sched_asym_cpucap_active())
> > +       if (sched_asym_cpucap_active()) {
> >                 /*
> >                  * Return true only if the cpu fully fits the task requirements
> >                  * which include the utilization and the performance hints.
> > +                *
> > +                * When SMT is active, also require that the core has no busy
> > +                * siblings.
> >                  */
> > -               return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> > +               return (!sched_smt_active() || is_core_idle(cpu)) &&
> > +                      (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> > +       }
> >
> >         return true;
> >  }
> > --
> > 2.54.0
> >

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity()
  2026-04-28 14:41 ` [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity() Andrea Righi
@ 2026-05-06 12:59   ` Vincent Guittot
  2026-05-06 17:01     ` Dietmar Eggemann
  0 siblings, 1 reply; 34+ messages in thread
From: Vincent Guittot @ 2026-05-06 12:59 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel

On Tue, 28 Apr 2026 at 16:44, Andrea Righi <arighi@nvidia.com> wrote:
>
> From: K Prateek Nayak <kprateek.nayak@amd.com>
>
> Add to select_idle_capacity() the same SIS_UTIL-controlled idle-scan
> mechanism, already used by select_idle_cpu(): when sched_feat(SIS_UTIL)
> is enabled and the LLC domain has sched_domain_shared data, derive the
> per-attempt scan limit from sd->shared->nr_idle_scan.
>
> That bounds the walk on large LLCs and allows an early return once the
> scan limit is reached, if we already picked a sufficiently strong
> idle-core candidate (best_fits == ASYM_IDLE_CORE_UCLAMP_MISFIT).
>
> Co-developed-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>  kernel/sched/fair.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a1f4d70f6b3d9..1cde3a9b1e0f5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8018,6 +8018,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>         int fits, best_fits = ASYM_IDLE_COMPLETE_MISFIT;
>         int cpu, best_cpu = -1;
>         struct cpumask *cpus;
> +       int nr = INT_MAX;
>
>         cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
>         cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> @@ -8026,10 +8027,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>         util_min = uclamp_eff_value(p, UCLAMP_MIN);
>         util_max = uclamp_eff_value(p, UCLAMP_MAX);
>
> +       if (sched_feat(SIS_UTIL) && sd->shared) {
> +               /*
> +                * Same nr_idle_scan hint as select_idle_cpu(), nr only limits
> +                * the scan when not preferring an idle core.
> +                */
> +               nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
> +               /* overloaded domain is unlikely to have idle cpu/core */
> +               if (nr == 1)
> +                       return -1;
> +       }
> +
>         for_each_cpu_wrap(cpu, cpus, target) {
>                 bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
>                 unsigned long cpu_cap = capacity_of(cpu);
>
> +               /*
> +                * Good-enough early exit (mirrors select_idle_cpu() logic).
> +                */
> +               if (!prefers_idle_core &&
> +                   --nr <= 0 && best_fits == ASYM_IDLE_CORE_UCLAMP_MISFIT)

With SMT, !prefers_idle_core implies that there is no idle core; Is
best_fits == ASYM_IDLE_CORE_UCLAMP_MISFIT really expected in such case
?

With !SMT, !prefers_idle_core is always true and we will bail out
early as expected


> +                       return best_cpu;
> +
>                 if (!choose_idle_cpu(cpu, p))
>                         continue;
>
> --
> 2.54.0
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity()
  2026-05-06 12:59   ` Vincent Guittot
@ 2026-05-06 17:01     ` Dietmar Eggemann
  2026-05-06 18:11       ` Andrea Righi
  0 siblings, 1 reply; 34+ messages in thread
From: Dietmar Eggemann @ 2026-05-06 17:01 UTC (permalink / raw)
  To: Vincent Guittot, Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
	Joel Fernandes, Shrikanth Hegde, linux-kernel

On 06.05.26 14:59, Vincent Guittot wrote:
> On Tue, 28 Apr 2026 at 16:44, Andrea Righi <arighi@nvidia.com> wrote:
>>
>> From: K Prateek Nayak <kprateek.nayak@amd.com>

[...]

>> @@ -8026,10 +8027,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>>         util_min = uclamp_eff_value(p, UCLAMP_MIN);
>>         util_max = uclamp_eff_value(p, UCLAMP_MAX);
>>
>> +       if (sched_feat(SIS_UTIL) && sd->shared) {
>> +               /*
>> +                * Same nr_idle_scan hint as select_idle_cpu(), nr only limits
>> +                * the scan when not preferring an idle core.
>> +                */
>> +               nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
>> +               /* overloaded domain is unlikely to have idle cpu/core */
>> +               if (nr == 1)
>> +                       return -1;
>> +       }
>> +
>>         for_each_cpu_wrap(cpu, cpus, target) {
>>                 bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
>>                 unsigned long cpu_cap = capacity_of(cpu);
>>
>> +               /*
>> +                * Good-enough early exit (mirrors select_idle_cpu() logic).
>> +                */
>> +               if (!prefers_idle_core &&
>> +                   --nr <= 0 && best_fits == ASYM_IDLE_CORE_UCLAMP_MISFIT)
> 
> With SMT, !prefers_idle_core implies that there is no idle core; Is
> best_fits == ASYM_IDLE_CORE_UCLAMP_MISFIT really expected in such case
> ?
> 
> With !SMT, !prefers_idle_core is always true and we will bail out
> early as expected

I struggle to comprehend:

I assume the mirrored select_idle_cpu() logic is:

    for_each_cpu_wrap(cpu, cpus, target + 1)

      if (has_idle_core)

      else
        if (--nr <= 0)
          return -1

Should this condition not be just:

  if (!prefers_idle_core && --nr <= 0)
    return best_cpu

since if we do a:

  if (!choose_idle_cpu(cpu, p)))
    continue;

right after that?

best_cpu is -1 by default so sis() will return target, in case we
already found a best_cpu then sis() will return this instead.

What do I miss here?

> 
> 
>> +                       return best_cpu;
>> +
>>                 if (!choose_idle_cpu(cpu, p))
>>                         continue;

[...]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity()
  2026-05-06 17:01     ` Dietmar Eggemann
@ 2026-05-06 18:11       ` Andrea Righi
  2026-05-07  6:47         ` Vincent Guittot
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Righi @ 2026-05-06 18:11 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel

Hi Dietmar and Vincent,

On Wed, May 06, 2026 at 07:01:35PM +0200, Dietmar Eggemann wrote:
> On 06.05.26 14:59, Vincent Guittot wrote:
> > On Tue, 28 Apr 2026 at 16:44, Andrea Righi <arighi@nvidia.com> wrote:
> >>
> >> From: K Prateek Nayak <kprateek.nayak@amd.com>
> 
> [...]
> 
> >> @@ -8026,10 +8027,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >>         util_min = uclamp_eff_value(p, UCLAMP_MIN);
> >>         util_max = uclamp_eff_value(p, UCLAMP_MAX);
> >>
> >> +       if (sched_feat(SIS_UTIL) && sd->shared) {
> >> +               /*
> >> +                * Same nr_idle_scan hint as select_idle_cpu(), nr only limits
> >> +                * the scan when not preferring an idle core.
> >> +                */
> >> +               nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
> >> +               /* overloaded domain is unlikely to have idle cpu/core */
> >> +               if (nr == 1)
> >> +                       return -1;
> >> +       }
> >> +
> >>         for_each_cpu_wrap(cpu, cpus, target) {
> >>                 bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
> >>                 unsigned long cpu_cap = capacity_of(cpu);
> >>
> >> +               /*
> >> +                * Good-enough early exit (mirrors select_idle_cpu() logic).
> >> +                */
> >> +               if (!prefers_idle_core &&
> >> +                   --nr <= 0 && best_fits == ASYM_IDLE_CORE_UCLAMP_MISFIT)
> > 
> > With SMT, !prefers_idle_core implies that there is no idle core; Is
> > best_fits == ASYM_IDLE_CORE_UCLAMP_MISFIT really expected in such case
> > ?
> > 
> > With !SMT, !prefers_idle_core is always true and we will bail out
> > early as expected
> 
> I struggle to comprehend:
> 
> I assume the mirrored select_idle_cpu() logic is:
> 
>     for_each_cpu_wrap(cpu, cpus, target + 1)
> 
>       if (has_idle_core)
> 
>       else
>         if (--nr <= 0)
>           return -1

So, the logic in select_idle_cpu() is that as soon as nr <= 0, we stops the walk
and returns -1, without any "only stop if the answer is good enough" guard.

With this change in select_idle_capacity() when nr is exhausted, we stop only if
best_cpu is "good enough" (ASYM_IDLE_CORE_UCLAMP_MISFIT), otherwise we keep
scanning. Therefore, we're not perfectly mirroring select_idle_cpu().

> 
> Should this condition not be just:
> 
>   if (!prefers_idle_core && --nr <= 0)
>     return best_cpu

I think this would match more closely select_idle_cpu(). However,
select_idle_cpu() doesn't have the "best partial idle placement" logic at all,
it either returns an idle CPU or -1.

I guess it's a policy decision here: do we want to mirror exactly the scan bound
(nr <= 0 -> hard stop) or allow extra scan based on the ranking quality
(nr <= 0 -> stop early if satisfied)?

Thanks,
-Andrea

> 
> since if we do a:
> 
>   if (!choose_idle_cpu(cpu, p)))
>     continue;
> 
> right after that?
> 
> best_cpu is -1 by default so sis() will return target, in case we
> already found a best_cpu then sis() will return this instead.
> 
> What do I miss here?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-05-06 10:29   ` Vincent Guittot
  2026-05-06 12:34     ` Vincent Guittot
@ 2026-05-06 18:15     ` Andrea Righi
  1 sibling, 0 replies; 34+ messages in thread
From: Andrea Righi @ 2026-05-06 18:15 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel

Hi Vincent,

On Wed, May 06, 2026 at 12:29:10PM +0200, Vincent Guittot wrote:
> On Tue, 28 Apr 2026 at 16:44, Andrea Righi <arighi@nvidia.com> wrote:
> >
> > On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> > different per-core frequencies), the wakeup path uses
> > select_idle_capacity() and prioritizes idle CPUs with higher capacity
> > for better task placement. However, when those CPUs belong to SMT cores,
> > their effective capacity can be much lower than the nominal capacity
> > when the sibling thread is busy: SMT siblings compete for shared
> > resources, so a "high capacity" CPU that is idle but whose sibling is
> > busy does not deliver its full capacity. This effective capacity
> > reduction cannot be modeled by the static capacity value alone.
> >
> > Introduce SMT awareness in the asym-capacity idle selection policy: when
> > SMT is active, always prefer fully-idle SMT cores over partially-idle
> > ones.
> >
> > Prioritizing fully-idle SMT cores yields better task placement because
> > the effective capacity of partially-idle SMT cores is reduced; always
> > preferring them when available leads to more accurate capacity usage on
> > task wakeup.
> >
> > On an SMT system with asymmetric CPU capacities, SMT-aware idle
> > selection has been shown to improve throughput by around 15-18% for
> > CPU-bound workloads, running an amount of tasks equal to the amount of
> > SMT cores.
> >
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Christian Loehle <christian.loehle@arm.com>
> > Cc: Koba Ko <kobak@nvidia.com>
> > Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  kernel/sched/fair.c | 70 +++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 65 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index bbdf537f61154..6a7e4943804b5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7989,6 +7989,22 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >         return idle_cpu;
> >  }
> >
> > +/*
> > + * Idle-capacity scan ranks transformed util_fits_cpu() outcomes; lower values
> > + * are more preferred (see select_idle_capacity()).
> > + */
> > +enum asym_fits_state {
> > +       /* In descending order of preference */
> > +       ASYM_IDLE_CORE_UCLAMP_MISFIT = -4,
> > +       ASYM_IDLE_CORE_COMPLETE_MISFIT,
> > +       ASYM_IDLE_THREAD_FITS,
> > +       ASYM_IDLE_THREAD_UCLAMP_MISFIT,
> > +       ASYM_IDLE_COMPLETE_MISFIT,
> > +
> > +       /* util_fits_cpu() bias for an idle core. */
> > +       ASYM_IDLE_CORE_BIAS = -3,
> > +};
> > +
> >  /*
> >   * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> >   * the task fits. If no CPU is big enough, but there are idle ones, try to
> > @@ -7997,8 +8013,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >  static int
> >  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >  {
> > +       bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
> >         unsigned long task_util, util_min, util_max, best_cap = 0;
> > -       int fits, best_fits = 0;
> > +       int fits, best_fits = ASYM_IDLE_COMPLETE_MISFIT;
> >         int cpu, best_cpu = -1;
> >         struct cpumask *cpus;
> >
> > @@ -8010,6 +8027,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >         util_max = uclamp_eff_value(p, UCLAMP_MAX);
> >
> >         for_each_cpu_wrap(cpu, cpus, target) {
> > +               bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
> >                 unsigned long cpu_cap = capacity_of(cpu);
> >
> >                 if (!choose_idle_cpu(cpu, p))
> > @@ -8018,7 +8036,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >                 fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> >
> >                 /* This CPU fits with all requirements */
> > -               if (fits > 0)
> > +               if (fits > 0 && preferred_core)
> >                         return cpu;
> >                 /*
> >                  * Only the min performance hint (i.e. uclamp_min) doesn't fit.
> > @@ -8026,9 +8044,33 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >                  */
> >                 else if (fits < 0)
> >                         cpu_cap = get_actual_cpu_capacity(cpu);
> > +               /*
> > +                * fits > 0 implies we are not on a preferred core
> > +                * but the util fits CPU capacity. Set fits to ASYM_IDLE_THREAD_FITS
> > +                * so the effective range becomes
> > +                * [ASYM_IDLE_THREAD_FITS, ASYM_IDLE_COMPLETE_MISFIT], where:
> > +                *    ASYM_IDLE_COMPLETE_MISFIT - does not fit
> > +                *    ASYM_IDLE_THREAD_UCLAMP_MISFIT - fits with the exception of UCLAMP_MIN
> > +                *    ASYM_IDLE_THREAD_FITS - fits with the exception of preferred_core
> > +                */
> > +               else if (fits > 0)
> > +                       fits = ASYM_IDLE_THREAD_FITS;
> > +
> > +               /*
> > +                * If we are on a preferred core, translate the range of fits
> > +                * of [ASYM_IDLE_THREAD_UCLAMP_MISFIT, ASYM_IDLE_COMPLETE_MISFIT] to
> > +                * [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_COMPLETE_MISFIT].
> > +                * This ensures that an idle core is always given priority over
> > +                * (partially) busy core.
> > +                *
> > +                * A fully fitting idle core would have returned early and hence
> > +                * fits > 0 for preferred_core need not be dealt with.
> > +                */
> > +               if (preferred_core)
> > +                       fits += ASYM_IDLE_CORE_BIAS;
> 
> It might be good to add a comment stating that if the system doesn't
> have SMT, prefers_idle_core and preferred_core are always true.
> 
> This is okay because CPU == Core in this case but the value differs
> from the default 0 or -1 of util_fits_cpu

Ack.

> 
> >
> >                 /*
> > -                * First, select CPU which fits better (-1 being better than 0).
> > +                * First, select CPU which fits better (lower is more preferred).
> >                  * Then, select the one with best capacity at same level.
> >                  */
> >                 if ((fits < best_fits) ||
> > @@ -8039,6 +8081,19 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >                 }
> >         }
> >
> > +       /*
> > +        * A value in the [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_BIAS]
> 
> s/ASYM_IDLE_CORE_BIAS/ASYM_IDLE_CORE_COMPLETE_MISFIT/
> 
> ASYM_IDLE_CORE_BIAS is an offset to move an idle core that doesn't
> fully fit in the preferred range [ASYM_IDLE_CORE_UCLAMP_MISFIT,
> ASYM_IDLE_CORE_COMPLETE_MISFIT]
> 
> Keeping in mind that ASYM_IDLE_CORE_BIAS == -3 == ASYM_IDLE_CORE_BIAS

Ah yes, using ASYM_IDLE_CORE_BIAS is just confusing, we should definitely use
[ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_COMPLETE_MISFIT]. Will fix this.

> 
> > +        * range means the chosen CPU is in a fully idle SMT core. Values above
> > +        * ASYM_IDLE_CORE_BIAS mean we never ranked such a CPU best.
> 
> s/ASYM_IDLE_CORE_BIAS/ASYM_IDLE_CORE_COMPLETE_MISFIT/

Ack.

> 
> > +        *
> > +        * The asym-capacity wakeup path returns from select_idle_sibling()
> > +        * after this function and never runs select_idle_cpu(), so the usual
> > +        * select_idle_cpu() tail that clears idle cores must live here when the
> > +        * idle-core preference did not win.
> > +        */
> > +       if (prefers_idle_core && best_fits > ASYM_IDLE_CORE_BIAS)
> 
> s/ASYM_IDLE_CORE_BIAS/ASYM_IDLE_CORE_COMPLETE_MISFIT/

Ack.

> 
> > +               set_idle_cores(target, false);
> > +
> >         return best_cpu;
> >  }
> >
> > @@ -8047,12 +8102,17 @@ static inline bool asym_fits_cpu(unsigned long util,
> >                                  unsigned long util_max,
> >                                  int cpu)
> >  {
> > -       if (sched_asym_cpucap_active())
> > +       if (sched_asym_cpucap_active()) {
> >                 /*
> >                  * Return true only if the cpu fully fits the task requirements
> >                  * which include the utilization and the performance hints.
> > +                *
> > +                * When SMT is active, also require that the core has no busy
> > +                * siblings.
> >                  */
> > -               return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> > +               return (!sched_smt_active() || is_core_idle(cpu)) &&
> > +                      (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> > +       }
> >
> >         return true;
> >  }
> > --
> > 2.54.0
> >

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-05-05 17:20   ` Dietmar Eggemann
@ 2026-05-06 18:31     ` Andrea Righi
  0 siblings, 0 replies; 34+ messages in thread
From: Andrea Righi @ 2026-05-06 18:31 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel

Hi Dietmar,

On Tue, May 05, 2026 at 07:20:35PM +0200, Dietmar Eggemann wrote:
> On 28.04.26 16:41, Andrea Righi wrote:
> > On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> > different per-core frequencies), the wakeup path uses
> 
> I assume those CPPC systems w/ different per-core frequencies (like your
> Vera) are the only real one which would make use of this. Mobile
> big.LITTLE/DynamIQ don't have SMT.
> 
> Phil mentioned other machines (PowerPC ?) which had issues with using
> select_idle_capacity():
> 
> https://lore.kernel.org/r/20260325124840.GA98184@pauld.westford.csb
> 
> [...]
> 
> > On an SMT system with asymmetric CPU capacities, SMT-aware idle
> > selection has been shown to improve throughput by around 15-18% for
> > CPU-bound workloads, running an amount of tasks equal to the amount of
> > SMT cores.
> 
> Just to make sure, this should be your internal NVBLAS benchmark. Is
> this 'ASYM (mainline) vs. ASYM + SMT' or 'NO_ASYM vs. ASYM + SMT' ? I
> try to match the cover letter's table numbers.

Yes, the 15-18% is with NVBLAS and it's NO_ASYM (mainline) vs ASYM + SMT.  The
speedup of ASYM (mainline) vs ASYM+SMT is like +60% (keep in mind that with this
workload the SMT part plays a big role, because it's creating exactly nr_cpus/2
tasks => 1 task per SMT core, hence the big speedup number).

> 
> [...]
> 
> > @@ -7997,8 +8013,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >  static int
> >  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >  {
> > +	bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
> 
> nit: why prefers_idle_core and not has_idle_core like in sis()?

Yeah, sounds good, I'll change to has_idle_core.

> 
> [...]
> 
> > @@ -8047,12 +8102,17 @@ static inline bool asym_fits_cpu(unsigned long util,
> >  				 unsigned long util_max,
> >  				 int cpu)
> >  {
> > -	if (sched_asym_cpucap_active())
> > +	if (sched_asym_cpucap_active()) {
> >  		/*
> >  		 * Return true only if the cpu fully fits the task requirements
> >  		 * which include the utilization and the performance hints.
> > +		 *
> > +		 * When SMT is active, also require that the core has no busy
> > +		 * siblings.
> >  		 */
> > -		return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> > +		return (!sched_smt_active() || is_core_idle(cpu)) &&
> > +		       (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> > +	}
> 
> Not sure whether this has been discussed already. This makes all early
> bailout conditions in sis() idle core aware for 'ASYM + SMT' but it's
> not for 'NO_ASYM'?

Yeah, that's another difference from NO_ASYM and I think it's worth a comment.
Maybe in the future it'd be interesting to see how NO_ASYM behaves with the same
idle core aware early bailout conditions (not for this series I'd say).

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity()
  2026-05-06 18:11       ` Andrea Righi
@ 2026-05-07  6:47         ` Vincent Guittot
  2026-05-08 14:49           ` Dietmar Eggemann
  0 siblings, 1 reply; 34+ messages in thread
From: Vincent Guittot @ 2026-05-07  6:47 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Dietmar Eggemann, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel

On Wed, 6 May 2026 at 20:11, Andrea Righi <arighi@nvidia.com> wrote:
>
> Hi Dietmar and Vincent,
>
> On Wed, May 06, 2026 at 07:01:35PM +0200, Dietmar Eggemann wrote:
> > On 06.05.26 14:59, Vincent Guittot wrote:
> > > On Tue, 28 Apr 2026 at 16:44, Andrea Righi <arighi@nvidia.com> wrote:
> > >>
> > >> From: K Prateek Nayak <kprateek.nayak@amd.com>
> >
> > [...]
> >
> > >> @@ -8026,10 +8027,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > >>         util_min = uclamp_eff_value(p, UCLAMP_MIN);
> > >>         util_max = uclamp_eff_value(p, UCLAMP_MAX);
> > >>
> > >> +       if (sched_feat(SIS_UTIL) && sd->shared) {
> > >> +               /*
> > >> +                * Same nr_idle_scan hint as select_idle_cpu(), nr only limits
> > >> +                * the scan when not preferring an idle core.
> > >> +                */
> > >> +               nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
> > >> +               /* overloaded domain is unlikely to have idle cpu/core */
> > >> +               if (nr == 1)
> > >> +                       return -1;
> > >> +       }
> > >> +
> > >>         for_each_cpu_wrap(cpu, cpus, target) {
> > >>                 bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
> > >>                 unsigned long cpu_cap = capacity_of(cpu);
> > >>
> > >> +               /*
> > >> +                * Good-enough early exit (mirrors select_idle_cpu() logic).
> > >> +                */
> > >> +               if (!prefers_idle_core &&
> > >> +                   --nr <= 0 && best_fits == ASYM_IDLE_CORE_UCLAMP_MISFIT)
> > >
> > > With SMT, !prefers_idle_core implies that there is no idle core; Is
> > > best_fits == ASYM_IDLE_CORE_UCLAMP_MISFIT really expected in such case
> > > ?
> > >
> > > With !SMT, !prefers_idle_core is always true and we will bail out
> > > early as expected
> >
> > I struggle to comprehend:
> >
> > I assume the mirrored select_idle_cpu() logic is:
> >
> >     for_each_cpu_wrap(cpu, cpus, target + 1)
> >
> >       if (has_idle_core)
> >
> >       else
> >         if (--nr <= 0)
> >           return -1
>
> So, the logic in select_idle_cpu() is that as soon as nr <= 0, we stops the walk
> and returns -1, without any "only stop if the answer is good enough" guard.
>
> With this change in select_idle_capacity() when nr is exhausted, we stop only if
> best_cpu is "good enough" (ASYM_IDLE_CORE_UCLAMP_MISFIT), otherwise we keep
> scanning. Therefore, we're not perfectly mirroring select_idle_cpu().

Okay, one reason of my confusion is that

With !SMT, preferred_core is always true and CPU == core in asym_fits_state

With SMT and test_idle_cores being true, preferred_core reflects
core/CPU idleness

But with SMT and test_idle_cores being false,  preferred_core is
always false and we are back to the !SMT case where CPU == core in the
asym_fits_state

So the condition is relevant
  if (!prefers_idle_core &&  --nr <= 0 && best_fits ==
ASYM_IDLE_CORE_UCLAMP_MISFIT)

We need a better description of which asym_fits_state range is used in
which conditions

>
> >
> > Should this condition not be just:
> >
> >   if (!prefers_idle_core && --nr <= 0)
> >     return best_cpu
>
> I think this would match more closely select_idle_cpu(). However,
> select_idle_cpu() doesn't have the "best partial idle placement" logic at all,
> it either returns an idle CPU or -1.
>
> I guess it's a policy decision here: do we want to mirror exactly the scan bound
> (nr <= 0 -> hard stop) or allow extra scan based on the ranking quality
> (nr <= 0 -> stop early if satisfied)?

The current proposal is ok for me:
With SMT and an idle core, we loop until finding the best idle core
Without SMT or idle core, we loop until we find a CPU on which the
task utilization matches at least the max capacity

>
> Thanks,
> -Andrea
>
> >
> > since if we do a:
> >
> >   if (!choose_idle_cpu(cpu, p)))
> >     continue;
> >
> > right after that?
> >
> > best_cpu is -1 by default so sis() will return target, in case we
> > already found a best_cpu then sis() will return this instead.
> >
> > What do I miss here?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity()
  2026-05-07  6:47         ` Vincent Guittot
@ 2026-05-08 14:49           ` Dietmar Eggemann
  2026-05-08 22:05             ` Andrea Righi
  0 siblings, 1 reply; 34+ messages in thread
From: Dietmar Eggemann @ 2026-05-08 14:49 UTC (permalink / raw)
  To: Vincent Guittot, Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
	Joel Fernandes, Shrikanth Hegde, linux-kernel



On 07.05.26 08:47, Vincent Guittot wrote:
> On Wed, 6 May 2026 at 20:11, Andrea Righi <arighi@nvidia.com> wrote:
>>
>> Hi Dietmar and Vincent,
>>
>> On Wed, May 06, 2026 at 07:01:35PM +0200, Dietmar Eggemann wrote:
>>> On 06.05.26 14:59, Vincent Guittot wrote:
>>>> On Tue, 28 Apr 2026 at 16:44, Andrea Righi <arighi@nvidia.com> wrote:
>>>>>
>>>>> From: K Prateek Nayak <kprateek.nayak@amd.com>
>>>
>>> [...]
>>>
>>>>> @@ -8026,10 +8027,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>>>>>         util_min = uclamp_eff_value(p, UCLAMP_MIN);
>>>>>         util_max = uclamp_eff_value(p, UCLAMP_MAX);
>>>>>
>>>>> +       if (sched_feat(SIS_UTIL) && sd->shared) {
>>>>> +               /*
>>>>> +                * Same nr_idle_scan hint as select_idle_cpu(), nr only limits
>>>>> +                * the scan when not preferring an idle core.
>>>>> +                */
>>>>> +               nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
>>>>> +               /* overloaded domain is unlikely to have idle cpu/core */
>>>>> +               if (nr == 1)
>>>>> +                       return -1;
>>>>> +       }
>>>>> +
>>>>>         for_each_cpu_wrap(cpu, cpus, target) {
>>>>>                 bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
>>>>>                 unsigned long cpu_cap = capacity_of(cpu);
>>>>>
>>>>> +               /*
>>>>> +                * Good-enough early exit (mirrors select_idle_cpu() logic).
>>>>> +                */
>>>>> +               if (!prefers_idle_core &&
>>>>> +                   --nr <= 0 && best_fits == ASYM_IDLE_CORE_UCLAMP_MISFIT)
>>>>
>>>> With SMT, !prefers_idle_core implies that there is no idle core; Is
>>>> best_fits == ASYM_IDLE_CORE_UCLAMP_MISFIT really expected in such case
>>>> ?
>>>>
>>>> With !SMT, !prefers_idle_core is always true and we will bail out
>>>> early as expected
>>>
>>> I struggle to comprehend:
>>>
>>> I assume the mirrored select_idle_cpu() logic is:
>>>
>>>     for_each_cpu_wrap(cpu, cpus, target + 1)
>>>
>>>       if (has_idle_core)
>>>
>>>       else
>>>         if (--nr <= 0)
>>>           return -1
>>
>> So, the logic in select_idle_cpu() is that as soon as nr <= 0, we stops the walk
>> and returns -1, without any "only stop if the answer is good enough" guard.
>>
>> With this change in select_idle_capacity() when nr is exhausted, we stop only if
>> best_cpu is "good enough" (ASYM_IDLE_CORE_UCLAMP_MISFIT), otherwise we keep
>> scanning. Therefore, we're not perfectly mirroring select_idle_cpu().

But when '--nr <= 0', does it actually make sense to continue scanning
for an _idle_ CPU?

  for_each_cpu_wrap(cpu, cpus, target)

    if (!prefers_idle_core &&
      --nr <= 0 && best_fits == ASYM_IDLE_CORE_UCLAMP_MISFIT)
        return best_cpu;

    if (!choose_idle_cpu(cpu, p))   <--- !!!
      continue;

I thought we want to bail since it doesn't. The likelihood that
choose_idle_cpu() will return 0 is high so from the point of '--nr <= 0'
we would not be able to reach the condition to alter best_cpu anymore?

Isn't this similar to select_idle_cpu()?

  for_each_cpu_wrap(cpu, cpus, target + 1)

    else
      if (--nr <= 0)
        return -1;
      idle_cpu = __select_idle_cpu(cpu, p);
                   choose_idle_cpu(cpu, p)
      if ((unsigned int)idle_cpu < nr_cpumask_bits)
        break;

[...]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity()
  2026-05-08 14:49           ` Dietmar Eggemann
@ 2026-05-08 22:05             ` Andrea Righi
  0 siblings, 0 replies; 34+ messages in thread
From: Andrea Righi @ 2026-05-08 22:05 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Joel Fernandes, Shrikanth Hegde, linux-kernel

Hi Dietmar,

On Fri, May 08, 2026 at 04:49:06PM +0200, Dietmar Eggemann wrote:
> On 07.05.26 08:47, Vincent Guittot wrote:
> > On Wed, 6 May 2026 at 20:11, Andrea Righi <arighi@nvidia.com> wrote:
> >>
> >> Hi Dietmar and Vincent,
> >>
> >> On Wed, May 06, 2026 at 07:01:35PM +0200, Dietmar Eggemann wrote:
> >>> On 06.05.26 14:59, Vincent Guittot wrote:
> >>>> On Tue, 28 Apr 2026 at 16:44, Andrea Righi <arighi@nvidia.com> wrote:
> >>>>>
> >>>>> From: K Prateek Nayak <kprateek.nayak@amd.com>
> >>>
> >>> [...]
> >>>
> >>>>> @@ -8026,10 +8027,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >>>>>         util_min = uclamp_eff_value(p, UCLAMP_MIN);
> >>>>>         util_max = uclamp_eff_value(p, UCLAMP_MAX);
> >>>>>
> >>>>> +       if (sched_feat(SIS_UTIL) && sd->shared) {
> >>>>> +               /*
> >>>>> +                * Same nr_idle_scan hint as select_idle_cpu(), nr only limits
> >>>>> +                * the scan when not preferring an idle core.
> >>>>> +                */
> >>>>> +               nr = READ_ONCE(sd->shared->nr_idle_scan) + 1;
> >>>>> +               /* overloaded domain is unlikely to have idle cpu/core */
> >>>>> +               if (nr == 1)
> >>>>> +                       return -1;
> >>>>> +       }
> >>>>> +
> >>>>>         for_each_cpu_wrap(cpu, cpus, target) {
> >>>>>                 bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
> >>>>>                 unsigned long cpu_cap = capacity_of(cpu);
> >>>>>
> >>>>> +               /*
> >>>>> +                * Good-enough early exit (mirrors select_idle_cpu() logic).
> >>>>> +                */
> >>>>> +               if (!prefers_idle_core &&
> >>>>> +                   --nr <= 0 && best_fits == ASYM_IDLE_CORE_UCLAMP_MISFIT)
> >>>>
> >>>> With SMT, !prefers_idle_core implies that there is no idle core; Is
> >>>> best_fits == ASYM_IDLE_CORE_UCLAMP_MISFIT really expected in such case
> >>>> ?
> >>>>
> >>>> With !SMT, !prefers_idle_core is always true and we will bail out
> >>>> early as expected
> >>>
> >>> I struggle to comprehend:
> >>>
> >>> I assume the mirrored select_idle_cpu() logic is:
> >>>
> >>>     for_each_cpu_wrap(cpu, cpus, target + 1)
> >>>
> >>>       if (has_idle_core)
> >>>
> >>>       else
> >>>         if (--nr <= 0)
> >>>           return -1
> >>
> >> So, the logic in select_idle_cpu() is that as soon as nr <= 0, we stops the walk
> >> and returns -1, without any "only stop if the answer is good enough" guard.
> >>
> >> With this change in select_idle_capacity() when nr is exhausted, we stop only if
> >> best_cpu is "good enough" (ASYM_IDLE_CORE_UCLAMP_MISFIT), otherwise we keep
> >> scanning. Therefore, we're not perfectly mirroring select_idle_cpu().
> 
> But when '--nr <= 0', does it actually make sense to continue scanning
> for an _idle_ CPU?
> 
>   for_each_cpu_wrap(cpu, cpus, target)
> 
>     if (!prefers_idle_core &&
>       --nr <= 0 && best_fits == ASYM_IDLE_CORE_UCLAMP_MISFIT)
>         return best_cpu;
> 
>     if (!choose_idle_cpu(cpu, p))   <--- !!!
>       continue;

Hm... yeah and only an idle CPU can update best_fits via the ranking down below:

	/*
	 * First, select CPU which fits better (lower is more preferred).
	 * Then, select the one with best capacity at same level.
	 */
	if ((fits < best_fits) ||
	    ((fits == best_fits) && (cpu_cap > best_cap))) {
		best_cap = cpu_cap;
		best_cpu = cpu;
		best_fits = fits;
	}

So, we'll likely continue iterating on choose_idle_cpu() and the chance of
best_fits flipping to ASYM_IDLE_CORE_UCLAMP_MISFIT after nr is exhausted is low.

> 
> I thought we want to bail since it doesn't. The likelihood that
> choose_idle_cpu() will return 0 is high so from the point of '--nr <= 0'
> we would not be able to reach the condition to alter best_cpu anymore?
> 
> Isn't this similar to select_idle_cpu()?
> 
>   for_each_cpu_wrap(cpu, cpus, target + 1)
> 
>     else
>       if (--nr <= 0)
>         return -1;
>       idle_cpu = __select_idle_cpu(cpu, p);
>                    choose_idle_cpu(cpu, p)
>       if ((unsigned int)idle_cpu < nr_cpumask_bits)
>         break;

Yes, with that said I think the right thing to do is to just mirror
select_idle_cpu unconditionally and do:

    if (!prefers_idle_core && --nr <= 0)
        return best_cpu;

If we all agree on this I'll fold this change in the next version (and re-test).

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path
  2026-05-09 18:01 Andrea Righi
@ 2026-05-09 18:01 ` Andrea Righi
  0 siblings, 0 replies; 34+ messages in thread
From: Andrea Righi @ 2026-05-09 18:01 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	Shrikanth Hegde, linux-kernel

nohz_balancer_kick() is reached from sched_balance_trigger(), which is
called from sched_tick(). sched_tick() runs with IRQs disabled, so the
additional rcu_read_lock/unlock() used around sched_domain accesses in
this path is redundant. Rely on the existing IRQ-disabled context (and
the rcu_dereference_all() checking) instead.

The same applies to set_cpu_sd_state_idle(), called from the idle entry
path with IRQs disabled, and to set_cpu_sd_state_busy(), reachable via
nohz_balance_exit_idle() from two contexts: nohz_balancer_kick() (IRQs
disabled, as above) and sched_cpu_deactivate() (the CPUHP_AP_ACTIVE
teardown, which runs under cpus_write_lock(), so it cannot race with
sched-domain rebuilds). In both cases the rcu_dereference_all()
validation is sufficient.

No functional change intended.

Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/fair.c | 38 +++++++++++---------------------------
 1 file changed, 11 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ebec186f9823..6b059ee80b631 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12785,8 +12785,6 @@ static void nohz_balancer_kick(struct rq *rq)
 		goto out;
 	}
 
-	rcu_read_lock();
-
 	sd = rcu_dereference_all(rq->sd);
 	if (sd) {
 		/*
@@ -12794,8 +12792,8 @@ static void nohz_balancer_kick(struct rq *rq)
 		 * capacity, kick the ILB to see if there's a better CPU to run on:
 		 */
 		if (rq->cfs.h_nr_runnable >= 1 && check_cpu_capacity(rq, sd)) {
-			flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-			goto unlock;
+			flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
+			goto out;
 		}
 	}
 
@@ -12811,8 +12809,8 @@ static void nohz_balancer_kick(struct rq *rq)
 		 */
 		for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) {
 			if (sched_asym(sd, i, cpu)) {
-				flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-				goto unlock;
+				flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
+				goto out;
 			}
 		}
 	}
@@ -12823,10 +12821,8 @@ static void nohz_balancer_kick(struct rq *rq)
 		 * When ASYM_CPUCAPACITY; see if there's a higher capacity CPU
 		 * to run the misfit task on.
 		 */
-		if (check_misfit_status(rq)) {
-			flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-			goto unlock;
-		}
+		if (check_misfit_status(rq))
+			flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
 
 		/*
 		 * For asymmetric systems, we do not want to nicely balance
@@ -12835,7 +12831,7 @@ static void nohz_balancer_kick(struct rq *rq)
 		 *
 		 * Skip the LLC logic because it's not relevant in that case.
 		 */
-		goto unlock;
+		goto out;
 	}
 
 	sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
@@ -12850,13 +12846,9 @@ static void nohz_balancer_kick(struct rq *rq)
 		 * like this LLC domain has tasks we could move.
 		 */
 		nr_busy = atomic_read(&sds->nr_busy_cpus);
-		if (nr_busy > 1) {
-			flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-			goto unlock;
-		}
+		if (nr_busy > 1)
+			flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
 	}
-unlock:
-	rcu_read_unlock();
 out:
 	if (READ_ONCE(nohz.needs_update))
 		flags |= NOHZ_NEXT_KICK;
@@ -12868,17 +12860,13 @@ static void nohz_balancer_kick(struct rq *rq)
 static void set_cpu_sd_state_busy(int cpu)
 {
 	struct sched_domain *sd;
-
-	rcu_read_lock();
 	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
 
 	if (!sd || !sd->nohz_idle)
-		goto unlock;
+		return;
 	sd->nohz_idle = 0;
 
 	atomic_inc(&sd->shared->nr_busy_cpus);
-unlock:
-	rcu_read_unlock();
 }
 
 void nohz_balance_exit_idle(struct rq *rq)
@@ -12897,17 +12885,13 @@ void nohz_balance_exit_idle(struct rq *rq)
 static void set_cpu_sd_state_idle(int cpu)
 {
 	struct sched_domain *sd;
-
-	rcu_read_lock();
 	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
 
 	if (!sd || sd->nohz_idle)
-		goto unlock;
+		return;
 	sd->nohz_idle = 1;
 
 	atomic_dec(&sd->shared->nr_busy_cpus);
-unlock:
-	rcu_read_unlock();
 }
 
 /*
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path
  2026-05-09 18:07 [PATCH v6 0/5 RESEND] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
@ 2026-05-09 18:07 ` Andrea Righi
  2026-05-11 13:04   ` Vincent Guittot
                     ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Andrea Righi @ 2026-05-09 18:07 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	Shrikanth Hegde, linux-kernel

nohz_balancer_kick() is reached from sched_balance_trigger(), which is
called from sched_tick(). sched_tick() runs with IRQs disabled, so the
additional rcu_read_lock/unlock() used around sched_domain accesses in
this path is redundant. Rely on the existing IRQ-disabled context (and
the rcu_dereference_all() checking) instead.

The same applies to set_cpu_sd_state_idle(), called from the idle entry
path with IRQs disabled, and to set_cpu_sd_state_busy(), reachable via
nohz_balance_exit_idle() from two contexts: nohz_balancer_kick() (IRQs
disabled, as above) and sched_cpu_deactivate() (the CPUHP_AP_ACTIVE
teardown, which runs under cpus_write_lock(), so it cannot race with
sched-domain rebuilds). In both cases the rcu_dereference_all()
validation is sufficient.

No functional change intended.

Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/fair.c | 38 +++++++++++---------------------------
 1 file changed, 11 insertions(+), 27 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ebec186f9823..6b059ee80b631 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12785,8 +12785,6 @@ static void nohz_balancer_kick(struct rq *rq)
 		goto out;
 	}
 
-	rcu_read_lock();
-
 	sd = rcu_dereference_all(rq->sd);
 	if (sd) {
 		/*
@@ -12794,8 +12792,8 @@ static void nohz_balancer_kick(struct rq *rq)
 		 * capacity, kick the ILB to see if there's a better CPU to run on:
 		 */
 		if (rq->cfs.h_nr_runnable >= 1 && check_cpu_capacity(rq, sd)) {
-			flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-			goto unlock;
+			flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
+			goto out;
 		}
 	}
 
@@ -12811,8 +12809,8 @@ static void nohz_balancer_kick(struct rq *rq)
 		 */
 		for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) {
 			if (sched_asym(sd, i, cpu)) {
-				flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-				goto unlock;
+				flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
+				goto out;
 			}
 		}
 	}
@@ -12823,10 +12821,8 @@ static void nohz_balancer_kick(struct rq *rq)
 		 * When ASYM_CPUCAPACITY; see if there's a higher capacity CPU
 		 * to run the misfit task on.
 		 */
-		if (check_misfit_status(rq)) {
-			flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-			goto unlock;
-		}
+		if (check_misfit_status(rq))
+			flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
 
 		/*
 		 * For asymmetric systems, we do not want to nicely balance
@@ -12835,7 +12831,7 @@ static void nohz_balancer_kick(struct rq *rq)
 		 *
 		 * Skip the LLC logic because it's not relevant in that case.
 		 */
-		goto unlock;
+		goto out;
 	}
 
 	sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
@@ -12850,13 +12846,9 @@ static void nohz_balancer_kick(struct rq *rq)
 		 * like this LLC domain has tasks we could move.
 		 */
 		nr_busy = atomic_read(&sds->nr_busy_cpus);
-		if (nr_busy > 1) {
-			flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-			goto unlock;
-		}
+		if (nr_busy > 1)
+			flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
 	}
-unlock:
-	rcu_read_unlock();
 out:
 	if (READ_ONCE(nohz.needs_update))
 		flags |= NOHZ_NEXT_KICK;
@@ -12868,17 +12860,13 @@ static void nohz_balancer_kick(struct rq *rq)
 static void set_cpu_sd_state_busy(int cpu)
 {
 	struct sched_domain *sd;
-
-	rcu_read_lock();
 	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
 
 	if (!sd || !sd->nohz_idle)
-		goto unlock;
+		return;
 	sd->nohz_idle = 0;
 
 	atomic_inc(&sd->shared->nr_busy_cpus);
-unlock:
-	rcu_read_unlock();
 }
 
 void nohz_balance_exit_idle(struct rq *rq)
@@ -12897,17 +12885,13 @@ void nohz_balance_exit_idle(struct rq *rq)
 static void set_cpu_sd_state_idle(int cpu)
 {
 	struct sched_domain *sd;
-
-	rcu_read_lock();
 	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
 
 	if (!sd || sd->nohz_idle)
-		goto unlock;
+		return;
 	sd->nohz_idle = 1;
 
 	atomic_dec(&sd->shared->nr_busy_cpus);
-unlock:
-	rcu_read_unlock();
 }
 
 /*
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path
  2026-05-09 18:07 ` [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path Andrea Righi
@ 2026-05-11 13:04   ` Vincent Guittot
  2026-05-15  6:49   ` Shrikanth Hegde
  2026-05-21 19:47   ` Marek Szyprowski
  2 siblings, 0 replies; 34+ messages in thread
From: Vincent Guittot @ 2026-05-11 13:04 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Phil Auld, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, Shrikanth Hegde,
	linux-kernel

On Sat, 9 May 2026 at 20:10, Andrea Righi <arighi@nvidia.com> wrote:
>
> nohz_balancer_kick() is reached from sched_balance_trigger(), which is
> called from sched_tick(). sched_tick() runs with IRQs disabled, so the
> additional rcu_read_lock/unlock() used around sched_domain accesses in
> this path is redundant. Rely on the existing IRQ-disabled context (and
> the rcu_dereference_all() checking) instead.
>
> The same applies to set_cpu_sd_state_idle(), called from the idle entry
> path with IRQs disabled, and to set_cpu_sd_state_busy(), reachable via
> nohz_balance_exit_idle() from two contexts: nohz_balancer_kick() (IRQs
> disabled, as above) and sched_cpu_deactivate() (the CPUHP_AP_ACTIVE
> teardown, which runs under cpus_write_lock(), so it cannot race with
> sched-domain rebuilds). In both cases the rcu_dereference_all()
> validation is sufficient.
>
> No functional change intended.
>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---
>  kernel/sched/fair.c | 38 +++++++++++---------------------------
>  1 file changed, 11 insertions(+), 27 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3ebec186f9823..6b059ee80b631 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12785,8 +12785,6 @@ static void nohz_balancer_kick(struct rq *rq)
>                 goto out;
>         }
>
> -       rcu_read_lock();
> -
>         sd = rcu_dereference_all(rq->sd);
>         if (sd) {
>                 /*
> @@ -12794,8 +12792,8 @@ static void nohz_balancer_kick(struct rq *rq)
>                  * capacity, kick the ILB to see if there's a better CPU to run on:
>                  */
>                 if (rq->cfs.h_nr_runnable >= 1 && check_cpu_capacity(rq, sd)) {
> -                       flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> -                       goto unlock;
> +                       flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> +                       goto out;
>                 }
>         }
>
> @@ -12811,8 +12809,8 @@ static void nohz_balancer_kick(struct rq *rq)
>                  */
>                 for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) {
>                         if (sched_asym(sd, i, cpu)) {
> -                               flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> -                               goto unlock;
> +                               flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> +                               goto out;
>                         }
>                 }
>         }
> @@ -12823,10 +12821,8 @@ static void nohz_balancer_kick(struct rq *rq)
>                  * When ASYM_CPUCAPACITY; see if there's a higher capacity CPU
>                  * to run the misfit task on.
>                  */
> -               if (check_misfit_status(rq)) {
> -                       flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> -                       goto unlock;
> -               }
> +               if (check_misfit_status(rq))
> +                       flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
>
>                 /*
>                  * For asymmetric systems, we do not want to nicely balance
> @@ -12835,7 +12831,7 @@ static void nohz_balancer_kick(struct rq *rq)
>                  *
>                  * Skip the LLC logic because it's not relevant in that case.
>                  */
> -               goto unlock;
> +               goto out;
>         }
>
>         sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
> @@ -12850,13 +12846,9 @@ static void nohz_balancer_kick(struct rq *rq)
>                  * like this LLC domain has tasks we could move.
>                  */
>                 nr_busy = atomic_read(&sds->nr_busy_cpus);
> -               if (nr_busy > 1) {
> -                       flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> -                       goto unlock;
> -               }
> +               if (nr_busy > 1)
> +                       flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
>         }
> -unlock:
> -       rcu_read_unlock();
>  out:
>         if (READ_ONCE(nohz.needs_update))
>                 flags |= NOHZ_NEXT_KICK;
> @@ -12868,17 +12860,13 @@ static void nohz_balancer_kick(struct rq *rq)
>  static void set_cpu_sd_state_busy(int cpu)
>  {
>         struct sched_domain *sd;
> -
> -       rcu_read_lock();
>         sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
>
>         if (!sd || !sd->nohz_idle)
> -               goto unlock;
> +               return;
>         sd->nohz_idle = 0;
>
>         atomic_inc(&sd->shared->nr_busy_cpus);
> -unlock:
> -       rcu_read_unlock();
>  }
>
>  void nohz_balance_exit_idle(struct rq *rq)
> @@ -12897,17 +12885,13 @@ void nohz_balance_exit_idle(struct rq *rq)
>  static void set_cpu_sd_state_idle(int cpu)
>  {
>         struct sched_domain *sd;
> -
> -       rcu_read_lock();
>         sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
>
>         if (!sd || sd->nohz_idle)
> -               goto unlock;
> +               return;
>         sd->nohz_idle = 1;
>
>         atomic_dec(&sd->shared->nr_busy_cpus);
> -unlock:
> -       rcu_read_unlock();
>  }
>
>  /*
> --
> 2.54.0
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path
  2026-05-09 18:07 ` [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path Andrea Righi
  2026-05-11 13:04   ` Vincent Guittot
@ 2026-05-15  6:49   ` Shrikanth Hegde
  2026-05-16  5:45     ` Andrea Righi
  2026-05-21 19:47   ` Marek Szyprowski
  2 siblings, 1 reply; 34+ messages in thread
From: Shrikanth Hegde @ 2026-05-15  6:49 UTC (permalink / raw)
  To: Andrea Righi, K Prateek Nayak, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Phil Auld, Koba Ko,
	Felix Abecassis, Balbir Singh, Joel Fernandes, linux-kernel



On 5/9/26 11:37 PM, Andrea Righi wrote:
> nohz_balancer_kick() is reached from sched_balance_trigger(), which is
> called from sched_tick(). sched_tick() runs with IRQs disabled, so the
> additional rcu_read_lock/unlock() used around sched_domain accesses in
> this path is redundant. Rely on the existing IRQ-disabled context (and
> the rcu_dereference_all() checking) instead.
> 
> The same applies to set_cpu_sd_state_idle(), called from the idle entry
> path with IRQs disabled, and to set_cpu_sd_state_busy(), reachable via
> nohz_balance_exit_idle() from two contexts: nohz_balancer_kick() (IRQs
> disabled, as above) and sched_cpu_deactivate() (the CPUHP_AP_ACTIVE
> teardown, which runs under cpus_write_lock(), so it cannot race with
> sched-domain rebuilds). In both cases the rcu_dereference_all()
> validation is sufficient.
> 
> No functional change intended.
> 

For this patch, few more comments below.

Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>

> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>


> @@ -12868,17 +12860,13 @@ static void nohz_balancer_kick(struct rq *rq)
>   static void set_cpu_sd_state_busy(int cpu)
>   {
>   	struct sched_domain *sd;
> -
> -	rcu_read_lock();
>   	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
>   
>   	if (!sd || !sd->nohz_idle)
> -		goto unlock;
> +		return;
>   	sd->nohz_idle = 0;
>   
>   	atomic_inc(&sd->shared->nr_busy_cpus);
> -unlock:
> -	rcu_read_unlock();
>   }
>   
>   void nohz_balance_exit_idle(struct rq *rq)
> @@ -12897,17 +12885,13 @@ void nohz_balance_exit_idle(struct rq *rq)
>   static void set_cpu_sd_state_idle(int cpu)
>   {
>   	struct sched_domain *sd;
> -
> -	rcu_read_lock();
>   	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
>   
>   	if (!sd || sd->nohz_idle)
> -		goto unlock;
> +		return;
>   	sd->nohz_idle = 1;
>   
>   	atomic_dec(&sd->shared->nr_busy_cpus);
> -unlock:
> -	rcu_read_unlock();
>   }
>   
>   /*

I was looking at other users of sd_llc, i.e test_idle_core and set_idle_core.
They have rcu_dereference_all. So callers need not call rcu_read_lock/unlock if
the irq disabled/preempt_disabled.

One more place would be update_idle_core. I think it is called with interrupt disabled
in __schedule path.

And in sched_ext, scx_idle_update_selcpu_topology, It seems to be tied to cpu hotplug and
by same logic of cpus_write_lock held, one could remove redundant rcu_read_lock there as well.

No?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path
  2026-05-15  6:49   ` Shrikanth Hegde
@ 2026-05-16  5:45     ` Andrea Righi
  2026-05-16 17:15       ` Shrikanth Hegde
  0 siblings, 1 reply; 34+ messages in thread
From: Andrea Righi @ 2026-05-16  5:45 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	linux-kernel

Hi Shrikanth,

On Fri, May 15, 2026 at 12:19:16PM +0530, Shrikanth Hegde wrote:
> 
> 
> On 5/9/26 11:37 PM, Andrea Righi wrote:
> > nohz_balancer_kick() is reached from sched_balance_trigger(), which is
> > called from sched_tick(). sched_tick() runs with IRQs disabled, so the
> > additional rcu_read_lock/unlock() used around sched_domain accesses in
> > this path is redundant. Rely on the existing IRQ-disabled context (and
> > the rcu_dereference_all() checking) instead.
> > 
> > The same applies to set_cpu_sd_state_idle(), called from the idle entry
> > path with IRQs disabled, and to set_cpu_sd_state_busy(), reachable via
> > nohz_balance_exit_idle() from two contexts: nohz_balancer_kick() (IRQs
> > disabled, as above) and sched_cpu_deactivate() (the CPUHP_AP_ACTIVE
> > teardown, which runs under cpus_write_lock(), so it cannot race with
> > sched-domain rebuilds). In both cases the rcu_dereference_all()
> > validation is sufficient.
> > 
> > No functional change intended.
> > 
> 
> For this patch, few more comments below.
> 
> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> 
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> 
> 
> > @@ -12868,17 +12860,13 @@ static void nohz_balancer_kick(struct rq *rq)
> >   static void set_cpu_sd_state_busy(int cpu)
> >   {
> >   	struct sched_domain *sd;
> > -
> > -	rcu_read_lock();
> >   	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
> >   	if (!sd || !sd->nohz_idle)
> > -		goto unlock;
> > +		return;
> >   	sd->nohz_idle = 0;
> >   	atomic_inc(&sd->shared->nr_busy_cpus);
> > -unlock:
> > -	rcu_read_unlock();
> >   }
> >   void nohz_balance_exit_idle(struct rq *rq)
> > @@ -12897,17 +12885,13 @@ void nohz_balance_exit_idle(struct rq *rq)
> >   static void set_cpu_sd_state_idle(int cpu)
> >   {
> >   	struct sched_domain *sd;
> > -
> > -	rcu_read_lock();
> >   	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
> >   	if (!sd || sd->nohz_idle)
> > -		goto unlock;
> > +		return;
> >   	sd->nohz_idle = 1;
> >   	atomic_dec(&sd->shared->nr_busy_cpus);
> > -unlock:
> > -	rcu_read_unlock();
> >   }
> >   /*
> 
> I was looking at other users of sd_llc, i.e test_idle_core and set_idle_core.
> They have rcu_dereference_all. So callers need not call rcu_read_lock/unlock if
> the irq disabled/preempt_disabled.
> 
> One more place would be update_idle_core. I think it is called with interrupt disabled
> in __schedule path.

Good point, __update_idle_core() reaches set_next_task_idle() via
pick_next_task() in __schedule(), and __schedule() disables IRQs before that
path.

Since set_idle_cores()/test_idle_cores() use rcu_dereference_all(), the
rcu_read_lock/unlock() pair in __update_idle_core() is indeed redundant. I can
send a follow-up patch for this.

> 
> And in sched_ext, scx_idle_update_selcpu_topology, It seems to be tied to cpu hotplug and
> by same logic of cpus_write_lock held, one could remove redundant rcu_read_lock there as well.
> 
> No?

For scx_idle_update_selcpu_topology() it's a bit more nuanced, if I'm not
missing anything:
 - the helpers it uses (llc_weight/llc_span/numa_weight/numa_span) use plain
   rcu_dereference(), so simply dropping rcu_read_lock() in the caller would
   trip the lockdep check. They'd need to be converted to rcu_dereference_all()
   first;
 - the two call sites have different protection:
    - handle_hotplug() runs from a CPU hotplug callback, so cpus_write_lock()
      is held, serializes against sched-domain rebuilds,
    - scx_enable() only holds cpus_read_lock(), which doesn't on
      its own prevent cpuset sched-domain rebuilds (those run under
      cpus_read_lock() too).

I think this one needs a separate, more careful patch. Maybe we should keep this
series scoped to the NOHZ kick path and address those as follow-ups?

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path
  2026-05-16  5:45     ` Andrea Righi
@ 2026-05-16 17:15       ` Shrikanth Hegde
  0 siblings, 0 replies; 34+ messages in thread
From: Shrikanth Hegde @ 2026-05-16 17:15 UTC (permalink / raw)
  To: Andrea Righi
  Cc: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	linux-kernel



On 5/16/26 11:15 AM, Andrea Righi wrote:
> Hi Shrikanth,
> 
> On Fri, May 15, 2026 at 12:19:16PM +0530, Shrikanth Hegde wrote:
>>
>>
>> On 5/9/26 11:37 PM, Andrea Righi wrote:
>>> nohz_balancer_kick() is reached from sched_balance_trigger(), which is
>>> called from sched_tick(). sched_tick() runs with IRQs disabled, so the
>>> additional rcu_read_lock/unlock() used around sched_domain accesses in
>>> this path is redundant. Rely on the existing IRQ-disabled context (and
>>> the rcu_dereference_all() checking) instead.
>>>
>>> The same applies to set_cpu_sd_state_idle(), called from the idle entry
>>> path with IRQs disabled, and to set_cpu_sd_state_busy(), reachable via
>>> nohz_balance_exit_idle() from two contexts: nohz_balancer_kick() (IRQs
>>> disabled, as above) and sched_cpu_deactivate() (the CPUHP_AP_ACTIVE
>>> teardown, which runs under cpus_write_lock(), so it cannot race with
>>> sched-domain rebuilds). In both cases the rcu_dereference_all()
>>> validation is sufficient.
>>>
>>> No functional change intended.
>>>
>>
>> For this patch, few more comments below.
>>
>> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>>
>>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>>> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
>>> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>>> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
>>> Signed-off-by: Andrea Righi <arighi@nvidia.com>
>>
>>
>>> @@ -12868,17 +12860,13 @@ static void nohz_balancer_kick(struct rq *rq)
>>>    static void set_cpu_sd_state_busy(int cpu)
>>>    {
>>>    	struct sched_domain *sd;
>>> -
>>> -	rcu_read_lock();
>>>    	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
>>>    	if (!sd || !sd->nohz_idle)
>>> -		goto unlock;
>>> +		return;
>>>    	sd->nohz_idle = 0;
>>>    	atomic_inc(&sd->shared->nr_busy_cpus);
>>> -unlock:
>>> -	rcu_read_unlock();
>>>    }
>>>    void nohz_balance_exit_idle(struct rq *rq)
>>> @@ -12897,17 +12885,13 @@ void nohz_balance_exit_idle(struct rq *rq)
>>>    static void set_cpu_sd_state_idle(int cpu)
>>>    {
>>>    	struct sched_domain *sd;
>>> -
>>> -	rcu_read_lock();
>>>    	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
>>>    	if (!sd || sd->nohz_idle)
>>> -		goto unlock;
>>> +		return;
>>>    	sd->nohz_idle = 1;
>>>    	atomic_dec(&sd->shared->nr_busy_cpus);
>>> -unlock:
>>> -	rcu_read_unlock();
>>>    }
>>>    /*
>>
>> I was looking at other users of sd_llc, i.e test_idle_core and set_idle_core.
>> They have rcu_dereference_all. So callers need not call rcu_read_lock/unlock if
>> the irq disabled/preempt_disabled.
>>
>> One more place would be update_idle_core. I think it is called with interrupt disabled
>> in __schedule path.
> 
> Good point, __update_idle_core() reaches set_next_task_idle() via
> pick_next_task() in __schedule(), and __schedule() disables IRQs before that
> path.
> 
> Since set_idle_cores()/test_idle_cores() use rcu_dereference_all(), the
> rcu_read_lock/unlock() pair in __update_idle_core() is indeed redundant. I can
> send a follow-up patch for this.
> 

Thanks.

>>
>> And in sched_ext, scx_idle_update_selcpu_topology, It seems to be tied to cpu hotplug and
>> by same logic of cpus_write_lock held, one could remove redundant rcu_read_lock there as well.
>>
>> No?
> 
> For scx_idle_update_selcpu_topology() it's a bit more nuanced, if I'm not
> missing anything:
>   - the helpers it uses (llc_weight/llc_span/numa_weight/numa_span) use plain
>     rcu_dereference(), so simply dropping rcu_read_lock() in the caller would
>     trip the lockdep check. They'd need to be converted to rcu_dereference_all()
>     first;
>   - the two call sites have different protection:
>      - handle_hotplug() runs from a CPU hotplug callback, so cpus_write_lock()
>        is held, serializes against sched-domain rebuilds,
>      - scx_enable() only holds cpus_read_lock(), which doesn't on
>        its own prevent cpuset sched-domain rebuilds (those run under
>        cpus_read_lock() too).
> 
> I think this one needs a separate, more careful patch. Maybe we should keep this
> series scoped to the NOHZ kick path and address those as follow-ups?
> 
> Thanks,
> -Andrea

Yes. That makes sense.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path
  2026-05-09 18:07 ` [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path Andrea Righi
  2026-05-11 13:04   ` Vincent Guittot
  2026-05-15  6:49   ` Shrikanth Hegde
@ 2026-05-21 19:47   ` Marek Szyprowski
  2026-05-21 20:13     ` Andrea Righi
  2 siblings, 1 reply; 34+ messages in thread
From: Marek Szyprowski @ 2026-05-21 19:47 UTC (permalink / raw)
  To: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	Shrikanth Hegde, linux-kernel

On 09.05.2026 20:07, Andrea Righi wrote:
> nohz_balancer_kick() is reached from sched_balance_trigger(), which is
> called from sched_tick(). sched_tick() runs with IRQs disabled, so the
> additional rcu_read_lock/unlock() used around sched_domain accesses in
> this path is redundant. Rely on the existing IRQ-disabled context (and
> the rcu_dereference_all() checking) instead.
>
> The same applies to set_cpu_sd_state_idle(), called from the idle entry
> path with IRQs disabled, and to set_cpu_sd_state_busy(), reachable via
> nohz_balance_exit_idle() from two contexts: nohz_balancer_kick() (IRQs
> disabled, as above) and sched_cpu_deactivate() (the CPUHP_AP_ACTIVE
> teardown, which runs under cpus_write_lock(), so it cannot race with
> sched-domain rebuilds). In both cases the rcu_dereference_all()
> validation is sufficient.
>
> No functional change intended.
>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
This patch landed in today's linux-next as commit c9d93a73ce87 ("sched/fair: Drop
redundant RCU read lock in NOHZ kick path"). In my tests I found that it introduced
the following warning during the CPU hot-plug tests:


root@target:~# for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0 >$i/online; done

=============================
WARNING: suspicious RCU usage
7.1.0-rc2+ #12775 Not tainted
-----------------------------
kernel/sched/fair.c:12793 suspicious rcu_dereference_check() usage!

other info that might help us debug this:


rcu_scheduler_active = 2, debug_locks = 1
2 locks held by cpuhp/1/20:
 #0: ffffffff81a16220 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x42/0x1ae
 #1: ffffffff81a16270 (cpuhp_state-down){+.+.}-{0:0}, at: cpuhp_thread_fun+0x72/0x1ae

stack backtrace:
CPU: 1 UID: 0 PID: 20 Comm: cpuhp/1 Not tainted 7.1.0-rc2+ #12775 PREEMPTLAZY
Hardware name: StarFive VisionFive 2 v1.2A (DT)
Call Trace:
[<ffffffff8001827c>] dump_backtrace+0x1c/0x24
[<ffffffff800014c0>] show_stack+0x28/0x34
[<ffffffff80010d42>] dump_stack_lvl+0x5e/0x86
[<ffffffff80010d7e>] dump_stack+0x14/0x1c
[<ffffffff800987ec>] lockdep_rcu_suspicious+0x14c/0x1b8
[<ffffffff80079992>] nohz_balance_exit_idle+0xf4/0xf6
[<ffffffff800664e6>] sched_cpu_deactivate+0x6c/0x1c8
[<ffffffff8002a5d0>] cpuhp_invoke_callback+0xf8/0x1ce
[<ffffffff8002a944>] cpuhp_thread_fun+0x150/0x1ae
[<ffffffff8005dc64>] smpboot_thread_fn+0x138/0x2a4
[<ffffffff800554ae>] kthread+0xea/0x10c
[<ffffffff800134c4>] ret_from_fork_kernel+0x22/0x386
[<ffffffff80c278ee>] ret_from_fork_kernel_asm+0x16/0x18
CPU1: off
CPU2: off
CPU3: off

This issue is observed on most of my ARM 32bit, ARM 64bit and RiscV64 based boards.


> ---
>  kernel/sched/fair.c | 38 +++++++++++---------------------------
>  1 file changed, 11 insertions(+), 27 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3ebec186f9823..6b059ee80b631 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12785,8 +12785,6 @@ static void nohz_balancer_kick(struct rq *rq)
>  		goto out;
>  	}
>  
> -	rcu_read_lock();
> -
>  	sd = rcu_dereference_all(rq->sd);
>  	if (sd) {
>  		/*
> @@ -12794,8 +12792,8 @@ static void nohz_balancer_kick(struct rq *rq)
>  		 * capacity, kick the ILB to see if there's a better CPU to run on:
>  		 */
>  		if (rq->cfs.h_nr_runnable >= 1 && check_cpu_capacity(rq, sd)) {
> -			flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> -			goto unlock;
> +			flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> +			goto out;
>  		}
>  	}
>  
> @@ -12811,8 +12809,8 @@ static void nohz_balancer_kick(struct rq *rq)
>  		 */
>  		for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) {
>  			if (sched_asym(sd, i, cpu)) {
> -				flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> -				goto unlock;
> +				flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> +				goto out;
>  			}
>  		}
>  	}
> @@ -12823,10 +12821,8 @@ static void nohz_balancer_kick(struct rq *rq)
>  		 * When ASYM_CPUCAPACITY; see if there's a higher capacity CPU
>  		 * to run the misfit task on.
>  		 */
> -		if (check_misfit_status(rq)) {
> -			flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> -			goto unlock;
> -		}
> +		if (check_misfit_status(rq))
> +			flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
>  
>  		/*
>  		 * For asymmetric systems, we do not want to nicely balance
> @@ -12835,7 +12831,7 @@ static void nohz_balancer_kick(struct rq *rq)
>  		 *
>  		 * Skip the LLC logic because it's not relevant in that case.
>  		 */
> -		goto unlock;
> +		goto out;
>  	}
>  
>  	sds = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
> @@ -12850,13 +12846,9 @@ static void nohz_balancer_kick(struct rq *rq)
>  		 * like this LLC domain has tasks we could move.
>  		 */
>  		nr_busy = atomic_read(&sds->nr_busy_cpus);
> -		if (nr_busy > 1) {
> -			flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
> -			goto unlock;
> -		}
> +		if (nr_busy > 1)
> +			flags |= NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
>  	}
> -unlock:
> -	rcu_read_unlock();
>  out:
>  	if (READ_ONCE(nohz.needs_update))
>  		flags |= NOHZ_NEXT_KICK;
> @@ -12868,17 +12860,13 @@ static void nohz_balancer_kick(struct rq *rq)
>  static void set_cpu_sd_state_busy(int cpu)
>  {
>  	struct sched_domain *sd;
> -
> -	rcu_read_lock();
>  	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
>  
>  	if (!sd || !sd->nohz_idle)
> -		goto unlock;
> +		return;
>  	sd->nohz_idle = 0;
>  
>  	atomic_inc(&sd->shared->nr_busy_cpus);
> -unlock:
> -	rcu_read_unlock();
>  }
>  
>  void nohz_balance_exit_idle(struct rq *rq)
> @@ -12897,17 +12885,13 @@ void nohz_balance_exit_idle(struct rq *rq)
>  static void set_cpu_sd_state_idle(int cpu)
>  {
>  	struct sched_domain *sd;
> -
> -	rcu_read_lock();
>  	sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
>  
>  	if (!sd || sd->nohz_idle)
> -		goto unlock;
> +		return;
>  	sd->nohz_idle = 1;
>  
>  	atomic_dec(&sd->shared->nr_busy_cpus);
> -unlock:
> -	rcu_read_unlock();
>  }
>  
>  /*

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path
  2026-05-21 19:47   ` Marek Szyprowski
@ 2026-05-21 20:13     ` Andrea Righi
  0 siblings, 0 replies; 34+ messages in thread
From: Andrea Righi @ 2026-05-21 20:13 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Phil Auld,
	Koba Ko, Felix Abecassis, Balbir Singh, Joel Fernandes,
	Shrikanth Hegde, linux-kernel

Hi Marek,

On Thu, May 21, 2026 at 09:47:03PM +0200, Marek Szyprowski wrote:
> On 09.05.2026 20:07, Andrea Righi wrote:
> > nohz_balancer_kick() is reached from sched_balance_trigger(), which is
> > called from sched_tick(). sched_tick() runs with IRQs disabled, so the
> > additional rcu_read_lock/unlock() used around sched_domain accesses in
> > this path is redundant. Rely on the existing IRQ-disabled context (and
> > the rcu_dereference_all() checking) instead.
> >
> > The same applies to set_cpu_sd_state_idle(), called from the idle entry
> > path with IRQs disabled, and to set_cpu_sd_state_busy(), reachable via
> > nohz_balance_exit_idle() from two contexts: nohz_balancer_kick() (IRQs
> > disabled, as above) and sched_cpu_deactivate() (the CPUHP_AP_ACTIVE
> > teardown, which runs under cpus_write_lock(), so it cannot race with
> > sched-domain rebuilds). In both cases the rcu_dereference_all()
> > validation is sufficient.
> >
> > No functional change intended.
> >
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> This patch landed in today's linux-next as commit c9d93a73ce87 ("sched/fair: Drop
> redundant RCU read lock in NOHZ kick path"). In my tests I found that it introduced
> the following warning during the CPU hot-plug tests:
> 
> 
> root@target:~# for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0 >$i/online; done
> 
> =============================
> WARNING: suspicious RCU usage
> 7.1.0-rc2+ #12775 Not tainted
> -----------------------------
> kernel/sched/fair.c:12793 suspicious rcu_dereference_check() usage!
> 
> other info that might help us debug this:
> 
> 
> rcu_scheduler_active = 2, debug_locks = 1
> 2 locks held by cpuhp/1/20:
>  #0: ffffffff81a16220 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x42/0x1ae
>  #1: ffffffff81a16270 (cpuhp_state-down){+.+.}-{0:0}, at: cpuhp_thread_fun+0x72/0x1ae
> 
> stack backtrace:
> CPU: 1 UID: 0 PID: 20 Comm: cpuhp/1 Not tainted 7.1.0-rc2+ #12775 PREEMPTLAZY
> Hardware name: StarFive VisionFive 2 v1.2A (DT)
> Call Trace:
> [<ffffffff8001827c>] dump_backtrace+0x1c/0x24
> [<ffffffff800014c0>] show_stack+0x28/0x34
> [<ffffffff80010d42>] dump_stack_lvl+0x5e/0x86
> [<ffffffff80010d7e>] dump_stack+0x14/0x1c
> [<ffffffff800987ec>] lockdep_rcu_suspicious+0x14c/0x1b8
> [<ffffffff80079992>] nohz_balance_exit_idle+0xf4/0xf6
> [<ffffffff800664e6>] sched_cpu_deactivate+0x6c/0x1c8
> [<ffffffff8002a5d0>] cpuhp_invoke_callback+0xf8/0x1ce
> [<ffffffff8002a944>] cpuhp_thread_fun+0x150/0x1ae
> [<ffffffff8005dc64>] smpboot_thread_fn+0x138/0x2a4
> [<ffffffff800554ae>] kthread+0xea/0x10c
> [<ffffffff800134c4>] ret_from_fork_kernel+0x22/0x386
> [<ffffffff80c278ee>] ret_from_fork_kernel_asm+0x16/0x18
> CPU1: off
> CPU2: off
> CPU3: off
> 
> This issue is observed on most of my ARM 32bit, ARM 64bit and RiscV64 based boards.
> 

Ah, yes, makes sense. We missed the CPU hotplug case. When CPUs are taken
offline, set_cpu_sd_state_busy() is invoked via:

    cpuhp/N kthread
      cpuhp_thread_fun()
        cpuhp_invoke_callback()
          sched_cpu_deactivate()
            nohz_balance_exit_idle()
              set_cpu_sd_state_busy()
                rcu_dereference_all(per_cpu(sd_llc, cpu))

The cpuhp kthread holds cpu_hotplug_lock, but runs with preemption and IRQs
enabled. I think we should just restore the RCU read lock in
set_cpu_sd_state_{busy,idle}() to fix this. I'll send a patch soon.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2026-05-21 20:14 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-28 14:41 [PATCH v5 0/5] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
2026-04-28 14:41 ` [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path Andrea Righi
2026-04-28 16:29   ` K Prateek Nayak
2026-04-29 16:07     ` [PATCH v2 " Andrea Righi
2026-05-05  9:15   ` [PATCH " Dietmar Eggemann
2026-05-05  9:22     ` Andrea Righi
2026-04-28 14:41 ` [PATCH 2/5] sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity Andrea Righi
2026-05-05 12:48   ` Dietmar Eggemann
2026-05-06  9:45   ` Vincent Guittot
2026-05-06 10:19     ` K Prateek Nayak
2026-05-06 10:30       ` Vincent Guittot
2026-04-28 14:41 ` [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
2026-05-05 17:20   ` Dietmar Eggemann
2026-05-06 18:31     ` Andrea Righi
2026-05-06 10:29   ` Vincent Guittot
2026-05-06 12:34     ` Vincent Guittot
2026-05-06 18:15     ` Andrea Righi
2026-04-28 14:41 ` [PATCH 4/5] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
2026-04-28 14:41 ` [PATCH 5/5] sched/fair: Add SIS_UTIL support to select_idle_capacity() Andrea Righi
2026-05-06 12:59   ` Vincent Guittot
2026-05-06 17:01     ` Dietmar Eggemann
2026-05-06 18:11       ` Andrea Righi
2026-05-07  6:47         ` Vincent Guittot
2026-05-08 14:49           ` Dietmar Eggemann
2026-05-08 22:05             ` Andrea Righi
2026-05-05 20:40 ` [PATCH v5 0/5] sched/fair: SMT-aware asymmetric CPU capacity Dietmar Eggemann
  -- strict thread matches above, loose matches on Subject: below --
2026-05-09 18:01 Andrea Righi
2026-05-09 18:01 ` [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path Andrea Righi
2026-05-09 18:07 [PATCH v6 0/5 RESEND] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
2026-05-09 18:07 ` [PATCH 1/5] sched/fair: Drop redundant RCU read lock in NOHZ kick path Andrea Righi
2026-05-11 13:04   ` Vincent Guittot
2026-05-15  6:49   ` Shrikanth Hegde
2026-05-16  5:45     ` Andrea Righi
2026-05-16 17:15       ` Shrikanth Hegde
2026-05-21 19:47   ` Marek Szyprowski
2026-05-21 20:13     ` Andrea Righi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox