The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* [PATCH] sched/fair: Stabilize idle SMT core selection with asym-capacity
@ 2026-06-30 15:27 Andrea Righi
  2026-07-03  5:51 ` K Prateek Nayak
  0 siblings, 1 reply; 11+ messages in thread
From: Andrea Righi @ 2026-06-30 15:27 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Ricardo Neri,
	Christian Loehle, Shrikanth Hegde, Felix Abecassis,
	Joel Fernandes, Phil Auld, linux-kernel

select_idle_capacity() scans all logical CPUs also when it is looking
for a fully idle SMT core. Two concurrent wakeups can therefore observe
the same core as idle, encounter different siblings first, and place one
task on each sibling while another core remains unused.

Make every logical CPU of a selected idle core resolve to the same
stable CPU representative within the scan's existing affinity and
scheduling-domain mask. If the first task is enqueued before the next
scan examines the core, that scan rejects the now-busy core. If both
scans observe the core as idle, they select the same runqueue even if
the first enqueue becomes visible before the second scan finishes,
exposing the imbalance to the load balancer.

The symmetric CPU idle selection path is subject to the same race, but
normally returns as soon as select_idle_core() finds a fully idle core,
reducing the conflict window. The per-CPU capacity scan can retain an
idle-core candidate while evaluating other CPUs, giving concurrent
wakeups more opportunity to select different siblings of the same SMT
core. Therefore, limit the normalization to the asym-capacity path,
where this behavior has a measurable impact.

On NVIDIA Vera Rubin (arm64, 176 CPUs/88 cores per NUMA node), a
CPU-intensive NVPL SGEMM workload restricted to 88 threads (one per
core) showed a consistent 23% increase in mean throughput across
multiple runs.

For comparison, DCPerf MediaWiki running at system saturation (with all
SMT siblings busy) showed neither a benefit nor a regression: throughput
and Nginx request latency remained within measurement error.

Likewise, schbench under partially idle conditions showed no material
change in wakeup latency, request latency, or throughput (within 0.1%).
Tail wakeup latency was more consistent across runs with this change
applied.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/fair.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d78467ec6ee13..f846fbe7379f4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8647,6 +8647,16 @@ enum asym_fits_state {
 	ASYM_IDLE_CORE_BIAS = -3,
 };
 
+/*
+ * Return a stable CPU representative of @cpu's SMT core within @cpus.
+ */
+static int select_idle_core_cpu(int cpu, const struct cpumask *cpus)
+{
+	int sibling = cpumask_first_and(cpu_smt_mask(cpu), cpus);
+
+	return sibling < nr_cpu_ids ? sibling : cpu;
+}
+
 /*
  * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
  * the task fits. If no CPU is big enough, but there are idle ones, try to
@@ -8661,6 +8671,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	 * collapses to the plain capacity scan.
 	 */
 	bool has_idle_core = sched_smt_active() && test_idle_cores(target);
+	bool best_idle_core = false;
 	unsigned long task_util, util_min, util_max, best_cap = 0;
 	int fits, best_fits = ASYM_IDLE_THREAD_MISFIT;
 	int cpu, best_cpu = -1;
@@ -8686,7 +8697,8 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	}
 
 	for_each_cpu_wrap(cpu, cpus, target) {
-		bool preferred_core = !has_idle_core || is_core_idle(cpu);
+		bool idle_core = !sched_smt_active() || is_core_idle(cpu);
+		bool preferred_core = !has_idle_core || idle_core;
 		unsigned long cpu_cap = capacity_of(cpu);
 
 		/*
@@ -8709,7 +8721,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 		 * immediately.
 		 */
 		if (fits > 0 && preferred_core)
-			return cpu;
+			return idle_core ? select_idle_core_cpu(cpu, cpus) : cpu;
 		/*
 		 * Only the min performance hint (i.e. uclamp_min) doesn't fit.
 		 * Look for the CPU with best capacity.
@@ -8750,6 +8762,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 			best_cap = cpu_cap;
 			best_cpu = cpu;
 			best_fits = fits;
+			best_idle_core = idle_core;
 		}
 	}
 
@@ -8765,6 +8778,8 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	 */
 	if (has_idle_core && best_fits > ASYM_IDLE_COMPLETE_MISFIT)
 		set_idle_cores(target, false);
+	else if (best_idle_core)
+		best_cpu = select_idle_core_cpu(best_cpu, cpus);
 
 	return best_cpu;
 }
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-07-03 17:07 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-30 15:27 [PATCH] sched/fair: Stabilize idle SMT core selection with asym-capacity Andrea Righi
2026-07-03  5:51 ` K Prateek Nayak
2026-07-03  9:40   ` Andrea Righi
2026-07-03 10:00     ` Christian Loehle
2026-07-03 14:52       ` Andrea Righi
2026-07-03 16:54         ` Peter Zijlstra
2026-07-03 17:07           ` Andrea Righi
2026-07-03 11:20     ` Julia Lawall
2026-07-03 14:38       ` Andrea Righi
2026-07-03 12:33     ` Andrea Righi
2026-07-03 12:51       ` Julia Lawall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox