[PATCH] sched/fair: Stabilize idle SMT core selection with asym-capacity

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH] sched/fair: Stabilize idle SMT core selection with asym-capacity
@ 2026-06-30 15:27 Andrea Righi
  2026-07-03  5:51 ` K Prateek Nayak
  0 siblings, 1 reply; 11+ messages in thread
From: Andrea Righi @ 2026-06-30 15:27 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Ricardo Neri,
	Christian Loehle, Shrikanth Hegde, Felix Abecassis,
	Joel Fernandes, Phil Auld, linux-kernel

select_idle_capacity() scans all logical CPUs also when it is looking
for a fully idle SMT core. Two concurrent wakeups can therefore observe
the same core as idle, encounter different siblings first, and place one
task on each sibling while another core remains unused.

Make every logical CPU of a selected idle core resolve to the same
stable CPU representative within the scan's existing affinity and
scheduling-domain mask. If the first task is enqueued before the next
scan examines the core, that scan rejects the now-busy core. If both
scans observe the core as idle, they select the same runqueue even if
the first enqueue becomes visible before the second scan finishes,
exposing the imbalance to the load balancer.

The symmetric CPU idle selection path is subject to the same race, but
normally returns as soon as select_idle_core() finds a fully idle core,
reducing the conflict window. The per-CPU capacity scan can retain an
idle-core candidate while evaluating other CPUs, giving concurrent
wakeups more opportunity to select different siblings of the same SMT
core. Therefore, limit the normalization to the asym-capacity path,
where this behavior has a measurable impact.

On NVIDIA Vera Rubin (arm64, 176 CPUs/88 cores per NUMA node), a
CPU-intensive NVPL SGEMM workload restricted to 88 threads (one per
core) showed a consistent 23% increase in mean throughput across
multiple runs.

For comparison, DCPerf MediaWiki running at system saturation (with all
SMT siblings busy) showed neither a benefit nor a regression: throughput
and Nginx request latency remained within measurement error.

Likewise, schbench under partially idle conditions showed no material
change in wakeup latency, request latency, or throughput (within 0.1%).
Tail wakeup latency was more consistent across runs with this change
applied.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/fair.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d78467ec6ee13..f846fbe7379f4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8647,6 +8647,16 @@ enum asym_fits_state {
 	ASYM_IDLE_CORE_BIAS = -3,
 };

+/*
+ * Return a stable CPU representative of @cpu's SMT core within @cpus.
+ */
+static int select_idle_core_cpu(int cpu, const struct cpumask *cpus)
+{
+	int sibling = cpumask_first_and(cpu_smt_mask(cpu), cpus);
+
+	return sibling < nr_cpu_ids ? sibling : cpu;
+}
+
 /*
  * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
  * the task fits. If no CPU is big enough, but there are idle ones, try to
@@ -8661,6 +8671,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	 * collapses to the plain capacity scan.
 	 */
 	bool has_idle_core = sched_smt_active() && test_idle_cores(target);
+	bool best_idle_core = false;
 	unsigned long task_util, util_min, util_max, best_cap = 0;
 	int fits, best_fits = ASYM_IDLE_THREAD_MISFIT;
 	int cpu, best_cpu = -1;
@@ -8686,7 +8697,8 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	}

 	for_each_cpu_wrap(cpu, cpus, target) {
-		bool preferred_core = !has_idle_core || is_core_idle(cpu);
+		bool idle_core = !sched_smt_active() || is_core_idle(cpu);
+		bool preferred_core = !has_idle_core || idle_core;
 		unsigned long cpu_cap = capacity_of(cpu);

 		/*
@@ -8709,7 +8721,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 		 * immediately.
 		 */
 		if (fits > 0 && preferred_core)
-			return cpu;
+			return idle_core ? select_idle_core_cpu(cpu, cpus) : cpu;
 		/*
 		 * Only the min performance hint (i.e. uclamp_min) doesn't fit.
 		 * Look for the CPU with best capacity.
@@ -8750,6 +8762,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 			best_cap = cpu_cap;
 			best_cpu = cpu;
 			best_fits = fits;
+			best_idle_core = idle_core;
 		}
 	}

@@ -8765,6 +8778,8 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	 */
 	if (has_idle_core && best_fits > ASYM_IDLE_COMPLETE_MISFIT)
 		set_idle_cores(target, false);
+	else if (best_idle_core)
+		best_cpu = select_idle_core_cpu(best_cpu, cpus);

 	return best_cpu;
 }
-- 
2.54.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: Stabilize idle SMT core selection with asym-capacity
  2026-06-30 15:27 [PATCH] sched/fair: Stabilize idle SMT core selection with asym-capacity Andrea Righi
@ 2026-07-03  5:51 ` K Prateek Nayak
  2026-07-03  9:40   ` Andrea Righi
  0 siblings, 1 reply; 11+ messages in thread
From: K Prateek Nayak @ 2026-07-03  5:51 UTC (permalink / raw)
  To: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Ricardo Neri, Christian Loehle,
	Shrikanth Hegde, Felix Abecassis, Joel Fernandes, Phil Auld,
	linux-kernel, Julia Lawall

Hello Andrea,

On 6/30/2026 8:57 PM, Andrea Righi wrote:
> select_idle_capacity() scans all logical CPUs also when it is looking
> for a fully idle SMT core. Two concurrent wakeups can therefore observe
> the same core as idle, encounter different siblings first, and place one
> task on each sibling while another core remains unused.
> 
> Make every logical CPU of a selected idle core resolve to the same
> stable CPU representative within the scan's existing affinity and
> scheduling-domain mask. If the first task is enqueued before the next
> scan examines the core, that scan rejects the now-busy core. If both
> scans observe the core as idle, they select the same runqueue even if
> the first enqueue becomes visible before the second scan finishes,
> exposing the imbalance to the load balancer.
> 
> The symmetric CPU idle selection path is subject to the same race, but
> normally returns as soon as select_idle_core() finds a fully idle core,
> reducing the conflict window. The per-CPU capacity scan can retain an
> idle-core candidate while evaluating other CPUs, giving concurrent
> wakeups more opportunity to select different siblings of the same SMT
> core. Therefore, limit the normalization to the asym-capacity path,
> where this behavior has a measurable impact.
> 
> On NVIDIA Vera Rubin (arm64, 176 CPUs/88 cores per NUMA node), a
> CPU-intensive NVPL SGEMM workload restricted to 88 threads (one per
> core) showed a consistent 23% increase in mean throughput across
> multiple runs.

Interesting! This reads like active balance across cores is not aggressive
enough for this workload and, as a result, stacking somehow helps.

I would have expected balance within the core would trigger first and that
would just lead to the same scenario as both sibling sibling busy but I
guess there is a higher order effect of stacking.

perf sched stats reports for this workload before and after
applying your patch may help to see what changes for the load
balancer to start doing better.

Could you check if something like this helps:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fc6cd55f9d22..f50f12316dd3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13221,7 +13221,8 @@ imbalanced_active_balance(struct lb_env *env)
 	 * threads on a system with spare capacity
 	 */
 	if ((env->migration_type == migrate_task) &&
-	    (sd->nr_balance_failed > sd->cache_nice_tries+2))
+	    ((sd->groups->flags & SD_SHARE_CPUCAPACITY) ||
+	      sd->nr_balance_failed > sd->cache_nice_tries+2))
 		return 1;
 
 	return 0;
---

I'm assuming we have group_has_spare for the destination CPU and the
busy core appears as group_fully_busy or group_has_spare.
calculate_imbalance() will take the sibling_imbalance() path since we
are balancing amongst cores (SD_PREFER_SIBLING domain) and we get
"migrate_task" with imbalance of 1.

Then we single down on a rq with a single task on it but that requires
active balance and need_active_balance() is too slow as a result of
imbalanced_active_balance() bailout on cache_nice_tries which requires
at least 3 failures and on a 176 CPUs system, it can take upwards of
176 ticks per retry and with 250Hz tick, that time goes into seconds
which might be too late.

I remember Julia had similar problem where balancing was taking too
long and setting very aggressive "min_interval" and "max_interval" for
load balancing helped her. Maybe you can try that too:

    # Needed to toggle /sys/kernel/debug/sched/domains/* visible
    echo Y > /sys/kernel/debug/sched/verbose
    for i in /sys/kernel/debug/sched/domains/cpu*/domain[1-5]/*_interval; do echo 10 > $i; done
    echo N > /sys/kernel/debug/sched/verbose

This will ensure there is one balance every 10 ticks on domains above
SMT. You can try make it more aggressive to see if that helps too.

> 
> For comparison, DCPerf MediaWiki running at system saturation (with all
> SMT siblings busy) showed neither a benefit nor a regression: throughput
> and Nginx request latency remained within measurement error.
> 
> Likewise, schbench under partially idle conditions showed no material
> change in wakeup latency, request latency, or throughput (within 0.1%).
> Tail wakeup latency was more consistent across runs with this change
> applied.
> 
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  kernel/sched/fair.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d78467ec6ee13..f846fbe7379f4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8647,6 +8647,16 @@ enum asym_fits_state {
>  	ASYM_IDLE_CORE_BIAS = -3,
>  };
>  
> +/*
> + * Return a stable CPU representative of @cpu's SMT core within @cpus.
> + */
> +static int select_idle_core_cpu(int cpu, const struct cpumask *cpus)
> +{
> +	int sibling = cpumask_first_and(cpu_smt_mask(cpu), cpus);
> +
> +	return sibling < nr_cpu_ids ? sibling : cpu;
> +}
> +
>  /*
>   * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
>   * the task fits. If no CPU is big enough, but there are idle ones, try to
> @@ -8661,6 +8671,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  	 * collapses to the plain capacity scan.
>  	 */
>  	bool has_idle_core = sched_smt_active() && test_idle_cores(target);
> +	bool best_idle_core = false;
>  	unsigned long task_util, util_min, util_max, best_cap = 0;
>  	int fits, best_fits = ASYM_IDLE_THREAD_MISFIT;
>  	int cpu, best_cpu = -1;
> @@ -8686,7 +8697,8 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  	}
>  
>  	for_each_cpu_wrap(cpu, cpus, target) {
> -		bool preferred_core = !has_idle_core || is_core_idle(cpu);
> +		bool idle_core = !sched_smt_active() || is_core_idle(cpu);
> +		bool preferred_core = !has_idle_core || idle_core;

Do you want to take overhead of is_core_idle() for !has_idle_core too?
Wouldn't a simple:

    /* True iff has_idle_core was true and is_core_idle() returned true. */
    bool idle_core = !has_idle_core ^ preferred_core;

after computing preferred_core do just fine?

>  		unsigned long cpu_cap = capacity_of(cpu);
>  
>  		/*
> @@ -8709,7 +8721,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  		 * immediately.
>  		 */
>  		if (fits > 0 && preferred_core)
> -			return cpu;
> +			return idle_core ? select_idle_core_cpu(cpu, cpus) : cpu;
>  		/*
>  		 * Only the min performance hint (i.e. uclamp_min) doesn't fit.
>  		 * Look for the CPU with best capacity.
> @@ -8750,6 +8762,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  			best_cap = cpu_cap;
>  			best_cpu = cpu;
>  			best_fits = fits;
> +			best_idle_core = idle_core;
>  		}
>  	}
>  
> @@ -8765,6 +8778,8 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  	 */
>  	if (has_idle_core && best_fits > ASYM_IDLE_COMPLETE_MISFIT)
>  		set_idle_cores(target, false);
> +	else if (best_idle_core)
> +		best_cpu = select_idle_core_cpu(best_cpu, cpus);
>  
>  	return best_cpu;
>  }

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: Stabilize idle SMT core selection with asym-capacity
  2026-07-03  5:51 ` K Prateek Nayak
@ 2026-07-03  9:40   ` Andrea Righi
  2026-07-03 10:00     ` Christian Loehle
                       ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Andrea Righi @ 2026-07-03  9:40 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Ricardo Neri, Christian Loehle,
	Shrikanth Hegde, Felix Abecassis, Joel Fernandes, Phil Auld,
	linux-kernel, Julia Lawall

Hi Prateek,

On Fri, Jul 03, 2026 at 11:21:57AM +0530, K Prateek Nayak wrote:
> Hello Andrea,
> 
> On 6/30/2026 8:57 PM, Andrea Righi wrote:
> > select_idle_capacity() scans all logical CPUs also when it is looking
> > for a fully idle SMT core. Two concurrent wakeups can therefore observe
> > the same core as idle, encounter different siblings first, and place one
> > task on each sibling while another core remains unused.
> > 
> > Make every logical CPU of a selected idle core resolve to the same
> > stable CPU representative within the scan's existing affinity and
> > scheduling-domain mask. If the first task is enqueued before the next
> > scan examines the core, that scan rejects the now-busy core. If both
> > scans observe the core as idle, they select the same runqueue even if
> > the first enqueue becomes visible before the second scan finishes,
> > exposing the imbalance to the load balancer.
> > 
> > The symmetric CPU idle selection path is subject to the same race, but
> > normally returns as soon as select_idle_core() finds a fully idle core,
> > reducing the conflict window. The per-CPU capacity scan can retain an
> > idle-core candidate while evaluating other CPUs, giving concurrent
> > wakeups more opportunity to select different siblings of the same SMT
> > core. Therefore, limit the normalization to the asym-capacity path,
> > where this behavior has a measurable impact.
> > 
> > On NVIDIA Vera Rubin (arm64, 176 CPUs/88 cores per NUMA node), a
> > CPU-intensive NVPL SGEMM workload restricted to 88 threads (one per
> > core) showed a consistent 23% increase in mean throughput across
> > multiple runs.
> 
> Interesting! This reads like active balance across cores is not aggressive
> enough for this workload and, as a result, stacking somehow helps.
> 
> I would have expected balance within the core would trigger first and that
> would just lead to the same scenario as both sibling sibling busy but I
> guess there is a higher order effect of stacking.

I think the key here is that temporary runqueue stacking is preferable to
consuming both SMT siblings when fully-idle SMT cores are available, more than
having benfits from the stacking itself.

> 
> perf sched stats reports for this workload before and after
> applying your patch may help to see what changes for the load
> balancer to start doing better.

Ack, I'll collect some perf stats and share.

> 
> Could you check if something like this helps:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fc6cd55f9d22..f50f12316dd3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -13221,7 +13221,8 @@ imbalanced_active_balance(struct lb_env *env)
>  	 * threads on a system with spare capacity
>  	 */
>  	if ((env->migration_type == migrate_task) &&
> -	    (sd->nr_balance_failed > sd->cache_nice_tries+2))
> +	    ((sd->groups->flags & SD_SHARE_CPUCAPACITY) ||
> +	      sd->nr_balance_failed > sd->cache_nice_tries+2))

I did a quick test and I don't see any significant difference with this applied.
Let's see if the perf stats tell us more.

>  		return 1;
>  
>  	return 0;
> ---
> 
> I'm assuming we have group_has_spare for the destination CPU and the
> busy core appears as group_fully_busy or group_has_spare.
> calculate_imbalance() will take the sibling_imbalance() path since we
> are balancing amongst cores (SD_PREFER_SIBLING domain) and we get
> "migrate_task" with imbalance of 1.
> 
> Then we single down on a rq with a single task on it but that requires
> active balance and need_active_balance() is too slow as a result of
> imbalanced_active_balance() bailout on cache_nice_tries which requires
> at least 3 failures and on a 176 CPUs system, it can take upwards of
> 176 ticks per retry and with 250Hz tick, that time goes into seconds
> which might be too late.
> 
> I remember Julia had similar problem where balancing was taking too
> long and setting very aggressive "min_interval" and "max_interval" for
> load balancing helped her. Maybe you can try that too:
> 
>     # Needed to toggle /sys/kernel/debug/sched/domains/* visible
>     echo Y > /sys/kernel/debug/sched/verbose
>     for i in /sys/kernel/debug/sched/domains/cpu*/domain[1-5]/*_interval; do echo 10 > $i; done
>     echo N > /sys/kernel/debug/sched/verbose
> 
> This will ensure there is one balance every 10 ticks on domains above
> SMT. You can try make it more aggressive to see if that helps too.

Tried this as well (both with the patched and unpatched kernels), also no
measurable difference.

> 
> > 
> > For comparison, DCPerf MediaWiki running at system saturation (with all
> > SMT siblings busy) showed neither a benefit nor a regression: throughput
> > and Nginx request latency remained within measurement error.
> > 
> > Likewise, schbench under partially idle conditions showed no material
> > change in wakeup latency, request latency, or throughput (within 0.1%).
> > Tail wakeup latency was more consistent across runs with this change
> > applied.
> > 
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  kernel/sched/fair.c | 19 +++++++++++++++++--
> >  1 file changed, 17 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d78467ec6ee13..f846fbe7379f4 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8647,6 +8647,16 @@ enum asym_fits_state {
> >  	ASYM_IDLE_CORE_BIAS = -3,
> >  };
> >  
> > +/*
> > + * Return a stable CPU representative of @cpu's SMT core within @cpus.
> > + */
> > +static int select_idle_core_cpu(int cpu, const struct cpumask *cpus)
> > +{
> > +	int sibling = cpumask_first_and(cpu_smt_mask(cpu), cpus);
> > +
> > +	return sibling < nr_cpu_ids ? sibling : cpu;
> > +}
> > +
> >  /*
> >   * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> >   * the task fits. If no CPU is big enough, but there are idle ones, try to
> > @@ -8661,6 +8671,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >  	 * collapses to the plain capacity scan.
> >  	 */
> >  	bool has_idle_core = sched_smt_active() && test_idle_cores(target);
> > +	bool best_idle_core = false;
> >  	unsigned long task_util, util_min, util_max, best_cap = 0;
> >  	int fits, best_fits = ASYM_IDLE_THREAD_MISFIT;
> >  	int cpu, best_cpu = -1;
> > @@ -8686,7 +8697,8 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >  	}
> >  
> >  	for_each_cpu_wrap(cpu, cpus, target) {
> > -		bool preferred_core = !has_idle_core || is_core_idle(cpu);
> > +		bool idle_core = !sched_smt_active() || is_core_idle(cpu);
> > +		bool preferred_core = !has_idle_core || idle_core;
> 
> Do you want to take overhead of is_core_idle() for !has_idle_core too?
> Wouldn't a simple:
> 
>     /* True iff has_idle_core was true and is_core_idle() returned true. */
>     bool idle_core = !has_idle_core ^ preferred_core;
> 
> after computing preferred_core do just fine?

Ah yes, or maybe something this, which looks a bit more readable:

  bool preferred_core = !has_idle_core || is_core_idle(cpu);
  bool idle_core = has_idle_core && preferred_core;

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: Stabilize idle SMT core selection with asym-capacity
  2026-07-03  9:40   ` Andrea Righi
@ 2026-07-03 10:00     ` Christian Loehle
  2026-07-03 14:52       ` Andrea Righi
  2026-07-03 11:20     ` Julia Lawall
  2026-07-03 12:33     ` Andrea Righi
  2 siblings, 1 reply; 11+ messages in thread
From: Christian Loehle @ 2026-07-03 10:00 UTC (permalink / raw)
  To: Andrea Righi, K Prateek Nayak
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Ricardo Neri, Shrikanth Hegde,
	Felix Abecassis, Joel Fernandes, Phil Auld, linux-kernel,
	Julia Lawall

On 7/3/26 10:40, Andrea Righi wrote:
> Hi Prateek,
> 
> On Fri, Jul 03, 2026 at 11:21:57AM +0530, K Prateek Nayak wrote:
>> Hello Andrea,
>>
>> On 6/30/2026 8:57 PM, Andrea Righi wrote:
>>> select_idle_capacity() scans all logical CPUs also when it is looking
>>> for a fully idle SMT core. Two concurrent wakeups can therefore observe
>>> the same core as idle, encounter different siblings first, and place one
>>> task on each sibling while another core remains unused.
>>>
>>> Make every logical CPU of a selected idle core resolve to the same
>>> stable CPU representative within the scan's existing affinity and
>>> scheduling-domain mask. If the first task is enqueued before the next
>>> scan examines the core, that scan rejects the now-busy core. If both
>>> scans observe the core as idle, they select the same runqueue even if
>>> the first enqueue becomes visible before the second scan finishes,
>>> exposing the imbalance to the load balancer.
>>>
>>> The symmetric CPU idle selection path is subject to the same race, but
>>> normally returns as soon as select_idle_core() finds a fully idle core,
>>> reducing the conflict window. The per-CPU capacity scan can retain an
>>> idle-core candidate while evaluating other CPUs, giving concurrent
>>> wakeups more opportunity to select different siblings of the same SMT
>>> core. Therefore, limit the normalization to the asym-capacity path,
>>> where this behavior has a measurable impact.
>>>
>>> On NVIDIA Vera Rubin (arm64, 176 CPUs/88 cores per NUMA node), a
>>> CPU-intensive NVPL SGEMM workload restricted to 88 threads (one per
>>> core) showed a consistent 23% increase in mean throughput across
>>> multiple runs.
>>
>> Interesting! This reads like active balance across cores is not aggressive
>> enough for this workload and, as a result, stacking somehow helps.
>>
>> I would have expected balance within the core would trigger first and that
>> would just lead to the same scenario as both sibling sibling busy but I
>> guess there is a higher order effect of stacking.
> 
> I think the key here is that temporary runqueue stacking is preferable to
> consuming both SMT siblings when fully-idle SMT cores are available, more than
> having benfits from the stacking itself.
> 
I'm curious now, as a not at all SMT expert, this is super counterintuitive to me,
am I missing something? How does this happen?
The SMT-switch should be way less overhead than the task context-switch, no?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: Stabilize idle SMT core selection with asym-capacity
  2026-07-03  9:40   ` Andrea Righi
  2026-07-03 10:00     ` Christian Loehle
@ 2026-07-03 11:20     ` Julia Lawall
  2026-07-03 14:38       ` Andrea Righi
  2026-07-03 12:33     ` Andrea Righi
  2 siblings, 1 reply; 11+ messages in thread
From: Julia Lawall @ 2026-07-03 11:20 UTC (permalink / raw)
  To: Andrea Righi
  Cc: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Ricardo Neri, Christian Loehle,
	Shrikanth Hegde, Felix Abecassis, Joel Fernandes, Phil Auld,
	linux-kernel, Julia Lawall, jean-pierre.lozi



On Fri, 3 Jul 2026, Andrea Righi wrote:

> Hi Prateek,
>
> On Fri, Jul 03, 2026 at 11:21:57AM +0530, K Prateek Nayak wrote:
> > Hello Andrea,
> >
> > On 6/30/2026 8:57 PM, Andrea Righi wrote:
> > > select_idle_capacity() scans all logical CPUs also when it is looking
> > > for a fully idle SMT core. Two concurrent wakeups can therefore observe
> > > the same core as idle, encounter different siblings first, and place one
> > > task on each sibling while another core remains unused.
> > >
> > > Make every logical CPU of a selected idle core resolve to the same
> > > stable CPU representative within the scan's existing affinity and
> > > scheduling-domain mask. If the first task is enqueued before the next
> > > scan examines the core, that scan rejects the now-busy core. If both
> > > scans observe the core as idle, they select the same runqueue even if
> > > the first enqueue becomes visible before the second scan finishes,
> > > exposing the imbalance to the load balancer.
> > >
> > > The symmetric CPU idle selection path is subject to the same race, but
> > > normally returns as soon as select_idle_core() finds a fully idle core,
> > > reducing the conflict window. The per-CPU capacity scan can retain an
> > > idle-core candidate while evaluating other CPUs, giving concurrent
> > > wakeups more opportunity to select different siblings of the same SMT
> > > core. Therefore, limit the normalization to the asym-capacity path,
> > > where this behavior has a measurable impact.
> > >
> > > On NVIDIA Vera Rubin (arm64, 176 CPUs/88 cores per NUMA node), a
> > > CPU-intensive NVPL SGEMM workload restricted to 88 threads (one per
> > > core) showed a consistent 23% increase in mean throughput across
> > > multiple runs.
> >
> > Interesting! This reads like active balance across cores is not aggressive
> > enough for this workload and, as a result, stacking somehow helps.
> >
> > I would have expected balance within the core would trigger first and that
> > would just lead to the same scenario as both sibling sibling busy but I
> > guess there is a higher order effect of stacking.
>
> I think the key here is that temporary runqueue stacking is preferable to
> consuming both SMT siblings when fully-idle SMT cores are available, more than
> having benfits from the stacking itself.

Andrea, did you try changing the clock speed?  With ticks every 4ms and an
EEVDF time slice that rounds up to 4ms, task_hot makes it almost
impossible for already-idle CPUs to pull tasks.

julia




>
> >
> > perf sched stats reports for this workload before and after
> > applying your patch may help to see what changes for the load
> > balancer to start doing better.
>
> Ack, I'll collect some perf stats and share.
>
> >
> > Could you check if something like this helps:
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index fc6cd55f9d22..f50f12316dd3 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -13221,7 +13221,8 @@ imbalanced_active_balance(struct lb_env *env)
> >  	 * threads on a system with spare capacity
> >  	 */
> >  	if ((env->migration_type == migrate_task) &&
> > -	    (sd->nr_balance_failed > sd->cache_nice_tries+2))
> > +	    ((sd->groups->flags & SD_SHARE_CPUCAPACITY) ||
> > +	      sd->nr_balance_failed > sd->cache_nice_tries+2))
>
> I did a quick test and I don't see any significant difference with this applied.
> Let's see if the perf stats tell us more.
>
> >  		return 1;
> >
> >  	return 0;
> > ---
> >
> > I'm assuming we have group_has_spare for the destination CPU and the
> > busy core appears as group_fully_busy or group_has_spare.
> > calculate_imbalance() will take the sibling_imbalance() path since we
> > are balancing amongst cores (SD_PREFER_SIBLING domain) and we get
> > "migrate_task" with imbalance of 1.
> >
> > Then we single down on a rq with a single task on it but that requires
> > active balance and need_active_balance() is too slow as a result of
> > imbalanced_active_balance() bailout on cache_nice_tries which requires
> > at least 3 failures and on a 176 CPUs system, it can take upwards of
> > 176 ticks per retry and with 250Hz tick, that time goes into seconds
> > which might be too late.
> >
> > I remember Julia had similar problem where balancing was taking too
> > long and setting very aggressive "min_interval" and "max_interval" for
> > load balancing helped her. Maybe you can try that too:
> >
> >     # Needed to toggle /sys/kernel/debug/sched/domains/* visible
> >     echo Y > /sys/kernel/debug/sched/verbose
> >     for i in /sys/kernel/debug/sched/domains/cpu*/domain[1-5]/*_interval; do echo 10 > $i; done
> >     echo N > /sys/kernel/debug/sched/verbose
> >
> > This will ensure there is one balance every 10 ticks on domains above
> > SMT. You can try make it more aggressive to see if that helps too.
>
> Tried this as well (both with the patched and unpatched kernels), also no
> measurable difference.
>
> >
> > >
> > > For comparison, DCPerf MediaWiki running at system saturation (with all
> > > SMT siblings busy) showed neither a benefit nor a regression: throughput
> > > and Nginx request latency remained within measurement error.
> > >
> > > Likewise, schbench under partially idle conditions showed no material
> > > change in wakeup latency, request latency, or throughput (within 0.1%).
> > > Tail wakeup latency was more consistent across runs with this change
> > > applied.
> > >
> > > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > > ---
> > >  kernel/sched/fair.c | 19 +++++++++++++++++--
> > >  1 file changed, 17 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index d78467ec6ee13..f846fbe7379f4 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8647,6 +8647,16 @@ enum asym_fits_state {
> > >  	ASYM_IDLE_CORE_BIAS = -3,
> > >  };
> > >
> > > +/*
> > > + * Return a stable CPU representative of @cpu's SMT core within @cpus.
> > > + */
> > > +static int select_idle_core_cpu(int cpu, const struct cpumask *cpus)
> > > +{
> > > +	int sibling = cpumask_first_and(cpu_smt_mask(cpu), cpus);
> > > +
> > > +	return sibling < nr_cpu_ids ? sibling : cpu;
> > > +}
> > > +
> > >  /*
> > >   * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> > >   * the task fits. If no CPU is big enough, but there are idle ones, try to
> > > @@ -8661,6 +8671,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > >  	 * collapses to the plain capacity scan.
> > >  	 */
> > >  	bool has_idle_core = sched_smt_active() && test_idle_cores(target);
> > > +	bool best_idle_core = false;
> > >  	unsigned long task_util, util_min, util_max, best_cap = 0;
> > >  	int fits, best_fits = ASYM_IDLE_THREAD_MISFIT;
> > >  	int cpu, best_cpu = -1;
> > > @@ -8686,7 +8697,8 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > >  	}
> > >
> > >  	for_each_cpu_wrap(cpu, cpus, target) {
> > > -		bool preferred_core = !has_idle_core || is_core_idle(cpu);
> > > +		bool idle_core = !sched_smt_active() || is_core_idle(cpu);
> > > +		bool preferred_core = !has_idle_core || idle_core;
> >
> > Do you want to take overhead of is_core_idle() for !has_idle_core too?
> > Wouldn't a simple:
> >
> >     /* True iff has_idle_core was true and is_core_idle() returned true. */
> >     bool idle_core = !has_idle_core ^ preferred_core;
> >
> > after computing preferred_core do just fine?
>
> Ah yes, or maybe something this, which looks a bit more readable:
>
>   bool preferred_core = !has_idle_core || is_core_idle(cpu);
>   bool idle_core = has_idle_core && preferred_core;
>
> Thanks,
> -Andrea
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: Stabilize idle SMT core selection with asym-capacity
  2026-07-03  9:40   ` Andrea Righi
  2026-07-03 10:00     ` Christian Loehle
  2026-07-03 11:20     ` Julia Lawall
@ 2026-07-03 12:33     ` Andrea Righi
  2026-07-03 12:51       ` Julia Lawall
  2 siblings, 1 reply; 11+ messages in thread
From: Andrea Righi @ 2026-07-03 12:33 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Ricardo Neri, Christian Loehle,
	Shrikanth Hegde, Felix Abecassis, Joel Fernandes, Phil Auld,
	linux-kernel, Julia Lawall

On Fri, Jul 03, 2026 at 11:40:28AM +0200, Andrea Righi wrote:
...
> > > On NVIDIA Vera Rubin (arm64, 176 CPUs/88 cores per NUMA node), a
> > > CPU-intensive NVPL SGEMM workload restricted to 88 threads (one per
> > > core) showed a consistent 23% increase in mean throughput across
> > > multiple runs.
> > 
> > Interesting! This reads like active balance across cores is not aggressive
> > enough for this workload and, as a result, stacking somehow helps.
> > 
> > I would have expected balance within the core would trigger first and that
> > would just lead to the same scenario as both sibling sibling busy but I
> > guess there is a higher order effect of stacking.
> 
> I think the key here is that temporary runqueue stacking is preferable to
> consuming both SMT siblings when fully-idle SMT cores are available, more than
> having benfits from the stacking itself.
> 
> > 
> > perf sched stats reports for this workload before and after
> > applying your patch may help to see what changes for the load
> > balancer to start doing better.
> 
> Ack, I'll collect some perf stats and share.
> 

I collected some perf sched stats diff with mainline vs patched kernel, here's a
quick recap of the benchmark results + stats (I can also share all the detailed
stats if you prefer):
                                 mainline    patched
  elapsed jiffies                  17472      13808
  average GFLOP/s                6297.62    8423.60
  sched_yield calls               11.47M      4.47M
  run delay / runtime              0.20%      0.31%
  timeslices                         168        562

  Across SMT, MC and NUMA domains:
    *_lb_gained                       0          0
    alb_pushed                        0          0
    ttwu_move_balance                 0          0

The schedstat comparison doesn't show the load balancer moving any potential
stacked tasks: *_lb_gained, alb_pushed and ttwu_move_balance remain 0 across the
domains. So the gain doesn't come from post-wakeup balancing.

The only clear difference is that the sched_yield() rate drops by approximately
51%, this might explain the speedup, but the stats don't expose the CPU selected
by select_idle_capacity(), so it can't directly prove if the placement
was beneficial. I'll collect more stats.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: Stabilize idle SMT core selection with asym-capacity
  2026-07-03 12:33     ` Andrea Righi
@ 2026-07-03 12:51       ` Julia Lawall
  0 siblings, 0 replies; 11+ messages in thread
From: Julia Lawall @ 2026-07-03 12:51 UTC (permalink / raw)
  To: Andrea Righi
  Cc: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Ricardo Neri, Christian Loehle,
	Shrikanth Hegde, Felix Abecassis, Joel Fernandes, Phil Auld,
	linux-kernel



> On 3 Jul 2026, at 08:34, Andrea Righi <arighi@nvidia.com> wrote:
> 
> On Fri, Jul 03, 2026 at 11:40:28AM +0200, Andrea Righi wrote:
> ...
>>>> On NVIDIA Vera Rubin (arm64, 176 CPUs/88 cores per NUMA node), a
>>>> CPU-intensive NVPL SGEMM workload restricted to 88 threads (one per
>>>> core) showed a consistent 23% increase in mean throughput across
>>>> multiple runs.
>>> 
>>> Interesting! This reads like active balance across cores is not aggressive
>>> enough for this workload and, as a result, stacking somehow helps.
>>> 
>>> I would have expected balance within the core would trigger first and that
>>> would just lead to the same scenario as both sibling sibling busy but I
>>> guess there is a higher order effect of stacking.
>> 
>> I think the key here is that temporary runqueue stacking is preferable to
>> consuming both SMT siblings when fully-idle SMT cores are available, more than
>> having benfits from the stacking itself.
>> 
>>> 
>>> perf sched stats reports for this workload before and after
>>> applying your patch may help to see what changes for the load
>>> balancer to start doing better.
>> 
>> Ack, I'll collect some perf stats and share.
>> 
> 
> I collected some perf sched stats diff with mainline vs patched kernel, here's a
> quick recap of the benchmark results + stats (I can also share all the detailed
> stats if you prefer):
>                                 mainline    patched
>  elapsed jiffies                  17472      13808
>  average GFLOP/s                6297.62    8423.60
>  sched_yield calls               11.47M      4.47M
>  run delay / runtime              0.20%      0.31%
>  timeslices                         168        562
> 
>  Across SMT, MC and NUMA domains:
>    *_lb_gained                       0          0
>    alb_pushed                        0          0
>    ttwu_move_balance                 0          0
> 
> The schedstat comparison doesn't show the load balancer moving any potential
> stacked tasks: *_lb_gained, alb_pushed and ttwu_move_balance remain 0 across the
> domains. So the gain doesn't come from post-wakeup balancing.
> 
> The only clear difference is that the sched_yield() rate drops by approximately
> 51%, this might explain the speedup, but the stats don't expose the CPU selected
> by select_idle_capacity(), so it can't directly prove if the placement
> was beneficial. I'll collect more stats.

Maybe look at a trace with perfetto ?

> 
> Thanks,
> -Andrea


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: Stabilize idle SMT core selection with asym-capacity
  2026-07-03 11:20     ` Julia Lawall
@ 2026-07-03 14:38       ` Andrea Righi
  0 siblings, 0 replies; 11+ messages in thread
From: Andrea Righi @ 2026-07-03 14:38 UTC (permalink / raw)
  To: Julia Lawall
  Cc: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Ricardo Neri, Christian Loehle,
	Shrikanth Hegde, Felix Abecassis, Joel Fernandes, Phil Auld,
	linux-kernel, jean-pierre.lozi

Hi Julia,

On Fri, Jul 03, 2026 at 07:20:38AM -0400, Julia Lawall wrote:
> On Fri, 3 Jul 2026, Andrea Righi wrote:
> 
> > Hi Prateek,
> >
> > On Fri, Jul 03, 2026 at 11:21:57AM +0530, K Prateek Nayak wrote:
> > > Hello Andrea,
> > >
> > > On 6/30/2026 8:57 PM, Andrea Righi wrote:
> > > > select_idle_capacity() scans all logical CPUs also when it is looking
> > > > for a fully idle SMT core. Two concurrent wakeups can therefore observe
> > > > the same core as idle, encounter different siblings first, and place one
> > > > task on each sibling while another core remains unused.
> > > >
> > > > Make every logical CPU of a selected idle core resolve to the same
> > > > stable CPU representative within the scan's existing affinity and
> > > > scheduling-domain mask. If the first task is enqueued before the next
> > > > scan examines the core, that scan rejects the now-busy core. If both
> > > > scans observe the core as idle, they select the same runqueue even if
> > > > the first enqueue becomes visible before the second scan finishes,
> > > > exposing the imbalance to the load balancer.
> > > >
> > > > The symmetric CPU idle selection path is subject to the same race, but
> > > > normally returns as soon as select_idle_core() finds a fully idle core,
> > > > reducing the conflict window. The per-CPU capacity scan can retain an
> > > > idle-core candidate while evaluating other CPUs, giving concurrent
> > > > wakeups more opportunity to select different siblings of the same SMT
> > > > core. Therefore, limit the normalization to the asym-capacity path,
> > > > where this behavior has a measurable impact.
> > > >
> > > > On NVIDIA Vera Rubin (arm64, 176 CPUs/88 cores per NUMA node), a
> > > > CPU-intensive NVPL SGEMM workload restricted to 88 threads (one per
> > > > core) showed a consistent 23% increase in mean throughput across
> > > > multiple runs.
> > >
> > > Interesting! This reads like active balance across cores is not aggressive
> > > enough for this workload and, as a result, stacking somehow helps.
> > >
> > > I would have expected balance within the core would trigger first and that
> > > would just lead to the same scenario as both sibling sibling busy but I
> > > guess there is a higher order effect of stacking.
> >
> > I think the key here is that temporary runqueue stacking is preferable to
> > consuming both SMT siblings when fully-idle SMT cores are available, more than
> > having benfits from the stacking itself.
> 
> Andrea, did you try changing the clock speed?  With ticks every 4ms and an
> EEVDF time slice that rounds up to 4ms, task_hot makes it almost
> impossible for already-idle CPUs to pull tasks.
> 
> julia

Oh I remember you mentioned this. However, the kernel that I'm using has
CONFIG_HZ_1000=y, so the scheduler tick is 1 ms rather than 4 ms. I tried to
play a bit with different migration_cost_ns settings, but didn't get much
benefit from that.

I think I have a lead, and the observed improvement with this patch may not
be a scheduler/load-balancing effect. In practice, I see different performance
on sibling 0 vs sibling 1, apparently sibling 0 is faster, despite the firmware
advertising identical capacity. So I think my patch is helping mostly due the
fact that I'm using cpumask_first_and(), more than the aggressive SMT avoidance.

I'm trying to get more details from the hw/firmware. Will keep you updated.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: Stabilize idle SMT core selection with asym-capacity
  2026-07-03 10:00     ` Christian Loehle
@ 2026-07-03 14:52       ` Andrea Righi
  2026-07-03 16:54         ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Andrea Righi @ 2026-07-03 14:52 UTC (permalink / raw)
  To: Christian Loehle
  Cc: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Ricardo Neri, Shrikanth Hegde,
	Felix Abecassis, Joel Fernandes, Phil Auld, linux-kernel,
	Julia Lawall

Hi Christian,

On Fri, Jul 03, 2026 at 11:00:23AM +0100, Christian Loehle wrote:
...
> > I think the key here is that temporary runqueue stacking is preferable to
> > consuming both SMT siblings when fully-idle SMT cores are available, more than
> > having benfits from the stacking itself.
> > 
> I'm curious now, as a not at all SMT expert, this is super counterintuitive to me,
> am I missing something? How does this happen?
> The SMT-switch should be way less overhead than the task context-switch, no?

As mentioned in my other email, I found a surprising asymmetry on this machine:
pinning one worker per core to the first SMT siblings gives substantially better
performance than pinning them to the second siblings, despite firmware
advertising identical capacity and frequency for both.

Since this change uses cpumask_first_and() as the stable representative, it also
strongly biases placement toward the faster first siblings. That may explain
much of the observed improvement independently of whether temporary stacking
helps the load balancer.

I haven't yet established that stacking two tasks on sibling 0 is better than
running one task on each sibling simultaneously. Also, the latter is not really
an SMT “switch”: both threads run concurrently and compete for shared execution
and memory resources, whereas stacking involves normal scheduler time-sharing
and context switches.

Once I figure out exactly why this machine has SMT asymmetry, I'll share the
details. :)

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: Stabilize idle SMT core selection with asym-capacity
  2026-07-03 14:52       ` Andrea Righi
@ 2026-07-03 16:54         ` Peter Zijlstra
  2026-07-03 17:07           ` Andrea Righi
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2026-07-03 16:54 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Christian Loehle, K Prateek Nayak, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Ricardo Neri, Shrikanth Hegde,
	Felix Abecassis, Joel Fernandes, Phil Auld, linux-kernel,
	Julia Lawall

On Fri, Jul 03, 2026 at 04:52:17PM +0200, Andrea Righi wrote:

> As mentioned in my other email, I found a surprising asymmetry on this machine:
> pinning one worker per core to the first SMT siblings gives substantially better
> performance than pinning them to the second siblings, despite firmware
> advertising identical capacity and frequency for both.

Cute, that's something that Power7 also had. That's where
SD_ASYM_PACKING originated from.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] sched/fair: Stabilize idle SMT core selection with asym-capacity
  2026-07-03 16:54         ` Peter Zijlstra
@ 2026-07-03 17:07           ` Andrea Righi
  0 siblings, 0 replies; 11+ messages in thread
From: Andrea Righi @ 2026-07-03 17:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christian Loehle, K Prateek Nayak, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Ricardo Neri, Shrikanth Hegde,
	Felix Abecassis, Joel Fernandes, Phil Auld, linux-kernel,
	Julia Lawall

On Fri, Jul 03, 2026 at 06:54:14PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 03, 2026 at 04:52:17PM +0200, Andrea Righi wrote:
> 
> > As mentioned in my other email, I found a surprising asymmetry on this machine:
> > pinning one worker per core to the first SMT siblings gives substantially better
> > performance than pinning them to the second siblings, despite firmware
> > advertising identical capacity and frequency for both.
> 
> Cute, that's something that Power7 also had. That's where
> SD_ASYM_PACKING originated from.

Yep, I'm actually experimenting with a patch that mimics the Power7 and it
seems to work. :) But I'm using a quirk to detect the particular CPU
implementation to set SD_ASYM_PACKING on the SMT domain and assign a higher
arch_asym_cpu_priority() to the first sibling, which is not the best...

So I'm checking with the firmware folks whether they can expose the relative SMT
thread priority explicitly, so that the kernel can discover the asymmetry and
the preferred sibling, instead of relying on CPU type and enumeration order
(considering that there are also multiple SMT configurations that can alter this
asymmetry... it's not just SMT on/off).

That said, we can ignore this patch for now. I'll come up with a better
solution, hopefully.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-07-03 17:07 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-30 15:27 [PATCH] sched/fair: Stabilize idle SMT core selection with asym-capacity Andrea Righi
2026-07-03  5:51 ` K Prateek Nayak
2026-07-03  9:40   ` Andrea Righi
2026-07-03 10:00     ` Christian Loehle
2026-07-03 14:52       ` Andrea Righi
2026-07-03 16:54         ` Peter Zijlstra
2026-07-03 17:07           ` Andrea Righi
2026-07-03 11:20     ` Julia Lawall
2026-07-03 14:38       ` Andrea Righi
2026-07-03 12:33     ` Andrea Righi
2026-07-03 12:51       ` Julia Lawall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox