* [PATCH] sched/topology: Avoid spurious asymmetry from CPU capacity noise
@ 2026-03-24 0:55 Andrea Righi
2026-03-24 7:39 ` Vincent Guittot
0 siblings, 1 reply; 17+ messages in thread
From: Andrea Righi @ 2026-03-24 0:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, linux-kernel,
Felix Abecassis
On some platforms, the firmware may expose per-CPU performance
differences (e.g., via ACPI CPPC highest_perf) even when the system is
effectively symmetric. These small variations, typically due to silicon
binning, are reflected in arch_scale_cpu_capacity() and end up being
interpreted as real capacity asymmetry.
As a result, the scheduler incorrectly enables SD_ASYM_CPUCAPACITY,
triggering asymmetry-specific behaviors, even though all CPUs have
comparable performance.
Prevent this by treating CPU capacities within 20% of the maximum value
as equivalent when building the asymmetry topology. This filters out
firmware noise, while preserving correct behavior on real heterogeneous
systems, where capacity differences are significantly larger.
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/topology.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 061f8c85f5552..fe71ea9f3bda7 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1432,9 +1432,8 @@ static void free_asym_cap_entry(struct rcu_head *head)
kfree(entry);
}
-static inline void asym_cpu_capacity_update_data(int cpu)
+static inline void asym_cpu_capacity_update_data(int cpu, unsigned long capacity)
{
- unsigned long capacity = arch_scale_cpu_capacity(cpu);
struct asym_cap_data *insert_entry = NULL;
struct asym_cap_data *entry;
@@ -1471,13 +1470,27 @@ static inline void asym_cpu_capacity_update_data(int cpu)
static void asym_cpu_capacity_scan(void)
{
struct asym_cap_data *entry, *next;
+ unsigned long max_cap = 0;
+ unsigned long capacity;
int cpu;
list_for_each_entry(entry, &asym_cap_list, link)
cpumask_clear(cpu_capacity_span(entry));
for_each_cpu_and(cpu, cpu_possible_mask, housekeeping_cpumask(HK_TYPE_DOMAIN))
- asym_cpu_capacity_update_data(cpu);
+ max_cap = max(max_cap, arch_scale_cpu_capacity(cpu));
+
+ /*
+ * Treat small capacity differences (< 20% max capacity) as noise,
+ * to prevent enabling SD_ASYM_CPUCAPACITY when it's not really
+ * needed.
+ */
+ for_each_cpu_and(cpu, cpu_possible_mask, housekeeping_cpumask(HK_TYPE_DOMAIN)) {
+ capacity = arch_scale_cpu_capacity(cpu);
+ if (capacity * 5 >= max_cap * 4)
+ capacity = max_cap;
+ asym_cpu_capacity_update_data(cpu, capacity);
+ }
list_for_each_entry_safe(entry, next, &asym_cap_list, link) {
if (cpumask_empty(cpu_capacity_span(entry))) {
--
2.53.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] sched/topology: Avoid spurious asymmetry from CPU capacity noise
2026-03-24 0:55 [PATCH] sched/topology: Avoid spurious asymmetry from CPU capacity noise Andrea Righi
@ 2026-03-24 7:39 ` Vincent Guittot
2026-03-24 7:55 ` Christian Loehle
2026-03-24 9:39 ` Andrea Righi
0 siblings, 2 replies; 17+ messages in thread
From: Vincent Guittot @ 2026-03-24 7:39 UTC (permalink / raw)
To: Andrea Righi
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Christian Loehle, linux-kernel, Felix Abecassis
On Tue, 24 Mar 2026 at 01:55, Andrea Righi <arighi@nvidia.com> wrote:
>
> On some platforms, the firmware may expose per-CPU performance
> differences (e.g., via ACPI CPPC highest_perf) even when the system is
> effectively symmetric. These small variations, typically due to silicon
> binning, are reflected in arch_scale_cpu_capacity() and end up being
> interpreted as real capacity asymmetry.
>
> As a result, the scheduler incorrectly enables SD_ASYM_CPUCAPACITY,
> triggering asymmetry-specific behaviors, even though all CPUs have
> comparable performance.
>
> Prevent this by treating CPU capacities within 20% of the maximum value
20% is a bit high, my snapdragon rb5 has a mid CPU with a capacity of
871 but we still want to keep them different
Why would 5% not be enough?
> as equivalent when building the asymmetry topology. This filters out
> firmware noise, while preserving correct behavior on real heterogeneous
> systems, where capacity differences are significantly larger.
>
> Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
> kernel/sched/topology.c | 19 ++++++++++++++++---
> 1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 061f8c85f5552..fe71ea9f3bda7 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1432,9 +1432,8 @@ static void free_asym_cap_entry(struct rcu_head *head)
> kfree(entry);
> }
>
> -static inline void asym_cpu_capacity_update_data(int cpu)
> +static inline void asym_cpu_capacity_update_data(int cpu, unsigned long capacity)
> {
> - unsigned long capacity = arch_scale_cpu_capacity(cpu);
> struct asym_cap_data *insert_entry = NULL;
> struct asym_cap_data *entry;
>
> @@ -1471,13 +1470,27 @@ static inline void asym_cpu_capacity_update_data(int cpu)
> static void asym_cpu_capacity_scan(void)
> {
> struct asym_cap_data *entry, *next;
> + unsigned long max_cap = 0;
> + unsigned long capacity;
> int cpu;
>
> list_for_each_entry(entry, &asym_cap_list, link)
> cpumask_clear(cpu_capacity_span(entry));
>
> for_each_cpu_and(cpu, cpu_possible_mask, housekeeping_cpumask(HK_TYPE_DOMAIN))
> - asym_cpu_capacity_update_data(cpu);
> + max_cap = max(max_cap, arch_scale_cpu_capacity(cpu));
> +
> + /*
> + * Treat small capacity differences (< 20% max capacity) as noise,
> + * to prevent enabling SD_ASYM_CPUCAPACITY when it's not really
> + * needed.
> + */
> + for_each_cpu_and(cpu, cpu_possible_mask, housekeeping_cpumask(HK_TYPE_DOMAIN)) {
> + capacity = arch_scale_cpu_capacity(cpu);
> + if (capacity * 5 >= max_cap * 4)
> + capacity = max_cap;
> + asym_cpu_capacity_update_data(cpu, capacity);
> + }
>
> list_for_each_entry_safe(entry, next, &asym_cap_list, link) {
> if (cpumask_empty(cpu_capacity_span(entry))) {
> --
> 2.53.0
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] sched/topology: Avoid spurious asymmetry from CPU capacity noise
2026-03-24 7:39 ` Vincent Guittot
@ 2026-03-24 7:55 ` Christian Loehle
2026-03-24 8:08 ` Christian Loehle
2026-03-24 9:39 ` Andrea Righi
1 sibling, 1 reply; 17+ messages in thread
From: Christian Loehle @ 2026-03-24 7:55 UTC (permalink / raw)
To: Vincent Guittot, Andrea Righi
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
linux-kernel, Felix Abecassis
On 3/24/26 07:39, Vincent Guittot wrote:
> On Tue, 24 Mar 2026 at 01:55, Andrea Righi <arighi@nvidia.com> wrote:
>>
>> On some platforms, the firmware may expose per-CPU performance
>> differences (e.g., via ACPI CPPC highest_perf) even when the system is
>> effectively symmetric. These small variations, typically due to silicon
>> binning, are reflected in arch_scale_cpu_capacity() and end up being
>> interpreted as real capacity asymmetry.
>>
>> As a result, the scheduler incorrectly enables SD_ASYM_CPUCAPACITY,
>> triggering asymmetry-specific behaviors, even though all CPUs have
>> comparable performance.
>>
>> Prevent this by treating CPU capacities within 20% of the maximum value
>
> 20% is a bit high, my snapdragon rb5 has a mid CPU with a capacity of
> 871 but we still want to keep them different
>
> Why would 5% not be enough?
I've also used 5%, or rather the existing capacity_greater() macro.
>[snip]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] sched/topology: Avoid spurious asymmetry from CPU capacity noise
2026-03-24 7:55 ` Christian Loehle
@ 2026-03-24 8:08 ` Christian Loehle
2026-03-24 9:46 ` Andrea Righi
0 siblings, 1 reply; 17+ messages in thread
From: Christian Loehle @ 2026-03-24 8:08 UTC (permalink / raw)
To: Vincent Guittot, Andrea Righi
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
linux-kernel, Felix Abecassis
On 3/24/26 07:55, Christian Loehle wrote:
> On 3/24/26 07:39, Vincent Guittot wrote:
>> On Tue, 24 Mar 2026 at 01:55, Andrea Righi <arighi@nvidia.com> wrote:
>>>
>>> On some platforms, the firmware may expose per-CPU performance
>>> differences (e.g., via ACPI CPPC highest_perf) even when the system is
>>> effectively symmetric. These small variations, typically due to silicon
>>> binning, are reflected in arch_scale_cpu_capacity() and end up being
>>> interpreted as real capacity asymmetry.
>>>
>>> As a result, the scheduler incorrectly enables SD_ASYM_CPUCAPACITY,
>>> triggering asymmetry-specific behaviors, even though all CPUs have
>>> comparable performance.
>>>
>>> Prevent this by treating CPU capacities within 20% of the maximum value
>>
>> 20% is a bit high, my snapdragon rb5 has a mid CPU with a capacity of
>> 871 but we still want to keep them different
>>
>> Why would 5% not be enough?
>
> I've also used 5%, or rather the existing capacity_greater() macro.
Also, given that this patch even mentions this as "noise" one might ask
why the firmware wouldn't force-equalise this.
Anyway let me finally send out those asympacking patches which would make
that issue obsolete because we actually make use of the highest_perf
information from the firmware.
>
>> [snip]
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] sched/topology: Avoid spurious asymmetry from CPU capacity noise
2026-03-24 7:39 ` Vincent Guittot
2026-03-24 7:55 ` Christian Loehle
@ 2026-03-24 9:39 ` Andrea Righi
2026-03-25 3:30 ` Koba Ko
1 sibling, 1 reply; 17+ messages in thread
From: Andrea Righi @ 2026-03-24 9:39 UTC (permalink / raw)
To: Vincent Guittot
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Christian Loehle, linux-kernel, Felix Abecassis
Hi Vincent,
On Tue, Mar 24, 2026 at 08:39:34AM +0100, Vincent Guittot wrote:
> On Tue, 24 Mar 2026 at 01:55, Andrea Righi <arighi@nvidia.com> wrote:
> >
> > On some platforms, the firmware may expose per-CPU performance
> > differences (e.g., via ACPI CPPC highest_perf) even when the system is
> > effectively symmetric. These small variations, typically due to silicon
> > binning, are reflected in arch_scale_cpu_capacity() and end up being
> > interpreted as real capacity asymmetry.
> >
> > As a result, the scheduler incorrectly enables SD_ASYM_CPUCAPACITY,
> > triggering asymmetry-specific behaviors, even though all CPUs have
> > comparable performance.
> >
> > Prevent this by treating CPU capacities within 20% of the maximum value
>
> 20% is a bit high, my snapdragon rb5 has a mid CPU with a capacity of
> 871 but we still want to keep them different
>
> Why would 5% not be enough?
Sure, 5% seems a more reasonable margin. I'll just reuse capacity_greater()
as suggested by Christian.
Thanks,
-Andrea
>
>
>
> > as equivalent when building the asymmetry topology. This filters out
> > firmware noise, while preserving correct behavior on real heterogeneous
> > systems, where capacity differences are significantly larger.
> >
> > Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> > kernel/sched/topology.c | 19 ++++++++++++++++---
> > 1 file changed, 16 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 061f8c85f5552..fe71ea9f3bda7 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1432,9 +1432,8 @@ static void free_asym_cap_entry(struct rcu_head *head)
> > kfree(entry);
> > }
> >
> > -static inline void asym_cpu_capacity_update_data(int cpu)
> > +static inline void asym_cpu_capacity_update_data(int cpu, unsigned long capacity)
> > {
> > - unsigned long capacity = arch_scale_cpu_capacity(cpu);
> > struct asym_cap_data *insert_entry = NULL;
> > struct asym_cap_data *entry;
> >
> > @@ -1471,13 +1470,27 @@ static inline void asym_cpu_capacity_update_data(int cpu)
> > static void asym_cpu_capacity_scan(void)
> > {
> > struct asym_cap_data *entry, *next;
> > + unsigned long max_cap = 0;
> > + unsigned long capacity;
> > int cpu;
> >
> > list_for_each_entry(entry, &asym_cap_list, link)
> > cpumask_clear(cpu_capacity_span(entry));
> >
> > for_each_cpu_and(cpu, cpu_possible_mask, housekeeping_cpumask(HK_TYPE_DOMAIN))
> > - asym_cpu_capacity_update_data(cpu);
> > + max_cap = max(max_cap, arch_scale_cpu_capacity(cpu));
> > +
> > + /*
> > + * Treat small capacity differences (< 20% max capacity) as noise,
> > + * to prevent enabling SD_ASYM_CPUCAPACITY when it's not really
> > + * needed.
> > + */
> > + for_each_cpu_and(cpu, cpu_possible_mask, housekeeping_cpumask(HK_TYPE_DOMAIN)) {
> > + capacity = arch_scale_cpu_capacity(cpu);
> > + if (capacity * 5 >= max_cap * 4)
> > + capacity = max_cap;
> > + asym_cpu_capacity_update_data(cpu, capacity);
> > + }
> >
> > list_for_each_entry_safe(entry, next, &asym_cap_list, link) {
> > if (cpumask_empty(cpu_capacity_span(entry))) {
> > --
> > 2.53.0
> >
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] sched/topology: Avoid spurious asymmetry from CPU capacity noise
2026-03-24 8:08 ` Christian Loehle
@ 2026-03-24 9:46 ` Andrea Righi
2026-03-24 10:29 ` Dietmar Eggemann
0 siblings, 1 reply; 17+ messages in thread
From: Andrea Righi @ 2026-03-24 9:46 UTC (permalink / raw)
To: Christian Loehle
Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel, Felix Abecassis
Hi Christian,
On Tue, Mar 24, 2026 at 08:08:22AM +0000, Christian Loehle wrote:
> On 3/24/26 07:55, Christian Loehle wrote:
> > On 3/24/26 07:39, Vincent Guittot wrote:
> >> On Tue, 24 Mar 2026 at 01:55, Andrea Righi <arighi@nvidia.com> wrote:
> >>>
> >>> On some platforms, the firmware may expose per-CPU performance
> >>> differences (e.g., via ACPI CPPC highest_perf) even when the system is
> >>> effectively symmetric. These small variations, typically due to silicon
> >>> binning, are reflected in arch_scale_cpu_capacity() and end up being
> >>> interpreted as real capacity asymmetry.
> >>>
> >>> As a result, the scheduler incorrectly enables SD_ASYM_CPUCAPACITY,
> >>> triggering asymmetry-specific behaviors, even though all CPUs have
> >>> comparable performance.
> >>>
> >>> Prevent this by treating CPU capacities within 20% of the maximum value
> >>
> >> 20% is a bit high, my snapdragon rb5 has a mid CPU with a capacity of
> >> 871 but we still want to keep them different
> >>
> >> Why would 5% not be enough?
> >
> > I've also used 5%, or rather the existing capacity_greater() macro.
>
> Also, given that this patch even mentions this as "noise" one might ask
> why the firmware wouldn't force-equalise this.
I think it's reasonable to consider that as "noise" from a scheduler
perspective, but from a hardware/firmware point of view I don't have strong
arguments to propose equalizing the highest_perf values. At the end, at
least in my case, it seems all compliant with the ACPI/CPPC specs and
suggesting to equalize them because "the kernel doesn't handle it well"
doesn't seem like a solid motivation...
> Anyway let me finally send out those asympacking patches which would make
> that issue obsolete because we actually make use of the highest_perf
> information from the firmware.
Looking forward to that. :)
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] sched/topology: Avoid spurious asymmetry from CPU capacity noise
2026-03-24 9:46 ` Andrea Righi
@ 2026-03-24 10:29 ` Dietmar Eggemann
2026-03-24 11:01 ` Andrea Righi
0 siblings, 1 reply; 17+ messages in thread
From: Dietmar Eggemann @ 2026-03-24 10:29 UTC (permalink / raw)
To: Andrea Righi, Christian Loehle
Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
linux-kernel, Felix Abecassis
On 24.03.26 10:46, Andrea Righi wrote:
> Hi Christian,
>
> On Tue, Mar 24, 2026 at 08:08:22AM +0000, Christian Loehle wrote:
>> On 3/24/26 07:55, Christian Loehle wrote:
>>> On 3/24/26 07:39, Vincent Guittot wrote:
>>>> On Tue, 24 Mar 2026 at 01:55, Andrea Righi <arighi@nvidia.com> wrote:
[...]
>>>> 20% is a bit high, my snapdragon rb5 has a mid CPU with a capacity of
>>>> 871 but we still want to keep them different
>>>>
>>>> Why would 5% not be enough?
>>>
>>> I've also used 5%, or rather the existing capacity_greater() macro.
>>
>> Also, given that this patch even mentions this as "noise" one might ask
>> why the firmware wouldn't force-equalise this.
>
> I think it's reasonable to consider that as "noise" from a scheduler
> perspective, but from a hardware/firmware point of view I don't have strong
> arguments to propose equalizing the highest_perf values. At the end, at
> least in my case, it seems all compliant with the ACPI/CPPC specs and
> suggesting to equalize them because "the kernel doesn't handle it well"
> doesn't seem like a solid motivation...
The first time we observed this on NVIDIA Grace, we wondered whether
there might be functionality outside the task scheduler that makes use
of these slightly heterogeneous CPU capacity values from CPPC—and
whether the dependency on task scheduling was simply an overlooked
phenomenon.
And then there was DCPerf Mediawiki on 72 CPUs system always scoring
better with sched_asym_cpucap_active() = TRUE (mentioned already by
Chris L. in:
https://lore.kernel.org/r/15ffdeb3-a0f3-4b88-92c0-17ffb03b0574@arm.com
>> Anyway let me finally send out those asympacking patches which would make
>> that issue obsolete because we actually make use of the highest_perf
>> information from the firmware.
>
> Looking forward to that. :)
>
> Thanks,
> -Andrea
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] sched/topology: Avoid spurious asymmetry from CPU capacity noise
2026-03-24 10:29 ` Dietmar Eggemann
@ 2026-03-24 11:01 ` Andrea Righi
2026-03-25 9:23 ` Dietmar Eggemann
0 siblings, 1 reply; 17+ messages in thread
From: Andrea Righi @ 2026-03-24 11:01 UTC (permalink / raw)
To: Dietmar Eggemann
Cc: Christian Loehle, Vincent Guittot, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel, Felix Abecassis
Hi Dietmar,
On Tue, Mar 24, 2026 at 11:29:24AM +0100, Dietmar Eggemann wrote:
> On 24.03.26 10:46, Andrea Righi wrote:
> > Hi Christian,
> >
> > On Tue, Mar 24, 2026 at 08:08:22AM +0000, Christian Loehle wrote:
> >> On 3/24/26 07:55, Christian Loehle wrote:
> >>> On 3/24/26 07:39, Vincent Guittot wrote:
> >>>> On Tue, 24 Mar 2026 at 01:55, Andrea Righi <arighi@nvidia.com> wrote:
>
> [...]
>
> >>>> 20% is a bit high, my snapdragon rb5 has a mid CPU with a capacity of
> >>>> 871 but we still want to keep them different
> >>>>
> >>>> Why would 5% not be enough?
> >>>
> >>> I've also used 5%, or rather the existing capacity_greater() macro.
> >>
> >> Also, given that this patch even mentions this as "noise" one might ask
> >> why the firmware wouldn't force-equalise this.
> >
> > I think it's reasonable to consider that as "noise" from a scheduler
> > perspective, but from a hardware/firmware point of view I don't have strong
> > arguments to propose equalizing the highest_perf values. At the end, at
> > least in my case, it seems all compliant with the ACPI/CPPC specs and
> > suggesting to equalize them because "the kernel doesn't handle it well"
> > doesn't seem like a solid motivation...
>
> The first time we observed this on NVIDIA Grace, we wondered whether
> there might be functionality outside the task scheduler that makes use
> of these slightly heterogeneous CPU capacity values from CPPC—and
> whether the dependency on task scheduling was simply an overlooked
> phenomenon.
>
> And then there was DCPerf Mediawiki on 72 CPUs system always scoring
> better with sched_asym_cpucap_active() = TRUE (mentioned already by
> Chris L. in:
> https://lore.kernel.org/r/15ffdeb3-a0f3-4b88-92c0-17ffb03b0574@arm.com
Yeah, I think Chris' asym-packing approach might be the safest thing to do.
At the same time it would be nice to improve asym-capacity to introduce
some concept of SMT awareness, that was my original attempt with
https://lore.kernel.org/all/20260318092214.130908-1-arighi@nvidia.com,
since we may see similar asym-capacity benefits on Vera (that has SMT,
unlike Grace). What do you think?
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] sched/topology: Avoid spurious asymmetry from CPU capacity noise
2026-03-24 9:39 ` Andrea Righi
@ 2026-03-25 3:30 ` Koba Ko
2026-03-25 12:29 ` Andrea Righi
0 siblings, 1 reply; 17+ messages in thread
From: Koba Ko @ 2026-03-25 3:30 UTC (permalink / raw)
To: Andrea Righi
Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, linux-kernel,
Felix Abecassis
On Tue, Mar 24, 2026 at 10:39:41AM +0100, Andrea Righi wrote:
> Hi Vincent,
>
> On Tue, Mar 24, 2026 at 08:39:34AM +0100, Vincent Guittot wrote:
> > On Tue, 24 Mar 2026 at 01:55, Andrea Righi <arighi@nvidia.com> wrote:
> > >
> > > On some platforms, the firmware may expose per-CPU performance
> > > differences (e.g., via ACPI CPPC highest_perf) even when the system is
> > > effectively symmetric. These small variations, typically due to silicon
> > > binning, are reflected in arch_scale_cpu_capacity() and end up being
> > > interpreted as real capacity asymmetry.
> > >
> > > As a result, the scheduler incorrectly enables SD_ASYM_CPUCAPACITY,
> > > triggering asymmetry-specific behaviors, even though all CPUs have
> > > comparable performance.
> > >
> > > Prevent this by treating CPU capacities within 20% of the maximum value
> >
> > 20% is a bit high, my snapdragon rb5 has a mid CPU with a capacity of
> > 871 but we still want to keep them different
> >
> > Why would 5% not be enough?
>
> Sure, 5% seems a more reasonable margin. I'll just reuse capacity_greater()
> as suggested by Christian.
>
> Thanks,
> -Andrea
>
How about modifying asym_cpu_capacity_update_data to group all CPUs within 5% capacity difference into the same group?
```
+#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
list_for_each_entry(entry, &asym_cap_list, link) {
- if (capacity == entry->capacity)
+ if (!capacity_greater(capacity, entry->capacity) &&
+ !capacity_greater(entry->capacity, capacity))
```
> >
> >
> >
> > > as equivalent when building the asymmetry topology. This filters out
> > > firmware noise, while preserving correct behavior on real heterogeneous
> > > systems, where capacity differences are significantly larger.
> > >
> > > Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> > > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > > ---
> > > kernel/sched/topology.c | 19 ++++++++++++++++---
> > > 1 file changed, 16 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > index 061f8c85f5552..fe71ea9f3bda7 100644
> > > --- a/kernel/sched/topology.c
> > > +++ b/kernel/sched/topology.c
> > > @@ -1432,9 +1432,8 @@ static void free_asym_cap_entry(struct rcu_head *head)
> > > kfree(entry);
> > > }
> > >
> > > -static inline void asym_cpu_capacity_update_data(int cpu)
> > > +static inline void asym_cpu_capacity_update_data(int cpu, unsigned long capacity)
> > > {
> > > - unsigned long capacity = arch_scale_cpu_capacity(cpu);
> > > struct asym_cap_data *insert_entry = NULL;
> > > struct asym_cap_data *entry;
> > >
> > > @@ -1471,13 +1470,27 @@ static inline void asym_cpu_capacity_update_data(int cpu)
> > > static void asym_cpu_capacity_scan(void)
> > > {
> > > struct asym_cap_data *entry, *next;
> > > + unsigned long max_cap = 0;
> > > + unsigned long capacity;
> > > int cpu;
> > >
> > > list_for_each_entry(entry, &asym_cap_list, link)
> > > cpumask_clear(cpu_capacity_span(entry));
> > >
> > > for_each_cpu_and(cpu, cpu_possible_mask, housekeeping_cpumask(HK_TYPE_DOMAIN))
> > > - asym_cpu_capacity_update_data(cpu);
> > > + max_cap = max(max_cap, arch_scale_cpu_capacity(cpu));
> > > +
> > > + /*
> > > + * Treat small capacity differences (< 20% max capacity) as noise,
> > > + * to prevent enabling SD_ASYM_CPUCAPACITY when it's not really
> > > + * needed.
> > > + */
> > > + for_each_cpu_and(cpu, cpu_possible_mask, housekeeping_cpumask(HK_TYPE_DOMAIN)) {
> > > + capacity = arch_scale_cpu_capacity(cpu);
> > > + if (capacity * 5 >= max_cap * 4)
> > > + capacity = max_cap;
> > > + asym_cpu_capacity_update_data(cpu, capacity);
> > > + }
> > >
> > > list_for_each_entry_safe(entry, next, &asym_cap_list, link) {
> > > if (cpumask_empty(cpu_capacity_span(entry))) {
> > > --
> > > 2.53.0
> > >
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] sched/topology: Avoid spurious asymmetry from CPU capacity noise
2026-03-24 11:01 ` Andrea Righi
@ 2026-03-25 9:23 ` Dietmar Eggemann
2026-03-25 9:32 ` Andrea Righi
0 siblings, 1 reply; 17+ messages in thread
From: Dietmar Eggemann @ 2026-03-25 9:23 UTC (permalink / raw)
To: Andrea Righi
Cc: Christian Loehle, Vincent Guittot, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel, Felix Abecassis
On 24.03.26 12:01, Andrea Righi wrote:
> Hi Dietmar,
>
> On Tue, Mar 24, 2026 at 11:29:24AM +0100, Dietmar Eggemann wrote:
>> On 24.03.26 10:46, Andrea Righi wrote:
>>> Hi Christian,
>>>
>>> On Tue, Mar 24, 2026 at 08:08:22AM +0000, Christian Loehle wrote:
>>>> On 3/24/26 07:55, Christian Loehle wrote:
>>>>> On 3/24/26 07:39, Vincent Guittot wrote:
>>>>>> On Tue, 24 Mar 2026 at 01:55, Andrea Righi <arighi@nvidia.com> wrote:
[...]
>> The first time we observed this on NVIDIA Grace, we wondered whether
>> there might be functionality outside the task scheduler that makes use
>> of these slightly heterogeneous CPU capacity values from CPPC—and
>> whether the dependency on task scheduling was simply an overlooked
>> phenomenon.
>>
>> And then there was DCPerf Mediawiki on 72 CPUs system always scoring
>> better with sched_asym_cpucap_active() = TRUE (mentioned already by
>> Chris L. in:
>> https://lore.kernel.org/r/15ffdeb3-a0f3-4b88-92c0-17ffb03b0574@arm.com
>
> Yeah, I think Chris' asym-packing approach might be the safest thing to do.
>
> At the same time it would be nice to improve asym-capacity to introduce
> some concept of SMT awareness, that was my original attempt with
> https://lore.kernel.org/all/20260318092214.130908-1-arighi@nvidia.com,
> since we may see similar asym-capacity benefits on Vera (that has SMT,
> unlike Grace). What do you think?
We never found a good way to specify a CPU capacity in the SMT case (EAS
and energy model included). So comparing CPU capacity w/ utilization, CPU
overutilization detection etc. definitions get more blurry.
But in case you now want to hide these small CPU capacity differences from
asym-cpucap setup you won't run into this 'SD_SHARE_CPUCAPACITY +
SD_ASYM_CPUCAPACITY'.
You still will have small differences in sched group capacities but this
is covered by load-balance.
BTW, you should have seen on Vera ?:
sd_int() [kernel/sched/.topology.c]
1720 WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
1721 (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY),
1722 "CPU capacity asymmetry not supported on SMT\n");
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] sched/topology: Avoid spurious asymmetry from CPU capacity noise
2026-03-25 9:23 ` Dietmar Eggemann
@ 2026-03-25 9:32 ` Andrea Righi
2026-03-25 11:16 ` Dietmar Eggemann
2026-03-25 12:48 ` Phil Auld
0 siblings, 2 replies; 17+ messages in thread
From: Andrea Righi @ 2026-03-25 9:32 UTC (permalink / raw)
To: Dietmar Eggemann
Cc: Christian Loehle, Vincent Guittot, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel, Felix Abecassis
On Wed, Mar 25, 2026 at 10:23:09AM +0100, Dietmar Eggemann wrote:
> On 24.03.26 12:01, Andrea Righi wrote:
> > Hi Dietmar,
> >
> > On Tue, Mar 24, 2026 at 11:29:24AM +0100, Dietmar Eggemann wrote:
> >> On 24.03.26 10:46, Andrea Righi wrote:
> >>> Hi Christian,
> >>>
> >>> On Tue, Mar 24, 2026 at 08:08:22AM +0000, Christian Loehle wrote:
> >>>> On 3/24/26 07:55, Christian Loehle wrote:
> >>>>> On 3/24/26 07:39, Vincent Guittot wrote:
> >>>>>> On Tue, 24 Mar 2026 at 01:55, Andrea Righi <arighi@nvidia.com> wrote:
>
> [...]
>
> >> The first time we observed this on NVIDIA Grace, we wondered whether
> >> there might be functionality outside the task scheduler that makes use
> >> of these slightly heterogeneous CPU capacity values from CPPC—and
> >> whether the dependency on task scheduling was simply an overlooked
> >> phenomenon.
> >>
> >> And then there was DCPerf Mediawiki on 72 CPUs system always scoring
> >> better with sched_asym_cpucap_active() = TRUE (mentioned already by
> >> Chris L. in:
> >> https://lore.kernel.org/r/15ffdeb3-a0f3-4b88-92c0-17ffb03b0574@arm.com
> >
> > Yeah, I think Chris' asym-packing approach might be the safest thing to do.
> >
> > At the same time it would be nice to improve asym-capacity to introduce
> > some concept of SMT awareness, that was my original attempt with
> > https://lore.kernel.org/all/20260318092214.130908-1-arighi@nvidia.com,
> > since we may see similar asym-capacity benefits on Vera (that has SMT,
> > unlike Grace). What do you think?
>
> We never found a good way to specify a CPU capacity in the SMT case (EAS
> and energy model included). So comparing CPU capacity w/ utilization, CPU
> overutilization detection etc. definitions get more blurry.
Hm... so should we just avoid calling select_idle_capacity() when SMT is
enabled to prevent waking up tasks on both SMT siblings when there are
fully-idle SMT cores?
>
> But in case you now want to hide these small CPU capacity differences from
> asym-cpucap setup you won't run into this 'SD_SHARE_CPUCAPACITY +
> SD_ASYM_CPUCAPACITY'.
>
> You still will have small differences in sched group capacities but this
> is covered by load-balance.
>
> BTW, you should have seen on Vera ?:
>
> sd_int() [kernel/sched/.topology.c]
>
> 1720 WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
> 1721 (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY),
> 1722 "CPU capacity asymmetry not supported on SMT\n");
Yep, I've seen that. :)
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] sched/topology: Avoid spurious asymmetry from CPU capacity noise
2026-03-25 9:32 ` Andrea Righi
@ 2026-03-25 11:16 ` Dietmar Eggemann
2026-03-25 12:25 ` Andrea Righi
2026-03-25 12:48 ` Phil Auld
1 sibling, 1 reply; 17+ messages in thread
From: Dietmar Eggemann @ 2026-03-25 11:16 UTC (permalink / raw)
To: Andrea Righi
Cc: Christian Loehle, Vincent Guittot, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel, Felix Abecassis
On 25.03.26 10:32, Andrea Righi wrote:
> On Wed, Mar 25, 2026 at 10:23:09AM +0100, Dietmar Eggemann wrote:
>> On 24.03.26 12:01, Andrea Righi wrote:
>>> Hi Dietmar,
>>>
>>> On Tue, Mar 24, 2026 at 11:29:24AM +0100, Dietmar Eggemann wrote:
>>>> On 24.03.26 10:46, Andrea Righi wrote:
>>>>> Hi Christian,
>>>>>
>>>>> On Tue, Mar 24, 2026 at 08:08:22AM +0000, Christian Loehle wrote:
>>>>>> On 3/24/26 07:55, Christian Loehle wrote:
>>>>>>> On 3/24/26 07:39, Vincent Guittot wrote:
>>>>>>>> On Tue, 24 Mar 2026 at 01:55, Andrea Righi <arighi@nvidia.com> wrote:
>>
>> [...]
>>
>>>> The first time we observed this on NVIDIA Grace, we wondered whether
>>>> there might be functionality outside the task scheduler that makes use
>>>> of these slightly heterogeneous CPU capacity values from CPPC—and
>>>> whether the dependency on task scheduling was simply an overlooked
>>>> phenomenon.
>>>>
>>>> And then there was DCPerf Mediawiki on 72 CPUs system always scoring
>>>> better with sched_asym_cpucap_active() = TRUE (mentioned already by
>>>> Chris L. in:
>>>> https://lore.kernel.org/r/15ffdeb3-a0f3-4b88-92c0-17ffb03b0574@arm.com
>>>
>>> Yeah, I think Chris' asym-packing approach might be the safest thing to do.
>>>
>>> At the same time it would be nice to improve asym-capacity to introduce
>>> some concept of SMT awareness, that was my original attempt with
>>> https://lore.kernel.org/all/20260318092214.130908-1-arighi@nvidia.com,
>>> since we may see similar asym-capacity benefits on Vera (that has SMT,
>>> unlike Grace). What do you think?
>>
>> We never found a good way to specify a CPU capacity in the SMT case (EAS
>> and energy model included). So comparing CPU capacity w/ utilization, CPU
>> overutilization detection etc. definitions get more blurry.
>
> Hm... so should we just avoid calling select_idle_capacity() when SMT is
> enabled to prevent waking up tasks on both SMT siblings when there are
> fully-idle SMT cores?
Yeah, pretty much. So prefer (2) over (1).
IMHO, we do have a similar issue here. Can we say that a logical CPU is idle
if its SMT sibling isn't? But at least we don't have to use any CPU cap/util
comparison there.
select_idle_sibling()
8132 if (sched_smt_active()) {
8133 has_idle_core = test_idle_cores(target);
8134
8135 if (!has_idle_core && cpus_share_cache(prev, target)) { <-- (1)
8136 i = select_idle_smt(p, sd, prev);
8137 if ((unsigned int)i < nr_cpumask_bits)
8138 return i;
8139 }
8140 }
8141
8142 i = select_idle_cpu(p, sd, has_idle_core, target); <-- (2a)
8143 if ((unsigned)i < nr_cpumask_bits)
8144 return i
select_idle_cpu()
7926 for_each_cpu_wrap(cpu, cpus, target + 1) {
7927 if (has_idle_core) {
7928 i = select_idle_core(p, cpu, cpus, &idle_cpu); <-- (2b)
7929 if ((unsigned int)i < nr_cpumask_bits)
7930 return i;
7931
7932 } else {
7933 if (--nr <= 0)
7934 return -1;
7935 idle_cpu = __select_idle_cpu(cpu, p);
7936 if ((unsigned int)idle_cpu < nr_cpumask_bits)
7937 break;
7938 }
7939 }
[...]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] sched/topology: Avoid spurious asymmetry from CPU capacity noise
2026-03-25 11:16 ` Dietmar Eggemann
@ 2026-03-25 12:25 ` Andrea Righi
2026-03-25 15:26 ` Dietmar Eggemann
0 siblings, 1 reply; 17+ messages in thread
From: Andrea Righi @ 2026-03-25 12:25 UTC (permalink / raw)
To: Dietmar Eggemann
Cc: Christian Loehle, Vincent Guittot, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel, Felix Abecassis
On Wed, Mar 25, 2026 at 12:16:59PM +0100, Dietmar Eggemann wrote:
> On 25.03.26 10:32, Andrea Righi wrote:
> > On Wed, Mar 25, 2026 at 10:23:09AM +0100, Dietmar Eggemann wrote:
> >> On 24.03.26 12:01, Andrea Righi wrote:
> >>> Hi Dietmar,
> >>>
> >>> On Tue, Mar 24, 2026 at 11:29:24AM +0100, Dietmar Eggemann wrote:
> >>>> On 24.03.26 10:46, Andrea Righi wrote:
> >>>>> Hi Christian,
> >>>>>
> >>>>> On Tue, Mar 24, 2026 at 08:08:22AM +0000, Christian Loehle wrote:
> >>>>>> On 3/24/26 07:55, Christian Loehle wrote:
> >>>>>>> On 3/24/26 07:39, Vincent Guittot wrote:
> >>>>>>>> On Tue, 24 Mar 2026 at 01:55, Andrea Righi <arighi@nvidia.com> wrote:
> >>
> >> [...]
> >>
> >>>> The first time we observed this on NVIDIA Grace, we wondered whether
> >>>> there might be functionality outside the task scheduler that makes use
> >>>> of these slightly heterogeneous CPU capacity values from CPPC—and
> >>>> whether the dependency on task scheduling was simply an overlooked
> >>>> phenomenon.
> >>>>
> >>>> And then there was DCPerf Mediawiki on 72 CPUs system always scoring
> >>>> better with sched_asym_cpucap_active() = TRUE (mentioned already by
> >>>> Chris L. in:
> >>>> https://lore.kernel.org/r/15ffdeb3-a0f3-4b88-92c0-17ffb03b0574@arm.com
> >>>
> >>> Yeah, I think Chris' asym-packing approach might be the safest thing to do.
> >>>
> >>> At the same time it would be nice to improve asym-capacity to introduce
> >>> some concept of SMT awareness, that was my original attempt with
> >>> https://lore.kernel.org/all/20260318092214.130908-1-arighi@nvidia.com,
> >>> since we may see similar asym-capacity benefits on Vera (that has SMT,
> >>> unlike Grace). What do you think?
> >>
> >> We never found a good way to specify a CPU capacity in the SMT case (EAS
> >> and energy model included). So comparing CPU capacity w/ utilization, CPU
> >> overutilization detection etc. definitions get more blurry.
> >
> > Hm... so should we just avoid calling select_idle_capacity() when SMT is
> > enabled to prevent waking up tasks on both SMT siblings when there are
> > fully-idle SMT cores?
>
> Yeah, pretty much. So prefer (2) over (1).
>
> IMHO, we do have a similar issue here. Can we say that a logical CPU is idle
> if its SMT sibling isn't? But at least we don't have to use any CPU cap/util
> comparison there.
>
> select_idle_sibling()
>
> 8132 if (sched_smt_active()) {
> 8133 has_idle_core = test_idle_cores(target);
> 8134
> 8135 if (!has_idle_core && cpus_share_cache(prev, target)) { <-- (1)
> 8136 i = select_idle_smt(p, sd, prev);
> 8137 if ((unsigned int)i < nr_cpumask_bits)
> 8138 return i;
> 8139 }
> 8140 }
> 8141
> 8142 i = select_idle_cpu(p, sd, has_idle_core, target); <-- (2a)
> 8143 if ((unsigned)i < nr_cpumask_bits)
> 8144 return i
>
> select_idle_cpu()
>
> 7926 for_each_cpu_wrap(cpu, cpus, target + 1) {
> 7927 if (has_idle_core) {
> 7928 i = select_idle_core(p, cpu, cpus, &idle_cpu); <-- (2b)
> 7929 if ((unsigned int)i < nr_cpumask_bits)
> 7930 return i;
> 7931
> 7932 } else {
> 7933 if (--nr <= 0)
> 7934 return -1;
> 7935 idle_cpu = __select_idle_cpu(cpu, p);
> 7936 if ((unsigned int)idle_cpu < nr_cpumask_bits)
> 7937 break;
> 7938 }
> 7939 }
Exactly, we already prefer fully-idle cores over partially-idle cores with
asym-capacity disabled, but in that case the idle selection logic stays in
a world of idle bits, without cap/util math, so it's a bit easier. And it's
probably fine also when we have both asym-capacity + SMT (at least it seems
better than what we have now, ignoring the SMT part).
Essentially having somethig like the following (which already gives better
performance on Vera):
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d57c02e82f3a1..534634f813fca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8086,7 +8086,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
* For asymmetric CPU capacity systems, our domain of interest is
* sd_asym_cpucapacity rather than sd_llc.
*/
- if (sched_asym_cpucap_active()) {
+ if (sched_asym_cpucap_active() && !sched_smt_active()) {
sd = rcu_dereference_all(per_cpu(sd_asym_cpucapacity, target));
/*
* On an asymmetric CPU capacity system where an exclusive
Thanks,
-Andrea
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] sched/topology: Avoid spurious asymmetry from CPU capacity noise
2026-03-25 3:30 ` Koba Ko
@ 2026-03-25 12:29 ` Andrea Righi
0 siblings, 0 replies; 17+ messages in thread
From: Andrea Righi @ 2026-03-25 12:29 UTC (permalink / raw)
To: Koba Ko
Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Christian Loehle, linux-kernel,
Felix Abecassis
Hi Koba,
On Wed, Mar 25, 2026 at 11:30:48AM +0800, Koba Ko wrote:
> On Tue, Mar 24, 2026 at 10:39:41AM +0100, Andrea Righi wrote:
> > Hi Vincent,
> >
> > On Tue, Mar 24, 2026 at 08:39:34AM +0100, Vincent Guittot wrote:
> > > On Tue, 24 Mar 2026 at 01:55, Andrea Righi <arighi@nvidia.com> wrote:
> > > >
> > > > On some platforms, the firmware may expose per-CPU performance
> > > > differences (e.g., via ACPI CPPC highest_perf) even when the system is
> > > > effectively symmetric. These small variations, typically due to silicon
> > > > binning, are reflected in arch_scale_cpu_capacity() and end up being
> > > > interpreted as real capacity asymmetry.
> > > >
> > > > As a result, the scheduler incorrectly enables SD_ASYM_CPUCAPACITY,
> > > > triggering asymmetry-specific behaviors, even though all CPUs have
> > > > comparable performance.
> > > >
> > > > Prevent this by treating CPU capacities within 20% of the maximum value
> > >
> > > 20% is a bit high, my snapdragon rb5 has a mid CPU with a capacity of
> > > 871 but we still want to keep them different
> > >
> > > Why would 5% not be enough?
> >
> > Sure, 5% seems a more reasonable margin. I'll just reuse capacity_greater()
> > as suggested by Christian.
> >
> > Thanks,
> > -Andrea
> >
>
> How about modifying asym_cpu_capacity_update_data to group all CPUs within 5% capacity difference into the same group?
> ```
> +#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
>
> list_for_each_entry(entry, &asym_cap_list, link) {
> - if (capacity == entry->capacity)
> + if (!capacity_greater(capacity, entry->capacity) &&
> + !capacity_greater(entry->capacity, capacity))
Yeah, makes sense, I like this better than mine. But there's still the
concern of potentially regressing other systems, nullifying the small
asym-capacity benefits (as Chris mentioned here:
https://lore.kernel.org/r/15ffdeb3-a0f3-4b88-92c0-17ffb03b0574@arm.com).
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] sched/topology: Avoid spurious asymmetry from CPU capacity noise
2026-03-25 9:32 ` Andrea Righi
2026-03-25 11:16 ` Dietmar Eggemann
@ 2026-03-25 12:48 ` Phil Auld
1 sibling, 0 replies; 17+ messages in thread
From: Phil Auld @ 2026-03-25 12:48 UTC (permalink / raw)
To: Andrea Righi
Cc: Dietmar Eggemann, Christian Loehle, Vincent Guittot, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, linux-kernel, Felix Abecassis
On Wed, Mar 25, 2026 at 10:32:28AM +0100 Andrea Righi wrote:
> On Wed, Mar 25, 2026 at 10:23:09AM +0100, Dietmar Eggemann wrote:
> > On 24.03.26 12:01, Andrea Righi wrote:
> > > Hi Dietmar,
> > >
> > > On Tue, Mar 24, 2026 at 11:29:24AM +0100, Dietmar Eggemann wrote:
> > >> On 24.03.26 10:46, Andrea Righi wrote:
> > >>> Hi Christian,
> > >>>
> > >>> On Tue, Mar 24, 2026 at 08:08:22AM +0000, Christian Loehle wrote:
> > >>>> On 3/24/26 07:55, Christian Loehle wrote:
> > >>>>> On 3/24/26 07:39, Vincent Guittot wrote:
> > >>>>>> On Tue, 24 Mar 2026 at 01:55, Andrea Righi <arighi@nvidia.com> wrote:
> >
> > [...]
> >
> > >> The first time we observed this on NVIDIA Grace, we wondered whether
> > >> there might be functionality outside the task scheduler that makes use
> > >> of these slightly heterogeneous CPU capacity values from CPPC—and
> > >> whether the dependency on task scheduling was simply an overlooked
> > >> phenomenon.
> > >>
> > >> And then there was DCPerf Mediawiki on 72 CPUs system always scoring
> > >> better with sched_asym_cpucap_active() = TRUE (mentioned already by
> > >> Chris L. in:
> > >> https://lore.kernel.org/r/15ffdeb3-a0f3-4b88-92c0-17ffb03b0574@arm.com
> > >
> > > Yeah, I think Chris' asym-packing approach might be the safest thing to do.
> > >
> > > At the same time it would be nice to improve asym-capacity to introduce
> > > some concept of SMT awareness, that was my original attempt with
> > > https://lore.kernel.org/all/20260318092214.130908-1-arighi@nvidia.com,
> > > since we may see similar asym-capacity benefits on Vera (that has SMT,
> > > unlike Grace). What do you think?
> >
> > We never found a good way to specify a CPU capacity in the SMT case (EAS
> > and energy model included). So comparing CPU capacity w/ utilization, CPU
> > overutilization detection etc. definitions get more blurry.
>
> Hm... so should we just avoid calling select_idle_capacity() when SMT is
> enabled to prevent waking up tasks on both SMT siblings when there are
> fully-idle SMT cores?
>
That might be a good idea. Especially if it's general and not tied to
EAS/ASYM. I'm getting some requests for something like that.
CHeers,
Phil
> >
> > But in case you now want to hide these small CPU capacity differences from
> > asym-cpucap setup you won't run into this 'SD_SHARE_CPUCAPACITY +
> > SD_ASYM_CPUCAPACITY'.
> >
> > You still will have small differences in sched group capacities but this
> > is covered by load-balance.
> >
> > BTW, you should have seen on Vera ?:
> >
> > sd_int() [kernel/sched/.topology.c]
> >
> > 1720 WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
> > 1721 (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY),
> > 1722 "CPU capacity asymmetry not supported on SMT\n");
>
> Yep, I've seen that. :)
>
> Thanks,
> -Andrea
>
--
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] sched/topology: Avoid spurious asymmetry from CPU capacity noise
2026-03-25 12:25 ` Andrea Righi
@ 2026-03-25 15:26 ` Dietmar Eggemann
2026-03-25 16:50 ` Andrea Righi
0 siblings, 1 reply; 17+ messages in thread
From: Dietmar Eggemann @ 2026-03-25 15:26 UTC (permalink / raw)
To: Andrea Righi
Cc: Christian Loehle, Vincent Guittot, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel, Felix Abecassis
On 25.03.26 13:25, Andrea Righi wrote:
> On Wed, Mar 25, 2026 at 12:16:59PM +0100, Dietmar Eggemann wrote:
>> On 25.03.26 10:32, Andrea Righi wrote:
>>> On Wed, Mar 25, 2026 at 10:23:09AM +0100, Dietmar Eggemann wrote:
>>>> On 24.03.26 12:01, Andrea Righi wrote:
>>>>> Hi Dietmar,
>>>>>
>>>>> On Tue, Mar 24, 2026 at 11:29:24AM +0100, Dietmar Eggemann wrote:
>>>>>> On 24.03.26 10:46, Andrea Righi wrote:
>>>>>>> Hi Christian,
>>>>>>>
>>>>>>> On Tue, Mar 24, 2026 at 08:08:22AM +0000, Christian Loehle wrote:
>>>>>>>> On 3/24/26 07:55, Christian Loehle wrote:
>>>>>>>>> On 3/24/26 07:39, Vincent Guittot wrote:
>>>>>>>>>> On Tue, 24 Mar 2026 at 01:55, Andrea Righi <arighi@nvidia.com> wrote:
[...]
> Exactly, we already prefer fully-idle cores over partially-idle cores with
> asym-capacity disabled, but in that case the idle selection logic stays in
> a world of idle bits, without cap/util math, so it's a bit easier. And it's
> probably fine also when we have both asym-capacity + SMT (at least it seems
> better than what we have now, ignoring the SMT part).
>
> Essentially having somethig like the following (which already gives better
> performance on Vera):
>
> kernel/sched/fair.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d57c02e82f3a1..534634f813fca 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8086,7 +8086,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> * For asymmetric CPU capacity systems, our domain of interest is
> * sd_asym_cpucapacity rather than sd_llc.
> */
> - if (sched_asym_cpucap_active()) {
> + if (sched_asym_cpucap_active() && !sched_smt_active()) {
> sd = rcu_dereference_all(per_cpu(sd_asym_cpucapacity, target));
> /*
> * On an asymmetric CPU capacity system where an exclusive
Ah, I thought we were talking !sched_asym_cpucap_active() case, either
by letting CPPC return the same value for all CPUs or by introducing
this 20%/5% threshold into asym_cpu_capacity_scan().
ASYM_CPUCAP + SHARE_CPUCAP vs SHARE_CPUCAP would still behave slightly
differently because of asym_fits_cpu() in all those early bailout
conditions (1) in sis().
select_idle_sibling()
if (choose_idle_cpu(target, p) &&
asym_fits_cpu(task_util, util_min, util_max, target)) <-- (1)
return target;
...
And you would still have misfit_task load balance enabled.
Those subtle differences may influence behavior compared to a simpler
homogeneous CPU capacity model, but it’s unclear whether they justify
introducing yet another variant alongside the existing homogeneous and
fully heterogeneous (non-SMT) approaches.
IMHO, we should only consider allowing this if there is clear evidence
of significant benefits across a representative range of benchmarks and
workloads.
[...]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] sched/topology: Avoid spurious asymmetry from CPU capacity noise
2026-03-25 15:26 ` Dietmar Eggemann
@ 2026-03-25 16:50 ` Andrea Righi
0 siblings, 0 replies; 17+ messages in thread
From: Andrea Righi @ 2026-03-25 16:50 UTC (permalink / raw)
To: Dietmar Eggemann
Cc: Christian Loehle, Vincent Guittot, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, linux-kernel, Felix Abecassis
On Wed, Mar 25, 2026 at 04:26:44PM +0100, Dietmar Eggemann wrote:
> On 25.03.26 13:25, Andrea Righi wrote:
> > On Wed, Mar 25, 2026 at 12:16:59PM +0100, Dietmar Eggemann wrote:
> >> On 25.03.26 10:32, Andrea Righi wrote:
> >>> On Wed, Mar 25, 2026 at 10:23:09AM +0100, Dietmar Eggemann wrote:
> >>>> On 24.03.26 12:01, Andrea Righi wrote:
> >>>>> Hi Dietmar,
> >>>>>
> >>>>> On Tue, Mar 24, 2026 at 11:29:24AM +0100, Dietmar Eggemann wrote:
> >>>>>> On 24.03.26 10:46, Andrea Righi wrote:
> >>>>>>> Hi Christian,
> >>>>>>>
> >>>>>>> On Tue, Mar 24, 2026 at 08:08:22AM +0000, Christian Loehle wrote:
> >>>>>>>> On 3/24/26 07:55, Christian Loehle wrote:
> >>>>>>>>> On 3/24/26 07:39, Vincent Guittot wrote:
> >>>>>>>>>> On Tue, 24 Mar 2026 at 01:55, Andrea Righi <arighi@nvidia.com> wrote:
>
> [...]
>
> > Exactly, we already prefer fully-idle cores over partially-idle cores with
> > asym-capacity disabled, but in that case the idle selection logic stays in
> > a world of idle bits, without cap/util math, so it's a bit easier. And it's
> > probably fine also when we have both asym-capacity + SMT (at least it seems
> > better than what we have now, ignoring the SMT part).
> >
> > Essentially having somethig like the following (which already gives better
> > performance on Vera):
> >
> > kernel/sched/fair.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d57c02e82f3a1..534634f813fca 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8086,7 +8086,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> > * For asymmetric CPU capacity systems, our domain of interest is
> > * sd_asym_cpucapacity rather than sd_llc.
> > */
> > - if (sched_asym_cpucap_active()) {
> > + if (sched_asym_cpucap_active() && !sched_smt_active()) {
> > sd = rcu_dereference_all(per_cpu(sd_asym_cpucapacity, target));
> > /*
> > * On an asymmetric CPU capacity system where an exclusive
>
> Ah, I thought we were talking !sched_asym_cpucap_active() case, either
> by letting CPPC return the same value for all CPUs or by introducing
> this 20%/5% threshold into asym_cpu_capacity_scan().
Sure, we can also equalize capacity via CPPC, but I tought we were worried
about potential regressions with other systems that don't have SMT and may
actually benefit from the asym-capacity logic.
Moreover, if any other platform with SMT enables asym CPU by slightly
exceeding the 5% margin, we may face the same issue again.
>
> ASYM_CPUCAP + SHARE_CPUCAP vs SHARE_CPUCAP would still behave slightly
> differently because of asym_fits_cpu() in all those early bailout
> conditions (1) in sis().
>
> select_idle_sibling()
>
> if (choose_idle_cpu(target, p) &&
> asym_fits_cpu(task_util, util_min, util_max, target)) <-- (1)
> return target;
>
> ...
Ah yes, this also needs to be changed...
>
> And you would still have misfit_task load balance enabled.
Correct, in fact to get the optimal performance on Vera with asym-capacity
enabled, I also need to fix the misfit logic to prioritize fully-idle SMT
cores. Same with find_new_ilb() and potentially other places. With these I
get almost 2x improvement in some cases, which is pretty big.
But I get similar results also disabling asym-capacity via the 5%
threshold.
>
> Those subtle differences may influence behavior compared to a simpler
> homogeneous CPU capacity model, but it’s unclear whether they justify
> introducing yet another variant alongside the existing homogeneous and
> fully heterogeneous (non-SMT) approaches.
>
> IMHO, we should only consider allowing this if there is clear evidence
> of significant benefits across a representative range of benchmarks and
> workloads.
Totally agree. But there's still the fact that select_idle_capacity() is
not compatible with SMT, so it should be avoided when SMT is enabled, in a
way or another.
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2026-03-25 16:50 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-24 0:55 [PATCH] sched/topology: Avoid spurious asymmetry from CPU capacity noise Andrea Righi
2026-03-24 7:39 ` Vincent Guittot
2026-03-24 7:55 ` Christian Loehle
2026-03-24 8:08 ` Christian Loehle
2026-03-24 9:46 ` Andrea Righi
2026-03-24 10:29 ` Dietmar Eggemann
2026-03-24 11:01 ` Andrea Righi
2026-03-25 9:23 ` Dietmar Eggemann
2026-03-25 9:32 ` Andrea Righi
2026-03-25 11:16 ` Dietmar Eggemann
2026-03-25 12:25 ` Andrea Righi
2026-03-25 15:26 ` Dietmar Eggemann
2026-03-25 16:50 ` Andrea Righi
2026-03-25 12:48 ` Phil Auld
2026-03-24 9:39 ` Andrea Righi
2026-03-25 3:30 ` Koba Ko
2026-03-25 12:29 ` Andrea Righi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox