* [PATCH v4] sched/fair: Use sched_domain_span() for topology_span_sane()
@ 2025-07-09 16:19 K Prateek Nayak
2025-07-10 15:09 ` Valentin Schneider
2025-07-20 10:41 ` Leon Romanovsky
0 siblings, 2 replies; 4+ messages in thread
From: K Prateek Nayak @ 2025-07-09 16:19 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Valentin Schneider, Leon Romanovsky, linux-kernel
Cc: Steve Wahl, Borislav Petkov, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, K Prateek Nayak
Leon noted a topology_span_sane() warning in their guest deployment
starting from v6.16-rc1 [1]. Debug that followed pointed to the
tl->mask() for the NODE domain being incorrectly resolved to that of the
highest NUMA domain.
tl->mask() for NODE is set to the sd_numa_mask() which depends on the
global "sched_domains_curr_level" hack. "sched_domains_curr_level" is
set to the "tl->numa_level" during tl traversal in build_sched_domains()
calling sd_init() but was not reset before topology_span_sane().
Since "tl->numa_level" still reflected the old value from
build_sched_domains(), topology_span_sane() for the NODE domain trips
when the span of the last NUMA domain overlaps.
Instead of replicating the "sched_domains_curr_level" hack, Valentin
suggested using the spans from the sched_domain objects constructed
during build_sched_domains() which can also catch overlaps when the
domain spans are fixed up by build_sched_domain().
Since build_sched_domain() is skipped when tl->mask() of a child domain
already covers the entire cpumap, skip the domains that have an empty
span.
The original warning was reproducible on the following NUMA topology
reported by Leon:
$ sudo numactl -H
available: 5 nodes (0-4)
node 0 cpus: 0 1
node 0 size: 2927 MB
node 0 free: 1603 MB
node 1 cpus: 2 3
node 1 size: 3023 MB
node 1 free: 3008 MB
node 2 cpus: 4 5
node 2 size: 3023 MB
node 2 free: 3007 MB
node 3 cpus: 6 7
node 3 size: 3023 MB
node 3 free: 3002 MB
node 4 cpus: 8 9
node 4 size: 3022 MB
node 4 free: 2718 MB
node distances:
node 0 1 2 3 4
0: 10 39 38 37 36
1: 39 10 38 37 36
2: 38 38 10 37 36
3: 37 37 37 10 36
4: 36 36 36 36 10
The above topology can be mimicked using the following QEMU cmd that was
used to reproduce the warning and test the fix:
sudo qemu-system-x86_64 -enable-kvm -cpu host \
-m 20G -smp cpus=10,sockets=10 -machine q35 \
-object memory-backend-ram,size=4G,id=m0 \
-object memory-backend-ram,size=4G,id=m1 \
-object memory-backend-ram,size=4G,id=m2 \
-object memory-backend-ram,size=4G,id=m3 \
-object memory-backend-ram,size=4G,id=m4 \
-numa node,cpus=0-1,memdev=m0,nodeid=0 \
-numa node,cpus=2-3,memdev=m1,nodeid=1 \
-numa node,cpus=4-5,memdev=m2,nodeid=2 \
-numa node,cpus=6-7,memdev=m3,nodeid=3 \
-numa node,cpus=8-9,memdev=m4,nodeid=4 \
-numa dist,src=0,dst=1,val=39 \
-numa dist,src=0,dst=2,val=38 \
-numa dist,src=0,dst=3,val=37 \
-numa dist,src=0,dst=4,val=36 \
-numa dist,src=1,dst=0,val=39 \
-numa dist,src=1,dst=2,val=38 \
-numa dist,src=1,dst=3,val=37 \
-numa dist,src=1,dst=4,val=36 \
-numa dist,src=2,dst=0,val=38 \
-numa dist,src=2,dst=1,val=38 \
-numa dist,src=2,dst=3,val=37 \
-numa dist,src=2,dst=4,val=36 \
-numa dist,src=3,dst=0,val=37 \
-numa dist,src=3,dst=1,val=37 \
-numa dist,src=3,dst=2,val=37 \
-numa dist,src=3,dst=4,val=36 \
-numa dist,src=4,dst=0,val=36 \
-numa dist,src=4,dst=1,val=36 \
-numa dist,src=4,dst=2,val=36 \
-numa dist,src=4,dst=3,val=36 \
...
Suggested-by: Valentin Schneider <vschneid@redhat.com>
Reported-by: Leon Romanovsky <leon@kernel.org>
Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/ [1]
Fixes: ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap") # ce29a7da84cd, f55dac1dafb3
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changes are based on tip:sched/urgent at commit fc975cfb3639
("sched/deadline: Fix dl_server runtime calculation formula")
Changelog v3..v4:
o Use empty span to detect sd objects that haven't been initialized
instead of using "sd->private" (Valentin).
v3: https://lore.kernel.org/lkml/20250707105302.11029-1-kprateek.nayak@amd.com/
---
kernel/sched/topology.c | 24 ++++++++++++++++++------
1 file changed, 18 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b958fe48e020..37b310116d19 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2403,6 +2403,7 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
id_seen = sched_domains_tmpmask2;
for_each_sd_topology(tl) {
+ struct sd_data *sdd = &tl->data;
/* NUMA levels are allowed to overlap */
if (tl->flags & SDTL_OVERLAP)
@@ -2418,22 +2419,33 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
* breaks the linking done for an earlier span.
*/
for_each_cpu(cpu, cpu_map) {
- const struct cpumask *tl_cpu_mask = tl->mask(cpu);
+ struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
+ struct cpumask *sd_span = sched_domain_span(sd);
int id;
/* lowest bit set in this mask is used as a unique id */
- id = cpumask_first(tl_cpu_mask);
+ id = cpumask_first(sd_span);
+
+ /*
+ * Span can be empty if that topology level won't be
+ * used for this CPU, i.e. a lower level already fully
+ * describes the topology and build_sched_domain()
+ * stopped there.
+ */
+ if (id >= nr_cpu_ids)
+ continue;
if (cpumask_test_cpu(id, id_seen)) {
- /* First CPU has already been seen, ensure identical spans */
- if (!cpumask_equal(tl->mask(id), tl_cpu_mask))
+ /* First CPU has already been seen, ensure identical sd spans */
+ sd = *per_cpu_ptr(sdd->sd, id);
+ if (!cpumask_equal(sched_domain_span(sd), sd_span))
return false;
} else {
/* First CPU hasn't been seen before, ensure it's a completely new span */
- if (cpumask_intersects(tl_cpu_mask, covered))
+ if (cpumask_intersects(sd_span, covered))
return false;
- cpumask_or(covered, covered, tl_cpu_mask);
+ cpumask_or(covered, covered, sd_span);
cpumask_set_cpu(id, id_seen);
}
}
base-commit: fc975cfb36393db1db517fbbe366e550bcdcff14
--
2.34.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH v4] sched/fair: Use sched_domain_span() for topology_span_sane()
2025-07-09 16:19 [PATCH v4] sched/fair: Use sched_domain_span() for topology_span_sane() K Prateek Nayak
@ 2025-07-10 15:09 ` Valentin Schneider
2025-07-20 10:41 ` Leon Romanovsky
1 sibling, 0 replies; 4+ messages in thread
From: Valentin Schneider @ 2025-07-10 15:09 UTC (permalink / raw)
To: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Leon Romanovsky, linux-kernel
Cc: Steve Wahl, Borislav Petkov, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, K Prateek Nayak
On 09/07/25 16:19, K Prateek Nayak wrote:
> Suggested-by: Valentin Schneider <vschneid@redhat.com>
> Reported-by: Leon Romanovsky <leon@kernel.org>
> Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/ [1]
> Fixes: ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap") # ce29a7da84cd, f55dac1dafb3
> Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
> Tested-by: Valentin Schneider <vschneid@redhat.com>
> Reviewed-by: Valentin Schneider <vschneid@redhat.com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> Changes are based on tip:sched/urgent at commit fc975cfb3639
> ("sched/deadline: Fix dl_server runtime calculation formula")
>
> Changelog v3..v4:
>
> o Use empty span to detect sd objects that haven't been initialized
> instead of using "sd->private" (Valentin).
>
LGTM, thanks!
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v4] sched/fair: Use sched_domain_span() for topology_span_sane()
2025-07-09 16:19 [PATCH v4] sched/fair: Use sched_domain_span() for topology_span_sane() K Prateek Nayak
2025-07-10 15:09 ` Valentin Schneider
@ 2025-07-20 10:41 ` Leon Romanovsky
2025-07-21 4:28 ` K Prateek Nayak
1 sibling, 1 reply; 4+ messages in thread
From: Leon Romanovsky @ 2025-07-20 10:41 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Valentin Schneider, linux-kernel, Steve Wahl, Borislav Petkov,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman
On Wed, Jul 09, 2025 at 04:19:17PM +0000, K Prateek Nayak wrote:
> Leon noted a topology_span_sane() warning in their guest deployment
> starting from v6.16-rc1 [1]. Debug that followed pointed to the
> tl->mask() for the NODE domain being incorrectly resolved to that of the
> highest NUMA domain.
>
> tl->mask() for NODE is set to the sd_numa_mask() which depends on the
> global "sched_domains_curr_level" hack. "sched_domains_curr_level" is
> set to the "tl->numa_level" during tl traversal in build_sched_domains()
> calling sd_init() but was not reset before topology_span_sane().
>
> Since "tl->numa_level" still reflected the old value from
> build_sched_domains(), topology_span_sane() for the NODE domain trips
> when the span of the last NUMA domain overlaps.
>
> Instead of replicating the "sched_domains_curr_level" hack, Valentin
> suggested using the spans from the sched_domain objects constructed
> during build_sched_domains() which can also catch overlaps when the
> domain spans are fixed up by build_sched_domain().
>
> Since build_sched_domain() is skipped when tl->mask() of a child domain
> already covers the entire cpumap, skip the domains that have an empty
> span.
>
> The original warning was reproducible on the following NUMA topology
> reported by Leon:
>
> $ sudo numactl -H
> available: 5 nodes (0-4)
> node 0 cpus: 0 1
> node 0 size: 2927 MB
> node 0 free: 1603 MB
> node 1 cpus: 2 3
> node 1 size: 3023 MB
> node 1 free: 3008 MB
> node 2 cpus: 4 5
> node 2 size: 3023 MB
> node 2 free: 3007 MB
> node 3 cpus: 6 7
> node 3 size: 3023 MB
> node 3 free: 3002 MB
> node 4 cpus: 8 9
> node 4 size: 3022 MB
> node 4 free: 2718 MB
> node distances:
> node 0 1 2 3 4
> 0: 10 39 38 37 36
> 1: 39 10 38 37 36
> 2: 38 38 10 37 36
> 3: 37 37 37 10 36
> 4: 36 36 36 36 10
>
> The above topology can be mimicked using the following QEMU cmd that was
> used to reproduce the warning and test the fix:
>
> sudo qemu-system-x86_64 -enable-kvm -cpu host \
> -m 20G -smp cpus=10,sockets=10 -machine q35 \
> -object memory-backend-ram,size=4G,id=m0 \
> -object memory-backend-ram,size=4G,id=m1 \
> -object memory-backend-ram,size=4G,id=m2 \
> -object memory-backend-ram,size=4G,id=m3 \
> -object memory-backend-ram,size=4G,id=m4 \
> -numa node,cpus=0-1,memdev=m0,nodeid=0 \
> -numa node,cpus=2-3,memdev=m1,nodeid=1 \
> -numa node,cpus=4-5,memdev=m2,nodeid=2 \
> -numa node,cpus=6-7,memdev=m3,nodeid=3 \
> -numa node,cpus=8-9,memdev=m4,nodeid=4 \
> -numa dist,src=0,dst=1,val=39 \
> -numa dist,src=0,dst=2,val=38 \
> -numa dist,src=0,dst=3,val=37 \
> -numa dist,src=0,dst=4,val=36 \
> -numa dist,src=1,dst=0,val=39 \
> -numa dist,src=1,dst=2,val=38 \
> -numa dist,src=1,dst=3,val=37 \
> -numa dist,src=1,dst=4,val=36 \
> -numa dist,src=2,dst=0,val=38 \
> -numa dist,src=2,dst=1,val=38 \
> -numa dist,src=2,dst=3,val=37 \
> -numa dist,src=2,dst=4,val=36 \
> -numa dist,src=3,dst=0,val=37 \
> -numa dist,src=3,dst=1,val=37 \
> -numa dist,src=3,dst=2,val=37 \
> -numa dist,src=3,dst=4,val=36 \
> -numa dist,src=4,dst=0,val=36 \
> -numa dist,src=4,dst=1,val=36 \
> -numa dist,src=4,dst=2,val=36 \
> -numa dist,src=4,dst=3,val=36 \
> ...
>
> Suggested-by: Valentin Schneider <vschneid@redhat.com>
> Reported-by: Leon Romanovsky <leon@kernel.org>
> Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/ [1]
> Fixes: ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap") # ce29a7da84cd, f55dac1dafb3
> Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
> Tested-by: Valentin Schneider <vschneid@redhat.com>
> Reviewed-by: Valentin Schneider <vschneid@redhat.com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> Changes are based on tip:sched/urgent at commit fc975cfb3639
> ("sched/deadline: Fix dl_server runtime calculation formula")
Was this patch picked?
Thanks,
Tested-by: Leon Romanovsky <leon@kernel.org>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v4] sched/fair: Use sched_domain_span() for topology_span_sane()
2025-07-20 10:41 ` Leon Romanovsky
@ 2025-07-21 4:28 ` K Prateek Nayak
0 siblings, 0 replies; 4+ messages in thread
From: K Prateek Nayak @ 2025-07-21 4:28 UTC (permalink / raw)
To: Leon Romanovsky
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Valentin Schneider, linux-kernel, Steve Wahl, Borislav Petkov,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman
Hello Leon,
On 7/20/2025 4:11 PM, Leon Romanovsky wrote:
> On Wed, Jul 09, 2025 at 04:19:17PM +0000, K Prateek Nayak wrote:
>> Leon noted a topology_span_sane() warning in their guest deployment
>> starting from v6.16-rc1 [1]. Debug that followed pointed to the
>> tl->mask() for the NODE domain being incorrectly resolved to that of the
>> highest NUMA domain.
>>
>> tl->mask() for NODE is set to the sd_numa_mask() which depends on the
>> global "sched_domains_curr_level" hack. "sched_domains_curr_level" is
>> set to the "tl->numa_level" during tl traversal in build_sched_domains()
>> calling sd_init() but was not reset before topology_span_sane().
>>
>> Since "tl->numa_level" still reflected the old value from
>> build_sched_domains(), topology_span_sane() for the NODE domain trips
>> when the span of the last NUMA domain overlaps.
>>
>> Instead of replicating the "sched_domains_curr_level" hack, Valentin
>> suggested using the spans from the sched_domain objects constructed
>> during build_sched_domains() which can also catch overlaps when the
>> domain spans are fixed up by build_sched_domain().
>>
>> Since build_sched_domain() is skipped when tl->mask() of a child domain
>> already covers the entire cpumap, skip the domains that have an empty
>> span.
>>
>> The original warning was reproducible on the following NUMA topology
>> reported by Leon:
>>
>> $ sudo numactl -H
>> available: 5 nodes (0-4)
>> node 0 cpus: 0 1
>> node 0 size: 2927 MB
>> node 0 free: 1603 MB
>> node 1 cpus: 2 3
>> node 1 size: 3023 MB
>> node 1 free: 3008 MB
>> node 2 cpus: 4 5
>> node 2 size: 3023 MB
>> node 2 free: 3007 MB
>> node 3 cpus: 6 7
>> node 3 size: 3023 MB
>> node 3 free: 3002 MB
>> node 4 cpus: 8 9
>> node 4 size: 3022 MB
>> node 4 free: 2718 MB
>> node distances:
>> node 0 1 2 3 4
>> 0: 10 39 38 37 36
>> 1: 39 10 38 37 36
>> 2: 38 38 10 37 36
>> 3: 37 37 37 10 36
>> 4: 36 36 36 36 10
>>
>> The above topology can be mimicked using the following QEMU cmd that was
>> used to reproduce the warning and test the fix:
>>
>> sudo qemu-system-x86_64 -enable-kvm -cpu host \
>> -m 20G -smp cpus=10,sockets=10 -machine q35 \
>> -object memory-backend-ram,size=4G,id=m0 \
>> -object memory-backend-ram,size=4G,id=m1 \
>> -object memory-backend-ram,size=4G,id=m2 \
>> -object memory-backend-ram,size=4G,id=m3 \
>> -object memory-backend-ram,size=4G,id=m4 \
>> -numa node,cpus=0-1,memdev=m0,nodeid=0 \
>> -numa node,cpus=2-3,memdev=m1,nodeid=1 \
>> -numa node,cpus=4-5,memdev=m2,nodeid=2 \
>> -numa node,cpus=6-7,memdev=m3,nodeid=3 \
>> -numa node,cpus=8-9,memdev=m4,nodeid=4 \
>> -numa dist,src=0,dst=1,val=39 \
>> -numa dist,src=0,dst=2,val=38 \
>> -numa dist,src=0,dst=3,val=37 \
>> -numa dist,src=0,dst=4,val=36 \
>> -numa dist,src=1,dst=0,val=39 \
>> -numa dist,src=1,dst=2,val=38 \
>> -numa dist,src=1,dst=3,val=37 \
>> -numa dist,src=1,dst=4,val=36 \
>> -numa dist,src=2,dst=0,val=38 \
>> -numa dist,src=2,dst=1,val=38 \
>> -numa dist,src=2,dst=3,val=37 \
>> -numa dist,src=2,dst=4,val=36 \
>> -numa dist,src=3,dst=0,val=37 \
>> -numa dist,src=3,dst=1,val=37 \
>> -numa dist,src=3,dst=2,val=37 \
>> -numa dist,src=3,dst=4,val=36 \
>> -numa dist,src=4,dst=0,val=36 \
>> -numa dist,src=4,dst=1,val=36 \
>> -numa dist,src=4,dst=2,val=36 \
>> -numa dist,src=4,dst=3,val=36 \
>> ...
>>
>> Suggested-by: Valentin Schneider <vschneid@redhat.com>
>> Reported-by: Leon Romanovsky <leon@kernel.org>
>> Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/ [1]
>> Fixes: ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap") # ce29a7da84cd, f55dac1dafb3
>> Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
>> Tested-by: Valentin Schneider <vschneid@redhat.com>
>> Reviewed-by: Valentin Schneider <vschneid@redhat.com>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> ---
>> Changes are based on tip:sched/urgent at commit fc975cfb3639
>> ("sched/deadline: Fix dl_server runtime calculation formula")
>
> Was this patch picked?
Not yet. I think Peter was planning to pick it up as v6.17 material.
P.S. The latest version v5 can be found at
https://lore.kernel.org/lkml/20250715040824.893-1-kprateek.nayak@amd.com/
It is basically v4 but rebased on top of tip:sched/core to resolve
conflicts with a recent cleanup.
>
> Thanks,
> Tested-by: Leon Romanovsky <leon@kernel.org>
Thanks a ton!
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-07-21 4:28 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-09 16:19 [PATCH v4] sched/fair: Use sched_domain_span() for topology_span_sane() K Prateek Nayak
2025-07-10 15:09 ` Valentin Schneider
2025-07-20 10:41 ` Leon Romanovsky
2025-07-21 4:28 ` K Prateek Nayak
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).