From: Leon Romanovsky <leon@kernel.org>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Valentin Schneider <vschneid@redhat.com>,
linux-kernel@vger.kernel.org, Steve Wahl <steve.wahl@hpe.com>,
Borislav Petkov <bp@alien8.de>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>
Subject: Re: [PATCH v4] sched/fair: Use sched_domain_span() for topology_span_sane()
Date: Sun, 20 Jul 2025 13:41:36 +0300 [thread overview]
Message-ID: <20250720104136.GI402218@unreal> (raw)
In-Reply-To: <20250709161917.14298-1-kprateek.nayak@amd.com>
On Wed, Jul 09, 2025 at 04:19:17PM +0000, K Prateek Nayak wrote:
> Leon noted a topology_span_sane() warning in their guest deployment
> starting from v6.16-rc1 [1]. Debug that followed pointed to the
> tl->mask() for the NODE domain being incorrectly resolved to that of the
> highest NUMA domain.
>
> tl->mask() for NODE is set to the sd_numa_mask() which depends on the
> global "sched_domains_curr_level" hack. "sched_domains_curr_level" is
> set to the "tl->numa_level" during tl traversal in build_sched_domains()
> calling sd_init() but was not reset before topology_span_sane().
>
> Since "tl->numa_level" still reflected the old value from
> build_sched_domains(), topology_span_sane() for the NODE domain trips
> when the span of the last NUMA domain overlaps.
>
> Instead of replicating the "sched_domains_curr_level" hack, Valentin
> suggested using the spans from the sched_domain objects constructed
> during build_sched_domains() which can also catch overlaps when the
> domain spans are fixed up by build_sched_domain().
>
> Since build_sched_domain() is skipped when tl->mask() of a child domain
> already covers the entire cpumap, skip the domains that have an empty
> span.
>
> The original warning was reproducible on the following NUMA topology
> reported by Leon:
>
> $ sudo numactl -H
> available: 5 nodes (0-4)
> node 0 cpus: 0 1
> node 0 size: 2927 MB
> node 0 free: 1603 MB
> node 1 cpus: 2 3
> node 1 size: 3023 MB
> node 1 free: 3008 MB
> node 2 cpus: 4 5
> node 2 size: 3023 MB
> node 2 free: 3007 MB
> node 3 cpus: 6 7
> node 3 size: 3023 MB
> node 3 free: 3002 MB
> node 4 cpus: 8 9
> node 4 size: 3022 MB
> node 4 free: 2718 MB
> node distances:
> node 0 1 2 3 4
> 0: 10 39 38 37 36
> 1: 39 10 38 37 36
> 2: 38 38 10 37 36
> 3: 37 37 37 10 36
> 4: 36 36 36 36 10
>
> The above topology can be mimicked using the following QEMU cmd that was
> used to reproduce the warning and test the fix:
>
> sudo qemu-system-x86_64 -enable-kvm -cpu host \
> -m 20G -smp cpus=10,sockets=10 -machine q35 \
> -object memory-backend-ram,size=4G,id=m0 \
> -object memory-backend-ram,size=4G,id=m1 \
> -object memory-backend-ram,size=4G,id=m2 \
> -object memory-backend-ram,size=4G,id=m3 \
> -object memory-backend-ram,size=4G,id=m4 \
> -numa node,cpus=0-1,memdev=m0,nodeid=0 \
> -numa node,cpus=2-3,memdev=m1,nodeid=1 \
> -numa node,cpus=4-5,memdev=m2,nodeid=2 \
> -numa node,cpus=6-7,memdev=m3,nodeid=3 \
> -numa node,cpus=8-9,memdev=m4,nodeid=4 \
> -numa dist,src=0,dst=1,val=39 \
> -numa dist,src=0,dst=2,val=38 \
> -numa dist,src=0,dst=3,val=37 \
> -numa dist,src=0,dst=4,val=36 \
> -numa dist,src=1,dst=0,val=39 \
> -numa dist,src=1,dst=2,val=38 \
> -numa dist,src=1,dst=3,val=37 \
> -numa dist,src=1,dst=4,val=36 \
> -numa dist,src=2,dst=0,val=38 \
> -numa dist,src=2,dst=1,val=38 \
> -numa dist,src=2,dst=3,val=37 \
> -numa dist,src=2,dst=4,val=36 \
> -numa dist,src=3,dst=0,val=37 \
> -numa dist,src=3,dst=1,val=37 \
> -numa dist,src=3,dst=2,val=37 \
> -numa dist,src=3,dst=4,val=36 \
> -numa dist,src=4,dst=0,val=36 \
> -numa dist,src=4,dst=1,val=36 \
> -numa dist,src=4,dst=2,val=36 \
> -numa dist,src=4,dst=3,val=36 \
> ...
>
> Suggested-by: Valentin Schneider <vschneid@redhat.com>
> Reported-by: Leon Romanovsky <leon@kernel.org>
> Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/ [1]
> Fixes: ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap") # ce29a7da84cd, f55dac1dafb3
> Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
> Tested-by: Valentin Schneider <vschneid@redhat.com>
> Reviewed-by: Valentin Schneider <vschneid@redhat.com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> Changes are based on tip:sched/urgent at commit fc975cfb3639
> ("sched/deadline: Fix dl_server runtime calculation formula")
Was this patch picked?
Thanks,
Tested-by: Leon Romanovsky <leon@kernel.org>
next prev parent reply other threads:[~2025-07-20 10:41 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-09 16:19 [PATCH v4] sched/fair: Use sched_domain_span() for topology_span_sane() K Prateek Nayak
2025-07-10 15:09 ` Valentin Schneider
2025-07-20 10:41 ` Leon Romanovsky [this message]
2025-07-21 4:28 ` K Prateek Nayak
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250720104136.GI402218@unreal \
--to=leon@kernel.org \
--cc=bp@alien8.de \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=steve.wahl@hpe.com \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.