From: Gavin Shan <gshan@redhat.com>
To: Igor Mammedov <imammedo@redhat.com>
Cc: "Daniel Henrique Barboza" <dbarboza@ventanamicro.com>,
qemu-arm@nongnu.org, qemu-devel@nongnu.org,
qemu-riscv@nongnu.org, rad@semihalf.com,
peter.maydell@linaro.org, quic_llindhol@quicinc.com,
eduardo@habkost.net, marcel.apfelbaum@gmail.com,
philmd@linaro.org, wangyanan55@huawei.com, palmer@dabbelt.com,
alistair.francis@wdc.com, bin.meng@windriver.com,
thuth@redhat.com, lvivier@redhat.com, pbonzini@redhat.com,
yihyu@redhat.com, shan.gavin@gmail.com,
"Michal Prívozník" <mprivozn@redhat.com>
Subject: Re: [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines
Date: Sat, 25 Feb 2023 11:05:42 +1100 [thread overview]
Message-ID: <7638e643-7a05-808c-ec47-c29db1c5d75f@redhat.com> (raw)
In-Reply-To: <20230224152035.6edfa832@imammedo.users.ipa.redhat.com>
On 2/25/23 1:20 AM, Igor Mammedov wrote:
> On Fri, 24 Feb 2023 21:16:39 +1100
> Gavin Shan <gshan@redhat.com> wrote:
>
>> On 2/24/23 8:26 PM, Daniel Henrique Barboza wrote:
>>> On 2/24/23 04:09, Gavin Shan wrote:
>>>> On 2/24/23 12:18 AM, Daniel Henrique Barboza wrote:
>>>>> On 2/23/23 05:13, Gavin Shan wrote:
>>>>>> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is
>>>>>> used to populate the CPU topology in the Linux guest. It's required that
>>>>>> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux
>>>>>> scheduling domain can't be sorted out, as the following warning message
>>>>>> indicates. To avoid the unexpected confusion, this series attempts to
>>>>>> rejects such kind of insane configurations.
>>>>>>
>>>>>> -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
>>>>>> -numa node,nodeid=0,cpus=0-1,memdev=ram0 \
>>>>>> -numa node,nodeid=1,cpus=2-3,memdev=ram1 \
>>>>>> -numa node,nodeid=2,cpus=4-5,memdev=ram2 \
>>>>>
>>>>>
>>>>> And why is this a QEMU problem? This doesn't hurt ACPI.
>>>>>
>>>>> Also, this restriction impacts breaks ARM guests in the wild that are running
>>>>> non-Linux OSes. I don't see why we should impact use cases that has nothing to
>>>>> do with Linux Kernel feelings about sockets - NUMA nodes exclusivity.
>>>>>
>>>>
>>>> With above configuration, CPU-0/1/2 are put into socket-0-cluster-0 while CPU-3/4/5
>>>> are put into socket-1-cluster-0, meaning CPU-2/3 belong to different socket and
>>>> cluster. However, CPU-2/3 are associated with NUMA node-1. In summary, multiple
>>>> CPUs in different clusters and sockets have been associated with one NUMA node.
>>>>
>>>> If I'm correct, the configuration isn't sensible in a baremetal environment and
>>>> same Linux kernel is supposed to work well for baremetal and virtualized machine.
>>>> So I think QEMU needs to emulate the topology as much as we can to match with the
>>>> baremetal environment. It's the reason why I think it's a QEMU problem even it
>>>> doesn't hurt ACPI. As I said in the reply to Daniel P. Berrangé <berrange@redhat.com>
>>>> in another thread, we may need to gurantee that the CPUs in one cluster can't be
>>>> split to multiple NUMA nodes, which matches with the baremetal environment, as I
>>>> can understand.
>>>>
>>>> Right, the restriction to have socket-NUMA-node or cluster-NUMA-node boundary will
>>>> definitely break the configurations running in the wild.
>>>
>>>
>>> What about a warning? If the user attempts to use an exotic NUMA configuration
>>> like the one you mentioned we can print something like:
>>>
>>> "Warning: NUMA topologies where a socket belongs to multiple NUMA nodes can cause OSes like Linux to misbehave"
>>>
>>> This would inform the user what might be going wrong in case Linux is crashing/error
>>> out on them and then the user is free to fix their topology (or the kernel). And
>>> at the same time we wouldn't break existing stuff that happens to be working.
>>>
>>>
>>
>> Yes, I think a warning message is more appropriate, so that users can fix their
>> irregular configurations and the existing configurations aren't disconnected.
>> It would be nice to get the agreements from Daniel P. Berrangé and Drew, before
>> I'm going to change the code and post next revision.
>
> Honestly you (and libvirt as far as I recall) are using legacy options
> to assign cpus to numa nodes.
> With '-numa node,nodeid=0,cpus=0-1' you can't really be sure what/where
> in topology those cpus are.
> What you can do is to use '-numa cpu,...' option to assign socket/core/...
> to numa node, ex:
> "-numa cpu,node-id=1,socket-id=0 " or
> "-numa cpu,node-id=0,socket-id=1,core-id=0 " or
> "-numa cpu,node-id=0,socket-id=1,core-id=1,thread-id=0
> to get your desired mapping.
>
> The problem that's so far was stopping the later adoption by libvirt (Michal CCed)
> is that values used by it are machine specific and to do it properly, for a concrete
> '-M x -smp ...' at least for the first time qemu should be started with-
> -preconfig option and then user should query possible cpus for those values
> and assign them to numa nodes via QMP.
>
The issue isn't related to the legacy or modern way to configure CPU-to-NUMA association.
Note that qtests/numa-test also use the legacy way. For example, the issue can be triggered
with the following command lines where the modern configuration is used:
/home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
-accel kvm -machine virt,gic-version=host \
-cpu host -smp 6,sockets=2,clusters=1,cores=3,threads=1 \
-m 768M,slots=16,maxmem=64G \
-object memory-backend-ram,id=mem0,size=256M \
-object memory-backend-ram,id=mem1,size=256M \
-object memory-backend-ram,id=mem2,size=256M \
-numa node,nodeid=0,memdev=mem0 \
-numa node,nodeid=1,memdev=mem1 \
-numa node,nodeid=2,memdev=mem2 \
-numa cpu,node-id=0,socket-id=0,cluster-id=0,core-id=0,thread-id=0 \
-numa cpu,node-id=0,socket-id=0,cluster-id=0,core-id=1,thread-id=0 \
-numa cpu,node-id=1,socket-id=0,cluster-id=0,core-id=2,thread-id=0 \
-numa cpu,node-id=1,socket-id=1,cluster-id=0,core-id=0,thread-id=0 \
-numa cpu,node-id=2,socket-id=1,cluster-id=0,core-id=1,thread-id=0 \
-numa cpu,node-id=2,socket-id=1,cluster-id=0,core-id=2,thread-id=0
:
:
[ 0.022260] =============== sched_init_domains ===================
[ 0.022263] CPU-0: 36,56,0,-1 thread=0 core=0-2 cluster=0-2 llc=0
[ 0.022267] CPU-1: 36,56,1,-1 thread=1 core=0-2 cluster=0-2 llc=1
[ 0.022269] CPU-2: 36,56,2,-1 thread=2 core=0-2 cluster=0-2 llc=2
[ 0.022270] CPU-3: 136,156,3,-1 thread=3 core=3-5 cluster=3-5 llc=3
[ 0.022272] CPU-4: 136,156,4,-1 thread=4 core=3-5 cluster=3-5 llc=4
[ 0.022273] CPU-5: 136,156,5,-1 thread=5 core=3-5 cluster=3-5 llc=5
[ 0.022275] CPU-0: SMT=0 CLUSTER=0 MULTICORE=0-2 CPUMASK=0-1 0-1
[ 0.022277] CPU-1: SMT=1 CLUSTER=1 MULTICORE=0-2 CPUMASK=0-1 0-1
[ 0.022279] CPU-2: SMT=2 CLUSTER=0-2 MULTICORE=2-3 CPUMASK=2-3 2-3
[ 0.022281] CPU-3: SMT=3 CLUSTER=3-5 MULTICORE=2-3 CPUMASK=2-3 2-3
[ 0.022283] CPU-4: SMT=4 CLUSTER=4 MULTICORE=3-5 CPUMASK=4-5 4-5
[ 0.022284] CPU-5: SMT=5 CLUSTER=5 MULTICORE=3-5 CPUMASK=4-5 4-5
[ 0.022314] build_sched_domains: CPU-0 level-SMT
[ 0.022317] build_sched_domains: CPU-0 level-CLS
[ 0.022318] topology_span_sane: CPU-0 0 vs CPU-2 0-2
[ 0.022322] ------------[ cut here ]------------
[ 0.022323] WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2275 build_sched_domains+0x135c/0x1660
:
> btw: on x86 we also allow 'insane' configuration incl. those that do not
> exist on baremetal and do not warn about it anyone (i.e. it's user's
> responsibility to provide topology that specific guest OS could handle,
> aka it's not QEMU business but upper layers). (I do occasionally try
> introduce stricter checks in that are, but that gets push back more
> often that not).
>
> I'd do check only in case of a specific board where mapping is fixed
> in specs of emulated machine, otherwise I wouldn't bother.
>
Right, x86 can handle the irregular configuration well and we didn't find
any triggered warning messages there. As the subject indicates, it's a
ARM64 or riscv specific issue. Unfortunately, I failed to find the requirement,
socket-to-NUMA-node or cluster-to-NUMA-node 1:1 mapping from specs. However, it's not
sensible to split CPUs in one cluster to multiple NUMA nodes in a (ARM64 or RISCv)
baremetal environment. QEMU needs to emulate the environment close to the baremetal
machine if we can.
In summary, a warning message when multiple CPUs in one cluster are split to
multiple NUMA nodes, as Daniel suggested, sounds reasonable to me. Please let me
know your thoughts, Igor :)
Thanks,
Gavin
prev parent reply other threads:[~2023-02-25 0:06 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-23 8:13 [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines Gavin Shan
2023-02-23 8:13 ` [PATCH v2 1/4] qtest/numa-test: Follow socket-NUMA-node boundary for aarch64 Gavin Shan
2023-02-23 8:13 ` [PATCH v2 2/4] numa: Validate socket and NUMA node boundary if required Gavin Shan
2023-02-23 9:05 ` Philippe Mathieu-Daudé
2023-02-23 10:27 ` Gavin Shan
2023-02-23 8:14 ` [PATCH v2 3/4] hw/arm: Validate socket and NUMA node boundary Gavin Shan
2023-02-23 8:14 ` [PATCH v2 4/4] hw/riscv: " Gavin Shan
2023-02-23 12:25 ` [PATCH v2 0/4] NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines Andrew Jones
2023-02-24 7:20 ` Gavin Shan
2023-02-23 12:57 ` Daniel P. Berrangé
2023-02-24 5:47 ` Gavin Shan
2023-02-23 13:18 ` Daniel Henrique Barboza
2023-02-24 7:09 ` Gavin Shan
2023-02-24 9:26 ` Daniel Henrique Barboza
2023-02-24 10:16 ` Gavin Shan
2023-02-24 10:39 ` Andrew Jones
2023-02-24 14:20 ` Igor Mammedov
2023-02-25 0:05 ` Gavin Shan [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7638e643-7a05-808c-ec47-c29db1c5d75f@redhat.com \
--to=gshan@redhat.com \
--cc=alistair.francis@wdc.com \
--cc=bin.meng@windriver.com \
--cc=dbarboza@ventanamicro.com \
--cc=eduardo@habkost.net \
--cc=imammedo@redhat.com \
--cc=lvivier@redhat.com \
--cc=marcel.apfelbaum@gmail.com \
--cc=mprivozn@redhat.com \
--cc=palmer@dabbelt.com \
--cc=pbonzini@redhat.com \
--cc=peter.maydell@linaro.org \
--cc=philmd@linaro.org \
--cc=qemu-arm@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=qemu-riscv@nongnu.org \
--cc=quic_llindhol@quicinc.com \
--cc=rad@semihalf.com \
--cc=shan.gavin@gmail.com \
--cc=thuth@redhat.com \
--cc=wangyanan55@huawei.com \
--cc=yihyu@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).