From: Daniel Henrique Barboza <danielhb413@gmail.com>
To: "Cédric Le Goater" <clg@kaod.org>,
"David Gibson" <david@gibson.dropbear.id.au>
Cc: Laurent Vivier <lvivier@redhat.com>,
Thomas Huth <thuth@redhat.com>,
Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
Michael Ellerman <mpe@ellerman.id.au>,
qemu-devel@nongnu.org, groug@kaod.org, qemu-ppc@nongnu.org,
Igor Mammedov <imammedo@redhat.com>
Subject: Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes
Date: Mon, 29 Mar 2021 15:32:37 -0300 [thread overview]
Message-ID: <91e406bf-c9c6-0734-1f69-081d3633332b@gmail.com> (raw)
In-Reply-To: <9870aaba-9921-5c5d-113c-5be6cd098cf2@kaod.org>
On 3/29/21 12:32 PM, Cédric Le Goater wrote:
> On 3/29/21 6:20 AM, David Gibson wrote:
>> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote:
>>> On 3/25/21 3:10 AM, David Gibson wrote:
>>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote:
>>>>>
>>>>>
>>>>> On 3/22/21 10:03 PM, David Gibson wrote:
>>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote:
>>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update
>>>>>>> topology_core_cpumask") cause a regression in the pseries machine when
>>>>>>> defining certain SMP topologies [1]. The reasoning behind the change is
>>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
>>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with
>>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as
>>>>>>> far as the kernel understanding of SMP topologies goes, both masks are
>>>>>>> equivalent.
>>>>>>>
>>>>>>> Further discussions in the kernel mailing list [2] shown that the
>>>>>>> powerpc kernel always considered that the number of sockets were equal
>>>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense,
>>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The
>>>>>>> immediate conclusion is that all SMP topologies the pseries machine were
>>>>>>> supplying to the kernel, with more than one socket in the same NUMA node
>>>>>>> as in [1], happened to be correctly represented in the kernel by
>>>>>>> accident during all these years.
>>>>>>>
>>>>>>> There's a case to be made for virtual topologies being detached from
>>>>>>> hardware constraints, allowing maximum flexibility to users. At the same
>>>>>>> time, this freedom can't result in unrealistic hardware representations
>>>>>>> being emulated. If the real hardware and the pseries kernel don't
>>>>>>> support multiple chips/sockets in the same NUMA node, neither should we.
>>>>>>>
>>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the
>>>>>>> pseries machine. qtest changes were made to adapt to this new
>>>>>>> condition.
>>>>>>
>>>>>> Oof. I really don't like this idea. It means a bunch of fiddly work
>>>>>> for users to match these up, for no real gain. I'm also concerned
>>>>>> that this will require follow on changes in libvirt to not make this a
>>>>>> really cryptic and irritating point of failure.
>>>>>
>>>>> Haven't though about required Libvirt changes, although I can say that there
>>>>> will be some amount to be mande and it will probably annoy existing users
>>>>> (everyone that has a multiple socket per NUMA node topology).
>>>>>
>>>>> There is not much we can do from the QEMU layer aside from what I've proposed
>>>>> here. The other alternative is to keep interacting with the kernel folks to
>>>>> see if there is a way to keep our use case untouched.
>>>>
>>>> Right. Well.. not necessarily untouched, but I'm hoping for more
>>>> replies from Cédric to my objections and mpe's. Even with sockets
>>>> being a kinda meaningless concept in PAPR, I don't think tying it to
>>>> NUMA nodes makes sense.
>>>
>>> I did a couple of replies in different email threads but maybe not
>>> to all. I felt it was going nowhere :/ Couple of thoughts,
>>
>> I think I saw some of those, but maybe not all.
>>
>>> Shouldn't we get rid of the socket concept, die also, under pseries
>>> since they don't exist under PAPR ? We only have numa nodes, cores,
>>> threads AFAICT.
>>
>> Theoretically, yes. I'm not sure it's really practical, though, since
>> AFAICT, both qemu and the kernel have the notion of sockets (though
>> not dies) built into generic code.
>
> Yes. But, AFAICT, these topology notions have not reached "arch/powerpc"
> and PPC Linux only has a NUMA node id, on pseries and powernv.
>
>> It does mean that one possible approach here - maybe the best one - is
>> to simply declare that sockets are meaningless under, so we simply
>> don't expect what the guest kernel reports to match what's given to
>> qemu.
>>
>> It'd be nice to avoid that if we can: in a sense it's just cosmetic,
>> but it is likely to surprise and confuse people.
>>
>>> Should we diverged from PAPR and add extra DT properties "qemu,..." ?
>>> There are a couple of places where Linux checks for the underlying
>>> hypervisor already.
>>>
>>>>> This also means that
>>>>> 'ibm,chip-id' will probably remain in use since it's the only place where
>>>>> we inform cores per socket information to the kernel.
>>>>
>>>> Well.. unless we can find some other sensible way to convey that
>>>> information. I haven't given up hope for that yet.
>>>
>>> Well, we could start by fixing the value in QEMU. It is broken
>>> today.
>>
>> Fixing what value, exactly?
>
> The value of the "ibm,chip-id" since we are keeping the property under
> QEMU.
David, I believe this has to do with the discussing we had last Friday.
I mentioned that the ibm,chip-id property is being calculated in a way that
promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes,
e.g.:
-smp 4,cores=4,maxcpus=8,threads=1 \
-numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \
-numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
$ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id
ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>;
ibm,pft-size = <0x00 0x19>;
ibm,chip-id = <0x00>;
--
ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x01>;
ibm,pft-size = <0x00 0x19>;
ibm,chip-id = <0x00>;
--
ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x02>;
ibm,pft-size = <0x00 0x19>;
ibm,chip-id = <0x00>;
--
ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x03>;
ibm,pft-size = <0x00 0x19>;
ibm,chip-id = <0x00>;
We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a different
NUMA node than 0-1. This would mean that the same socket would belong to
different NUMA nodes at the same time.
I believe this is what Cedric wants to be addressed. Given that the property is
called after the OPAL property ibm,chip-id, the kernel expects that the property
will have the same semantics as in OPAL.
Thanks,
DHB
>
>>> This is all coming from some work we did last year to evaluate our HW
>>> (mostly for XIVE) on 2s, 4s, 16s systems on baremetal, KVM and PowerVM.
>>> We saw some real problems because Linux did not have a clear view of the
>>> topology. See the figures here :
>>>
>>> http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210303174857.1760393-9-clg@kaod.org/
>>>
>>> The node id is a key parameter for system resource management, memory
>>> allocation, interrupt affinity, etc. Linux scales much better if used
>>> correctly.
>>
>> Well, sure. And we have all the ibm,associativity stuff to convey the
>> node ids to the guest (which has its own problems, but not that are
>> relevant here). What's throwing me is why getting node IDs correct
>> has anything to do with socket numbers.
>
>
next prev parent reply other threads:[~2021-03-29 18:35 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-19 18:34 [PATCH 0/2] pseries: SMP sockets must match NUMA nodes Daniel Henrique Barboza
2021-03-19 18:34 ` [PATCH 1/2] spapr: number of SMP sockets must be equal to " Daniel Henrique Barboza
2021-03-23 1:03 ` David Gibson
2021-03-23 17:21 ` Daniel Henrique Barboza
2021-03-25 2:10 ` David Gibson
2021-03-25 8:56 ` Cédric Le Goater
2021-03-25 10:15 ` Daniel Henrique Barboza
2021-03-29 4:20 ` David Gibson
2021-03-29 15:32 ` Cédric Le Goater
2021-03-29 18:32 ` Daniel Henrique Barboza [this message]
2021-03-29 23:55 ` Igor Mammedov
2021-03-31 0:57 ` David Gibson
2021-03-31 4:58 ` Michael Ellerman
2021-03-31 15:22 ` Cédric Le Goater
2021-04-01 2:53 ` David Gibson
2021-03-31 15:18 ` Cédric Le Goater
2021-03-31 17:29 ` Daniel Henrique Barboza
2021-03-31 17:40 ` Cédric Le Goater
2021-04-01 2:59 ` David Gibson
2021-04-01 9:21 ` Cédric Le Goater
2021-03-29 23:51 ` Igor Mammedov
2021-03-30 21:33 ` Daniel Henrique Barboza
2021-03-19 18:34 ` [PATCH 2/2] spapr.c: remove 'ibm,chip-id' from DT Daniel Henrique Barboza
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=91e406bf-c9c6-0734-1f69-081d3633332b@gmail.com \
--to=danielhb413@gmail.com \
--cc=clg@kaod.org \
--cc=david@gibson.dropbear.id.au \
--cc=groug@kaod.org \
--cc=imammedo@redhat.com \
--cc=lvivier@redhat.com \
--cc=mpe@ellerman.id.au \
--cc=qemu-devel@nongnu.org \
--cc=qemu-ppc@nongnu.org \
--cc=srikar@linux.vnet.ibm.com \
--cc=thuth@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).