From: David Hildenbrand <david@redhat.com>
To: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Cc: peter.maydell@linaro.org, Andrew Jones <drjones@redhat.com>,
Gavin Shan <gshan@redhat.com>,
ehabkost@redhat.com, richard.henderson@linaro.org,
alison.schofield@intel.com, qemu-devel@nongnu.org,
qemu-arm@nongnu.org, shan.gavin@gmail.com,
Igor Mammedov <imammedo@redhat.com>,
Dan Williams <dan.j.williams@intel.com>
Subject: Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI
Date: Wed, 17 Nov 2021 19:08:28 +0100 [thread overview]
Message-ID: <8576e0e8-aa06-1c05-9849-806c2bce4141@redhat.com> (raw)
In-Reply-To: <20211117143015.00002e0a@Huawei.com>
On 17.11.21 15:30, Jonathan Cameron wrote:
> On Tue, 16 Nov 2021 12:11:29 +0100
> David Hildenbrand <david@redhat.com> wrote:
>
>>>>
>>>> Examples include exposing HBM or PMEM to the VM. Just like on real HW,
>>>> this memory is exposed via cpu-less, special nodes. In contrast to real
>>>> HW, the memory is hotplugged later (I don't think HW supports hotplug
>>>> like that yet, but it might just be a matter of time).
>>>
>>> I suppose some of that maybe covered by GENERIC_AFFINITY entries in SRAT
>>> some by MEMORY entries. Or nodes created dynamically like with normal
>>> hotplug memory.
>>>
>
Hi Jonathan,
> The naming of the define is unhelpful. GENERIC_AFFINITY here corresponds
> to Generic Initiator Affinity. So no good for memory. This is meant for
> representation of accelerators / network cards etc so you can get the NUMA
> characteristics for them accessing Memory in other nodes.
>
> My understanding of 'traditional' memory hotplug is that typically the
> PA into which memory is hotplugged is known at boot time whether or not
> the memory is physically present. As such, you present that in SRAT and rely
> on the EFI memory map / other information sources to know the memory isn't
> there. When it is hotplugged later the address is looked up in SRAT to identify
> the NUMA node.
in virtualized environments we use the SRAT only to indicate the hotpluggable
region (-> indicate maximum possible PFN to the guest OS), the actual present
memory+PXM assignment is not done via SRAT. We differ quite a lot here from
actual hardware I think.
>
> That model is less useful for more flexible entities like virtio-mem or
> indeed physical hardware such as CXL type 3 memory devices which typically
> need their own nodes.
>
> For the CXL type 3 option, currently proposal is to use the CXL table entries
> representing Physical Address space regions to work out how many NUMA nodes
> are needed and just create extra ones at boot.
> https://lore.kernel.org/linux-cxl/163553711933.2509508.2203471175679990.stgit@dwillia2-desk3.amr.corp.intel.com
>
> It's a heuristic as we might need more nodes to represent things well kernel
> side, but it's better than nothing and less effort that true dynamic node creation.
> If you chase through the earlier versions of Alison's patch you will find some
> discussion of that.
>
> I wonder if virtio-mem should just grow a CDAT instance via a DOE?
>
> That would make all this stuff discoverable via PCI config space rather than ACPI
> CDAT is at:
> https://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf
> but the table access protocol over PCI DOE is currently in the CXL 2.0 spec
> (nothing stops others using it though AFAIK).
>
> However, then we'd actually need either dynamic node creation in the OS, or
> some sort of reserved pool of extra nodes. Long term it may be the most
> flexible option.
I think for virtio-mem it's actually a bit simpler:
a) The user defined on the QEMU cmdline an empty node
b) The user assigned a virtio-mem device to a node, either when
coldplugging or hotplugging the device.
So we don't actually "hotplug" a new node, the (possible) node is already known
to QEMU right when starting up. It's just a matter of exposing that fact to the
guest OS -- similar to how we expose the maximum possible PFN to the guest OS.
It's seems to boild down to an ACPI limitation.
Conceptually, virtio-mem on an empty node in QEMU is not that different from
hot/coldplugging a CPU to an empty node or hot/coldplugging a DIMM/NVDIMM to
an empty node. But I guess it all just doesn't work with QEMU as of now.
In current x86-64 code, we define the "hotpluggable region" in hw/i386/acpi-build.c via
build_srat_memory(table_data, machine->device_memory->base,
hotpluggable_address_space_size, nb_numa_nodes - 1,
MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
So we tell the guest OS "this range is hotpluggable" and "it contains to
this node unless the device says something different". From both values we
can -- when under QEMU -- conclude the maximum possible PFN and the maximum
possible node. But the latter is not what Linux does: it simply maps the last
numa node (indicated in the memory entry) to a PXM
(-> drivers/acpi/numa/srat.c:acpi_numa_memory_affinity_init()).
I do wonder if we could simply expose the same hotpluggable range via multiple nodes:
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index a3ad6abd33..6c0ab442ea 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2084,6 +2084,22 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
* providing _PXM method if necessary.
*/
if (hotpluggable_address_space_size) {
+ /*
+ * For the guest to "know" about possible nodes, we'll indicate the
+ * same hotpluggable region to all empty nodes.
+ */
+ for (i = 0; i < nb_numa_nodes - 1; i++) {
+ if (machine->numa_state->nodes[i].node_mem > 0) {
+ continue;
+ }
+ build_srat_memory(table_data, machine->device_memory->base,
+ hotpluggable_address_space_size, i,
+ MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
+ }
+ /*
+ * Historically, we always indicated all hotpluggable memory to the
+ * last node -- if it was empty or not.
+ */
build_srat_memory(table_data, machine->device_memory->base,
hotpluggable_address_space_size, nb_numa_nodes - 1,
MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
Of course, this won't make CPU hotplug to empty nodes happy if we don't have
mempory hotplug enabled for a VM. I did not check in detail if that is valid
according to ACPI -- Linux might eat it (did not try yet, though).
--
Thanks,
David / dhildenb
next prev parent reply other threads:[~2021-11-17 18:10 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-10-27 5:29 [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI Gavin Shan
2021-10-27 15:40 ` Igor Mammedov
2021-10-28 11:32 ` Gavin Shan
2021-11-01 8:44 ` Igor Mammedov
2021-11-01 23:44 ` Gavin Shan
2021-11-02 7:39 ` Andrew Jones
2021-11-05 12:47 ` Gavin Shan
2021-11-10 10:33 ` Igor Mammedov
2021-11-10 11:01 ` David Hildenbrand
2021-11-12 13:27 ` Igor Mammedov
2021-11-16 11:11 ` David Hildenbrand
2021-11-17 14:30 ` Jonathan Cameron
2021-11-17 18:08 ` David Hildenbrand [this message]
2021-11-18 10:28 ` Jonathan Cameron
2021-11-18 11:06 ` David Hildenbrand
2021-11-18 11:23 ` Jonathan Cameron
2021-11-19 10:58 ` Jonathan Cameron
2021-11-19 11:33 ` David Hildenbrand
2021-11-19 17:26 ` Jonathan Cameron
2021-11-19 17:56 ` David Hildenbrand
2021-11-17 18:26 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8576e0e8-aa06-1c05-9849-806c2bce4141@redhat.com \
--to=david@redhat.com \
--cc=Jonathan.Cameron@Huawei.com \
--cc=alison.schofield@intel.com \
--cc=dan.j.williams@intel.com \
--cc=drjones@redhat.com \
--cc=ehabkost@redhat.com \
--cc=gshan@redhat.com \
--cc=imammedo@redhat.com \
--cc=peter.maydell@linaro.org \
--cc=qemu-arm@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=richard.henderson@linaro.org \
--cc=shan.gavin@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).