qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Cc: peter.maydell@linaro.org, Andrew Jones <drjones@redhat.com>,
	Gavin Shan <gshan@redhat.com>,
	ehabkost@redhat.com, richard.henderson@linaro.org,
	alison.schofield@intel.com, qemu-devel@nongnu.org,
	qemu-arm@nongnu.org, shan.gavin@gmail.com,
	Igor Mammedov <imammedo@redhat.com>,
	Dan Williams <dan.j.williams@intel.com>
Subject: Re: [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI
Date: Wed, 17 Nov 2021 19:08:28 +0100	[thread overview]
Message-ID: <8576e0e8-aa06-1c05-9849-806c2bce4141@redhat.com> (raw)
In-Reply-To: <20211117143015.00002e0a@Huawei.com>

On 17.11.21 15:30, Jonathan Cameron wrote:
> On Tue, 16 Nov 2021 12:11:29 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>>>>
>>>> Examples include exposing HBM or PMEM to the VM. Just like on real HW,
>>>> this memory is exposed via cpu-less, special nodes. In contrast to real
>>>> HW, the memory is hotplugged later (I don't think HW supports hotplug
>>>> like that yet, but it might just be a matter of time).  
>>>
>>> I suppose some of that maybe covered by GENERIC_AFFINITY entries in SRAT
>>> some by MEMORY entries. Or nodes created dynamically like with normal
>>> hotplug memory.
>>>   
> 

Hi Jonathan,

> The naming of the define is unhelpful.  GENERIC_AFFINITY here corresponds
> to Generic Initiator Affinity.  So no good for memory. This is meant for
> representation of accelerators / network cards etc so you can get the NUMA
> characteristics for them accessing Memory in other nodes.
> 
> My understanding of 'traditional' memory hotplug is that typically the
> PA into which memory is hotplugged is known at boot time whether or not
> the memory is physically present.  As such, you present that in SRAT and rely
> on the EFI memory map / other information sources to know the memory isn't
> there.  When it is hotplugged later the address is looked up in SRAT to identify
> the NUMA node.

in virtualized environments we use the SRAT only to indicate the hotpluggable
region (-> indicate maximum possible PFN to the guest OS), the actual present
memory+PXM assignment is not done via SRAT. We differ quite a lot here from
actual hardware I think.

> 
> That model is less useful for more flexible entities like virtio-mem or
> indeed physical hardware such as CXL type 3 memory devices which typically
> need their own nodes.
> 
> For the CXL type 3 option, currently proposal is to use the CXL table entries
> representing Physical Address space regions to work out how many NUMA nodes
> are needed and just create extra ones at boot.
> https://lore.kernel.org/linux-cxl/163553711933.2509508.2203471175679990.stgit@dwillia2-desk3.amr.corp.intel.com
> 
> It's a heuristic as we might need more nodes to represent things well kernel
> side, but it's better than nothing and less effort that true dynamic node creation.
> If you chase through the earlier versions of Alison's patch you will find some
> discussion of that.
> 
> I wonder if virtio-mem should just grow a CDAT instance via a DOE?
> 
> That would make all this stuff discoverable via PCI config space rather than ACPI
> CDAT is at:
> https://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf
> but the table access protocol over PCI DOE is currently in the CXL 2.0 spec
> (nothing stops others using it though AFAIK).
> 
> However, then we'd actually need either dynamic node creation in the OS, or
> some sort of reserved pool of extra nodes.  Long term it may be the most
> flexible option.


I think for virtio-mem it's actually a bit simpler:

a) The user defined on the QEMU cmdline an empty node
b) The user assigned a virtio-mem device to a node, either when 
   coldplugging or hotplugging the device.

So we don't actually "hotplug" a new node, the (possible) node is already known
to QEMU right when starting up. It's just a matter of exposing that fact to the
guest OS -- similar to how we expose the maximum possible PFN to the guest OS.
It's seems to boild down to an ACPI limitation.

Conceptually, virtio-mem on an empty node in QEMU is not that different from
hot/coldplugging a CPU to an empty node or hot/coldplugging a DIMM/NVDIMM to
an empty node. But I guess it all just doesn't work with QEMU as of now.


In current x86-64 code, we define the "hotpluggable region" in hw/i386/acpi-build.c via

	build_srat_memory(table_data, machine->device_memory->base,
			  hotpluggable_address_space_size, nb_numa_nodes - 1,
			  MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);

So we tell the guest OS "this range is hotpluggable" and "it contains to
this node unless the device says something different". From both values we
can -- when under QEMU -- conclude the maximum possible PFN and the maximum
possible node. But the latter is not what Linux does: it simply maps the last
numa node (indicated in the memory entry) to a PXM
(-> drivers/acpi/numa/srat.c:acpi_numa_memory_affinity_init()).


I do wonder if we could simply expose the same hotpluggable range via multiple nodes:

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index a3ad6abd33..6c0ab442ea 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2084,6 +2084,22 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
      * providing _PXM method if necessary.
      */
     if (hotpluggable_address_space_size) {
+        /*
+         * For the guest to "know" about possible nodes, we'll indicate the
+         * same hotpluggable region to all empty nodes.
+         */
+        for (i = 0; i < nb_numa_nodes - 1; i++) {
+            if (machine->numa_state->nodes[i].node_mem > 0) {
+                continue;
+            }
+            build_srat_memory(table_data, machine->device_memory->base,
+                              hotpluggable_address_space_size, i,
+                              MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
+        }
+        /*
+         * Historically, we always indicated all hotpluggable memory to the
+         * last node -- if it was empty or not.
+         */
         build_srat_memory(table_data, machine->device_memory->base,
                           hotpluggable_address_space_size, nb_numa_nodes - 1,
                           MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);


Of course, this won't make CPU hotplug to empty nodes happy if we don't have
mempory hotplug enabled for a VM. I did not check in detail if that is valid
according to ACPI -- Linux might eat it (did not try yet, though).


-- 
Thanks,

David / dhildenb



  reply	other threads:[~2021-11-17 18:10 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-27  5:29 [PATCH v2] hw/arm/virt: Expose empty NUMA nodes through ACPI Gavin Shan
2021-10-27 15:40 ` Igor Mammedov
2021-10-28 11:32   ` Gavin Shan
2021-11-01  8:44     ` Igor Mammedov
2021-11-01 23:44       ` Gavin Shan
2021-11-02  7:39         ` Andrew Jones
2021-11-05 12:47           ` Gavin Shan
2021-11-10 10:33             ` Igor Mammedov
2021-11-10 11:01               ` David Hildenbrand
2021-11-12 13:27                 ` Igor Mammedov
2021-11-16 11:11                   ` David Hildenbrand
2021-11-17 14:30                     ` Jonathan Cameron
2021-11-17 18:08                       ` David Hildenbrand [this message]
2021-11-18 10:28                         ` Jonathan Cameron
2021-11-18 11:06                           ` David Hildenbrand
2021-11-18 11:23                             ` Jonathan Cameron
2021-11-19 10:58                               ` Jonathan Cameron
2021-11-19 11:33                                 ` David Hildenbrand
2021-11-19 17:26                                   ` Jonathan Cameron
2021-11-19 17:56                                     ` David Hildenbrand
2021-11-17 18:26                   ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8576e0e8-aa06-1c05-9849-806c2bce4141@redhat.com \
    --to=david@redhat.com \
    --cc=Jonathan.Cameron@Huawei.com \
    --cc=alison.schofield@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=drjones@redhat.com \
    --cc=ehabkost@redhat.com \
    --cc=gshan@redhat.com \
    --cc=imammedo@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-arm@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=shan.gavin@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).