From: "Cédric Le Goater" <clg@redhat.com>
To: ankita@nvidia.com, jgg@nvidia.com, marcel.apfelbaum@gmail.com,
philmd@linaro.org, wangyanan55@huawei.com,
alex.williamson@redhat.com, pbonzini@redhat.com,
shannon.zhaosl@gmail.com, peter.maydell@linaro.org,
ani@anisinha.ca, berrange@redhat.com, eduardo@habkost.net,
imammedo@redhat.com, mst@redhat.com, eblake@redhat.com,
armbru@redhat.com, david@redhat.com, gshan@redhat.com,
Jonathan.Cameron@huawei.com
Cc: aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com,
targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com,
mochs@nvidia.com, dnigam@nvidia.com, udhoke@nvidia.com,
qemu-arm@nongnu.org, qemu-devel@nongnu.org
Subject: Re: [PATCH v9 0/3] acpi: report NUMA nodes for device memory using GI
Date: Mon, 11 Mar 2024 11:39:11 +0100 [thread overview]
Message-ID: <2f86279d-3c0e-447b-97ae-f4257b84ad71@redhat.com> (raw)
In-Reply-To: <20240308145525.10886-1-ankita@nvidia.com>
On 3/8/24 15:55, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
>
> There are upcoming devices which allow CPU to cache coherently access
> their memory. It is sensible to expose such memory as NUMA nodes separate
> from the sysmem node to the OS. The ACPI spec provides a scheme in SRAT
> called Generic Initiator Affinity Structure [1] to allow an association
> between a Proximity Domain (PXM) and a Generic Initiator (GI) (e.g.
> heterogeneous processors and accelerators, GPUs, and I/O devices with
> integrated compute or DMA engines).
>
> While a single node per device may cover several use cases, it is however
> insufficient for a full utilization of the NVIDIA GPUs MIG
> (Mult-Instance GPUs) [2] feature. The feature allows partitioning of the
> GPU device resources (including device memory) into several (upto 8)
> isolated instances. Each of the partitioned memory requires a dedicated NUMA
> node to operate. The partitions are not fixed and they can be created/deleted
> at runtime.
>
> Linux OS does not provide a means to dynamically create/destroy NUMA nodes
> and such feature implementation is expected to be non-trivial. The nodes
> that OS discovers at the boot time while parsing SRAT remains fixed. So we
> utilize the GI Affinity structures that allows association between nodes
> and devices. Multiple GI structures per device/BDF is possible, allowing
> creation of multiple nodes in the VM by exposing unique PXM in each of these
> structures.
>
> Implement the mechanism to build the GI affinity structures as Qemu currently
> does not. Introduce a new acpi-generic-initiator object to allow host admin
> link a device with an associated NUMA node. Qemu maintains this association
> and use this object to build the requisite GI Affinity Structure.
>
> When multiple NUMA nodes are associated with a device, it is required to
> create those many number of acpi-generic-initiator objects, each representing
> a unique device:node association.
>
> Following is one of a decoded GI affinity structure in VM ACPI SRAT.
> [0C8h 0200 1] Subtable Type : 05 [Generic Initiator Affinity]
> [0C9h 0201 1] Length : 20
>
> [0CAh 0202 1] Reserved1 : 00
> [0CBh 0203 1] Device Handle Type : 01
> [0CCh 0204 4] Proximity Domain : 00000007
> [0D0h 0208 16] Device Handle : 00 00 20 00 00 00 00 00 00 00 00
> 00 00 00 00 00
> [0E0h 0224 4] Flags (decoded below) : 00000001
> Enabled : 1
> [0E4h 0228 4] Reserved2 : 00000000
>
> [0E8h 0232 1] Subtable Type : 05 [Generic Initiator Affinity]
> [0E9h 0233 1] Length : 20
>
> On Grace Hopper systems, an admin will create a range of 8 nodes and associate
> them with the device using the acpi-generic-initiator object. While a
> configuration of less than 8 nodes per device is allowed, such configuration
> will prevent utilization of the feature to the fullest. This setting is
> applicable to all the Grace+Hopper systems. The following is an example of
> the Qemu command line arguments to create 8 nodes and link them to the device
> 'dev0':
>
> -numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 \
> -numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 \
> -numa node,nodeid=8 -numa node,nodeid=9 \
> -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
> -object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=2 \
> -object acpi-generic-initiator,id=gi1,pci-dev=dev0,node=3 \
> -object acpi-generic-initiator,id=gi2,pci-dev=dev0,node=4 \
> -object acpi-generic-initiator,id=gi3,pci-dev=dev0,node=5 \
> -object acpi-generic-initiator,id=gi4,pci-dev=dev0,node=6 \
> -object acpi-generic-initiator,id=gi5,pci-dev=dev0,node=7 \
> -object acpi-generic-initiator,id=gi6,pci-dev=dev0,node=8 \
> -object acpi-generic-initiator,id=gi7,pci-dev=dev0,node=9 \
>
> The performance benefits can be realized by providing the NUMA node distances
> appropriately (through libvirt tags or Qemu params). The admin can get the
> distance among nodes in hardware using `numactl -H`.
>
> This series goes along with the recenty added vfio-pci variant driver [3].
>
> Applied over v8.2.2
> base commit: 11aa0b1ff115b86160c4d37e7c37e6a6b13b77ea
>
> [1] ACPI Spec 6.3, Section 5.2.16.6
> Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu [2]
> Link: https://lore.kernel.org/all/20240220115055.23546-1-ankita@nvidia.com/ [3]
>
> Link for v8:
> Link: https://lore.kernel.org/all/20240306123317.4691-1-ankita@nvidia.com/
v9 looks ready for QEMU 9.0. An Ack from the ACPI supporters is missing
though.
Michal, Igor, Ani,
Did you have time to take a look ?
Thanks
C.
> v8 -> v9
> - Removed unused included headers based on Jonathan's suggestion.
> - Collected Reviewed-by from Jonathan.
> - Added acpi-generic-initiator support for i386
> - Moved HMAT change from patch 1/2 to 2/3.
> - Fixed nits.
>
> v7 -> v8
> - Replaced the code to collect the acpi-generic-initiator objects
> with the code to use recursive helper object_child_foreach_recursive
> based on suggestion from Jonathan Cameron.
> - Added sanity check for the node id passed to the
> acpi-generic-initiator object.
> - Added change to use GI as HMAT initiator as per Jonathan's suggestion.
> - Fixed nits pointed by Marcus and Jonathan.
> - Collected Marcus' Acked-by.
> - Rebased to v8.2.2.
>
> v6 -> v7
> - Updated code and the commit message to make acpi-generic-initiator
> define a 1:1 relationship between device and node based on
> Jonathan Cameron's suggestion.
> - Updated commit message to include the decoded GI entry in the SRAT.
> - Rebased to v8.2.1.
>
> v5 -> v6
> - Updated commit message for the [1/2] and the cover letter.
> - Updated the acpi-generic-initiator object comment description for
> clarity on the input host-nodes.
> - Rebased to v8.2.0-rc4.
>
> v4 -> v5
> - Removed acpi-dev option until full support.
> - The NUMA nodes are saved as bitmap instead of uint16List.
> - Replaced asserts to exit calls.
> - Addressed other miscellaneous comments.
>
> v3 -> v4
> - changed the ':' delimited way to a uint16 array to communicate the
> nodes associated with the device.
> - added asserts to handle invalid inputs.
> - addressed other miscellaneous v3 comments.
>
> v2 -> v3
> - changed param to accept a ':' delimited list of NUMA nodes, instead
> of a range.
> - Removed nvidia-acpi-generic-initiator object.
> - Addressed miscellaneous comments in v2.
>
> v1 -> v2
> - Removed dependency on sysfs to communicate the feature with variant module.
> - Use GI Affinity SRAT structure instead of Memory Affinity.
> - No DSDT entries needed to communicate the PXM for the device. SRAT GI
> structure is used instead.
> - New objects introduced to establish link between device and nodes.
>
> Ankit Agrawal (3):
> qom: new object to associate device to NUMA node
> hw/acpi: Implement the SRAT GI affinity structure
> hw/i386/acpi-build: Add support for SRAT Generic Initiator structures
>
> hw/acpi/acpi_generic_initiator.c | 148 +++++++++++++++++++++++
> hw/acpi/hmat.c | 2 +-
> hw/acpi/meson.build | 1 +
> hw/arm/virt-acpi-build.c | 3 +
> hw/core/numa.c | 3 +-
> hw/i386/acpi-build.c | 3 +
> include/hw/acpi/acpi_generic_initiator.h | 47 +++++++
> include/sysemu/numa.h | 1 +
> qapi/qom.json | 17 +++
> 9 files changed, 223 insertions(+), 2 deletions(-)
> create mode 100644 hw/acpi/acpi_generic_initiator.c
> create mode 100644 include/hw/acpi/acpi_generic_initiator.h
>
next prev parent reply other threads:[~2024-03-11 10:39 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-08 14:55 [PATCH v9 0/3] acpi: report NUMA nodes for device memory using GI ankita
2024-03-08 14:55 ` [PATCH v9 1/3] qom: new object to associate device to NUMA node ankita
2024-03-08 14:55 ` [PATCH v9 2/3] hw/acpi: Implement the SRAT GI affinity structure ankita
2024-03-08 14:55 ` [PATCH v9 3/3] hw/i386/acpi-build: Add support for SRAT Generic Initiator structures ankita
2024-03-11 15:19 ` Jonathan Cameron via
2024-03-11 10:39 ` Cédric Le Goater [this message]
2024-03-11 15:45 ` [PATCH v9 0/3] acpi: report NUMA nodes for device memory using GI Michael S. Tsirkin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2f86279d-3c0e-447b-97ae-f4257b84ad71@redhat.com \
--to=clg@redhat.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=acurrid@nvidia.com \
--cc=alex.williamson@redhat.com \
--cc=ani@anisinha.ca \
--cc=aniketa@nvidia.com \
--cc=ankita@nvidia.com \
--cc=armbru@redhat.com \
--cc=berrange@redhat.com \
--cc=cjia@nvidia.com \
--cc=david@redhat.com \
--cc=dnigam@nvidia.com \
--cc=eblake@redhat.com \
--cc=eduardo@habkost.net \
--cc=gshan@redhat.com \
--cc=imammedo@redhat.com \
--cc=jgg@nvidia.com \
--cc=kwankhede@nvidia.com \
--cc=marcel.apfelbaum@gmail.com \
--cc=mochs@nvidia.com \
--cc=mst@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peter.maydell@linaro.org \
--cc=philmd@linaro.org \
--cc=qemu-arm@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=shannon.zhaosl@gmail.com \
--cc=targupta@nvidia.com \
--cc=udhoke@nvidia.com \
--cc=vsethi@nvidia.com \
--cc=wangyanan55@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).