From: "Michael S. Tsirkin" <mst@redhat.com>
To: "Cédric Le Goater" <clg@redhat.com>
Cc: ankita@nvidia.com, jgg@nvidia.com, marcel.apfelbaum@gmail.com,
philmd@linaro.org, wangyanan55@huawei.com,
alex.williamson@redhat.com, pbonzini@redhat.com,
shannon.zhaosl@gmail.com, peter.maydell@linaro.org,
ani@anisinha.ca, berrange@redhat.com, eduardo@habkost.net,
imammedo@redhat.com, eblake@redhat.com, armbru@redhat.com,
david@redhat.com, gshan@redhat.com, Jonathan.Cameron@huawei.com,
aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com,
targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com,
mochs@nvidia.com, dnigam@nvidia.com, udhoke@nvidia.com,
qemu-arm@nongnu.org, qemu-devel@nongnu.org
Subject: Re: [PATCH v9 0/3] acpi: report NUMA nodes for device memory using GI
Date: Mon, 11 Mar 2024 11:45:57 -0400 [thread overview]
Message-ID: <20240311114543-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <2f86279d-3c0e-447b-97ae-f4257b84ad71@redhat.com>
On Mon, Mar 11, 2024 at 11:39:11AM +0100, Cédric Le Goater wrote:
> On 3/8/24 15:55, ankita@nvidia.com wrote:
> > From: Ankit Agrawal <ankita@nvidia.com>
> >
> > There are upcoming devices which allow CPU to cache coherently access
> > their memory. It is sensible to expose such memory as NUMA nodes separate
> > from the sysmem node to the OS. The ACPI spec provides a scheme in SRAT
> > called Generic Initiator Affinity Structure [1] to allow an association
> > between a Proximity Domain (PXM) and a Generic Initiator (GI) (e.g.
> > heterogeneous processors and accelerators, GPUs, and I/O devices with
> > integrated compute or DMA engines).
> >
> > While a single node per device may cover several use cases, it is however
> > insufficient for a full utilization of the NVIDIA GPUs MIG
> > (Mult-Instance GPUs) [2] feature. The feature allows partitioning of the
> > GPU device resources (including device memory) into several (upto 8)
> > isolated instances. Each of the partitioned memory requires a dedicated NUMA
> > node to operate. The partitions are not fixed and they can be created/deleted
> > at runtime.
> >
> > Linux OS does not provide a means to dynamically create/destroy NUMA nodes
> > and such feature implementation is expected to be non-trivial. The nodes
> > that OS discovers at the boot time while parsing SRAT remains fixed. So we
> > utilize the GI Affinity structures that allows association between nodes
> > and devices. Multiple GI structures per device/BDF is possible, allowing
> > creation of multiple nodes in the VM by exposing unique PXM in each of these
> > structures.
> >
> > Implement the mechanism to build the GI affinity structures as Qemu currently
> > does not. Introduce a new acpi-generic-initiator object to allow host admin
> > link a device with an associated NUMA node. Qemu maintains this association
> > and use this object to build the requisite GI Affinity Structure.
> >
> > When multiple NUMA nodes are associated with a device, it is required to
> > create those many number of acpi-generic-initiator objects, each representing
> > a unique device:node association.
> >
> > Following is one of a decoded GI affinity structure in VM ACPI SRAT.
> > [0C8h 0200 1] Subtable Type : 05 [Generic Initiator Affinity]
> > [0C9h 0201 1] Length : 20
> >
> > [0CAh 0202 1] Reserved1 : 00
> > [0CBh 0203 1] Device Handle Type : 01
> > [0CCh 0204 4] Proximity Domain : 00000007
> > [0D0h 0208 16] Device Handle : 00 00 20 00 00 00 00 00 00 00 00
> > 00 00 00 00 00
> > [0E0h 0224 4] Flags (decoded below) : 00000001
> > Enabled : 1
> > [0E4h 0228 4] Reserved2 : 00000000
> >
> > [0E8h 0232 1] Subtable Type : 05 [Generic Initiator Affinity]
> > [0E9h 0233 1] Length : 20
> >
> > On Grace Hopper systems, an admin will create a range of 8 nodes and associate
> > them with the device using the acpi-generic-initiator object. While a
> > configuration of less than 8 nodes per device is allowed, such configuration
> > will prevent utilization of the feature to the fullest. This setting is
> > applicable to all the Grace+Hopper systems. The following is an example of
> > the Qemu command line arguments to create 8 nodes and link them to the device
> > 'dev0':
> >
> > -numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 \
> > -numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 \
> > -numa node,nodeid=8 -numa node,nodeid=9 \
> > -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
> > -object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=2 \
> > -object acpi-generic-initiator,id=gi1,pci-dev=dev0,node=3 \
> > -object acpi-generic-initiator,id=gi2,pci-dev=dev0,node=4 \
> > -object acpi-generic-initiator,id=gi3,pci-dev=dev0,node=5 \
> > -object acpi-generic-initiator,id=gi4,pci-dev=dev0,node=6 \
> > -object acpi-generic-initiator,id=gi5,pci-dev=dev0,node=7 \
> > -object acpi-generic-initiator,id=gi6,pci-dev=dev0,node=8 \
> > -object acpi-generic-initiator,id=gi7,pci-dev=dev0,node=9 \
> >
> > The performance benefits can be realized by providing the NUMA node distances
> > appropriately (through libvirt tags or Qemu params). The admin can get the
> > distance among nodes in hardware using `numactl -H`.
> >
> > This series goes along with the recenty added vfio-pci variant driver [3].
> >
> > Applied over v8.2.2
> > base commit: 11aa0b1ff115b86160c4d37e7c37e6a6b13b77ea
> >
> > [1] ACPI Spec 6.3, Section 5.2.16.6
> > Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu [2]
> > Link: https://lore.kernel.org/all/20240220115055.23546-1-ankita@nvidia.com/ [3]
> >
> > Link for v8:
> > Link: https://lore.kernel.org/all/20240306123317.4691-1-ankita@nvidia.com/
>
> v9 looks ready for QEMU 9.0. An Ack from the ACPI supporters is missing
> though.
>
> Michal, Igor, Ani,
>
> Did you have time to take a look ?
>
> Thanks
>
> C.
I tagged it already.
>
>
> > v8 -> v9
> > - Removed unused included headers based on Jonathan's suggestion.
> > - Collected Reviewed-by from Jonathan.
> > - Added acpi-generic-initiator support for i386
> > - Moved HMAT change from patch 1/2 to 2/3.
> > - Fixed nits.
> >
> > v7 -> v8
> > - Replaced the code to collect the acpi-generic-initiator objects
> > with the code to use recursive helper object_child_foreach_recursive
> > based on suggestion from Jonathan Cameron.
> > - Added sanity check for the node id passed to the
> > acpi-generic-initiator object.
> > - Added change to use GI as HMAT initiator as per Jonathan's suggestion.
> > - Fixed nits pointed by Marcus and Jonathan.
> > - Collected Marcus' Acked-by.
> > - Rebased to v8.2.2.
> >
> > v6 -> v7
> > - Updated code and the commit message to make acpi-generic-initiator
> > define a 1:1 relationship between device and node based on
> > Jonathan Cameron's suggestion.
> > - Updated commit message to include the decoded GI entry in the SRAT.
> > - Rebased to v8.2.1.
> >
> > v5 -> v6
> > - Updated commit message for the [1/2] and the cover letter.
> > - Updated the acpi-generic-initiator object comment description for
> > clarity on the input host-nodes.
> > - Rebased to v8.2.0-rc4.
> >
> > v4 -> v5
> > - Removed acpi-dev option until full support.
> > - The NUMA nodes are saved as bitmap instead of uint16List.
> > - Replaced asserts to exit calls.
> > - Addressed other miscellaneous comments.
> >
> > v3 -> v4
> > - changed the ':' delimited way to a uint16 array to communicate the
> > nodes associated with the device.
> > - added asserts to handle invalid inputs.
> > - addressed other miscellaneous v3 comments.
> >
> > v2 -> v3
> > - changed param to accept a ':' delimited list of NUMA nodes, instead
> > of a range.
> > - Removed nvidia-acpi-generic-initiator object.
> > - Addressed miscellaneous comments in v2.
> >
> > v1 -> v2
> > - Removed dependency on sysfs to communicate the feature with variant module.
> > - Use GI Affinity SRAT structure instead of Memory Affinity.
> > - No DSDT entries needed to communicate the PXM for the device. SRAT GI
> > structure is used instead.
> > - New objects introduced to establish link between device and nodes.
> >
> > Ankit Agrawal (3):
> > qom: new object to associate device to NUMA node
> > hw/acpi: Implement the SRAT GI affinity structure
> > hw/i386/acpi-build: Add support for SRAT Generic Initiator structures
> >
> > hw/acpi/acpi_generic_initiator.c | 148 +++++++++++++++++++++++
> > hw/acpi/hmat.c | 2 +-
> > hw/acpi/meson.build | 1 +
> > hw/arm/virt-acpi-build.c | 3 +
> > hw/core/numa.c | 3 +-
> > hw/i386/acpi-build.c | 3 +
> > include/hw/acpi/acpi_generic_initiator.h | 47 +++++++
> > include/sysemu/numa.h | 1 +
> > qapi/qom.json | 17 +++
> > 9 files changed, 223 insertions(+), 2 deletions(-)
> > create mode 100644 hw/acpi/acpi_generic_initiator.c
> > create mode 100644 include/hw/acpi/acpi_generic_initiator.h
> >
prev parent reply other threads:[~2024-03-11 15:46 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-08 14:55 [PATCH v9 0/3] acpi: report NUMA nodes for device memory using GI ankita
2024-03-08 14:55 ` [PATCH v9 1/3] qom: new object to associate device to NUMA node ankita
2024-03-08 14:55 ` [PATCH v9 2/3] hw/acpi: Implement the SRAT GI affinity structure ankita
2024-03-08 14:55 ` [PATCH v9 3/3] hw/i386/acpi-build: Add support for SRAT Generic Initiator structures ankita
2024-03-11 15:19 ` Jonathan Cameron via
2024-03-11 10:39 ` [PATCH v9 0/3] acpi: report NUMA nodes for device memory using GI Cédric Le Goater
2024-03-11 15:45 ` Michael S. Tsirkin [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240311114543-mutt-send-email-mst@kernel.org \
--to=mst@redhat.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=acurrid@nvidia.com \
--cc=alex.williamson@redhat.com \
--cc=ani@anisinha.ca \
--cc=aniketa@nvidia.com \
--cc=ankita@nvidia.com \
--cc=armbru@redhat.com \
--cc=berrange@redhat.com \
--cc=cjia@nvidia.com \
--cc=clg@redhat.com \
--cc=david@redhat.com \
--cc=dnigam@nvidia.com \
--cc=eblake@redhat.com \
--cc=eduardo@habkost.net \
--cc=gshan@redhat.com \
--cc=imammedo@redhat.com \
--cc=jgg@nvidia.com \
--cc=kwankhede@nvidia.com \
--cc=marcel.apfelbaum@gmail.com \
--cc=mochs@nvidia.com \
--cc=pbonzini@redhat.com \
--cc=peter.maydell@linaro.org \
--cc=philmd@linaro.org \
--cc=qemu-arm@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=shannon.zhaosl@gmail.com \
--cc=targupta@nvidia.com \
--cc=udhoke@nvidia.com \
--cc=vsethi@nvidia.com \
--cc=wangyanan55@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).