Re: [PATCH v9 0/3] acpi: report NUMA nodes for device memory using GI

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: "Cédric Le Goater" <clg@redhat.com>
To: ankita@nvidia.com, jgg@nvidia.com, marcel.apfelbaum@gmail.com,
	philmd@linaro.org, wangyanan55@huawei.com,
	alex.williamson@redhat.com, pbonzini@redhat.com,
	shannon.zhaosl@gmail.com, peter.maydell@linaro.org,
	ani@anisinha.ca, berrange@redhat.com, eduardo@habkost.net,
	imammedo@redhat.com, mst@redhat.com, eblake@redhat.com,
	armbru@redhat.com, david@redhat.com, gshan@redhat.com,
	Jonathan.Cameron@huawei.com
Cc: aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com,
	targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com,
	mochs@nvidia.com, dnigam@nvidia.com, udhoke@nvidia.com,
	qemu-arm@nongnu.org, qemu-devel@nongnu.org
Subject: Re: [PATCH v9 0/3] acpi: report NUMA nodes for device memory using GI
Date: Mon, 11 Mar 2024 11:39:11 +0100	[thread overview]
Message-ID: <2f86279d-3c0e-447b-97ae-f4257b84ad71@redhat.com> (raw)
In-Reply-To: <20240308145525.10886-1-ankita@nvidia.com>

On 3/8/24 15:55, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> There are upcoming devices which allow CPU to cache coherently access
> their memory. It is sensible to expose such memory as NUMA nodes separate
> from the sysmem node to the OS. The ACPI spec provides a scheme in SRAT
> called Generic Initiator Affinity Structure [1] to allow an association
> between a Proximity Domain (PXM) and a Generic Initiator (GI) (e.g.
> heterogeneous processors and accelerators, GPUs, and I/O devices with
> integrated compute or DMA engines).
> 
> While a single node per device may cover several use cases, it is however
> insufficient for a full utilization of the NVIDIA GPUs MIG
> (Mult-Instance GPUs) [2] feature. The feature allows partitioning of the
> GPU device resources (including device memory) into several (upto 8)
> isolated instances. Each of the partitioned memory requires a dedicated NUMA
> node to operate. The partitions are not fixed and they can be created/deleted
> at runtime.
> 
> Linux OS does not provide a means to dynamically create/destroy NUMA nodes
> and such feature implementation is expected to be non-trivial. The nodes
> that OS discovers at the boot time while parsing SRAT remains fixed. So we
> utilize the GI Affinity structures that allows association between nodes
> and devices. Multiple GI structures per device/BDF is possible, allowing
> creation of multiple nodes in the VM by exposing unique PXM in each of these
> structures.
> 
> Implement the mechanism to build the GI affinity structures as Qemu currently
> does not. Introduce a new acpi-generic-initiator object to allow host admin
> link a device with an associated NUMA node. Qemu maintains this association
> and use this object to build the requisite GI Affinity Structure.
> 
> When multiple NUMA nodes are associated with a device, it is required to
> create those many number of acpi-generic-initiator objects, each representing
> a unique device:node association.
> 
> Following is one of a decoded GI affinity structure in VM ACPI SRAT.
> [0C8h 0200   1]                Subtable Type : 05 [Generic Initiator Affinity]
> [0C9h 0201   1]                       Length : 20
> 
> [0CAh 0202   1]                    Reserved1 : 00
> [0CBh 0203   1]           Device Handle Type : 01
> [0CCh 0204   4]             Proximity Domain : 00000007
> [0D0h 0208  16]                Device Handle : 00 00 20 00 00 00 00 00 00 00 00
> 00 00 00 00 00
> [0E0h 0224   4]        Flags (decoded below) : 00000001
>                                       Enabled : 1
> [0E4h 0228   4]                    Reserved2 : 00000000
> 
> [0E8h 0232   1]                Subtable Type : 05 [Generic Initiator Affinity]
> [0E9h 0233   1]                       Length : 20
> 
> On Grace Hopper systems, an admin will create a range of 8 nodes and associate
> them with the device using the acpi-generic-initiator object. While a
> configuration of less than 8 nodes per device is allowed, such configuration
> will prevent utilization of the feature to the fullest. This setting is
> applicable to all the Grace+Hopper systems. The following is an example of
> the Qemu command line arguments to create 8 nodes and link them to the device
> 'dev0':
> 
> -numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 \
> -numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 \
> -numa node,nodeid=8 -numa node,nodeid=9 \
> -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
> -object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=2 \
> -object acpi-generic-initiator,id=gi1,pci-dev=dev0,node=3 \
> -object acpi-generic-initiator,id=gi2,pci-dev=dev0,node=4 \
> -object acpi-generic-initiator,id=gi3,pci-dev=dev0,node=5 \
> -object acpi-generic-initiator,id=gi4,pci-dev=dev0,node=6 \
> -object acpi-generic-initiator,id=gi5,pci-dev=dev0,node=7 \
> -object acpi-generic-initiator,id=gi6,pci-dev=dev0,node=8 \
> -object acpi-generic-initiator,id=gi7,pci-dev=dev0,node=9 \
> 
> The performance benefits can be realized by providing the NUMA node distances
> appropriately (through libvirt tags or Qemu params). The admin can get the
> distance among nodes in hardware using `numactl -H`.
> 
> This series goes along with the recenty added vfio-pci variant driver [3].
> 
> Applied over v8.2.2
> base commit: 11aa0b1ff115b86160c4d37e7c37e6a6b13b77ea
> 
> [1] ACPI Spec 6.3, Section 5.2.16.6
> Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu [2]
> Link: https://lore.kernel.org/all/20240220115055.23546-1-ankita@nvidia.com/ [3]
> 
> Link for v8:
> Link: https://lore.kernel.org/all/20240306123317.4691-1-ankita@nvidia.com/

v9 looks ready for QEMU 9.0. An Ack from the ACPI supporters is missing
though.

Michal, Igor, Ani,

Did you have time to take a look ?

Thanks

C.



> v8 -> v9
> - Removed unused included headers based on Jonathan's suggestion.
> - Collected Reviewed-by from Jonathan.
> - Added acpi-generic-initiator support for i386
> - Moved HMAT change from patch 1/2 to 2/3.
> - Fixed nits.
> 
> v7 -> v8
> - Replaced the code to collect the acpi-generic-initiator objects
>    with the code to use recursive helper object_child_foreach_recursive
>    based on suggestion from Jonathan Cameron.
> - Added sanity check for the node id passed to the
>    acpi-generic-initiator object.
> - Added change to use GI as HMAT initiator as per Jonathan's suggestion.
> - Fixed nits pointed by Marcus and Jonathan.
> - Collected Marcus' Acked-by.
> - Rebased to v8.2.2.
> 
> v6 -> v7
> - Updated code and the commit message to make acpi-generic-initiator
>    define a 1:1 relationship between device and node based on
>    Jonathan Cameron's suggestion.
> - Updated commit message to include the decoded GI entry in the SRAT.
> - Rebased to v8.2.1.
> 
> v5 -> v6
> - Updated commit message for the [1/2] and the cover letter.
> - Updated the acpi-generic-initiator object comment description for
>    clarity on the input host-nodes.
> - Rebased to v8.2.0-rc4.
> 
> v4 -> v5
> - Removed acpi-dev option until full support.
> - The NUMA nodes are saved as bitmap instead of uint16List.
> - Replaced asserts to exit calls.
> - Addressed other miscellaneous comments.
> 
> v3 -> v4
> - changed the ':' delimited way to a uint16 array to communicate the
> nodes associated with the device.
> - added asserts to handle invalid inputs.
> - addressed other miscellaneous v3 comments.
> 
> v2 -> v3
> - changed param to accept a ':' delimited list of NUMA nodes, instead
> of a range.
> - Removed nvidia-acpi-generic-initiator object.
> - Addressed miscellaneous comments in v2.
> 
> v1 -> v2
> - Removed dependency on sysfs to communicate the feature with variant module.
> - Use GI Affinity SRAT structure instead of Memory Affinity.
> - No DSDT entries needed to communicate the PXM for the device. SRAT GI
> structure is used instead.
> - New objects introduced to establish link between device and nodes.
> 
> Ankit Agrawal (3):
>    qom: new object to associate device to NUMA node
>    hw/acpi: Implement the SRAT GI affinity structure
>    hw/i386/acpi-build: Add support for SRAT Generic Initiator structures
> 
>   hw/acpi/acpi_generic_initiator.c         | 148 +++++++++++++++++++++++
>   hw/acpi/hmat.c                           |   2 +-
>   hw/acpi/meson.build                      |   1 +
>   hw/arm/virt-acpi-build.c                 |   3 +
>   hw/core/numa.c                           |   3 +-
>   hw/i386/acpi-build.c                     |   3 +
>   include/hw/acpi/acpi_generic_initiator.h |  47 +++++++
>   include/sysemu/numa.h                    |   1 +
>   qapi/qom.json                            |  17 +++
>   9 files changed, 223 insertions(+), 2 deletions(-)
>   create mode 100644 hw/acpi/acpi_generic_initiator.c
>   create mode 100644 include/hw/acpi/acpi_generic_initiator.h
>

next prev parent reply	other threads:[~2024-03-11 10:39 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-08 14:55 [PATCH v9 0/3] acpi: report NUMA nodes for device memory using GI ankita
2024-03-08 14:55 ` [PATCH v9 1/3] qom: new object to associate device to NUMA node ankita
2024-03-08 14:55 ` [PATCH v9 2/3] hw/acpi: Implement the SRAT GI affinity structure ankita
2024-03-08 14:55 ` [PATCH v9 3/3] hw/i386/acpi-build: Add support for SRAT Generic Initiator structures ankita
2024-03-11 15:19   ` Jonathan Cameron via
2024-03-11 10:39 ` Cédric Le Goater [this message]
2024-03-11 15:45   ` [PATCH v9 0/3] acpi: report NUMA nodes for device memory using GI Michael S. Tsirkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2f86279d-3c0e-447b-97ae-f4257b84ad71@redhat.com \
    --to=clg@redhat.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=acurrid@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=ani@anisinha.ca \
    --cc=aniketa@nvidia.com \
    --cc=ankita@nvidia.com \
    --cc=armbru@redhat.com \
    --cc=berrange@redhat.com \
    --cc=cjia@nvidia.com \
    --cc=david@redhat.com \
    --cc=dnigam@nvidia.com \
    --cc=eblake@redhat.com \
    --cc=eduardo@habkost.net \
    --cc=gshan@redhat.com \
    --cc=imammedo@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=kwankhede@nvidia.com \
    --cc=marcel.apfelbaum@gmail.com \
    --cc=mochs@nvidia.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=philmd@linaro.org \
    --cc=qemu-arm@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=shannon.zhaosl@gmail.com \
    --cc=targupta@nvidia.com \
    --cc=udhoke@nvidia.com \
    --cc=vsethi@nvidia.com \
    --cc=wangyanan55@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).