qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: "Cédric Le Goater" <clg@redhat.com>
Cc: ankita@nvidia.com, jgg@nvidia.com, marcel.apfelbaum@gmail.com,
	philmd@linaro.org, wangyanan55@huawei.com,
	alex.williamson@redhat.com, pbonzini@redhat.com,
	shannon.zhaosl@gmail.com, peter.maydell@linaro.org,
	ani@anisinha.ca, berrange@redhat.com, eduardo@habkost.net,
	imammedo@redhat.com, eblake@redhat.com, armbru@redhat.com,
	david@redhat.com, gshan@redhat.com, Jonathan.Cameron@huawei.com,
	aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com,
	targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com,
	mochs@nvidia.com, dnigam@nvidia.com, udhoke@nvidia.com,
	qemu-arm@nongnu.org, qemu-devel@nongnu.org
Subject: Re: [PATCH v9 0/3] acpi: report NUMA nodes for device memory using GI
Date: Mon, 11 Mar 2024 11:45:57 -0400	[thread overview]
Message-ID: <20240311114543-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <2f86279d-3c0e-447b-97ae-f4257b84ad71@redhat.com>

On Mon, Mar 11, 2024 at 11:39:11AM +0100, Cédric Le Goater wrote:
> On 3/8/24 15:55, ankita@nvidia.com wrote:
> > From: Ankit Agrawal <ankita@nvidia.com>
> > 
> > There are upcoming devices which allow CPU to cache coherently access
> > their memory. It is sensible to expose such memory as NUMA nodes separate
> > from the sysmem node to the OS. The ACPI spec provides a scheme in SRAT
> > called Generic Initiator Affinity Structure [1] to allow an association
> > between a Proximity Domain (PXM) and a Generic Initiator (GI) (e.g.
> > heterogeneous processors and accelerators, GPUs, and I/O devices with
> > integrated compute or DMA engines).
> > 
> > While a single node per device may cover several use cases, it is however
> > insufficient for a full utilization of the NVIDIA GPUs MIG
> > (Mult-Instance GPUs) [2] feature. The feature allows partitioning of the
> > GPU device resources (including device memory) into several (upto 8)
> > isolated instances. Each of the partitioned memory requires a dedicated NUMA
> > node to operate. The partitions are not fixed and they can be created/deleted
> > at runtime.
> > 
> > Linux OS does not provide a means to dynamically create/destroy NUMA nodes
> > and such feature implementation is expected to be non-trivial. The nodes
> > that OS discovers at the boot time while parsing SRAT remains fixed. So we
> > utilize the GI Affinity structures that allows association between nodes
> > and devices. Multiple GI structures per device/BDF is possible, allowing
> > creation of multiple nodes in the VM by exposing unique PXM in each of these
> > structures.
> > 
> > Implement the mechanism to build the GI affinity structures as Qemu currently
> > does not. Introduce a new acpi-generic-initiator object to allow host admin
> > link a device with an associated NUMA node. Qemu maintains this association
> > and use this object to build the requisite GI Affinity Structure.
> > 
> > When multiple NUMA nodes are associated with a device, it is required to
> > create those many number of acpi-generic-initiator objects, each representing
> > a unique device:node association.
> > 
> > Following is one of a decoded GI affinity structure in VM ACPI SRAT.
> > [0C8h 0200   1]                Subtable Type : 05 [Generic Initiator Affinity]
> > [0C9h 0201   1]                       Length : 20
> > 
> > [0CAh 0202   1]                    Reserved1 : 00
> > [0CBh 0203   1]           Device Handle Type : 01
> > [0CCh 0204   4]             Proximity Domain : 00000007
> > [0D0h 0208  16]                Device Handle : 00 00 20 00 00 00 00 00 00 00 00
> > 00 00 00 00 00
> > [0E0h 0224   4]        Flags (decoded below) : 00000001
> >                                       Enabled : 1
> > [0E4h 0228   4]                    Reserved2 : 00000000
> > 
> > [0E8h 0232   1]                Subtable Type : 05 [Generic Initiator Affinity]
> > [0E9h 0233   1]                       Length : 20
> > 
> > On Grace Hopper systems, an admin will create a range of 8 nodes and associate
> > them with the device using the acpi-generic-initiator object. While a
> > configuration of less than 8 nodes per device is allowed, such configuration
> > will prevent utilization of the feature to the fullest. This setting is
> > applicable to all the Grace+Hopper systems. The following is an example of
> > the Qemu command line arguments to create 8 nodes and link them to the device
> > 'dev0':
> > 
> > -numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 \
> > -numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 \
> > -numa node,nodeid=8 -numa node,nodeid=9 \
> > -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
> > -object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=2 \
> > -object acpi-generic-initiator,id=gi1,pci-dev=dev0,node=3 \
> > -object acpi-generic-initiator,id=gi2,pci-dev=dev0,node=4 \
> > -object acpi-generic-initiator,id=gi3,pci-dev=dev0,node=5 \
> > -object acpi-generic-initiator,id=gi4,pci-dev=dev0,node=6 \
> > -object acpi-generic-initiator,id=gi5,pci-dev=dev0,node=7 \
> > -object acpi-generic-initiator,id=gi6,pci-dev=dev0,node=8 \
> > -object acpi-generic-initiator,id=gi7,pci-dev=dev0,node=9 \
> > 
> > The performance benefits can be realized by providing the NUMA node distances
> > appropriately (through libvirt tags or Qemu params). The admin can get the
> > distance among nodes in hardware using `numactl -H`.
> > 
> > This series goes along with the recenty added vfio-pci variant driver [3].
> > 
> > Applied over v8.2.2
> > base commit: 11aa0b1ff115b86160c4d37e7c37e6a6b13b77ea
> > 
> > [1] ACPI Spec 6.3, Section 5.2.16.6
> > Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu [2]
> > Link: https://lore.kernel.org/all/20240220115055.23546-1-ankita@nvidia.com/ [3]
> > 
> > Link for v8:
> > Link: https://lore.kernel.org/all/20240306123317.4691-1-ankita@nvidia.com/
> 
> v9 looks ready for QEMU 9.0. An Ack from the ACPI supporters is missing
> though.
> 
> Michal, Igor, Ani,
> 
> Did you have time to take a look ?
> 
> Thanks
> 
> C.

I tagged it already.

> 
> 
> > v8 -> v9
> > - Removed unused included headers based on Jonathan's suggestion.
> > - Collected Reviewed-by from Jonathan.
> > - Added acpi-generic-initiator support for i386
> > - Moved HMAT change from patch 1/2 to 2/3.
> > - Fixed nits.
> > 
> > v7 -> v8
> > - Replaced the code to collect the acpi-generic-initiator objects
> >    with the code to use recursive helper object_child_foreach_recursive
> >    based on suggestion from Jonathan Cameron.
> > - Added sanity check for the node id passed to the
> >    acpi-generic-initiator object.
> > - Added change to use GI as HMAT initiator as per Jonathan's suggestion.
> > - Fixed nits pointed by Marcus and Jonathan.
> > - Collected Marcus' Acked-by.
> > - Rebased to v8.2.2.
> > 
> > v6 -> v7
> > - Updated code and the commit message to make acpi-generic-initiator
> >    define a 1:1 relationship between device and node based on
> >    Jonathan Cameron's suggestion.
> > - Updated commit message to include the decoded GI entry in the SRAT.
> > - Rebased to v8.2.1.
> > 
> > v5 -> v6
> > - Updated commit message for the [1/2] and the cover letter.
> > - Updated the acpi-generic-initiator object comment description for
> >    clarity on the input host-nodes.
> > - Rebased to v8.2.0-rc4.
> > 
> > v4 -> v5
> > - Removed acpi-dev option until full support.
> > - The NUMA nodes are saved as bitmap instead of uint16List.
> > - Replaced asserts to exit calls.
> > - Addressed other miscellaneous comments.
> > 
> > v3 -> v4
> > - changed the ':' delimited way to a uint16 array to communicate the
> > nodes associated with the device.
> > - added asserts to handle invalid inputs.
> > - addressed other miscellaneous v3 comments.
> > 
> > v2 -> v3
> > - changed param to accept a ':' delimited list of NUMA nodes, instead
> > of a range.
> > - Removed nvidia-acpi-generic-initiator object.
> > - Addressed miscellaneous comments in v2.
> > 
> > v1 -> v2
> > - Removed dependency on sysfs to communicate the feature with variant module.
> > - Use GI Affinity SRAT structure instead of Memory Affinity.
> > - No DSDT entries needed to communicate the PXM for the device. SRAT GI
> > structure is used instead.
> > - New objects introduced to establish link between device and nodes.
> > 
> > Ankit Agrawal (3):
> >    qom: new object to associate device to NUMA node
> >    hw/acpi: Implement the SRAT GI affinity structure
> >    hw/i386/acpi-build: Add support for SRAT Generic Initiator structures
> > 
> >   hw/acpi/acpi_generic_initiator.c         | 148 +++++++++++++++++++++++
> >   hw/acpi/hmat.c                           |   2 +-
> >   hw/acpi/meson.build                      |   1 +
> >   hw/arm/virt-acpi-build.c                 |   3 +
> >   hw/core/numa.c                           |   3 +-
> >   hw/i386/acpi-build.c                     |   3 +
> >   include/hw/acpi/acpi_generic_initiator.h |  47 +++++++
> >   include/sysemu/numa.h                    |   1 +
> >   qapi/qom.json                            |  17 +++
> >   9 files changed, 223 insertions(+), 2 deletions(-)
> >   create mode 100644 hw/acpi/acpi_generic_initiator.c
> >   create mode 100644 include/hw/acpi/acpi_generic_initiator.h
> > 



      reply	other threads:[~2024-03-11 15:46 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-08 14:55 [PATCH v9 0/3] acpi: report NUMA nodes for device memory using GI ankita
2024-03-08 14:55 ` [PATCH v9 1/3] qom: new object to associate device to NUMA node ankita
2024-03-08 14:55 ` [PATCH v9 2/3] hw/acpi: Implement the SRAT GI affinity structure ankita
2024-03-08 14:55 ` [PATCH v9 3/3] hw/i386/acpi-build: Add support for SRAT Generic Initiator structures ankita
2024-03-11 15:19   ` Jonathan Cameron via
2024-03-11 10:39 ` [PATCH v9 0/3] acpi: report NUMA nodes for device memory using GI Cédric Le Goater
2024-03-11 15:45   ` Michael S. Tsirkin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240311114543-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=acurrid@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=ani@anisinha.ca \
    --cc=aniketa@nvidia.com \
    --cc=ankita@nvidia.com \
    --cc=armbru@redhat.com \
    --cc=berrange@redhat.com \
    --cc=cjia@nvidia.com \
    --cc=clg@redhat.com \
    --cc=david@redhat.com \
    --cc=dnigam@nvidia.com \
    --cc=eblake@redhat.com \
    --cc=eduardo@habkost.net \
    --cc=gshan@redhat.com \
    --cc=imammedo@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=kwankhede@nvidia.com \
    --cc=marcel.apfelbaum@gmail.com \
    --cc=mochs@nvidia.com \
    --cc=pbonzini@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=philmd@linaro.org \
    --cc=qemu-arm@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=shannon.zhaosl@gmail.com \
    --cc=targupta@nvidia.com \
    --cc=udhoke@nvidia.com \
    --cc=vsethi@nvidia.com \
    --cc=wangyanan55@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).