qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/2] acpi: report numa nodes for device memory using GI
@ 2023-12-25  4:56 ankita
  2023-12-25  4:56 ` [PATCH v6 1/2] qom: new object to associate device to numa node ankita
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: ankita @ 2023-12-25  4:56 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell,
	ani, berrange, eduardo, imammedo, mst, eblake, armbru, david,
	gshan, Jonathan.Cameron
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, dnigam,
	udhoke, qemu-arm, qemu-devel

From: Ankit Agrawal <ankita@nvidia.com>

There are upcoming devices which allow CPU to cache coherently access
their memory. It is sensible to expose such memory as NUMA nodes separate
from the sysmem node to the OS. The ACPI spec provides a scheme in SRAT
called Generic Initiator Affinity Structure [1] to allow an association
between a Proximity Domain (PXM) and a Generic Initiator (GI) (e.g.
heterogeneous processors and accelerators, GPUs, and I/O devices with
integrated compute or DMA engines).

While a single node per device may cover several use cases, it is however
insufficient for a full utilization of the NVIDIA GPUs MIG
(Mult-Instance GPUs) [2] feature. The feature allows partitioning of the
GPU device resources (including device memory) into several (upto 8)
isolated instances. Each of the partitioned memory requires a dedicated NUMA
node to operate. The partitions are not fixed and they can be created/deleted
at runtime.

Linux OS does not provide a means to dynamically create/destroy NUMA nodes
and such feature implementation is expected to be non-trivial. The nodes
that OS discovers at the boot time while parsing SRAT remains fixed. So we
utilize the GI Affinity structures that allows association between nodes
and devices. Multiple GI structures per device/BDF is possible, allowing
creation of multiple nodes in the VM by exposing unique PXM in each of these
structures.

Implement the mechanism to build the GI affinity structures as Qemu currently
does not. Introduce a new acpi-generic-initiator object that allows an
association of a set of nodes with a device. During SRAT creation, all such
objected are identified and used to add the GI Affinity Structures. Currently,
only PCI device is supported. On a multi device system, each device supporting
the features needs a unique acpi-generic-initiator object with its own set of
NUMA nodes associated to it.

The admin will create a range of 8 nodes and associate that with the device
using the acpi-generic-initiator object. While a configuration of less than
8 nodes per device is allowed, such configuration will prevent utilization of
the feature to the fullest. This setting is applicable to all the Grace+Hopper
systems. The following is an example of the Qemu command line arguments to
create 8 nodes and link them to the device 'dev0':

-numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 \
-numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 \
-numa node,nodeid=8 -numa node,nodeid=9 \
-device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
-object acpi-generic-initiator,id=gi0,pci-dev=dev0,host-nodes=2-9 \

The performance benefits can be realized by providing the NUMA node distances
appropriately (through libvirt tags or Qemu params). The admin can get the
distance among nodes in hardware using `numactl -H`.

This series goes along with the vfio-pci variant driver [3] under review.

Applied over v8.2.0-rc4.

[1] ACPI Spec 6.3, Section 5.2.16.6
[2] https://www.nvidia.com/en-in/technologies/multi-instance-gpu
[3] https://lore.kernel.org/all/20231212184613.3237-1-ankita@nvidia.com/

Link for v5:
https://lore.kernel.org/all/20231203060245.31593-1-ankita@nvidia.com/

v5 -> v6
- Updated commit message for the [1/2] and the cover letter.
- Updated the acpi-generic-initiator object comment description for
  clarity on the input host-nodes.
- Rebased to v8.2.0-rc4.

v4 -> v5
- Removed acpi-dev option until full support.
- The numa nodes are saved as bitmap instead of uint16List.
- Replaced asserts to exit calls.
- Addressed other miscellaneous comments.

v3 -> v4
- changed the ':' delimited way to a uint16 array to communicate the
nodes associated with the device.
- added asserts to handle invalid inputs.
- addressed other miscellaneous v3 comments.

v2 -> v3
- changed param to accept a ':' delimited list of numa nodes, instead
of a range.
- Removed nvidia-acpi-generic-initiator object.
- Addressed miscellaneous comments in v2.

v1 -> v2
- Removed dependency on sysfs to communicate the feature with variant module.
- Use GI Affinity SRAT structure instead of Memory Affinity.
- No DSDT entries needed to communicate the PXM for the device. SRAT GI
structure is used instead.
- New objects introduced to establish link between device and nodes.

Ankit Agrawal (2):
  qom: new object to associate device to numa node
  hw/acpi: Implement the SRAT GI affinity structure

 hw/acpi/acpi-generic-initiator.c         | 169 +++++++++++++++++++++++
 hw/acpi/meson.build                      |   1 +
 hw/arm/virt-acpi-build.c                 |   3 +
 include/hw/acpi/acpi-generic-initiator.h |  53 +++++++
 qapi/qom.json                            |  17 +++
 5 files changed, 243 insertions(+)
 create mode 100644 hw/acpi/acpi-generic-initiator.c
 create mode 100644 include/hw/acpi/acpi-generic-initiator.h

-- 
2.34.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v6 1/2] qom: new object to associate device to numa node
  2023-12-25  4:56 [PATCH v6 0/2] acpi: report numa nodes for device memory using GI ankita
@ 2023-12-25  4:56 ` ankita
  2024-01-02 12:58   ` Jonathan Cameron via
  2024-01-08 12:09   ` Markus Armbruster
  2023-12-25  4:56 ` [PATCH v6 2/2] hw/acpi: Implement the SRAT GI affinity structure ankita
  2024-01-02 12:31 ` [PATCH v6 0/2] acpi: report numa nodes for device memory using GI Jonathan Cameron via
  2 siblings, 2 replies; 26+ messages in thread
From: ankita @ 2023-12-25  4:56 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell,
	ani, berrange, eduardo, imammedo, mst, eblake, armbru, david,
	gshan, Jonathan.Cameron
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, dnigam,
	udhoke, qemu-arm, qemu-devel

From: Ankit Agrawal <ankita@nvidia.com>

NVIDIA GPU's support MIG (Mult-Instance GPUs) feature [1], which allows
partitioning of the GPU device resources (including device memory) into
several (upto 8) isolated instances. Each of the partitioned memory needs
a dedicated NUMA node to operate. The partitions are not fixed and they
can be created/deleted at runtime.

Unfortunately Linux OS does not provide a means to dynamically create/destroy
NUMA nodes and such feature implementation is not expected to be trivial. The
nodes that OS discovers at the boot time while parsing SRAT remains fixed. So
we utilize the Generic Initiator Affinity structures that allows association
between nodes and devices. Multiple GI structures per BDF is possible,
allowing creation of multiple nodes by exposing unique PXM in each of these
structures.

Introduce a new acpi-generic-initiator object to allow host admin provide the
device and the corresponding NUMA nodes. Qemu maintain this association and
use this object to build the requisite GI Affinity Structure. On a multi
device system, each device supporting the features needs a unique
acpi-generic-initiator object with its own set of NUMA nodes associated to it.

An admin can provide the range of nodes through a uint16 array host-nodes
and link it to a device by providing its id. Currently, only PCI device is
supported. The following sample creates 8 nodes per PCI device for a VM
with 2 PCI devices and link them to the respecitve PCI device using
acpi-generic-initiator objects:

-numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 \
-numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 \
-numa node,nodeid=8 -numa node,nodeid=9 \
-device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
-object acpi-generic-initiator,id=gi0,pci-dev=dev0,host-nodes=2-9 \

-numa node,nodeid=10 -numa node,nodeid=11 -numa node,nodeid=12 \
-numa node,nodeid=13 -numa node,nodeid=14 -numa node,nodeid=15 \
-numa node,nodeid=16 -numa node,nodeid=17 \
-device vfio-pci-nohotplug,host=0009:01:01.0,bus=pcie.0,addr=05.0,rombar=0,id=dev1 \
-object acpi-generic-initiator,id=gi1,pci-dev=dev1,host-nodes=10-17 \

[1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 hw/acpi/acpi-generic-initiator.c         | 70 ++++++++++++++++++++++++
 hw/acpi/meson.build                      |  1 +
 include/hw/acpi/acpi-generic-initiator.h | 27 +++++++++
 qapi/qom.json                            | 17 ++++++
 4 files changed, 115 insertions(+)
 create mode 100644 hw/acpi/acpi-generic-initiator.c
 create mode 100644 include/hw/acpi/acpi-generic-initiator.h

diff --git a/hw/acpi/acpi-generic-initiator.c b/hw/acpi/acpi-generic-initiator.c
new file mode 100644
index 0000000000..e05e28e962
--- /dev/null
+++ b/hw/acpi/acpi-generic-initiator.c
@@ -0,0 +1,70 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include "qemu/osdep.h"
+#include "hw/acpi/acpi-generic-initiator.h"
+#include "hw/pci/pci_device.h"
+#include "qapi/error.h"
+#include "qapi/qapi-builtin-visit.h"
+#include "qapi/visitor.h"
+#include "qemu/error-report.h"
+
+OBJECT_DEFINE_TYPE_WITH_INTERFACES(AcpiGenericInitiator, acpi_generic_initiator,
+                   ACPI_GENERIC_INITIATOR, OBJECT,
+                   { TYPE_USER_CREATABLE },
+                   { NULL })
+
+OBJECT_DECLARE_SIMPLE_TYPE(AcpiGenericInitiator, ACPI_GENERIC_INITIATOR)
+
+static void acpi_generic_initiator_init(Object *obj)
+{
+    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
+    bitmap_zero(gi->host_nodes, MAX_NODES);
+    gi->pci_dev = NULL;
+}
+
+static void acpi_generic_initiator_finalize(Object *obj)
+{
+    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
+
+    g_free(gi->pci_dev);
+}
+
+static void acpi_generic_initiator_set_pci_device(Object *obj, const char *val,
+                                                  Error **errp)
+{
+    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
+
+    gi->pci_dev = g_strdup(val);
+}
+
+static void
+acpi_generic_initiator_set_host_nodes(Object *obj, Visitor *v, const char *name,
+                                      void *opaque, Error **errp)
+{
+    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
+    uint16List *l = NULL, *host_nodes = NULL;
+
+    visit_type_uint16List(v, name, &host_nodes, errp);
+
+    for (l = host_nodes; l; l = l->next) {
+        if (l->value >= MAX_NODES) {
+            error_setg(errp, "Invalid host-nodes value: %d", l->value);
+            break;
+        } else {
+            bitmap_set(gi->host_nodes, l->value, 1);
+        }
+    }
+
+    qapi_free_uint16List(host_nodes);
+}
+
+static void acpi_generic_initiator_class_init(ObjectClass *oc, void *data)
+{
+    object_class_property_add_str(oc, "pci-dev", NULL,
+        acpi_generic_initiator_set_pci_device);
+    object_class_property_add(oc, "host-nodes", "int", NULL,
+        acpi_generic_initiator_set_host_nodes, NULL, NULL);
+}
diff --git a/hw/acpi/meson.build b/hw/acpi/meson.build
index fc1b952379..2268589519 100644
--- a/hw/acpi/meson.build
+++ b/hw/acpi/meson.build
@@ -1,5 +1,6 @@
 acpi_ss = ss.source_set()
 acpi_ss.add(files(
+  'acpi-generic-initiator.c',
   'acpi_interface.c',
   'aml-build.c',
   'bios-linker-loader.c',
diff --git a/include/hw/acpi/acpi-generic-initiator.h b/include/hw/acpi/acpi-generic-initiator.h
new file mode 100644
index 0000000000..9643b81951
--- /dev/null
+++ b/include/hw/acpi/acpi-generic-initiator.h
@@ -0,0 +1,27 @@
+#ifndef ACPI_GENERIC_INITIATOR_H
+#define ACPI_GENERIC_INITIATOR_H
+
+#include "hw/mem/pc-dimm.h"
+#include "hw/acpi/bios-linker-loader.h"
+#include "hw/acpi/aml-build.h"
+#include "sysemu/numa.h"
+#include "qemu/uuid.h"
+#include "qom/object.h"
+#include "qom/object_interfaces.h"
+
+#define TYPE_ACPI_GENERIC_INITIATOR "acpi-generic-initiator"
+
+typedef struct AcpiGenericInitiator {
+    /* private */
+    Object parent;
+
+    /* public */
+    char *pci_dev;
+    DECLARE_BITMAP(host_nodes, MAX_NODES);
+} AcpiGenericInitiator;
+
+typedef struct AcpiGenericInitiatorClass {
+        ObjectClass parent_class;
+} AcpiGenericInitiatorClass;
+
+#endif
diff --git a/qapi/qom.json b/qapi/qom.json
index c53ef978ff..7b33d4a53c 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -794,6 +794,21 @@
 { 'struct': 'VfioUserServerProperties',
   'data': { 'socket': 'SocketAddress', 'device': 'str' } }
 
+##
+# @AcpiGenericInitiatorProperties:
+#
+# Properties for acpi-generic-initiator objects.
+#
+# @pci-dev: PCI device ID to be associated with the node
+#
+# @host-nodes: numa node list associated with the PCI device.
+#
+# Since: 9.0
+##
+{ 'struct': 'AcpiGenericInitiatorProperties',
+  'data': { 'pci-dev': 'str',
+            'host-nodes': ['uint16'] } }
+
 ##
 # @RngProperties:
 #
@@ -911,6 +926,7 @@
 ##
 { 'enum': 'ObjectType',
   'data': [
+    'acpi-generic-initiator',
     'authz-list',
     'authz-listfile',
     'authz-pam',
@@ -981,6 +997,7 @@
             'id': 'str' },
   'discriminator': 'qom-type',
   'data': {
+      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties',
       'authz-list':                 'AuthZListProperties',
       'authz-listfile':             'AuthZListFileProperties',
       'authz-pam':                  'AuthZPAMProperties',
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v6 2/2] hw/acpi: Implement the SRAT GI affinity structure
  2023-12-25  4:56 [PATCH v6 0/2] acpi: report numa nodes for device memory using GI ankita
  2023-12-25  4:56 ` [PATCH v6 1/2] qom: new object to associate device to numa node ankita
@ 2023-12-25  4:56 ` ankita
  2024-01-02 12:31 ` [PATCH v6 0/2] acpi: report numa nodes for device memory using GI Jonathan Cameron via
  2 siblings, 0 replies; 26+ messages in thread
From: ankita @ 2023-12-25  4:56 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell,
	ani, berrange, eduardo, imammedo, mst, eblake, armbru, david,
	gshan, Jonathan.Cameron
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, dnigam,
	udhoke, qemu-arm, qemu-devel

From: Ankit Agrawal <ankita@nvidia.com>

ACPI spec provides a scheme to associate "Generic Initiators" [1]
(e.g. heterogeneous processors and accelerators, GPUs, and I/O devices with
integrated compute or DMA engines GPUs) with Proximity Domains. This is
achieved using Generic Initiator Affinity Structure in SRAT. During bootup,
Linux kernel parse the ACPI SRAT to determine the PXM ids and create a NUMA
node for each unique PXM ID encountered. Qemu currently do not implement
these structures while building SRAT.

Add GI structures while building VM ACPI SRAT. The association between
devices and nodes are stored using acpi-generic-initiator object. Lookup
presence of all such objects and use them to build these structures.

The structure needs a PCI device handle [2] that consists of the device BDF.
The vfio-pci device corresponding to the acpi-generic-initiator object is
located to determine the BDF.

[1] ACPI Spec 6.3, Section 5.2.16.6
[2] ACPI Spec 6.3, Table 5.80

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 hw/acpi/acpi-generic-initiator.c         | 99 ++++++++++++++++++++++++
 hw/arm/virt-acpi-build.c                 |  3 +
 include/hw/acpi/acpi-generic-initiator.h | 26 +++++++
 3 files changed, 128 insertions(+)

diff --git a/hw/acpi/acpi-generic-initiator.c b/hw/acpi/acpi-generic-initiator.c
index e05e28e962..fa5235f2bb 100644
--- a/hw/acpi/acpi-generic-initiator.c
+++ b/hw/acpi/acpi-generic-initiator.c
@@ -68,3 +68,102 @@ static void acpi_generic_initiator_class_init(ObjectClass *oc, void *data)
     object_class_property_add(oc, "host-nodes", "int", NULL,
         acpi_generic_initiator_set_host_nodes, NULL, NULL);
 }
+
+static int acpi_generic_initiator_list(Object *obj, void *opaque)
+{
+    GSList **list = opaque;
+
+    if (object_dynamic_cast(obj, TYPE_ACPI_GENERIC_INITIATOR)) {
+        *list = g_slist_append(*list, ACPI_GENERIC_INITIATOR(obj));
+    }
+
+    object_child_foreach(obj, acpi_generic_initiator_list, opaque);
+    return 0;
+}
+
+/*
+ * Identify Generic Initiator objects and link them into the list which is
+ * returned to the caller.
+ *
+ * Note: it is the caller's responsibility to free the list to avoid
+ * memory leak.
+ */
+static GSList *acpi_generic_initiator_get_list(void)
+{
+    GSList *list = NULL;
+
+    object_child_foreach(object_get_root(),
+                         acpi_generic_initiator_list, &list);
+    return list;
+}
+
+/*
+ * ACPI 6.3:
+ * Table 5-78 Generic Initiator Affinity Structure
+ */
+static void
+build_srat_generic_pci_initiator_affinity(GArray *table_data, int node,
+                                          PCIDeviceHandle *handle)
+{
+    uint8_t index;
+
+    build_append_int_noprefix(table_data, 5, 1);  /* Type */
+    build_append_int_noprefix(table_data, 32, 1); /* Length */
+    build_append_int_noprefix(table_data, 0, 1);  /* Reserved */
+    build_append_int_noprefix(table_data, 1, 1);  /* Device Handle Type: PCI */
+    build_append_int_noprefix(table_data, node, 4);  /* Proximity Domain */
+
+    /* Device Handle - PCI */
+    build_append_int_noprefix(table_data, handle->segment, 2);
+    build_append_int_noprefix(table_data, handle->bdf, 2);
+    for (index = 0; index < 12; index++) {
+        build_append_int_noprefix(table_data, 0, 1);
+    }
+
+    build_append_int_noprefix(table_data, GEN_AFFINITY_ENABLED, 4); /* Flags */
+    build_append_int_noprefix(table_data, 0, 4);     /* Reserved */
+}
+
+void build_srat_generic_pci_initiator(GArray *table_data)
+{
+    GSList *gi_list, *list = acpi_generic_initiator_get_list();
+    AcpiGenericInitiator *gi;
+
+    for (gi_list = list; gi_list; gi_list = gi_list->next) {
+        Object *o;
+        uint16_t node;
+        PCIDevice *pci_dev;
+        bool node_specified = false;
+
+        gi = gi_list->data;
+
+        o = object_resolve_path_type(gi->pci_dev, TYPE_PCI_DEVICE, NULL);
+        if (!o) {
+            error_printf("Specified device must be a PCI device.\n");
+            exit(1);
+        }
+        pci_dev = PCI_DEVICE(o);
+
+        for (node = 0; (node = find_next_bit(gi->host_nodes,
+                             MAX_NODES, node)) != MAX_NODES; node++)
+        {
+            PCIDeviceHandle dev_handle;
+            dev_handle.segment = 0;
+            dev_handle.bdf = PCI_BUILD_BDF(pci_bus_num(pci_get_bus(pci_dev)),
+                                                       pci_dev->devfn);
+            build_srat_generic_pci_initiator_affinity(table_data,
+                                                      node, &dev_handle);
+            node_specified = true;
+        }
+
+        if (!node_specified) {
+            error_report("Generic Initiator device 0:%x:%x.%x has no associated"
+                         " NUMA node.", pci_bus_num(pci_get_bus(pci_dev)),
+                         PCI_SLOT(pci_dev->devfn), PCI_FUNC(pci_dev->devfn));
+            error_printf("Specify NUMA node with -host-nodes option.\n");
+            exit(1);
+        }
+    }
+
+    g_slist_free(list);
+}
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 8bc35a483c..00d77327e0 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -58,6 +58,7 @@
 #include "migration/vmstate.h"
 #include "hw/acpi/ghes.h"
 #include "hw/acpi/viot.h"
+#include "hw/acpi/acpi-generic-initiator.h"
 
 #define ARM_SPI_BASE 32
 
@@ -558,6 +559,8 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
         }
     }
 
+    build_srat_generic_pci_initiator(table_data);
+
     if (ms->nvdimms_state->is_enabled) {
         nvdimm_build_srat(table_data);
     }
diff --git a/include/hw/acpi/acpi-generic-initiator.h b/include/hw/acpi/acpi-generic-initiator.h
index 9643b81951..76efd5d3f0 100644
--- a/include/hw/acpi/acpi-generic-initiator.h
+++ b/include/hw/acpi/acpi-generic-initiator.h
@@ -24,4 +24,30 @@ typedef struct AcpiGenericInitiatorClass {
         ObjectClass parent_class;
 } AcpiGenericInitiatorClass;
 
+/*
+ * ACPI 6.3:
+ * Table 5-81 Flags – Generic Initiator Affinity Structure
+ */
+typedef enum {
+    GEN_AFFINITY_ENABLED = (1 << 0), /*
+                                      * If clear, the OSPM ignores the contents
+                                      * of the Generic Initiator/Port Affinity
+                                      * Structure. This allows system firmware
+                                      * to populate the SRAT with a static
+                                      * number of structures, but only enable
+                                      * them as necessary.
+                                      */
+} GenericAffinityFlags;
+
+/*
+ * ACPI 6.3:
+ * Table 5-80 Device Handle - PCI
+ */
+typedef struct PCIDeviceHandle {
+    uint16_t segment;
+    uint16_t bdf;
+} PCIDeviceHandle;
+
+void build_srat_generic_pci_initiator(GArray *table_data);
+
 #endif
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] acpi: report numa nodes for device memory using GI
  2023-12-25  4:56 [PATCH v6 0/2] acpi: report numa nodes for device memory using GI ankita
  2023-12-25  4:56 ` [PATCH v6 1/2] qom: new object to associate device to numa node ankita
  2023-12-25  4:56 ` [PATCH v6 2/2] hw/acpi: Implement the SRAT GI affinity structure ankita
@ 2024-01-02 12:31 ` Jonathan Cameron via
  2024-01-04  3:05   ` Ankit Agrawal
  2 siblings, 1 reply; 26+ messages in thread
From: Jonathan Cameron via @ 2024-01-02 12:31 UTC (permalink / raw)
  To: ankita
  Cc: jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell, ani,
	berrange, eduardo, imammedo, mst, eblake, armbru, david, gshan,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, dnigam,
	udhoke, qemu-arm, qemu-devel

On Mon, 25 Dec 2023 10:26:01 +0530
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> There are upcoming devices which allow CPU to cache coherently access
> their memory. It is sensible to expose such memory as NUMA nodes separate
> from the sysmem node to the OS. The ACPI spec provides a scheme in SRAT
> called Generic Initiator Affinity Structure [1] to allow an association
> between a Proximity Domain (PXM) and a Generic Initiator (GI) (e.g.
> heterogeneous processors and accelerators, GPUs, and I/O devices with
> integrated compute or DMA engines).
> 
> While a single node per device may cover several use cases, it is however
> insufficient for a full utilization of the NVIDIA GPUs MIG
> (Mult-Instance GPUs) [2] feature. The feature allows partitioning of the
> GPU device resources (including device memory) into several (upto 8)
> isolated instances. Each of the partitioned memory requires a dedicated NUMA
> node to operate. The partitions are not fixed and they can be created/deleted
> at runtime.
> 
> Linux OS does not provide a means to dynamically create/destroy NUMA nodes
> and such feature implementation is expected to be non-trivial. The nodes
> that OS discovers at the boot time while parsing SRAT remains fixed. So we
> utilize the GI Affinity structures that allows association between nodes
> and devices. Multiple GI structures per device/BDF is possible, allowing
> creation of multiple nodes in the VM by exposing unique PXM in each of these
> structures.
> 
> Implement the mechanism to build the GI affinity structures as Qemu currently
> does not. Introduce a new acpi-generic-initiator object that allows an
> association of a set of nodes with a device. During SRAT creation, all such
> objected are identified and used to add the GI Affinity Structures. Currently,
> only PCI device is supported. On a multi device system, each device supporting
> the features needs a unique acpi-generic-initiator object with its own set of
> NUMA nodes associated to it.
> 
> The admin will create a range of 8 nodes and associate that with the device
> using the acpi-generic-initiator object. While a configuration of less than
> 8 nodes per device is allowed, such configuration will prevent utilization of
> the feature to the fullest. This setting is applicable to all the Grace+Hopper
> systems. The following is an example of the Qemu command line arguments to
> create 8 nodes and link them to the device 'dev0':
> 
> -numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 \
> -numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 \
> -numa node,nodeid=8 -numa node,nodeid=9 \
> -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
> -object acpi-generic-initiator,id=gi0,pci-dev=dev0,host-nodes=2-9 \
> 

I'd find it helpful to see the resulting chunk of SRAT for these examples
(disassembled) in this cover letter and the patches (where there are more examples).



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 1/2] qom: new object to associate device to numa node
  2023-12-25  4:56 ` [PATCH v6 1/2] qom: new object to associate device to numa node ankita
@ 2024-01-02 12:58   ` Jonathan Cameron via
  2024-01-04  3:36     ` Ankit Agrawal
  2024-01-08 12:09   ` Markus Armbruster
  1 sibling, 1 reply; 26+ messages in thread
From: Jonathan Cameron via @ 2024-01-02 12:58 UTC (permalink / raw)
  To: ankita
  Cc: jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell, ani,
	berrange, eduardo, imammedo, mst, eblake, armbru, david, gshan,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, dnigam,
	udhoke, qemu-arm, qemu-devel

On Mon, 25 Dec 2023 10:26:02 +0530
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> NVIDIA GPU's support MIG (Mult-Instance GPUs) feature [1], which allows
> partitioning of the GPU device resources (including device memory) into
> several (upto 8) isolated instances. Each of the partitioned memory needs
> a dedicated NUMA node to operate. The partitions are not fixed and they
> can be created/deleted at runtime.
> 
> Unfortunately Linux OS does not provide a means to dynamically create/destroy
> NUMA nodes and such feature implementation is not expected to be trivial. The
> nodes that OS discovers at the boot time while parsing SRAT remains fixed. So
> we utilize the Generic Initiator Affinity structures that allows association
> between nodes and devices. Multiple GI structures per BDF is possible,
> allowing creation of multiple nodes by exposing unique PXM in each of these
> structures.
> 
> Introduce a new acpi-generic-initiator object to allow host admin provide the
> device and the corresponding NUMA nodes. Qemu maintain this association and
> use this object to build the requisite GI Affinity Structure. On a multi
> device system, each device supporting the features needs a unique
> acpi-generic-initiator object with its own set of NUMA nodes associated to it.
> 
> An admin can provide the range of nodes through a uint16 array host-nodes
> and link it to a device by providing its id. Currently, only PCI device is
> supported. The following sample creates 8 nodes per PCI device for a VM
> with 2 PCI devices and link them to the respecitve PCI device using
> acpi-generic-initiator objects:
> 
> -numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 \
> -numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 \
> -numa node,nodeid=8 -numa node,nodeid=9 \
> -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
> -object acpi-generic-initiator,id=gi0,pci-dev=dev0,host-nodes=2-9 \
> 
> -numa node,nodeid=10 -numa node,nodeid=11 -numa node,nodeid=12 \
> -numa node,nodeid=13 -numa node,nodeid=14 -numa node,nodeid=15 \
> -numa node,nodeid=16 -numa node,nodeid=17 \
> -device vfio-pci-nohotplug,host=0009:01:01.0,bus=pcie.0,addr=05.0,rombar=0,id=dev1 \
> -object acpi-generic-initiator,id=gi1,pci-dev=dev1,host-nodes=10-17 \

Hi Ankit,

Whilst I'm still not particularly keen on this use of GI nodes, the
infrastructure is now generic enough that it covers more normal use cases
so I'm almost fine with it going into QEMU. If you want to use it for unusual
things that's up to you ;)  Note that the following is about QEMU allowing
you to potentially shoot yourself in the foot rather than necessarily saying
the interface shouldn't allow a PCI dev to map to multiple GI nodes.

As per reply to the cover letter I definitely want to see SRAT table dumps
in here though so we can easily see what this is actually building.

I worry that some OS might make the assumption that it's one GI node
per PCI device though. The language in the ACPI specification is:

"The Generic Initiator Affinity Structure provides the association between _a_
generic initiator and _the_ proximity domain to which the initiator belongs".

The use of _a_ and _the_ in there makes it pretty explicitly a N:1 relationship
(multiple devices can be in same proximity domain, but a device may only be in one).
To avoid that confusion you will need an ACPI spec change.  I'd be happy to
support 

The reason you can get away with this in Linux today is that I only implemented
a very minimal support for GIs with the mappings being provided the other way
around (_PXM in a PCIe node in DSDT).  If we finish that support off I'd assume
the multiple mappings here will result in a firmware bug warning in at least
some cases.  Note the reason support for the mapping the other way isn't yet
in linux is that we never resolved the mess that a PCI re-enumeration will
cause (requires a pre enumeration pass of what is configured by fw and caching
of the path to all the PCIe devices that lets you access so we can reconstruct
the mapping post enumeration).

Also, this effectively creates a bunch of separate generic initiator nodes
and lumping that under one object seems to imply they are in general connected
to each other.

I'd be happier with a separate instance per GI node

  -object acpi-generic-initiator,id=gi1,pci-dev=dev1,nodeid=10
  -object acpi-generic-initiator,id=gi2,pci-dev=dev1,nodeid=11
etc with the proviso that anyone using this on a system that assumes a one
to one mapping for PCI

However, I'll leave it up to those more familiar with the QEMU numa
control interface design to comment on whether this approach is preferable
to making the gi part of the numa node entry or doing it like hmat.

-numa srat-gi,node-id=10,gi-pci-dev=dev1

etc

> 
> [1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  hw/acpi/acpi-generic-initiator.c         | 70 ++++++++++++++++++++++++
>  hw/acpi/meson.build                      |  1 +
>  include/hw/acpi/acpi-generic-initiator.h | 27 +++++++++
>  qapi/qom.json                            | 17 ++++++
>  4 files changed, 115 insertions(+)
>  create mode 100644 hw/acpi/acpi-generic-initiator.c
>  create mode 100644 include/hw/acpi/acpi-generic-initiator.h
> 
> diff --git a/hw/acpi/acpi-generic-initiator.c b/hw/acpi/acpi-generic-initiator.c
> new file mode 100644
> index 0000000000..e05e28e962
> --- /dev/null
> +++ b/hw/acpi/acpi-generic-initiator.c
> @@ -0,0 +1,70 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#include "qemu/osdep.h"
> +#include "hw/acpi/acpi-generic-initiator.h"
> +#include "hw/pci/pci_device.h"
> +#include "qapi/error.h"
> +#include "qapi/qapi-builtin-visit.h"
> +#include "qapi/visitor.h"
> +#include "qemu/error-report.h"
> +
> +OBJECT_DEFINE_TYPE_WITH_INTERFACES(AcpiGenericInitiator, acpi_generic_initiator,
> +                   ACPI_GENERIC_INITIATOR, OBJECT,
> +                   { TYPE_USER_CREATABLE },
> +                   { NULL })
> +
> +OBJECT_DECLARE_SIMPLE_TYPE(AcpiGenericInitiator, ACPI_GENERIC_INITIATOR)
> +
> +static void acpi_generic_initiator_init(Object *obj)
> +{
> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +    bitmap_zero(gi->host_nodes, MAX_NODES);
> +    gi->pci_dev = NULL;
> +}
> +
> +static void acpi_generic_initiator_finalize(Object *obj)
> +{
> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +
> +    g_free(gi->pci_dev);
> +}
> +
> +static void acpi_generic_initiator_set_pci_device(Object *obj, const char *val,
> +                                                  Error **errp)
> +{
> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +
> +    gi->pci_dev = g_strdup(val);
> +}
> +
> +static void
> +acpi_generic_initiator_set_host_nodes(Object *obj, Visitor *v, const char *name,
> +                                      void *opaque, Error **errp)
> +{
> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +    uint16List *l = NULL, *host_nodes = NULL;
> +
> +    visit_type_uint16List(v, name, &host_nodes, errp);
> +
> +    for (l = host_nodes; l; l = l->next) {
> +        if (l->value >= MAX_NODES) {
> +            error_setg(errp, "Invalid host-nodes value: %d", l->value);
> +            break;
> +        } else {
> +            bitmap_set(gi->host_nodes, l->value, 1);
> +        }
> +    }
> +
> +    qapi_free_uint16List(host_nodes);
> +}
> +
> +static void acpi_generic_initiator_class_init(ObjectClass *oc, void *data)
> +{
> +    object_class_property_add_str(oc, "pci-dev", NULL,
> +        acpi_generic_initiator_set_pci_device);
> +    object_class_property_add(oc, "host-nodes", "int", NULL,
> +        acpi_generic_initiator_set_host_nodes, NULL, NULL);
> +}
> diff --git a/hw/acpi/meson.build b/hw/acpi/meson.build
> index fc1b952379..2268589519 100644
> --- a/hw/acpi/meson.build
> +++ b/hw/acpi/meson.build
> @@ -1,5 +1,6 @@
>  acpi_ss = ss.source_set()
>  acpi_ss.add(files(
> +  'acpi-generic-initiator.c',
>    'acpi_interface.c',
>    'aml-build.c',
>    'bios-linker-loader.c',
> diff --git a/include/hw/acpi/acpi-generic-initiator.h b/include/hw/acpi/acpi-generic-initiator.h
> new file mode 100644
> index 0000000000..9643b81951
> --- /dev/null
> +++ b/include/hw/acpi/acpi-generic-initiator.h
> @@ -0,0 +1,27 @@
> +#ifndef ACPI_GENERIC_INITIATOR_H
> +#define ACPI_GENERIC_INITIATOR_H
> +
> +#include "hw/mem/pc-dimm.h"
> +#include "hw/acpi/bios-linker-loader.h"
> +#include "hw/acpi/aml-build.h"
> +#include "sysemu/numa.h"
> +#include "qemu/uuid.h"
> +#include "qom/object.h"
> +#include "qom/object_interfaces.h"
> +
> +#define TYPE_ACPI_GENERIC_INITIATOR "acpi-generic-initiator"
> +
> +typedef struct AcpiGenericInitiator {
> +    /* private */
> +    Object parent;
> +
> +    /* public */
> +    char *pci_dev;
> +    DECLARE_BITMAP(host_nodes, MAX_NODES);
> +} AcpiGenericInitiator;
> +
> +typedef struct AcpiGenericInitiatorClass {
> +        ObjectClass parent_class;
> +} AcpiGenericInitiatorClass;
> +
> +#endif
> diff --git a/qapi/qom.json b/qapi/qom.json
> index c53ef978ff..7b33d4a53c 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -794,6 +794,21 @@
>  { 'struct': 'VfioUserServerProperties',
>    'data': { 'socket': 'SocketAddress', 'device': 'str' } }
>  
> +##
> +# @AcpiGenericInitiatorProperties:
> +#
> +# Properties for acpi-generic-initiator objects.
> +#
> +# @pci-dev: PCI device ID to be associated with the node
> +#
> +# @host-nodes: numa node list associated with the PCI device.
> +#
> +# Since: 9.0
> +##
> +{ 'struct': 'AcpiGenericInitiatorProperties',
> +  'data': { 'pci-dev': 'str',
> +            'host-nodes': ['uint16'] } }
> +
>  ##
>  # @RngProperties:
>  #
> @@ -911,6 +926,7 @@
>  ##
>  { 'enum': 'ObjectType',
>    'data': [
> +    'acpi-generic-initiator',
>      'authz-list',
>      'authz-listfile',
>      'authz-pam',
> @@ -981,6 +997,7 @@
>              'id': 'str' },
>    'discriminator': 'qom-type',
>    'data': {
> +      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties',
>        'authz-list':                 'AuthZListProperties',
>        'authz-listfile':             'AuthZListFileProperties',
>        'authz-pam':                  'AuthZPAMProperties',



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] acpi: report numa nodes for device memory using GI
  2024-01-02 12:31 ` [PATCH v6 0/2] acpi: report numa nodes for device memory using GI Jonathan Cameron via
@ 2024-01-04  3:05   ` Ankit Agrawal
  2024-02-12 16:05     ` Michael S. Tsirkin
  0 siblings, 1 reply; 26+ messages in thread
From: Ankit Agrawal @ 2024-01-04  3:05 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Jason Gunthorpe, alex.williamson@redhat.com, clg@redhat.com,
	shannon.zhaosl@gmail.com, peter.maydell@linaro.org,
	ani@anisinha.ca, berrange@redhat.com, eduardo@habkost.net,
	imammedo@redhat.com, mst@redhat.com, eblake@redhat.com,
	armbru@redhat.com, david@redhat.com, gshan@redhat.com,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Dheeraj Nigam, Uday Dhoke,
	qemu-arm@nongnu.org, qemu-devel@nongnu.org


>>
>> -numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 \
>> -numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 \
>> -numa node,nodeid=8 -numa node,nodeid=9 \
>> -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
>> -object acpi-generic-initiator,id=gi0,pci-dev=dev0,host-nodes=2-9 \
>>
>
> I'd find it helpful to see the resulting chunk of SRAT for these examples
> (disassembled) in this cover letter and the patches (where there are more examples).

Ack. I'll document the resulting SRAT table as well.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 1/2] qom: new object to associate device to numa node
  2024-01-02 12:58   ` Jonathan Cameron via
@ 2024-01-04  3:36     ` Ankit Agrawal
  2024-01-04 12:33       ` Ankit Agrawal
                         ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Ankit Agrawal @ 2024-01-04  3:36 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Jason Gunthorpe, alex.williamson@redhat.com, clg@redhat.com,
	shannon.zhaosl@gmail.com, peter.maydell@linaro.org,
	ani@anisinha.ca, berrange@redhat.com, eduardo@habkost.net,
	imammedo@redhat.com, mst@redhat.com, eblake@redhat.com,
	armbru@redhat.com, david@redhat.com, gshan@redhat.com,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Dheeraj Nigam, Uday Dhoke,
	qemu-arm@nongnu.org, qemu-devel@nongnu.org

Thanks Jonathan for the review.

> As per reply to the cover letter I definitely want to see SRAT table dumps
> in here though so we can easily see what this is actually building.

Ack.

> I worry that some OS might make the assumption that it's one GI node
> per PCI device though. The language in the ACPI specification is:
> 
> "The Generic Initiator Affinity Structure provides the association between _a_
> generic initiator and _the_ proximity domain to which the initiator belongs".
> 
> The use of _a_ and _the_ in there makes it pretty explicitly a N:1 relationship
> (multiple devices can be in same proximity domain, but a device may only be in one).
> To avoid that confusion you will need an ACPI spec change.  I'd be happy to
> support

Yeah, that's a good point. It won't hurt to make the spec change to make the
possibility of the association between a device with multiple domains.

> The reason you can get away with this in Linux today is that I only implemented
> a very minimal support for GIs with the mappings being provided the other way
> around (_PXM in a PCIe node in DSDT).  If we finish that support off I'd assume

Not sure if I understand this. Can you provide a reference to this DSDT related
change?

> Also, this effectively creates a bunch of separate generic initiator nodes
> and lumping that under one object seems to imply they are in general connected
> to each other.
> 
> I'd be happier with a separate instance per GI node
> 
>  -object acpi-generic-initiator,id=gi1,pci-dev=dev1,nodeid=10
>  -object acpi-generic-initiator,id=gi2,pci-dev=dev1,nodeid=11
> etc with the proviso that anyone using this on a system that assumes a one
> to one mapping for PCI
>
> However, I'll leave it up to those more familiar with the QEMU numa
> control interface design to comment on whether this approach is preferable
> to making the gi part of the numa node entry or doing it like hmat.

> -numa srat-gi,node-id=10,gi-pci-dev=dev1

The current way of acpi-generic-initiator object usage came out of the discussion
on v1 to essentially link all the device NUMA nodes to the device.
(https://lore.kernel.org/all/20230926131427.1e441670.alex.williamson@redhat.com/)

Can Alex or David comment on which is preferable (the current mechanism vs 1:1
mapping per object as suggested by Jonathan)?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 1/2] qom: new object to associate device to numa node
  2024-01-04  3:36     ` Ankit Agrawal
@ 2024-01-04 12:33       ` Ankit Agrawal
  2024-01-04 16:40       ` Ankit Agrawal
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: Ankit Agrawal @ 2024-01-04 12:33 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Jason Gunthorpe, alex.williamson@redhat.com, clg@redhat.com,
	shannon.zhaosl@gmail.com, peter.maydell@linaro.org,
	ani@anisinha.ca, berrange@redhat.com, eduardo@habkost.net,
	imammedo@redhat.com, mst@redhat.com, eblake@redhat.com,
	armbru@redhat.com, david@redhat.com, gshan@redhat.com,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Dheeraj Nigam, Uday Dhoke,
	qemu-arm@nongnu.org, qemu-devel@nongnu.org

>> However, I'll leave it up to those more familiar with the QEMU numa
>> control interface design to comment on whether this approach is preferable
>> to making the gi part of the numa node entry or doing it like hmat.
>> -numa srat-gi,node-id=10,gi-pci-dev=dev1
>
> The current way of acpi-generic-initiator object usage came out of the discussion
> on v1 to essentially link all the device NUMA nodes to the device.
> (https://lore.kernel.org/all/20230926131427.1e441670.alex.williamson@redhat.com/)

> Can Alex or David comment on which is preferable (the current mechanism vs 1:1
> mapping per object as suggested by Jonathan)?

Just to add, IMO just a single Qemu object to tie the nodes with the device is
better as the nodes are kind of a pool. Having several objects may be an overkill?

Plus this is a Qemu object, eventually we populate one SRAT GI structure to
expose the PXM-to-device link.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 1/2] qom: new object to associate device to numa node
  2024-01-04  3:36     ` Ankit Agrawal
  2024-01-04 12:33       ` Ankit Agrawal
@ 2024-01-04 16:40       ` Ankit Agrawal
  2024-01-04 17:39         ` Alex Williamson
  2024-01-04 17:23       ` Alex Williamson
  2024-01-09 16:38       ` Jonathan Cameron via
  3 siblings, 1 reply; 26+ messages in thread
From: Ankit Agrawal @ 2024-01-04 16:40 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Jason Gunthorpe, alex.williamson@redhat.com, clg@redhat.com,
	shannon.zhaosl@gmail.com, peter.maydell@linaro.org,
	ani@anisinha.ca, berrange@redhat.com, eduardo@habkost.net,
	imammedo@redhat.com, mst@redhat.com, eblake@redhat.com,
	armbru@redhat.com, david@redhat.com, gshan@redhat.com,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Dheeraj Nigam, Uday Dhoke,
	qemu-arm@nongnu.org, qemu-devel@nongnu.org

Had a discussion with RH folks, summary follows:

1. To align with the current spec description pointed by Jonathan, we first do
     a separate object instance per GI node as suggested by Jonathan. i.e.
     a acpi-generic-initiator would only link one node to the device. To 
     associate a set of nodes, those number of object instances should be
     created.
2. In parallel, we work to get the spec updated. After the update, we switch
    to the current implementation to link a PCI device with a set of NUMA
    nodes.

Alex/Jonathan, does this sound fine?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 1/2] qom: new object to associate device to numa node
  2024-01-04  3:36     ` Ankit Agrawal
  2024-01-04 12:33       ` Ankit Agrawal
  2024-01-04 16:40       ` Ankit Agrawal
@ 2024-01-04 17:23       ` Alex Williamson
  2024-01-09  4:21         ` Ankit Agrawal
  2024-01-09 16:38       ` Jonathan Cameron via
  3 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2024-01-04 17:23 UTC (permalink / raw)
  To: Ankit Agrawal
  Cc: Jonathan Cameron, Jason Gunthorpe, clg@redhat.com,
	shannon.zhaosl@gmail.com, peter.maydell@linaro.org,
	ani@anisinha.ca, berrange@redhat.com, eduardo@habkost.net,
	imammedo@redhat.com, mst@redhat.com, eblake@redhat.com,
	armbru@redhat.com, david@redhat.com, gshan@redhat.com,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Dheeraj Nigam, Uday Dhoke,
	qemu-arm@nongnu.org, qemu-devel@nongnu.org

On Thu, 4 Jan 2024 03:36:06 +0000
Ankit Agrawal <ankita@nvidia.com> wrote:

> Thanks Jonathan for the review.
> 
> > As per reply to the cover letter I definitely want to see SRAT table dumps
> > in here though so we can easily see what this is actually building.  
> 
> Ack.
> 
> > I worry that some OS might make the assumption that it's one GI node
> > per PCI device though. The language in the ACPI specification is:
> > 
> > "The Generic Initiator Affinity Structure provides the association between _a_
> > generic initiator and _the_ proximity domain to which the initiator belongs".
> > 
> > The use of _a_ and _the_ in there makes it pretty explicitly a N:1 relationship
> > (multiple devices can be in same proximity domain, but a device may only be in one).
> > To avoid that confusion you will need an ACPI spec change.  I'd be happy to
> > support  
> 
> Yeah, that's a good point. It won't hurt to make the spec change to make the
> possibility of the association between a device with multiple domains.
> 
> > The reason you can get away with this in Linux today is that I only implemented
> > a very minimal support for GIs with the mappings being provided the other way
> > around (_PXM in a PCIe node in DSDT).  If we finish that support off I'd assume  
> 
> Not sure if I understand this. Can you provide a reference to this DSDT related
> change?
> 
> > Also, this effectively creates a bunch of separate generic initiator nodes
> > and lumping that under one object seems to imply they are in general connected
> > to each other.
> > 
> > I'd be happier with a separate instance per GI node
> > 
> >  -object acpi-generic-initiator,id=gi1,pci-dev=dev1,nodeid=10
> >  -object acpi-generic-initiator,id=gi2,pci-dev=dev1,nodeid=11
> > etc with the proviso that anyone using this on a system that assumes a one
> > to one mapping for PCI
> >
> > However, I'll leave it up to those more familiar with the QEMU numa
> > control interface design to comment on whether this approach is preferable
> > to making the gi part of the numa node entry or doing it like hmat.  
> 
> > -numa srat-gi,node-id=10,gi-pci-dev=dev1  
> 
> The current way of acpi-generic-initiator object usage came out of the discussion
> on v1 to essentially link all the device NUMA nodes to the device.
> (https://lore.kernel.org/all/20230926131427.1e441670.alex.williamson@redhat.com/)
> 
> Can Alex or David comment on which is preferable (the current mechanism vs 1:1
> mapping per object as suggested by Jonathan)?

I imagine there are ways that either could work, but specifying a
gi-pci-dev in the numa node declaration appears to get a bit messy if we
have multiple gi-pci-dev devices to associate to the node whereas
creating an acpi-generic-initiator object per individual device:node
relationship feels a bit easier to iterate.

Also if we do extend the ACPI spec to more explicitly allow a device to
associate to multiple nodes, we could re-instate the list behavior of
the acpi-generic-initiator whereas I don't see a representation of the
association at the numa object that makes sense.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 1/2] qom: new object to associate device to numa node
  2024-01-04 16:40       ` Ankit Agrawal
@ 2024-01-04 17:39         ` Alex Williamson
  2024-01-09 16:52           ` Jonathan Cameron via
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2024-01-04 17:39 UTC (permalink / raw)
  To: Ankit Agrawal
  Cc: Jonathan Cameron, Jason Gunthorpe, clg@redhat.com,
	shannon.zhaosl@gmail.com, peter.maydell@linaro.org,
	ani@anisinha.ca, berrange@redhat.com, eduardo@habkost.net,
	imammedo@redhat.com, mst@redhat.com, eblake@redhat.com,
	armbru@redhat.com, david@redhat.com, gshan@redhat.com,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Dheeraj Nigam, Uday Dhoke,
	qemu-arm@nongnu.org, qemu-devel@nongnu.org

On Thu, 4 Jan 2024 16:40:39 +0000
Ankit Agrawal <ankita@nvidia.com> wrote:

> Had a discussion with RH folks, summary follows:
> 
> 1. To align with the current spec description pointed by Jonathan, we first do
>      a separate object instance per GI node as suggested by Jonathan. i.e.
>      a acpi-generic-initiator would only link one node to the device. To 
>      associate a set of nodes, those number of object instances should be
>      created.
> 2. In parallel, we work to get the spec updated. After the update, we switch
>     to the current implementation to link a PCI device with a set of NUMA
>     nodes.
> 
> Alex/Jonathan, does this sound fine?
> 

Yes, as I understand Jonathan's comments, the acpi-generic-initiator
object should currently define a single device:node relationship to
match the ACPI definition.  Separately a clarification of the spec
could be pursued that could allow us to reinstate a node list option
for the acpi-generic-initiator object.  In the interim, a user can
define multiple 1:1 objects to create the 1:N relationship that's
ultimately required here.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 1/2] qom: new object to associate device to numa node
  2023-12-25  4:56 ` [PATCH v6 1/2] qom: new object to associate device to numa node ankita
  2024-01-02 12:58   ` Jonathan Cameron via
@ 2024-01-08 12:09   ` Markus Armbruster
  2024-01-09  4:11     ` Ankit Agrawal
  1 sibling, 1 reply; 26+ messages in thread
From: Markus Armbruster @ 2024-01-08 12:09 UTC (permalink / raw)
  To: ankita
  Cc: jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell, ani,
	berrange, eduardo, imammedo, mst, eblake, david, gshan,
	Jonathan.Cameron, aniketa, cjia, kwankhede, targupta, vsethi,
	acurrid, dnigam, udhoke, qemu-arm, qemu-devel

<ankita@nvidia.com> writes:

> From: Ankit Agrawal <ankita@nvidia.com>
>
> NVIDIA GPU's support MIG (Mult-Instance GPUs) feature [1], which allows
> partitioning of the GPU device resources (including device memory) into
> several (upto 8) isolated instances. Each of the partitioned memory needs
> a dedicated NUMA node to operate. The partitions are not fixed and they
> can be created/deleted at runtime.
>
> Unfortunately Linux OS does not provide a means to dynamically create/destroy
> NUMA nodes and such feature implementation is not expected to be trivial. The
> nodes that OS discovers at the boot time while parsing SRAT remains fixed. So
> we utilize the Generic Initiator Affinity structures that allows association
> between nodes and devices. Multiple GI structures per BDF is possible,
> allowing creation of multiple nodes by exposing unique PXM in each of these
> structures.
>
> Introduce a new acpi-generic-initiator object to allow host admin provide the
> device and the corresponding NUMA nodes. Qemu maintain this association and
> use this object to build the requisite GI Affinity Structure. On a multi
> device system, each device supporting the features needs a unique
> acpi-generic-initiator object with its own set of NUMA nodes associated to it.
>
> An admin can provide the range of nodes through a uint16 array host-nodes
> and link it to a device by providing its id. Currently, only PCI device is
> supported. The following sample creates 8 nodes per PCI device for a VM
> with 2 PCI devices and link them to the respecitve PCI device using
> acpi-generic-initiator objects:
>
> -numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 \
> -numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 \
> -numa node,nodeid=8 -numa node,nodeid=9 \
> -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
> -object acpi-generic-initiator,id=gi0,pci-dev=dev0,host-nodes=2-9 \
>
> -numa node,nodeid=10 -numa node,nodeid=11 -numa node,nodeid=12 \
> -numa node,nodeid=13 -numa node,nodeid=14 -numa node,nodeid=15 \
> -numa node,nodeid=16 -numa node,nodeid=17 \
> -device vfio-pci-nohotplug,host=0009:01:01.0,bus=pcie.0,addr=05.0,rombar=0,id=dev1 \
> -object acpi-generic-initiator,id=gi1,pci-dev=dev1,host-nodes=10-17 \
>
> [1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu
>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>

Appreciate the improved commit message.

[...]

> diff --git a/qapi/qom.json b/qapi/qom.json
> index c53ef978ff..7b33d4a53c 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -794,6 +794,21 @@
>  { 'struct': 'VfioUserServerProperties',
>    'data': { 'socket': 'SocketAddress', 'device': 'str' } }
>  
> +##
> +# @AcpiGenericInitiatorProperties:
> +#
> +# Properties for acpi-generic-initiator objects.
> +#
> +# @pci-dev: PCI device ID to be associated with the node
> +#
> +# @host-nodes: numa node list associated with the PCI device.

NUMA

Suggest "list of NUMA nodes associated with ..."

> +#
> +# Since: 9.0
> +##
> +{ 'struct': 'AcpiGenericInitiatorProperties',
> +  'data': { 'pci-dev': 'str',
> +            'host-nodes': ['uint16'] } }
> +
>  ##
>  # @RngProperties:
>  #
> @@ -911,6 +926,7 @@
>  ##
>  { 'enum': 'ObjectType',
>    'data': [
> +    'acpi-generic-initiator',
>      'authz-list',
>      'authz-listfile',
>      'authz-pam',
> @@ -981,6 +997,7 @@
>              'id': 'str' },
>    'discriminator': 'qom-type',
>    'data': {
> +      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties',
>        'authz-list':                 'AuthZListProperties',
>        'authz-listfile':             'AuthZListFileProperties',
>        'authz-pam':                  'AuthZPAMProperties',

I'm holding my Acked-by until the interface design issues raised by
Jason have been resolved.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 1/2] qom: new object to associate device to numa node
  2024-01-08 12:09   ` Markus Armbruster
@ 2024-01-09  4:11     ` Ankit Agrawal
  2024-01-09  7:02       ` Markus Armbruster
  0 siblings, 1 reply; 26+ messages in thread
From: Ankit Agrawal @ 2024-01-09  4:11 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Jason Gunthorpe, alex.williamson@redhat.com, clg@redhat.com,
	shannon.zhaosl@gmail.com, peter.maydell@linaro.org,
	ani@anisinha.ca, berrange@redhat.com, eduardo@habkost.net,
	imammedo@redhat.com, mst@redhat.com, eblake@redhat.com,
	david@redhat.com, gshan@redhat.com, Jonathan.Cameron@huawei.com,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Dheeraj Nigam, Uday Dhoke,
	qemu-arm@nongnu.org, qemu-devel@nongnu.org


>> +##
>> +# @AcpiGenericInitiatorProperties:
>> +#
>> +# Properties for acpi-generic-initiator objects.
>> +#
>> +# @pci-dev: PCI device ID to be associated with the node
>> +#
>> +# @host-nodes: numa node list associated with the PCI device.
>
> NUMA
>
> Suggest "list of NUMA nodes associated with ..."

Ack, will make the change.

>> @@ -981,6 +997,7 @@
>>              'id': 'str' },
>>    'discriminator': 'qom-type',
>>    'data': {
>> +      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties',
>>        'authz-list':                 'AuthZListProperties',
>>        'authz-listfile':             'AuthZListFileProperties',
>>        'authz-pam':                  'AuthZPAMProperties',
>
> I'm holding my Acked-by until the interface design issues raised by
> Jason have been resolved.

I suppose you meant Jonathan here?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 1/2] qom: new object to associate device to numa node
  2024-01-04 17:23       ` Alex Williamson
@ 2024-01-09  4:21         ` Ankit Agrawal
  0 siblings, 0 replies; 26+ messages in thread
From: Ankit Agrawal @ 2024-01-09  4:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jonathan Cameron, Jason Gunthorpe, clg@redhat.com,
	shannon.zhaosl@gmail.com, peter.maydell@linaro.org,
	ani@anisinha.ca, berrange@redhat.com, eduardo@habkost.net,
	imammedo@redhat.com, mst@redhat.com, eblake@redhat.com,
	armbru@redhat.com, david@redhat.com, gshan@redhat.com,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Dheeraj Nigam, Uday Dhoke,
	qemu-arm@nongnu.org, qemu-devel@nongnu.org


>> > However, I'll leave it up to those more familiar with the QEMU numa
>> > control interface design to comment on whether this approach is preferable
>> > to making the gi part of the numa node entry or doing it like hmat.
>>
>> > -numa srat-gi,node-id=10,gi-pci-dev=dev1
>>
>> The current way of acpi-generic-initiator object usage came out of the discussion
>> on v1 to essentially link all the device NUMA nodes to the device.
>> (https://lore.kernel.org/all/20230926131427.1e441670.alex.williamson@redhat.com/)
>>
>> Can Alex or David comment on which is preferable (the current mechanism vs 1:1
>> mapping per object as suggested by Jonathan)?
>
> I imagine there are ways that either could work, but specifying a
> gi-pci-dev in the numa node declaration appears to get a bit messy if we
> have multiple gi-pci-dev devices to associate to the node whereas
> creating an acpi-generic-initiator object per individual device:node
> relationship feels a bit easier to iterate.
>
> Also if we do extend the ACPI spec to more explicitly allow a device to
> associate to multiple nodes, we could re-instate the list behavior of
> the acpi-generic-initiator whereas I don't see a representation of the
> association at the numa object that makes sense.  Thanks,

Ack, making the change to create an individual acpi-generic-initiator object
per device:node.

Alex


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 1/2] qom: new object to associate device to numa node
  2024-01-09  4:11     ` Ankit Agrawal
@ 2024-01-09  7:02       ` Markus Armbruster
  0 siblings, 0 replies; 26+ messages in thread
From: Markus Armbruster @ 2024-01-09  7:02 UTC (permalink / raw)
  To: Ankit Agrawal
  Cc: Jason Gunthorpe, alex.williamson@redhat.com, clg@redhat.com,
	shannon.zhaosl@gmail.com, peter.maydell@linaro.org,
	ani@anisinha.ca, berrange@redhat.com, eduardo@habkost.net,
	imammedo@redhat.com, mst@redhat.com, eblake@redhat.com,
	david@redhat.com, gshan@redhat.com, Jonathan.Cameron@huawei.com,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Dheeraj Nigam, Uday Dhoke,
	qemu-arm@nongnu.org, qemu-devel@nongnu.org

Ankit Agrawal <ankita@nvidia.com> writes:

>>> +##
>>> +# @AcpiGenericInitiatorProperties:
>>> +#
>>> +# Properties for acpi-generic-initiator objects.
>>> +#
>>> +# @pci-dev: PCI device ID to be associated with the node
>>> +#
>>> +# @host-nodes: numa node list associated with the PCI device.
>>
>> NUMA
>>
>> Suggest "list of NUMA nodes associated with ..."
>
> Ack, will make the change.
>
>>> @@ -981,6 +997,7 @@
>>>              'id': 'str' },
>>>    'discriminator': 'qom-type',
>>>    'data': {
>>> +      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties',
>>>        'authz-list':                 'AuthZListProperties',
>>>        'authz-listfile':             'AuthZListFileProperties',
>>>        'authz-pam':                  'AuthZPAMProperties',
>>
>> I'm holding my Acked-by until the interface design issues raised by
>> Jason have been resolved.
>
> I suppose you meant Jonathan here?

Yes.  Going too fast.  My apologies!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 1/2] qom: new object to associate device to numa node
  2024-01-04  3:36     ` Ankit Agrawal
                         ` (2 preceding siblings ...)
  2024-01-04 17:23       ` Alex Williamson
@ 2024-01-09 16:38       ` Jonathan Cameron via
  3 siblings, 0 replies; 26+ messages in thread
From: Jonathan Cameron via @ 2024-01-09 16:38 UTC (permalink / raw)
  To: Ankit Agrawal
  Cc: Jason Gunthorpe, alex.williamson@redhat.com, clg@redhat.com,
	shannon.zhaosl@gmail.com, peter.maydell@linaro.org,
	ani@anisinha.ca, berrange@redhat.com, eduardo@habkost.net,
	imammedo@redhat.com, mst@redhat.com, eblake@redhat.com,
	armbru@redhat.com, david@redhat.com, gshan@redhat.com,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Dheeraj Nigam, Uday Dhoke,
	qemu-arm@nongnu.org, qemu-devel@nongnu.org

On Thu, 4 Jan 2024 03:36:06 +0000
Ankit Agrawal <ankita@nvidia.com> wrote:

> Thanks Jonathan for the review.
> 
> > As per reply to the cover letter I definitely want to see SRAT table dumps
> > in here though so we can easily see what this is actually building.  
> 
> Ack.
> 
> > I worry that some OS might make the assumption that it's one GI node
> > per PCI device though. The language in the ACPI specification is:
> > 
> > "The Generic Initiator Affinity Structure provides the association between _a_
> > generic initiator and _the_ proximity domain to which the initiator belongs".
> > 
> > The use of _a_ and _the_ in there makes it pretty explicitly a N:1 relationship
> > (multiple devices can be in same proximity domain, but a device may only be in one).
> > To avoid that confusion you will need an ACPI spec change.  I'd be happy to
> > support  
> 
> Yeah, that's a good point. It won't hurt to make the spec change to make the
> possibility of the association between a device with multiple domains.
> 
> > The reason you can get away with this in Linux today is that I only implemented
> > a very minimal support for GIs with the mappings being provided the other way
> > around (_PXM in a PCIe node in DSDT).  If we finish that support off I'd assume  
> 
> Not sure if I understand this. Can you provide a reference to this DSDT related
> change?

You need to add the PCI tree down to the device which is a bit fiddly if there
are switches etc. I'm also not sure I ever followed up in getting the PCI
fix in after we finally dealt with the issue this triggered on old AMD boxes
(they had devices that claimed to be in non existent proximity domains :(
later at least one path to hit that was closed down - I'm not sure all of them
were).

Anyhow, the fix for PCI include an example where the EP has a different PXM
to the root bridge.  In this example 0x02 is the GI node.

https://lore.kernel.org/all/20180912152140.3676-2-Jonathan.Cameron@huawei.com/

>   Device (PCI2)
>   {
>     Name (_HID, "PNP0A08") // PCI Express Root Bridge
>     Name (_CID, "PNP0A03") // Compatible PCI Root Bridge
>     Name(_SEG, 2) // Segment of this Root complex
>     Name(_BBN, 0xF8) // Base Bus Number
>     Name(_CCA, 1)
>     Method (_PXM, 0, NotSerialized) {
>       Return(0x00)
>     }
> 
> ...
>     Device (BRI0) {
>       Name (_HID, "19E51610")
>       Name (_ADR, 0)
>       Name (_BBN, 0xF9)
>       Device (CAR0) {
>         Name (_HID, "97109912")
>         Name (_ADR, 0)
>         Method (_PXM, 0, NotSerialized) {
>           Return(0x02)
>         }
>       }
>     }
>   }

Without that PCI fix, you'll only see correct GI mappings in Linux
for platform devices.

Sorry for slow reply - I missed the rest of this thread until I was
brandishing as an argument for another discussion on GIs and noticed
it had carried on with out me.

Jonathan




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 1/2] qom: new object to associate device to numa node
  2024-01-04 17:39         ` Alex Williamson
@ 2024-01-09 16:52           ` Jonathan Cameron via
  2024-01-09 17:02             ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Jonathan Cameron via @ 2024-01-09 16:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Ankit Agrawal, Jason Gunthorpe, clg@redhat.com,
	shannon.zhaosl@gmail.com, peter.maydell@linaro.org,
	ani@anisinha.ca, berrange@redhat.com, eduardo@habkost.net,
	imammedo@redhat.com, mst@redhat.com, eblake@redhat.com,
	armbru@redhat.com, david@redhat.com, gshan@redhat.com,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Dheeraj Nigam, Uday Dhoke,
	qemu-arm@nongnu.org, qemu-devel@nongnu.org

On Thu, 4 Jan 2024 10:39:41 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Thu, 4 Jan 2024 16:40:39 +0000
> Ankit Agrawal <ankita@nvidia.com> wrote:
> 
> > Had a discussion with RH folks, summary follows:
> > 
> > 1. To align with the current spec description pointed by Jonathan, we first do
> >      a separate object instance per GI node as suggested by Jonathan. i.e.
> >      a acpi-generic-initiator would only link one node to the device. To 
> >      associate a set of nodes, those number of object instances should be
> >      created.
> > 2. In parallel, we work to get the spec updated. After the update, we switch
> >     to the current implementation to link a PCI device with a set of NUMA
> >     nodes.
> > 
> > Alex/Jonathan, does this sound fine?
> >   
> 
> Yes, as I understand Jonathan's comments, the acpi-generic-initiator
> object should currently define a single device:node relationship to
> match the ACPI definition.

Doesn't matter for this, but it's a many_device:single_node
relationship as currently defined. We should be able to support that
in any new interfaces for QEMU.

>  Separately a clarification of the spec
> could be pursued that could allow us to reinstate a node list option
> for the acpi-generic-initiator object.  In the interim, a user can
> define multiple 1:1 objects to create the 1:N relationship that's
> ultimately required here.  Thanks,

Yes, a spec clarification would work, probably needs some text
to say a GI might not be an initiator as well - my worry is
theoretical backwards compatibility with a (probably
nonexistent) OS that assumes the N:1 mapping. So you may be in 
new SRAT entry territory.

Given that, an alternative proposal that I think would work
for you would be to add a 'placeholder' memory node definition
in SRAT (so allow 0 size explicitly - might need a new SRAT
entry to avoid backwards compat issues). Then put the GPU
initiator part in a GI node and use the HMAT Memory Proximity
Domain Attributes magic linkage entry "Proximity Domain for
the Attached Initiator" to associate the placeholder memory
nodes with the GI / GPU.

I'd go to ASWG with a big diagram and ask 'how do I do this!'

If you do it code first I'm happy to help out with refining
the proposal. I just don't like the time of ASWG calls so tend
to not make them in person.

Or just emulate UEFI's CDAT (from CXL, but not CXL specific)
from your GPU and make it a driver problem ;)

Jonathan


> 
> Alex
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 1/2] qom: new object to associate device to numa node
  2024-01-09 16:52           ` Jonathan Cameron via
@ 2024-01-09 17:02             ` David Hildenbrand
  2024-01-09 17:10               ` Jason Gunthorpe
  2024-01-10 23:19               ` Dan Williams
  0 siblings, 2 replies; 26+ messages in thread
From: David Hildenbrand @ 2024-01-09 17:02 UTC (permalink / raw)
  To: Jonathan Cameron, Alex Williamson
  Cc: Ankit Agrawal, Jason Gunthorpe, clg@redhat.com,
	shannon.zhaosl@gmail.com, peter.maydell@linaro.org,
	ani@anisinha.ca, berrange@redhat.com, eduardo@habkost.net,
	imammedo@redhat.com, mst@redhat.com, eblake@redhat.com,
	armbru@redhat.com, gshan@redhat.com, Aniket Agashe, Neo Jia,
	Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
	Dheeraj Nigam, Uday Dhoke, qemu-arm@nongnu.org,
	qemu-devel@nongnu.org

On 09.01.24 17:52, Jonathan Cameron wrote:
> On Thu, 4 Jan 2024 10:39:41 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
>> On Thu, 4 Jan 2024 16:40:39 +0000
>> Ankit Agrawal <ankita@nvidia.com> wrote:
>>
>>> Had a discussion with RH folks, summary follows:
>>>
>>> 1. To align with the current spec description pointed by Jonathan, we first do
>>>       a separate object instance per GI node as suggested by Jonathan. i.e.
>>>       a acpi-generic-initiator would only link one node to the device. To
>>>       associate a set of nodes, those number of object instances should be
>>>       created.
>>> 2. In parallel, we work to get the spec updated. After the update, we switch
>>>      to the current implementation to link a PCI device with a set of NUMA
>>>      nodes.
>>>
>>> Alex/Jonathan, does this sound fine?
>>>    
>>
>> Yes, as I understand Jonathan's comments, the acpi-generic-initiator
>> object should currently define a single device:node relationship to
>> match the ACPI definition.
> 
> Doesn't matter for this, but it's a many_device:single_node
> relationship as currently defined. We should be able to support that
> in any new interfaces for QEMU.
> 
>>   Separately a clarification of the spec
>> could be pursued that could allow us to reinstate a node list option
>> for the acpi-generic-initiator object.  In the interim, a user can
>> define multiple 1:1 objects to create the 1:N relationship that's
>> ultimately required here.  Thanks,
> 
> Yes, a spec clarification would work, probably needs some text
> to say a GI might not be an initiator as well - my worry is
> theoretical backwards compatibility with a (probably
> nonexistent) OS that assumes the N:1 mapping. So you may be in
> new SRAT entry territory.
> 
> Given that, an alternative proposal that I think would work
> for you would be to add a 'placeholder' memory node definition
> in SRAT (so allow 0 size explicitly - might need a new SRAT
> entry to avoid backwards compat issues).

Putting all the PCI/GI/... complexity aside, I'll just raise again that 
for virtio-mem something simple like that might be helpful as well, IIUC.

	-numa node,nodeid=2 \
	...
	-device virtio-mem-pci,node=2,... \

All we need is the OS to prepare for an empty node that will get 
populated with memory later.

So if that's what a "placeholder" node definition in srat could achieve 
as well, even without all of the other acpi-generic-initiator stuff, 
that would be great.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 1/2] qom: new object to associate device to numa node
  2024-01-09 17:02             ` David Hildenbrand
@ 2024-01-09 17:10               ` Jason Gunthorpe
  2024-01-09 19:36                 ` Dan Williams
  2024-01-10 23:19               ` Dan Williams
  1 sibling, 1 reply; 26+ messages in thread
From: Jason Gunthorpe @ 2024-01-09 17:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jonathan Cameron, Alex Williamson, Ankit Agrawal, clg@redhat.com,
	shannon.zhaosl@gmail.com, peter.maydell@linaro.org,
	ani@anisinha.ca, berrange@redhat.com, eduardo@habkost.net,
	imammedo@redhat.com, mst@redhat.com, eblake@redhat.com,
	armbru@redhat.com, gshan@redhat.com, Aniket Agashe, Neo Jia,
	Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
	Dheeraj Nigam, Uday Dhoke, qemu-arm@nongnu.org,
	qemu-devel@nongnu.org

On Tue, Jan 09, 2024 at 06:02:03PM +0100, David Hildenbrand wrote:
> > Given that, an alternative proposal that I think would work
> > for you would be to add a 'placeholder' memory node definition
> > in SRAT (so allow 0 size explicitly - might need a new SRAT
> > entry to avoid backwards compat issues).
> 
> Putting all the PCI/GI/... complexity aside, I'll just raise again that for
> virtio-mem something simple like that might be helpful as well, IIUC.
> 
> 	-numa node,nodeid=2 \
> 	...
> 	-device virtio-mem-pci,node=2,... \
> 
> All we need is the OS to prepare for an empty node that will get populated
> with memory later.

That is all this is doing too, the NUMA relationship of the actual
memory is desribed already by the PCI device since it is a BAR on the
device.

The only purpose is to get the empty nodes into Linux :(

> So if that's what a "placeholder" node definition in srat could achieve as
> well, even without all of the other acpi-generic-initiator stuff, that would
> be great.

Seems like there are two use quite similar cases.. virtio-mem is going
to be calling the same family of kernel API I suspect :)

Jason


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 1/2] qom: new object to associate device to numa node
  2024-01-09 17:10               ` Jason Gunthorpe
@ 2024-01-09 19:36                 ` Dan Williams
  2024-01-09 19:38                   ` Jason Gunthorpe
  0 siblings, 1 reply; 26+ messages in thread
From: Dan Williams @ 2024-01-09 19:36 UTC (permalink / raw)
  To: Jason Gunthorpe, David Hildenbrand
  Cc: Jonathan Cameron, Alex Williamson, Ankit Agrawal, clg@redhat.com,
	shannon.zhaosl@gmail.com, peter.maydell@linaro.org,
	ani@anisinha.ca, berrange@redhat.com, eduardo@habkost.net,
	imammedo@redhat.com, mst@redhat.com, eblake@redhat.com,
	armbru@redhat.com, gshan@redhat.com, Aniket Agashe, Neo Jia,
	Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
	Dheeraj Nigam, Uday Dhoke, qemu-arm@nongnu.org,
	qemu-devel@nongnu.org

Jason Gunthorpe wrote:
> On Tue, Jan 09, 2024 at 06:02:03PM +0100, David Hildenbrand wrote:
> > > Given that, an alternative proposal that I think would work
> > > for you would be to add a 'placeholder' memory node definition
> > > in SRAT (so allow 0 size explicitly - might need a new SRAT
> > > entry to avoid backwards compat issues).
> > 
> > Putting all the PCI/GI/... complexity aside, I'll just raise again that for
> > virtio-mem something simple like that might be helpful as well, IIUC.
> > 
> > 	-numa node,nodeid=2 \
> > 	...
> > 	-device virtio-mem-pci,node=2,... \
> > 
> > All we need is the OS to prepare for an empty node that will get populated
> > with memory later.
> 
> That is all this is doing too, the NUMA relationship of the actual
> memory is desribed already by the PCI device since it is a BAR on the
> device.
> 
> The only purpose is to get the empty nodes into Linux :(
> 
> > So if that's what a "placeholder" node definition in srat could achieve as
> > well, even without all of the other acpi-generic-initiator stuff, that would
> > be great.
> 
> Seems like there are two use quite similar cases.. virtio-mem is going
> to be calling the same family of kernel API I suspect :)

It seems sad that we, as an industry, went through all of this trouble
to define a dynamically enumerable CXL device model only to turn around
and require static ACPI tables to tell us how to enumerate it.

A similar problem exists on the memory target side and the approach
taken there was to have Linux statically reserve at least enough numa
node numbers for all the platform CXL memory ranges (defined in the
ACPI.CEDT.CFMWS), but with the promise to come back and broach the
dynamic node creation problem "if the need arises".

This initiator-node enumeration case seems like that occasion where the
need has arisen to get Linux out of the mode of needing to declare all
possible numa nodes early in boot. Allow for nodes to be discoverable
post NUMA-init.

One strawman scheme that comes to mind is instead of "add nodes early" in
boot, "delete unused nodes late" in boot after the device topology has
been enumerated. Otherwise, requiring static ACPI tables to further
enumerate an industry-standard dynamically enumerated bus seems to be
going in the wrong direction.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 1/2] qom: new object to associate device to numa node
  2024-01-09 19:36                 ` Dan Williams
@ 2024-01-09 19:38                   ` Jason Gunthorpe
  0 siblings, 0 replies; 26+ messages in thread
From: Jason Gunthorpe @ 2024-01-09 19:38 UTC (permalink / raw)
  To: Dan Williams
  Cc: David Hildenbrand, Jonathan Cameron, Alex Williamson,
	Ankit Agrawal, clg@redhat.com, shannon.zhaosl@gmail.com,
	peter.maydell@linaro.org, ani@anisinha.ca, berrange@redhat.com,
	eduardo@habkost.net, imammedo@redhat.com, mst@redhat.com,
	eblake@redhat.com, armbru@redhat.com, gshan@redhat.com,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Dheeraj Nigam, Uday Dhoke,
	qemu-arm@nongnu.org, qemu-devel@nongnu.org

On Tue, Jan 09, 2024 at 11:36:03AM -0800, Dan Williams wrote:
> Jason Gunthorpe wrote:
> > On Tue, Jan 09, 2024 at 06:02:03PM +0100, David Hildenbrand wrote:
> > > > Given that, an alternative proposal that I think would work
> > > > for you would be to add a 'placeholder' memory node definition
> > > > in SRAT (so allow 0 size explicitly - might need a new SRAT
> > > > entry to avoid backwards compat issues).
> > > 
> > > Putting all the PCI/GI/... complexity aside, I'll just raise again that for
> > > virtio-mem something simple like that might be helpful as well, IIUC.
> > > 
> > > 	-numa node,nodeid=2 \
> > > 	...
> > > 	-device virtio-mem-pci,node=2,... \
> > > 
> > > All we need is the OS to prepare for an empty node that will get populated
> > > with memory later.
> > 
> > That is all this is doing too, the NUMA relationship of the actual
> > memory is desribed already by the PCI device since it is a BAR on the
> > device.
> > 
> > The only purpose is to get the empty nodes into Linux :(
> > 
> > > So if that's what a "placeholder" node definition in srat could achieve as
> > > well, even without all of the other acpi-generic-initiator stuff, that would
> > > be great.
> > 
> > Seems like there are two use quite similar cases.. virtio-mem is going
> > to be calling the same family of kernel API I suspect :)
> 
> It seems sad that we, as an industry, went through all of this trouble
> to define a dynamically enumerable CXL device model only to turn around
> and require static ACPI tables to tell us how to enumerate it.
> 
> A similar problem exists on the memory target side and the approach
> taken there was to have Linux statically reserve at least enough numa
> node numbers for all the platform CXL memory ranges (defined in the
> ACPI.CEDT.CFMWS), but with the promise to come back and broach the
> dynamic node creation problem "if the need arises".
> 
> This initiator-node enumeration case seems like that occasion where the
> need has arisen to get Linux out of the mode of needing to declare all
> possible numa nodes early in boot. Allow for nodes to be discoverable
> post NUMA-init.
> 
> One strawman scheme that comes to mind is instead of "add nodes early" in
> boot, "delete unused nodes late" in boot after the device topology has
> been enumerated. Otherwise, requiring static ACPI tables to further
> enumerate an industry-standard dynamically enumerated bus seems to be
> going in the wrong direction.

Fully agree, and I think this will get increasingly painful as we go
down the CXL road.

Jason


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 1/2] qom: new object to associate device to numa node
  2024-01-09 17:02             ` David Hildenbrand
  2024-01-09 17:10               ` Jason Gunthorpe
@ 2024-01-10 23:19               ` Dan Williams
  2024-01-11  7:01                 ` Michael S. Tsirkin
  2024-01-16 14:02                 ` Ankit Agrawal
  1 sibling, 2 replies; 26+ messages in thread
From: Dan Williams @ 2024-01-10 23:19 UTC (permalink / raw)
  To: David Hildenbrand, Jonathan Cameron, Alex Williamson
  Cc: Ankit Agrawal, Jason Gunthorpe, clg@redhat.com,
	shannon.zhaosl@gmail.com, peter.maydell@linaro.org,
	ani@anisinha.ca, berrange@redhat.com, eduardo@habkost.net,
	imammedo@redhat.com, mst@redhat.com, eblake@redhat.com,
	armbru@redhat.com, gshan@redhat.com, Aniket Agashe, Neo Jia,
	Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
	Dheeraj Nigam, Uday Dhoke, qemu-arm@nongnu.org,
	qemu-devel@nongnu.org

David Hildenbrand wrote:
> On 09.01.24 17:52, Jonathan Cameron wrote:
> > On Thu, 4 Jan 2024 10:39:41 -0700
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> > 
> >> On Thu, 4 Jan 2024 16:40:39 +0000
> >> Ankit Agrawal <ankita@nvidia.com> wrote:
> >>
> >>> Had a discussion with RH folks, summary follows:
> >>>
> >>> 1. To align with the current spec description pointed by Jonathan, we first do
> >>>       a separate object instance per GI node as suggested by Jonathan. i.e.
> >>>       a acpi-generic-initiator would only link one node to the device. To
> >>>       associate a set of nodes, those number of object instances should be
> >>>       created.
> >>> 2. In parallel, we work to get the spec updated. After the update, we switch
> >>>      to the current implementation to link a PCI device with a set of NUMA
> >>>      nodes.
> >>>
> >>> Alex/Jonathan, does this sound fine?
> >>>    
> >>
> >> Yes, as I understand Jonathan's comments, the acpi-generic-initiator
> >> object should currently define a single device:node relationship to
> >> match the ACPI definition.
> > 
> > Doesn't matter for this, but it's a many_device:single_node
> > relationship as currently defined. We should be able to support that
> > in any new interfaces for QEMU.
> > 
> >>   Separately a clarification of the spec
> >> could be pursued that could allow us to reinstate a node list option
> >> for the acpi-generic-initiator object.  In the interim, a user can
> >> define multiple 1:1 objects to create the 1:N relationship that's
> >> ultimately required here.  Thanks,
> > 
> > Yes, a spec clarification would work, probably needs some text
> > to say a GI might not be an initiator as well - my worry is
> > theoretical backwards compatibility with a (probably
> > nonexistent) OS that assumes the N:1 mapping. So you may be in
> > new SRAT entry territory.
> > 
> > Given that, an alternative proposal that I think would work
> > for you would be to add a 'placeholder' memory node definition
> > in SRAT (so allow 0 size explicitly - might need a new SRAT
> > entry to avoid backwards compat issues).
> 
> Putting all the PCI/GI/... complexity aside, I'll just raise again that 
> for virtio-mem something simple like that might be helpful as well, IIUC.
> 
> 	-numa node,nodeid=2 \
> 	...
> 	-device virtio-mem-pci,node=2,... \
> 
> All we need is the OS to prepare for an empty node that will get 
> populated with memory later.
> 
> So if that's what a "placeholder" node definition in srat could achieve 
> as well, even without all of the other acpi-generic-initiator stuff, 
> that would be great.

Please no "placeholder" definitions in SRAT. One of the main thrusts of
CXL is to move away from static ACPI tables describing vendor-specific
memory topology, towards an industry standard device enumeration.

Platform firmware enumerates the platform CXL "windows" (ACPI CEDT
CFMWS) and the relative performance of the CPU access a CXL port (ACPI
HMAT Generic Port), everything else is CXL standard enumeration.

It is strictly OS policy about how many NUMA nodes it imagines it wants
to define within that playground. The current OS policy is one node per
"window". If a solution believes Linux should be creating more than that
I submit that's a discussion with OS policy developers, not a trip to
the BIOS team to please sprinkle in more placeholders. Linux can fully
own the policy here. The painful bit is just that it never had to
before.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 1/2] qom: new object to associate device to numa node
  2024-01-10 23:19               ` Dan Williams
@ 2024-01-11  7:01                 ` Michael S. Tsirkin
  2024-01-16 14:02                 ` Ankit Agrawal
  1 sibling, 0 replies; 26+ messages in thread
From: Michael S. Tsirkin @ 2024-01-11  7:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: David Hildenbrand, Jonathan Cameron, Alex Williamson,
	Ankit Agrawal, Jason Gunthorpe, clg@redhat.com,
	shannon.zhaosl@gmail.com, peter.maydell@linaro.org,
	ani@anisinha.ca, berrange@redhat.com, eduardo@habkost.net,
	imammedo@redhat.com, eblake@redhat.com, armbru@redhat.com,
	gshan@redhat.com, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Dheeraj Nigam,
	Uday Dhoke, qemu-arm@nongnu.org, qemu-devel@nongnu.org

On Wed, Jan 10, 2024 at 03:19:05PM -0800, Dan Williams wrote:
> David Hildenbrand wrote:
> > On 09.01.24 17:52, Jonathan Cameron wrote:
> > > On Thu, 4 Jan 2024 10:39:41 -0700
> > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > 
> > >> On Thu, 4 Jan 2024 16:40:39 +0000
> > >> Ankit Agrawal <ankita@nvidia.com> wrote:
> > >>
> > >>> Had a discussion with RH folks, summary follows:
> > >>>
> > >>> 1. To align with the current spec description pointed by Jonathan, we first do
> > >>>       a separate object instance per GI node as suggested by Jonathan. i.e.
> > >>>       a acpi-generic-initiator would only link one node to the device. To
> > >>>       associate a set of nodes, those number of object instances should be
> > >>>       created.
> > >>> 2. In parallel, we work to get the spec updated. After the update, we switch
> > >>>      to the current implementation to link a PCI device with a set of NUMA
> > >>>      nodes.
> > >>>
> > >>> Alex/Jonathan, does this sound fine?
> > >>>    
> > >>
> > >> Yes, as I understand Jonathan's comments, the acpi-generic-initiator
> > >> object should currently define a single device:node relationship to
> > >> match the ACPI definition.
> > > 
> > > Doesn't matter for this, but it's a many_device:single_node
> > > relationship as currently defined. We should be able to support that
> > > in any new interfaces for QEMU.
> > > 
> > >>   Separately a clarification of the spec
> > >> could be pursued that could allow us to reinstate a node list option
> > >> for the acpi-generic-initiator object.  In the interim, a user can
> > >> define multiple 1:1 objects to create the 1:N relationship that's
> > >> ultimately required here.  Thanks,
> > > 
> > > Yes, a spec clarification would work, probably needs some text
> > > to say a GI might not be an initiator as well - my worry is
> > > theoretical backwards compatibility with a (probably
> > > nonexistent) OS that assumes the N:1 mapping. So you may be in
> > > new SRAT entry territory.
> > > 
> > > Given that, an alternative proposal that I think would work
> > > for you would be to add a 'placeholder' memory node definition
> > > in SRAT (so allow 0 size explicitly - might need a new SRAT
> > > entry to avoid backwards compat issues).
> > 
> > Putting all the PCI/GI/... complexity aside, I'll just raise again that 
> > for virtio-mem something simple like that might be helpful as well, IIUC.
> > 
> > 	-numa node,nodeid=2 \
> > 	...
> > 	-device virtio-mem-pci,node=2,... \
> > 
> > All we need is the OS to prepare for an empty node that will get 
> > populated with memory later.
> > 
> > So if that's what a "placeholder" node definition in srat could achieve 
> > as well, even without all of the other acpi-generic-initiator stuff, 
> > that would be great.
> 
> Please no "placeholder" definitions in SRAT. One of the main thrusts of
> CXL is to move away from static ACPI tables describing vendor-specific
> memory topology, towards an industry standard device enumeration.
> 
> Platform firmware enumerates the platform CXL "windows" (ACPI CEDT
> CFMWS) and the relative performance of the CPU access a CXL port (ACPI
> HMAT Generic Port), everything else is CXL standard enumeration.

I assume memory topology and so on apply, right?  E.g PMTT etc.
Just making sure.


> It is strictly OS policy about how many NUMA nodes it imagines it wants
> to define within that playground. The current OS policy is one node per
> "window". If a solution believes Linux should be creating more than that
> I submit that's a discussion with OS policy developers, not a trip to
> the BIOS team to please sprinkle in more placeholders. Linux can fully
> own the policy here. The painful bit is just that it never had to
> before.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 1/2] qom: new object to associate device to numa node
  2024-01-10 23:19               ` Dan Williams
  2024-01-11  7:01                 ` Michael S. Tsirkin
@ 2024-01-16 14:02                 ` Ankit Agrawal
  1 sibling, 0 replies; 26+ messages in thread
From: Ankit Agrawal @ 2024-01-16 14:02 UTC (permalink / raw)
  To: Dan Williams, David Hildenbrand, Jonathan Cameron,
	Alex Williamson
  Cc: Jason Gunthorpe, clg@redhat.com, shannon.zhaosl@gmail.com,
	peter.maydell@linaro.org, ani@anisinha.ca, berrange@redhat.com,
	eduardo@habkost.net, imammedo@redhat.com, mst@redhat.com,
	eblake@redhat.com, armbru@redhat.com, gshan@redhat.com,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Dheeraj Nigam, Uday Dhoke,
	qemu-arm@nongnu.org, qemu-devel@nongnu.org

>> >
>> > Given that, an alternative proposal that I think would work
>> > for you would be to add a 'placeholder' memory node definition
>> > in SRAT (so allow 0 size explicitly - might need a new SRAT
>> > entry to avoid backwards compat issues).
>>
>> Putting all the PCI/GI/... complexity aside, I'll just raise again that
>> for virtio-mem something simple like that might be helpful as well, IIUC.
>>
>>       -numa node,nodeid=2 \
>>       ...
>>       -device virtio-mem-pci,node=2,... \
>>
>> All we need is the OS to prepare for an empty node that will get
>> populated with memory later.
>>
>> So if that's what a "placeholder" node definition in srat could achieve
>> as well, even without all of the other acpi-generic-initiator stuff,
>> that would be great.
>
> Please no "placeholder" definitions in SRAT. One of the main thrusts of
> CXL is to move away from static ACPI tables describing vendor-specific
> memory topology, towards an industry standard device enumeration.

So I suppose we go with the original suggestion that aligns with the
current spec description pointed by Jonathan, which is the following:

A separate acpi-generic-initiator object that links only one node to the
device. For each such association, a new object would be created.

A previously mentioned example from Jonathan:
  -object acpi-generic-initiator,id=gi1,pci-dev=dev1,nodeid=10
  -object acpi-generic-initiator,id=gi2,pci-dev=dev1,nodeid=11

> It is strictly OS policy about how many NUMA nodes it imagines it wants
> to define within that playground. The current OS policy is one node per
> "window". If a solution believes Linux should be creating more than that
> I submit that's a discussion with OS policy developers, not a trip to
> the BIOS team to please sprinkle in more placeholders. Linux can fully
> own the policy here. The painful bit is just that it never had to
> before.

Whilst I agree that Linux kernel solution would be nice as a long term
solution, such change could be quite involved and intrusive.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] acpi: report numa nodes for device memory using GI
  2024-01-04  3:05   ` Ankit Agrawal
@ 2024-02-12 16:05     ` Michael S. Tsirkin
  2024-02-13  3:32       ` Ankit Agrawal
  0 siblings, 1 reply; 26+ messages in thread
From: Michael S. Tsirkin @ 2024-02-12 16:05 UTC (permalink / raw)
  To: Ankit Agrawal
  Cc: Jonathan Cameron, Jason Gunthorpe, alex.williamson@redhat.com,
	clg@redhat.com, shannon.zhaosl@gmail.com,
	peter.maydell@linaro.org, ani@anisinha.ca, berrange@redhat.com,
	eduardo@habkost.net, imammedo@redhat.com, eblake@redhat.com,
	armbru@redhat.com, david@redhat.com, gshan@redhat.com,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Dheeraj Nigam, Uday Dhoke,
	qemu-arm@nongnu.org, qemu-devel@nongnu.org

On Thu, Jan 04, 2024 at 03:05:27AM +0000, Ankit Agrawal wrote:
> 
> >>
> >> -numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 \
> >> -numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 \
> >> -numa node,nodeid=8 -numa node,nodeid=9 \
> >> -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
> >> -object acpi-generic-initiator,id=gi0,pci-dev=dev0,host-nodes=2-9 \
> >>
> >
> > I'd find it helpful to see the resulting chunk of SRAT for these examples
> > (disassembled) in this cover letter and the patches (where there are more examples).
> 
> Ack. I'll document the resulting SRAT table as well.

Still didn't happen so this is dropped for now.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v6 0/2] acpi: report numa nodes for device memory using GI
  2024-02-12 16:05     ` Michael S. Tsirkin
@ 2024-02-13  3:32       ` Ankit Agrawal
  0 siblings, 0 replies; 26+ messages in thread
From: Ankit Agrawal @ 2024-02-13  3:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jonathan Cameron, Jason Gunthorpe, alex.williamson@redhat.com,
	clg@redhat.com, shannon.zhaosl@gmail.com,
	peter.maydell@linaro.org, ani@anisinha.ca, berrange@redhat.com,
	eduardo@habkost.net, imammedo@redhat.com, eblake@redhat.com,
	armbru@redhat.com, david@redhat.com, gshan@redhat.com,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Dheeraj Nigam, Uday Dhoke,
	qemu-arm@nongnu.org, qemu-devel@nongnu.org

>> >
>> > I'd find it helpful to see the resulting chunk of SRAT for these examples
>> > (disassembled) in this cover letter and the patches (where there are more examples).
>>
>> Ack. I'll document the resulting SRAT table as well.
>
> Still didn't happen so this is dropped for now.

Hi Michael, does this mean it is dropped from Qemu v9.0?
FWIW, I'll post the next version incorporating the feedbacks by next week.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2024-02-13  3:33 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-25  4:56 [PATCH v6 0/2] acpi: report numa nodes for device memory using GI ankita
2023-12-25  4:56 ` [PATCH v6 1/2] qom: new object to associate device to numa node ankita
2024-01-02 12:58   ` Jonathan Cameron via
2024-01-04  3:36     ` Ankit Agrawal
2024-01-04 12:33       ` Ankit Agrawal
2024-01-04 16:40       ` Ankit Agrawal
2024-01-04 17:39         ` Alex Williamson
2024-01-09 16:52           ` Jonathan Cameron via
2024-01-09 17:02             ` David Hildenbrand
2024-01-09 17:10               ` Jason Gunthorpe
2024-01-09 19:36                 ` Dan Williams
2024-01-09 19:38                   ` Jason Gunthorpe
2024-01-10 23:19               ` Dan Williams
2024-01-11  7:01                 ` Michael S. Tsirkin
2024-01-16 14:02                 ` Ankit Agrawal
2024-01-04 17:23       ` Alex Williamson
2024-01-09  4:21         ` Ankit Agrawal
2024-01-09 16:38       ` Jonathan Cameron via
2024-01-08 12:09   ` Markus Armbruster
2024-01-09  4:11     ` Ankit Agrawal
2024-01-09  7:02       ` Markus Armbruster
2023-12-25  4:56 ` [PATCH v6 2/2] hw/acpi: Implement the SRAT GI affinity structure ankita
2024-01-02 12:31 ` [PATCH v6 0/2] acpi: report numa nodes for device memory using GI Jonathan Cameron via
2024-01-04  3:05   ` Ankit Agrawal
2024-02-12 16:05     ` Michael S. Tsirkin
2024-02-13  3:32       ` Ankit Agrawal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).