[PATCH RFC v2 00/15] Add virtualization support for EGM

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC v2 00/15] Add virtualization support for EGM
@ 2026-02-23 15:54 ankita
  2026-02-23 15:55 ` [PATCH RFC v2 01/15] vfio/nvgrace-gpu: Expand module_pci_driver to allow custom module init ankita
                   ` (15 more replies)
  0 siblings, 16 replies; 42+ messages in thread
From: ankita @ 2026-02-23 15:54 UTC (permalink / raw)
  To: ankita, vsethi, jgg, mochs, jgg, skolothumtho, alex
  Cc: cjia, zhiw, kjaju, yishaih, kevin.tian, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

Background
----------
Grace Hopper/Blackwell systems support the Extended GPU Memory (EGM)
feature that enable the GPU to access the system memory allocations
within and across nodes through high bandwidth path. This access path
goes as: GPU <--> NVswitch <--> GPU <--> CPU. The GPU can utilize the
system memory located on the same socket or from a different socket
or even on a different node in a multi-node system [1]. This feature is
being extended to virtualization.


Design Details
--------------
EGM when enabled in the virtualization stack, the host memory
is partitioned into 2 parts: One partition for the Host OS usage
called Hypervisor region, and a second Hypervisor-Invisible (HI) region
for the VM. Only the hypervisor region is part of the host EFI map
and is thus visible to the host OS on bootup. Since the entire VM
sysmem is eligible for EGM allocations within the VM, the HI partition
is interchangeably called as EGM region in the series. This HI/EGM region
range base SPA and size is exposed through the ACPI DSDT properties.

Whilst the EGM region is accessible on the host, it is not added to
the kernel. The HI region is assigned to a VM by mapping the QEMU VMA
to the SPA using remap_pfn_range().

The following figure shows the memory map in the virtualization
environment.

|---- Sysmem ----|                  |--- GPU mem ---|  VM Memory
|                |                  |               |
|IPA <-> SPA map |                  |IPA <-> SPA map|
|                |                  |               |
|--- HI / EGM ---|-- Host Mem --|   |--- GPU mem ---|  Host Memory

The patch series introduce a new nvgrace-egm auxiliary driver module
to manage and map the HI/EGM region in the Grace Blackwell systems.
This binds to the auxiliary device created by the parent
nvgrace-gpu (in-tree module for device assignment) / nvidia-vgpu-vfio
(out-of-tree open source module for SRIOV vGPU) to manage the
EGM region for the VM. Note that there is a unique EGM region per
socket and the auxiliary device gets created for every region. The
parent module fetches the EGM region information from the ACPI
tables and populate to the data structures shared with the auxiliary
nvgrace-egm module.

nvgrace-egm module handles the following:
1. Fetch the EGM memory properties (base HPA, length, proximity domain)
from the parent device shared EGM region structure.
2. Create a char device that can be used as memory-backend-file by Qemu
for the VM and implement file operations. The char device is /dev/egmX,
where X is the PXM node ID of the EGM being mapped fetched in 1.
3. Zero the EGM memory on first device open().
4. Map the QEMU VMA to the EGM region using remap_pfn_range.
5. Cleaning up state and destroying the chardev on device unbind.
6. Handle presence of retired poisoned pages on the EGM region.

Since nvgrace-egm is an auxiliary module to the nvgrace-gpu, it is kept
in the same directory.


Implementation
--------------
Patch 1-4 makes changes to the nvgrace-gpu module to fetch the
EGM information, create auxiliary device and save the EGM region
information in the shared structures.
Path 5-10 introduce the new nvgrace-egm module to manage the EGM
region. The module implements a char device to expose the EGM to
usermode apps such as QEMU. The module does the mapping of the
QEMU VMA to the EGM SPA using remap_pfn range.
Patch 11-12 fetches the list of pages on EGM with known poisoned errors.
Patch 13-14 expose the EGM topology and size through sysfs.
Patch 15 register EGM memory to memory_handle and track runtime poison errors


Enablement
----------
The EGM mode is enabled through a flag in the SBIOS. The size of
the Hypervisor region is modifiable through a second parameter in
the SBIOS. All the remaining system memory on the host will be
invisible to the Hypervisor.


Verification
------------
Applied over v6.19-rc4 and tested using qemu repository [3]. Tested on the
Grace Blackwell platform by booting up VM, loading NVIDIA module [2] and
running nvidia-smi in the VM to check for the presence of EGM capability.

There is a dependency on iommu support for generic dmabuf exports being
worked on by Jason Gunthorpe (jgg@nvidia.com). Need to use the patch [4]
until then.


Changelog
---------
v2:
* Replaced vmalloc calls with kmalloc for small structures in multiple
  files (Shameer Kolothum)
* Updated sysfs representation of the egm nodes in 14/15 (Jason Gunthorpe)
* Split EGM memory clearing in 1G chunks to avoid softlock logs in 10/15.
* Added EGM memory registration with memory_failure in 15/15.
* Updated aux device cleanup path to fix improper sequence in 8/15
  (Shameer Kolothum)
* Range checks for remap_pfn_range in 9/15 (Jason Gunthorpe)
* Miscellaneous cleanup (Shameer Kolothum, Jason Gunthorpe)

Link: https://lore.kernel.org/all/20250904040828.319452-1-ankita@nvidia.com/ [v1]


Recognitions
------------
Many thanks to Jason Gunthorpe, Vikram Sethi, Aniket Agashe for design
suggestions and Matt Ochs, Neo Jia, Kirti Wankhede among others for the
review feedbacks.


Links
-----
Link: https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/#extended_gpu_memory [1]
Link: https://github.com/NVIDIA/open-gpu-kernel-modules [2]
Link: https://github.com/NVIDIA/QEMU/tree/nvidia_stable-10.1 [3]
Link: https://github.com/ankita-nv/linux/commit/6f92e3ca1995d17c3dd45f3e0a074b0b5806f681 [4]


Github Branch
-------------
Link: https://github.com/ankita-nv/linux/tree/v6.19-egm-180226

Ankit Agrawal (15):
  vfio/nvgrace-gpu: Expand module_pci_driver to allow custom module init
  vfio/nvgrace-gpu: Create auxiliary device for EGM
  vfio/nvgrace-gpu: track GPUs associated with the EGM regions
  vfio/nvgrace-gpu: Introduce functions to fetch and save EGM info
  vfio/nvgrace-egm: Introduce module to manage EGM
  vfio/nvgrace-egm: Introduce egm class and register char device numbers
  vfio/nvgrace-egm: Register auxiliary driver ops
  vfio/nvgrace-egm: Expose EGM region as char device
  vfio/nvgrace-egm: Add chardev ops for EGM management
  vfio/nvgrace-egm: Clear Memory before handing out to VM
  vfio/nvgrace-egm: Fetch EGM region retired pages list
  vfio/nvgrace-egm: Introduce ioctl to share retired pages
  vfio/nvgrace-egm: expose the egm size through sysfs
  vfio/nvgrace-gpu: Add link from pci to EGM
  vfio/nvgrace-egm: register EGM PFNMAP range with memory_failure

 MAINTAINERS                            |  12 +-
 drivers/vfio/pci/nvgrace-gpu/Kconfig   |  12 +
 drivers/vfio/pci/nvgrace-gpu/Makefile  |   5 +-
 drivers/vfio/pci/nvgrace-gpu/egm.c     | 540 +++++++++++++++++++++++++
 drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 179 ++++++++
 drivers/vfio/pci/nvgrace-gpu/egm_dev.h |  24 ++
 drivers/vfio/pci/nvgrace-gpu/main.c    | 124 +++++-
 include/linux/nvgrace-egm.h            |  34 ++
 include/uapi/linux/egm.h               |  28 ++
 9 files changed, 954 insertions(+), 4 deletions(-)
 create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm.c
 create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm_dev.c
 create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm_dev.h
 create mode 100644 include/linux/nvgrace-egm.h
 create mode 100644 include/uapi/linux/egm.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH RFC v2 01/15] vfio/nvgrace-gpu: Expand module_pci_driver to allow custom module init
  2026-02-23 15:54 [PATCH RFC v2 00/15] Add virtualization support for EGM ankita
@ 2026-02-23 15:55 ` ankita
  2026-02-23 15:55 ` [PATCH RFC v2 02/15] vfio/nvgrace-gpu: Create auxiliary device for EGM ankita
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 42+ messages in thread
From: ankita @ 2026-02-23 15:55 UTC (permalink / raw)
  To: ankita, vsethi, jgg, mochs, jgg, skolothumtho, alex
  Cc: cjia, zhiw, kjaju, yishaih, kevin.tian, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

Allow custom changes to the nvgrace-gpu module init functions by
expanding definition of module_pci_driver.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/main.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index d3e5fee29180..7c4d51f5c701 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -1287,7 +1287,17 @@ static struct pci_driver nvgrace_gpu_vfio_pci_driver = {
 	.driver_managed_dma = true,
 };
 
-module_pci_driver(nvgrace_gpu_vfio_pci_driver);
+static int __init nvgrace_gpu_vfio_pci_init(void)
+{
+	return pci_register_driver(&nvgrace_gpu_vfio_pci_driver);
+}
+module_init(nvgrace_gpu_vfio_pci_init);
+
+static void __exit nvgrace_gpu_vfio_pci_cleanup(void)
+{
+	pci_unregister_driver(&nvgrace_gpu_vfio_pci_driver);
+}
+module_exit(nvgrace_gpu_vfio_pci_cleanup);
 
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Ankit Agrawal <ankita@nvidia.com>");
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RFC v2 02/15] vfio/nvgrace-gpu: Create auxiliary device for EGM
  2026-02-23 15:54 [PATCH RFC v2 00/15] Add virtualization support for EGM ankita
  2026-02-23 15:55 ` [PATCH RFC v2 01/15] vfio/nvgrace-gpu: Expand module_pci_driver to allow custom module init ankita
@ 2026-02-23 15:55 ` ankita
  2026-02-26 14:28   ` Shameer Kolothum Thodi
  2026-03-04  0:13   ` Alex Williamson
  2026-02-23 15:55 ` [PATCH RFC v2 03/15] vfio/nvgrace-gpu: track GPUs associated with the EGM regions ankita
                   ` (13 subsequent siblings)
  15 siblings, 2 replies; 42+ messages in thread
From: ankita @ 2026-02-23 15:55 UTC (permalink / raw)
  To: ankita, vsethi, jgg, mochs, jgg, skolothumtho, alex
  Cc: cjia, zhiw, kjaju, yishaih, kevin.tian, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

The Extended GPU Memory (EGM) feature enables the GPU access to
the system memory across sockets and physical systems on the
Grace Hopper and Grace Blackwell systems. When the feature is
enabled through SBIOS, part of the system memory is made available
to the GPU for access through EGM path.

The EGM functionality is separate and largely independent from the
core GPU device functionality. However, the EGM region information
of base SPA and size is associated with the GPU on the ACPI tables.
An architecture wih EGM represented as an auxiliary device suits well
in this context.

The parent GPU device creates an EGM auxiliary device to be managed
independently by an auxiliary EGM driver. The EGM region information
is kept as part of the shared struct nvgrace_egm_dev along with the
auxiliary device handle.

Each socket has a separate EGM region and hence a multi-socket system
have multiple EGM regions. Each EGM region has a separate nvgrace_egm_dev
and the nvgrace-gpu keeps the EGM regions as part of a list.

Note that EGM is an optional feature enabled through SBIOS. The EGM
properties are only populated in ACPI tables if the feature is enabled;
they are absent otherwise. The absence of the properties is thus not
considered fatal. The presence of improper set of values however are
considered fatal.

It is also noteworthy that there may also be multiple GPUs present per
socket and have duplicate EGM region information with them. Make sure
the duplicate data does not get added.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 MAINTAINERS                            |  5 +-
 drivers/vfio/pci/nvgrace-gpu/Makefile  |  2 +-
 drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 61 +++++++++++++++++++++
 drivers/vfio/pci/nvgrace-gpu/egm_dev.h | 17 ++++++
 drivers/vfio/pci/nvgrace-gpu/main.c    | 76 +++++++++++++++++++++++++-
 include/linux/nvgrace-egm.h            | 23 ++++++++
 6 files changed, 181 insertions(+), 3 deletions(-)
 create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm_dev.c
 create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm_dev.h
 create mode 100644 include/linux/nvgrace-egm.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 765ad2daa218..5b3d86de9ec0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -27379,7 +27379,10 @@ VFIO NVIDIA GRACE GPU DRIVER
 M:	Ankit Agrawal <ankita@nvidia.com>
 L:	kvm@vger.kernel.org
 S:	Supported
-F:	drivers/vfio/pci/nvgrace-gpu/
+F:	drivers/vfio/pci/nvgrace-gpu/egm_dev.c
+F:	drivers/vfio/pci/nvgrace-gpu/egm_dev.h
+F:	drivers/vfio/pci/nvgrace-gpu/main.c
+F:	include/linux/nvgrace-egm.h
 
 VFIO PCI DEVICE SPECIFIC DRIVERS
 R:	Jason Gunthorpe <jgg@nvidia.com>
diff --git a/drivers/vfio/pci/nvgrace-gpu/Makefile b/drivers/vfio/pci/nvgrace-gpu/Makefile
index 3ca8c187897a..e72cc6739ef8 100644
--- a/drivers/vfio/pci/nvgrace-gpu/Makefile
+++ b/drivers/vfio/pci/nvgrace-gpu/Makefile
@@ -1,3 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0-only
 obj-$(CONFIG_NVGRACE_GPU_VFIO_PCI) += nvgrace-gpu-vfio-pci.o
-nvgrace-gpu-vfio-pci-y := main.o
+nvgrace-gpu-vfio-pci-y := main.o egm_dev.o
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
new file mode 100644
index 000000000000..faf658723f7a
--- /dev/null
+++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
@@ -0,0 +1,61 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include <linux/vfio_pci_core.h>
+#include "egm_dev.h"
+
+/*
+ * Determine if the EGM feature is enabled. If disabled, there
+ * will be no EGM properties populated in the ACPI tables and this
+ * fetch would fail.
+ */
+int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm)
+{
+	return device_property_read_u64(&pdev->dev, "nvidia,egm-pxm",
+					pegmpxm);
+}
+
+static void nvgrace_gpu_release_aux_device(struct device *device)
+{
+	struct auxiliary_device *aux_dev = container_of(device, struct auxiliary_device, dev);
+	struct nvgrace_egm_dev *egm_dev = container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
+
+	kvfree(egm_dev);
+}
+
+struct nvgrace_egm_dev *
+nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
+			      u64 egmpxm)
+{
+	struct nvgrace_egm_dev *egm_dev;
+	int ret;
+
+	egm_dev = kzalloc(sizeof(*egm_dev), GFP_KERNEL);
+	if (!egm_dev)
+		goto create_err;
+
+	egm_dev->egmpxm = egmpxm;
+	egm_dev->aux_dev.id = egmpxm;
+	egm_dev->aux_dev.name = name;
+	egm_dev->aux_dev.dev.release = nvgrace_gpu_release_aux_device;
+	egm_dev->aux_dev.dev.parent = &pdev->dev;
+
+	ret = auxiliary_device_init(&egm_dev->aux_dev);
+	if (ret)
+		goto free_dev;
+
+	ret = auxiliary_device_add(&egm_dev->aux_dev);
+	if (ret) {
+		auxiliary_device_uninit(&egm_dev->aux_dev);
+		goto free_dev;
+	}
+
+	return egm_dev;
+
+free_dev:
+	kvfree(egm_dev);
+create_err:
+	return NULL;
+}
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
new file mode 100644
index 000000000000..c00f5288f4e7
--- /dev/null
+++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#ifndef EGM_DEV_H
+#define EGM_DEV_H
+
+#include <linux/nvgrace-egm.h>
+
+int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm);
+
+struct nvgrace_egm_dev *
+nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
+			      u64 egmphys);
+
+#endif /* EGM_DEV_H */
diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index 7c4d51f5c701..23028e6e7192 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -10,6 +10,8 @@
 #include <linux/pci-p2pdma.h>
 #include <linux/pm_runtime.h>
 #include <linux/memory-failure.h>
+#include <linux/nvgrace-egm.h>
+#include "egm_dev.h"
 
 /*
  * The device memory usable to the workloads running in the VM is cached
@@ -66,6 +68,68 @@ struct nvgrace_gpu_pci_core_device {
 	bool reset_done;
 };
 
+/*
+ * Track egm device lists. Note that there is one device per socket.
+ * All the GPUs belonging to the same sockets are associated with
+ * the EGM device for that socket.
+ */
+static struct list_head egm_dev_list;
+
+static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
+{
+	struct nvgrace_egm_dev_entry *egm_entry;
+	u64 egmpxm;
+	int ret = 0;
+
+	/*
+	 * EGM is an optional feature enabled in SBIOS. If disabled, there
+	 * will be no EGM properties populated in the ACPI tables and this
+	 * fetch would fail. Treat this failure as non-fatal and return
+	 * early.
+	 */
+	if (nvgrace_gpu_has_egm_property(pdev, &egmpxm))
+		goto exit;
+
+	egm_entry = kzalloc(sizeof(*egm_entry), GFP_KERNEL);
+	if (!egm_entry)
+		return -ENOMEM;
+
+	egm_entry->egm_dev =
+		nvgrace_gpu_create_aux_device(pdev, NVGRACE_EGM_DEV_NAME,
+					      egmpxm);
+	if (!egm_entry->egm_dev) {
+		kvfree(egm_entry);
+		ret = -EINVAL;
+		goto exit;
+	}
+
+	list_add_tail(&egm_entry->list, &egm_dev_list);
+
+exit:
+	return ret;
+}
+
+static void nvgrace_gpu_destroy_egm_aux_device(struct pci_dev *pdev)
+{
+	struct nvgrace_egm_dev_entry *egm_entry, *temp_egm_entry;
+	u64 egmpxm;
+
+	if (nvgrace_gpu_has_egm_property(pdev, &egmpxm))
+		return;
+
+	list_for_each_entry_safe(egm_entry, temp_egm_entry, &egm_dev_list, list) {
+		/*
+		 * Free the EGM region corresponding to the input GPU
+		 * device.
+		 */
+		if (egm_entry->egm_dev->egmpxm == egmpxm) {
+			auxiliary_device_destroy(&egm_entry->egm_dev->aux_dev);
+			list_del(&egm_entry->list);
+			kvfree(egm_entry);
+		}
+	}
+}
+
 static void nvgrace_gpu_init_fake_bar_emu_regs(struct vfio_device *core_vdev)
 {
 	struct nvgrace_gpu_pci_core_device *nvdev =
@@ -1212,6 +1276,11 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
 						    memphys, memlength);
 		if (ret)
 			goto out_put_vdev;
+
+		ret = nvgrace_gpu_create_egm_aux_device(pdev);
+		if (ret)
+			goto out_put_vdev;
+
 		nvdev->core_device.pci_ops = &nvgrace_gpu_pci_dev_ops;
 	} else {
 		nvdev->core_device.pci_ops = &nvgrace_gpu_pci_dev_core_ops;
@@ -1219,10 +1288,12 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
 
 	ret = vfio_pci_core_register_device(&nvdev->core_device);
 	if (ret)
-		goto out_put_vdev;
+		goto out_reg;
 
 	return ret;
 
+out_reg:
+	nvgrace_gpu_destroy_egm_aux_device(pdev);
 out_put_vdev:
 	vfio_put_device(&nvdev->core_device.vdev);
 	return ret;
@@ -1232,6 +1303,7 @@ static void nvgrace_gpu_remove(struct pci_dev *pdev)
 {
 	struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
 
+	nvgrace_gpu_destroy_egm_aux_device(pdev);
 	vfio_pci_core_unregister_device(core_device);
 	vfio_put_device(&core_device->vdev);
 }
@@ -1289,6 +1361,8 @@ static struct pci_driver nvgrace_gpu_vfio_pci_driver = {
 
 static int __init nvgrace_gpu_vfio_pci_init(void)
 {
+	INIT_LIST_HEAD(&egm_dev_list);
+
 	return pci_register_driver(&nvgrace_gpu_vfio_pci_driver);
 }
 module_init(nvgrace_gpu_vfio_pci_init);
diff --git a/include/linux/nvgrace-egm.h b/include/linux/nvgrace-egm.h
new file mode 100644
index 000000000000..9575d4ad4338
--- /dev/null
+++ b/include/linux/nvgrace-egm.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#ifndef NVGRACE_EGM_H
+#define NVGRACE_EGM_H
+
+#include <linux/auxiliary_bus.h>
+
+#define NVGRACE_EGM_DEV_NAME "egm"
+
+struct nvgrace_egm_dev {
+	struct auxiliary_device aux_dev;
+	u64 egmpxm;
+};
+
+struct nvgrace_egm_dev_entry {
+	struct list_head list;
+	struct nvgrace_egm_dev *egm_dev;
+};
+
+#endif /* NVGRACE_EGM_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* RE: [PATCH RFC v2 02/15] vfio/nvgrace-gpu: Create auxiliary device for EGM
  2026-02-23 15:55 ` [PATCH RFC v2 02/15] vfio/nvgrace-gpu: Create auxiliary device for EGM ankita
@ 2026-02-26 14:28   ` Shameer Kolothum Thodi
  2026-03-04  0:13   ` Alex Williamson
  1 sibling, 0 replies; 42+ messages in thread
From: Shameer Kolothum Thodi @ 2026-02-26 14:28 UTC (permalink / raw)
  To: Ankit Agrawal, Vikram Sethi, Jason Gunthorpe, Matt Ochs,
	jgg@ziepe.ca, alex@shazbot.org
  Cc: Neo Jia, Zhi Wang, Krishnakant Jaju, Yishai Hadas,
	kevin.tian@intel.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org



> -----Original Message-----
> From: Ankit Agrawal <ankita@nvidia.com>
> Sent: 23 February 2026 15:55
> To: Ankit Agrawal <ankita@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>;
> Jason Gunthorpe <jgg@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> jgg@ziepe.ca; Shameer Kolothum Thodi <skolothumtho@nvidia.com>;
> alex@shazbot.org
> Cc: Neo Jia <cjia@nvidia.com>; Zhi Wang <zhiw@nvidia.com>; Krishnakant
> Jaju <kjaju@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>;
> kevin.tian@intel.com; kvm@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [PATCH RFC v2 02/15] vfio/nvgrace-gpu: Create auxiliary device for
> EGM
> 
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The Extended GPU Memory (EGM) feature enables the GPU access to
> the system memory across sockets and physical systems on the
> Grace Hopper and Grace Blackwell systems. When the feature is
> enabled through SBIOS, part of the system memory is made available
> to the GPU for access through EGM path.
> 
> The EGM functionality is separate and largely independent from the
> core GPU device functionality. However, the EGM region information
> of base SPA and size is associated with the GPU on the ACPI tables.
> An architecture wih EGM represented as an auxiliary device suits well
> in this context.
> 
> The parent GPU device creates an EGM auxiliary device to be managed
> independently by an auxiliary EGM driver. The EGM region information
> is kept as part of the shared struct nvgrace_egm_dev along with the
> auxiliary device handle.
> 
> Each socket has a separate EGM region and hence a multi-socket system
> have multiple EGM regions. Each EGM region has a separate nvgrace_egm_dev
> and the nvgrace-gpu keeps the EGM regions as part of a list.
> 
> Note that EGM is an optional feature enabled through SBIOS. The EGM
> properties are only populated in ACPI tables if the feature is enabled;
> they are absent otherwise. The absence of the properties is thus not
> considered fatal. The presence of improper set of values however are
> considered fatal.
> 
> It is also noteworthy that there may also be multiple GPUs present per
> socket and have duplicate EGM region information with them. Make sure
> the duplicate data does not get added.
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  MAINTAINERS                            |  5 +-
>  drivers/vfio/pci/nvgrace-gpu/Makefile  |  2 +-
>  drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 61 +++++++++++++++++++++
>  drivers/vfio/pci/nvgrace-gpu/egm_dev.h | 17 ++++++
>  drivers/vfio/pci/nvgrace-gpu/main.c    | 76 +++++++++++++++++++++++++-
>  include/linux/nvgrace-egm.h            | 23 ++++++++
>  6 files changed, 181 insertions(+), 3 deletions(-)
>  create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm_dev.c
>  create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm_dev.h
>  create mode 100644 include/linux/nvgrace-egm.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 765ad2daa218..5b3d86de9ec0 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -27379,7 +27379,10 @@ VFIO NVIDIA GRACE GPU DRIVER
>  M:	Ankit Agrawal <ankita@nvidia.com>
>  L:	kvm@vger.kernel.org
>  S:	Supported
> -F:	drivers/vfio/pci/nvgrace-gpu/
> +F:	drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> +F:	drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> +F:	drivers/vfio/pci/nvgrace-gpu/main.c
> +F:	include/linux/nvgrace-egm.h
> 
>  VFIO PCI DEVICE SPECIFIC DRIVERS
>  R:	Jason Gunthorpe <jgg@nvidia.com>
> diff --git a/drivers/vfio/pci/nvgrace-gpu/Makefile b/drivers/vfio/pci/nvgrace-
> gpu/Makefile
> index 3ca8c187897a..e72cc6739ef8 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/Makefile
> +++ b/drivers/vfio/pci/nvgrace-gpu/Makefile
> @@ -1,3 +1,3 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>  obj-$(CONFIG_NVGRACE_GPU_VFIO_PCI) += nvgrace-gpu-vfio-pci.o
> -nvgrace-gpu-vfio-pci-y := main.o
> +nvgrace-gpu-vfio-pci-y := main.o egm_dev.o
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> new file mode 100644
> index 000000000000..faf658723f7a
> --- /dev/null
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> @@ -0,0 +1,61 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights
> reserved
> + */
> +
> +#include <linux/vfio_pci_core.h>
> +#include "egm_dev.h"
> +
> +/*
> + * Determine if the EGM feature is enabled. If disabled, there
> + * will be no EGM properties populated in the ACPI tables and this
> + * fetch would fail.
> + */
> +int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm)
> +{
> +	return device_property_read_u64(&pdev->dev, "nvidia,egm-pxm",
> +					pegmpxm);
> +}
> +
> +static void nvgrace_gpu_release_aux_device(struct device *device)
> +{
> +	struct auxiliary_device *aux_dev = container_of(device, struct
> auxiliary_device, dev);
> +	struct nvgrace_egm_dev *egm_dev = container_of(aux_dev, struct
> nvgrace_egm_dev, aux_dev);
> +
> +	kvfree(egm_dev);
> +}
> +
> +struct nvgrace_egm_dev *
> +nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
> +			      u64 egmpxm)
> +{
> +	struct nvgrace_egm_dev *egm_dev;
> +	int ret;
> +
> +	egm_dev = kzalloc(sizeof(*egm_dev), GFP_KERNEL);
> +	if (!egm_dev)
> +		goto create_err;
> +
> +	egm_dev->egmpxm = egmpxm;
> +	egm_dev->aux_dev.id = egmpxm;
> +	egm_dev->aux_dev.name = name;
> +	egm_dev->aux_dev.dev.release = nvgrace_gpu_release_aux_device;
> +	egm_dev->aux_dev.dev.parent = &pdev->dev;
> +
> +	ret = auxiliary_device_init(&egm_dev->aux_dev);
> +	if (ret)
> +		goto free_dev;
> +
> +	ret = auxiliary_device_add(&egm_dev->aux_dev);
> +	if (ret) {
> +		auxiliary_device_uninit(&egm_dev->aux_dev);
> +		goto free_dev;
> +	}
> +
> +	return egm_dev;
> +
> +free_dev:
> +	kvfree(egm_dev);
> +create_err:
> +	return NULL;
> +}
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> new file mode 100644
> index 000000000000..c00f5288f4e7
> --- /dev/null
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights
> reserved
> + */
> +
> +#ifndef EGM_DEV_H
> +#define EGM_DEV_H
> +
> +#include <linux/nvgrace-egm.h>
> +
> +int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm);
> +
> +struct nvgrace_egm_dev *
> +nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
> +			      u64 egmphys);
> +
> +#endif /* EGM_DEV_H */
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-
> gpu/main.c
> index 7c4d51f5c701..23028e6e7192 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -10,6 +10,8 @@
>  #include <linux/pci-p2pdma.h>
>  #include <linux/pm_runtime.h>
>  #include <linux/memory-failure.h>
> +#include <linux/nvgrace-egm.h>
> +#include "egm_dev.h"
> 
>  /*
>   * The device memory usable to the workloads running in the VM is cached
> @@ -66,6 +68,68 @@ struct nvgrace_gpu_pci_core_device {
>  	bool reset_done;
>  };
> 
> +/*
> + * Track egm device lists. Note that there is one device per socket.
> + * All the GPUs belonging to the same sockets are associated with
> + * the EGM device for that socket.
> + */
> +static struct list_head egm_dev_list;

Probably I asked this before...Does this need any locking?

> +
> +static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
> +{
> +	struct nvgrace_egm_dev_entry *egm_entry;
> +	u64 egmpxm;
> +	int ret = 0;
> +
> +	/*
> +	 * EGM is an optional feature enabled in SBIOS. If disabled, there
> +	 * will be no EGM properties populated in the ACPI tables and this
> +	 * fetch would fail. Treat this failure as non-fatal and return
> +	 * early.
> +	 */
> +	if (nvgrace_gpu_has_egm_property(pdev, &egmpxm))
> +		goto exit;
> +
> +	egm_entry = kzalloc(sizeof(*egm_entry), GFP_KERNEL);
> +	if (!egm_entry)
> +		return -ENOMEM;
> +
> +	egm_entry->egm_dev =
> +		nvgrace_gpu_create_aux_device(pdev,
> NVGRACE_EGM_DEV_NAME,
> +					      egmpxm);
> +	if (!egm_entry->egm_dev) {
> +		kvfree(egm_entry);
> +		ret = -EINVAL;
> +		goto exit;
> +	}
> +
> +	list_add_tail(&egm_entry->list, &egm_dev_list);

Commit log mentions " Make sure the duplicate data does not get added"
But this doesn't have any check in case multiple GPUs points to the same
egm_dev, right? Or the commit meant something else?

Thanks,
Shameer


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC v2 02/15] vfio/nvgrace-gpu: Create auxiliary device for EGM
  2026-02-23 15:55 ` [PATCH RFC v2 02/15] vfio/nvgrace-gpu: Create auxiliary device for EGM ankita
  2026-02-26 14:28   ` Shameer Kolothum Thodi
@ 2026-03-04  0:13   ` Alex Williamson
  1 sibling, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2026-03-04  0:13 UTC (permalink / raw)
  To: ankita
  Cc: vsethi, jgg, mochs, jgg, skolothumtho, cjia, zhiw, kjaju, yishaih,
	kevin.tian, kvm, linux-kernel, alex

On Mon, 23 Feb 2026 15:55:01 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The Extended GPU Memory (EGM) feature enables the GPU access to
> the system memory across sockets and physical systems on the
> Grace Hopper and Grace Blackwell systems. When the feature is
> enabled through SBIOS, part of the system memory is made available
> to the GPU for access through EGM path.
> 
> The EGM functionality is separate and largely independent from the

"largely independent", what happens to access to the remote memory
through the GPU during reset?

In your KVM Forum presentation you show a remote CPU accessing EGM
memory through a local GPU, through the NVLink, though a remote GPU, to
the remote CPU memory.  Does this only work if all the GPUs in the path
are bound to nvgrace-gpu?

The ownership of these egm devices vs the vfio device seems dubious.

> core GPU device functionality. However, the EGM region information
> of base SPA and size is associated with the GPU on the ACPI tables.
> An architecture wih EGM represented as an auxiliary device suits well

s/wih/with/

> in this context.
> 
> The parent GPU device creates an EGM auxiliary device to be managed
> independently by an auxiliary EGM driver. The EGM region information
> is kept as part of the shared struct nvgrace_egm_dev along with the
> auxiliary device handle.
> 
> Each socket has a separate EGM region and hence a multi-socket system
> have multiple EGM regions. Each EGM region has a separate nvgrace_egm_dev
> and the nvgrace-gpu keeps the EGM regions as part of a list.
> 
> Note that EGM is an optional feature enabled through SBIOS. The EGM
> properties are only populated in ACPI tables if the feature is enabled;
> they are absent otherwise. The absence of the properties is thus not
> considered fatal. The presence of improper set of values however are
> considered fatal.
> 
> It is also noteworthy that there may also be multiple GPUs present per
> socket and have duplicate EGM region information with them. Make sure
> the duplicate data does not get added.

De-duplication isn't done until the next patch.

> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  MAINTAINERS                            |  5 +-
>  drivers/vfio/pci/nvgrace-gpu/Makefile  |  2 +-
>  drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 61 +++++++++++++++++++++
>  drivers/vfio/pci/nvgrace-gpu/egm_dev.h | 17 ++++++
>  drivers/vfio/pci/nvgrace-gpu/main.c    | 76 +++++++++++++++++++++++++-
>  include/linux/nvgrace-egm.h            | 23 ++++++++
>  6 files changed, 181 insertions(+), 3 deletions(-)
>  create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm_dev.c
>  create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm_dev.h
>  create mode 100644 include/linux/nvgrace-egm.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 765ad2daa218..5b3d86de9ec0 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -27379,7 +27379,10 @@ VFIO NVIDIA GRACE GPU DRIVER
>  M:	Ankit Agrawal <ankita@nvidia.com>
>  L:	kvm@vger.kernel.org
>  S:	Supported
> -F:	drivers/vfio/pci/nvgrace-gpu/
> +F:	drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> +F:	drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> +F:	drivers/vfio/pci/nvgrace-gpu/main.c

This was better before, you own the sub-directory, we don't need to
list each file and it adds maintenance.

> +F:	include/linux/nvgrace-egm.h
>  
>  VFIO PCI DEVICE SPECIFIC DRIVERS
>  R:	Jason Gunthorpe <jgg@nvidia.com>
> diff --git a/drivers/vfio/pci/nvgrace-gpu/Makefile b/drivers/vfio/pci/nvgrace-gpu/Makefile
> index 3ca8c187897a..e72cc6739ef8 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/Makefile
> +++ b/drivers/vfio/pci/nvgrace-gpu/Makefile
> @@ -1,3 +1,3 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>  obj-$(CONFIG_NVGRACE_GPU_VFIO_PCI) += nvgrace-gpu-vfio-pci.o
> -nvgrace-gpu-vfio-pci-y := main.o
> +nvgrace-gpu-vfio-pci-y := main.o egm_dev.o
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> new file mode 100644
> index 000000000000..faf658723f7a
> --- /dev/null
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> @@ -0,0 +1,61 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved

2026

> + */
> +
> +#include <linux/vfio_pci_core.h>
> +#include "egm_dev.h"
> +
> +/*
> + * Determine if the EGM feature is enabled. If disabled, there
> + * will be no EGM properties populated in the ACPI tables and this
> + * fetch would fail.
> + */
> +int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm)
> +{
> +	return device_property_read_u64(&pdev->dev, "nvidia,egm-pxm",
> +					pegmpxm);
> +}
> +
> +static void nvgrace_gpu_release_aux_device(struct device *device)
> +{
> +	struct auxiliary_device *aux_dev = container_of(device, struct auxiliary_device, dev);
> +	struct nvgrace_egm_dev *egm_dev = container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
> +
> +	kvfree(egm_dev);

This was allocated with kzalloc() it should use kfree() not kvfree().

> +}
> +
> +struct nvgrace_egm_dev *
> +nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
> +			      u64 egmpxm)
> +{
> +	struct nvgrace_egm_dev *egm_dev;
> +	int ret;
> +
> +	egm_dev = kzalloc(sizeof(*egm_dev), GFP_KERNEL);
> +	if (!egm_dev)
> +		goto create_err;
> +
> +	egm_dev->egmpxm = egmpxm;
> +	egm_dev->aux_dev.id = egmpxm;
> +	egm_dev->aux_dev.name = name;
> +	egm_dev->aux_dev.dev.release = nvgrace_gpu_release_aux_device;
> +	egm_dev->aux_dev.dev.parent = &pdev->dev;
> +
> +	ret = auxiliary_device_init(&egm_dev->aux_dev);
> +	if (ret)
> +		goto free_dev;
> +
> +	ret = auxiliary_device_add(&egm_dev->aux_dev);
> +	if (ret) {
> +		auxiliary_device_uninit(&egm_dev->aux_dev);
> +		goto free_dev;

There's a double free here, from auxiliary_device_init():

 * It returns 0 on success.  On success, the device_initialize has been
 * performed.  After this point any error unwinding will need to include a call
 * to auxiliary_device_uninit().  In this post-initialize error scenario, a call
 * to the device's .release callback will be triggered, and all memory clean-up
 * is expected to be handled there.


> +	}
> +
> +	return egm_dev;
> +
> +free_dev:
> +	kvfree(egm_dev);
> +create_err:
> +	return NULL;
> +}
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> new file mode 100644
> index 000000000000..c00f5288f4e7
> --- /dev/null
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved

2026

> + */
> +
> +#ifndef EGM_DEV_H
> +#define EGM_DEV_H
> +
> +#include <linux/nvgrace-egm.h>
> +
> +int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm);
> +
> +struct nvgrace_egm_dev *
> +nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
> +			      u64 egmphys);

egmpxm

> +
> +#endif /* EGM_DEV_H */
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
> index 7c4d51f5c701..23028e6e7192 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -10,6 +10,8 @@
>  #include <linux/pci-p2pdma.h>
>  #include <linux/pm_runtime.h>
>  #include <linux/memory-failure.h>
> +#include <linux/nvgrace-egm.h>
> +#include "egm_dev.h"
>  
>  /*
>   * The device memory usable to the workloads running in the VM is cached
> @@ -66,6 +68,68 @@ struct nvgrace_gpu_pci_core_device {
>  	bool reset_done;
>  };
>  
> +/*
> + * Track egm device lists. Note that there is one device per socket.
> + * All the GPUs belonging to the same sockets are associated with
> + * the EGM device for that socket.
> + */
> +static struct list_head egm_dev_list;

As Shameer notes, this list needs locking to avoid concurrent
operation corruption.  I'd also question why we're tracking this list
in the main code of the nvgrace-gpu driver rather than in the egm_dev
aux driver portion of the code.  It would be trivial to do
de-duplication in the create function if the list were over there.

> +
> +static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
> +{
> +	struct nvgrace_egm_dev_entry *egm_entry;
> +	u64 egmpxm;
> +	int ret = 0;
> +
> +	/*
> +	 * EGM is an optional feature enabled in SBIOS. If disabled, there
> +	 * will be no EGM properties populated in the ACPI tables and this
> +	 * fetch would fail. Treat this failure as non-fatal and return
> +	 * early.
> +	 */
> +	if (nvgrace_gpu_has_egm_property(pdev, &egmpxm))
> +		goto exit;

return 0;

> +
> +	egm_entry = kzalloc(sizeof(*egm_entry), GFP_KERNEL);
> +	if (!egm_entry)
> +		return -ENOMEM;
> +
> +	egm_entry->egm_dev =
> +		nvgrace_gpu_create_aux_device(pdev, NVGRACE_EGM_DEV_NAME,
> +					      egmpxm);
> +	if (!egm_entry->egm_dev) {
> +		kvfree(egm_entry);

kzalloc() -> kfree()

> +		ret = -EINVAL;

Why doesn't the previous function return ERR_PTR() to propagate the
errno through rather than clobber it?  We don't really need the goto
here for now either.

struct nvgrace_egm_dev egm_dev;

egm_dev = nvgrace_gpu_create...

if (IS_ERR(egm_dev)) {
	kfree(egm_entry);
	return ERR_PTR(egm_dev);
}

egm_entry->egm_dev = egm_dev;

> +		goto exit;
> +	}
> +
> +	list_add_tail(&egm_entry->list, &egm_dev_list);
> +
> +exit:

s/exit://

> +	return ret;

return 0;

> +}
> +
> +static void nvgrace_gpu_destroy_egm_aux_device(struct pci_dev *pdev)
> +{
> +	struct nvgrace_egm_dev_entry *egm_entry, *temp_egm_entry;
> +	u64 egmpxm;
> +
> +	if (nvgrace_gpu_has_egm_property(pdev, &egmpxm))
> +		return;
> +
> +	list_for_each_entry_safe(egm_entry, temp_egm_entry, &egm_dev_list, list) {
> +		/*
> +		 * Free the EGM region corresponding to the input GPU
> +		 * device.
> +		 */
> +		if (egm_entry->egm_dev->egmpxm == egmpxm) {
> +			auxiliary_device_destroy(&egm_entry->egm_dev->aux_dev);
> +			list_del(&egm_entry->list);
> +			kvfree(egm_entry);

kfree()

Why do we continue walking the list after we've found it?  Is this
because we don't yet do the de-duplication?

> +		}
> +	}
> +}
> +
>  static void nvgrace_gpu_init_fake_bar_emu_regs(struct vfio_device *core_vdev)
>  {
>  	struct nvgrace_gpu_pci_core_device *nvdev =
> @@ -1212,6 +1276,11 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
>  						    memphys, memlength);
>  		if (ret)
>  			goto out_put_vdev;
> +
> +		ret = nvgrace_gpu_create_egm_aux_device(pdev);
> +		if (ret)
> +			goto out_put_vdev;
> +
>  		nvdev->core_device.pci_ops = &nvgrace_gpu_pci_dev_ops;
>  	} else {
>  		nvdev->core_device.pci_ops = &nvgrace_gpu_pci_dev_core_ops;
> @@ -1219,10 +1288,12 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
>  
>  	ret = vfio_pci_core_register_device(&nvdev->core_device);
>  	if (ret)
> -		goto out_put_vdev;
> +		goto out_reg;
>  
>  	return ret;
>  
> +out_reg:
> +	nvgrace_gpu_destroy_egm_aux_device(pdev);
>  out_put_vdev:
>  	vfio_put_device(&nvdev->core_device.vdev);
>  	return ret;
> @@ -1232,6 +1303,7 @@ static void nvgrace_gpu_remove(struct pci_dev *pdev)
>  {
>  	struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
>  
> +	nvgrace_gpu_destroy_egm_aux_device(pdev);

I'm curious how this will handle the lifecycle issues if the device is
unbound from the nvgrace-gpu driver while the aux egm device is still
in use...

>  	vfio_pci_core_unregister_device(core_device);
>  	vfio_put_device(&core_device->vdev);
>  }
> @@ -1289,6 +1361,8 @@ static struct pci_driver nvgrace_gpu_vfio_pci_driver = {
>  
>  static int __init nvgrace_gpu_vfio_pci_init(void)
>  {
> +	INIT_LIST_HEAD(&egm_dev_list);
> +
>  	return pci_register_driver(&nvgrace_gpu_vfio_pci_driver);
>  }
>  module_init(nvgrace_gpu_vfio_pci_init);
> diff --git a/include/linux/nvgrace-egm.h b/include/linux/nvgrace-egm.h
> new file mode 100644
> index 000000000000..9575d4ad4338
> --- /dev/null
> +++ b/include/linux/nvgrace-egm.h
> @@ -0,0 +1,23 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved

2026

> + */
> +
> +#ifndef NVGRACE_EGM_H
> +#define NVGRACE_EGM_H
> +
> +#include <linux/auxiliary_bus.h>
> +
> +#define NVGRACE_EGM_DEV_NAME "egm"
> +
> +struct nvgrace_egm_dev {
> +	struct auxiliary_device aux_dev;
> +	u64 egmpxm;
> +};
> +
> +struct nvgrace_egm_dev_entry {
> +	struct list_head list;
> +	struct nvgrace_egm_dev *egm_dev;
> +};

Looks like only nvgrace_egm_dev eventually requires a public header.
The list entry certainly doesn't need to be here.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH RFC v2 03/15] vfio/nvgrace-gpu: track GPUs associated with the EGM regions
  2026-02-23 15:54 [PATCH RFC v2 00/15] Add virtualization support for EGM ankita
  2026-02-23 15:55 ` [PATCH RFC v2 01/15] vfio/nvgrace-gpu: Expand module_pci_driver to allow custom module init ankita
  2026-02-23 15:55 ` [PATCH RFC v2 02/15] vfio/nvgrace-gpu: Create auxiliary device for EGM ankita
@ 2026-02-23 15:55 ` ankita
  2026-02-26 14:55   ` Shameer Kolothum Thodi
  2026-02-23 15:55 ` [PATCH RFC v2 04/15] vfio/nvgrace-gpu: Introduce functions to fetch and save EGM info ankita
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 42+ messages in thread
From: ankita @ 2026-02-23 15:55 UTC (permalink / raw)
  To: ankita, vsethi, jgg, mochs, jgg, skolothumtho, alex
  Cc: cjia, zhiw, kjaju, yishaih, kevin.tian, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

Grace Blackwell systems could have multiple GPUs on a socket and
thus are associated with the corresponding EGM region for that
socket. Track the GPUs as a list.

On the device probe, the device pci_dev struct is added to a
linked list of the appropriate EGM region.

Similarly on device remove, the pci_dev struct for the GPU
is removed from the EGM region.

Since the GPUs on a socket have the same EGM region, they have
the have the same set of EGM region information. Skip the EGM
region information fetch if already done through a differnt
GPU on the same socket.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 29 ++++++++++++++++++++
 drivers/vfio/pci/nvgrace-gpu/egm_dev.h |  4 +++
 drivers/vfio/pci/nvgrace-gpu/main.c    | 37 +++++++++++++++++++++++---
 include/linux/nvgrace-egm.h            |  6 +++++
 4 files changed, 72 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
index faf658723f7a..0bf95688a486 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
@@ -17,6 +17,33 @@ int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm)
 					pegmpxm);
 }
 
+int add_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
+{
+	struct gpu_node *node;
+
+	node = kzalloc(sizeof(*node), GFP_KERNEL);
+	if (!node)
+		return -ENOMEM;
+
+	node->pdev = pdev;
+
+	list_add_tail(&node->list, &egm_dev->gpus);
+
+	return 0;
+}
+
+void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
+{
+	struct gpu_node *node, *tmp;
+
+	list_for_each_entry_safe(node, tmp, &egm_dev->gpus, list) {
+		if (node->pdev == pdev) {
+			list_del(&node->list);
+			kfree(node);
+		}
+	}
+}
+
 static void nvgrace_gpu_release_aux_device(struct device *device)
 {
 	struct auxiliary_device *aux_dev = container_of(device, struct auxiliary_device, dev);
@@ -37,6 +64,8 @@ nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
 		goto create_err;
 
 	egm_dev->egmpxm = egmpxm;
+	INIT_LIST_HEAD(&egm_dev->gpus);
+
 	egm_dev->aux_dev.id = egmpxm;
 	egm_dev->aux_dev.name = name;
 	egm_dev->aux_dev.dev.release = nvgrace_gpu_release_aux_device;
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
index c00f5288f4e7..1635753c9e50 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
+++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
@@ -10,6 +10,10 @@
 
 int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm);
 
+int add_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev);
+
+void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev);
+
 struct nvgrace_egm_dev *
 nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
 			      u64 egmphys);
diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index 23028e6e7192..3dd0c57e5789 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -77,9 +77,10 @@ static struct list_head egm_dev_list;
 
 static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
 {
-	struct nvgrace_egm_dev_entry *egm_entry;
+	struct nvgrace_egm_dev_entry *egm_entry = NULL;
 	u64 egmpxm;
 	int ret = 0;
+	bool is_new_region = false;
 
 	/*
 	 * EGM is an optional feature enabled in SBIOS. If disabled, there
@@ -90,6 +91,19 @@ static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
 	if (nvgrace_gpu_has_egm_property(pdev, &egmpxm))
 		goto exit;
 
+	list_for_each_entry(egm_entry, &egm_dev_list, list) {
+		/*
+		 * A system could have multiple GPUs associated with an
+		 * EGM region and will have the same set of EGM region
+		 * information. Skip the EGM region information fetch if
+		 * already done through a differnt GPU on the same socket.
+		 */
+		if (egm_entry->egm_dev->egmpxm == egmpxm)
+			goto add_gpu;
+	}
+
+	is_new_region = true;
+
 	egm_entry = kzalloc(sizeof(*egm_entry), GFP_KERNEL);
 	if (!egm_entry)
 		return -ENOMEM;
@@ -98,13 +112,24 @@ static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
 		nvgrace_gpu_create_aux_device(pdev, NVGRACE_EGM_DEV_NAME,
 					      egmpxm);
 	if (!egm_entry->egm_dev) {
-		kvfree(egm_entry);
 		ret = -EINVAL;
-		goto exit;
+		goto free_egm_entry;
 	}
 
-	list_add_tail(&egm_entry->list, &egm_dev_list);
+add_gpu:
+	ret = add_gpu(egm_entry->egm_dev, pdev);
+	if (ret)
+		goto free_dev;
 
+	if (is_new_region)
+		list_add_tail(&egm_entry->list, &egm_dev_list);
+	return 0;
+
+free_dev:
+	if (is_new_region)
+		auxiliary_device_destroy(&egm_entry->egm_dev->aux_dev);
+free_egm_entry:
+	kvfree(egm_entry);
 exit:
 	return ret;
 }
@@ -123,6 +148,10 @@ static void nvgrace_gpu_destroy_egm_aux_device(struct pci_dev *pdev)
 		 * device.
 		 */
 		if (egm_entry->egm_dev->egmpxm == egmpxm) {
+			remove_gpu(egm_entry->egm_dev, pdev);
+			if (!list_empty(&egm_entry->egm_dev->gpus))
+				break;
+
 			auxiliary_device_destroy(&egm_entry->egm_dev->aux_dev);
 			list_del(&egm_entry->list);
 			kvfree(egm_entry);
diff --git a/include/linux/nvgrace-egm.h b/include/linux/nvgrace-egm.h
index 9575d4ad4338..e42494a2b1a6 100644
--- a/include/linux/nvgrace-egm.h
+++ b/include/linux/nvgrace-egm.h
@@ -10,9 +10,15 @@
 
 #define NVGRACE_EGM_DEV_NAME "egm"
 
+struct gpu_node {
+	struct list_head list;
+	struct pci_dev *pdev;
+};
+
 struct nvgrace_egm_dev {
 	struct auxiliary_device aux_dev;
 	u64 egmpxm;
+	struct list_head gpus;
 };
 
 struct nvgrace_egm_dev_entry {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* RE: [PATCH RFC v2 03/15] vfio/nvgrace-gpu: track GPUs associated with the EGM regions
  2026-02-23 15:55 ` [PATCH RFC v2 03/15] vfio/nvgrace-gpu: track GPUs associated with the EGM regions ankita
@ 2026-02-26 14:55   ` Shameer Kolothum Thodi
  2026-03-04 17:14     ` Alex Williamson
  0 siblings, 1 reply; 42+ messages in thread
From: Shameer Kolothum Thodi @ 2026-02-26 14:55 UTC (permalink / raw)
  To: Ankit Agrawal, Vikram Sethi, Jason Gunthorpe, Matt Ochs,
	jgg@ziepe.ca, alex@shazbot.org
  Cc: Neo Jia, Zhi Wang, Krishnakant Jaju, Yishai Hadas,
	kevin.tian@intel.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org



> -----Original Message-----
> From: Ankit Agrawal <ankita@nvidia.com>
> Sent: 23 February 2026 15:55
> To: Ankit Agrawal <ankita@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>;
> Jason Gunthorpe <jgg@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> jgg@ziepe.ca; Shameer Kolothum Thodi <skolothumtho@nvidia.com>;
> alex@shazbot.org
> Cc: Neo Jia <cjia@nvidia.com>; Zhi Wang <zhiw@nvidia.com>; Krishnakant
> Jaju <kjaju@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>;
> kevin.tian@intel.com; kvm@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [PATCH RFC v2 03/15] vfio/nvgrace-gpu: track GPUs associated with
> the EGM regions
> 
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Grace Blackwell systems could have multiple GPUs on a socket and
> thus are associated with the corresponding EGM region for that
> socket. Track the GPUs as a list.
> 
> On the device probe, the device pci_dev struct is added to a
> linked list of the appropriate EGM region.
> 
> Similarly on device remove, the pci_dev struct for the GPU
> is removed from the EGM region.
> 
> Since the GPUs on a socket have the same EGM region, they have
> the have the same set of EGM region information. Skip the EGM
> region information fetch if already done through a differnt
> GPU on the same socket.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 29 ++++++++++++++++++++
>  drivers/vfio/pci/nvgrace-gpu/egm_dev.h |  4 +++
>  drivers/vfio/pci/nvgrace-gpu/main.c    | 37 +++++++++++++++++++++++---
>  include/linux/nvgrace-egm.h            |  6 +++++
>  4 files changed, 72 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> index faf658723f7a..0bf95688a486 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> @@ -17,6 +17,33 @@ int nvgrace_gpu_has_egm_property(struct pci_dev
> *pdev, u64 *pegmpxm)
>  					pegmpxm);
>  }
> 
> +int add_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
> +{
> +	struct gpu_node *node;
> +
> +	node = kzalloc(sizeof(*node), GFP_KERNEL);
> +	if (!node)
> +		return -ENOMEM;
> +
> +	node->pdev = pdev;
> +
> +	list_add_tail(&node->list, &egm_dev->gpus);
> +
> +	return 0;
> +}
> +
> +void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
> +{
> +	struct gpu_node *node, *tmp;
> +
> +	list_for_each_entry_safe(node, tmp, &egm_dev->gpus, list) {

Looks like this gpu list also will require a lock.
Can we get rid of this gpu list by having a refcount_t in struct nvgrace_egm_dev?

> +		if (node->pdev == pdev) {
> +			list_del(&node->list);
> +			kfree(node);
> +		}
> +	}
> +}
> +
>  static void nvgrace_gpu_release_aux_device(struct device *device)
>  {
>  	struct auxiliary_device *aux_dev = container_of(device, struct
> auxiliary_device, dev);
> @@ -37,6 +64,8 @@ nvgrace_gpu_create_aux_device(struct pci_dev *pdev,
> const char *name,
>  		goto create_err;
> 
>  	egm_dev->egmpxm = egmpxm;
> +	INIT_LIST_HEAD(&egm_dev->gpus);
> +
>  	egm_dev->aux_dev.id = egmpxm;
>  	egm_dev->aux_dev.name = name;
>  	egm_dev->aux_dev.dev.release = nvgrace_gpu_release_aux_device;
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> index c00f5288f4e7..1635753c9e50 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> @@ -10,6 +10,10 @@
> 
>  int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm);
> 
> +int add_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev);
> +
> +void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev);
> +
>  struct nvgrace_egm_dev *
>  nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
>  			      u64 egmphys);
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-
> gpu/main.c
> index 23028e6e7192..3dd0c57e5789 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -77,9 +77,10 @@ static struct list_head egm_dev_list;
> 
>  static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
>  {
> -	struct nvgrace_egm_dev_entry *egm_entry;
> +	struct nvgrace_egm_dev_entry *egm_entry = NULL;
>  	u64 egmpxm;
>  	int ret = 0;
> +	bool is_new_region = false;
> 
>  	/*
>  	 * EGM is an optional feature enabled in SBIOS. If disabled, there
> @@ -90,6 +91,19 @@ static int nvgrace_gpu_create_egm_aux_device(struct
> pci_dev *pdev)
>  	if (nvgrace_gpu_has_egm_property(pdev, &egmpxm))
>  		goto exit;
> 
> +	list_for_each_entry(egm_entry, &egm_dev_list, list) {
> +		/*
> +		 * A system could have multiple GPUs associated with an
> +		 * EGM region and will have the same set of EGM region
> +		 * information. Skip the EGM region information fetch if
> +		 * already done through a differnt GPU on the same socket.
> +		 */
> +		if (egm_entry->egm_dev->egmpxm == egmpxm)
> +			goto add_gpu;
> +	}
> +
> +	is_new_region = true;
> +
>  	egm_entry = kzalloc(sizeof(*egm_entry), GFP_KERNEL);
>  	if (!egm_entry)
>  		return -ENOMEM;
> @@ -98,13 +112,24 @@ static int
> nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
>  		nvgrace_gpu_create_aux_device(pdev,
> NVGRACE_EGM_DEV_NAME,
>  					      egmpxm);
>  	if (!egm_entry->egm_dev) {
> -		kvfree(egm_entry);
>  		ret = -EINVAL;
> -		goto exit;
> +		goto free_egm_entry;
>  	}
> 
> -	list_add_tail(&egm_entry->list, &egm_dev_list);
> +add_gpu:
> +	ret = add_gpu(egm_entry->egm_dev, pdev);
> +	if (ret)
> +		goto free_dev;
> 
> +	if (is_new_region)
> +		list_add_tail(&egm_entry->list, &egm_dev_list);

So this is where you address the previous patch comment I suppose...
If so, need to change the commit description there.

> +	return 0;
> +
> +free_dev:
> +	if (is_new_region)
> +		auxiliary_device_destroy(&egm_entry->egm_dev->aux_dev);
> +free_egm_entry:
> +	kvfree(egm_entry);

Suppose the add_gpu() above fails, then you will end up here with an existing 
egm_entry which might be in use.

Thanks,
Shameer


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC v2 03/15] vfio/nvgrace-gpu: track GPUs associated with the EGM regions
  2026-02-26 14:55   ` Shameer Kolothum Thodi
@ 2026-03-04 17:14     ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2026-03-04 17:14 UTC (permalink / raw)
  To: Shameer Kolothum Thodi
  Cc: Ankit Agrawal, Vikram Sethi, Jason Gunthorpe, Matt Ochs,
	jgg@ziepe.ca, Neo Jia, Zhi Wang, Krishnakant Jaju, Yishai Hadas,
	kevin.tian@intel.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, alex

On Thu, 26 Feb 2026 14:55:37 +0000
Shameer Kolothum Thodi <skolothumtho@nvidia.com> wrote:

> > -----Original Message-----
> > From: Ankit Agrawal <ankita@nvidia.com>
> > Sent: 23 February 2026 15:55
> > To: Ankit Agrawal <ankita@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>;
> > Jason Gunthorpe <jgg@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> > jgg@ziepe.ca; Shameer Kolothum Thodi <skolothumtho@nvidia.com>;
> > alex@shazbot.org
> > Cc: Neo Jia <cjia@nvidia.com>; Zhi Wang <zhiw@nvidia.com>; Krishnakant
> > Jaju <kjaju@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>;
> > kevin.tian@intel.com; kvm@vger.kernel.org; linux-kernel@vger.kernel.org
> > Subject: [PATCH RFC v2 03/15] vfio/nvgrace-gpu: track GPUs associated with
> > the EGM regions
> > 
> > From: Ankit Agrawal <ankita@nvidia.com>
> > 
> > Grace Blackwell systems could have multiple GPUs on a socket and
> > thus are associated with the corresponding EGM region for that
> > socket. Track the GPUs as a list.
> > 
> > On the device probe, the device pci_dev struct is added to a
> > linked list of the appropriate EGM region.
> > 
> > Similarly on device remove, the pci_dev struct for the GPU
> > is removed from the EGM region.
> > 
> > Since the GPUs on a socket have the same EGM region, they have
> > the have the same set of EGM region information. Skip the EGM
> > region information fetch if already done through a differnt
> > GPU on the same socket.
> > 
> > Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> > ---
> >  drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 29 ++++++++++++++++++++
> >  drivers/vfio/pci/nvgrace-gpu/egm_dev.h |  4 +++
> >  drivers/vfio/pci/nvgrace-gpu/main.c    | 37 +++++++++++++++++++++++---
> >  include/linux/nvgrace-egm.h            |  6 +++++
> >  4 files changed, 72 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> > b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> > index faf658723f7a..0bf95688a486 100644
> > --- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> > +++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> > @@ -17,6 +17,33 @@ int nvgrace_gpu_has_egm_property(struct pci_dev
> > *pdev, u64 *pegmpxm)
> >  					pegmpxm);
> >  }
> > 
> > +int add_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
> > +{
> > +	struct gpu_node *node;
> > +
> > +	node = kzalloc(sizeof(*node), GFP_KERNEL);
> > +	if (!node)
> > +		return -ENOMEM;
> > +
> > +	node->pdev = pdev;
> > +
> > +	list_add_tail(&node->list, &egm_dev->gpus);
> > +
> > +	return 0;
> > +}
> > +
> > +void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
> > +{
> > +	struct gpu_node *node, *tmp;
> > +
> > +	list_for_each_entry_safe(node, tmp, &egm_dev->gpus, list) {  
> 
> Looks like this gpu list also will require a lock.

+1

> Can we get rid of this gpu list by having a refcount_t in struct nvgrace_egm_dev?

+1

In this implementation, a reference count seems sufficient and the
egm_dev list could be moved to egm_dev.c, where a get_or_create
function could handle the de-dupe and refcount and a put function could
deference and free.

We'd only need reference to the GPU pci_dev if we needed to invalidate
mappings across a GPU reset, or perhaps if we were exposing multiple
EGM devices per socket, one for each GPU route.

> > +		if (node->pdev == pdev) {
> > +			list_del(&node->list);
> > +			kfree(node);
> > +		}

Also why do we continue searching the list after a match is found?
Thanks,

Alex

> > +	}
> > +}
> > +
> >  static void nvgrace_gpu_release_aux_device(struct device *device)
> >  {
> >  	struct auxiliary_device *aux_dev = container_of(device, struct
> > auxiliary_device, dev);
> > @@ -37,6 +64,8 @@ nvgrace_gpu_create_aux_device(struct pci_dev *pdev,
> > const char *name,
> >  		goto create_err;
> > 
> >  	egm_dev->egmpxm = egmpxm;
> > +	INIT_LIST_HEAD(&egm_dev->gpus);
> > +
> >  	egm_dev->aux_dev.id = egmpxm;
> >  	egm_dev->aux_dev.name = name;
> >  	egm_dev->aux_dev.dev.release = nvgrace_gpu_release_aux_device;
> > diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> > b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> > index c00f5288f4e7..1635753c9e50 100644
> > --- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> > +++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> > @@ -10,6 +10,10 @@
> > 
> >  int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm);
> > 
> > +int add_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev);
> > +
> > +void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev);
> > +
> >  struct nvgrace_egm_dev *
> >  nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
> >  			      u64 egmphys);
> > diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-
> > gpu/main.c
> > index 23028e6e7192..3dd0c57e5789 100644
> > --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> > +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> > @@ -77,9 +77,10 @@ static struct list_head egm_dev_list;
> > 
> >  static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
> >  {
> > -	struct nvgrace_egm_dev_entry *egm_entry;
> > +	struct nvgrace_egm_dev_entry *egm_entry = NULL;
> >  	u64 egmpxm;
> >  	int ret = 0;
> > +	bool is_new_region = false;
> > 
> >  	/*
> >  	 * EGM is an optional feature enabled in SBIOS. If disabled, there
> > @@ -90,6 +91,19 @@ static int nvgrace_gpu_create_egm_aux_device(struct
> > pci_dev *pdev)
> >  	if (nvgrace_gpu_has_egm_property(pdev, &egmpxm))
> >  		goto exit;
> > 
> > +	list_for_each_entry(egm_entry, &egm_dev_list, list) {
> > +		/*
> > +		 * A system could have multiple GPUs associated with an
> > +		 * EGM region and will have the same set of EGM region
> > +		 * information. Skip the EGM region information fetch if
> > +		 * already done through a differnt GPU on the same socket.
> > +		 */
> > +		if (egm_entry->egm_dev->egmpxm == egmpxm)
> > +			goto add_gpu;
> > +	}
> > +
> > +	is_new_region = true;
> > +
> >  	egm_entry = kzalloc(sizeof(*egm_entry), GFP_KERNEL);
> >  	if (!egm_entry)
> >  		return -ENOMEM;
> > @@ -98,13 +112,24 @@ static int
> > nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
> >  		nvgrace_gpu_create_aux_device(pdev,
> > NVGRACE_EGM_DEV_NAME,
> >  					      egmpxm);
> >  	if (!egm_entry->egm_dev) {
> > -		kvfree(egm_entry);
> >  		ret = -EINVAL;
> > -		goto exit;
> > +		goto free_egm_entry;
> >  	}
> > 
> > -	list_add_tail(&egm_entry->list, &egm_dev_list);
> > +add_gpu:
> > +	ret = add_gpu(egm_entry->egm_dev, pdev);
> > +	if (ret)
> > +		goto free_dev;
> > 
> > +	if (is_new_region)
> > +		list_add_tail(&egm_entry->list, &egm_dev_list);  
> 
> So this is where you address the previous patch comment I suppose...
> If so, need to change the commit description there.
> 
> > +	return 0;
> > +
> > +free_dev:
> > +	if (is_new_region)
> > +		auxiliary_device_destroy(&egm_entry->egm_dev->aux_dev);
> > +free_egm_entry:
> > +	kvfree(egm_entry);  
> 
> Suppose the add_gpu() above fails, then you will end up here with an existing 
> egm_entry which might be in use.
> 
> Thanks,
> Shameer
> 
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH RFC v2 04/15] vfio/nvgrace-gpu: Introduce functions to fetch and save EGM info
  2026-02-23 15:54 [PATCH RFC v2 00/15] Add virtualization support for EGM ankita
                   ` (2 preceding siblings ...)
  2026-02-23 15:55 ` [PATCH RFC v2 03/15] vfio/nvgrace-gpu: track GPUs associated with the EGM regions ankita
@ 2026-02-23 15:55 ` ankita
  2026-02-26 15:12   ` Shameer Kolothum Thodi
  2026-03-04 17:37   ` Alex Williamson
  2026-02-23 15:55 ` [PATCH RFC v2 05/15] vfio/nvgrace-egm: Introduce module to manage EGM ankita
                   ` (11 subsequent siblings)
  15 siblings, 2 replies; 42+ messages in thread
From: ankita @ 2026-02-23 15:55 UTC (permalink / raw)
  To: ankita, vsethi, jgg, mochs, jgg, skolothumtho, alex
  Cc: cjia, zhiw, kjaju, yishaih, kevin.tian, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

The nvgrace-gpu module tracks the various EGM regions on the system.
The EGM region information - Base SPA and size - are part of the ACPI
tables. This can be fetched from the DSD table using the GPU handle.

When the GPUs are bound to the nvgrace-gpu module, it fetches the EGM
region information from the ACPI table using the GPU's pci_dev. The
EGM regions are tracked in a list and the information per region is
maintained in the nvgrace_egm_dev.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 24 +++++++++++++++++++++++-
 drivers/vfio/pci/nvgrace-gpu/egm_dev.h |  4 +++-
 drivers/vfio/pci/nvgrace-gpu/main.c    |  8 ++++++--
 include/linux/nvgrace-egm.h            |  2 ++
 4 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
index 0bf95688a486..20291504aca8 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
@@ -17,6 +17,26 @@ int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm)
 					pegmpxm);
 }
 
+int nvgrace_gpu_fetch_egm_property(struct pci_dev *pdev, u64 *pegmphys,
+				   u64 *pegmlength)
+{
+	int ret;
+
+	/*
+	 * The memory information is present in the system ACPI tables as DSD
+	 * properties nvidia,egm-base-pa and nvidia,egm-size.
+	 */
+	ret = device_property_read_u64(&pdev->dev, "nvidia,egm-size",
+				       pegmlength);
+	if (ret)
+		return ret;
+
+	ret = device_property_read_u64(&pdev->dev, "nvidia,egm-base-pa",
+				       pegmphys);
+
+	return ret;
+}
+
 int add_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
 {
 	struct gpu_node *node;
@@ -54,7 +74,7 @@ static void nvgrace_gpu_release_aux_device(struct device *device)
 
 struct nvgrace_egm_dev *
 nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
-			      u64 egmpxm)
+			      u64 egmphys, u64 egmlength, u64 egmpxm)
 {
 	struct nvgrace_egm_dev *egm_dev;
 	int ret;
@@ -64,6 +84,8 @@ nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
 		goto create_err;
 
 	egm_dev->egmpxm = egmpxm;
+	egm_dev->egmphys = egmphys;
+	egm_dev->egmlength = egmlength;
 	INIT_LIST_HEAD(&egm_dev->gpus);
 
 	egm_dev->aux_dev.id = egmpxm;
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
index 1635753c9e50..2e1612445898 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
+++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
@@ -16,6 +16,8 @@ void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev);
 
 struct nvgrace_egm_dev *
 nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
-			      u64 egmphys);
+			      u64 egmphys, u64 egmlength, u64 egmpxm);
 
+int nvgrace_gpu_fetch_egm_property(struct pci_dev *pdev, u64 *pegmphys,
+				   u64 *pegmlength);
 #endif /* EGM_DEV_H */
diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index 3dd0c57e5789..b356e941340a 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -78,7 +78,7 @@ static struct list_head egm_dev_list;
 static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
 {
 	struct nvgrace_egm_dev_entry *egm_entry = NULL;
-	u64 egmpxm;
+	u64 egmphys, egmlength, egmpxm;
 	int ret = 0;
 	bool is_new_region = false;
 
@@ -91,6 +91,10 @@ static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
 	if (nvgrace_gpu_has_egm_property(pdev, &egmpxm))
 		goto exit;
 
+	ret = nvgrace_gpu_fetch_egm_property(pdev, &egmphys, &egmlength);
+	if (ret)
+		goto exit;
+
 	list_for_each_entry(egm_entry, &egm_dev_list, list) {
 		/*
 		 * A system could have multiple GPUs associated with an
@@ -110,7 +114,7 @@ static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
 
 	egm_entry->egm_dev =
 		nvgrace_gpu_create_aux_device(pdev, NVGRACE_EGM_DEV_NAME,
-					      egmpxm);
+					      egmphys, egmlength, egmpxm);
 	if (!egm_entry->egm_dev) {
 		ret = -EINVAL;
 		goto free_egm_entry;
diff --git a/include/linux/nvgrace-egm.h b/include/linux/nvgrace-egm.h
index e42494a2b1a6..a66906753267 100644
--- a/include/linux/nvgrace-egm.h
+++ b/include/linux/nvgrace-egm.h
@@ -17,6 +17,8 @@ struct gpu_node {
 
 struct nvgrace_egm_dev {
 	struct auxiliary_device aux_dev;
+	phys_addr_t egmphys;
+	size_t egmlength;
 	u64 egmpxm;
 	struct list_head gpus;
 };
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* RE: [PATCH RFC v2 04/15] vfio/nvgrace-gpu: Introduce functions to fetch and save EGM info
  2026-02-23 15:55 ` [PATCH RFC v2 04/15] vfio/nvgrace-gpu: Introduce functions to fetch and save EGM info ankita
@ 2026-02-26 15:12   ` Shameer Kolothum Thodi
  2026-03-04 17:37   ` Alex Williamson
  1 sibling, 0 replies; 42+ messages in thread
From: Shameer Kolothum Thodi @ 2026-02-26 15:12 UTC (permalink / raw)
  To: Ankit Agrawal, Vikram Sethi, Jason Gunthorpe, Matt Ochs,
	jgg@ziepe.ca, alex@shazbot.org
  Cc: Neo Jia, Zhi Wang, Krishnakant Jaju, Yishai Hadas,
	kevin.tian@intel.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org



> -----Original Message-----
> From: Ankit Agrawal <ankita@nvidia.com>
> Sent: 23 February 2026 15:55
> To: Ankit Agrawal <ankita@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>;
> Jason Gunthorpe <jgg@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> jgg@ziepe.ca; Shameer Kolothum Thodi <skolothumtho@nvidia.com>;
> alex@shazbot.org
> Cc: Neo Jia <cjia@nvidia.com>; Zhi Wang <zhiw@nvidia.com>; Krishnakant
> Jaju <kjaju@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>;
> kevin.tian@intel.com; kvm@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [PATCH RFC v2 04/15] vfio/nvgrace-gpu: Introduce functions to fetch
> and save EGM info
> 
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The nvgrace-gpu module tracks the various EGM regions on the system.
> The EGM region information - Base SPA and size - are part of the ACPI
> tables. This can be fetched from the DSD table using the GPU handle.
> 
> When the GPUs are bound to the nvgrace-gpu module, it fetches the EGM
> region information from the ACPI table using the GPU's pci_dev. The
> EGM regions are tracked in a list and the information per region is
> maintained in the nvgrace_egm_dev.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 24 +++++++++++++++++++++++-
>  drivers/vfio/pci/nvgrace-gpu/egm_dev.h |  4 +++-
>  drivers/vfio/pci/nvgrace-gpu/main.c    |  8 ++++++--
>  include/linux/nvgrace-egm.h            |  2 ++
>  4 files changed, 34 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> index 0bf95688a486..20291504aca8 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> @@ -17,6 +17,26 @@ int nvgrace_gpu_has_egm_property(struct pci_dev
> *pdev, u64 *pegmpxm)
>  					pegmpxm);
>  }
> 
> +int nvgrace_gpu_fetch_egm_property(struct pci_dev *pdev, u64 *pegmphys,
> +				   u64 *pegmlength)
> +{
> +	int ret;
> +
> +	/*
> +	 * The memory information is present in the system ACPI tables as DSD
> +	 * properties nvidia,egm-base-pa and nvidia,egm-size.
> +	 */
> +	ret = device_property_read_u64(&pdev->dev, "nvidia,egm-size",
> +				       pegmlength);
> +	if (ret)
> +		return ret;
> +
> +	ret = device_property_read_u64(&pdev->dev, "nvidia,egm-base-pa",
> +				       pegmphys);
> +
> +	return ret;
> +}
> +
>  int add_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
>  {
>  	struct gpu_node *node;
> @@ -54,7 +74,7 @@ static void nvgrace_gpu_release_aux_device(struct
> device *device)
> 
>  struct nvgrace_egm_dev *
>  nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
> -			      u64 egmpxm)
> +			      u64 egmphys, u64 egmlength, u64 egmpxm)
>  {
>  	struct nvgrace_egm_dev *egm_dev;
>  	int ret;
> @@ -64,6 +84,8 @@ nvgrace_gpu_create_aux_device(struct pci_dev *pdev,
> const char *name,
>  		goto create_err;
> 
>  	egm_dev->egmpxm = egmpxm;
> +	egm_dev->egmphys = egmphys;
> +	egm_dev->egmlength = egmlength;
>  	INIT_LIST_HEAD(&egm_dev->gpus);
> 
>  	egm_dev->aux_dev.id = egmpxm;
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> index 1635753c9e50..2e1612445898 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> @@ -16,6 +16,8 @@ void remove_gpu(struct nvgrace_egm_dev *egm_dev,
> struct pci_dev *pdev);
> 
>  struct nvgrace_egm_dev *
>  nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
> -			      u64 egmphys);
> +			      u64 egmphys, u64 egmlength, u64 egmpxm);
> 
> +int nvgrace_gpu_fetch_egm_property(struct pci_dev *pdev, u64 *pegmphys,
> +				   u64 *pegmlength);
>  #endif /* EGM_DEV_H */
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-
> gpu/main.c
> index 3dd0c57e5789..b356e941340a 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -78,7 +78,7 @@ static struct list_head egm_dev_list;
>  static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
>  {
>  	struct nvgrace_egm_dev_entry *egm_entry = NULL;
> -	u64 egmpxm;
> +	u64 egmphys, egmlength, egmpxm;
>  	int ret = 0;
>  	bool is_new_region = false;
> 
> @@ -91,6 +91,10 @@ static int nvgrace_gpu_create_egm_aux_device(struct
> pci_dev *pdev)
>  	if (nvgrace_gpu_has_egm_property(pdev, &egmpxm))
>  		goto exit;
> 
> +	ret = nvgrace_gpu_fetch_egm_property(pdev, &egmphys,
> &egmlength);
> +	if (ret)
> +		goto exit;
> +

This should only be done if this is not the add_gpu case below

Also, patch #3 has a comment:
" Skip the EGM region information fetch if 
 * already done through a differnt GPU on the same socket."

That probably belongs here instead.

Thanks,
Shameer


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC v2 04/15] vfio/nvgrace-gpu: Introduce functions to fetch and save EGM info
  2026-02-23 15:55 ` [PATCH RFC v2 04/15] vfio/nvgrace-gpu: Introduce functions to fetch and save EGM info ankita
  2026-02-26 15:12   ` Shameer Kolothum Thodi
@ 2026-03-04 17:37   ` Alex Williamson
  1 sibling, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2026-03-04 17:37 UTC (permalink / raw)
  To: ankita
  Cc: vsethi, jgg, mochs, jgg, skolothumtho, cjia, zhiw, kjaju, yishaih,
	kevin.tian, kvm, linux-kernel, alex

On Mon, 23 Feb 2026 15:55:03 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The nvgrace-gpu module tracks the various EGM regions on the system.
> The EGM region information - Base SPA and size - are part of the ACPI
> tables. This can be fetched from the DSD table using the GPU handle.
> 
> When the GPUs are bound to the nvgrace-gpu module, it fetches the EGM
> region information from the ACPI table using the GPU's pci_dev. The
> EGM regions are tracked in a list and the information per region is
> maintained in the nvgrace_egm_dev.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 24 +++++++++++++++++++++++-
>  drivers/vfio/pci/nvgrace-gpu/egm_dev.h |  4 +++-
>  drivers/vfio/pci/nvgrace-gpu/main.c    |  8 ++++++--
>  include/linux/nvgrace-egm.h            |  2 ++
>  4 files changed, 34 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> index 0bf95688a486..20291504aca8 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> @@ -17,6 +17,26 @@ int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm)
>  					pegmpxm);
>  }
>  
> +int nvgrace_gpu_fetch_egm_property(struct pci_dev *pdev, u64 *pegmphys,
> +				   u64 *pegmlength)
> +{
> +	int ret;
> +
> +	/*
> +	 * The memory information is present in the system ACPI tables as DSD
> +	 * properties nvidia,egm-base-pa and nvidia,egm-size.
> +	 */
> +	ret = device_property_read_u64(&pdev->dev, "nvidia,egm-size",
> +				       pegmlength);
> +	if (ret)
> +		return ret;
> +
> +	ret = device_property_read_u64(&pdev->dev, "nvidia,egm-base-pa",
> +				       pegmphys);
> +
> +	return ret;
> +}

What guarantees that all GPUs in the same PXM have the same properties?
AIUI we only consume the resulting properties for the first GPU
associated to the egm_dev.  Why do we even bother to retrieve the
properties for subsequent GPUs?

Nit, it's a bit inconsistent to partially write caller data on error
versus read to local variables and set the caller data only on success.
Thanks,

Alex

> +
>  int add_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
>  {
>  	struct gpu_node *node;
> @@ -54,7 +74,7 @@ static void nvgrace_gpu_release_aux_device(struct device *device)
>  
>  struct nvgrace_egm_dev *
>  nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
> -			      u64 egmpxm)
> +			      u64 egmphys, u64 egmlength, u64 egmpxm)
>  {
>  	struct nvgrace_egm_dev *egm_dev;
>  	int ret;
> @@ -64,6 +84,8 @@ nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
>  		goto create_err;
>  
>  	egm_dev->egmpxm = egmpxm;
> +	egm_dev->egmphys = egmphys;
> +	egm_dev->egmlength = egmlength;
>  	INIT_LIST_HEAD(&egm_dev->gpus);
>  
>  	egm_dev->aux_dev.id = egmpxm;
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> index 1635753c9e50..2e1612445898 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> @@ -16,6 +16,8 @@ void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev);
>  
>  struct nvgrace_egm_dev *
>  nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
> -			      u64 egmphys);
> +			      u64 egmphys, u64 egmlength, u64 egmpxm);
>  
> +int nvgrace_gpu_fetch_egm_property(struct pci_dev *pdev, u64 *pegmphys,
> +				   u64 *pegmlength);
>  #endif /* EGM_DEV_H */
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
> index 3dd0c57e5789..b356e941340a 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -78,7 +78,7 @@ static struct list_head egm_dev_list;
>  static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
>  {
>  	struct nvgrace_egm_dev_entry *egm_entry = NULL;
> -	u64 egmpxm;
> +	u64 egmphys, egmlength, egmpxm;
>  	int ret = 0;
>  	bool is_new_region = false;
>  
> @@ -91,6 +91,10 @@ static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
>  	if (nvgrace_gpu_has_egm_property(pdev, &egmpxm))
>  		goto exit;
>  
> +	ret = nvgrace_gpu_fetch_egm_property(pdev, &egmphys, &egmlength);
> +	if (ret)
> +		goto exit;
> +
>  	list_for_each_entry(egm_entry, &egm_dev_list, list) {
>  		/*
>  		 * A system could have multiple GPUs associated with an
> @@ -110,7 +114,7 @@ static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
>  
>  	egm_entry->egm_dev =
>  		nvgrace_gpu_create_aux_device(pdev, NVGRACE_EGM_DEV_NAME,
> -					      egmpxm);
> +					      egmphys, egmlength, egmpxm);
>  	if (!egm_entry->egm_dev) {
>  		ret = -EINVAL;
>  		goto free_egm_entry;
> diff --git a/include/linux/nvgrace-egm.h b/include/linux/nvgrace-egm.h
> index e42494a2b1a6..a66906753267 100644
> --- a/include/linux/nvgrace-egm.h
> +++ b/include/linux/nvgrace-egm.h
> @@ -17,6 +17,8 @@ struct gpu_node {
>  
>  struct nvgrace_egm_dev {
>  	struct auxiliary_device aux_dev;
> +	phys_addr_t egmphys;
> +	size_t egmlength;
>  	u64 egmpxm;
>  	struct list_head gpus;
>  };


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH RFC v2 05/15] vfio/nvgrace-egm: Introduce module to manage EGM
  2026-02-23 15:54 [PATCH RFC v2 00/15] Add virtualization support for EGM ankita
                   ` (3 preceding siblings ...)
  2026-02-23 15:55 ` [PATCH RFC v2 04/15] vfio/nvgrace-gpu: Introduce functions to fetch and save EGM info ankita
@ 2026-02-23 15:55 ` ankita
  2026-03-04 18:09   ` Alex Williamson
  2026-02-23 15:55 ` [PATCH RFC v2 06/15] vfio/nvgrace-egm: Introduce egm class and register char device numbers ankita
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 42+ messages in thread
From: ankita @ 2026-02-23 15:55 UTC (permalink / raw)
  To: ankita, vsethi, jgg, mochs, jgg, skolothumtho, alex
  Cc: cjia, zhiw, kjaju, yishaih, kevin.tian, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

The Extended GPU Memory (EGM) feature that enables the GPU to access
the system memory allocations within and across nodes through high
bandwidth path on Grace Based systems. The GPU can utilize the
system memory located on the same socket or from a different socket
or even on a different node in a multi-node system [1].

When the EGM mode is enabled through SBIOS, the host system memory is
partitioned into 2 parts: One partition for the Host OS usage
called Hypervisor region, and a second Hypervisor-Invisible (HI) region
for the VM. Only the hypervisor region is part of the host EFI map
and is thus visible to the host OS on bootup. Since the entire VM
sysmem is eligible for EGM allocations within the VM, the HI partition
is interchangeably called as EGM region in the series. This HI/EGM region
range base SPA and size is exposed through the ACPI DSDT properties.

Whilst the EGM region is accessible on the host, it is not added to
the kernel. The HI region is assigned to a VM by mapping the QEMU VMA
to the SPA using remap_pfn_range().

The following figure shows the memory map in the virtualization
environment.

|---- Sysmem ----|                  |--- GPU mem ---|  VM Memory
|                |                  |               |
|IPA <-> SPA map |                  |IPA <-> SPA map|
|                |                  |               |
|--- HI / EGM ---|-- Host Mem --|   |--- GPU mem ---|  Host Memory

Introduce a new nvgrace-egm auxiliary driver module to manage and
map the HI/EGM region in the Grace Blackwell systems. This binds to
the auxiliary device created by the parent nvgrace-gpu (in-tree
module for device assignment) / nvidia-vgpu-vfio (out-of-tree open
source module for SRIOV vGPU) to manage the EGM region for the VM.
Note that there is a unique EGM region per socket and the auxiliary
device gets created for every region. The parent module fetches the
EGM region information from the ACPI tables and populate to the data
structures shared with the auxiliary nvgrace-egm module.

nvgrace-egm module handles the following:
1. Fetch the EGM memory properties (base HPA, length, proximity domain)
from the parent device shared EGM region structure.
2. Create a char device that can be used as memory-backend-file by Qemu
for the VM and implement file operations. The char device is /dev/egmX,
where X is the PXM node ID of the EGM being mapped fetched in 1.
3. Zero the EGM memory on first device open().
4. Map the QEMU VMA to the EGM region using remap_pfn_range.
5. Cleaning up state and destroying the chardev on device unbind.
6. Handle presence of retired ECC pages on the EGM region.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 MAINTAINERS                           |  6 ++++++
 drivers/vfio/pci/nvgrace-gpu/Kconfig  | 12 ++++++++++++
 drivers/vfio/pci/nvgrace-gpu/Makefile |  3 +++
 drivers/vfio/pci/nvgrace-gpu/egm.c    | 22 ++++++++++++++++++++++
 drivers/vfio/pci/nvgrace-gpu/main.c   |  1 +
 5 files changed, 44 insertions(+)
 create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 5b3d86de9ec0..1fc551d7d667 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -27384,6 +27384,12 @@ F:	drivers/vfio/pci/nvgrace-gpu/egm_dev.h
 F:	drivers/vfio/pci/nvgrace-gpu/main.c
 F:	include/linux/nvgrace-egm.h
 
+VFIO NVIDIA GRACE EGM DRIVER
+M:	Ankit Agrawal <ankita@nvidia.com>
+L:	kvm@vger.kernel.org
+S:	Supported
+F:	drivers/vfio/pci/nvgrace-gpu/egm.c
+
 VFIO PCI DEVICE SPECIFIC DRIVERS
 R:	Jason Gunthorpe <jgg@nvidia.com>
 R:	Yishai Hadas <yishaih@nvidia.com>
diff --git a/drivers/vfio/pci/nvgrace-gpu/Kconfig b/drivers/vfio/pci/nvgrace-gpu/Kconfig
index a7f624b37e41..7989d8d1c377 100644
--- a/drivers/vfio/pci/nvgrace-gpu/Kconfig
+++ b/drivers/vfio/pci/nvgrace-gpu/Kconfig
@@ -1,8 +1,20 @@
 # SPDX-License-Identifier: GPL-2.0-only
+config NVGRACE_EGM
+	tristate "EGM driver for NVIDIA Grace Hopper and Blackwell Superchip"
+	depends on ARM64 || (COMPILE_TEST && 64BIT)
+	depends on NVGRACE_GPU_VFIO_PCI
+	help
+	  Extended GPU Memory (EGM) support for the GPU in the NVIDIA Grace
+	  based chips required to avail the CPU memory as additional
+	  cross-node/cross-socket memory for GPU using KVM/qemu.
+
+	  If you don't know what to do here, say N.
+
 config NVGRACE_GPU_VFIO_PCI
 	tristate "VFIO support for the GPU in the NVIDIA Grace Hopper Superchip"
 	depends on ARM64 || (COMPILE_TEST && 64BIT)
 	select VFIO_PCI_CORE
+	select NVGRACE_EGM
 	help
 	  VFIO support for the GPU in the NVIDIA Grace Hopper Superchip is
 	  required to assign the GPU device to userspace using KVM/qemu/etc.
diff --git a/drivers/vfio/pci/nvgrace-gpu/Makefile b/drivers/vfio/pci/nvgrace-gpu/Makefile
index e72cc6739ef8..d0d191be56b9 100644
--- a/drivers/vfio/pci/nvgrace-gpu/Makefile
+++ b/drivers/vfio/pci/nvgrace-gpu/Makefile
@@ -1,3 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 obj-$(CONFIG_NVGRACE_GPU_VFIO_PCI) += nvgrace-gpu-vfio-pci.o
 nvgrace-gpu-vfio-pci-y := main.o egm_dev.o
+
+obj-$(CONFIG_NVGRACE_EGM) += nvgrace-egm.o
+nvgrace-egm-y := egm.o
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
new file mode 100644
index 000000000000..999808807019
--- /dev/null
+++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include <linux/vfio_pci_core.h>
+
+static int __init nvgrace_egm_init(void)
+{
+	return 0;
+}
+
+static void __exit nvgrace_egm_cleanup(void)
+{
+}
+
+module_init(nvgrace_egm_init);
+module_exit(nvgrace_egm_cleanup);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Ankit Agrawal <ankita@nvidia.com>");
+MODULE_DESCRIPTION("NVGRACE EGM - Module to support Extended GPU Memory on NVIDIA Grace Based systems");
diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index b356e941340a..0bb427cca31f 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -1410,3 +1410,4 @@ MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Ankit Agrawal <ankita@nvidia.com>");
 MODULE_AUTHOR("Aniket Agashe <aniketa@nvidia.com>");
 MODULE_DESCRIPTION("VFIO NVGRACE GPU PF - User Level driver for NVIDIA devices with CPU coherently accessible device memory");
+MODULE_SOFTDEP("pre: nvgrace-egm");
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC v2 05/15] vfio/nvgrace-egm: Introduce module to manage EGM
  2026-02-23 15:55 ` [PATCH RFC v2 05/15] vfio/nvgrace-egm: Introduce module to manage EGM ankita
@ 2026-03-04 18:09   ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2026-03-04 18:09 UTC (permalink / raw)
  To: ankita
  Cc: vsethi, jgg, mochs, jgg, skolothumtho, cjia, zhiw, kjaju, yishaih,
	kevin.tian, kvm, linux-kernel, alex

On Mon, 23 Feb 2026 15:55:04 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The Extended GPU Memory (EGM) feature that enables the GPU to access
> the system memory allocations within and across nodes through high
> bandwidth path on Grace Based systems. The GPU can utilize the
> system memory located on the same socket or from a different socket
> or even on a different node in a multi-node system [1].
> 
> When the EGM mode is enabled through SBIOS, the host system memory is
> partitioned into 2 parts: One partition for the Host OS usage
> called Hypervisor region, and a second Hypervisor-Invisible (HI) region
> for the VM. Only the hypervisor region is part of the host EFI map
> and is thus visible to the host OS on bootup. Since the entire VM
> sysmem is eligible for EGM allocations within the VM, the HI partition
> is interchangeably called as EGM region in the series. This HI/EGM region
> range base SPA and size is exposed through the ACPI DSDT properties.
> 
> Whilst the EGM region is accessible on the host, it is not added to
> the kernel. The HI region is assigned to a VM by mapping the QEMU VMA
> to the SPA using remap_pfn_range().
> 
> The following figure shows the memory map in the virtualization
> environment.
> 
> |---- Sysmem ----|                  |--- GPU mem ---|  VM Memory
> |                |                  |               |
> |IPA <-> SPA map |                  |IPA <-> SPA map|
> |                |                  |               |
> |--- HI / EGM ---|-- Host Mem --|   |--- GPU mem ---|  Host Memory
> 
> Introduce a new nvgrace-egm auxiliary driver module to manage and
> map the HI/EGM region in the Grace Blackwell systems. This binds to
> the auxiliary device created by the parent nvgrace-gpu (in-tree
> module for device assignment) / nvidia-vgpu-vfio (out-of-tree open
> source module for SRIOV vGPU) to manage the EGM region for the VM.
> Note that there is a unique EGM region per socket and the auxiliary
> device gets created for every region. The parent module fetches the
> EGM region information from the ACPI tables and populate to the data
> structures shared with the auxiliary nvgrace-egm module.
> 
> nvgrace-egm module handles the following:

Or it will eventually, not in this commit.

> 1. Fetch the EGM memory properties (base HPA, length, proximity domain)
> from the parent device shared EGM region structure.
> 2. Create a char device that can be used as memory-backend-file by Qemu
> for the VM and implement file operations. The char device is /dev/egmX,
> where X is the PXM node ID of the EGM being mapped fetched in 1.
> 3. Zero the EGM memory on first device open().
> 4. Map the QEMU VMA to the EGM region using remap_pfn_range.
> 5. Cleaning up state and destroying the chardev on device unbind.
> 6. Handle presence of retired ECC pages on the EGM region.
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  MAINTAINERS                           |  6 ++++++
>  drivers/vfio/pci/nvgrace-gpu/Kconfig  | 12 ++++++++++++
>  drivers/vfio/pci/nvgrace-gpu/Makefile |  3 +++
>  drivers/vfio/pci/nvgrace-gpu/egm.c    | 22 ++++++++++++++++++++++
>  drivers/vfio/pci/nvgrace-gpu/main.c   |  1 +
>  5 files changed, 44 insertions(+)
>  create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 5b3d86de9ec0..1fc551d7d667 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -27384,6 +27384,12 @@ F:	drivers/vfio/pci/nvgrace-gpu/egm_dev.h
>  F:	drivers/vfio/pci/nvgrace-gpu/main.c
>  F:	include/linux/nvgrace-egm.h
>  
> +VFIO NVIDIA GRACE EGM DRIVER
> +M:	Ankit Agrawal <ankita@nvidia.com>
> +L:	kvm@vger.kernel.org
> +S:	Supported
> +F:	drivers/vfio/pci/nvgrace-gpu/egm.c

I'm not sure a separate MAINTAINERS entry is warranted here, these are
intertwined, even if constructed to allow this EGM driver to be used by
an out-of-tree driver.  It's also an unclean split, with Makefile and
Kconfig dependencies under the nvgrace-gpu heading.  It should probably
be self contained in a separate sub-dir to justify a new MAINTAINERS
entry.

> +
>  VFIO PCI DEVICE SPECIFIC DRIVERS
>  R:	Jason Gunthorpe <jgg@nvidia.com>
>  R:	Yishai Hadas <yishaih@nvidia.com>
> diff --git a/drivers/vfio/pci/nvgrace-gpu/Kconfig b/drivers/vfio/pci/nvgrace-gpu/Kconfig
> index a7f624b37e41..7989d8d1c377 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/Kconfig
> +++ b/drivers/vfio/pci/nvgrace-gpu/Kconfig
> @@ -1,8 +1,20 @@
>  # SPDX-License-Identifier: GPL-2.0-only
> +config NVGRACE_EGM
> +	tristate "EGM driver for NVIDIA Grace Hopper and Blackwell Superchip"
> +	depends on ARM64 || (COMPILE_TEST && 64BIT)
> +	depends on NVGRACE_GPU_VFIO_PCI
> +	help
> +	  Extended GPU Memory (EGM) support for the GPU in the NVIDIA Grace
> +	  based chips required to avail the CPU memory as additional
> +	  cross-node/cross-socket memory for GPU using KVM/qemu.
> +
> +	  If you don't know what to do here, say N.
> +
>  config NVGRACE_GPU_VFIO_PCI
>  	tristate "VFIO support for the GPU in the NVIDIA Grace Hopper Superchip"
>  	depends on ARM64 || (COMPILE_TEST && 64BIT)
>  	select VFIO_PCI_CORE
> +	select NVGRACE_EGM

This should be dropped, it creates a circular dependency where we
cannot actually unselect NVGRACE_EGM with NVGRACE_GPU_VFIO_PCI
selected.

>  	help
>  	  VFIO support for the GPU in the NVIDIA Grace Hopper Superchip is
>  	  required to assign the GPU device to userspace using KVM/qemu/etc.
> diff --git a/drivers/vfio/pci/nvgrace-gpu/Makefile b/drivers/vfio/pci/nvgrace-gpu/Makefile
> index e72cc6739ef8..d0d191be56b9 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/Makefile
> +++ b/drivers/vfio/pci/nvgrace-gpu/Makefile
> @@ -1,3 +1,6 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>  obj-$(CONFIG_NVGRACE_GPU_VFIO_PCI) += nvgrace-gpu-vfio-pci.o
>  nvgrace-gpu-vfio-pci-y := main.o egm_dev.o
> +
> +obj-$(CONFIG_NVGRACE_EGM) += nvgrace-egm.o
> +nvgrace-egm-y := egm.o
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
> new file mode 100644
> index 000000000000..999808807019
> --- /dev/null
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
> @@ -0,0 +1,22 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved

2026

> + */
> +
> +#include <linux/vfio_pci_core.h>

Premature?

> +
> +static int __init nvgrace_egm_init(void)
> +{
> +	return 0;
> +}
> +
> +static void __exit nvgrace_egm_cleanup(void)
> +{
> +}
> +
> +module_init(nvgrace_egm_init);
> +module_exit(nvgrace_egm_cleanup);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Ankit Agrawal <ankita@nvidia.com>");
> +MODULE_DESCRIPTION("NVGRACE EGM - Module to support Extended GPU Memory on NVIDIA Grace Based systems");
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
> index b356e941340a..0bb427cca31f 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -1410,3 +1410,4 @@ MODULE_LICENSE("GPL");
>  MODULE_AUTHOR("Ankit Agrawal <ankita@nvidia.com>");
>  MODULE_AUTHOR("Aniket Agashe <aniketa@nvidia.com>");
>  MODULE_DESCRIPTION("VFIO NVGRACE GPU PF - User Level driver for NVIDIA devices with CPU coherently accessible device memory");
> +MODULE_SOFTDEP("pre: nvgrace-egm");

Premature and wrong if necessary.  AIUI the aux device created should
generate uevents and modules loaded automatically via device tables.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH RFC v2 06/15] vfio/nvgrace-egm: Introduce egm class and register char device numbers
  2026-02-23 15:54 [PATCH RFC v2 00/15] Add virtualization support for EGM ankita
                   ` (4 preceding siblings ...)
  2026-02-23 15:55 ` [PATCH RFC v2 05/15] vfio/nvgrace-egm: Introduce module to manage EGM ankita
@ 2026-02-23 15:55 ` ankita
  2026-03-04 18:56   ` Alex Williamson
  2026-02-23 15:55 ` [PATCH RFC v2 07/15] vfio/nvgrace-egm: Register auxiliary driver ops ankita
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 42+ messages in thread
From: ankita @ 2026-02-23 15:55 UTC (permalink / raw)
  To: ankita, vsethi, jgg, mochs, jgg, skolothumtho, alex
  Cc: cjia, zhiw, kjaju, yishaih, kevin.tian, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

The EGM regions are exposed to the userspace as char devices. A unique
char device with a different minor number is assigned to EGM region
belonging to a different Grace socket.

Add a new egm class and register a range of char device numbers for
the same.

Setting MAX_EGM_NODES as 4 as the 4-socket is the largest configuration
on Grace based systems.

Suggested-by: Aniket Agashe <aniketa@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/egm.c | 36 ++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
index 999808807019..6bab4d94cb99 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
@@ -4,14 +4,50 @@
  */
 
 #include <linux/vfio_pci_core.h>
+#include <linux/nvgrace-egm.h>
+
+#define MAX_EGM_NODES 4
+
+static dev_t dev;
+static struct class *class;
+
+static char *egm_devnode(const struct device *device, umode_t *mode)
+{
+	if (mode)
+		*mode = 0600;
+
+	return NULL;
+}
 
 static int __init nvgrace_egm_init(void)
 {
+	int ret;
+
+	/*
+	 * Each EGM region on a system is represented with a unique
+	 * char device with a different minor number. Allow a range
+	 * of char device creation.
+	 */
+	ret = alloc_chrdev_region(&dev, 0, MAX_EGM_NODES,
+				  NVGRACE_EGM_DEV_NAME);
+	if (ret < 0)
+		return ret;
+
+	class = class_create(NVGRACE_EGM_DEV_NAME);
+	if (IS_ERR(class)) {
+		unregister_chrdev_region(dev, MAX_EGM_NODES);
+		return PTR_ERR(class);
+	}
+
+	class->devnode = egm_devnode;
+
 	return 0;
 }
 
 static void __exit nvgrace_egm_cleanup(void)
 {
+	class_destroy(class);
+	unregister_chrdev_region(dev, MAX_EGM_NODES);
 }
 
 module_init(nvgrace_egm_init);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC v2 06/15] vfio/nvgrace-egm: Introduce egm class and register char device numbers
  2026-02-23 15:55 ` [PATCH RFC v2 06/15] vfio/nvgrace-egm: Introduce egm class and register char device numbers ankita
@ 2026-03-04 18:56   ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2026-03-04 18:56 UTC (permalink / raw)
  To: ankita
  Cc: vsethi, jgg, mochs, jgg, skolothumtho, cjia, zhiw, kjaju, yishaih,
	kevin.tian, kvm, linux-kernel, alex

On Mon, 23 Feb 2026 15:55:05 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The EGM regions are exposed to the userspace as char devices. A unique
> char device with a different minor number is assigned to EGM region
> belonging to a different Grace socket.
> 
> Add a new egm class and register a range of char device numbers for
> the same.
> 
> Setting MAX_EGM_NODES as 4 as the 4-socket is the largest configuration
> on Grace based systems.

Should this be a Kconfig option or have a driver module parameter or is
this a long term limit?

The use of "nodes" here is a bit confusing too since the KVM Forum
slides show each GB "node" is composed of 2-sockets.  Should this be
something like MAX_NUM_EGM?
 
> Suggested-by: Aniket Agashe <aniketa@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/nvgrace-gpu/egm.c | 36 ++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
> 
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
> index 999808807019..6bab4d94cb99 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
> @@ -4,14 +4,50 @@
>   */
>  
>  #include <linux/vfio_pci_core.h>
> +#include <linux/nvgrace-egm.h>
> +
> +#define MAX_EGM_NODES 4
> +
> +static dev_t dev;
> +static struct class *class;
> +
> +static char *egm_devnode(const struct device *device, umode_t *mode)
> +{
> +	if (mode)
> +		*mode = 0600;
> +
> +	return NULL;
> +}
>  
>  static int __init nvgrace_egm_init(void)
>  {
> +	int ret;
> +
> +	/*
> +	 * Each EGM region on a system is represented with a unique
> +	 * char device with a different minor number. Allow a range
> +	 * of char device creation.
> +	 */
> +	ret = alloc_chrdev_region(&dev, 0, MAX_EGM_NODES,
> +				  NVGRACE_EGM_DEV_NAME);

This reserves a range of 4 minor numbers, 0-3, but then in 8/
we use the PXM number as the minor value, which according to 13/ seems
to result in egm4 and egm5 chardevs.  So we're stomping on minor values
outside what we've reserved.  Thanks,

Alex

> +	if (ret < 0)
> +		return ret;
> +
> +	class = class_create(NVGRACE_EGM_DEV_NAME);
> +	if (IS_ERR(class)) {
> +		unregister_chrdev_region(dev, MAX_EGM_NODES);
> +		return PTR_ERR(class);
> +	}
> +
> +	class->devnode = egm_devnode;
> +
>  	return 0;
>  }
>  
>  static void __exit nvgrace_egm_cleanup(void)
>  {
> +	class_destroy(class);
> +	unregister_chrdev_region(dev, MAX_EGM_NODES);
>  }
>  
>  module_init(nvgrace_egm_init);


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH RFC v2 07/15] vfio/nvgrace-egm: Register auxiliary driver ops
  2026-02-23 15:54 [PATCH RFC v2 00/15] Add virtualization support for EGM ankita
                   ` (5 preceding siblings ...)
  2026-02-23 15:55 ` [PATCH RFC v2 06/15] vfio/nvgrace-egm: Introduce egm class and register char device numbers ankita
@ 2026-02-23 15:55 ` ankita
  2026-03-04 19:06   ` Alex Williamson
  2026-02-23 15:55 ` [PATCH RFC v2 08/15] vfio/nvgrace-egm: Expose EGM region as char device ankita
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 42+ messages in thread
From: ankita @ 2026-02-23 15:55 UTC (permalink / raw)
  To: ankita, vsethi, jgg, mochs, jgg, skolothumtho, alex
  Cc: cjia, zhiw, kjaju, yishaih, kevin.tian, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

Setup dummy auxiliary device ops to be able to get probed by
the nvgrace-egm auxiliary driver.

Both nvgrace-gpu and the out-of-tree nvidia-vgpu-vfio will make
use of the EGM for device assignment and the SRIOV vGPU virtualization
solutions respectively. Hence allow auxiliary device probing for both.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/egm.c | 38 +++++++++++++++++++++++++++---
 1 file changed, 35 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
index 6bab4d94cb99..6fd6302a004a 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
@@ -11,6 +11,29 @@
 static dev_t dev;
 static struct class *class;
 
+static int egm_driver_probe(struct auxiliary_device *aux_dev,
+			    const struct auxiliary_device_id *id)
+{
+	return 0;
+}
+
+static void egm_driver_remove(struct auxiliary_device *aux_dev)
+{
+}
+
+static const struct auxiliary_device_id egm_id_table[] = {
+	{ .name = "nvgrace_gpu_vfio_pci.egm" },
+	{ },
+};
+MODULE_DEVICE_TABLE(auxiliary, egm_id_table);
+
+static struct auxiliary_driver egm_driver = {
+	.name = KBUILD_MODNAME,
+	.id_table = egm_id_table,
+	.probe = egm_driver_probe,
+	.remove = egm_driver_remove,
+};
+
 static char *egm_devnode(const struct device *device, umode_t *mode)
 {
 	if (mode)
@@ -35,17 +58,26 @@ static int __init nvgrace_egm_init(void)
 
 	class = class_create(NVGRACE_EGM_DEV_NAME);
 	if (IS_ERR(class)) {
-		unregister_chrdev_region(dev, MAX_EGM_NODES);
-		return PTR_ERR(class);
+		ret = PTR_ERR(class);
+		goto unregister_chrdev;
 	}
 
 	class->devnode = egm_devnode;
 
-	return 0;
+	ret = auxiliary_driver_register(&egm_driver);
+	if (!ret)
+		goto fn_exit;
+
+	class_destroy(class);
+unregister_chrdev:
+	unregister_chrdev_region(dev, MAX_EGM_NODES);
+fn_exit:
+	return ret;
 }
 
 static void __exit nvgrace_egm_cleanup(void)
 {
+	auxiliary_driver_unregister(&egm_driver);
 	class_destroy(class);
 	unregister_chrdev_region(dev, MAX_EGM_NODES);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC v2 07/15] vfio/nvgrace-egm: Register auxiliary driver ops
  2026-02-23 15:55 ` [PATCH RFC v2 07/15] vfio/nvgrace-egm: Register auxiliary driver ops ankita
@ 2026-03-04 19:06   ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2026-03-04 19:06 UTC (permalink / raw)
  To: ankita
  Cc: vsethi, jgg, mochs, jgg, skolothumtho, cjia, zhiw, kjaju, yishaih,
	kevin.tian, kvm, linux-kernel, alex

On Mon, 23 Feb 2026 15:55:06 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Setup dummy auxiliary device ops to be able to get probed by
> the nvgrace-egm auxiliary driver.
> 
> Both nvgrace-gpu and the out-of-tree nvidia-vgpu-vfio will make
> use of the EGM for device assignment and the SRIOV vGPU virtualization
> solutions respectively. Hence allow auxiliary device probing for both.

But only one is added?

Can you point to any other in-tree drivers that include out-of-tree
device entries in their ID table?

Isn't this ID table what should make the module soft-dep unnecessary?

> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/nvgrace-gpu/egm.c | 38 +++++++++++++++++++++++++++---
>  1 file changed, 35 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
> index 6bab4d94cb99..6fd6302a004a 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
> @@ -11,6 +11,29 @@
>  static dev_t dev;
>  static struct class *class;
>  
> +static int egm_driver_probe(struct auxiliary_device *aux_dev,
> +			    const struct auxiliary_device_id *id)
> +{
> +	return 0;
> +}
> +
> +static void egm_driver_remove(struct auxiliary_device *aux_dev)
> +{
> +}
> +
> +static const struct auxiliary_device_id egm_id_table[] = {
> +	{ .name = "nvgrace_gpu_vfio_pci.egm" },
> +	{ },
> +};
> +MODULE_DEVICE_TABLE(auxiliary, egm_id_table);
> +
> +static struct auxiliary_driver egm_driver = {
> +	.name = KBUILD_MODNAME,
> +	.id_table = egm_id_table,
> +	.probe = egm_driver_probe,
> +	.remove = egm_driver_remove,
> +};
> +
>  static char *egm_devnode(const struct device *device, umode_t *mode)
>  {
>  	if (mode)
> @@ -35,17 +58,26 @@ static int __init nvgrace_egm_init(void)
>  
>  	class = class_create(NVGRACE_EGM_DEV_NAME);
>  	if (IS_ERR(class)) {
> -		unregister_chrdev_region(dev, MAX_EGM_NODES);
> -		return PTR_ERR(class);
> +		ret = PTR_ERR(class);
> +		goto unregister_chrdev;
>  	}
>  
>  	class->devnode = egm_devnode;
>  
> -	return 0;
> +	ret = auxiliary_driver_register(&egm_driver);
> +	if (!ret)
> +		goto fn_exit;

This is not a good success oriented flow.  The error condition should
goto the unwind, the success condition can just fall through to return.
Thanks,

Alex

> +
> +	class_destroy(class);
> +unregister_chrdev:
> +	unregister_chrdev_region(dev, MAX_EGM_NODES);
> +fn_exit:
> +	return ret;
>  }
>  
>  static void __exit nvgrace_egm_cleanup(void)
>  {
> +	auxiliary_driver_unregister(&egm_driver);
>  	class_destroy(class);
>  	unregister_chrdev_region(dev, MAX_EGM_NODES);
>  }


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH RFC v2 08/15] vfio/nvgrace-egm: Expose EGM region as char device
  2026-02-23 15:54 [PATCH RFC v2 00/15] Add virtualization support for EGM ankita
                   ` (6 preceding siblings ...)
  2026-02-23 15:55 ` [PATCH RFC v2 07/15] vfio/nvgrace-egm: Register auxiliary driver ops ankita
@ 2026-02-23 15:55 ` ankita
  2026-02-26 17:08   ` Shameer Kolothum Thodi
  2026-03-04 20:16   ` Alex Williamson
  2026-02-23 15:55 ` [PATCH RFC v2 09/15] vfio/nvgrace-egm: Add chardev ops for EGM management ankita
                   ` (7 subsequent siblings)
  15 siblings, 2 replies; 42+ messages in thread
From: ankita @ 2026-02-23 15:55 UTC (permalink / raw)
  To: ankita, vsethi, jgg, mochs, jgg, skolothumtho, alex
  Cc: cjia, zhiw, kjaju, yishaih, kevin.tian, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

The EGM module expose the various EGM regions as a char device. A
usermode app such as Qemu may mmap to the region and use as VM sysmem.
Each EGM region is represented with a unique char device /dev/egmX
bearing a distinct minor number.

EGM module implements the mmap file_ops to manage the usermode app's
VMA mapping to the EGM region. The appropriate region is determined
from the minor number.

Note that the EGM memory region is invisible to the host kernel as it
is not present in the host EFI map. The host Linux MM thus cannot manage
the memory, even though it is accessible on the host SPA. The EGM module
thus use remap_pfn_range() to perform the VMA mapping to the EGM region.

Suggested-by: Aniket Agashe <aniketa@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/egm.c | 99 ++++++++++++++++++++++++++++++
 1 file changed, 99 insertions(+)

diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
index 6fd6302a004a..d7e4f61a241c 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
@@ -10,15 +10,114 @@
 
 static dev_t dev;
 static struct class *class;
+static DEFINE_XARRAY(egm_chardevs);
+
+struct chardev {
+	struct device device;
+	struct cdev cdev;
+};
+
+static int nvgrace_egm_open(struct inode *inode, struct file *file)
+{
+	return 0;
+}
+
+static int nvgrace_egm_release(struct inode *inode, struct file *file)
+{
+	return 0;
+}
+
+static int nvgrace_egm_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	return 0;
+}
+
+static const struct file_operations file_ops = {
+	.owner = THIS_MODULE,
+	.open = nvgrace_egm_open,
+	.release = nvgrace_egm_release,
+	.mmap = nvgrace_egm_mmap,
+};
+
+static void egm_chardev_release(struct device *dev)
+{
+	struct chardev *egm_chardev = container_of(dev, struct chardev, device);
+
+	kfree(egm_chardev);
+}
+
+static struct chardev *
+setup_egm_chardev(struct nvgrace_egm_dev *egm_dev)
+{
+	struct chardev *egm_chardev;
+	int ret;
+
+	egm_chardev = kzalloc(sizeof(*egm_chardev), GFP_KERNEL);
+	if (!egm_chardev)
+		goto create_err;
+
+	device_initialize(&egm_chardev->device);
+
+	/*
+	 * Use the proximity domain number as the device minor
+	 * number. So the EGM corresponding to node X would be
+	 * /dev/egmX.
+	 */
+	egm_chardev->device.devt = MKDEV(MAJOR(dev), egm_dev->egmpxm);
+	egm_chardev->device.class = class;
+	egm_chardev->device.release = egm_chardev_release;
+	egm_chardev->device.parent = &egm_dev->aux_dev.dev;
+	cdev_init(&egm_chardev->cdev, &file_ops);
+	egm_chardev->cdev.owner = THIS_MODULE;
+
+	ret = dev_set_name(&egm_chardev->device, "egm%lld", egm_dev->egmpxm);
+	if (ret)
+		goto error_exit;
+
+	ret = cdev_device_add(&egm_chardev->cdev, &egm_chardev->device);
+	if (ret)
+		goto error_exit;
+
+	return egm_chardev;
+
+error_exit:
+	put_device(&egm_chardev->device);
+create_err:
+	return NULL;
+}
+
+static void del_egm_chardev(struct chardev *egm_chardev)
+{
+	cdev_device_del(&egm_chardev->cdev, &egm_chardev->device);
+	put_device(&egm_chardev->device);
+}
 
 static int egm_driver_probe(struct auxiliary_device *aux_dev,
 			    const struct auxiliary_device_id *id)
 {
+	struct nvgrace_egm_dev *egm_dev =
+		container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
+	struct chardev *egm_chardev;
+
+	egm_chardev = setup_egm_chardev(egm_dev);
+	if (!egm_chardev)
+		return -EINVAL;
+
+	xa_store(&egm_chardevs, egm_dev->egmpxm, egm_chardev, GFP_KERNEL);
+
 	return 0;
 }
 
 static void egm_driver_remove(struct auxiliary_device *aux_dev)
 {
+	struct nvgrace_egm_dev *egm_dev =
+		container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
+	struct chardev *egm_chardev = xa_erase(&egm_chardevs, egm_dev->egmpxm);
+
+	if (!egm_chardev)
+		return;
+
+	del_egm_chardev(egm_chardev);
 }
 
 static const struct auxiliary_device_id egm_id_table[] = {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* RE: [PATCH RFC v2 08/15] vfio/nvgrace-egm: Expose EGM region as char device
  2026-02-23 15:55 ` [PATCH RFC v2 08/15] vfio/nvgrace-egm: Expose EGM region as char device ankita
@ 2026-02-26 17:08   ` Shameer Kolothum Thodi
  2026-03-04 20:16   ` Alex Williamson
  1 sibling, 0 replies; 42+ messages in thread
From: Shameer Kolothum Thodi @ 2026-02-26 17:08 UTC (permalink / raw)
  To: Ankit Agrawal, Vikram Sethi, Jason Gunthorpe, Matt Ochs,
	jgg@ziepe.ca, alex@shazbot.org
  Cc: Neo Jia, Zhi Wang, Krishnakant Jaju, Yishai Hadas,
	kevin.tian@intel.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org



> -----Original Message-----
> From: Ankit Agrawal <ankita@nvidia.com>
> Sent: 23 February 2026 15:55
> To: Ankit Agrawal <ankita@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>;
> Jason Gunthorpe <jgg@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> jgg@ziepe.ca; Shameer Kolothum Thodi <skolothumtho@nvidia.com>;
> alex@shazbot.org
> Cc: Neo Jia <cjia@nvidia.com>; Zhi Wang <zhiw@nvidia.com>; Krishnakant
> Jaju <kjaju@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>;
> kevin.tian@intel.com; kvm@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [PATCH RFC v2 08/15] vfio/nvgrace-egm: Expose EGM region as char
> device
> 
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The EGM module expose the various EGM regions as a char device. A
> usermode app such as Qemu may mmap to the region and use as VM
> sysmem.
> Each EGM region is represented with a unique char device /dev/egmX
> bearing a distinct minor number.
> 
> EGM module implements the mmap file_ops to manage the usermode app's
> VMA mapping to the EGM region. The appropriate region is determined
> from the minor number.
> 
> Note that the EGM memory region is invisible to the host kernel as it
> is not present in the host EFI map. The host Linux MM thus cannot manage
> the memory, even though it is accessible on the host SPA. The EGM module
> thus use remap_pfn_range() to perform the VMA mapping to the EGM region.
> 
> Suggested-by: Aniket Agashe <aniketa@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/nvgrace-gpu/egm.c | 99
> ++++++++++++++++++++++++++++++
>  1 file changed, 99 insertions(+)
> 
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-
> gpu/egm.c
> index 6fd6302a004a..d7e4f61a241c 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
> @@ -10,15 +10,114 @@
> 
>  static dev_t dev;
>  static struct class *class;
> +static DEFINE_XARRAY(egm_chardevs);
> +
> +struct chardev {
> +	struct device device;
> +	struct cdev cdev;
> +};
> +
> +static int nvgrace_egm_open(struct inode *inode, struct file *file)
> +{
> +	return 0;
> +}
> +
> +static int nvgrace_egm_release(struct inode *inode, struct file *file)
> +{
> +	return 0;
> +}
> +
> +static int nvgrace_egm_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	return 0;
> +}
> +
> +static const struct file_operations file_ops = {
> +	.owner = THIS_MODULE,
> +	.open = nvgrace_egm_open,
> +	.release = nvgrace_egm_release,
> +	.mmap = nvgrace_egm_mmap,
> +};
> +
> +static void egm_chardev_release(struct device *dev)
> +{
> +	struct chardev *egm_chardev = container_of(dev, struct chardev,
> device);
> +
> +	kfree(egm_chardev);
> +}
> +
> +static struct chardev *
> +setup_egm_chardev(struct nvgrace_egm_dev *egm_dev)
> +{
> +	struct chardev *egm_chardev;
> +	int ret;
> +
> +	egm_chardev = kzalloc(sizeof(*egm_chardev), GFP_KERNEL);
> +	if (!egm_chardev)
> +		goto create_err;

return NULL; instead and get rid of create_err.

> +
> +	device_initialize(&egm_chardev->device);
> +
> +	/*
> +	 * Use the proximity domain number as the device minor
> +	 * number. So the EGM corresponding to node X would be
> +	 * /dev/egmX.
> +	 */
> +	egm_chardev->device.devt = MKDEV(MAJOR(dev), egm_dev-
> >egmpxm);
> +	egm_chardev->device.class = class;
> +	egm_chardev->device.release = egm_chardev_release;
> +	egm_chardev->device.parent = &egm_dev->aux_dev.dev;
> +	cdev_init(&egm_chardev->cdev, &file_ops);
> +	egm_chardev->cdev.owner = THIS_MODULE;
> +
> +	ret = dev_set_name(&egm_chardev->device, "egm%lld", egm_dev-
> >egmpxm);
> +	if (ret)
> +		goto error_exit;
> +
> +	ret = cdev_device_add(&egm_chardev->cdev, &egm_chardev-
> >device);
> +	if (ret)
> +		goto error_exit;
> +
> +	return egm_chardev;
> +
> +error_exit:
> +	put_device(&egm_chardev->device);
> +create_err:
> +	return NULL;
> +}
> +
> +static void del_egm_chardev(struct chardev *egm_chardev)
> +{
> +	cdev_device_del(&egm_chardev->cdev, &egm_chardev->device);
> +	put_device(&egm_chardev->device);
> +}
> 
>  static int egm_driver_probe(struct auxiliary_device *aux_dev,
>  			    const struct auxiliary_device_id *id)
>  {
> +	struct nvgrace_egm_dev *egm_dev =
> +		container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
> +	struct chardev *egm_chardev;
> +
> +	egm_chardev = setup_egm_chardev(egm_dev);
> +	if (!egm_chardev)
> +		return -EINVAL;
> +
> +	xa_store(&egm_chardevs, egm_dev->egmpxm, egm_chardev,
> GFP_KERNEL);
> +
>  	return 0;
>  }
> 
>  static void egm_driver_remove(struct auxiliary_device *aux_dev)
>  {
> +	struct nvgrace_egm_dev *egm_dev =
> +		container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
> +	struct chardev *egm_chardev = xa_erase(&egm_chardevs, egm_dev-
> >egmpxm);
> +
> +	if (!egm_chardev)
> +		return;
> +
> +	del_egm_chardev(egm_chardev);

Is this safe if there is still a file in use e.g. QEMU has /dev/egm0 open.

Thanks,
Shameer

>  }
> 
>  static const struct auxiliary_device_id egm_id_table[] = {
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC v2 08/15] vfio/nvgrace-egm: Expose EGM region as char device
  2026-02-23 15:55 ` [PATCH RFC v2 08/15] vfio/nvgrace-egm: Expose EGM region as char device ankita
  2026-02-26 17:08   ` Shameer Kolothum Thodi
@ 2026-03-04 20:16   ` Alex Williamson
  1 sibling, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2026-03-04 20:16 UTC (permalink / raw)
  To: ankita
  Cc: vsethi, jgg, mochs, jgg, skolothumtho, cjia, zhiw, kjaju, yishaih,
	kevin.tian, kvm, linux-kernel, alex

On Mon, 23 Feb 2026 15:55:07 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The EGM module expose the various EGM regions as a char device. A
> usermode app such as Qemu may mmap to the region and use as VM sysmem.
> Each EGM region is represented with a unique char device /dev/egmX
> bearing a distinct minor number.
> 
> EGM module implements the mmap file_ops to manage the usermode app's
> VMA mapping to the EGM region. The appropriate region is determined
> from the minor number.
> 
> Note that the EGM memory region is invisible to the host kernel as it
> is not present in the host EFI map. The host Linux MM thus cannot manage
> the memory, even though it is accessible on the host SPA. The EGM module
> thus use remap_pfn_range() to perform the VMA mapping to the EGM region.
> 
> Suggested-by: Aniket Agashe <aniketa@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/nvgrace-gpu/egm.c | 99 ++++++++++++++++++++++++++++++
>  1 file changed, 99 insertions(+)
> 
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
> index 6fd6302a004a..d7e4f61a241c 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
> @@ -10,15 +10,114 @@
>  
>  static dev_t dev;
>  static struct class *class;
> +static DEFINE_XARRAY(egm_chardevs);
> +
> +struct chardev {
> +	struct device device;
> +	struct cdev cdev;
> +};
> +
> +static int nvgrace_egm_open(struct inode *inode, struct file *file)
> +{
> +	return 0;
> +}
> +
> +static int nvgrace_egm_release(struct inode *inode, struct file *file)
> +{
> +	return 0;
> +}
> +
> +static int nvgrace_egm_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	return 0;

At this point it seems none of these stubs should return success.

> +}
> +
> +static const struct file_operations file_ops = {
> +	.owner = THIS_MODULE,
> +	.open = nvgrace_egm_open,
> +	.release = nvgrace_egm_release,
> +	.mmap = nvgrace_egm_mmap,
> +};
> +
> +static void egm_chardev_release(struct device *dev)
> +{
> +	struct chardev *egm_chardev = container_of(dev, struct chardev, device);
> +
> +	kfree(egm_chardev);
> +}
> +
> +static struct chardev *
> +setup_egm_chardev(struct nvgrace_egm_dev *egm_dev)
> +{
> +	struct chardev *egm_chardev;
> +	int ret;
> +
> +	egm_chardev = kzalloc(sizeof(*egm_chardev), GFP_KERNEL);
> +	if (!egm_chardev)
> +		goto create_err;

return ERR_PTR(-ENOMEM);  Same for remaining returns.

> +
> +	device_initialize(&egm_chardev->device);
> +
> +	/*
> +	 * Use the proximity domain number as the device minor
> +	 * number. So the EGM corresponding to node X would be
> +	 * /dev/egmX.
> +	 */
> +	egm_chardev->device.devt = MKDEV(MAJOR(dev), egm_dev->egmpxm);

As in previous comment, we have no guarantee that the PXM value is in
the range 0-3 of the reserved minor numbers.

> +	egm_chardev->device.class = class;
> +	egm_chardev->device.release = egm_chardev_release;
> +	egm_chardev->device.parent = &egm_dev->aux_dev.dev;
> +	cdev_init(&egm_chardev->cdev, &file_ops);
> +	egm_chardev->cdev.owner = THIS_MODULE;
> +
> +	ret = dev_set_name(&egm_chardev->device, "egm%lld", egm_dev->egmpxm);
> +	if (ret)
> +		goto error_exit;
> +
> +	ret = cdev_device_add(&egm_chardev->cdev, &egm_chardev->device);
> +	if (ret)
> +		goto error_exit;
> +
> +	return egm_chardev;
> +
> +error_exit:
> +	put_device(&egm_chardev->device);
> +create_err:
> +	return NULL;
> +}
> +
> +static void del_egm_chardev(struct chardev *egm_chardev)
> +{
> +	cdev_device_del(&egm_chardev->cdev, &egm_chardev->device);
> +	put_device(&egm_chardev->device);
> +}
>  
>  static int egm_driver_probe(struct auxiliary_device *aux_dev,
>  			    const struct auxiliary_device_id *id)
>  {
> +	struct nvgrace_egm_dev *egm_dev =
> +		container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
> +	struct chardev *egm_chardev;
> +
> +	egm_chardev = setup_egm_chardev(egm_dev);
> +	if (!egm_chardev)
> +		return -EINVAL;

Use IS_ERR() and don't clobber the return value.

> +
> +	xa_store(&egm_chardevs, egm_dev->egmpxm, egm_chardev, GFP_KERNEL);

Return value unchecked.  Isn't this xarray just replacing stuffing this
in drvdata?  Why?

> +
>  	return 0;
>  }
>  
>  static void egm_driver_remove(struct auxiliary_device *aux_dev)
>  {
> +	struct nvgrace_egm_dev *egm_dev =
> +		container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
> +	struct chardev *egm_chardev = xa_erase(&egm_chardevs, egm_dev->egmpxm);
> +
> +	if (!egm_chardev)
> +		return;
> +
> +	del_egm_chardev(egm_chardev);

No evidence yet of lifecycle management if there's an outstanding
opened chardev.  Thanks,

Alex

>  }
>  
>  static const struct auxiliary_device_id egm_id_table[] = {


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH RFC v2 09/15] vfio/nvgrace-egm: Add chardev ops for EGM management
  2026-02-23 15:54 [PATCH RFC v2 00/15] Add virtualization support for EGM ankita
                   ` (7 preceding siblings ...)
  2026-02-23 15:55 ` [PATCH RFC v2 08/15] vfio/nvgrace-egm: Expose EGM region as char device ankita
@ 2026-02-23 15:55 ` ankita
  2026-03-04 22:04   ` Alex Williamson
  2026-02-23 15:55 ` [PATCH RFC v2 10/15] vfio/nvgrace-egm: Clear Memory before handing out to VM ankita
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 42+ messages in thread
From: ankita @ 2026-02-23 15:55 UTC (permalink / raw)
  To: ankita, vsethi, jgg, mochs, jgg, skolothumtho, alex
  Cc: cjia, zhiw, kjaju, yishaih, kevin.tian, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

EGM module implements the mmap file_ops to manage the usermode app's
VMA mapping to the EGM region. The appropriate region is determined
from the minor number.

Note that the EGM memory region is invisible to the host kernel as it
is not present in the host EFI map. The host Linux MM thus cannot manage
the memory, even though it is accessible on the host SPA. The EGM module
thus use remap_pfn_range() to perform the VMA mapping to the EGM region.

Suggested-by: Aniket Agashe <aniketa@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/egm.c | 41 +++++++++++++++++++++++++++++-
 include/linux/nvgrace-egm.h        |  1 +
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
index d7e4f61a241c..5786ebe374a5 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
@@ -17,19 +17,58 @@ struct chardev {
 	struct cdev cdev;
 };
 
+static struct nvgrace_egm_dev *
+egm_chardev_to_nvgrace_egm_dev(struct chardev *egm_chardev)
+{
+	struct auxiliary_device *aux_dev =
+		container_of(egm_chardev->device.parent, struct auxiliary_device, dev);
+
+	return container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
+}
+
 static int nvgrace_egm_open(struct inode *inode, struct file *file)
 {
+	struct chardev *egm_chardev =
+		container_of(inode->i_cdev, struct chardev, cdev);
+
+	file->private_data = egm_chardev;
+
 	return 0;
 }
 
 static int nvgrace_egm_release(struct inode *inode, struct file *file)
 {
+	file->private_data = NULL;
+
 	return 0;
 }
 
 static int nvgrace_egm_mmap(struct file *file, struct vm_area_struct *vma)
 {
-	return 0;
+	struct chardev *egm_chardev = file->private_data;
+	struct nvgrace_egm_dev *egm_dev =
+		egm_chardev_to_nvgrace_egm_dev(egm_chardev);
+	u64 req_len, pgoff, end;
+	unsigned long start_pfn;
+
+	pgoff = vma->vm_pgoff &
+		((1U << (EGM_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+
+	if (check_sub_overflow(vma->vm_end, vma->vm_start, &req_len) ||
+	    check_add_overflow(PHYS_PFN(egm_dev->egmphys), pgoff, &start_pfn) ||
+	    check_add_overflow(PFN_PHYS(pgoff), req_len, &end))
+		return -EOVERFLOW;
+
+	if (end > egm_dev->egmlength)
+		return -EINVAL;
+
+	/*
+	 * EGM memory is invisible to the host kernel and is not managed
+	 * by it. Map the usermode VMA to the EGM region.
+	 */
+	return remap_pfn_range(vma, vma->vm_start,
+			       start_pfn, req_len,
+			       vma->vm_page_prot);
 }
 
 static const struct file_operations file_ops = {
diff --git a/include/linux/nvgrace-egm.h b/include/linux/nvgrace-egm.h
index a66906753267..b9956e7e5a0e 100644
--- a/include/linux/nvgrace-egm.h
+++ b/include/linux/nvgrace-egm.h
@@ -9,6 +9,7 @@
 #include <linux/auxiliary_bus.h>
 
 #define NVGRACE_EGM_DEV_NAME "egm"
+#define EGM_OFFSET_SHIFT   40
 
 struct gpu_node {
 	struct list_head list;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC v2 09/15] vfio/nvgrace-egm: Add chardev ops for EGM management
  2026-02-23 15:55 ` [PATCH RFC v2 09/15] vfio/nvgrace-egm: Add chardev ops for EGM management ankita
@ 2026-03-04 22:04   ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2026-03-04 22:04 UTC (permalink / raw)
  To: ankita
  Cc: vsethi, jgg, mochs, jgg, skolothumtho, cjia, zhiw, kjaju, yishaih,
	kevin.tian, kvm, linux-kernel, alex

On Mon, 23 Feb 2026 15:55:08 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> EGM module implements the mmap file_ops to manage the usermode app's
> VMA mapping to the EGM region. The appropriate region is determined
> from the minor number.
> 
> Note that the EGM memory region is invisible to the host kernel as it
> is not present in the host EFI map. The host Linux MM thus cannot manage
> the memory, even though it is accessible on the host SPA. The EGM module
> thus use remap_pfn_range() to perform the VMA mapping to the EGM region.
> 
> Suggested-by: Aniket Agashe <aniketa@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/nvgrace-gpu/egm.c | 41 +++++++++++++++++++++++++++++-
>  include/linux/nvgrace-egm.h        |  1 +
>  2 files changed, 41 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
> index d7e4f61a241c..5786ebe374a5 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
> @@ -17,19 +17,58 @@ struct chardev {
>  	struct cdev cdev;
>  };
>  
> +static struct nvgrace_egm_dev *
> +egm_chardev_to_nvgrace_egm_dev(struct chardev *egm_chardev)
> +{
> +	struct auxiliary_device *aux_dev =
> +		container_of(egm_chardev->device.parent, struct auxiliary_device, dev);
> +
> +	return container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
> +}
> +
>  static int nvgrace_egm_open(struct inode *inode, struct file *file)
>  {
> +	struct chardev *egm_chardev =
> +		container_of(inode->i_cdev, struct chardev, cdev);
> +
> +	file->private_data = egm_chardev;
> +

No reference taken to egm device, nothing blocks it being removed.

>  	return 0;
>  }
>  
>  static int nvgrace_egm_release(struct inode *inode, struct file *file)
>  {
> +	file->private_data = NULL;

Unnecessary.

> +
>  	return 0;
>  }
>  
>  static int nvgrace_egm_mmap(struct file *file, struct vm_area_struct *vma)
>  {
> -	return 0;
> +	struct chardev *egm_chardev = file->private_data;
> +	struct nvgrace_egm_dev *egm_dev =
> +		egm_chardev_to_nvgrace_egm_dev(egm_chardev);
> +	u64 req_len, pgoff, end;
> +	unsigned long start_pfn;
> +
> +	pgoff = vma->vm_pgoff &
> +		((1U << (EGM_OFFSET_SHIFT - PAGE_SHIFT)) - 1);

I don't know what you're doing here with EGM_OFFSET_SHIFT other than
ignoring the high bits and creating aliases across the device file
address space for no(?) reason.  Looks like pointlessly copying vfio's
region segmentation.

> +
> +	if (check_sub_overflow(vma->vm_end, vma->vm_start, &req_len) ||
> +	    check_add_overflow(PHYS_PFN(egm_dev->egmphys), pgoff, &start_pfn) ||
> +	    check_add_overflow(PFN_PHYS(pgoff), req_len, &end))
> +		return -EOVERFLOW;
> +
> +	if (end > egm_dev->egmlength)
> +		return -EINVAL;
> +
> +	/*
> +	 * EGM memory is invisible to the host kernel and is not managed
> +	 * by it. Map the usermode VMA to the EGM region.
> +	 */
> +	return remap_pfn_range(vma, vma->vm_start,
> +			       start_pfn, req_len,
> +			       vma->vm_page_prot);

Obviously there are concerns about how this relates not only to the
state of the device in routing access, but also the lifetime of this as
there's no reference tracking whatsoever.  Thanks,

Alex

>  }
>  
>  static const struct file_operations file_ops = {
> diff --git a/include/linux/nvgrace-egm.h b/include/linux/nvgrace-egm.h
> index a66906753267..b9956e7e5a0e 100644
> --- a/include/linux/nvgrace-egm.h
> +++ b/include/linux/nvgrace-egm.h
> @@ -9,6 +9,7 @@
>  #include <linux/auxiliary_bus.h>
>  
>  #define NVGRACE_EGM_DEV_NAME "egm"
> +#define EGM_OFFSET_SHIFT   40
>  
>  struct gpu_node {
>  	struct list_head list;


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH RFC v2 10/15] vfio/nvgrace-egm: Clear Memory before handing out to VM
  2026-02-23 15:54 [PATCH RFC v2 00/15] Add virtualization support for EGM ankita
                   ` (8 preceding siblings ...)
  2026-02-23 15:55 ` [PATCH RFC v2 09/15] vfio/nvgrace-egm: Add chardev ops for EGM management ankita
@ 2026-02-23 15:55 ` ankita
  2026-02-26 18:15   ` Shameer Kolothum Thodi
  2026-03-04 22:14   ` Alex Williamson
  2026-02-23 15:55 ` [PATCH RFC v2 11/15] vfio/nvgrace-egm: Fetch EGM region retired pages list ankita
                   ` (5 subsequent siblings)
  15 siblings, 2 replies; 42+ messages in thread
From: ankita @ 2026-02-23 15:55 UTC (permalink / raw)
  To: ankita, vsethi, jgg, mochs, jgg, skolothumtho, alex
  Cc: cjia, zhiw, kjaju, yishaih, kevin.tian, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

The EGM region is invisible to the host Linux kernel and it does not
manage the region. The EGM module manages the EGM memory and thus is
responsible to clear out the region before handing out to the VM.

Clear EGM region on EGM chardev open. To avoid CPU lockup logs,
zap the region in 1G chunks.

Suggested-by: Vikram Sethi <vsethi@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/egm.c | 43 ++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
index 5786ebe374a5..de7771a4145d 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
@@ -15,6 +15,7 @@ static DEFINE_XARRAY(egm_chardevs);
 struct chardev {
 	struct device device;
 	struct cdev cdev;
+	atomic_t open_count;
 };
 
 static struct nvgrace_egm_dev *
@@ -30,6 +31,42 @@ static int nvgrace_egm_open(struct inode *inode, struct file *file)
 {
 	struct chardev *egm_chardev =
 		container_of(inode->i_cdev, struct chardev, cdev);
+	struct nvgrace_egm_dev *egm_dev =
+		egm_chardev_to_nvgrace_egm_dev(egm_chardev);
+	void *memaddr;
+
+	if (atomic_cmpxchg(&egm_chardev->open_count, 0, 1) != 0)
+		return -EBUSY;
+
+	/*
+	 * nvgrace-egm module is responsible to manage the EGM memory as
+	 * the host kernel has no knowledge of it. Clear the region before
+	 * handing over to userspace.
+	 */
+	memaddr = memremap(egm_dev->egmphys, egm_dev->egmlength, MEMREMAP_WB);
+	if (!memaddr) {
+		atomic_dec(&egm_chardev->open_count);
+		return -ENOMEM;
+	}
+
+	/*
+	 * Clear in chunks of 1G to avoid CPU lockup logs.
+	 */
+	{
+		size_t remaining = egm_dev->egmlength;
+		u8 *chunk_addr = (u8 *)memaddr;
+		size_t chunk_size;
+
+		while (remaining > 0) {
+			chunk_size = min(remaining, SZ_1G);
+			memset(chunk_addr, 0, chunk_size);
+			cond_resched();
+			chunk_addr += chunk_size;
+			remaining -= chunk_size;
+		}
+	}
+
+	memunmap(memaddr);
 
 	file->private_data = egm_chardev;
 
@@ -38,8 +75,13 @@ static int nvgrace_egm_open(struct inode *inode, struct file *file)
 
 static int nvgrace_egm_release(struct inode *inode, struct file *file)
 {
+	struct chardev *egm_chardev =
+		container_of(inode->i_cdev, struct chardev, cdev);
+
 	file->private_data = NULL;
 
+	atomic_dec(&egm_chardev->open_count);
+
 	return 0;
 }
 
@@ -108,6 +150,7 @@ setup_egm_chardev(struct nvgrace_egm_dev *egm_dev)
 	egm_chardev->device.parent = &egm_dev->aux_dev.dev;
 	cdev_init(&egm_chardev->cdev, &file_ops);
 	egm_chardev->cdev.owner = THIS_MODULE;
+	atomic_set(&egm_chardev->open_count, 0);
 
 	ret = dev_set_name(&egm_chardev->device, "egm%lld", egm_dev->egmpxm);
 	if (ret)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* RE: [PATCH RFC v2 10/15] vfio/nvgrace-egm: Clear Memory before handing out to VM
  2026-02-23 15:55 ` [PATCH RFC v2 10/15] vfio/nvgrace-egm: Clear Memory before handing out to VM ankita
@ 2026-02-26 18:15   ` Shameer Kolothum Thodi
  2026-02-26 18:56     ` Jason Gunthorpe
  2026-03-04 22:14   ` Alex Williamson
  1 sibling, 1 reply; 42+ messages in thread
From: Shameer Kolothum Thodi @ 2026-02-26 18:15 UTC (permalink / raw)
  To: Ankit Agrawal, Vikram Sethi, Jason Gunthorpe, Matt Ochs,
	jgg@ziepe.ca, alex@shazbot.org
  Cc: Neo Jia, Zhi Wang, Krishnakant Jaju, Yishai Hadas,
	kevin.tian@intel.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org



> -----Original Message-----
> From: Ankit Agrawal <ankita@nvidia.com>
> Sent: 23 February 2026 15:55
> To: Ankit Agrawal <ankita@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>;
> Jason Gunthorpe <jgg@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> jgg@ziepe.ca; Shameer Kolothum Thodi <skolothumtho@nvidia.com>;
> alex@shazbot.org
> Cc: Neo Jia <cjia@nvidia.com>; Zhi Wang <zhiw@nvidia.com>; Krishnakant
> Jaju <kjaju@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>;
> kevin.tian@intel.com; kvm@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [PATCH RFC v2 10/15] vfio/nvgrace-egm: Clear Memory before
> handing out to VM
> 
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The EGM region is invisible to the host Linux kernel and it does not
> manage the region. The EGM module manages the EGM memory and thus is
> responsible to clear out the region before handing out to the VM.
> 
> Clear EGM region on EGM chardev open. To avoid CPU lockup logs,
> zap the region in 1G chunks.
> 
> Suggested-by: Vikram Sethi <vsethi@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/nvgrace-gpu/egm.c | 43
> ++++++++++++++++++++++++++++++
>  1 file changed, 43 insertions(+)
> 
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-
> gpu/egm.c
> index 5786ebe374a5..de7771a4145d 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
> @@ -15,6 +15,7 @@ static DEFINE_XARRAY(egm_chardevs);
>  struct chardev {
>  	struct device device;
>  	struct cdev cdev;
> +	atomic_t open_count;
>  };
> 
>  static struct nvgrace_egm_dev *
> @@ -30,6 +31,42 @@ static int nvgrace_egm_open(struct inode *inode,
> struct file *file)
>  {
>  	struct chardev *egm_chardev =
>  		container_of(inode->i_cdev, struct chardev, cdev);
> +	struct nvgrace_egm_dev *egm_dev =
> +		egm_chardev_to_nvgrace_egm_dev(egm_chardev);
> +	void *memaddr;
> +
> +	if (atomic_cmpxchg(&egm_chardev->open_count, 0, 1) != 0)
> +		return -EBUSY;
> +
> +	/*
> +	 * nvgrace-egm module is responsible to manage the EGM memory as
> +	 * the host kernel has no knowledge of it. Clear the region before
> +	 * handing over to userspace.
> +	 */
> +	memaddr = memremap(egm_dev->egmphys, egm_dev->egmlength,
> MEMREMAP_WB);
> +	if (!memaddr) {
> +		atomic_dec(&egm_chardev->open_count);
> +		return -ENOMEM;
> +	}
> +
> +	/*
> +	 * Clear in chunks of 1G to avoid CPU lockup logs.
> +	 */
> +	{
> +		size_t remaining = egm_dev->egmlength;
> +		u8 *chunk_addr = (u8 *)memaddr;
> +		size_t chunk_size;
> +
> +		while (remaining > 0) {
> +			chunk_size = min(remaining, SZ_1G);
> +			memset(chunk_addr, 0, chunk_size);
> +			cond_resched();
> +			chunk_addr += chunk_size;
> +			remaining -= chunk_size;
> +		}
> +	}
> +
> +	memunmap(memaddr);

I am not sure this is safe. If userspace does:
open(fd)
mmap()
close(fd)

The mmap mapping stays alive and accessible in userspace even after
the close(). Since the release function decrements open_count on close(),
a second process could then call open() and wipe the mapping while it's
still live.

I may be wrong, but please double check the mapping lifecycle here.

Thanks,
Shameer

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC v2 10/15] vfio/nvgrace-egm: Clear Memory before handing out to VM
  2026-02-26 18:15   ` Shameer Kolothum Thodi
@ 2026-02-26 18:56     ` Jason Gunthorpe
  2026-02-26 19:29       ` Shameer Kolothum Thodi
  0 siblings, 1 reply; 42+ messages in thread
From: Jason Gunthorpe @ 2026-02-26 18:56 UTC (permalink / raw)
  To: Shameer Kolothum Thodi
  Cc: Ankit Agrawal, Vikram Sethi, Matt Ochs, alex@shazbot.org, Neo Jia,
	Zhi Wang, Krishnakant Jaju, Yishai Hadas, kevin.tian@intel.com,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org

On Thu, Feb 26, 2026 at 06:15:33PM +0000, Shameer Kolothum Thodi wrote:
> The mmap mapping stays alive and accessible in userspace even after
> the close(). Since the release function decrements open_count on close(),
> a second process could then call open() and wipe the mapping while it's
> still live.

fops release is not called until the mmap is closed too, the VMA holds
a struct file pointer as well. close does not call release, close
calls fput and fput calls release when the struct file refcount is 0.

Jason

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH RFC v2 10/15] vfio/nvgrace-egm: Clear Memory before handing out to VM
  2026-02-26 18:56     ` Jason Gunthorpe
@ 2026-02-26 19:29       ` Shameer Kolothum Thodi
  0 siblings, 0 replies; 42+ messages in thread
From: Shameer Kolothum Thodi @ 2026-02-26 19:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Ankit Agrawal, Vikram Sethi, Matt Ochs, alex@shazbot.org, Neo Jia,
	Zhi Wang, Krishnakant Jaju, Yishai Hadas, kevin.tian@intel.com,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org



> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: 26 February 2026 18:57
> To: Shameer Kolothum Thodi <skolothumtho@nvidia.com>
> Cc: Ankit Agrawal <ankita@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>;
> Matt Ochs <mochs@nvidia.com>; alex@shazbot.org; Neo Jia
> <cjia@nvidia.com>; Zhi Wang <zhiw@nvidia.com>; Krishnakant Jaju
> <kjaju@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>;
> kevin.tian@intel.com; kvm@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH RFC v2 10/15] vfio/nvgrace-egm: Clear Memory before
> handing out to VM
> 
> On Thu, Feb 26, 2026 at 06:15:33PM +0000, Shameer Kolothum Thodi wrote:
> > The mmap mapping stays alive and accessible in userspace even after
> > the close(). Since the release function decrements open_count on close(),
> > a second process could then call open() and wipe the mapping while it's
> > still live.
> 
> fops release is not called until the mmap is closed too, the VMA holds
> a struct file pointer as well. close does not call release, close
> calls fput and fput calls release when the struct file refcount is 0.

Ah..I wasn't sure about the release not being called until mmap is closed
part. Thanks for that explanation.

Shameer


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC v2 10/15] vfio/nvgrace-egm: Clear Memory before handing out to VM
  2026-02-23 15:55 ` [PATCH RFC v2 10/15] vfio/nvgrace-egm: Clear Memory before handing out to VM ankita
  2026-02-26 18:15   ` Shameer Kolothum Thodi
@ 2026-03-04 22:14   ` Alex Williamson
  1 sibling, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2026-03-04 22:14 UTC (permalink / raw)
  To: ankita
  Cc: vsethi, jgg, mochs, jgg, skolothumtho, cjia, zhiw, kjaju, yishaih,
	kevin.tian, kvm, linux-kernel, alex

On Mon, 23 Feb 2026 15:55:09 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The EGM region is invisible to the host Linux kernel and it does not
> manage the region. The EGM module manages the EGM memory and thus is
> responsible to clear out the region before handing out to the VM.
> 
> Clear EGM region on EGM chardev open. To avoid CPU lockup logs,
> zap the region in 1G chunks.
> 
> Suggested-by: Vikram Sethi <vsethi@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/nvgrace-gpu/egm.c | 43 ++++++++++++++++++++++++++++++
>  1 file changed, 43 insertions(+)
> 
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
> index 5786ebe374a5..de7771a4145d 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
> @@ -15,6 +15,7 @@ static DEFINE_XARRAY(egm_chardevs);
>  struct chardev {
>  	struct device device;
>  	struct cdev cdev;
> +	atomic_t open_count;
>  };
>  
>  static struct nvgrace_egm_dev *
> @@ -30,6 +31,42 @@ static int nvgrace_egm_open(struct inode *inode, struct file *file)
>  {
>  	struct chardev *egm_chardev =
>  		container_of(inode->i_cdev, struct chardev, cdev);
> +	struct nvgrace_egm_dev *egm_dev =
> +		egm_chardev_to_nvgrace_egm_dev(egm_chardev);
> +	void *memaddr;
> +
> +	if (atomic_cmpxchg(&egm_chardev->open_count, 0, 1) != 0)
> +		return -EBUSY;
> +
> +	/*
> +	 * nvgrace-egm module is responsible to manage the EGM memory as
> +	 * the host kernel has no knowledge of it. Clear the region before
> +	 * handing over to userspace.
> +	 */
> +	memaddr = memremap(egm_dev->egmphys, egm_dev->egmlength, MEMREMAP_WB);
> +	if (!memaddr) {
> +		atomic_dec(&egm_chardev->open_count);
> +		return -ENOMEM;
> +	}
> +
> +	/*
> +	 * Clear in chunks of 1G to avoid CPU lockup logs.
> +	 */
> +	{
> +		size_t remaining = egm_dev->egmlength;
> +		u8 *chunk_addr = (u8 *)memaddr;
> +		size_t chunk_size;

Declare at the start of the function and remove this scope hack.

> +
> +		while (remaining > 0) {
> +			chunk_size = min(remaining, SZ_1G);

min_t(size_t,,);

> +			memset(chunk_addr, 0, chunk_size);
> +			cond_resched();
> +			chunk_addr += chunk_size;
> +			remaining -= chunk_size;
> +		}
> +	}

Aren't we going to want to do this asynchronously or run multiple
threads to avoid stalling VM launch? 

> +
> +	memunmap(memaddr);
>  
>  	file->private_data = egm_chardev;
>  
> @@ -38,8 +75,13 @@ static int nvgrace_egm_open(struct inode *inode, struct file *file)
>  
>  static int nvgrace_egm_release(struct inode *inode, struct file *file)
>  {
> +	struct chardev *egm_chardev =
> +		container_of(inode->i_cdev, struct chardev, cdev);
> +
>  	file->private_data = NULL;
>  
> +	atomic_dec(&egm_chardev->open_count);
> +
>  	return 0;
>  }
>  
> @@ -108,6 +150,7 @@ setup_egm_chardev(struct nvgrace_egm_dev *egm_dev)
>  	egm_chardev->device.parent = &egm_dev->aux_dev.dev;
>  	cdev_init(&egm_chardev->cdev, &file_ops);
>  	egm_chardev->cdev.owner = THIS_MODULE;
> +	atomic_set(&egm_chardev->open_count, 0);

Already zero from kzalloc.  Thanks,

Alex

>  
>  	ret = dev_set_name(&egm_chardev->device, "egm%lld", egm_dev->egmpxm);
>  	if (ret)


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH RFC v2 11/15] vfio/nvgrace-egm: Fetch EGM region retired pages list
  2026-02-23 15:54 [PATCH RFC v2 00/15] Add virtualization support for EGM ankita
                   ` (9 preceding siblings ...)
  2026-02-23 15:55 ` [PATCH RFC v2 10/15] vfio/nvgrace-egm: Clear Memory before handing out to VM ankita
@ 2026-02-23 15:55 ` ankita
  2026-03-04 22:37   ` Alex Williamson
  2026-02-23 15:55 ` [PATCH RFC v2 12/15] vfio/nvgrace-egm: Introduce ioctl to share retired pages ankita
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 42+ messages in thread
From: ankita @ 2026-02-23 15:55 UTC (permalink / raw)
  To: ankita, vsethi, jgg, mochs, jgg, skolothumtho, alex
  Cc: cjia, zhiw, kjaju, yishaih, kevin.tian, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

It is possible for some system memory pages on the EGM to
have retired pages with uncorrectable ECC errors. A list of
pages known with such errors (referred as retired pages) are
maintained by the Host UEFI. The Host UEFI populates such list
in a reserved region. It communicates the SPA of this region
through a ACPI DSDT property.

nvgrace-egm module is responsible to store the list of retired page
offsets to be made available for usermode processes. The module:
1. Get the reserved memory region SPA and maps to it to fetch
the list of bad pages.
2. Calculate the retired page offsets in the EGM and stores it.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/egm.c     | 75 ++++++++++++++++++++++++++
 drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 32 +++++++++--
 drivers/vfio/pci/nvgrace-gpu/egm_dev.h |  5 +-
 drivers/vfio/pci/nvgrace-gpu/main.c    |  8 +--
 include/linux/nvgrace-egm.h            |  2 +
 5 files changed, 112 insertions(+), 10 deletions(-)

diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
index de7771a4145d..077de3833046 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
@@ -8,6 +8,11 @@
 
 #define MAX_EGM_NODES 4
 
+struct h_node {
+	unsigned long mem_offset;
+	struct hlist_node node;
+};
+
 static dev_t dev;
 static struct class *class;
 static DEFINE_XARRAY(egm_chardevs);
@@ -16,6 +21,7 @@ struct chardev {
 	struct device device;
 	struct cdev cdev;
 	atomic_t open_count;
+	DECLARE_HASHTABLE(htbl, 0x10);
 };
 
 static struct nvgrace_egm_dev *
@@ -174,20 +180,88 @@ static void del_egm_chardev(struct chardev *egm_chardev)
 	put_device(&egm_chardev->device);
 }
 
+static void cleanup_retired_pages(struct chardev *egm_chardev)
+{
+	struct h_node *cur_page;
+	unsigned long bkt;
+	struct hlist_node *temp_node;
+
+	hash_for_each_safe(egm_chardev->htbl, bkt, temp_node, cur_page, node) {
+		hash_del(&cur_page->node);
+		kvfree(cur_page);
+	}
+}
+
+static int nvgrace_egm_fetch_retired_pages(struct nvgrace_egm_dev *egm_dev,
+					   struct chardev *egm_chardev)
+{
+	u64 count;
+	void *memaddr;
+	int index, ret = 0;
+
+	memaddr = memremap(egm_dev->retiredpagesphys, PAGE_SIZE, MEMREMAP_WB);
+	if (!memaddr)
+		return -ENOMEM;
+
+	count = *(u64 *)memaddr;
+	if (count > PAGE_SIZE / sizeof(count))
+		return -EINVAL;
+
+	for (index = 0; index < count; index++) {
+		struct h_node *retired_page;
+
+		/*
+		 * Since the EGM is linearly mapped, the offset in the
+		 * carveout is the same offset in the VM system memory.
+		 *
+		 * Calculate the offset to communicate to the usermode
+		 * apps.
+		 */
+		retired_page = kzalloc(sizeof(*retired_page), GFP_KERNEL);
+		if (!retired_page) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		retired_page->mem_offset = *((u64 *)memaddr + index + 1) -
+					   egm_dev->egmphys;
+		hash_add(egm_chardev->htbl, &retired_page->node,
+			 retired_page->mem_offset);
+	}
+
+	memunmap(memaddr);
+
+	if (ret)
+		cleanup_retired_pages(egm_chardev);
+
+	return ret;
+}
+
 static int egm_driver_probe(struct auxiliary_device *aux_dev,
 			    const struct auxiliary_device_id *id)
 {
 	struct nvgrace_egm_dev *egm_dev =
 		container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
 	struct chardev *egm_chardev;
+	int ret;
 
 	egm_chardev = setup_egm_chardev(egm_dev);
 	if (!egm_chardev)
 		return -EINVAL;
 
+	hash_init(egm_chardev->htbl);
+
+	ret = nvgrace_egm_fetch_retired_pages(egm_dev, egm_chardev);
+	if (ret)
+		goto error_exit;
+
 	xa_store(&egm_chardevs, egm_dev->egmpxm, egm_chardev, GFP_KERNEL);
 
 	return 0;
+
+error_exit:
+	del_egm_chardev(egm_chardev);
+	return ret;
 }
 
 static void egm_driver_remove(struct auxiliary_device *aux_dev)
@@ -199,6 +273,7 @@ static void egm_driver_remove(struct auxiliary_device *aux_dev)
 	if (!egm_chardev)
 		return;
 
+	cleanup_retired_pages(egm_chardev);
 	del_egm_chardev(egm_chardev);
 }
 
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
index 20291504aca8..6d716c3a3257 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
@@ -18,22 +18,41 @@ int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm)
 }
 
 int nvgrace_gpu_fetch_egm_property(struct pci_dev *pdev, u64 *pegmphys,
-				   u64 *pegmlength)
+				   u64 *pegmlength, u64 *pretiredpagesphys)
 {
 	int ret;
 
 	/*
-	 * The memory information is present in the system ACPI tables as DSD
-	 * properties nvidia,egm-base-pa and nvidia,egm-size.
+	 * The EGM memory information is present in the system ACPI tables
+	 * as DSD properties nvidia,egm-base-pa and nvidia,egm-size.
 	 */
 	ret = device_property_read_u64(&pdev->dev, "nvidia,egm-size",
 				       pegmlength);
 	if (ret)
-		return ret;
+		goto error_exit;
 
 	ret = device_property_read_u64(&pdev->dev, "nvidia,egm-base-pa",
 				       pegmphys);
+	if (ret)
+		goto error_exit;
+
+	/*
+	 * SBIOS puts the list of retired pages on a region. The region
+	 * SPA is exposed as "nvidia,egm-retired-pages-data-base".
+	 */
+	ret = device_property_read_u64(&pdev->dev,
+				       "nvidia,egm-retired-pages-data-base",
+				       pretiredpagesphys);
+	if (ret)
+		goto error_exit;
+
+	/* Catch firmware bug and avoid a crash */
+	if (*pretiredpagesphys == 0) {
+		dev_err(&pdev->dev, "Retired pages region is not setup\n");
+		ret = -EINVAL;
+	}
 
+error_exit:
 	return ret;
 }
 
@@ -74,7 +93,8 @@ static void nvgrace_gpu_release_aux_device(struct device *device)
 
 struct nvgrace_egm_dev *
 nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
-			      u64 egmphys, u64 egmlength, u64 egmpxm)
+			      u64 egmphys, u64 egmlength, u64 egmpxm,
+			      u64 retiredpagesphys)
 {
 	struct nvgrace_egm_dev *egm_dev;
 	int ret;
@@ -86,6 +106,8 @@ nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
 	egm_dev->egmpxm = egmpxm;
 	egm_dev->egmphys = egmphys;
 	egm_dev->egmlength = egmlength;
+	egm_dev->retiredpagesphys = retiredpagesphys;
+
 	INIT_LIST_HEAD(&egm_dev->gpus);
 
 	egm_dev->aux_dev.id = egmpxm;
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
index 2e1612445898..2f329a05685d 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
+++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
@@ -16,8 +16,9 @@ void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev);
 
 struct nvgrace_egm_dev *
 nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
-			      u64 egmphys, u64 egmlength, u64 egmpxm);
+			      u64 egmphys, u64 egmlength, u64 egmpxm,
+			      u64 retiredpagesphys);
 
 int nvgrace_gpu_fetch_egm_property(struct pci_dev *pdev, u64 *pegmphys,
-				   u64 *pegmlength);
+				   u64 *pegmlength, u64 *pretiredpagesphys);
 #endif /* EGM_DEV_H */
diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index 0bb427cca31f..11bbecda1ad2 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -78,7 +78,7 @@ static struct list_head egm_dev_list;
 static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
 {
 	struct nvgrace_egm_dev_entry *egm_entry = NULL;
-	u64 egmphys, egmlength, egmpxm;
+	u64 egmphys, egmlength, egmpxm, retiredpagesphys;
 	int ret = 0;
 	bool is_new_region = false;
 
@@ -91,7 +91,8 @@ static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
 	if (nvgrace_gpu_has_egm_property(pdev, &egmpxm))
 		goto exit;
 
-	ret = nvgrace_gpu_fetch_egm_property(pdev, &egmphys, &egmlength);
+	ret = nvgrace_gpu_fetch_egm_property(pdev, &egmphys, &egmlength,
+					     &retiredpagesphys);
 	if (ret)
 		goto exit;
 
@@ -114,7 +115,8 @@ static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
 
 	egm_entry->egm_dev =
 		nvgrace_gpu_create_aux_device(pdev, NVGRACE_EGM_DEV_NAME,
-					      egmphys, egmlength, egmpxm);
+					      egmphys, egmlength, egmpxm,
+					      retiredpagesphys);
 	if (!egm_entry->egm_dev) {
 		ret = -EINVAL;
 		goto free_egm_entry;
diff --git a/include/linux/nvgrace-egm.h b/include/linux/nvgrace-egm.h
index b9956e7e5a0e..9e0d190c7da0 100644
--- a/include/linux/nvgrace-egm.h
+++ b/include/linux/nvgrace-egm.h
@@ -7,6 +7,7 @@
 #define NVGRACE_EGM_H
 
 #include <linux/auxiliary_bus.h>
+#include <linux/hashtable.h>
 
 #define NVGRACE_EGM_DEV_NAME "egm"
 #define EGM_OFFSET_SHIFT   40
@@ -20,6 +21,7 @@ struct nvgrace_egm_dev {
 	struct auxiliary_device aux_dev;
 	phys_addr_t egmphys;
 	size_t egmlength;
+	phys_addr_t retiredpagesphys;
 	u64 egmpxm;
 	struct list_head gpus;
 };
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC v2 11/15] vfio/nvgrace-egm: Fetch EGM region retired pages list
  2026-02-23 15:55 ` [PATCH RFC v2 11/15] vfio/nvgrace-egm: Fetch EGM region retired pages list ankita
@ 2026-03-04 22:37   ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2026-03-04 22:37 UTC (permalink / raw)
  To: ankita
  Cc: vsethi, jgg, mochs, jgg, skolothumtho, cjia, zhiw, kjaju, yishaih,
	kevin.tian, kvm, linux-kernel, alex

On Mon, 23 Feb 2026 15:55:10 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> It is possible for some system memory pages on the EGM to
> have retired pages with uncorrectable ECC errors. A list of
> pages known with such errors (referred as retired pages) are
> maintained by the Host UEFI. The Host UEFI populates such list
> in a reserved region. It communicates the SPA of this region
> through a ACPI DSDT property.
> 
> nvgrace-egm module is responsible to store the list of retired page
> offsets to be made available for usermode processes. The module:
> 1. Get the reserved memory region SPA and maps to it to fetch
> the list of bad pages.
> 2. Calculate the retired page offsets in the EGM and stores it.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/nvgrace-gpu/egm.c     | 75 ++++++++++++++++++++++++++
>  drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 32 +++++++++--
>  drivers/vfio/pci/nvgrace-gpu/egm_dev.h |  5 +-
>  drivers/vfio/pci/nvgrace-gpu/main.c    |  8 +--
>  include/linux/nvgrace-egm.h            |  2 +
>  5 files changed, 112 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
> index de7771a4145d..077de3833046 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
> @@ -8,6 +8,11 @@
>  
>  #define MAX_EGM_NODES 4
>  
> +struct h_node {
> +	unsigned long mem_offset;
> +	struct hlist_node node;
> +};
> +
>  static dev_t dev;
>  static struct class *class;
>  static DEFINE_XARRAY(egm_chardevs);
> @@ -16,6 +21,7 @@ struct chardev {
>  	struct device device;
>  	struct cdev cdev;
>  	atomic_t open_count;
> +	DECLARE_HASHTABLE(htbl, 0x10);
>  };
>  
>  static struct nvgrace_egm_dev *
> @@ -174,20 +180,88 @@ static void del_egm_chardev(struct chardev *egm_chardev)
>  	put_device(&egm_chardev->device);
>  }
>  
> +static void cleanup_retired_pages(struct chardev *egm_chardev)
> +{
> +	struct h_node *cur_page;
> +	unsigned long bkt;
> +	struct hlist_node *temp_node;
> +
> +	hash_for_each_safe(egm_chardev->htbl, bkt, temp_node, cur_page, node) {
> +		hash_del(&cur_page->node);
> +		kvfree(cur_page);
> +	}
> +}
> +
> +static int nvgrace_egm_fetch_retired_pages(struct nvgrace_egm_dev *egm_dev,
> +					   struct chardev *egm_chardev)
> +{
> +	u64 count;
> +	void *memaddr;
> +	int index, ret = 0;
> +
> +	memaddr = memremap(egm_dev->retiredpagesphys, PAGE_SIZE, MEMREMAP_WB);

We're reading some data structure in physical memory, how does that
data structure have any relation to the kernel PAGE_SIZE, which might
be 4K or 64K?

> +	if (!memaddr)
> +		return -ENOMEM;
> +
> +	count = *(u64 *)memaddr;

So the first 8-bytes contains a page count.

> +	if (count > PAGE_SIZE / sizeof(count))
> +		return -EINVAL;

So if it's a 64K table and we're on a 4K host, this can unnecessarily
fail, or fail to incorporate the vast majority of pages.

Also the 0th index is the count itself, so there can only be 511
entries with 4K page, not 512.  This is off-by-one and the loop below
can exceed the map range.

AI also tells me that the hash table is vastly oversized for containing
either 511 or 8191 entries.

Also we didn't menunmap on this error condition.

> +
> +	for (index = 0; index < count; index++) {
> +		struct h_node *retired_page;
> +
> +		/*
> +		 * Since the EGM is linearly mapped, the offset in the
> +		 * carveout is the same offset in the VM system memory.
> +		 *
> +		 * Calculate the offset to communicate to the usermode
> +		 * apps.
> +		 */
> +		retired_page = kzalloc(sizeof(*retired_page), GFP_KERNEL);
> +		if (!retired_page) {
> +			ret = -ENOMEM;
> +			break;
> +		}
> +
> +		retired_page->mem_offset = *((u64 *)memaddr + index + 1) -
> +					   egm_dev->egmphys;
> +		hash_add(egm_chardev->htbl, &retired_page->node,
> +			 retired_page->mem_offset);
> +	}
> +
> +	memunmap(memaddr);
> +
> +	if (ret)
> +		cleanup_retired_pages(egm_chardev);
> +
> +	return ret;
> +}
> +
>  static int egm_driver_probe(struct auxiliary_device *aux_dev,
>  			    const struct auxiliary_device_id *id)
>  {
>  	struct nvgrace_egm_dev *egm_dev =
>  		container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
>  	struct chardev *egm_chardev;
> +	int ret;
>  
>  	egm_chardev = setup_egm_chardev(egm_dev);
>  	if (!egm_chardev)
>  		return -EINVAL;
>  
> +	hash_init(egm_chardev->htbl);
> +
> +	ret = nvgrace_egm_fetch_retired_pages(egm_dev, egm_chardev);
> +	if (ret)
> +		goto error_exit;
> +
>  	xa_store(&egm_chardevs, egm_dev->egmpxm, egm_chardev, GFP_KERNEL);
>  
>  	return 0;
> +
> +error_exit:
> +	del_egm_chardev(egm_chardev);
> +	return ret;
>  }
>  
>  static void egm_driver_remove(struct auxiliary_device *aux_dev)
> @@ -199,6 +273,7 @@ static void egm_driver_remove(struct auxiliary_device *aux_dev)
>  	if (!egm_chardev)
>  		return;
>  
> +	cleanup_retired_pages(egm_chardev);
>  	del_egm_chardev(egm_chardev);
>  }
>  
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> index 20291504aca8..6d716c3a3257 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> @@ -18,22 +18,41 @@ int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm)
>  }
>  
>  int nvgrace_gpu_fetch_egm_property(struct pci_dev *pdev, u64 *pegmphys,
> -				   u64 *pegmlength)
> +				   u64 *pegmlength, u64 *pretiredpagesphys)
>  {
>  	int ret;
>  
>  	/*
> -	 * The memory information is present in the system ACPI tables as DSD
> -	 * properties nvidia,egm-base-pa and nvidia,egm-size.
> +	 * The EGM memory information is present in the system ACPI tables
> +	 * as DSD properties nvidia,egm-base-pa and nvidia,egm-size.
>  	 */
>  	ret = device_property_read_u64(&pdev->dev, "nvidia,egm-size",
>  				       pegmlength);
>  	if (ret)
> -		return ret;
> +		goto error_exit;
>  
>  	ret = device_property_read_u64(&pdev->dev, "nvidia,egm-base-pa",
>  				       pegmphys);
> +	if (ret)
> +		goto error_exit;
> +
> +	/*
> +	 * SBIOS puts the list of retired pages on a region. The region
> +	 * SPA is exposed as "nvidia,egm-retired-pages-data-base".
> +	 */
> +	ret = device_property_read_u64(&pdev->dev,
> +				       "nvidia,egm-retired-pages-data-base",
> +				       pretiredpagesphys);
> +	if (ret)
> +		goto error_exit;
> +
> +	/* Catch firmware bug and avoid a crash */
> +	if (*pretiredpagesphys == 0) {
> +		dev_err(&pdev->dev, "Retired pages region is not setup\n");
> +		ret = -EINVAL;
> +	}
>  
> +error_exit:
>  	return ret;
>  }
>  
> @@ -74,7 +93,8 @@ static void nvgrace_gpu_release_aux_device(struct device *device)
>  
>  struct nvgrace_egm_dev *
>  nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
> -			      u64 egmphys, u64 egmlength, u64 egmpxm)
> +			      u64 egmphys, u64 egmlength, u64 egmpxm,
> +			      u64 retiredpagesphys)
>  {
>  	struct nvgrace_egm_dev *egm_dev;
>  	int ret;
> @@ -86,6 +106,8 @@ nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
>  	egm_dev->egmpxm = egmpxm;
>  	egm_dev->egmphys = egmphys;
>  	egm_dev->egmlength = egmlength;
> +	egm_dev->retiredpagesphys = retiredpagesphys;
> +
>  	INIT_LIST_HEAD(&egm_dev->gpus);
>  
>  	egm_dev->aux_dev.id = egmpxm;
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> index 2e1612445898..2f329a05685d 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> @@ -16,8 +16,9 @@ void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev);
>  
>  struct nvgrace_egm_dev *
>  nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
> -			      u64 egmphys, u64 egmlength, u64 egmpxm);
> +			      u64 egmphys, u64 egmlength, u64 egmpxm,
> +			      u64 retiredpagesphys);
>  
>  int nvgrace_gpu_fetch_egm_property(struct pci_dev *pdev, u64 *pegmphys,
> -				   u64 *pegmlength);
> +				   u64 *pegmlength, u64 *pretiredpagesphys);
>  #endif /* EGM_DEV_H */
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
> index 0bb427cca31f..11bbecda1ad2 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -78,7 +78,7 @@ static struct list_head egm_dev_list;
>  static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
>  {
>  	struct nvgrace_egm_dev_entry *egm_entry = NULL;
> -	u64 egmphys, egmlength, egmpxm;
> +	u64 egmphys, egmlength, egmpxm, retiredpagesphys;
>  	int ret = 0;
>  	bool is_new_region = false;
>  
> @@ -91,7 +91,8 @@ static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
>  	if (nvgrace_gpu_has_egm_property(pdev, &egmpxm))
>  		goto exit;
>  
> -	ret = nvgrace_gpu_fetch_egm_property(pdev, &egmphys, &egmlength);
> +	ret = nvgrace_gpu_fetch_egm_property(pdev, &egmphys, &egmlength,
> +					     &retiredpagesphys);
>  	if (ret)
>  		goto exit;
>  
> @@ -114,7 +115,8 @@ static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
>  
>  	egm_entry->egm_dev =
>  		nvgrace_gpu_create_aux_device(pdev, NVGRACE_EGM_DEV_NAME,
> -					      egmphys, egmlength, egmpxm);
> +					      egmphys, egmlength, egmpxm,
> +					      retiredpagesphys);
>  	if (!egm_entry->egm_dev) {
>  		ret = -EINVAL;
>  		goto free_egm_entry;
> diff --git a/include/linux/nvgrace-egm.h b/include/linux/nvgrace-egm.h
> index b9956e7e5a0e..9e0d190c7da0 100644
> --- a/include/linux/nvgrace-egm.h
> +++ b/include/linux/nvgrace-egm.h
> @@ -7,6 +7,7 @@
>  #define NVGRACE_EGM_H
>  
>  #include <linux/auxiliary_bus.h>
> +#include <linux/hashtable.h>

This is implementation, it should be in the c file not the public
header.  Thanks,

Alex

>  
>  #define NVGRACE_EGM_DEV_NAME "egm"
>  #define EGM_OFFSET_SHIFT   40
> @@ -20,6 +21,7 @@ struct nvgrace_egm_dev {
>  	struct auxiliary_device aux_dev;
>  	phys_addr_t egmphys;
>  	size_t egmlength;
> +	phys_addr_t retiredpagesphys;
>  	u64 egmpxm;
>  	struct list_head gpus;
>  };


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH RFC v2 12/15] vfio/nvgrace-egm: Introduce ioctl to share retired pages
  2026-02-23 15:54 [PATCH RFC v2 00/15] Add virtualization support for EGM ankita
                   ` (10 preceding siblings ...)
  2026-02-23 15:55 ` [PATCH RFC v2 11/15] vfio/nvgrace-egm: Fetch EGM region retired pages list ankita
@ 2026-02-23 15:55 ` ankita
  2026-03-04 23:00   ` Alex Williamson
  2026-02-23 15:55 ` [PATCH RFC v2 13/15] vfio/nvgrace-egm: expose the egm size through sysfs ankita
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 42+ messages in thread
From: ankita @ 2026-02-23 15:55 UTC (permalink / raw)
  To: ankita, vsethi, jgg, mochs, jgg, skolothumtho, alex
  Cc: cjia, zhiw, kjaju, yishaih, kevin.tian, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

nvgrace-egm module stores the list of retired page offsets to be made
available for usermode processes. Introduce an ioctl to share the
information with the userspace.

The ioctl is called by usermode apps such as QEMU to get the retired
page offsets. The usermode apps are expected to take appropriate action
to communicate the list to the VM.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 MAINTAINERS                        |  1 +
 drivers/vfio/pci/nvgrace-gpu/egm.c | 67 ++++++++++++++++++++++++++++++
 include/uapi/linux/egm.h           | 28 +++++++++++++
 3 files changed, 96 insertions(+)
 create mode 100644 include/uapi/linux/egm.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 1fc551d7d667..94cf15a1e82c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -27389,6 +27389,7 @@ M:	Ankit Agrawal <ankita@nvidia.com>
 L:	kvm@vger.kernel.org
 S:	Supported
 F:	drivers/vfio/pci/nvgrace-gpu/egm.c
+F:	include/uapi/linux/egm.h
 
 VFIO PCI DEVICE SPECIFIC DRIVERS
 R:	Jason Gunthorpe <jgg@nvidia.com>
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
index 077de3833046..918979d8fcd4 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
@@ -5,6 +5,7 @@
 
 #include <linux/vfio_pci_core.h>
 #include <linux/nvgrace-egm.h>
+#include <linux/egm.h>
 
 #define MAX_EGM_NODES 4
 
@@ -119,11 +120,77 @@ static int nvgrace_egm_mmap(struct file *file, struct vm_area_struct *vma)
 			       vma->vm_page_prot);
 }
 
+static long nvgrace_egm_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	unsigned long minsz = offsetofend(struct egm_retired_pages_list, count);
+	struct egm_retired_pages_list info;
+	void __user *uarg = (void __user *)arg;
+	struct chardev *egm_chardev = file->private_data;
+
+	if (copy_from_user(&info, uarg, minsz))
+		return -EFAULT;
+
+	if (info.argsz < minsz || !egm_chardev)
+		return -EINVAL;
+
+	switch (cmd) {
+	case EGM_RETIRED_PAGES_LIST:
+		int ret;
+		unsigned long retired_page_struct_size = sizeof(struct egm_retired_pages_info);
+		struct egm_retired_pages_info tmp;
+		struct h_node *cur_page;
+		struct hlist_node *tmp_node;
+		unsigned long bkt;
+		int count = 0, index = 0;
+
+		hash_for_each_safe(egm_chardev->htbl, bkt, tmp_node, cur_page, node)
+			count++;
+
+		if (info.argsz < (minsz + count * retired_page_struct_size)) {
+			info.argsz = minsz + count * retired_page_struct_size;
+			info.count = 0;
+			goto done;
+		} else {
+			hash_for_each_safe(egm_chardev->htbl, bkt, tmp_node, cur_page, node) {
+				/*
+				 * This check fails if there was an ECC error
+				 * after the usermode app read the count of
+				 * bad pages through this ioctl.
+				 */
+				if (minsz + index * retired_page_struct_size >= info.argsz) {
+					info.argsz = minsz + index * retired_page_struct_size;
+					info.count = index;
+					goto done;
+				}
+
+				tmp.offset = cur_page->mem_offset;
+				tmp.size = PAGE_SIZE;
+
+				ret = copy_to_user(uarg + minsz +
+						   index * retired_page_struct_size,
+						   &tmp, retired_page_struct_size);
+				if (ret)
+					return -EFAULT;
+				index++;
+			}
+
+			info.count = index;
+		}
+		break;
+	default:
+		return -EINVAL;
+	}
+
+done:
+	return copy_to_user(uarg, &info, minsz) ? -EFAULT : 0;
+}
+
 static const struct file_operations file_ops = {
 	.owner = THIS_MODULE,
 	.open = nvgrace_egm_open,
 	.release = nvgrace_egm_release,
 	.mmap = nvgrace_egm_mmap,
+	.unlocked_ioctl = nvgrace_egm_ioctl,
 };
 
 static void egm_chardev_release(struct device *dev)
diff --git a/include/uapi/linux/egm.h b/include/uapi/linux/egm.h
new file mode 100644
index 000000000000..4d3a2304d4f0
--- /dev/null
+++ b/include/uapi/linux/egm.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#ifndef _UAPI_LINUX_EGM_H
+#define _UAPI_LINUX_EGM_H
+
+#include <linux/types.h>
+
+#define EGM_TYPE ('E')
+
+struct egm_retired_pages_info {
+	__aligned_u64 offset;
+	__aligned_u64 size;
+};
+
+struct egm_retired_pages_list {
+	__u32 argsz;
+	/* out */
+	__u32 count;
+	/* out */
+	struct egm_retired_pages_info retired_pages[];
+};
+
+#define EGM_RETIRED_PAGES_LIST     _IO(EGM_TYPE, 100)
+
+#endif /* _UAPI_LINUX_EGM_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC v2 12/15] vfio/nvgrace-egm: Introduce ioctl to share retired pages
  2026-02-23 15:55 ` [PATCH RFC v2 12/15] vfio/nvgrace-egm: Introduce ioctl to share retired pages ankita
@ 2026-03-04 23:00   ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2026-03-04 23:00 UTC (permalink / raw)
  To: ankita
  Cc: vsethi, jgg, mochs, jgg, skolothumtho, cjia, zhiw, kjaju, yishaih,
	kevin.tian, kvm, linux-kernel, alex

On Mon, 23 Feb 2026 15:55:11 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> nvgrace-egm module stores the list of retired page offsets to be made
> available for usermode processes. Introduce an ioctl to share the
> information with the userspace.
> 
> The ioctl is called by usermode apps such as QEMU to get the retired
> page offsets. The usermode apps are expected to take appropriate action
> to communicate the list to the VM.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  MAINTAINERS                        |  1 +
>  drivers/vfio/pci/nvgrace-gpu/egm.c | 67 ++++++++++++++++++++++++++++++
>  include/uapi/linux/egm.h           | 28 +++++++++++++
>  3 files changed, 96 insertions(+)
>  create mode 100644 include/uapi/linux/egm.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1fc551d7d667..94cf15a1e82c 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -27389,6 +27389,7 @@ M:	Ankit Agrawal <ankita@nvidia.com>
>  L:	kvm@vger.kernel.org
>  S:	Supported
>  F:	drivers/vfio/pci/nvgrace-gpu/egm.c
> +F:	include/uapi/linux/egm.h
>  
>  VFIO PCI DEVICE SPECIFIC DRIVERS
>  R:	Jason Gunthorpe <jgg@nvidia.com>
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
> index 077de3833046..918979d8fcd4 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
> @@ -5,6 +5,7 @@
>  
>  #include <linux/vfio_pci_core.h>
>  #include <linux/nvgrace-egm.h>
> +#include <linux/egm.h>
>  
>  #define MAX_EGM_NODES 4
>  
> @@ -119,11 +120,77 @@ static int nvgrace_egm_mmap(struct file *file, struct vm_area_struct *vma)
>  			       vma->vm_page_prot);
>  }
>  
> +static long nvgrace_egm_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> +	unsigned long minsz = offsetofend(struct egm_retired_pages_list, count);
> +	struct egm_retired_pages_list info;
> +	void __user *uarg = (void __user *)arg;
> +	struct chardev *egm_chardev = file->private_data;
> +
> +	if (copy_from_user(&info, uarg, minsz))
> +		return -EFAULT;
> +
> +	if (info.argsz < minsz || !egm_chardev)
> +		return -EINVAL;

How could we get here with !egm_chardev?

> +
> +	switch (cmd) {
> +	case EGM_RETIRED_PAGES_LIST:
> +		int ret;
> +		unsigned long retired_page_struct_size = sizeof(struct egm_retired_pages_info);
> +		struct egm_retired_pages_info tmp;
> +		struct h_node *cur_page;
> +		struct hlist_node *tmp_node;
> +		unsigned long bkt;
> +		int count = 0, index = 0;

No brackets for inline declarations.  Ordering could be improved.

> +
> +		hash_for_each_safe(egm_chardev->htbl, bkt, tmp_node, cur_page, node)
> +			count++;

Why not keep track of the count as they're added?

Neither loop here needs the _safe variant here since we're not removing
entries.

> +
> +		if (info.argsz < (minsz + count * retired_page_struct_size)) {
> +			info.argsz = minsz + count * retired_page_struct_size;
> +			info.count = 0;

vfio returns success when there's not enough space for compatibility
for new capabilities.  For a new ioctl just set argsz and count and
return -ENOSPC.

> +			goto done;
> +		} else {

We don't need an else if the previous branch unconditionally goes
somewhere else.

> +			hash_for_each_safe(egm_chardev->htbl, bkt, tmp_node, cur_page, node) {
> +				/*
> +				 * This check fails if there was an ECC error
> +				 * after the usermode app read the count of
> +				 * bad pages through this ioctl.
> +				 */
> +				if (minsz + index * retired_page_struct_size >= info.argsz) {
> +					info.argsz = minsz + index * retired_page_struct_size;
> +					info.count = index;

If only we had locking to prevent such races...

> +					goto done;
> +				}
> +
> +				tmp.offset = cur_page->mem_offset;
> +				tmp.size = PAGE_SIZE;

Is firmware recording 4K or 64K pages in this table?

The above comment alludes runtime ECC faults, are those a different
page size from the granularity firmware reports in the table?

> +
> +				ret = copy_to_user(uarg + minsz +
> +						   index * retired_page_struct_size,
> +						   &tmp, retired_page_struct_size);
> +				if (ret)
> +					return -EFAULT;
> +				index++;
> +			}
> +
> +			info.count = index;
> +		}
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +done:
> +	return copy_to_user(uarg, &info, minsz) ? -EFAULT : 0;
> +}
> +
>  static const struct file_operations file_ops = {
>  	.owner = THIS_MODULE,
>  	.open = nvgrace_egm_open,
>  	.release = nvgrace_egm_release,
>  	.mmap = nvgrace_egm_mmap,
> +	.unlocked_ioctl = nvgrace_egm_ioctl,
>  };
>  
>  static void egm_chardev_release(struct device *dev)
> diff --git a/include/uapi/linux/egm.h b/include/uapi/linux/egm.h
> new file mode 100644
> index 000000000000..4d3a2304d4f0
> --- /dev/null
> +++ b/include/uapi/linux/egm.h
> @@ -0,0 +1,28 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved

2026

> + */
> +
> +#ifndef _UAPI_LINUX_EGM_H
> +#define _UAPI_LINUX_EGM_H
> +
> +#include <linux/types.h>
> +
> +#define EGM_TYPE ('E')

Arbitrarily chosen?  Update ioctl-number.rst?

> +
> +struct egm_retired_pages_info {
> +	__aligned_u64 offset;
> +	__aligned_u64 size;
> +};
> +
> +struct egm_retired_pages_list {
> +	__u32 argsz;
> +	/* out */
> +	__u32 count;
> +	/* out */
> +	struct egm_retired_pages_info retired_pages[];
> +};

I imagine you want some uapi description of this ioctl.  Thanks,

Alex

> +
> +#define EGM_RETIRED_PAGES_LIST     _IO(EGM_TYPE, 100)
> +
> +#endif /* _UAPI_LINUX_EGM_H */


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH RFC v2 13/15] vfio/nvgrace-egm: expose the egm size through sysfs
  2026-02-23 15:54 [PATCH RFC v2 00/15] Add virtualization support for EGM ankita
                   ` (11 preceding siblings ...)
  2026-02-23 15:55 ` [PATCH RFC v2 12/15] vfio/nvgrace-egm: Introduce ioctl to share retired pages ankita
@ 2026-02-23 15:55 ` ankita
  2026-03-04 23:22   ` Alex Williamson
  2026-02-23 15:55 ` [PATCH RFC v2 14/15] vfio/nvgrace-gpu: Add link from pci to EGM ankita
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 42+ messages in thread
From: ankita @ 2026-02-23 15:55 UTC (permalink / raw)
  To: ankita, vsethi, jgg, mochs, jgg, skolothumtho, alex
  Cc: cjia, zhiw, kjaju, yishaih, kevin.tian, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

To allocate the EGM, the userspace need to know its size. Currently,
there is no easy way for the userspace to determine that.

Make nvgrace-egm expose the size through sysfs that can be queried
by the userspace from <char_dev_path>/egm_size.
E.g. on a 2-socket system, it is present at
/sys/class/egm/egm4
/sys/class/egm/egm5

It also shows up at <aux_device path>/egm_size.
/sys/devices/pci0008:00/0008:00:00.0/0008:01:00.0/nvgrace_gpu_vfio_pci.egm.4/egm/egm4/egm_size
/sys/devices/pci0018:00/0018:00:00.0/0018:01:00.0/nvgrace_gpu_vfio_pci.egm.5/egm/egm5/egm_size

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/egm.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
index 918979d8fcd4..2e4024c25e8a 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
@@ -365,6 +365,32 @@ static char *egm_devnode(const struct device *device, umode_t *mode)
 	return NULL;
 }
 
+static ssize_t egm_size_show(struct device *dev, struct device_attribute *attr,
+			     char *buf)
+{
+	struct chardev *egm_chardev = container_of(dev, struct chardev, device);
+	struct nvgrace_egm_dev *egm_dev =
+		egm_chardev_to_nvgrace_egm_dev(egm_chardev);
+
+	return sysfs_emit(buf, "0x%lx\n", egm_dev->egmlength);
+}
+
+static DEVICE_ATTR_RO(egm_size);
+
+static struct attribute *attrs[] = {
+	&dev_attr_egm_size.attr,
+	NULL,
+};
+
+static struct attribute_group attr_group = {
+	.attrs = attrs,
+};
+
+static const struct attribute_group *attr_groups[2] = {
+	&attr_group,
+	NULL
+};
+
 static int __init nvgrace_egm_init(void)
 {
 	int ret;
@@ -386,6 +412,7 @@ static int __init nvgrace_egm_init(void)
 	}
 
 	class->devnode = egm_devnode;
+	class->dev_groups = attr_groups;
 
 	ret = auxiliary_driver_register(&egm_driver);
 	if (!ret)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC v2 13/15] vfio/nvgrace-egm: expose the egm size through sysfs
  2026-02-23 15:55 ` [PATCH RFC v2 13/15] vfio/nvgrace-egm: expose the egm size through sysfs ankita
@ 2026-03-04 23:22   ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2026-03-04 23:22 UTC (permalink / raw)
  To: ankita
  Cc: vsethi, jgg, mochs, jgg, skolothumtho, cjia, zhiw, kjaju, yishaih,
	kevin.tian, kvm, linux-kernel, alex

On Mon, 23 Feb 2026 15:55:12 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> To allocate the EGM, the userspace need to know its size. Currently,
> there is no easy way for the userspace to determine that.
> 
> Make nvgrace-egm expose the size through sysfs that can be queried
> by the userspace from <char_dev_path>/egm_size.
> E.g. on a 2-socket system, it is present at
> /sys/class/egm/egm4
> /sys/class/egm/egm5
> 
> It also shows up at <aux_device path>/egm_size.
> /sys/devices/pci0008:00/0008:00:00.0/0008:01:00.0/nvgrace_gpu_vfio_pci.egm.4/egm/egm4/egm_size
> /sys/devices/pci0018:00/0018:00:00.0/0018:01:00.0/nvgrace_gpu_vfio_pci.egm.5/egm/egm5/egm_size
> 

But we like to de-privilege QEMU and even pass file handles, how does
QEMU know the EGM size without trawling through sysfs?  It seems there
needs to be a primary interface to learn the size through the chardev.

Also no Documentation/ABI/testing/ entry.

> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/nvgrace-gpu/egm.c | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
> index 918979d8fcd4..2e4024c25e8a 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
> @@ -365,6 +365,32 @@ static char *egm_devnode(const struct device *device, umode_t *mode)
>  	return NULL;
>  }
>  
> +static ssize_t egm_size_show(struct device *dev, struct device_attribute *attr,
> +			     char *buf)
> +{
> +	struct chardev *egm_chardev = container_of(dev, struct chardev, device);
> +	struct nvgrace_egm_dev *egm_dev =
> +		egm_chardev_to_nvgrace_egm_dev(egm_chardev);
> +
> +	return sysfs_emit(buf, "0x%lx\n", egm_dev->egmlength);

Should the size be in decimal, %zu?

> +}
> +
> +static DEVICE_ATTR_RO(egm_size);
> +
> +static struct attribute *attrs[] = {
> +	&dev_attr_egm_size.attr,
> +	NULL,
> +};
> +
> +static struct attribute_group attr_group = {
> +	.attrs = attrs,
> +};

const?

> +
> +static const struct attribute_group *attr_groups[2] = {

No need to explicitly size the array, [].  Thanks,

Alex

> +	&attr_group,
> +	NULL
> +};
> +
>  static int __init nvgrace_egm_init(void)
>  {
>  	int ret;
> @@ -386,6 +412,7 @@ static int __init nvgrace_egm_init(void)
>  	}
>  
>  	class->devnode = egm_devnode;
> +	class->dev_groups = attr_groups;
>  
>  	ret = auxiliary_driver_register(&egm_driver);
>  	if (!ret)


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH RFC v2 14/15] vfio/nvgrace-gpu: Add link from pci to EGM
  2026-02-23 15:54 [PATCH RFC v2 00/15] Add virtualization support for EGM ankita
                   ` (12 preceding siblings ...)
  2026-02-23 15:55 ` [PATCH RFC v2 13/15] vfio/nvgrace-egm: expose the egm size through sysfs ankita
@ 2026-02-23 15:55 ` ankita
  2026-03-04 23:37   ` Alex Williamson
  2026-02-23 15:55 ` [PATCH RFC v2 15/15] vfio/nvgrace-egm: register EGM PFNMAP range with memory_failure ankita
  2026-03-05 17:33 ` [PATCH RFC v2 00/15] Add virtualization support for EGM Alex Williamson
  15 siblings, 1 reply; 42+ messages in thread
From: ankita @ 2026-02-23 15:55 UTC (permalink / raw)
  To: ankita, vsethi, jgg, mochs, jgg, skolothumtho, alex
  Cc: cjia, zhiw, kjaju, yishaih, kevin.tian, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

To replicate the host EGM topology in the VM in terms of
the GPU affinity, the userspace need to be aware of which
GPUs belong to the same socket as the EGM region.

Expose the list of GPUs associated with an EGM region
through sysfs. The list can be queried from the auxiliary
device path.

On a 2-socket, 4 GPU Grace Blackwell setup, the GPUs shows
up at /sys/class/egm/egmX.

E.g. ls /sys/class/egm/egm4/
0008:01:00.0  0009:01:00.0  dev  device  egm_size  power  subsystem  uevent

Suggested-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 47 +++++++++++++++++++++++++-
 1 file changed, 46 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
index 6d716c3a3257..3bdd5bb41e1b 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
@@ -56,6 +56,50 @@ int nvgrace_gpu_fetch_egm_property(struct pci_dev *pdev, u64 *pegmphys,
 	return ret;
 }
 
+static struct device *egm_find_chardev(struct nvgrace_egm_dev *egm_dev)
+{
+	char name[32] = { 0 };
+
+	scnprintf(name, sizeof(name), "egm%lld", egm_dev->egmpxm);
+	return device_find_child_by_name(&egm_dev->aux_dev.dev, name);
+}
+
+static int nvgrace_egm_create_gpu_links(struct nvgrace_egm_dev *egm_dev,
+					struct pci_dev *pdev)
+{
+	struct device *chardev_dev = egm_find_chardev(egm_dev);
+	int ret;
+
+	if (!chardev_dev)
+		return 0;
+
+	ret = sysfs_create_link(&chardev_dev->kobj,
+				&pdev->dev.kobj,
+				dev_name(&pdev->dev));
+
+	put_device(chardev_dev);
+
+	if (ret && ret != -EEXIST)
+		return ret;
+
+	return 0;
+}
+
+static void remove_egm_symlinks(struct nvgrace_egm_dev *egm_dev,
+				struct pci_dev *pdev)
+{
+	struct device *chardev_dev;
+
+	chardev_dev = egm_find_chardev(egm_dev);
+	if (!chardev_dev)
+		return;
+
+	sysfs_remove_link(&chardev_dev->kobj,
+			  dev_name(&pdev->dev));
+
+	put_device(chardev_dev);
+}
+
 int add_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
 {
 	struct gpu_node *node;
@@ -68,7 +112,7 @@ int add_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
 
 	list_add_tail(&node->list, &egm_dev->gpus);
 
-	return 0;
+	return nvgrace_egm_create_gpu_links(egm_dev, pdev);
 }
 
 void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
@@ -77,6 +121,7 @@ void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
 
 	list_for_each_entry_safe(node, tmp, &egm_dev->gpus, list) {
 		if (node->pdev == pdev) {
+			remove_egm_symlinks(egm_dev, pdev);
 			list_del(&node->list);
 			kfree(node);
 		}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC v2 14/15] vfio/nvgrace-gpu: Add link from pci to EGM
  2026-02-23 15:55 ` [PATCH RFC v2 14/15] vfio/nvgrace-gpu: Add link from pci to EGM ankita
@ 2026-03-04 23:37   ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2026-03-04 23:37 UTC (permalink / raw)
  To: ankita
  Cc: vsethi, jgg, mochs, jgg, skolothumtho, cjia, zhiw, kjaju, yishaih,
	kevin.tian, kvm, linux-kernel, alex

On Mon, 23 Feb 2026 15:55:13 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> To replicate the host EGM topology in the VM in terms of
> the GPU affinity, the userspace need to be aware of which
> GPUs belong to the same socket as the EGM region.
> 
> Expose the list of GPUs associated with an EGM region
> through sysfs. The list can be queried from the auxiliary
> device path.
> 
> On a 2-socket, 4 GPU Grace Blackwell setup, the GPUs shows
> up at /sys/class/egm/egmX.
> 
> E.g. ls /sys/class/egm/egm4/

If we end up with a sysfs representation of the EGM device, why did we
go to the trouble of naming them based on their PXM?

Shouldn't we just have a node association in sysfs rather than the GPUs?

AIUI, the PXM value doesn't necessarily align to the kernel's node
index anyway, so what is the value of exposing the PXM?  If the node
association is learned through sysfs, we could just use an ida for
assigning minors and avoid the address space problem of PXM values
aligning to reserved minor numbers.

> 0008:01:00.0  0009:01:00.0  dev  device  egm_size  power  subsystem  uevent
> 
> Suggested-by: Matthew R. Ochs <mochs@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 47 +++++++++++++++++++++++++-
>  1 file changed, 46 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> index 6d716c3a3257..3bdd5bb41e1b 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> @@ -56,6 +56,50 @@ int nvgrace_gpu_fetch_egm_property(struct pci_dev *pdev, u64 *pegmphys,
>  	return ret;
>  }
>  
> +static struct device *egm_find_chardev(struct nvgrace_egm_dev *egm_dev)
> +{
> +	char name[32] = { 0 };
> +
> +	scnprintf(name, sizeof(name), "egm%lld", egm_dev->egmpxm);

%llu

> +	return device_find_child_by_name(&egm_dev->aux_dev.dev, name);
> +}
> +
> +static int nvgrace_egm_create_gpu_links(struct nvgrace_egm_dev *egm_dev,
> +					struct pci_dev *pdev)
> +{
> +	struct device *chardev_dev = egm_find_chardev(egm_dev);
> +	int ret;
> +
> +	if (!chardev_dev)
> +		return 0;
> +
> +	ret = sysfs_create_link(&chardev_dev->kobj,
> +				&pdev->dev.kobj,
> +				dev_name(&pdev->dev));
> +
> +	put_device(chardev_dev);
> +
> +	if (ret && ret != -EEXIST)
> +		return ret;
> +
> +	return 0;
> +}
> +
> +static void remove_egm_symlinks(struct nvgrace_egm_dev *egm_dev,
> +				struct pci_dev *pdev)
> +{
> +	struct device *chardev_dev;
> +
> +	chardev_dev = egm_find_chardev(egm_dev);
> +	if (!chardev_dev)
> +		return;
> +
> +	sysfs_remove_link(&chardev_dev->kobj,
> +			  dev_name(&pdev->dev));
> +
> +	put_device(chardev_dev);
> +}
> +
>  int add_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
>  {
>  	struct gpu_node *node;
> @@ -68,7 +112,7 @@ int add_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
>  
>  	list_add_tail(&node->list, &egm_dev->gpus);
>  
> -	return 0;
> +	return nvgrace_egm_create_gpu_links(egm_dev, pdev);
>  }
>  
>  void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
> @@ -77,6 +121,7 @@ void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
>  
>  	list_for_each_entry_safe(node, tmp, &egm_dev->gpus, list) {
>  		if (node->pdev == pdev) {
> +			remove_egm_symlinks(egm_dev, pdev);
>  			list_del(&node->list);
>  			kfree(node);
>  		}

This is really broken layering for nvgrace-gpu to be adding sysfs
attributes to the chardev devices.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH RFC v2 15/15] vfio/nvgrace-egm: register EGM PFNMAP range with memory_failure
  2026-02-23 15:54 [PATCH RFC v2 00/15] Add virtualization support for EGM ankita
                   ` (13 preceding siblings ...)
  2026-02-23 15:55 ` [PATCH RFC v2 14/15] vfio/nvgrace-gpu: Add link from pci to EGM ankita
@ 2026-02-23 15:55 ` ankita
  2026-03-04 23:48   ` Alex Williamson
  2026-03-05 17:33 ` [PATCH RFC v2 00/15] Add virtualization support for EGM Alex Williamson
  15 siblings, 1 reply; 42+ messages in thread
From: ankita @ 2026-02-23 15:55 UTC (permalink / raw)
  To: ankita, vsethi, jgg, mochs, jgg, skolothumtho, alex
  Cc: cjia, zhiw, kjaju, yishaih, kevin.tian, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

EGM carveout memory is mapped directly into userspace (QEMU) and is not
added to the kernel. It is not managed by the kernel page allocator and
has no struct pages. The module can thus utilize the Linux memory manager's
memory_failure mechanism for regions with no struct pages. The Linux MM
code exposes register/unregister APIs allowing modules to register such
memory regions for memory_failure handling.

Register the EGM PFN range with the MM memory_failure infrastructure on
open, and unregister it on the last close. Provide a PFN-to-VMA offset
callback that validates the PFN is within the EGM region and the VMA,
then converts it to a file offset and records the poisoned offset in the
existing hashtable for reporting to userspace.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/egm.c | 100 +++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
index 2e4024c25e8a..5b60db6294a8 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
@@ -6,6 +6,7 @@
 #include <linux/vfio_pci_core.h>
 #include <linux/nvgrace-egm.h>
 #include <linux/egm.h>
+#include <linux/memory-failure.h>
 
 #define MAX_EGM_NODES 4
 
@@ -23,6 +24,7 @@ struct chardev {
 	struct cdev cdev;
 	atomic_t open_count;
 	DECLARE_HASHTABLE(htbl, 0x10);
+	struct pfn_address_space pfn_address_space;
 };
 
 static struct nvgrace_egm_dev *
@@ -34,6 +36,94 @@ egm_chardev_to_nvgrace_egm_dev(struct chardev *egm_chardev)
 	return container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
 }
 
+static int pfn_memregion_offset(struct chardev *egm_chardev,
+				unsigned long pfn,
+				pgoff_t *pfn_offset_in_region)
+{
+	unsigned long start_pfn, num_pages;
+	struct nvgrace_egm_dev *egm_dev =
+		egm_chardev_to_nvgrace_egm_dev(egm_chardev);
+
+	start_pfn = PHYS_PFN(egm_dev->egmphys);
+	num_pages = egm_dev->egmlength >> PAGE_SHIFT;
+
+	if (pfn < start_pfn || pfn >= start_pfn + num_pages)
+		return -EFAULT;
+
+	*pfn_offset_in_region = pfn - start_pfn;
+
+	return 0;
+}
+
+static int track_ecc_offset(struct chardev *egm_chardev,
+			    unsigned long mem_offset)
+{
+	struct h_node *cur_page, *ecc_page;
+
+	hash_for_each_possible(egm_chardev->htbl, cur_page, node, mem_offset) {
+		if (cur_page->mem_offset == mem_offset)
+			return 0;
+	}
+
+	ecc_page = kzalloc(sizeof(*ecc_page), GFP_NOFS);
+	if (!ecc_page)
+		return -ENOMEM;
+
+	ecc_page->mem_offset = mem_offset;
+
+	hash_add(egm_chardev->htbl, &ecc_page->node, ecc_page->mem_offset);
+
+	return 0;
+}
+
+static int nvgrace_egm_pfn_to_vma_pgoff(struct vm_area_struct *vma,
+					unsigned long pfn,
+					pgoff_t *pgoff)
+{
+	struct chardev *egm_chardev = vma->vm_file->private_data;
+	pgoff_t vma_offset_in_region = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+	pgoff_t pfn_offset_in_region;
+	int ret;
+
+	ret = pfn_memregion_offset(egm_chardev, pfn, &pfn_offset_in_region);
+	if (ret)
+		return ret;
+
+	/* Ensure PFN is not before VMA's start within the region */
+	if (pfn_offset_in_region < vma_offset_in_region)
+		return -EFAULT;
+
+	/* Calculate offset from VMA start */
+	*pgoff = vma->vm_pgoff +
+		 (pfn_offset_in_region - vma_offset_in_region);
+
+	/* Track and save the poisoned offset */
+	return track_ecc_offset(egm_chardev, *pgoff << PAGE_SHIFT);
+}
+
+static int
+nvgrace_egm_vfio_pci_register_pfn_range(struct inode *inode,
+					struct chardev *egm_chardev)
+{
+	struct nvgrace_egm_dev *egm_dev =
+		egm_chardev_to_nvgrace_egm_dev(egm_chardev);
+	unsigned long pfn, nr_pages;
+	int ret;
+
+	pfn = PHYS_PFN(egm_dev->egmphys);
+	nr_pages = egm_dev->egmlength >> PAGE_SHIFT;
+
+	egm_chardev->pfn_address_space.node.start = pfn;
+	egm_chardev->pfn_address_space.node.last = pfn + nr_pages - 1;
+	egm_chardev->pfn_address_space.mapping = inode->i_mapping;
+	egm_chardev->pfn_address_space.pfn_to_vma_pgoff = nvgrace_egm_pfn_to_vma_pgoff;
+
+	ret = register_pfn_address_space(&egm_chardev->pfn_address_space);
+
+	return ret;
+}
+
 static int nvgrace_egm_open(struct inode *inode, struct file *file)
 {
 	struct chardev *egm_chardev =
@@ -41,6 +131,7 @@ static int nvgrace_egm_open(struct inode *inode, struct file *file)
 	struct nvgrace_egm_dev *egm_dev =
 		egm_chardev_to_nvgrace_egm_dev(egm_chardev);
 	void *memaddr;
+	int ret;
 
 	if (atomic_cmpxchg(&egm_chardev->open_count, 0, 1) != 0)
 		return -EBUSY;
@@ -77,6 +168,13 @@ static int nvgrace_egm_open(struct inode *inode, struct file *file)
 
 	file->private_data = egm_chardev;
 
+	ret = nvgrace_egm_vfio_pci_register_pfn_range(inode, egm_chardev);
+	if (ret && ret != -EOPNOTSUPP) {
+		file->private_data = NULL;
+		atomic_dec(&egm_chardev->open_count);
+		return ret;
+	}
+
 	return 0;
 }
 
@@ -85,6 +183,8 @@ static int nvgrace_egm_release(struct inode *inode, struct file *file)
 	struct chardev *egm_chardev =
 		container_of(inode->i_cdev, struct chardev, cdev);
 
+	unregister_pfn_address_space(&egm_chardev->pfn_address_space);
+
 	file->private_data = NULL;
 
 	atomic_dec(&egm_chardev->open_count);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC v2 15/15] vfio/nvgrace-egm: register EGM PFNMAP range with memory_failure
  2026-02-23 15:55 ` [PATCH RFC v2 15/15] vfio/nvgrace-egm: register EGM PFNMAP range with memory_failure ankita
@ 2026-03-04 23:48   ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2026-03-04 23:48 UTC (permalink / raw)
  To: ankita
  Cc: vsethi, jgg, mochs, jgg, skolothumtho, cjia, zhiw, kjaju, yishaih,
	kevin.tian, kvm, linux-kernel, alex

On Mon, 23 Feb 2026 15:55:14 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> EGM carveout memory is mapped directly into userspace (QEMU) and is not
> added to the kernel. It is not managed by the kernel page allocator and
> has no struct pages. The module can thus utilize the Linux memory manager's
> memory_failure mechanism for regions with no struct pages. The Linux MM
> code exposes register/unregister APIs allowing modules to register such
> memory regions for memory_failure handling.
> 
> Register the EGM PFN range with the MM memory_failure infrastructure on
> open, and unregister it on the last close. Provide a PFN-to-VMA offset
> callback that validates the PFN is within the EGM region and the VMA,
> then converts it to a file offset and records the poisoned offset in the
> existing hashtable for reporting to userspace.

So the idea is that we kill the process owning the VMA and add the page
to the hash such that the next user process avoids it, and this is what
encourages userspace to consume the bad page list?

> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/nvgrace-gpu/egm.c | 100 +++++++++++++++++++++++++++++
>  1 file changed, 100 insertions(+)
> 
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
> index 2e4024c25e8a..5b60db6294a8 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
> @@ -6,6 +6,7 @@
>  #include <linux/vfio_pci_core.h>
>  #include <linux/nvgrace-egm.h>
>  #include <linux/egm.h>
> +#include <linux/memory-failure.h>
>  
>  #define MAX_EGM_NODES 4
>  
> @@ -23,6 +24,7 @@ struct chardev {
>  	struct cdev cdev;
>  	atomic_t open_count;
>  	DECLARE_HASHTABLE(htbl, 0x10);
> +	struct pfn_address_space pfn_address_space;
>  };
>  
>  static struct nvgrace_egm_dev *
> @@ -34,6 +36,94 @@ egm_chardev_to_nvgrace_egm_dev(struct chardev *egm_chardev)
>  	return container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
>  }
>  
> +static int pfn_memregion_offset(struct chardev *egm_chardev,
> +				unsigned long pfn,
> +				pgoff_t *pfn_offset_in_region)
> +{
> +	unsigned long start_pfn, num_pages;
> +	struct nvgrace_egm_dev *egm_dev =
> +		egm_chardev_to_nvgrace_egm_dev(egm_chardev);
> +
> +	start_pfn = PHYS_PFN(egm_dev->egmphys);
> +	num_pages = egm_dev->egmlength >> PAGE_SHIFT;
> +
> +	if (pfn < start_pfn || pfn >= start_pfn + num_pages)
> +		return -EFAULT;
> +
> +	*pfn_offset_in_region = pfn - start_pfn;
> +
> +	return 0;
> +}
> +
> +static int track_ecc_offset(struct chardev *egm_chardev,
> +			    unsigned long mem_offset)
> +{
> +	struct h_node *cur_page, *ecc_page;
> +
> +	hash_for_each_possible(egm_chardev->htbl, cur_page, node, mem_offset) {
> +		if (cur_page->mem_offset == mem_offset)
> +			return 0;
> +	}
> +
> +	ecc_page = kzalloc(sizeof(*ecc_page), GFP_NOFS);
> +	if (!ecc_page)
> +		return -ENOMEM;
> +
> +	ecc_page->mem_offset = mem_offset;
> +
> +	hash_add(egm_chardev->htbl, &ecc_page->node, ecc_page->mem_offset);
> +
> +	return 0;
> +}

How do concurrent faults work?  No locking on the hash table.

> +
> +static int nvgrace_egm_pfn_to_vma_pgoff(struct vm_area_struct *vma,
> +					unsigned long pfn,
> +					pgoff_t *pgoff)
> +{
> +	struct chardev *egm_chardev = vma->vm_file->private_data;
> +	pgoff_t vma_offset_in_region = vma->vm_pgoff &
> +		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> +	pgoff_t pfn_offset_in_region;
> +	int ret;
> +
> +	ret = pfn_memregion_offset(egm_chardev, pfn, &pfn_offset_in_region);
> +	if (ret)
> +		return ret;
> +
> +	/* Ensure PFN is not before VMA's start within the region */
> +	if (pfn_offset_in_region < vma_offset_in_region)
> +		return -EFAULT;
> +
> +	/* Calculate offset from VMA start */
> +	*pgoff = vma->vm_pgoff +
> +		 (pfn_offset_in_region - vma_offset_in_region);
> +
> +	/* Track and save the poisoned offset */
> +	return track_ecc_offset(egm_chardev, *pgoff << PAGE_SHIFT);
> +}
> +
> +static int
> +nvgrace_egm_vfio_pci_register_pfn_range(struct inode *inode,
> +					struct chardev *egm_chardev)

What does this have to do with vfio-pci?  It's not even a device
address space.  Thanks,

Alex

> +{
> +	struct nvgrace_egm_dev *egm_dev =
> +		egm_chardev_to_nvgrace_egm_dev(egm_chardev);
> +	unsigned long pfn, nr_pages;
> +	int ret;
> +
> +	pfn = PHYS_PFN(egm_dev->egmphys);
> +	nr_pages = egm_dev->egmlength >> PAGE_SHIFT;
> +
> +	egm_chardev->pfn_address_space.node.start = pfn;
> +	egm_chardev->pfn_address_space.node.last = pfn + nr_pages - 1;
> +	egm_chardev->pfn_address_space.mapping = inode->i_mapping;
> +	egm_chardev->pfn_address_space.pfn_to_vma_pgoff = nvgrace_egm_pfn_to_vma_pgoff;
> +
> +	ret = register_pfn_address_space(&egm_chardev->pfn_address_space);
> +
> +	return ret;
> +}
> +
>  static int nvgrace_egm_open(struct inode *inode, struct file *file)
>  {
>  	struct chardev *egm_chardev =
> @@ -41,6 +131,7 @@ static int nvgrace_egm_open(struct inode *inode, struct file *file)
>  	struct nvgrace_egm_dev *egm_dev =
>  		egm_chardev_to_nvgrace_egm_dev(egm_chardev);
>  	void *memaddr;
> +	int ret;
>  
>  	if (atomic_cmpxchg(&egm_chardev->open_count, 0, 1) != 0)
>  		return -EBUSY;
> @@ -77,6 +168,13 @@ static int nvgrace_egm_open(struct inode *inode, struct file *file)
>  
>  	file->private_data = egm_chardev;
>  
> +	ret = nvgrace_egm_vfio_pci_register_pfn_range(inode, egm_chardev);
> +	if (ret && ret != -EOPNOTSUPP) {
> +		file->private_data = NULL;
> +		atomic_dec(&egm_chardev->open_count);
> +		return ret;
> +	}
> +
>  	return 0;
>  }
>  
> @@ -85,6 +183,8 @@ static int nvgrace_egm_release(struct inode *inode, struct file *file)
>  	struct chardev *egm_chardev =
>  		container_of(inode->i_cdev, struct chardev, cdev);
>  
> +	unregister_pfn_address_space(&egm_chardev->pfn_address_space);
> +
>  	file->private_data = NULL;
>  
>  	atomic_dec(&egm_chardev->open_count);


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC v2 00/15] Add virtualization support for EGM
  2026-02-23 15:54 [PATCH RFC v2 00/15] Add virtualization support for EGM ankita
                   ` (14 preceding siblings ...)
  2026-02-23 15:55 ` [PATCH RFC v2 15/15] vfio/nvgrace-egm: register EGM PFNMAP range with memory_failure ankita
@ 2026-03-05 17:33 ` Alex Williamson
  2026-03-11  6:47   ` Ankit Agrawal
  15 siblings, 1 reply; 42+ messages in thread
From: Alex Williamson @ 2026-03-05 17:33 UTC (permalink / raw)
  To: ankita
  Cc: vsethi, jgg, mochs, jgg, skolothumtho, cjia, zhiw, kjaju, yishaih,
	kevin.tian, kvm, linux-kernel, alex

On Mon, 23 Feb 2026 15:54:59 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Background
> ----------
> Grace Hopper/Blackwell systems support the Extended GPU Memory (EGM)
> feature that enable the GPU to access the system memory allocations
> within and across nodes through high bandwidth path. This access path
> goes as: GPU <--> NVswitch <--> GPU <--> CPU. The GPU can utilize the
> system memory located on the same socket or from a different socket
> or even on a different node in a multi-node system [1]. This feature is
> being extended to virtualization.
> 
> 
> Design Details
> --------------
> EGM when enabled in the virtualization stack, the host memory
> is partitioned into 2 parts: One partition for the Host OS usage
> called Hypervisor region, and a second Hypervisor-Invisible (HI) region
> for the VM. Only the hypervisor region is part of the host EFI map
> and is thus visible to the host OS on bootup. Since the entire VM
> sysmem is eligible for EGM allocations within the VM, the HI partition
> is interchangeably called as EGM region in the series. This HI/EGM region
> range base SPA and size is exposed through the ACPI DSDT properties.
> 
> Whilst the EGM region is accessible on the host, it is not added to
> the kernel. The HI region is assigned to a VM by mapping the QEMU VMA
> to the SPA using remap_pfn_range().
> 
> The following figure shows the memory map in the virtualization
> environment.
> 
> |---- Sysmem ----|                  |--- GPU mem ---|  VM Memory
> |                |                  |               |
> |IPA <-> SPA map |                  |IPA <-> SPA map|
> |                |                  |               |
> |--- HI / EGM ---|-- Host Mem --|   |--- GPU mem ---|  Host Memory
> 
> The patch series introduce a new nvgrace-egm auxiliary driver module
> to manage and map the HI/EGM region in the Grace Blackwell systems.
> This binds to the auxiliary device created by the parent
> nvgrace-gpu (in-tree module for device assignment) / nvidia-vgpu-vfio
> (out-of-tree open source module for SRIOV vGPU) to manage the
> EGM region for the VM. Note that there is a unique EGM region per
> socket and the auxiliary device gets created for every region. The
> parent module fetches the EGM region information from the ACPI
> tables and populate to the data structures shared with the auxiliary
> nvgrace-egm module.
> 
> nvgrace-egm module handles the following:
> 1. Fetch the EGM memory properties (base HPA, length, proximity domain)
> from the parent device shared EGM region structure.
> 2. Create a char device that can be used as memory-backend-file by Qemu
> for the VM and implement file operations. The char device is /dev/egmX,
> where X is the PXM node ID of the EGM being mapped fetched in 1.
> 3. Zero the EGM memory on first device open().
> 4. Map the QEMU VMA to the EGM region using remap_pfn_range.
> 5. Cleaning up state and destroying the chardev on device unbind.
> 6. Handle presence of retired poisoned pages on the EGM region.
> 
> Since nvgrace-egm is an auxiliary module to the nvgrace-gpu, it is kept
> in the same directory.

Pondering this series for a bit, is this auxiliary chardev approach
really the model we should be pursuing?

I know we're trying to disassociate the EGM region from the GPU, and
de-duplicate it between GPUs on the same socket, but is there actually a
use case of the EGM chardev separate from the GPU?

The independent lifecycle of this aux device is troubling and it hasn't
been confirmed whether or not access to the EGM region has some
dependency on the state of the GPU.  nvgrace-gpu is manipulating sysfs
on devices owned by nvgrace-egm, we don't have mechanisms to manage the
aux device relative to the state of the GPU, we're trying to add a
driver that can bind to device created by an out-of-tree driver, and
we're inventing new uAPIs on the chardev for things that already exist
for vfio regions.

Therefore, does it actually make more sense to expose EGM as a device
specific region on the vfio device fd?

For example, nvgrace-gpu might manage the de-duplication by only
exposing this device specific region on the lowest BDF GPU per socket.
The existing REGION_INFO ioctl handles reporting the size to the user.
The direct association to the GPU device handles reporting the node
locality.  If necessary, a capability on the region could report the
associated PXM, and maybe even the retired page list.

All of the lifecycle issues are automatically handled, there's no
separate aux device.  If necessary, zapping and faulting across reset
is handled just like a BAR mapping.

If we need to expose the EGM size and GPU association via sysfs for
management tooling, nvgrace-gpu could add an "egm_size" attribute to the
PCI device's sysfs node.  This could also avoid the implicit
implementation knowledge about which GPU exposes the EGM device
specific region.

Was such a design considered?  It seems much, much simpler and could be
implemented by either nvgrace-gpu or identically by an out-of-tree
driver without references in an in-kernel ID table.

I'd like to understand the pros and cons of such an approach vs the one
presented here.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC v2 00/15] Add virtualization support for EGM
  2026-03-05 17:33 ` [PATCH RFC v2 00/15] Add virtualization support for EGM Alex Williamson
@ 2026-03-11  6:47   ` Ankit Agrawal
  2026-03-11 20:37     ` Alex Williamson
  0 siblings, 1 reply; 42+ messages in thread
From: Ankit Agrawal @ 2026-03-11  6:47 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Vikram Sethi, Jason Gunthorpe, Matt Ochs, jgg@ziepe.ca,
	Shameer Kolothum Thodi, Neo Jia, Zhi Wang, Krishnakant Jaju,
	Yishai Hadas, kevin.tian@intel.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org

Thanks Alex for the review.

>> The patch series introduce a new nvgrace-egm auxiliary driver module
>> to manage and map the HI/EGM region in the Grace Blackwell systems.
>> This binds to the auxiliary device created by the parent
>> nvgrace-gpu (in-tree module for device assignment) / nvidia-vgpu-vfio
>> (out-of-tree open source module for SRIOV vGPU) to manage the
>> EGM region for the VM. Note that there is a unique EGM region per
>> socket and the auxiliary device gets created for every region. The
>> parent module fetches the EGM region information from the ACPI
>> tables and populate to the data structures shared with the auxiliary
>> nvgrace-egm module.
>>
>> nvgrace-egm module handles the following:
>> 1. Fetch the EGM memory properties (base HPA, length, proximity domain)
>> from the parent device shared EGM region structure.
>> 2. Create a char device that can be used as memory-backend-file by Qemu
>> for the VM and implement file operations. The char device is /dev/egmX,
>> where X is the PXM node ID of the EGM being mapped fetched in 1.
>> 3. Zero the EGM memory on first device open().
>> 4. Map the QEMU VMA to the EGM region using remap_pfn_range.
>> 5. Cleaning up state and destroying the chardev on device unbind.
>> 6. Handle presence of retired poisoned pages on the EGM region.
>>
>> Since nvgrace-egm is an auxiliary module to the nvgrace-gpu, it is kept
>> in the same directory.
>
>
> Pondering this series for a bit, is this auxiliary chardev approach
> really the model we should be pursuing?
>
> I know we're trying to disassociate the EGM region from the GPU, and
> de-duplicate it between GPUs on the same socket, but is there actually a
> use case of the EGM chardev separate from the GPU?

It is not just de-duplication. The EGM is a carveout of system memory
logically and physically separate and disconnected from the GPU. The
uniqueness here is that the information (SPA, size) of the region is present
on the GPU ACPI tables.

>
> The independent lifecycle of this aux device is troubling and it hasn't
> been confirmed whether or not access to the EGM region has some
>dependency on the state of the GPU. 

The EGM region is independent on the state of the GPU. One can plausibly
bootup the VM with just the EGM memory chardev as the backend file and
no GPU.

> nvgrace-gpu is manipulating sysfs
> on devices owned by nvgrace-egm, we don't have mechanisms to manage the
> aux device relative to the state of the GPU, we're trying to add a
> driver that can bind to device created by an out-of-tree driver, and
> we're inventing new uAPIs on the chardev for things that already exist
> for vfio regions.

Sorry for the confusion. The nvgrace-egm would not bind to the device
created by the out-of-tree driver. We would have a separate out-of-tree
equivalent of nvgrace-egm to bind to the device by the out-of-tree vfio
driver. Maybe we can consider exposing a register/unregister APIs from
nvgrace-egm where a module (in-tree nvgrace / out-of-tree) can register
a pdev and nvgrace-egm can use to fetch the region info.

> Therefore, does it actually make more sense to expose EGM as a device
> specific region on the vfio device fd?
>
> For example, nvgrace-gpu might manage the de-duplication by only
> exposing this device specific region on the lowest BDF GPU per socket.
> The existing REGION_INFO ioctl handles reporting the size to the user.
> The direct association to the GPU device handles reporting the node
> locality.  If necessary, a capability on the region could report the
> associated PXM, and maybe even the retired page list.
>
> All of the lifecycle issues are automatically handled, there's no
> separate aux device.  If necessary, zapping and faulting across reset
> is handled just like a BAR mapping.

The EGM memory (which becomes the system memory of the VM) cannot
be connected to the GPU reset as it is unrelated to the GPU device. We would
not want that to happen to system memory on GPU reset.

> If we need to expose the EGM size and GPU association via sysfs for
> management tooling, nvgrace-gpu could add an "egm_size" attribute to the
> PCI device's sysfs node.  This could also avoid the implicit
>  implementation knowledge about which GPU exposes the EGM device
> specific region.
>
> Was such a design considered?  It seems much, much simpler and could be
> implemented by either nvgrace-gpu or identically by an out-of-tree
> driver without references in an in-kernel ID table.
>
> I'd like to understand the pros and cons of such an approach vs the one
> presented here.  Thanks,

We didn't consider it as a separate BAR / region as the EGM memory (part of the
system memory) is unrelated to the GPU device besides having its information
in the GPU ACPI table and becomes the system memory of the VM. Considering
it as part of the device BAR / region would connect the lifecyle of the EGM region
on the GPU device. Also we cannot consider zapping/faulting across GPU reset
as it is system memory of the VM.

Thanks
Ankit Agrawal

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC v2 00/15] Add virtualization support for EGM
  2026-03-11  6:47   ` Ankit Agrawal
@ 2026-03-11 20:37     ` Alex Williamson
  2026-03-12 13:51       ` Ankit Agrawal
  0 siblings, 1 reply; 42+ messages in thread
From: Alex Williamson @ 2026-03-11 20:37 UTC (permalink / raw)
  To: Ankit Agrawal, Jason Gunthorpe
  Cc: Vikram Sethi, Matt Ochs, jgg@ziepe.ca, Shameer Kolothum Thodi,
	Neo Jia, Zhi Wang, Krishnakant Jaju, Yishai Hadas,
	kevin.tian@intel.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, alex

On Wed, 11 Mar 2026 06:47:12 +0000
Ankit Agrawal <ankita@nvidia.com> wrote:

> Thanks Alex for the review.
> 
> >> The patch series introduce a new nvgrace-egm auxiliary driver module
> >> to manage and map the HI/EGM region in the Grace Blackwell systems.
> >> This binds to the auxiliary device created by the parent
> >> nvgrace-gpu (in-tree module for device assignment) / nvidia-vgpu-vfio
> >> (out-of-tree open source module for SRIOV vGPU) to manage the
> >> EGM region for the VM. Note that there is a unique EGM region per
> >> socket and the auxiliary device gets created for every region. The
> >> parent module fetches the EGM region information from the ACPI
> >> tables and populate to the data structures shared with the auxiliary
> >> nvgrace-egm module.
> >>
> >> nvgrace-egm module handles the following:
> >> 1. Fetch the EGM memory properties (base HPA, length, proximity domain)
> >> from the parent device shared EGM region structure.
> >> 2. Create a char device that can be used as memory-backend-file by Qemu
> >> for the VM and implement file operations. The char device is /dev/egmX,
> >> where X is the PXM node ID of the EGM being mapped fetched in 1.
> >> 3. Zero the EGM memory on first device open().
> >> 4. Map the QEMU VMA to the EGM region using remap_pfn_range.
> >> 5. Cleaning up state and destroying the chardev on device unbind.
> >> 6. Handle presence of retired poisoned pages on the EGM region.
> >>
> >> Since nvgrace-egm is an auxiliary module to the nvgrace-gpu, it is kept
> >> in the same directory.  
> >
> >
> > Pondering this series for a bit, is this auxiliary chardev approach
> > really the model we should be pursuing?
> >
> > I know we're trying to disassociate the EGM region from the GPU, and
> > de-duplicate it between GPUs on the same socket, but is there actually a
> > use case of the EGM chardev separate from the GPU?  
> 
> It is not just de-duplication. The EGM is a carveout of system memory
> logically and physically separate and disconnected from the GPU. The
> uniqueness here is that the information (SPA, size) of the region is present
> on the GPU ACPI tables.
> 
> >
> > The independent lifecycle of this aux device is troubling and it hasn't
> > been confirmed whether or not access to the EGM region has some
> >dependency on the state of the GPU.   
> 
> The EGM region is independent on the state of the GPU. One can plausibly
> bootup the VM with just the EGM memory chardev as the backend file and
> no GPU.

Seems like we have the wrong model then to base the lifecycle of the
aux devices on the state of the PCI driver if EGM is fully independent
of the state of the PCI device.

> > nvgrace-gpu is manipulating sysfs
> > on devices owned by nvgrace-egm, we don't have mechanisms to manage the
> > aux device relative to the state of the GPU, we're trying to add a
> > driver that can bind to device created by an out-of-tree driver, and
> > we're inventing new uAPIs on the chardev for things that already exist
> > for vfio regions.  
> 
> Sorry for the confusion. The nvgrace-egm would not bind to the device
> created by the out-of-tree driver. We would have a separate out-of-tree
> equivalent of nvgrace-egm to bind to the device by the out-of-tree vfio
> driver. Maybe we can consider exposing a register/unregister APIs from
> nvgrace-egm where a module (in-tree nvgrace / out-of-tree) can register
> a pdev and nvgrace-egm can use to fetch the region info.

Ok, this wasn't clear to me, but does that also mean that if some GPUs
are managed by nvgrace-gpu and others by out-of-tree drivers that the
in-kernel and out-of-tree equivalent drivers are both installing
chardevs as /dev/egmXX?  Playing in the same space is ugly, but what
happens when the 2 GPUs per socket are split between drivers and they
both try to added the same chardev?

> > Therefore, does it actually make more sense to expose EGM as a device
> > specific region on the vfio device fd?
> >
> > For example, nvgrace-gpu might manage the de-duplication by only
> > exposing this device specific region on the lowest BDF GPU per socket.
> > The existing REGION_INFO ioctl handles reporting the size to the user.
> > The direct association to the GPU device handles reporting the node
> > locality.  If necessary, a capability on the region could report the
> > associated PXM, and maybe even the retired page list.
> >
> > All of the lifecycle issues are automatically handled, there's no
> > separate aux device.  If necessary, zapping and faulting across reset
> > is handled just like a BAR mapping.  
> 
> The EGM memory (which becomes the system memory of the VM) cannot
> be connected to the GPU reset as it is unrelated to the GPU device. We would
> not want that to happen to system memory on GPU reset.

It's not the state of the EGM/system memory that I'm concerned about,
it's the fact that the routing to access that memory traverses two
GPUs and both the backplane and c2c NVLink connections.  If access
through that channel is 100% independent of the state of either GPU
then GPU resets are irrelevant.

However, I'd then ask the question why we're associating EGM to the GPU
PCI driver at all.  For instance, why should nvgrace-gpu spawn aux
devices to feed into an nvgrace-egm driver, and duplicate that whole
thing in an out-of-tree driver, when we could just have one in-kernel
platform(?) driver walk ACPI, find these ranges, and expose them as
chardev entirely independent of the PCI driver bound to the GPU?
 
> > If we need to expose the EGM size and GPU association via sysfs for
> > management tooling, nvgrace-gpu could add an "egm_size" attribute to the
> > PCI device's sysfs node.  This could also avoid the implicit
> >  implementation knowledge about which GPU exposes the EGM device
> > specific region.
> >
> > Was such a design considered?  It seems much, much simpler and could be
> > implemented by either nvgrace-gpu or identically by an out-of-tree
> > driver without references in an in-kernel ID table.
> >
> > I'd like to understand the pros and cons of such an approach vs the one
> > presented here.  Thanks,  
> 
> We didn't consider it as a separate BAR / region as the EGM memory (part of the
> system memory) is unrelated to the GPU device besides having its information
> in the GPU ACPI table and becomes the system memory of the VM. Considering
> it as part of the device BAR / region would connect the lifecyle of the EGM region
> on the GPU device. Also we cannot consider zapping/faulting across GPU reset
> as it is system memory of the VM.

It's curious why the EGM description is associated to the GPU ACPI
object if it really is fully independent.  It seems like perhaps it
should be a unique ACPI object in that case, which would make claiming
it via a platform driver easier.  Maybe we don't need to be tied to
that firmware decision in the design of the software driver though.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC v2 00/15] Add virtualization support for EGM
  2026-03-11 20:37     ` Alex Williamson
@ 2026-03-12 13:51       ` Ankit Agrawal
  2026-03-12 14:59         ` Alex Williamson
  0 siblings, 1 reply; 42+ messages in thread
From: Ankit Agrawal @ 2026-03-12 13:51 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Vikram Sethi, Matt Ochs, jgg@ziepe.ca, Shameer Kolothum Thodi,
	Neo Jia, Zhi Wang, Krishnakant Jaju, Yishai Hadas,
	kevin.tian@intel.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org

>> > nvgrace-gpu is manipulating sysfs
>> > on devices owned by nvgrace-egm, we don't have mechanisms to manage the
>> > aux device relative to the state of the GPU, we're trying to add a
>> > driver that can bind to device created by an out-of-tree driver, and
>> > we're inventing new uAPIs on the chardev for things that already exist
>> > for vfio regions.
>>
>> Sorry for the confusion. The nvgrace-egm would not bind to the device
>> created by the out-of-tree driver. We would have a separate out-of-tree
>> equivalent of nvgrace-egm to bind to the device by the out-of-tree vfio
>> driver. Maybe we can consider exposing a register/unregister APIs from
>> nvgrace-egm where a module (in-tree nvgrace / out-of-tree) can register
>> a pdev and nvgrace-egm can use to fetch the region info.
>
> Ok, this wasn't clear to me, but does that also mean that if some GPUs
> are managed by nvgrace-gpu and others by out-of-tree drivers that the
> in-kernel and out-of-tree equivalent drivers are both installing
> chardevs as /dev/egmXX?  Playing in the same space is ugly, but what
> happens when the 2 GPUs per socket are split between drivers and they
> both try to added the same chardev?

But that would be an unsupported configuration. It is expected that all the
GPUs on the system and the EGM char devices to be attached to the same
VM for full functionality. So either all the devices (GPU and EGM chardev)
would be bound to nvgrace or to the out-of-tree module. Please refer sec 8.1
https://docs.nvidia.com/multi-node-nvlink-systems/partition-guide-v1-2.pdf
Perhaps I should add this information in the commit message.

> However, I'd then ask the question why we're associating EGM to the GPU
> PCI driver at all.  For instance, why should nvgrace-gpu spawn aux
> devices to feed into an nvgrace-egm driver, and duplicate that whole
> thing in an out-of-tree driver, when we could just have one in-kernel
> platform(?) driver walk ACPI, find these ranges, and expose them as
> chardev entirely independent of the PCI driver bound to the GPU?

So a new platform driver to walk through the ACPI and look for EGM properties
and create EGM char devs? 

Maybe it is okay, but given that all the 4 EGM properties are under the GPU's
ACPI node and there being no independent ACPI _HID device identity, it sounds
a bit off to me. Do we have a precedent like that?

But as I mentioned above, the expectation is that the EGM devices and the GPU
devices to be assigned to the same VM. So would it not make sense that we
keep the association between the EGM devices and the GPU devices?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC v2 00/15] Add virtualization support for EGM
  2026-03-12 13:51       ` Ankit Agrawal
@ 2026-03-12 14:59         ` Alex Williamson
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Williamson @ 2026-03-12 14:59 UTC (permalink / raw)
  To: Ankit Agrawal
  Cc: Jason Gunthorpe, Vikram Sethi, Matt Ochs, jgg@ziepe.ca,
	Shameer Kolothum Thodi, Neo Jia, Zhi Wang, Krishnakant Jaju,
	Yishai Hadas, kevin.tian@intel.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, alex

On Thu, 12 Mar 2026 13:51:20 +0000
Ankit Agrawal <ankita@nvidia.com> wrote:

> >> > nvgrace-gpu is manipulating sysfs
> >> > on devices owned by nvgrace-egm, we don't have mechanisms to manage the
> >> > aux device relative to the state of the GPU, we're trying to add a
> >> > driver that can bind to device created by an out-of-tree driver, and
> >> > we're inventing new uAPIs on the chardev for things that already exist
> >> > for vfio regions.  
> >>
> >> Sorry for the confusion. The nvgrace-egm would not bind to the device
> >> created by the out-of-tree driver. We would have a separate out-of-tree
> >> equivalent of nvgrace-egm to bind to the device by the out-of-tree vfio
> >> driver. Maybe we can consider exposing a register/unregister APIs from
> >> nvgrace-egm where a module (in-tree nvgrace / out-of-tree) can register
> >> a pdev and nvgrace-egm can use to fetch the region info.  
> >
> > Ok, this wasn't clear to me, but does that also mean that if some GPUs
> > are managed by nvgrace-gpu and others by out-of-tree drivers that the
> > in-kernel and out-of-tree equivalent drivers are both installing
> > chardevs as /dev/egmXX?  Playing in the same space is ugly, but what
> > happens when the 2 GPUs per socket are split between drivers and they
> > both try to added the same chardev?  
> 
> But that would be an unsupported configuration. It is expected that all the
> GPUs on the system and the EGM char devices to be attached to the same
> VM for full functionality. So either all the devices (GPU and EGM chardev)
> would be bound to nvgrace or to the out-of-tree module. Please refer sec 8.1
> https://docs.nvidia.com/multi-node-nvlink-systems/partition-guide-v1-2.pdf
> Perhaps I should add this information in the commit message.

Just because it can be documented as a policy doesn't make it an
agreeable architecture.

> > However, I'd then ask the question why we're associating EGM to the GPU
> > PCI driver at all.  For instance, why should nvgrace-gpu spawn aux
> > devices to feed into an nvgrace-egm driver, and duplicate that whole
> > thing in an out-of-tree driver, when we could just have one in-kernel
> > platform(?) driver walk ACPI, find these ranges, and expose them as
> > chardev entirely independent of the PCI driver bound to the GPU?  
> 
> So a new platform driver to walk through the ACPI and look for EGM properties
> and create EGM char devs? 
> 
> Maybe it is okay, but given that all the 4 EGM properties are under the GPU's
> ACPI node and there being no independent ACPI _HID device identity, it sounds
> a bit off to me. Do we have a precedent like that?
> 
> But as I mentioned above, the expectation is that the EGM devices and the GPU
> devices to be assigned to the same VM. So would it not make sense that we
> keep the association between the EGM devices and the GPU devices?

You're telling me that the EGM access is 100% independent of any state
related to the GPU, so why would we tie the lifecycle of these aux
devices to any particular driver for the GPU or re-implement it across
multiple drivers?  That doesn't make sense to me.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2026-03-12 14:59 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-23 15:54 [PATCH RFC v2 00/15] Add virtualization support for EGM ankita
2026-02-23 15:55 ` [PATCH RFC v2 01/15] vfio/nvgrace-gpu: Expand module_pci_driver to allow custom module init ankita
2026-02-23 15:55 ` [PATCH RFC v2 02/15] vfio/nvgrace-gpu: Create auxiliary device for EGM ankita
2026-02-26 14:28   ` Shameer Kolothum Thodi
2026-03-04  0:13   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 03/15] vfio/nvgrace-gpu: track GPUs associated with the EGM regions ankita
2026-02-26 14:55   ` Shameer Kolothum Thodi
2026-03-04 17:14     ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 04/15] vfio/nvgrace-gpu: Introduce functions to fetch and save EGM info ankita
2026-02-26 15:12   ` Shameer Kolothum Thodi
2026-03-04 17:37   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 05/15] vfio/nvgrace-egm: Introduce module to manage EGM ankita
2026-03-04 18:09   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 06/15] vfio/nvgrace-egm: Introduce egm class and register char device numbers ankita
2026-03-04 18:56   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 07/15] vfio/nvgrace-egm: Register auxiliary driver ops ankita
2026-03-04 19:06   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 08/15] vfio/nvgrace-egm: Expose EGM region as char device ankita
2026-02-26 17:08   ` Shameer Kolothum Thodi
2026-03-04 20:16   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 09/15] vfio/nvgrace-egm: Add chardev ops for EGM management ankita
2026-03-04 22:04   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 10/15] vfio/nvgrace-egm: Clear Memory before handing out to VM ankita
2026-02-26 18:15   ` Shameer Kolothum Thodi
2026-02-26 18:56     ` Jason Gunthorpe
2026-02-26 19:29       ` Shameer Kolothum Thodi
2026-03-04 22:14   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 11/15] vfio/nvgrace-egm: Fetch EGM region retired pages list ankita
2026-03-04 22:37   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 12/15] vfio/nvgrace-egm: Introduce ioctl to share retired pages ankita
2026-03-04 23:00   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 13/15] vfio/nvgrace-egm: expose the egm size through sysfs ankita
2026-03-04 23:22   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 14/15] vfio/nvgrace-gpu: Add link from pci to EGM ankita
2026-03-04 23:37   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 15/15] vfio/nvgrace-egm: register EGM PFNMAP range with memory_failure ankita
2026-03-04 23:48   ` Alex Williamson
2026-03-05 17:33 ` [PATCH RFC v2 00/15] Add virtualization support for EGM Alex Williamson
2026-03-11  6:47   ` Ankit Agrawal
2026-03-11 20:37     ` Alex Williamson
2026-03-12 13:51       ` Ankit Agrawal
2026-03-12 14:59         ` Alex Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox