* [RFC 00/14] cover-letter: Add virtualization support for EGM
@ 2025-09-04 4:08 ankita
2025-09-04 4:08 ` [RFC 01/14] vfio/nvgrace-gpu: Expand module_pci_driver to allow custom module init ankita
` (13 more replies)
0 siblings, 14 replies; 27+ messages in thread
From: ankita @ 2025-09-04 4:08 UTC (permalink / raw)
To: ankita, jgg, alex.williamson, yishaih, skolothumtho, kevin.tian,
yi.l.liu, zhiw
Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
jhubbard, danw, anuaggarwal, mochs, kjaju, dnigam, kvm,
linux-kernel
From: Ankit Agrawal <ankita@nvidia.com>
Background
----------
Grace Hopper/Blackwell systems support the Extended GPU Memory (EGM)
feature that enable the GPU to access the system memory allocations
within and across nodes through high bandwidth path. This access path
goes as: GPU <--> NVswitch <--> GPU <--> CPU. The GPU can utilize the
system memory located on the same socket or from a different socket
or even on a different node in a multi-node system [1]. This feature is
being extended to virtualization.
Design Details
--------------
EGM when enabled in the virtualization stack, the host memory
is partitioned into 2 parts: One partition for the Host OS usage
called Hypervisor region, and a second Hypervisor-Invisible (HI) region
for the VM. Only the hypervisor region is part of the host EFI map
and is thus visible to the host OS on bootup. Since the entire VM
sysmem is eligible for EGM allocations within the VM, the HI partition
is interchangeably called as EGM region in the series. This HI/EGM region
range base SPA and size is exposed through the ACPI DSDT properties.
Whilst the EGM region is accessible on the host, it is not added to
the kernel. The HI region is assigned to a VM by mapping the QEMU VMA
to the SPA using remap_pfn_range().
The following figure shows the memory map in the virtualization
environment.
|---- Sysmem ----| |--- GPU mem ---| VM Memory
| | | |
|IPA <-> SPA map | |IPA <-> SPA map|
| | | |
|--- HI / EGM ---|-- Host Mem --| |--- GPU mem ---| Host Memory
The patch series introduce a new nvgrace-egm auxiliary driver module
to manage and map the HI/EGM region in the Grace Blackwell systems.
This binds to the auxiliary device created by the parent
nvgrace-gpu (in-tree module for device assignment) / nvidia-vgpu-vfio
(out-of-tree open source module for SRIOV vGPU) to manage the
EGM region for the VM. Note that there is a unique EGM region per
socket and the auxiliary device gets created for every region. The
parent module fetches the EGM region information from the ACPI
tables and populate to the data structures shared with the auxiliary
nvgrace-egm module.
nvgrace-egm module handles the following:
1. Fetch the EGM memory properties (base HPA, length, proximity domain)
from the parent device shared EGM region structure.
2. Create a char device that can be used as memory-backend-file by Qemu
for the VM and implement file operations. The char device is /dev/egmX,
where X is the PXM node ID of the EGM being mapped fetched in 1.
3. Zero the EGM memory on first device open().
4. Map the QEMU VMA to the EGM region using remap_pfn_range.
5. Cleaning up state and destroying the chardev on device unbind.
6. Handle presence of retired ECC pages on the EGM region.
Since nvgrace-egm is an auxiliary module to the nvgrace-gpu, it is kept
in the same directory.
Implementation
--------------
Patch 1-4 makes changes to the nvgrace-gpu module to fetch the
EGM information, create auxiliary device and save the EGM region
information in the shared structures.
Path 5-10 introduce the new nvgrace-egm module to manage the EGM
region. The module implements a char device to expose the EGM to
usermode apps such as QEMU. The module does the mapping of the
QEMU VMA to the EGM SPA using remap_pfn range.
Patch 11-12 fetches the list of pages on EGM with known ECC errors.
Patch 13-14 expose the EGM topology and size through sysfs.
Enablement
----------
The EGM mode is enabled through a flag in the SBIOS. The size of
the Hypervisor region is modifiable through a second parameter in
the SBIOS. All the remaining system memory on the host will be
invisible to the Hypervisor.
Verification
------------
Applied over v6.17-rc4 and using qemu repository [3]. Tested on the
Grace Blackwell platform by booting up VM, loading NVIDIA module [2] and
running nvidia-smi in the VM to check for the presence of EGM capability.
To run CUDA workloads, there is a dependency on the Nested Page Table
patches being worked on separately by Shameer Kolothum
(skolothumtho@nvidia.com).
Recognitions
------------
Many thanks to Jason Gunthorpe, Vikram Sethi, Aniket Agashe for design
suggestions and Matt Ochs, Andy Currid, Neo Jia, Kirti Wankhede among
others for the review feedbacks.
Links
-----
Link: https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/#extended_gpu_memory [1]
Link: https://github.com/NVIDIA/open-gpu-kernel-modules [2]
Link: https://github.com/ankita-nv/nicolinc-qemu/tree/iommufd_veventq-v9-egm-0903 [3]
Ankit Agrawal (14):
vfio/nvgrace-gpu: Expand module_pci_driver to allow custom module init
vfio/nvgrace-gpu: Create auxiliary device for EGM
vfio/nvgrace-gpu: track GPUs associated with the EGM regions
vfio/nvgrace-gpu: Introduce functions to fetch and save EGM info
vfio/nvgrace-egm: Introduce module to manage EGM
vfio/nvgrace-egm: Introduce egm class and register char device numbers
vfio/nvgrace-egm: Register auxiliary driver ops
vfio/nvgrace-egm: Expose EGM region as char device
vfio/nvgrace-egm: Add chardev ops for EGM management
vfio/nvgrace-egm: Clear Memory before handing out to VM
vfio/nvgrace-egm: Fetch EGM region retired pages list
vfio/nvgrace-egm: Introduce ioctl to share retired pages
vfio/nvgrace-egm: expose the egm size through sysfs
vfio/nvgrace-gpu: Add link from pci to EGM
MAINTAINERS | 12 +-
drivers/vfio/pci/nvgrace-gpu/Kconfig | 11 +
drivers/vfio/pci/nvgrace-gpu/Makefile | 5 +-
drivers/vfio/pci/nvgrace-gpu/egm.c | 418 +++++++++++++++++++++++++
drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 174 ++++++++++
drivers/vfio/pci/nvgrace-gpu/egm_dev.h | 24 ++
drivers/vfio/pci/nvgrace-gpu/main.c | 117 ++++++-
include/linux/nvgrace-egm.h | 33 ++
include/uapi/linux/egm.h | 26 ++
9 files changed, 816 insertions(+), 4 deletions(-)
create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm.c
create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm_dev.c
create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm_dev.h
create mode 100644 include/linux/nvgrace-egm.h
create mode 100644 include/uapi/linux/egm.h
--
2.34.1
^ permalink raw reply [flat|nested] 27+ messages in thread
* [RFC 01/14] vfio/nvgrace-gpu: Expand module_pci_driver to allow custom module init
2025-09-04 4:08 [RFC 00/14] cover-letter: Add virtualization support for EGM ankita
@ 2025-09-04 4:08 ` ankita
2025-09-04 4:08 ` [RFC 02/14] vfio/nvgrace-gpu: Create auxiliary device for EGM ankita
` (12 subsequent siblings)
13 siblings, 0 replies; 27+ messages in thread
From: ankita @ 2025-09-04 4:08 UTC (permalink / raw)
To: ankita, jgg, alex.williamson, yishaih, skolothumtho, kevin.tian,
yi.l.liu, zhiw
Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
jhubbard, danw, anuaggarwal, mochs, kjaju, dnigam, kvm,
linux-kernel
From: Ankit Agrawal <ankita@nvidia.com>
Allow custom changes to the nvgrace-gpu module init functions by
expanding definition of module_pci_driver.
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
drivers/vfio/pci/nvgrace-gpu/main.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index d95761dcdd58..72e7ac1fa309 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -1009,7 +1009,17 @@ static struct pci_driver nvgrace_gpu_vfio_pci_driver = {
.driver_managed_dma = true,
};
-module_pci_driver(nvgrace_gpu_vfio_pci_driver);
+static int __init nvgrace_gpu_vfio_pci_init(void)
+{
+ return pci_register_driver(&nvgrace_gpu_vfio_pci_driver);
+}
+module_init(nvgrace_gpu_vfio_pci_init);
+
+static void __exit nvgrace_gpu_vfio_pci_cleanup(void)
+{
+ pci_unregister_driver(&nvgrace_gpu_vfio_pci_driver);
+}
+module_exit(nvgrace_gpu_vfio_pci_cleanup);
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Ankit Agrawal <ankita@nvidia.com>");
--
2.34.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [RFC 02/14] vfio/nvgrace-gpu: Create auxiliary device for EGM
2025-09-04 4:08 [RFC 00/14] cover-letter: Add virtualization support for EGM ankita
2025-09-04 4:08 ` [RFC 01/14] vfio/nvgrace-gpu: Expand module_pci_driver to allow custom module init ankita
@ 2025-09-04 4:08 ` ankita
2025-09-15 6:56 ` Shameer Kolothum
2025-09-04 4:08 ` [RFC 03/14] vfio/nvgrace-gpu: track GPUs associated with the EGM regions ankita
` (11 subsequent siblings)
13 siblings, 1 reply; 27+ messages in thread
From: ankita @ 2025-09-04 4:08 UTC (permalink / raw)
To: ankita, jgg, alex.williamson, yishaih, skolothumtho, kevin.tian,
yi.l.liu, zhiw
Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
jhubbard, danw, anuaggarwal, mochs, kjaju, dnigam, kvm,
linux-kernel
From: Ankit Agrawal <ankita@nvidia.com>
The Extended GPU Memory (EGM) feature enables the GPU access to
the system memory across sockets and physical systems on the
Grace Hopper and Grace Blackwell systems. When the feature is
enabled through SBIOS, part of the system memory is made available
to the GPU for access through EGM path.
The EGM functionality is separate and largely independent from the
core GPU device functionality. However, the EGM region information
of base SPA and size is associated with the GPU on the ACPI tables.
An architecture wih EGM represented as an auxiliary device suits well
in this context.
The parent GPU device creates an EGM auxiliary device to be managed
independently by an auxiliary EGM driver. The EGM region information
is kept as part of the shared struct nvgrace_egm_dev along with the
auxiliary device handle.
Each socket has a separate EGM region and hence a multi-socket system
have multiple EGM regions. Each EGM region has a separate nvgrace_egm_dev
and the nvgrace-gpu keeps the EGM regions as part of a list.
Note that EGM is an optional feature enabled through SBIOS. The EGM
properties are only populated in ACPI tables if the feature is enabled;
they are absent otherwise. The absence of the properties is thus not
considered fatal. The presence of improper set of values however are
considered fatal.
It is also noteworthy that there may also be multiple GPUs present per
socket and have duplicate EGM region information with them. Make sure
the duplicate data does not get added.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
MAINTAINERS | 5 +-
drivers/vfio/pci/nvgrace-gpu/Makefile | 2 +-
drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 61 ++++++++++++++++++++++
drivers/vfio/pci/nvgrace-gpu/egm_dev.h | 17 +++++++
drivers/vfio/pci/nvgrace-gpu/main.c | 70 +++++++++++++++++++++++++-
include/linux/nvgrace-egm.h | 23 +++++++++
6 files changed, 175 insertions(+), 3 deletions(-)
create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm_dev.c
create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm_dev.h
create mode 100644 include/linux/nvgrace-egm.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 6dcfbd11efef..dd7df834b70b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -26471,7 +26471,10 @@ VFIO NVIDIA GRACE GPU DRIVER
M: Ankit Agrawal <ankita@nvidia.com>
L: kvm@vger.kernel.org
S: Supported
-F: drivers/vfio/pci/nvgrace-gpu/
+F: drivers/vfio/pci/nvgrace-gpu/egm_dev.c
+F: drivers/vfio/pci/nvgrace-gpu/egm_dev.h
+F: drivers/vfio/pci/nvgrace-gpu/main.c
+F: include/linux/nvgrace-egm.h
VFIO PCI DEVICE SPECIFIC DRIVERS
R: Jason Gunthorpe <jgg@nvidia.com>
diff --git a/drivers/vfio/pci/nvgrace-gpu/Makefile b/drivers/vfio/pci/nvgrace-gpu/Makefile
index 3ca8c187897a..e72cc6739ef8 100644
--- a/drivers/vfio/pci/nvgrace-gpu/Makefile
+++ b/drivers/vfio/pci/nvgrace-gpu/Makefile
@@ -1,3 +1,3 @@
# SPDX-License-Identifier: GPL-2.0-only
obj-$(CONFIG_NVGRACE_GPU_VFIO_PCI) += nvgrace-gpu-vfio-pci.o
-nvgrace-gpu-vfio-pci-y := main.o
+nvgrace-gpu-vfio-pci-y := main.o egm_dev.o
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
new file mode 100644
index 000000000000..f4e27dadf1ef
--- /dev/null
+++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
@@ -0,0 +1,61 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include <linux/vfio_pci_core.h>
+#include "egm_dev.h"
+
+/*
+ * Determine if the EGM feature is enabled. If disabled, there
+ * will be no EGM properties populated in the ACPI tables and this
+ * fetch would fail.
+ */
+int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm)
+{
+ return device_property_read_u64(&pdev->dev, "nvidia,egm-pxm",
+ pegmpxm);
+}
+
+static void nvgrace_gpu_release_aux_device(struct device *device)
+{
+ struct auxiliary_device *aux_dev = container_of(device, struct auxiliary_device, dev);
+ struct nvgrace_egm_dev *egm_dev = container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
+
+ kvfree(egm_dev);
+}
+
+struct nvgrace_egm_dev *
+nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
+ u64 egmpxm)
+{
+ struct nvgrace_egm_dev *egm_dev;
+ int ret;
+
+ egm_dev = kvzalloc(sizeof(*egm_dev), GFP_KERNEL);
+ if (!egm_dev)
+ goto create_err;
+
+ egm_dev->egmpxm = egmpxm;
+ egm_dev->aux_dev.id = egmpxm;
+ egm_dev->aux_dev.name = name;
+ egm_dev->aux_dev.dev.release = nvgrace_gpu_release_aux_device;
+ egm_dev->aux_dev.dev.parent = &pdev->dev;
+
+ ret = auxiliary_device_init(&egm_dev->aux_dev);
+ if (ret)
+ goto free_dev;
+
+ ret = auxiliary_device_add(&egm_dev->aux_dev);
+ if (ret) {
+ auxiliary_device_uninit(&egm_dev->aux_dev);
+ goto create_err;
+ }
+
+ return egm_dev;
+
+free_dev:
+ kvfree(egm_dev);
+create_err:
+ return NULL;
+}
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
new file mode 100644
index 000000000000..c00f5288f4e7
--- /dev/null
+++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#ifndef EGM_DEV_H
+#define EGM_DEV_H
+
+#include <linux/nvgrace-egm.h>
+
+int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm);
+
+struct nvgrace_egm_dev *
+nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
+ u64 egmphys);
+
+#endif /* EGM_DEV_H */
diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index 72e7ac1fa309..2cf851492990 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -7,6 +7,8 @@
#include <linux/vfio_pci_core.h>
#include <linux/delay.h>
#include <linux/jiffies.h>
+#include <linux/nvgrace-egm.h>
+#include "egm_dev.h"
/*
* The device memory usable to the workloads running in the VM is cached
@@ -60,6 +62,63 @@ struct nvgrace_gpu_pci_core_device {
bool has_mig_hw_bug;
};
+static struct list_head egm_dev_list;
+
+static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
+{
+ struct nvgrace_egm_dev_entry *egm_entry;
+ u64 egmpxm;
+ int ret = 0;
+
+ /*
+ * EGM is an optional feature enabled in SBIOS. If disabled, there
+ * will be no EGM properties populated in the ACPI tables and this
+ * fetch would fail. Treat this failure as non-fatal and return
+ * early.
+ */
+ if (nvgrace_gpu_has_egm_property(pdev, &egmpxm))
+ goto exit;
+
+ egm_entry = kvzalloc(sizeof(*egm_entry), GFP_KERNEL);
+ if (!egm_entry)
+ return -ENOMEM;
+
+ egm_entry->egm_dev =
+ nvgrace_gpu_create_aux_device(pdev, NVGRACE_EGM_DEV_NAME,
+ egmpxm);
+ if (!egm_entry->egm_dev) {
+ kvfree(egm_entry);
+ ret = -EINVAL;
+ goto exit;
+ }
+
+ list_add_tail(&egm_entry->list, &egm_dev_list);
+
+exit:
+ return ret;
+}
+
+static void nvgrace_gpu_destroy_egm_aux_device(struct pci_dev *pdev)
+{
+ struct nvgrace_egm_dev_entry *egm_entry, *temp_egm_entry;
+ u64 egmpxm;
+
+ if (nvgrace_gpu_has_egm_property(pdev, &egmpxm))
+ return;
+
+ list_for_each_entry_safe(egm_entry, temp_egm_entry, &egm_dev_list, list) {
+ /*
+ * Free the EGM region corresponding to the input GPU
+ * device.
+ */
+ if (egm_entry->egm_dev->egmpxm == egmpxm) {
+ auxiliary_device_destroy(&egm_entry->egm_dev->aux_dev);
+ list_del(&egm_entry->list);
+ kvfree(egm_entry);
+ }
+ }
+}
+
static void nvgrace_gpu_init_fake_bar_emu_regs(struct vfio_device *core_vdev)
{
struct nvgrace_gpu_pci_core_device *nvdev =
@@ -965,14 +1024,20 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
memphys, memlength);
if (ret)
goto out_put_vdev;
+
+ ret = nvgrace_gpu_create_egm_aux_device(pdev);
+ if (ret)
+ goto out_put_vdev;
}
ret = vfio_pci_core_register_device(&nvdev->core_device);
if (ret)
- goto out_put_vdev;
+ goto out_reg;
return ret;
+out_reg:
+ nvgrace_gpu_destroy_egm_aux_device(pdev);
out_put_vdev:
vfio_put_device(&nvdev->core_device.vdev);
return ret;
@@ -982,6 +1047,7 @@ static void nvgrace_gpu_remove(struct pci_dev *pdev)
{
struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
+ nvgrace_gpu_destroy_egm_aux_device(pdev);
vfio_pci_core_unregister_device(core_device);
vfio_put_device(&core_device->vdev);
}
@@ -1011,6 +1077,8 @@ static struct pci_driver nvgrace_gpu_vfio_pci_driver = {
static int __init nvgrace_gpu_vfio_pci_init(void)
{
+ INIT_LIST_HEAD(&egm_dev_list);
+
return pci_register_driver(&nvgrace_gpu_vfio_pci_driver);
}
module_init(nvgrace_gpu_vfio_pci_init);
diff --git a/include/linux/nvgrace-egm.h b/include/linux/nvgrace-egm.h
new file mode 100644
index 000000000000..9575d4ad4338
--- /dev/null
+++ b/include/linux/nvgrace-egm.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#ifndef NVGRACE_EGM_H
+#define NVGRACE_EGM_H
+
+#include <linux/auxiliary_bus.h>
+
+#define NVGRACE_EGM_DEV_NAME "egm"
+
+struct nvgrace_egm_dev {
+ struct auxiliary_device aux_dev;
+ u64 egmpxm;
+};
+
+struct nvgrace_egm_dev_entry {
+ struct list_head list;
+ struct nvgrace_egm_dev *egm_dev;
+};
+
+#endif /* NVGRACE_EGM_H */
--
2.34.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [RFC 03/14] vfio/nvgrace-gpu: track GPUs associated with the EGM regions
2025-09-04 4:08 [RFC 00/14] cover-letter: Add virtualization support for EGM ankita
2025-09-04 4:08 ` [RFC 01/14] vfio/nvgrace-gpu: Expand module_pci_driver to allow custom module init ankita
2025-09-04 4:08 ` [RFC 02/14] vfio/nvgrace-gpu: Create auxiliary device for EGM ankita
@ 2025-09-04 4:08 ` ankita
2025-09-15 7:19 ` Shameer Kolothum
2025-09-04 4:08 ` [RFC 04/14] vfio/nvgrace-gpu: Introduce functions to fetch and save EGM info ankita
` (10 subsequent siblings)
13 siblings, 1 reply; 27+ messages in thread
From: ankita @ 2025-09-04 4:08 UTC (permalink / raw)
To: ankita, jgg, alex.williamson, yishaih, skolothumtho, kevin.tian,
yi.l.liu, zhiw
Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
jhubbard, danw, anuaggarwal, mochs, kjaju, dnigam, kvm,
linux-kernel
From: Ankit Agrawal <ankita@nvidia.com>
Grace Blackwell systems could have multiple GPUs on a socket and
thus are associated with the corresponding EGM region for that
socket. Track the GPUs as a list.
On the device probe, the device pci_dev struct is added to a
linked list of the appropriate EGM region.
Similarly on device remove, the pci_dev struct for the GPU
is removed from the EGM region.
Since the GPUs on a socket have the same EGM region, they have
the have the same set of EGM region information. Skip the EGM
region information fetch if already done through a differnt
GPU on the same socket.
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 29 ++++++++++++++++++++++
drivers/vfio/pci/nvgrace-gpu/egm_dev.h | 4 +++
drivers/vfio/pci/nvgrace-gpu/main.c | 34 +++++++++++++++++++++++---
include/linux/nvgrace-egm.h | 6 +++++
4 files changed, 70 insertions(+), 3 deletions(-)
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
index f4e27dadf1ef..28cfd29eda56 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
@@ -17,6 +17,33 @@ int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm)
pegmpxm);
}
+int add_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
+{
+ struct gpu_node *node;
+
+ node = kvzalloc(sizeof(*node), GFP_KERNEL);
+ if (!node)
+ return -ENOMEM;
+
+ node->pdev = pdev;
+
+ list_add_tail(&node->list, &egm_dev->gpus);
+
+ return 0;
+}
+
+void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
+{
+ struct gpu_node *node, *tmp;
+
+ list_for_each_entry_safe(node, tmp, &egm_dev->gpus, list) {
+ if (node->pdev == pdev) {
+ list_del(&node->list);
+ kvfree(node);
+ }
+ }
+}
+
static void nvgrace_gpu_release_aux_device(struct device *device)
{
struct auxiliary_device *aux_dev = container_of(device, struct auxiliary_device, dev);
@@ -37,6 +64,8 @@ nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
goto create_err;
egm_dev->egmpxm = egmpxm;
+ INIT_LIST_HEAD(&egm_dev->gpus);
+
egm_dev->aux_dev.id = egmpxm;
egm_dev->aux_dev.name = name;
egm_dev->aux_dev.dev.release = nvgrace_gpu_release_aux_device;
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
index c00f5288f4e7..1635753c9e50 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
+++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
@@ -10,6 +10,10 @@
int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm);
+int add_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev);
+
+void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev);
+
struct nvgrace_egm_dev *
nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
u64 egmphys);
diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index 2cf851492990..436f0ac17332 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -66,9 +66,10 @@ static struct list_head egm_dev_list;
static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
{
- struct nvgrace_egm_dev_entry *egm_entry;
+ struct nvgrace_egm_dev_entry *egm_entry = NULL;
u64 egmpxm;
int ret = 0;
+ bool is_new_region = false;
/*
* EGM is an optional feature enabled in SBIOS. If disabled, there
@@ -79,6 +80,19 @@ static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
if (nvgrace_gpu_has_egm_property(pdev, &egmpxm))
goto exit;
+ list_for_each_entry(egm_entry, &egm_dev_list, list) {
+ /*
+ * A system could have multiple GPUs associated with an
+ * EGM region and will have the same set of EGM region
+ * information. Skip the EGM region information fetch if
+ * already done through a differnt GPU on the same socket.
+ */
+ if (egm_entry->egm_dev->egmpxm == egmpxm)
+ goto add_gpu;
+ }
+
+ is_new_region = true;
+
egm_entry = kvzalloc(sizeof(*egm_entry), GFP_KERNEL);
if (!egm_entry)
return -ENOMEM;
@@ -87,13 +101,23 @@ static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
nvgrace_gpu_create_aux_device(pdev, NVGRACE_EGM_DEV_NAME,
egmpxm);
if (!egm_entry->egm_dev) {
- kvfree(egm_entry);
ret = -EINVAL;
+ goto free_egm_entry;
+ }
+
+add_gpu:
+ ret = add_gpu(egm_entry->egm_dev, pdev);
+ if (!ret) {
+ if (is_new_region)
+ list_add_tail(&egm_entry->list, &egm_dev_list);
goto exit;
}
- list_add_tail(&egm_entry->list, &egm_dev_list);
+ if (is_new_region)
+ auxiliary_device_destroy(&egm_entry->egm_dev->aux_dev);
+free_egm_entry:
+ kvfree(egm_entry);
exit:
return ret;
}
@@ -112,6 +136,10 @@ static void nvgrace_gpu_destroy_egm_aux_device(struct pci_dev *pdev)
* device.
*/
if (egm_entry->egm_dev->egmpxm == egmpxm) {
+ remove_gpu(egm_entry->egm_dev, pdev);
+ if (!list_empty(&egm_entry->egm_dev->gpus))
+ break;
+
auxiliary_device_destroy(&egm_entry->egm_dev->aux_dev);
list_del(&egm_entry->list);
kvfree(egm_entry);
diff --git a/include/linux/nvgrace-egm.h b/include/linux/nvgrace-egm.h
index 9575d4ad4338..e42494a2b1a6 100644
--- a/include/linux/nvgrace-egm.h
+++ b/include/linux/nvgrace-egm.h
@@ -10,9 +10,15 @@
#define NVGRACE_EGM_DEV_NAME "egm"
+struct gpu_node {
+ struct list_head list;
+ struct pci_dev *pdev;
+};
+
struct nvgrace_egm_dev {
struct auxiliary_device aux_dev;
u64 egmpxm;
+ struct list_head gpus;
};
struct nvgrace_egm_dev_entry {
--
2.34.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [RFC 04/14] vfio/nvgrace-gpu: Introduce functions to fetch and save EGM info
2025-09-04 4:08 [RFC 00/14] cover-letter: Add virtualization support for EGM ankita
` (2 preceding siblings ...)
2025-09-04 4:08 ` [RFC 03/14] vfio/nvgrace-gpu: track GPUs associated with the EGM regions ankita
@ 2025-09-04 4:08 ` ankita
2025-09-04 4:08 ` [RFC 05/14] vfio/nvgrace-egm: Introduce module to manage EGM ankita
` (9 subsequent siblings)
13 siblings, 0 replies; 27+ messages in thread
From: ankita @ 2025-09-04 4:08 UTC (permalink / raw)
To: ankita, jgg, alex.williamson, yishaih, skolothumtho, kevin.tian,
yi.l.liu, zhiw
Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
jhubbard, danw, anuaggarwal, mochs, kjaju, dnigam, kvm,
linux-kernel
From: Ankit Agrawal <ankita@nvidia.com>
The nvgrace-gpu module tracks the various EGM regions on the system.
The EGM region information - Base SPA and size - are part of the ACPI
tables. This can be fetched from the DSD table using the GPU handle.
When the GPUs are bound to the nvgrace-gpu module, it fetches the EGM
region information from the ACPI table using the GPU's pci_dev. The
EGM regions are tracked in a list and the information per region is
maintained in the nvgrace_egm_dev.
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 24 +++++++++++++++++++++++-
drivers/vfio/pci/nvgrace-gpu/egm_dev.h | 4 +++-
drivers/vfio/pci/nvgrace-gpu/main.c | 8 ++++++--
include/linux/nvgrace-egm.h | 2 ++
4 files changed, 34 insertions(+), 4 deletions(-)
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
index 28cfd29eda56..ca50bc1f67a0 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
@@ -17,6 +17,26 @@ int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm)
pegmpxm);
}
+int nvgrace_gpu_fetch_egm_property(struct pci_dev *pdev, u64 *pegmphys,
+ u64 *pegmlength)
+{
+ int ret;
+
+ /*
+ * The memory information is present in the system ACPI tables as DSD
+ * properties nvidia,egm-base-pa and nvidia,egm-size.
+ */
+ ret = device_property_read_u64(&pdev->dev, "nvidia,egm-size",
+ pegmlength);
+ if (ret)
+ return ret;
+
+ ret = device_property_read_u64(&pdev->dev, "nvidia,egm-base-pa",
+ pegmphys);
+
+ return ret;
+}
+
int add_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
{
struct gpu_node *node;
@@ -54,7 +74,7 @@ static void nvgrace_gpu_release_aux_device(struct device *device)
struct nvgrace_egm_dev *
nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
- u64 egmpxm)
+ u64 egmphys, u64 egmlength, u64 egmpxm)
{
struct nvgrace_egm_dev *egm_dev;
int ret;
@@ -64,6 +84,8 @@ nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
goto create_err;
egm_dev->egmpxm = egmpxm;
+ egm_dev->egmphys = egmphys;
+ egm_dev->egmlength = egmlength;
INIT_LIST_HEAD(&egm_dev->gpus);
egm_dev->aux_dev.id = egmpxm;
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
index 1635753c9e50..2e1612445898 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
+++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
@@ -16,6 +16,8 @@ void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev);
struct nvgrace_egm_dev *
nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
- u64 egmphys);
+ u64 egmphys, u64 egmlength, u64 egmpxm);
+int nvgrace_gpu_fetch_egm_property(struct pci_dev *pdev, u64 *pegmphys,
+ u64 *pegmlength);
#endif /* EGM_DEV_H */
diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index 436f0ac17332..7486a1b49275 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -67,7 +67,7 @@ static struct list_head egm_dev_list;
static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
{
struct nvgrace_egm_dev_entry *egm_entry = NULL;
- u64 egmpxm;
+ u64 egmphys, egmlength, egmpxm;
int ret = 0;
bool is_new_region = false;
@@ -80,6 +80,10 @@ static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
if (nvgrace_gpu_has_egm_property(pdev, &egmpxm))
goto exit;
+ ret = nvgrace_gpu_fetch_egm_property(pdev, &egmphys, &egmlength);
+ if (ret)
+ goto exit;
+
list_for_each_entry(egm_entry, &egm_dev_list, list) {
/*
* A system could have multiple GPUs associated with an
@@ -99,7 +103,7 @@ static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
egm_entry->egm_dev =
nvgrace_gpu_create_aux_device(pdev, NVGRACE_EGM_DEV_NAME,
- egmpxm);
+ egmphys, egmlength, egmpxm);
if (!egm_entry->egm_dev) {
ret = -EINVAL;
goto free_egm_entry;
diff --git a/include/linux/nvgrace-egm.h b/include/linux/nvgrace-egm.h
index e42494a2b1a6..a66906753267 100644
--- a/include/linux/nvgrace-egm.h
+++ b/include/linux/nvgrace-egm.h
@@ -17,6 +17,8 @@ struct gpu_node {
struct nvgrace_egm_dev {
struct auxiliary_device aux_dev;
+ phys_addr_t egmphys;
+ size_t egmlength;
u64 egmpxm;
struct list_head gpus;
};
--
2.34.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [RFC 05/14] vfio/nvgrace-egm: Introduce module to manage EGM
2025-09-04 4:08 [RFC 00/14] cover-letter: Add virtualization support for EGM ankita
` (3 preceding siblings ...)
2025-09-04 4:08 ` [RFC 04/14] vfio/nvgrace-gpu: Introduce functions to fetch and save EGM info ankita
@ 2025-09-04 4:08 ` ankita
2025-09-05 13:26 ` Jason Gunthorpe
2025-09-15 7:47 ` Shameer Kolothum
2025-09-04 4:08 ` [RFC 06/14] vfio/nvgrace-egm: Introduce egm class and register char device numbers ankita
` (8 subsequent siblings)
13 siblings, 2 replies; 27+ messages in thread
From: ankita @ 2025-09-04 4:08 UTC (permalink / raw)
To: ankita, jgg, alex.williamson, yishaih, skolothumtho, kevin.tian,
yi.l.liu, zhiw
Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
jhubbard, danw, anuaggarwal, mochs, kjaju, dnigam, kvm,
linux-kernel
From: Ankit Agrawal <ankita@nvidia.com>
The Extended GPU Memory (EGM) feature that enables the GPU to access
the system memory allocations within and across nodes through high
bandwidth path on Grace Based systems. The GPU can utilize the
system memory located on the same socket or from a different socket
or even on a different node in a multi-node system [1].
When the EGM mode is enabled through SBIOS, the host system memory is
partitioned into 2 parts: One partition for the Host OS usage
called Hypervisor region, and a second Hypervisor-Invisible (HI) region
for the VM. Only the hypervisor region is part of the host EFI map
and is thus visible to the host OS on bootup. Since the entire VM
sysmem is eligible for EGM allocations within the VM, the HI partition
is interchangeably called as EGM region in the series. This HI/EGM region
range base SPA and size is exposed through the ACPI DSDT properties.
Whilst the EGM region is accessible on the host, it is not added to
the kernel. The HI region is assigned to a VM by mapping the QEMU VMA
to the SPA using remap_pfn_range().
The following figure shows the memory map in the virtualization
environment.
|---- Sysmem ----| |--- GPU mem ---| VM Memory
| | | |
|IPA <-> SPA map | |IPA <-> SPA map|
| | | |
|--- HI / EGM ---|-- Host Mem --| |--- GPU mem ---| Host Memory
Introduce a new nvgrace-egm auxiliary driver module to manage and
map the HI/EGM region in the Grace Blackwell systems. This binds to
the auxiliary device created by the parent nvgrace-gpu (in-tree
module for device assignment) / nvidia-vgpu-vfio (out-of-tree open
source module for SRIOV vGPU) to manage the EGM region for the VM.
Note that there is a unique EGM region per socket and the auxiliary
device gets created for every region. The parent module fetches the
EGM region information from the ACPI tables and populate to the data
structures shared with the auxiliary nvgrace-egm module.
nvgrace-egm module handles the following:
1. Fetch the EGM memory properties (base HPA, length, proximity domain)
from the parent device shared EGM region structure.
2. Create a char device that can be used as memory-backend-file by Qemu
for the VM and implement file operations. The char device is /dev/egmX,
where X is the PXM node ID of the EGM being mapped fetched in 1.
3. Zero the EGM memory on first device open().
4. Map the QEMU VMA to the EGM region using remap_pfn_range.
5. Cleaning up state and destroying the chardev on device unbind.
6. Handle presence of retired ECC pages on the EGM region.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
MAINTAINERS | 6 ++++++
drivers/vfio/pci/nvgrace-gpu/Kconfig | 11 +++++++++++
drivers/vfio/pci/nvgrace-gpu/Makefile | 3 +++
drivers/vfio/pci/nvgrace-gpu/egm.c | 22 ++++++++++++++++++++++
drivers/vfio/pci/nvgrace-gpu/main.c | 1 +
5 files changed, 43 insertions(+)
create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm.c
diff --git a/MAINTAINERS b/MAINTAINERS
index dd7df834b70b..ec6bc10f346d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -26476,6 +26476,12 @@ F: drivers/vfio/pci/nvgrace-gpu/egm_dev.h
F: drivers/vfio/pci/nvgrace-gpu/main.c
F: include/linux/nvgrace-egm.h
+VFIO NVIDIA GRACE EGM DRIVER
+M: Ankit Agrawal <ankita@nvidia.com>
+L: kvm@vger.kernel.org
+S: Supported
+F: drivers/vfio/pci/nvgrace-gpu/egm.c
+
VFIO PCI DEVICE SPECIFIC DRIVERS
R: Jason Gunthorpe <jgg@nvidia.com>
R: Yishai Hadas <yishaih@nvidia.com>
diff --git a/drivers/vfio/pci/nvgrace-gpu/Kconfig b/drivers/vfio/pci/nvgrace-gpu/Kconfig
index a7f624b37e41..d5773bbd22f5 100644
--- a/drivers/vfio/pci/nvgrace-gpu/Kconfig
+++ b/drivers/vfio/pci/nvgrace-gpu/Kconfig
@@ -1,8 +1,19 @@
# SPDX-License-Identifier: GPL-2.0-only
+config NVGRACE_EGM
+ tristate "EGM driver for NVIDIA Grace Hopper and Blackwell Superchip"
+ depends on ARM64 || (COMPILE_TEST && 64BIT)
+ help
+ Extended GPU Memory (EGM) support for the GPU in the NVIDIA Grace
+ based chips required to avail the CPU memory as additional
+ cross-node/cross-socket memory for GPU using KVM/qemu.
+
+ If you don't know what to do here, say N.
+
config NVGRACE_GPU_VFIO_PCI
tristate "VFIO support for the GPU in the NVIDIA Grace Hopper Superchip"
depends on ARM64 || (COMPILE_TEST && 64BIT)
select VFIO_PCI_CORE
+ select NVGRACE_EGM
help
VFIO support for the GPU in the NVIDIA Grace Hopper Superchip is
required to assign the GPU device to userspace using KVM/qemu/etc.
diff --git a/drivers/vfio/pci/nvgrace-gpu/Makefile b/drivers/vfio/pci/nvgrace-gpu/Makefile
index e72cc6739ef8..d0d191be56b9 100644
--- a/drivers/vfio/pci/nvgrace-gpu/Makefile
+++ b/drivers/vfio/pci/nvgrace-gpu/Makefile
@@ -1,3 +1,6 @@
# SPDX-License-Identifier: GPL-2.0-only
obj-$(CONFIG_NVGRACE_GPU_VFIO_PCI) += nvgrace-gpu-vfio-pci.o
nvgrace-gpu-vfio-pci-y := main.o egm_dev.o
+
+obj-$(CONFIG_NVGRACE_EGM) += nvgrace-egm.o
+nvgrace-egm-y := egm.o
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
new file mode 100644
index 000000000000..999808807019
--- /dev/null
+++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include <linux/vfio_pci_core.h>
+
+static int __init nvgrace_egm_init(void)
+{
+ return 0;
+}
+
+static void __exit nvgrace_egm_cleanup(void)
+{
+}
+
+module_init(nvgrace_egm_init);
+module_exit(nvgrace_egm_cleanup);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Ankit Agrawal <ankita@nvidia.com>");
+MODULE_DESCRIPTION("NVGRACE EGM - Module to support Extended GPU Memory on NVIDIA Grace Based systems");
diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index 7486a1b49275..b1ccd1ac2e0a 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -1125,3 +1125,4 @@ MODULE_LICENSE("GPL");
MODULE_AUTHOR("Ankit Agrawal <ankita@nvidia.com>");
MODULE_AUTHOR("Aniket Agashe <aniketa@nvidia.com>");
MODULE_DESCRIPTION("VFIO NVGRACE GPU PF - User Level driver for NVIDIA devices with CPU coherently accessible device memory");
+MODULE_SOFTDEP("pre: nvgrace-egm");
--
2.34.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [RFC 06/14] vfio/nvgrace-egm: Introduce egm class and register char device numbers
2025-09-04 4:08 [RFC 00/14] cover-letter: Add virtualization support for EGM ankita
` (4 preceding siblings ...)
2025-09-04 4:08 ` [RFC 05/14] vfio/nvgrace-egm: Introduce module to manage EGM ankita
@ 2025-09-04 4:08 ` ankita
2025-09-04 4:08 ` [RFC 07/14] vfio/nvgrace-egm: Register auxiliary driver ops ankita
` (7 subsequent siblings)
13 siblings, 0 replies; 27+ messages in thread
From: ankita @ 2025-09-04 4:08 UTC (permalink / raw)
To: ankita, jgg, alex.williamson, yishaih, skolothumtho, kevin.tian,
yi.l.liu, zhiw
Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
jhubbard, danw, anuaggarwal, mochs, kjaju, dnigam, kvm,
linux-kernel
From: Ankit Agrawal <ankita@nvidia.com>
The EGM regions are exposed to the userspace as char devices. A unique
char device with a different minor number is assigned to EGM region
belonging to a different Grace socket.
Add a new egm class and register a range of char device numbers for
the same.
Setting MAX_EGM_NODES as 4 as the 4-socket is the largest configuration
on Grace based systems.
Suggested-by: Aniket Agashe <aniketa@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
drivers/vfio/pci/nvgrace-gpu/egm.c | 36 ++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
index 999808807019..6bab4d94cb99 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
@@ -4,14 +4,50 @@
*/
#include <linux/vfio_pci_core.h>
+#include <linux/nvgrace-egm.h>
+
+#define MAX_EGM_NODES 4
+
+static dev_t dev;
+static struct class *class;
+
+static char *egm_devnode(const struct device *device, umode_t *mode)
+{
+ if (mode)
+ *mode = 0600;
+
+ return NULL;
+}
static int __init nvgrace_egm_init(void)
{
+ int ret;
+
+ /*
+ * Each EGM region on a system is represented with a unique
+ * char device with a different minor number. Allow a range
+ * of char device creation.
+ */
+ ret = alloc_chrdev_region(&dev, 0, MAX_EGM_NODES,
+ NVGRACE_EGM_DEV_NAME);
+ if (ret < 0)
+ return ret;
+
+ class = class_create(NVGRACE_EGM_DEV_NAME);
+ if (IS_ERR(class)) {
+ unregister_chrdev_region(dev, MAX_EGM_NODES);
+ return PTR_ERR(class);
+ }
+
+ class->devnode = egm_devnode;
+
return 0;
}
static void __exit nvgrace_egm_cleanup(void)
{
+ class_destroy(class);
+ unregister_chrdev_region(dev, MAX_EGM_NODES);
}
module_init(nvgrace_egm_init);
--
2.34.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [RFC 07/14] vfio/nvgrace-egm: Register auxiliary driver ops
2025-09-04 4:08 [RFC 00/14] cover-letter: Add virtualization support for EGM ankita
` (5 preceding siblings ...)
2025-09-04 4:08 ` [RFC 06/14] vfio/nvgrace-egm: Introduce egm class and register char device numbers ankita
@ 2025-09-04 4:08 ` ankita
2025-09-05 13:31 ` Jason Gunthorpe
2025-09-04 4:08 ` [RFC 08/14] vfio/nvgrace-egm: Expose EGM region as char device ankita
` (6 subsequent siblings)
13 siblings, 1 reply; 27+ messages in thread
From: ankita @ 2025-09-04 4:08 UTC (permalink / raw)
To: ankita, jgg, alex.williamson, yishaih, skolothumtho, kevin.tian,
yi.l.liu, zhiw
Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
jhubbard, danw, anuaggarwal, mochs, kjaju, dnigam, kvm,
linux-kernel
From: Ankit Agrawal <ankita@nvidia.com>
Setup dummy auxiliary device ops to be able to get probed by
the nvgrace-egm auxiliary driver.
Both nvgrace-gpu and the out-of-tree nvidia-vgpu-vfio will make
use of the EGM for device assignment and the SRIOV vGPU virtualization
solutions respectively. Hence allow auxiliary device probing for both.
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
drivers/vfio/pci/nvgrace-gpu/egm.c | 39 +++++++++++++++++++++++++++---
1 file changed, 36 insertions(+), 3 deletions(-)
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
index 6bab4d94cb99..12d4e6e83fff 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
@@ -11,6 +11,30 @@
static dev_t dev;
static struct class *class;
+static int egm_driver_probe(struct auxiliary_device *aux_dev,
+ const struct auxiliary_device_id *id)
+{
+ return 0;
+}
+
+static void egm_driver_remove(struct auxiliary_device *aux_dev)
+{
+}
+
+static const struct auxiliary_device_id egm_id_table[] = {
+ { .name = "nvgrace_gpu_vfio_pci.egm" },
+ { .name = "nvidia_vgpu_vfio.egm" },
+ { },
+};
+MODULE_DEVICE_TABLE(auxiliary, egm_id_table);
+
+static struct auxiliary_driver egm_driver = {
+ .name = KBUILD_MODNAME,
+ .id_table = egm_id_table,
+ .probe = egm_driver_probe,
+ .remove = egm_driver_remove,
+};
+
static char *egm_devnode(const struct device *device, umode_t *mode)
{
if (mode)
@@ -35,19 +59,28 @@ static int __init nvgrace_egm_init(void)
class = class_create(NVGRACE_EGM_DEV_NAME);
if (IS_ERR(class)) {
- unregister_chrdev_region(dev, MAX_EGM_NODES);
- return PTR_ERR(class);
+ ret = PTR_ERR(class);
+ goto unregister_chrdev;
}
class->devnode = egm_devnode;
- return 0;
+ ret = auxiliary_driver_register(&egm_driver);
+ if (!ret)
+ goto fn_exit;
+
+ class_destroy(class);
+unregister_chrdev:
+ unregister_chrdev_region(dev, MAX_EGM_NODES);
+fn_exit:
+ return ret;
}
static void __exit nvgrace_egm_cleanup(void)
{
class_destroy(class);
unregister_chrdev_region(dev, MAX_EGM_NODES);
+ auxiliary_driver_unregister(&egm_driver);
}
module_init(nvgrace_egm_init);
--
2.34.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [RFC 08/14] vfio/nvgrace-egm: Expose EGM region as char device
2025-09-04 4:08 [RFC 00/14] cover-letter: Add virtualization support for EGM ankita
` (6 preceding siblings ...)
2025-09-04 4:08 ` [RFC 07/14] vfio/nvgrace-egm: Register auxiliary driver ops ankita
@ 2025-09-04 4:08 ` ankita
2025-09-05 13:34 ` Jason Gunthorpe
2025-09-15 8:36 ` Shameer Kolothum
2025-09-04 4:08 ` [RFC 09/14] vfio/nvgrace-egm: Add chardev ops for EGM management ankita
` (5 subsequent siblings)
13 siblings, 2 replies; 27+ messages in thread
From: ankita @ 2025-09-04 4:08 UTC (permalink / raw)
To: ankita, jgg, alex.williamson, yishaih, skolothumtho, kevin.tian,
yi.l.liu, zhiw
Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
jhubbard, danw, anuaggarwal, mochs, kjaju, dnigam, kvm,
linux-kernel
From: Ankit Agrawal <ankita@nvidia.com>
The EGM module expose the various EGM regions as a char device. A
usermode app such as Qemu may mmap to the region and use as VM sysmem.
Each EGM region is represented with a unique char device /dev/egmX
bearing a distinct minor number.
EGM module implements the mmap file_ops to manage the usermode app's
VMA mapping to the EGM region. The appropriate region is determined
from the minor number.
Note that the EGM memory region is invisible to the host kernel as it
is not present in the host EFI map. The host Linux MM thus cannot manage
the memory, even though it is accessible on the host SPA. The EGM module
thus use remap_pfn_range() to perform the VMA mapping to the EGM region.
Suggested-by: Aniket Agashe <aniketa@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
drivers/vfio/pci/nvgrace-gpu/egm.c | 99 ++++++++++++++++++++++++++++++
1 file changed, 99 insertions(+)
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
index 12d4e6e83fff..c2dce5fa797a 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
@@ -10,15 +10,114 @@
static dev_t dev;
static struct class *class;
+static DEFINE_XARRAY(egm_chardevs);
+
+struct chardev {
+ struct device device;
+ struct cdev cdev;
+};
+
+static int nvgrace_egm_open(struct inode *inode, struct file *file)
+{
+ return 0;
+}
+
+static int nvgrace_egm_release(struct inode *inode, struct file *file)
+{
+ return 0;
+}
+
+static int nvgrace_egm_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ return 0;
+}
+
+static const struct file_operations file_ops = {
+ .owner = THIS_MODULE,
+ .open = nvgrace_egm_open,
+ .release = nvgrace_egm_release,
+ .mmap = nvgrace_egm_mmap,
+};
+
+static void egm_chardev_release(struct device *dev)
+{
+ struct chardev *egm_chardev = container_of(dev, struct chardev, device);
+
+ kvfree(egm_chardev);
+}
+
+static struct chardev *
+setup_egm_chardev(struct nvgrace_egm_dev *egm_dev)
+{
+ struct chardev *egm_chardev;
+ int ret;
+
+ egm_chardev = kvzalloc(sizeof(*egm_chardev), GFP_KERNEL);
+ if (!egm_chardev)
+ goto create_err;
+
+ device_initialize(&egm_chardev->device);
+
+ /*
+ * Use the proximity domain number as the device minor
+ * number. So the EGM corresponding to node X would be
+ * /dev/egmX.
+ */
+ egm_chardev->device.devt = MKDEV(MAJOR(dev), egm_dev->egmpxm);
+ egm_chardev->device.class = class;
+ egm_chardev->device.release = egm_chardev_release;
+ egm_chardev->device.parent = &egm_dev->aux_dev.dev;
+ cdev_init(&egm_chardev->cdev, &file_ops);
+ egm_chardev->cdev.owner = THIS_MODULE;
+
+ ret = dev_set_name(&egm_chardev->device, "egm%lld", egm_dev->egmpxm);
+ if (ret)
+ goto error_exit;
+
+ ret = cdev_device_add(&egm_chardev->cdev, &egm_chardev->device);
+ if (ret)
+ goto error_exit;
+
+ return egm_chardev;
+
+error_exit:
+ kvfree(egm_chardev);
+create_err:
+ return NULL;
+}
+
+static void del_egm_chardev(struct chardev *egm_chardev)
+{
+ cdev_device_del(&egm_chardev->cdev, &egm_chardev->device);
+ put_device(&egm_chardev->device);
+}
static int egm_driver_probe(struct auxiliary_device *aux_dev,
const struct auxiliary_device_id *id)
{
+ struct nvgrace_egm_dev *egm_dev =
+ container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
+ struct chardev *egm_chardev;
+
+ egm_chardev = setup_egm_chardev(egm_dev);
+ if (!egm_chardev)
+ return -EINVAL;
+
+ xa_store(&egm_chardevs, egm_dev->egmpxm, egm_chardev, GFP_KERNEL);
+
return 0;
}
static void egm_driver_remove(struct auxiliary_device *aux_dev)
{
+ struct nvgrace_egm_dev *egm_dev =
+ container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
+ struct chardev *egm_chardev = xa_erase(&egm_chardevs, egm_dev->egmpxm);
+
+ if (!egm_chardev)
+ return;
+
+ del_egm_chardev(egm_chardev);
}
static const struct auxiliary_device_id egm_id_table[] = {
--
2.34.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [RFC 09/14] vfio/nvgrace-egm: Add chardev ops for EGM management
2025-09-04 4:08 [RFC 00/14] cover-letter: Add virtualization support for EGM ankita
` (7 preceding siblings ...)
2025-09-04 4:08 ` [RFC 08/14] vfio/nvgrace-egm: Expose EGM region as char device ankita
@ 2025-09-04 4:08 ` ankita
2025-09-05 13:36 ` Jason Gunthorpe
2025-09-04 4:08 ` [RFC 10/14] vfio/nvgrace-egm: Clear Memory before handing out to VM ankita
` (4 subsequent siblings)
13 siblings, 1 reply; 27+ messages in thread
From: ankita @ 2025-09-04 4:08 UTC (permalink / raw)
To: ankita, jgg, alex.williamson, yishaih, skolothumtho, kevin.tian,
yi.l.liu, zhiw
Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
jhubbard, danw, anuaggarwal, mochs, kjaju, dnigam, kvm,
linux-kernel
From: Ankit Agrawal <ankita@nvidia.com>
EGM module implements the mmap file_ops to manage the usermode app's
VMA mapping to the EGM region. The appropriate region is determined
from the minor number.
Note that the EGM memory region is invisible to the host kernel as it
is not present in the host EFI map. The host Linux MM thus cannot manage
the memory, even though it is accessible on the host SPA. The EGM module
thus use remap_pfn_range() to perform the VMA mapping to the EGM region.
Suggested-by: Aniket Agashe <aniketa@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
drivers/vfio/pci/nvgrace-gpu/egm.c | 29 ++++++++++++++++++++++++++++-
1 file changed, 28 insertions(+), 1 deletion(-)
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
index c2dce5fa797a..7bf6a05aa967 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
@@ -17,19 +17,46 @@ struct chardev {
struct cdev cdev;
};
+static struct nvgrace_egm_dev *
+egm_chardev_to_nvgrace_egm_dev(struct chardev *egm_chardev)
+{
+ struct auxiliary_device *aux_dev =
+ container_of(egm_chardev->device.parent, struct auxiliary_device, dev);
+
+ return container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
+}
+
static int nvgrace_egm_open(struct inode *inode, struct file *file)
{
+ struct chardev *egm_chardev =
+ container_of(inode->i_cdev, struct chardev, cdev);
+
+ file->private_data = egm_chardev;
+
return 0;
}
static int nvgrace_egm_release(struct inode *inode, struct file *file)
{
+ file->private_data = NULL;
+
return 0;
}
static int nvgrace_egm_mmap(struct file *file, struct vm_area_struct *vma)
{
- return 0;
+ struct chardev *egm_chardev = file->private_data;
+ struct nvgrace_egm_dev *egm_dev =
+ egm_chardev_to_nvgrace_egm_dev(egm_chardev);
+
+ /*
+ * EGM memory is invisible to the host kernel and is not managed
+ * by it. Map the usermode VMA to the EGM region.
+ */
+ return remap_pfn_range(vma, vma->vm_start,
+ PHYS_PFN(egm_dev->egmphys),
+ (vma->vm_end - vma->vm_start),
+ vma->vm_page_prot);
}
static const struct file_operations file_ops = {
--
2.34.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [RFC 10/14] vfio/nvgrace-egm: Clear Memory before handing out to VM
2025-09-04 4:08 [RFC 00/14] cover-letter: Add virtualization support for EGM ankita
` (8 preceding siblings ...)
2025-09-04 4:08 ` [RFC 09/14] vfio/nvgrace-egm: Add chardev ops for EGM management ankita
@ 2025-09-04 4:08 ` ankita
2025-09-05 13:39 ` Jason Gunthorpe
2025-09-15 8:45 ` Shameer Kolothum
2025-09-04 4:08 ` [RFC 11/14] vfio/nvgrace-egm: Fetch EGM region retired pages list ankita
` (3 subsequent siblings)
13 siblings, 2 replies; 27+ messages in thread
From: ankita @ 2025-09-04 4:08 UTC (permalink / raw)
To: ankita, jgg, alex.williamson, yishaih, skolothumtho, kevin.tian,
yi.l.liu, zhiw
Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
jhubbard, danw, anuaggarwal, mochs, kjaju, dnigam, kvm,
linux-kernel
From: Ankit Agrawal <ankita@nvidia.com>
The EGM region is invisible to the host Linux kernel and it does not
manage the region. The EGM module manages the EGM memory and thus is
responsible to clear out the region before handing out to the VM.
Clear EGM region on EGM chardev open. It is possible to trigger open
multiple times by tools such as kvmtool. Thus ensure the region is
cleared only on the first open.
Suggested-by: Vikram Sethi <vsethi@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
drivers/vfio/pci/nvgrace-gpu/egm.c | 28 +++++++++++++++++++++++++++-
1 file changed, 27 insertions(+), 1 deletion(-)
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
index 7bf6a05aa967..bf1241ed1d60 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
@@ -15,6 +15,7 @@ static DEFINE_XARRAY(egm_chardevs);
struct chardev {
struct device device;
struct cdev cdev;
+ atomic_t open_count;
};
static struct nvgrace_egm_dev *
@@ -30,6 +31,26 @@ static int nvgrace_egm_open(struct inode *inode, struct file *file)
{
struct chardev *egm_chardev =
container_of(inode->i_cdev, struct chardev, cdev);
+ struct nvgrace_egm_dev *egm_dev =
+ egm_chardev_to_nvgrace_egm_dev(egm_chardev);
+ void *memaddr;
+
+ if (atomic_inc_return(&egm_chardev->open_count) > 1)
+ return 0;
+
+ /*
+ * nvgrace-egm module is responsible to manage the EGM memory as
+ * the host kernel has no knowledge of it. Clear the region before
+ * handing over to userspace.
+ */
+ memaddr = memremap(egm_dev->egmphys, egm_dev->egmlength, MEMREMAP_WB);
+ if (!memaddr) {
+ atomic_dec(&egm_chardev->open_count);
+ return -EINVAL;
+ }
+
+ memset((u8 *)memaddr, 0, egm_dev->egmlength);
+ memunmap(memaddr);
file->private_data = egm_chardev;
@@ -38,7 +59,11 @@ static int nvgrace_egm_open(struct inode *inode, struct file *file)
static int nvgrace_egm_release(struct inode *inode, struct file *file)
{
- file->private_data = NULL;
+ struct chardev *egm_chardev =
+ container_of(inode->i_cdev, struct chardev, cdev);
+
+ if (atomic_dec_and_test(&egm_chardev->open_count))
+ file->private_data = NULL;
return 0;
}
@@ -96,6 +121,7 @@ setup_egm_chardev(struct nvgrace_egm_dev *egm_dev)
egm_chardev->device.parent = &egm_dev->aux_dev.dev;
cdev_init(&egm_chardev->cdev, &file_ops);
egm_chardev->cdev.owner = THIS_MODULE;
+ atomic_set(&egm_chardev->open_count, 0);
ret = dev_set_name(&egm_chardev->device, "egm%lld", egm_dev->egmpxm);
if (ret)
--
2.34.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [RFC 11/14] vfio/nvgrace-egm: Fetch EGM region retired pages list
2025-09-04 4:08 [RFC 00/14] cover-letter: Add virtualization support for EGM ankita
` (9 preceding siblings ...)
2025-09-04 4:08 ` [RFC 10/14] vfio/nvgrace-egm: Clear Memory before handing out to VM ankita
@ 2025-09-04 4:08 ` ankita
2025-09-15 9:21 ` Shameer Kolothum
2025-09-04 4:08 ` [RFC 12/14] vfio/nvgrace-egm: Introduce ioctl to share retired pages ankita
` (2 subsequent siblings)
13 siblings, 1 reply; 27+ messages in thread
From: ankita @ 2025-09-04 4:08 UTC (permalink / raw)
To: ankita, jgg, alex.williamson, yishaih, skolothumtho, kevin.tian,
yi.l.liu, zhiw
Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
jhubbard, danw, anuaggarwal, mochs, kjaju, dnigam, kvm,
linux-kernel
From: Ankit Agrawal <ankita@nvidia.com>
It is possible for some system memory pages on the EGM to
have retired pages with uncorrectable ECC errors. A list of
pages known with such errors (referred as retired pages) are
maintained by the Host UEFI. The Host UEFI populates such list
in a reserved region. It communicates the SPA of this region
through a ACPI DSDT property.
nvgrace-egm module is responsible to store the list of retired page
offsets to be made available for usermode processes. The module:
1. Get the reserved memory region SPA and maps to it to fetch
the list of bad pages.
2. Calculate the retired page offsets in the EGM and stores it.
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
drivers/vfio/pci/nvgrace-gpu/egm.c | 81 ++++++++++++++++++++++++++
drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 32 ++++++++--
drivers/vfio/pci/nvgrace-gpu/egm_dev.h | 5 +-
drivers/vfio/pci/nvgrace-gpu/main.c | 8 ++-
include/linux/nvgrace-egm.h | 2 +
5 files changed, 118 insertions(+), 10 deletions(-)
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
index bf1241ed1d60..7a026b4d98f7 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
@@ -8,6 +8,11 @@
#define MAX_EGM_NODES 4
+struct h_node {
+ unsigned long mem_offset;
+ struct hlist_node node;
+};
+
static dev_t dev;
static struct class *class;
static DEFINE_XARRAY(egm_chardevs);
@@ -16,6 +21,7 @@ struct chardev {
struct device device;
struct cdev cdev;
atomic_t open_count;
+ DECLARE_HASHTABLE(htbl, 0x10);
};
static struct nvgrace_egm_dev *
@@ -145,20 +151,86 @@ static void del_egm_chardev(struct chardev *egm_chardev)
put_device(&egm_chardev->device);
}
+static void cleanup_retired_pages(struct chardev *egm_chardev)
+{
+ struct h_node *cur_page;
+ unsigned long bkt;
+ struct hlist_node *temp_node;
+
+ hash_for_each_safe(egm_chardev->htbl, bkt, temp_node, cur_page, node) {
+ hash_del(&cur_page->node);
+ kvfree(cur_page);
+ }
+}
+
+static int nvgrace_egm_fetch_retired_pages(struct nvgrace_egm_dev *egm_dev,
+ struct chardev *egm_chardev)
+{
+ u64 count;
+ void *memaddr;
+ int index, ret = 0;
+
+ memaddr = memremap(egm_dev->retiredpagesphys, PAGE_SIZE, MEMREMAP_WB);
+ if (!memaddr)
+ return -ENOMEM;
+
+ count = *(u64 *)memaddr;
+
+ for (index = 0; index < count; index++) {
+ struct h_node *retired_page;
+
+ /*
+ * Since the EGM is linearly mapped, the offset in the
+ * carveout is the same offset in the VM system memory.
+ *
+ * Calculate the offset to communicate to the usermode
+ * apps.
+ */
+ retired_page = kvzalloc(sizeof(*retired_page), GFP_KERNEL);
+ if (!retired_page) {
+ ret = -ENOMEM;
+ break;
+ }
+
+ retired_page->mem_offset = *((u64 *)memaddr + index + 1) -
+ egm_dev->egmphys;
+ hash_add(egm_chardev->htbl, &retired_page->node,
+ retired_page->mem_offset);
+ }
+
+ memunmap(memaddr);
+
+ if (ret)
+ cleanup_retired_pages(egm_chardev);
+
+ return ret;
+}
+
static int egm_driver_probe(struct auxiliary_device *aux_dev,
const struct auxiliary_device_id *id)
{
struct nvgrace_egm_dev *egm_dev =
container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
struct chardev *egm_chardev;
+ int ret;
egm_chardev = setup_egm_chardev(egm_dev);
if (!egm_chardev)
return -EINVAL;
+ hash_init(egm_chardev->htbl);
+
+ ret = nvgrace_egm_fetch_retired_pages(egm_dev, egm_chardev);
+ if (ret)
+ goto error_exit;
+
xa_store(&egm_chardevs, egm_dev->egmpxm, egm_chardev, GFP_KERNEL);
return 0;
+
+error_exit:
+ del_egm_chardev(egm_chardev);
+ return ret;
}
static void egm_driver_remove(struct auxiliary_device *aux_dev)
@@ -166,10 +238,19 @@ static void egm_driver_remove(struct auxiliary_device *aux_dev)
struct nvgrace_egm_dev *egm_dev =
container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
struct chardev *egm_chardev = xa_erase(&egm_chardevs, egm_dev->egmpxm);
+ struct h_node *cur_page;
+ unsigned long bkt;
+ struct hlist_node *temp_node;
if (!egm_chardev)
return;
+ hash_for_each_safe(egm_chardev->htbl, bkt, temp_node, cur_page, node) {
+ hash_del(&cur_page->node);
+ kvfree(cur_page);
+ }
+
+ cleanup_retired_pages(egm_chardev);
del_egm_chardev(egm_chardev);
}
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
index ca50bc1f67a0..b8e143542bce 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
@@ -18,22 +18,41 @@ int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm)
}
int nvgrace_gpu_fetch_egm_property(struct pci_dev *pdev, u64 *pegmphys,
- u64 *pegmlength)
+ u64 *pegmlength, u64 *pretiredpagesphys)
{
int ret;
/*
- * The memory information is present in the system ACPI tables as DSD
- * properties nvidia,egm-base-pa and nvidia,egm-size.
+ * The EGM memory information is present in the system ACPI tables
+ * as DSD properties nvidia,egm-base-pa and nvidia,egm-size.
*/
ret = device_property_read_u64(&pdev->dev, "nvidia,egm-size",
pegmlength);
if (ret)
- return ret;
+ goto error_exit;
ret = device_property_read_u64(&pdev->dev, "nvidia,egm-base-pa",
pegmphys);
+ if (ret)
+ goto error_exit;
+
+ /*
+ * SBIOS puts the list of retired pages on a region. The region
+ * SPA is exposed as "nvidia,egm-retired-pages-data-base".
+ */
+ ret = device_property_read_u64(&pdev->dev,
+ "nvidia,egm-retired-pages-data-base",
+ pretiredpagesphys);
+ if (ret)
+ goto error_exit;
+
+ /* Catch firmware bug and avoid a crash */
+ if (*pretiredpagesphys == 0) {
+ dev_err(&pdev->dev, "Retired pages region is not setup\n");
+ ret = -EINVAL;
+ }
+error_exit:
return ret;
}
@@ -74,7 +93,8 @@ static void nvgrace_gpu_release_aux_device(struct device *device)
struct nvgrace_egm_dev *
nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
- u64 egmphys, u64 egmlength, u64 egmpxm)
+ u64 egmphys, u64 egmlength, u64 egmpxm,
+ u64 retiredpagesphys)
{
struct nvgrace_egm_dev *egm_dev;
int ret;
@@ -86,6 +106,8 @@ nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
egm_dev->egmpxm = egmpxm;
egm_dev->egmphys = egmphys;
egm_dev->egmlength = egmlength;
+ egm_dev->retiredpagesphys = retiredpagesphys;
+
INIT_LIST_HEAD(&egm_dev->gpus);
egm_dev->aux_dev.id = egmpxm;
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
index 2e1612445898..2f329a05685d 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
+++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
@@ -16,8 +16,9 @@ void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev);
struct nvgrace_egm_dev *
nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
- u64 egmphys, u64 egmlength, u64 egmpxm);
+ u64 egmphys, u64 egmlength, u64 egmpxm,
+ u64 retiredpagesphys);
int nvgrace_gpu_fetch_egm_property(struct pci_dev *pdev, u64 *pegmphys,
- u64 *pegmlength);
+ u64 *pegmlength, u64 *pretiredpagesphys);
#endif /* EGM_DEV_H */
diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index b1ccd1ac2e0a..534dc3ee6113 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -67,7 +67,7 @@ static struct list_head egm_dev_list;
static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
{
struct nvgrace_egm_dev_entry *egm_entry = NULL;
- u64 egmphys, egmlength, egmpxm;
+ u64 egmphys, egmlength, egmpxm, retiredpagesphys;
int ret = 0;
bool is_new_region = false;
@@ -80,7 +80,8 @@ static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
if (nvgrace_gpu_has_egm_property(pdev, &egmpxm))
goto exit;
- ret = nvgrace_gpu_fetch_egm_property(pdev, &egmphys, &egmlength);
+ ret = nvgrace_gpu_fetch_egm_property(pdev, &egmphys, &egmlength,
+ &retiredpagesphys);
if (ret)
goto exit;
@@ -103,7 +104,8 @@ static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
egm_entry->egm_dev =
nvgrace_gpu_create_aux_device(pdev, NVGRACE_EGM_DEV_NAME,
- egmphys, egmlength, egmpxm);
+ egmphys, egmlength, egmpxm,
+ retiredpagesphys);
if (!egm_entry->egm_dev) {
ret = -EINVAL;
goto free_egm_entry;
diff --git a/include/linux/nvgrace-egm.h b/include/linux/nvgrace-egm.h
index a66906753267..197255c2a3b7 100644
--- a/include/linux/nvgrace-egm.h
+++ b/include/linux/nvgrace-egm.h
@@ -7,6 +7,7 @@
#define NVGRACE_EGM_H
#include <linux/auxiliary_bus.h>
+#include <linux/hashtable.h>
#define NVGRACE_EGM_DEV_NAME "egm"
@@ -19,6 +20,7 @@ struct nvgrace_egm_dev {
struct auxiliary_device aux_dev;
phys_addr_t egmphys;
size_t egmlength;
+ phys_addr_t retiredpagesphys;
u64 egmpxm;
struct list_head gpus;
};
--
2.34.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [RFC 12/14] vfio/nvgrace-egm: Introduce ioctl to share retired pages
2025-09-04 4:08 [RFC 00/14] cover-letter: Add virtualization support for EGM ankita
` (10 preceding siblings ...)
2025-09-04 4:08 ` [RFC 11/14] vfio/nvgrace-egm: Fetch EGM region retired pages list ankita
@ 2025-09-04 4:08 ` ankita
2025-09-04 4:08 ` [RFC 13/14] vfio/nvgrace-egm: expose the egm size through sysfs ankita
2025-09-04 4:08 ` [RFC 14/14] vfio/nvgrace-gpu: Add link from pci to EGM ankita
13 siblings, 0 replies; 27+ messages in thread
From: ankita @ 2025-09-04 4:08 UTC (permalink / raw)
To: ankita, jgg, alex.williamson, yishaih, skolothumtho, kevin.tian,
yi.l.liu, zhiw
Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
jhubbard, danw, anuaggarwal, mochs, kjaju, dnigam, kvm,
linux-kernel
From: Ankit Agrawal <ankita@nvidia.com>
nvgrace-egm module stores the list of retired page offsets to be made
available for usermode processes. Introduce an ioctl to share the
information with the userspace.
The ioctl is called by usermode apps such as QEMU to get the retired
page offsets. The usermode apps are expected to take appropriate action
to communicate the list to the VM.
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
MAINTAINERS | 1 +
drivers/vfio/pci/nvgrace-gpu/egm.c | 67 ++++++++++++++++++++++++++++++
include/uapi/linux/egm.h | 26 ++++++++++++
3 files changed, 94 insertions(+)
create mode 100644 include/uapi/linux/egm.h
diff --git a/MAINTAINERS b/MAINTAINERS
index ec6bc10f346d..bd2d2d309d92 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -26481,6 +26481,7 @@ M: Ankit Agrawal <ankita@nvidia.com>
L: kvm@vger.kernel.org
S: Supported
F: drivers/vfio/pci/nvgrace-gpu/egm.c
+F: include/uapi/linux/egm.h
VFIO PCI DEVICE SPECIFIC DRIVERS
R: Jason Gunthorpe <jgg@nvidia.com>
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
index 7a026b4d98f7..2cb100e39c4b 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
@@ -5,6 +5,7 @@
#include <linux/vfio_pci_core.h>
#include <linux/nvgrace-egm.h>
+#include <linux/egm.h>
#define MAX_EGM_NODES 4
@@ -90,11 +91,77 @@ static int nvgrace_egm_mmap(struct file *file, struct vm_area_struct *vma)
vma->vm_page_prot);
}
+static long nvgrace_egm_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+ unsigned long minsz = offsetofend(struct egm_retired_pages_list, count);
+ struct egm_retired_pages_list info;
+ void __user *uarg = (void __user *)arg;
+ struct chardev *egm_chardev = file->private_data;
+
+ if (copy_from_user(&info, uarg, minsz))
+ return -EFAULT;
+
+ if (info.argsz < minsz || !egm_chardev)
+ return -EINVAL;
+
+ switch (cmd) {
+ case EGM_RETIRED_PAGES_LIST:
+ int ret;
+ unsigned long retired_page_struct_size = sizeof(struct egm_retired_pages_info);
+ struct egm_retired_pages_info tmp;
+ struct h_node *cur_page;
+ struct hlist_node *tmp_node;
+ unsigned long bkt;
+ int count = 0, index = 0;
+
+ hash_for_each_safe(egm_chardev->htbl, bkt, tmp_node, cur_page, node)
+ count++;
+
+ if (info.argsz < (minsz + count * retired_page_struct_size)) {
+ info.argsz = minsz + count * retired_page_struct_size;
+ info.count = 0;
+ goto done;
+ } else {
+ hash_for_each_safe(egm_chardev->htbl, bkt, tmp_node, cur_page, node) {
+ /*
+ * This check fails if there was an ECC error
+ * after the usermode app read the count of
+ * bad pages through this ioctl.
+ */
+ if (minsz + index * retired_page_struct_size >= info.argsz) {
+ info.argsz = minsz + index * retired_page_struct_size;
+ info.count = index;
+ goto done;
+ }
+
+ tmp.offset = cur_page->mem_offset;
+ tmp.size = PAGE_SIZE;
+
+ ret = copy_to_user(uarg + minsz +
+ index * retired_page_struct_size,
+ &tmp, retired_page_struct_size);
+ if (ret)
+ return -EFAULT;
+ index++;
+ }
+
+ info.count = index;
+ }
+ break;
+ default:
+ return -EINVAL;
+ }
+
+done:
+ return copy_to_user(uarg, &info, minsz) ? -EFAULT : 0;
+}
+
static const struct file_operations file_ops = {
.owner = THIS_MODULE,
.open = nvgrace_egm_open,
.release = nvgrace_egm_release,
.mmap = nvgrace_egm_mmap,
+ .unlocked_ioctl = nvgrace_egm_ioctl,
};
static void egm_chardev_release(struct device *dev)
diff --git a/include/uapi/linux/egm.h b/include/uapi/linux/egm.h
new file mode 100644
index 000000000000..d157fbb5e305
--- /dev/null
+++ b/include/uapi/linux/egm.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#ifndef _UAPIEGM_H
+#define _UAPIEGM_H
+
+#define EGM_TYPE ('E')
+
+struct egm_retired_pages_info {
+ __aligned_u64 offset;
+ __aligned_u64 size;
+};
+
+struct egm_retired_pages_list {
+ __u32 argsz;
+ /* out */
+ __u32 count;
+ /* out */
+ struct egm_retired_pages_info retired_pages[];
+};
+
+#define EGM_RETIRED_PAGES_LIST _IO(EGM_TYPE, 100)
+
+#endif /* _UAPIEGM_H */
--
2.34.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [RFC 13/14] vfio/nvgrace-egm: expose the egm size through sysfs
2025-09-04 4:08 [RFC 00/14] cover-letter: Add virtualization support for EGM ankita
` (11 preceding siblings ...)
2025-09-04 4:08 ` [RFC 12/14] vfio/nvgrace-egm: Introduce ioctl to share retired pages ankita
@ 2025-09-04 4:08 ` ankita
2025-09-04 4:08 ` [RFC 14/14] vfio/nvgrace-gpu: Add link from pci to EGM ankita
13 siblings, 0 replies; 27+ messages in thread
From: ankita @ 2025-09-04 4:08 UTC (permalink / raw)
To: ankita, jgg, alex.williamson, yishaih, skolothumtho, kevin.tian,
yi.l.liu, zhiw
Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
jhubbard, danw, anuaggarwal, mochs, kjaju, dnigam, kvm,
linux-kernel
From: Ankit Agrawal <ankita@nvidia.com>
To allocate the EGM, the userspace need to know its size. Currently,
there is no easy way for the userspace to determine that.
Make nvgrace-egm expose the size through sysfs that can be queried
by the userspace from <aux_dev_path>/egm_size.
On a 2-socket, 4 GPU Grace Blackwell setup, it shows up as:
Socket0:
/sys/devices/pci0008:00/0008:00:00.0/0008:01:00.0/nvgrace_gpu_vfio_pci.egm.4/egm/egm4/egm_size
/sys/devices/pci0009:00/0009:00:00.0/0009:01:00.0/nvgrace_gpu_vfio_pci.egm.4/egm/egm4/egm_size
Socket1:
/sys/devices/pci0018:00/0018:00:00.0/0018:01:00.0/nvgrace_gpu_vfio_pci.egm.5/egm/egm5/egm_size
/sys/devices/pci0019:00/0019:00:00.0/0019:01:00.0/nvgrace_gpu_vfio_pci.egm.5/egm/egm5/egm_size
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
drivers/vfio/pci/nvgrace-gpu/egm.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
index 2cb100e39c4b..346607eeb0f9 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
@@ -343,6 +343,32 @@ static char *egm_devnode(const struct device *device, umode_t *mode)
return NULL;
}
+static ssize_t egm_size_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct chardev *egm_chardev = container_of(dev, struct chardev, device);
+ struct nvgrace_egm_dev *egm_dev =
+ egm_chardev_to_nvgrace_egm_dev(egm_chardev);
+
+ return sysfs_emit(buf, "0x%lx\n", egm_dev->egmlength);
+}
+
+static DEVICE_ATTR_RO(egm_size);
+
+static struct attribute *attrs[] = {
+ &dev_attr_egm_size.attr,
+ NULL,
+};
+
+static struct attribute_group attr_group = {
+ .attrs = attrs,
+};
+
+static const struct attribute_group *attr_groups[2] = {
+ &attr_group,
+ NULL
+};
+
static int __init nvgrace_egm_init(void)
{
int ret;
@@ -364,6 +390,7 @@ static int __init nvgrace_egm_init(void)
}
class->devnode = egm_devnode;
+ class->dev_groups = attr_groups;
ret = auxiliary_driver_register(&egm_driver);
if (!ret)
--
2.34.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [RFC 14/14] vfio/nvgrace-gpu: Add link from pci to EGM
2025-09-04 4:08 [RFC 00/14] cover-letter: Add virtualization support for EGM ankita
` (12 preceding siblings ...)
2025-09-04 4:08 ` [RFC 13/14] vfio/nvgrace-egm: expose the egm size through sysfs ankita
@ 2025-09-04 4:08 ` ankita
2025-09-05 13:42 ` Jason Gunthorpe
13 siblings, 1 reply; 27+ messages in thread
From: ankita @ 2025-09-04 4:08 UTC (permalink / raw)
To: ankita, jgg, alex.williamson, yishaih, skolothumtho, kevin.tian,
yi.l.liu, zhiw
Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
jhubbard, danw, anuaggarwal, mochs, kjaju, dnigam, kvm,
linux-kernel
From: Ankit Agrawal <ankita@nvidia.com>
To replicate the host EGM topology in the VM in terms of
the GPU affinity, the userspace need to be aware of which
GPUs belong to the same socket as the EGM region.
Expose the list of GPUs associated with an EGM region
through sysfs. The list can be queried from the auxiliary
device path.
On a 2-socket, 4 GPU Grace Blackwell setup, it shows up as the following:
/sys/devices/pci0008:00/0008:00:00.0/0008:01:00.0/nvgrace_gpu_vfio_pci.egm.4
/sys/devices/pci0009:00/0009:00:00.0/0009:01:00.0/nvgrace_gpu_vfio_pci.egm.4
pointing to egm4.
/sys/devices/pci0018:00/0018:00:00.0/0018:01:00.0/nvgrace_gpu_vfio_pci.egm.5
/sys/devices/pci0019:00/0019:00:00.0/0019:01:00.0/nvgrace_gpu_vfio_pci.egm.5
pointing to egm5.
Moreover
/sys/devices/pci0008:00/0008:00:00.0/0008:01:00.0/nvgrace_gpu_vfio_pci.egm.4
/sys/devices/pci0009:00/0009:00:00.0/0009:01:00.0/nvgrace_gpu_vfio_pci.egm.4
lists links to both the 0008:01:00.0 & 0009:01:00.0 GPU devices.
and
/sys/devices/pci0018:00/0018:00:00.0/0018:01:00.0/nvgrace_gpu_vfio_pci.egm.5
/sys/devices/pci0019:00/0019:00:00.0/0019:01:00.0/nvgrace_gpu_vfio_pci.egm.5
lists links to both the 0018:01:00.0 & 0019:01:00.0.
Suggested-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 42 +++++++++++++++++++++++++-
1 file changed, 41 insertions(+), 1 deletion(-)
diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
index b8e143542bce..20e9213aa0ac 100644
--- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
+++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
@@ -56,6 +56,36 @@ int nvgrace_gpu_fetch_egm_property(struct pci_dev *pdev, u64 *pegmphys,
return ret;
}
+static int create_egm_symlinks(struct nvgrace_egm_dev *egm_dev,
+ struct pci_dev *pdev)
+{
+ int ret_l1, ret_l2;
+
+ ret_l1 = sysfs_create_link_nowarn(&pdev->dev.kobj,
+ &egm_dev->aux_dev.dev.kobj,
+ dev_name(&egm_dev->aux_dev.dev));
+
+ /*
+ * Allow if Link already exists - created since GPU is the auxiliary
+ * device's parent; flag the error otherwise.
+ */
+ if (ret_l1 && ret_l1 != -EEXIST)
+ return ret_l1;
+
+ ret_l2 = sysfs_create_link(&egm_dev->aux_dev.dev.kobj,
+ &pdev->dev.kobj,
+ dev_name(&pdev->dev));
+
+ /*
+ * Remove the aux dev link only if wasn't already present.
+ */
+ if (ret_l2 && !ret_l1)
+ sysfs_remove_link(&pdev->dev.kobj,
+ dev_name(&egm_dev->aux_dev.dev));
+
+ return ret_l2;
+}
+
int add_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
{
struct gpu_node *node;
@@ -68,7 +98,16 @@ int add_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
list_add_tail(&node->list, &egm_dev->gpus);
- return 0;
+ return create_egm_symlinks(egm_dev, pdev);
+}
+
+static void remove_egm_symlinks(struct nvgrace_egm_dev *egm_dev,
+ struct pci_dev *pdev)
+{
+ sysfs_remove_link(&pdev->dev.kobj,
+ dev_name(&egm_dev->aux_dev.dev));
+ sysfs_remove_link(&egm_dev->aux_dev.dev.kobj,
+ dev_name(&pdev->dev));
}
void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
@@ -77,6 +116,7 @@ void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
list_for_each_entry_safe(node, tmp, &egm_dev->gpus, list) {
if (node->pdev == pdev) {
+ remove_egm_symlinks(egm_dev, pdev);
list_del(&node->list);
kvfree(node);
}
--
2.34.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [RFC 05/14] vfio/nvgrace-egm: Introduce module to manage EGM
2025-09-04 4:08 ` [RFC 05/14] vfio/nvgrace-egm: Introduce module to manage EGM ankita
@ 2025-09-05 13:26 ` Jason Gunthorpe
2025-09-15 7:47 ` Shameer Kolothum
1 sibling, 0 replies; 27+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 13:26 UTC (permalink / raw)
To: ankita
Cc: alex.williamson, yishaih, skolothumtho, kevin.tian, yi.l.liu,
zhiw, aniketa, cjia, kwankhede, targupta, vsethi, acurrid,
apopple, jhubbard, danw, anuaggarwal, mochs, kjaju, dnigam, kvm,
linux-kernel
On Thu, Sep 04, 2025 at 04:08:19AM +0000, ankita@nvidia.com wrote:
> @@ -1125,3 +1125,4 @@ MODULE_LICENSE("GPL");
> MODULE_AUTHOR("Ankit Agrawal <ankita@nvidia.com>");
> MODULE_AUTHOR("Aniket Agashe <aniketa@nvidia.com>");
> MODULE_DESCRIPTION("VFIO NVGRACE GPU PF - User Level driver for NVIDIA devices with CPU coherently accessible device memory");
> +MODULE_SOFTDEP("pre: nvgrace-egm");
There sholdn't be softdeps, automatic struct device based probing
should be sufficient.
Jason
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 07/14] vfio/nvgrace-egm: Register auxiliary driver ops
2025-09-04 4:08 ` [RFC 07/14] vfio/nvgrace-egm: Register auxiliary driver ops ankita
@ 2025-09-05 13:31 ` Jason Gunthorpe
0 siblings, 0 replies; 27+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 13:31 UTC (permalink / raw)
To: ankita
Cc: alex.williamson, yishaih, skolothumtho, kevin.tian, yi.l.liu,
zhiw, aniketa, cjia, kwankhede, targupta, vsethi, acurrid,
apopple, jhubbard, danw, anuaggarwal, mochs, kjaju, dnigam, kvm,
linux-kernel
On Thu, Sep 04, 2025 at 04:08:21AM +0000, ankita@nvidia.com wrote:
> +static const struct auxiliary_device_id egm_id_table[] = {
> + { .name = "nvgrace_gpu_vfio_pci.egm" },
> + { .name = "nvidia_vgpu_vfio.egm" },
Not in tree
> static char *egm_devnode(const struct device *device, umode_t *mode)
> {
> if (mode)
> @@ -35,19 +59,28 @@ static int __init nvgrace_egm_init(void)
>
> class = class_create(NVGRACE_EGM_DEV_NAME);
> if (IS_ERR(class)) {
> - unregister_chrdev_region(dev, MAX_EGM_NODES);
> - return PTR_ERR(class);
> + ret = PTR_ERR(class);
> + goto unregister_chrdev;
> }
>
> class->devnode = egm_devnode;
>
> - return 0;
> + ret = auxiliary_driver_register(&egm_driver);
> + if (!ret)
> + goto fn_exit;
> +
> + class_destroy(class);
> +unregister_chrdev:
> + unregister_chrdev_region(dev, MAX_EGM_NODES);
> +fn_exit:
> + return ret;
> }
>
> static void __exit nvgrace_egm_cleanup(void)
> {
> class_destroy(class);
> unregister_chrdev_region(dev, MAX_EGM_NODES);
> + auxiliary_driver_unregister(&egm_driver);
> }
Out of order, the order should be the reverse of init. This will UAF
the class as-is.
Jason
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 08/14] vfio/nvgrace-egm: Expose EGM region as char device
2025-09-04 4:08 ` [RFC 08/14] vfio/nvgrace-egm: Expose EGM region as char device ankita
@ 2025-09-05 13:34 ` Jason Gunthorpe
2025-09-15 8:36 ` Shameer Kolothum
1 sibling, 0 replies; 27+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 13:34 UTC (permalink / raw)
To: ankita
Cc: alex.williamson, yishaih, skolothumtho, kevin.tian, yi.l.liu,
zhiw, aniketa, cjia, kwankhede, targupta, vsethi, acurrid,
apopple, jhubbard, danw, anuaggarwal, mochs, kjaju, dnigam, kvm,
linux-kernel
On Thu, Sep 04, 2025 at 04:08:22AM +0000, ankita@nvidia.com wrote:
> +static struct chardev *
> +setup_egm_chardev(struct nvgrace_egm_dev *egm_dev)
> +{
> + struct chardev *egm_chardev;
> + int ret;
> +
> + egm_chardev = kvzalloc(sizeof(*egm_chardev), GFP_KERNEL);
> + if (!egm_chardev)
> + goto create_err;
> +
> + device_initialize(&egm_chardev->device);
> +
> + /*
> + * Use the proximity domain number as the device minor
> + * number. So the EGM corresponding to node X would be
> + * /dev/egmX.
> + */
> + egm_chardev->device.devt = MKDEV(MAJOR(dev), egm_dev->egmpxm);
> + egm_chardev->device.class = class;
> + egm_chardev->device.release = egm_chardev_release;
> + egm_chardev->device.parent = &egm_dev->aux_dev.dev;
> + cdev_init(&egm_chardev->cdev, &file_ops);
> + egm_chardev->cdev.owner = THIS_MODULE;
> +
> + ret = dev_set_name(&egm_chardev->device, "egm%lld", egm_dev->egmpxm);
> + if (ret)
> + goto error_exit;
> +
> + ret = cdev_device_add(&egm_chardev->cdev, &egm_chardev->device);
> + if (ret)
> + goto error_exit;
> +
> + return egm_chardev;
> +
> +error_exit:
> + kvfree(egm_chardev);
After calling init you have to use put_device not kvfree.
Why kvalloc anyhow? Struct chardev is not big
> static void egm_driver_remove(struct auxiliary_device *aux_dev)
> {
> + struct nvgrace_egm_dev *egm_dev =
> + container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
> + struct chardev *egm_chardev = xa_erase(&egm_chardevs, egm_dev->egmpxm);
> +
> + if (!egm_chardev)
> + return;
> +
> + del_egm_chardev(egm_chardev);
> }
This proceeds even if files are left open which is not going to be any
good..
Jason
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 09/14] vfio/nvgrace-egm: Add chardev ops for EGM management
2025-09-04 4:08 ` [RFC 09/14] vfio/nvgrace-egm: Add chardev ops for EGM management ankita
@ 2025-09-05 13:36 ` Jason Gunthorpe
0 siblings, 0 replies; 27+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 13:36 UTC (permalink / raw)
To: ankita
Cc: alex.williamson, yishaih, skolothumtho, kevin.tian, yi.l.liu,
zhiw, aniketa, cjia, kwankhede, targupta, vsethi, acurrid,
apopple, jhubbard, danw, anuaggarwal, mochs, kjaju, dnigam, kvm,
linux-kernel
On Thu, Sep 04, 2025 at 04:08:23AM +0000, ankita@nvidia.com wrote:
> static int nvgrace_egm_mmap(struct file *file, struct vm_area_struct *vma)
> {
> - return 0;
> + struct chardev *egm_chardev = file->private_data;
> + struct nvgrace_egm_dev *egm_dev =
> + egm_chardev_to_nvgrace_egm_dev(egm_chardev);
> +
> + /*
> + * EGM memory is invisible to the host kernel and is not managed
> + * by it. Map the usermode VMA to the EGM region.
> + */
> + return remap_pfn_range(vma, vma->vm_start,
> + PHYS_PFN(egm_dev->egmphys),
> + (vma->vm_end - vma->vm_start),
> + vma->vm_page_prot);
This needs to handle vm_pgoff and sanity check end - start!!
It should also reject !MAP_SHARED
Jason
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 10/14] vfio/nvgrace-egm: Clear Memory before handing out to VM
2025-09-04 4:08 ` [RFC 10/14] vfio/nvgrace-egm: Clear Memory before handing out to VM ankita
@ 2025-09-05 13:39 ` Jason Gunthorpe
2025-09-15 8:45 ` Shameer Kolothum
1 sibling, 0 replies; 27+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 13:39 UTC (permalink / raw)
To: ankita
Cc: alex.williamson, yishaih, skolothumtho, kevin.tian, yi.l.liu,
zhiw, aniketa, cjia, kwankhede, targupta, vsethi, acurrid,
apopple, jhubbard, danw, anuaggarwal, mochs, kjaju, dnigam, kvm,
linux-kernel
On Thu, Sep 04, 2025 at 04:08:24AM +0000, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
>
> The EGM region is invisible to the host Linux kernel and it does not
> manage the region. The EGM module manages the EGM memory and thus is
> responsible to clear out the region before handing out to the VM.
>
> Clear EGM region on EGM chardev open. It is possible to trigger open
> multiple times by tools such as kvmtool. Thus ensure the region is
> cleared only on the first open.
It would be cleaner not to support multi-open, why is kvmtool doing
this?
Jason
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 14/14] vfio/nvgrace-gpu: Add link from pci to EGM
2025-09-04 4:08 ` [RFC 14/14] vfio/nvgrace-gpu: Add link from pci to EGM ankita
@ 2025-09-05 13:42 ` Jason Gunthorpe
0 siblings, 0 replies; 27+ messages in thread
From: Jason Gunthorpe @ 2025-09-05 13:42 UTC (permalink / raw)
To: ankita
Cc: alex.williamson, yishaih, skolothumtho, kevin.tian, yi.l.liu,
zhiw, aniketa, cjia, kwankhede, targupta, vsethi, acurrid,
apopple, jhubbard, danw, anuaggarwal, mochs, kjaju, dnigam, kvm,
linux-kernel
On Thu, Sep 04, 2025 at 04:08:28AM +0000, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
>
> To replicate the host EGM topology in the VM in terms of
> the GPU affinity, the userspace need to be aware of which
> GPUs belong to the same socket as the EGM region.
>
> Expose the list of GPUs associated with an EGM region
> through sysfs. The list can be queried from the auxiliary
> device path.
>
> On a 2-socket, 4 GPU Grace Blackwell setup, it shows up as the following:
> /sys/devices/pci0008:00/0008:00:00.0/0008:01:00.0/nvgrace_gpu_vfio_pci.egm.4
> /sys/devices/pci0009:00/0009:00:00.0/0009:01:00.0/nvgrace_gpu_vfio_pci.egm.4
> pointing to egm4.
>
> /sys/devices/pci0018:00/0018:00:00.0/0018:01:00.0/nvgrace_gpu_vfio_pci.egm.5
> /sys/devices/pci0019:00/0019:00:00.0/0019:01:00.0/nvgrace_gpu_vfio_pci.egm.5
> pointing to egm5.
>
> Moreover
> /sys/devices/pci0008:00/0008:00:00.0/0008:01:00.0/nvgrace_gpu_vfio_pci.egm.4
> /sys/devices/pci0009:00/0009:00:00.0/0009:01:00.0/nvgrace_gpu_vfio_pci.egm.4
> lists links to both the 0008:01:00.0 & 0009:01:00.0 GPU devices.
>
> and
> /sys/devices/pci0018:00/0018:00:00.0/0018:01:00.0/nvgrace_gpu_vfio_pci.egm.5
> /sys/devices/pci0019:00/0019:00:00.0/0019:01:00.0/nvgrace_gpu_vfio_pci.egm.5
> lists links to both the 0018:01:00.0 & 0019:01:00.0.
This seems backwards, I would rather the egm chardev itself have a
directory of links to the PCI devices not have EGM manipulate the
sysfs belonging to some other driver and subsystem..
Jason
^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: [RFC 02/14] vfio/nvgrace-gpu: Create auxiliary device for EGM
2025-09-04 4:08 ` [RFC 02/14] vfio/nvgrace-gpu: Create auxiliary device for EGM ankita
@ 2025-09-15 6:56 ` Shameer Kolothum
0 siblings, 0 replies; 27+ messages in thread
From: Shameer Kolothum @ 2025-09-15 6:56 UTC (permalink / raw)
To: Ankit Agrawal, Jason Gunthorpe, alex.williamson@redhat.com,
Yishai Hadas, kevin.tian@intel.com, yi.l.liu@intel.com, Zhi Wang
Cc: Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard,
Dan Williams, Anuj Aggarwal (SW-GPU), Matt Ochs, Krishnakant Jaju,
Dheeraj Nigam, kvm@vger.kernel.org, linux-kernel@vger.kernel.org
Hi Ankit,
> -----Original Message-----
> From: Ankit Agrawal <ankita@nvidia.com>
> Sent: 04 September 2025 05:08
> To: Ankit Agrawal <ankita@nvidia.com>; Jason Gunthorpe <jgg@nvidia.com>;
> alex.williamson@redhat.com; Yishai Hadas <yishaih@nvidia.com>; Shameer
> Kolothum <skolothumtho@nvidia.com>; kevin.tian@intel.com;
> yi.l.liu@intel.com; Zhi Wang <zhiw@nvidia.com>
> Cc: Aniket Agashe <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti
> Wankhede <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU)
> <targupta@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>; Andy Currid
> <acurrid@nvidia.com>; Alistair Popple <apopple@nvidia.com>; John
> Hubbard <jhubbard@nvidia.com>; Dan Williams <danw@nvidia.com>; Anuj
> Aggarwal (SW-GPU) <anuaggarwal@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>; Dheeraj Nigam
> <dnigam@nvidia.com>; kvm@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [RFC 02/14] vfio/nvgrace-gpu: Create auxiliary device for EGM
>
> From: Ankit Agrawal <ankita@nvidia.com>
>
> The Extended GPU Memory (EGM) feature enables the GPU access to
> the system memory across sockets and physical systems on the
> Grace Hopper and Grace Blackwell systems. When the feature is
> enabled through SBIOS, part of the system memory is made available
> to the GPU for access through EGM path.
>
> The EGM functionality is separate and largely independent from the
> core GPU device functionality. However, the EGM region information
> of base SPA and size is associated with the GPU on the ACPI tables.
> An architecture wih EGM represented as an auxiliary device suits well
> in this context.
>
> The parent GPU device creates an EGM auxiliary device to be managed
> independently by an auxiliary EGM driver. The EGM region information
> is kept as part of the shared struct nvgrace_egm_dev along with the
> auxiliary device handle.
>
> Each socket has a separate EGM region and hence a multi-socket system
> have multiple EGM regions. Each EGM region has a separate
> nvgrace_egm_dev
> and the nvgrace-gpu keeps the EGM regions as part of a list.
>
> Note that EGM is an optional feature enabled through SBIOS. The EGM
> properties are only populated in ACPI tables if the feature is enabled;
> they are absent otherwise. The absence of the properties is thus not
> considered fatal. The presence of improper set of values however are
> considered fatal.
>
> It is also noteworthy that there may also be multiple GPUs present per
> socket and have duplicate EGM region information with them. Make sure
> the duplicate data does not get added.
>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
> MAINTAINERS | 5 +-
> drivers/vfio/pci/nvgrace-gpu/Makefile | 2 +-
> drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 61 ++++++++++++++++++++++
> drivers/vfio/pci/nvgrace-gpu/egm_dev.h | 17 +++++++
> drivers/vfio/pci/nvgrace-gpu/main.c | 70 +++++++++++++++++++++++++-
> include/linux/nvgrace-egm.h | 23 +++++++++
> 6 files changed, 175 insertions(+), 3 deletions(-)
> create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> create mode 100644 include/linux/nvgrace-egm.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 6dcfbd11efef..dd7df834b70b 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -26471,7 +26471,10 @@ VFIO NVIDIA GRACE GPU DRIVER
> M: Ankit Agrawal <ankita@nvidia.com>
> L: kvm@vger.kernel.org
> S: Supported
> -F: drivers/vfio/pci/nvgrace-gpu/
> +F: drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> +F: drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> +F: drivers/vfio/pci/nvgrace-gpu/main.c
> +F: include/linux/nvgrace-egm.h
>
> VFIO PCI DEVICE SPECIFIC DRIVERS
> R: Jason Gunthorpe <jgg@nvidia.com>
> diff --git a/drivers/vfio/pci/nvgrace-gpu/Makefile b/drivers/vfio/pci/nvgrace-
> gpu/Makefile
> index 3ca8c187897a..e72cc6739ef8 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/Makefile
> +++ b/drivers/vfio/pci/nvgrace-gpu/Makefile
> @@ -1,3 +1,3 @@
> # SPDX-License-Identifier: GPL-2.0-only
> obj-$(CONFIG_NVGRACE_GPU_VFIO_PCI) += nvgrace-gpu-vfio-pci.o
> -nvgrace-gpu-vfio-pci-y := main.o
> +nvgrace-gpu-vfio-pci-y := main.o egm_dev.o
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c b/drivers/vfio/pci/nvgrace-
> gpu/egm_dev.c
> new file mode 100644
> index 000000000000..f4e27dadf1ef
> --- /dev/null
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> @@ -0,0 +1,61 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#include <linux/vfio_pci_core.h>
> +#include "egm_dev.h"
> +
> +/*
> + * Determine if the EGM feature is enabled. If disabled, there
> + * will be no EGM properties populated in the ACPI tables and this
> + * fetch would fail.
> + */
> +int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm)
> +{
> + return device_property_read_u64(&pdev->dev, "nvidia,egm-pxm",
> + pegmpxm);
> +}
> +
> +static void nvgrace_gpu_release_aux_device(struct device *device)
> +{
> + struct auxiliary_device *aux_dev = container_of(device, struct
> auxiliary_device, dev);
> + struct nvgrace_egm_dev *egm_dev = container_of(aux_dev, struct
> nvgrace_egm_dev, aux_dev);
> +
> + kvfree(egm_dev);
> +}
> +
> +struct nvgrace_egm_dev *
> +nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
> + u64 egmpxm)
> +{
> + struct nvgrace_egm_dev *egm_dev;
> + int ret;
> +
> + egm_dev = kvzalloc(sizeof(*egm_dev), GFP_KERNEL);
Do you really need this kvzalloc variant here? Looks like kzalloc will do.
> + if (!egm_dev)
> + goto create_err;
> +
> + egm_dev->egmpxm = egmpxm;
> + egm_dev->aux_dev.id = egmpxm;
> + egm_dev->aux_dev.name = name;
> + egm_dev->aux_dev.dev.release = nvgrace_gpu_release_aux_device;
> + egm_dev->aux_dev.dev.parent = &pdev->dev;
> +
> + ret = auxiliary_device_init(&egm_dev->aux_dev);
> + if (ret)
> + goto free_dev;
> +
> + ret = auxiliary_device_add(&egm_dev->aux_dev);
> + if (ret) {
> + auxiliary_device_uninit(&egm_dev->aux_dev);
> + goto create_err;
Should be free_dev to free the mem allocated.
> + }
> +
> + return egm_dev;
> +
> +free_dev:
> + kvfree(egm_dev);
> +create_err:
> + return NULL;
> +}
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h b/drivers/vfio/pci/nvgrace-
> gpu/egm_dev.h
> new file mode 100644
> index 000000000000..c00f5288f4e7
> --- /dev/null
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#ifndef EGM_DEV_H
> +#define EGM_DEV_H
> +
> +#include <linux/nvgrace-egm.h>
> +
> +int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm);
> +
> +struct nvgrace_egm_dev *
> +nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
> + u64 egmphys);
> +
> +#endif /* EGM_DEV_H */
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-
> gpu/main.c
> index 72e7ac1fa309..2cf851492990 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -7,6 +7,8 @@
> #include <linux/vfio_pci_core.h>
> #include <linux/delay.h>
> #include <linux/jiffies.h>
> +#include <linux/nvgrace-egm.h>
> +#include "egm_dev.h"
>
> /*
> * The device memory usable to the workloads running in the VM is cached
> @@ -60,6 +62,63 @@ struct nvgrace_gpu_pci_core_device {
> bool has_mig_hw_bug;
> };
>
> +static struct list_head egm_dev_list;
Its not clear to me why is this list a global? Is the egm not per device?
May be a comment to explain this will be useful. Also do you need a lock
to protect it?
> +
> +static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
> +{
> + struct nvgrace_egm_dev_entry *egm_entry;
> + u64 egmpxm;
> + int ret = 0;
> +
> + /*
> + * EGM is an optional feature enabled in SBIOS. If disabled, there
> + * will be no EGM properties populated in the ACPI tables and this
> + * fetch would fail. Treat this failure as non-fatal and return
> + * early.
> + */
> + if (nvgrace_gpu_has_egm_property(pdev, &egmpxm))
> + goto exit;
> +
> + egm_entry = kvzalloc(sizeof(*egm_entry), GFP_KERNEL);
Kzalloc() good enough?
Thanks,
Shameer
> + if (!egm_entry)
> + return -ENOMEM;
> +
> + egm_entry->egm_dev =
> + nvgrace_gpu_create_aux_device(pdev,
> NVGRACE_EGM_DEV_NAME,
> + egmpxm);
> + if (!egm_entry->egm_dev) {
> + kvfree(egm_entry);
> + ret = -EINVAL;
> + goto exit;
> + }
> +
> + list_add_tail(&egm_entry->list, &egm_dev_list);
> +
> +exit:
> + return ret;
> +}
> +
> +static void nvgrace_gpu_destroy_egm_aux_device(struct pci_dev *pdev)
> +{
> + struct nvgrace_egm_dev_entry *egm_entry, *temp_egm_entry;
> + u64 egmpxm;
> +
> + if (nvgrace_gpu_has_egm_property(pdev, &egmpxm))
> + return;
> +
> + list_for_each_entry_safe(egm_entry, temp_egm_entry,
> &egm_dev_list, list) {
> + /*
> + * Free the EGM region corresponding to the input GPU
> + * device.
> + */
> + if (egm_entry->egm_dev->egmpxm == egmpxm) {
> + auxiliary_device_destroy(&egm_entry->egm_dev-
> >aux_dev);
> + list_del(&egm_entry->list);
> + kvfree(egm_entry);
> + }
> + }
> +}
> +
> static void nvgrace_gpu_init_fake_bar_emu_regs(struct vfio_device
> *core_vdev)
> {
> struct nvgrace_gpu_pci_core_device *nvdev =
> @@ -965,14 +1024,20 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
> memphys, memlength);
> if (ret)
> goto out_put_vdev;
> +
> + ret = nvgrace_gpu_create_egm_aux_device(pdev);
> + if (ret)
> + goto out_put_vdev;
> }
>
> ret = vfio_pci_core_register_device(&nvdev->core_device);
> if (ret)
> - goto out_put_vdev;
> + goto out_reg;
>
> return ret;
>
> +out_reg:
> + nvgrace_gpu_destroy_egm_aux_device(pdev);
> out_put_vdev:
> vfio_put_device(&nvdev->core_device.vdev);
> return ret;
> @@ -982,6 +1047,7 @@ static void nvgrace_gpu_remove(struct pci_dev
> *pdev)
> {
> struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev-
> >dev);
>
> + nvgrace_gpu_destroy_egm_aux_device(pdev);
> vfio_pci_core_unregister_device(core_device);
> vfio_put_device(&core_device->vdev);
> }
> @@ -1011,6 +1077,8 @@ static struct pci_driver nvgrace_gpu_vfio_pci_driver
> = {
>
> static int __init nvgrace_gpu_vfio_pci_init(void)
> {
> + INIT_LIST_HEAD(&egm_dev_list);
> +
> return pci_register_driver(&nvgrace_gpu_vfio_pci_driver);
> }
> module_init(nvgrace_gpu_vfio_pci_init);
> diff --git a/include/linux/nvgrace-egm.h b/include/linux/nvgrace-egm.h
> new file mode 100644
> index 000000000000..9575d4ad4338
> --- /dev/null
> +++ b/include/linux/nvgrace-egm.h
> @@ -0,0 +1,23 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#ifndef NVGRACE_EGM_H
> +#define NVGRACE_EGM_H
> +
> +#include <linux/auxiliary_bus.h>
> +
> +#define NVGRACE_EGM_DEV_NAME "egm"
> +
> +struct nvgrace_egm_dev {
> + struct auxiliary_device aux_dev;
> + u64 egmpxm;
> +};
> +
> +struct nvgrace_egm_dev_entry {
> + struct list_head list;
> + struct nvgrace_egm_dev *egm_dev;
> +};
> +
> +#endif /* NVGRACE_EGM_H */
> --
> 2.34.1
^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: [RFC 03/14] vfio/nvgrace-gpu: track GPUs associated with the EGM regions
2025-09-04 4:08 ` [RFC 03/14] vfio/nvgrace-gpu: track GPUs associated with the EGM regions ankita
@ 2025-09-15 7:19 ` Shameer Kolothum
0 siblings, 0 replies; 27+ messages in thread
From: Shameer Kolothum @ 2025-09-15 7:19 UTC (permalink / raw)
To: Ankit Agrawal, Jason Gunthorpe, alex.williamson@redhat.com,
Yishai Hadas, kevin.tian@intel.com, yi.l.liu@intel.com, Zhi Wang
Cc: Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard,
Dan Williams, Anuj Aggarwal (SW-GPU), Matt Ochs, Krishnakant Jaju,
Dheeraj Nigam, kvm@vger.kernel.org, linux-kernel@vger.kernel.org
> -----Original Message-----
> From: Ankit Agrawal <ankita@nvidia.com>
> Sent: 04 September 2025 05:08
> To: Ankit Agrawal <ankita@nvidia.com>; Jason Gunthorpe <jgg@nvidia.com>;
> alex.williamson@redhat.com; Yishai Hadas <yishaih@nvidia.com>; Shameer
> Kolothum <skolothumtho@nvidia.com>; kevin.tian@intel.com;
> yi.l.liu@intel.com; Zhi Wang <zhiw@nvidia.com>
> Cc: Aniket Agashe <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti
> Wankhede <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU)
> <targupta@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>; Andy Currid
> <acurrid@nvidia.com>; Alistair Popple <apopple@nvidia.com>; John Hubbard
> <jhubbard@nvidia.com>; Dan Williams <danw@nvidia.com>; Anuj Aggarwal
> (SW-GPU) <anuaggarwal@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> Krishnakant Jaju <kjaju@nvidia.com>; Dheeraj Nigam <dnigam@nvidia.com>;
> kvm@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [RFC 03/14] vfio/nvgrace-gpu: track GPUs associated with the EGM
> regions
>
> From: Ankit Agrawal <ankita@nvidia.com>
>
> Grace Blackwell systems could have multiple GPUs on a socket and
> thus are associated with the corresponding EGM region for that
> socket. Track the GPUs as a list.
>
> On the device probe, the device pci_dev struct is added to a
> linked list of the appropriate EGM region.
>
> Similarly on device remove, the pci_dev struct for the GPU
> is removed from the EGM region.
>
> Since the GPUs on a socket have the same EGM region, they have
> the have the same set of EGM region information. Skip the EGM
> region information fetch if already done through a differnt
> GPU on the same socket.
Ok. This is probably why you are keeping egm_dev_list as global.
>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
> drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 29 ++++++++++++++++++++++
> drivers/vfio/pci/nvgrace-gpu/egm_dev.h | 4 +++
> drivers/vfio/pci/nvgrace-gpu/main.c | 34 +++++++++++++++++++++++---
> include/linux/nvgrace-egm.h | 6 +++++
> 4 files changed, 70 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> index f4e27dadf1ef..28cfd29eda56 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> @@ -17,6 +17,33 @@ int nvgrace_gpu_has_egm_property(struct pci_dev
> *pdev, u64 *pegmpxm)
> pegmpxm);
> }
>
> +int add_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
> +{
> + struct gpu_node *node;
> +
> + node = kvzalloc(sizeof(*node), GFP_KERNEL);
> + if (!node)
> + return -ENOMEM;
> +
> + node->pdev = pdev;
> +
> + list_add_tail(&node->list, &egm_dev->gpus);
> +
> + return 0;
> +}
> +
> +void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
> +{
> + struct gpu_node *node, *tmp;
> +
> + list_for_each_entry_safe(node, tmp, &egm_dev->gpus, list) {
> + if (node->pdev == pdev) {
> + list_del(&node->list);
> + kvfree(node);
> + }
> + }
> +}
> +
> static void nvgrace_gpu_release_aux_device(struct device *device)
> {
> struct auxiliary_device *aux_dev = container_of(device, struct
> auxiliary_device, dev);
> @@ -37,6 +64,8 @@ nvgrace_gpu_create_aux_device(struct pci_dev *pdev,
> const char *name,
> goto create_err;
>
> egm_dev->egmpxm = egmpxm;
> + INIT_LIST_HEAD(&egm_dev->gpus);
> +
> egm_dev->aux_dev.id = egmpxm;
> egm_dev->aux_dev.name = name;
> egm_dev->aux_dev.dev.release = nvgrace_gpu_release_aux_device;
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> index c00f5288f4e7..1635753c9e50 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> @@ -10,6 +10,10 @@
>
> int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm);
>
> +int add_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev);
> +
> +void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev);
> +
> struct nvgrace_egm_dev *
> nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
> u64 egmphys);
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-
> gpu/main.c
> index 2cf851492990..436f0ac17332 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -66,9 +66,10 @@ static struct list_head egm_dev_list;
>
> static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
> {
> - struct nvgrace_egm_dev_entry *egm_entry;
> + struct nvgrace_egm_dev_entry *egm_entry = NULL;
> u64 egmpxm;
> int ret = 0;
> + bool is_new_region = false;
>
> /*
> * EGM is an optional feature enabled in SBIOS. If disabled, there
> @@ -79,6 +80,19 @@ static int nvgrace_gpu_create_egm_aux_device(struct
> pci_dev *pdev)
> if (nvgrace_gpu_has_egm_property(pdev, &egmpxm))
> goto exit;
>
> + list_for_each_entry(egm_entry, &egm_dev_list, list) {
> + /*
> + * A system could have multiple GPUs associated with an
> + * EGM region and will have the same set of EGM region
> + * information. Skip the EGM region information fetch if
> + * already done through a differnt GPU on the same socket.
> + */
> + if (egm_entry->egm_dev->egmpxm == egmpxm)
> + goto add_gpu;
> + }
> +
> + is_new_region = true;
> +
> egm_entry = kvzalloc(sizeof(*egm_entry), GFP_KERNEL);
> if (!egm_entry)
> return -ENOMEM;
> @@ -87,13 +101,23 @@ static int
> nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
> nvgrace_gpu_create_aux_device(pdev,
> NVGRACE_EGM_DEV_NAME,
> egmpxm);
> if (!egm_entry->egm_dev) {
> - kvfree(egm_entry);
> ret = -EINVAL;
> + goto free_egm_entry;
> + }
> +
> +add_gpu:
> + ret = add_gpu(egm_entry->egm_dev, pdev);
> + if (!ret) {
> + if (is_new_region)
> + list_add_tail(&egm_entry->list, &egm_dev_list);
> goto exit;
> }
>
> - list_add_tail(&egm_entry->list, &egm_dev_list);
> + if (is_new_region)
> + auxiliary_device_destroy(&egm_entry->egm_dev->aux_dev);
Maybe it is easier to read if you flip the ret check above.
Something like below,
add_gpu:
ret = add_gpu(egm_entry->egm_dev, pdev);
if (ret) {
goto free_dev;
}
if (is_new_region)
list_add_tail(&egm_entry->list, &egm_dev_list);
return 0;
free_dev:
if (is_new_region)
auxiliary_device_destroy(&egm_entry->egm_dev->aux_dev);
....
Thanks,
Shameer
>
> +free_egm_entry:
> + kvfree(egm_entry);
> exit:
> return ret;
> }
> @@ -112,6 +136,10 @@ static void
> nvgrace_gpu_destroy_egm_aux_device(struct pci_dev *pdev)
> * device.
> */
> if (egm_entry->egm_dev->egmpxm == egmpxm) {
> + remove_gpu(egm_entry->egm_dev, pdev);
> + if (!list_empty(&egm_entry->egm_dev->gpus))
> + break;
> +
> auxiliary_device_destroy(&egm_entry->egm_dev-
> >aux_dev);
> list_del(&egm_entry->list);
> kvfree(egm_entry);
> diff --git a/include/linux/nvgrace-egm.h b/include/linux/nvgrace-egm.h
> index 9575d4ad4338..e42494a2b1a6 100644
> --- a/include/linux/nvgrace-egm.h
> +++ b/include/linux/nvgrace-egm.h
> @@ -10,9 +10,15 @@
>
> #define NVGRACE_EGM_DEV_NAME "egm"
>
> +struct gpu_node {
> + struct list_head list;
> + struct pci_dev *pdev;
> +};
> +
> struct nvgrace_egm_dev {
> struct auxiliary_device aux_dev;
> u64 egmpxm;
> + struct list_head gpus;
> };
>
> struct nvgrace_egm_dev_entry {
> --
> 2.34.1
^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: [RFC 05/14] vfio/nvgrace-egm: Introduce module to manage EGM
2025-09-04 4:08 ` [RFC 05/14] vfio/nvgrace-egm: Introduce module to manage EGM ankita
2025-09-05 13:26 ` Jason Gunthorpe
@ 2025-09-15 7:47 ` Shameer Kolothum
1 sibling, 0 replies; 27+ messages in thread
From: Shameer Kolothum @ 2025-09-15 7:47 UTC (permalink / raw)
To: Ankit Agrawal, Jason Gunthorpe, alex.williamson@redhat.com,
Yishai Hadas, kevin.tian@intel.com, yi.l.liu@intel.com, Zhi Wang
Cc: Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard,
Dan Williams, Anuj Aggarwal (SW-GPU), Matt Ochs, Krishnakant Jaju,
Dheeraj Nigam, kvm@vger.kernel.org, linux-kernel@vger.kernel.org
> -----Original Message-----
> From: Ankit Agrawal <ankita@nvidia.com>
> Sent: 04 September 2025 05:08
> To: Ankit Agrawal <ankita@nvidia.com>; Jason Gunthorpe <jgg@nvidia.com>;
> alex.williamson@redhat.com; Yishai Hadas <yishaih@nvidia.com>; Shameer
> Kolothum <skolothumtho@nvidia.com>; kevin.tian@intel.com;
> yi.l.liu@intel.com; Zhi Wang <zhiw@nvidia.com>
> Cc: Aniket Agashe <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti
> Wankhede <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU)
> <targupta@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>; Andy Currid
> <acurrid@nvidia.com>; Alistair Popple <apopple@nvidia.com>; John Hubbard
> <jhubbard@nvidia.com>; Dan Williams <danw@nvidia.com>; Anuj Aggarwal
> (SW-GPU) <anuaggarwal@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> Krishnakant Jaju <kjaju@nvidia.com>; Dheeraj Nigam <dnigam@nvidia.com>;
> kvm@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [RFC 05/14] vfio/nvgrace-egm: Introduce module to manage EGM
>
> From: Ankit Agrawal <ankita@nvidia.com>
>
> The Extended GPU Memory (EGM) feature that enables the GPU to access
> the system memory allocations within and across nodes through high
> bandwidth path on Grace Based systems. The GPU can utilize the
> system memory located on the same socket or from a different socket
> or even on a different node in a multi-node system [1].
>
> When the EGM mode is enabled through SBIOS, the host system memory is
> partitioned into 2 parts: One partition for the Host OS usage
> called Hypervisor region, and a second Hypervisor-Invisible (HI) region
> for the VM. Only the hypervisor region is part of the host EFI map
> and is thus visible to the host OS on bootup. Since the entire VM
> sysmem is eligible for EGM allocations within the VM, the HI partition
> is interchangeably called as EGM region in the series. This HI/EGM region
> range base SPA and size is exposed through the ACPI DSDT properties.
>
> Whilst the EGM region is accessible on the host, it is not added to
> the kernel. The HI region is assigned to a VM by mapping the QEMU VMA
> to the SPA using remap_pfn_range().
>
> The following figure shows the memory map in the virtualization
> environment.
>
> |---- Sysmem ----| |--- GPU mem ---| VM Memory
> | | | |
> |IPA <-> SPA map | |IPA <-> SPA map|
> | | | |
> |--- HI / EGM ---|-- Host Mem --| |--- GPU mem ---| Host Memory
>
> Introduce a new nvgrace-egm auxiliary driver module to manage and
> map the HI/EGM region in the Grace Blackwell systems. This binds to
> the auxiliary device created by the parent nvgrace-gpu (in-tree
> module for device assignment) / nvidia-vgpu-vfio (out-of-tree open
> source module for SRIOV vGPU) to manage the EGM region for the VM.
> Note that there is a unique EGM region per socket and the auxiliary
> device gets created for every region. The parent module fetches the
> EGM region information from the ACPI tables and populate to the data
> structures shared with the auxiliary nvgrace-egm module.
>
> nvgrace-egm module handles the following:
> 1. Fetch the EGM memory properties (base HPA, length, proximity domain)
> from the parent device shared EGM region structure.
> 2. Create a char device that can be used as memory-backend-file by Qemu
> for the VM and implement file operations. The char device is /dev/egmX,
> where X is the PXM node ID of the EGM being mapped fetched in 1.
> 3. Zero the EGM memory on first device open().
> 4. Map the QEMU VMA to the EGM region using remap_pfn_range.
> 5. Cleaning up state and destroying the chardev on device unbind.
> 6. Handle presence of retired ECC pages on the EGM region.
>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
> MAINTAINERS | 6 ++++++
> drivers/vfio/pci/nvgrace-gpu/Kconfig | 11 +++++++++++
> drivers/vfio/pci/nvgrace-gpu/Makefile | 3 +++
> drivers/vfio/pci/nvgrace-gpu/egm.c | 22 ++++++++++++++++++++++
> drivers/vfio/pci/nvgrace-gpu/main.c | 1 +
> 5 files changed, 43 insertions(+)
> create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index dd7df834b70b..ec6bc10f346d 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -26476,6 +26476,12 @@ F: drivers/vfio/pci/nvgrace-
> gpu/egm_dev.h
> F: drivers/vfio/pci/nvgrace-gpu/main.c
> F: include/linux/nvgrace-egm.h
>
> +VFIO NVIDIA GRACE EGM DRIVER
> +M: Ankit Agrawal <ankita@nvidia.com>
> +L: kvm@vger.kernel.org
> +S: Supported
> +F: drivers/vfio/pci/nvgrace-gpu/egm.c
> +
> VFIO PCI DEVICE SPECIFIC DRIVERS
> R: Jason Gunthorpe <jgg@nvidia.com>
> R: Yishai Hadas <yishaih@nvidia.com>
> diff --git a/drivers/vfio/pci/nvgrace-gpu/Kconfig b/drivers/vfio/pci/nvgrace-
> gpu/Kconfig
> index a7f624b37e41..d5773bbd22f5 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/Kconfig
> +++ b/drivers/vfio/pci/nvgrace-gpu/Kconfig
> @@ -1,8 +1,19 @@
> # SPDX-License-Identifier: GPL-2.0-only
> +config NVGRACE_EGM
> + tristate "EGM driver for NVIDIA Grace Hopper and Blackwell
> Superchip"
> + depends on ARM64 || (COMPILE_TEST && 64BIT)
Should it depend on NVGRACE_GPU_VFIO_PCI as well?
Thanks,
Shameer
> + help
> + Extended GPU Memory (EGM) support for the GPU in the NVIDIA
> Grace
> + based chips required to avail the CPU memory as additional
> + cross-node/cross-socket memory for GPU using KVM/qemu.
> +
> + If you don't know what to do here, say N.
> +
> config NVGRACE_GPU_VFIO_PCI
> tristate "VFIO support for the GPU in the NVIDIA Grace Hopper
> Superchip"
> depends on ARM64 || (COMPILE_TEST && 64BIT)
> select VFIO_PCI_CORE
> + select NVGRACE_EGM
> help
> VFIO support for the GPU in the NVIDIA Grace Hopper Superchip is
> required to assign the GPU device to userspace using
> KVM/qemu/etc.
> diff --git a/drivers/vfio/pci/nvgrace-gpu/Makefile b/drivers/vfio/pci/nvgrace-
> gpu/Makefile
> index e72cc6739ef8..d0d191be56b9 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/Makefile
> +++ b/drivers/vfio/pci/nvgrace-gpu/Makefile
> @@ -1,3 +1,6 @@
> # SPDX-License-Identifier: GPL-2.0-only
> obj-$(CONFIG_NVGRACE_GPU_VFIO_PCI) += nvgrace-gpu-vfio-pci.o
> nvgrace-gpu-vfio-pci-y := main.o egm_dev.o
> +
> +obj-$(CONFIG_NVGRACE_EGM) += nvgrace-egm.o
> +nvgrace-egm-y := egm.o
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-
> gpu/egm.c
> new file mode 100644
> index 000000000000..999808807019
> --- /dev/null
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
> @@ -0,0 +1,22 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights
> reserved
> + */
> +
> +#include <linux/vfio_pci_core.h>
> +
> +static int __init nvgrace_egm_init(void)
> +{
> + return 0;
> +}
> +
> +static void __exit nvgrace_egm_cleanup(void)
> +{
> +}
> +
> +module_init(nvgrace_egm_init);
> +module_exit(nvgrace_egm_cleanup);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Ankit Agrawal <ankita@nvidia.com>");
> +MODULE_DESCRIPTION("NVGRACE EGM - Module to support Extended GPU
> Memory on NVIDIA Grace Based systems");
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-
> gpu/main.c
> index 7486a1b49275..b1ccd1ac2e0a 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -1125,3 +1125,4 @@ MODULE_LICENSE("GPL");
> MODULE_AUTHOR("Ankit Agrawal <ankita@nvidia.com>");
> MODULE_AUTHOR("Aniket Agashe <aniketa@nvidia.com>");
> MODULE_DESCRIPTION("VFIO NVGRACE GPU PF - User Level driver for
> NVIDIA devices with CPU coherently accessible device memory");
> +MODULE_SOFTDEP("pre: nvgrace-egm");
> --
> 2.34.1
^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: [RFC 08/14] vfio/nvgrace-egm: Expose EGM region as char device
2025-09-04 4:08 ` [RFC 08/14] vfio/nvgrace-egm: Expose EGM region as char device ankita
2025-09-05 13:34 ` Jason Gunthorpe
@ 2025-09-15 8:36 ` Shameer Kolothum
1 sibling, 0 replies; 27+ messages in thread
From: Shameer Kolothum @ 2025-09-15 8:36 UTC (permalink / raw)
To: Ankit Agrawal, Jason Gunthorpe, alex.williamson@redhat.com,
Yishai Hadas, kevin.tian@intel.com, yi.l.liu@intel.com, Zhi Wang
Cc: Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard,
Dan Williams, Anuj Aggarwal (SW-GPU), Matt Ochs, Krishnakant Jaju,
Dheeraj Nigam, kvm@vger.kernel.org, linux-kernel@vger.kernel.org
> -----Original Message-----
> From: Ankit Agrawal <ankita@nvidia.com>
> Sent: 04 September 2025 05:08
> To: Ankit Agrawal <ankita@nvidia.com>; Jason Gunthorpe <jgg@nvidia.com>;
> alex.williamson@redhat.com; Yishai Hadas <yishaih@nvidia.com>; Shameer
> Kolothum <skolothumtho@nvidia.com>; kevin.tian@intel.com;
> yi.l.liu@intel.com; Zhi Wang <zhiw@nvidia.com>
> Cc: Aniket Agashe <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti
> Wankhede <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU)
> <targupta@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>; Andy Currid
> <acurrid@nvidia.com>; Alistair Popple <apopple@nvidia.com>; John Hubbard
> <jhubbard@nvidia.com>; Dan Williams <danw@nvidia.com>; Anuj Aggarwal
> (SW-GPU) <anuaggarwal@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> Krishnakant Jaju <kjaju@nvidia.com>; Dheeraj Nigam <dnigam@nvidia.com>;
> kvm@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [RFC 08/14] vfio/nvgrace-egm: Expose EGM region as char device
>
[...]
> static int egm_driver_probe(struct auxiliary_device *aux_dev,
> const struct auxiliary_device_id *id)
> {
> + struct nvgrace_egm_dev *egm_dev =
> + container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
> + struct chardev *egm_chardev;
> +
> + egm_chardev = setup_egm_chardev(egm_dev);
> + if (!egm_chardev)
> + return -EINVAL;
> +
> + xa_store(&egm_chardevs, egm_dev->egmpxm, egm_chardev,
> GFP_KERNEL);
Nit: May be better to check ret for xa_store here.
Thanks,
Shameer
^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: [RFC 10/14] vfio/nvgrace-egm: Clear Memory before handing out to VM
2025-09-04 4:08 ` [RFC 10/14] vfio/nvgrace-egm: Clear Memory before handing out to VM ankita
2025-09-05 13:39 ` Jason Gunthorpe
@ 2025-09-15 8:45 ` Shameer Kolothum
1 sibling, 0 replies; 27+ messages in thread
From: Shameer Kolothum @ 2025-09-15 8:45 UTC (permalink / raw)
To: Ankit Agrawal, Jason Gunthorpe, alex.williamson@redhat.com,
Yishai Hadas, kevin.tian@intel.com, yi.l.liu@intel.com, Zhi Wang
Cc: Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard,
Dan Williams, Anuj Aggarwal (SW-GPU), Matt Ochs, Krishnakant Jaju,
Dheeraj Nigam, kvm@vger.kernel.org, linux-kernel@vger.kernel.org
> -----Original Message-----
> From: Ankit Agrawal <ankita@nvidia.com>
> Sent: 04 September 2025 05:08
> To: Ankit Agrawal <ankita@nvidia.com>; Jason Gunthorpe <jgg@nvidia.com>;
> alex.williamson@redhat.com; Yishai Hadas <yishaih@nvidia.com>; Shameer
> Kolothum <skolothumtho@nvidia.com>; kevin.tian@intel.com;
> yi.l.liu@intel.com; Zhi Wang <zhiw@nvidia.com>
> Cc: Aniket Agashe <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti
> Wankhede <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU)
> <targupta@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>; Andy Currid
> <acurrid@nvidia.com>; Alistair Popple <apopple@nvidia.com>; John Hubbard
> <jhubbard@nvidia.com>; Dan Williams <danw@nvidia.com>; Anuj Aggarwal
> (SW-GPU) <anuaggarwal@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> Krishnakant Jaju <kjaju@nvidia.com>; Dheeraj Nigam <dnigam@nvidia.com>;
> kvm@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [RFC 10/14] vfio/nvgrace-egm: Clear Memory before handing out to
> VM
>
[...]
> static struct nvgrace_egm_dev *
> @@ -30,6 +31,26 @@ static int nvgrace_egm_open(struct inode *inode,
> struct file *file)
> {
> struct chardev *egm_chardev =
> container_of(inode->i_cdev, struct chardev, cdev);
> + struct nvgrace_egm_dev *egm_dev =
> + egm_chardev_to_nvgrace_egm_dev(egm_chardev);
> + void *memaddr;
> +
> + if (atomic_inc_return(&egm_chardev->open_count) > 1)
> + return 0;
> +
> + /*
> + * nvgrace-egm module is responsible to manage the EGM memory as
> + * the host kernel has no knowledge of it. Clear the region before
> + * handing over to userspace.
> + */
> + memaddr = memremap(egm_dev->egmphys, egm_dev->egmlength,
> MEMREMAP_WB);
> + if (!memaddr) {
> + atomic_dec(&egm_chardev->open_count);
> + return -EINVAL;
Nit: may be better to ret -ENOMEM here.
Thanks,
Shameer
^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: [RFC 11/14] vfio/nvgrace-egm: Fetch EGM region retired pages list
2025-09-04 4:08 ` [RFC 11/14] vfio/nvgrace-egm: Fetch EGM region retired pages list ankita
@ 2025-09-15 9:21 ` Shameer Kolothum
0 siblings, 0 replies; 27+ messages in thread
From: Shameer Kolothum @ 2025-09-15 9:21 UTC (permalink / raw)
To: Ankit Agrawal, Jason Gunthorpe, alex.williamson@redhat.com,
Yishai Hadas, kevin.tian@intel.com, yi.l.liu@intel.com, Zhi Wang
Cc: Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard,
Dan Williams, Anuj Aggarwal (SW-GPU), Matt Ochs, Krishnakant Jaju,
Dheeraj Nigam, kvm@vger.kernel.org, linux-kernel@vger.kernel.org
> -----Original Message-----
> From: Ankit Agrawal <ankita@nvidia.com>
> Sent: 04 September 2025 05:08
> To: Ankit Agrawal <ankita@nvidia.com>; Jason Gunthorpe <jgg@nvidia.com>;
> alex.williamson@redhat.com; Yishai Hadas <yishaih@nvidia.com>; Shameer
> Kolothum <skolothumtho@nvidia.com>; kevin.tian@intel.com;
> yi.l.liu@intel.com; Zhi Wang <zhiw@nvidia.com>
> Cc: Aniket Agashe <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti
> Wankhede <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU)
> <targupta@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>; Andy Currid
> <acurrid@nvidia.com>; Alistair Popple <apopple@nvidia.com>; John Hubbard
> <jhubbard@nvidia.com>; Dan Williams <danw@nvidia.com>; Anuj Aggarwal
> (SW-GPU) <anuaggarwal@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> Krishnakant Jaju <kjaju@nvidia.com>; Dheeraj Nigam <dnigam@nvidia.com>;
> kvm@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [RFC 11/14] vfio/nvgrace-egm: Fetch EGM region retired pages list
>
> From: Ankit Agrawal <ankita@nvidia.com>
>
> It is possible for some system memory pages on the EGM to
> have retired pages with uncorrectable ECC errors. A list of
> pages known with such errors (referred as retired pages) are
> maintained by the Host UEFI. The Host UEFI populates such list
> in a reserved region. It communicates the SPA of this region
> through a ACPI DSDT property.
>
> nvgrace-egm module is responsible to store the list of retired page
> offsets to be made available for usermode processes. The module:
> 1. Get the reserved memory region SPA and maps to it to fetch
> the list of bad pages.
> 2. Calculate the retired page offsets in the EGM and stores it.
>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
> drivers/vfio/pci/nvgrace-gpu/egm.c | 81 ++++++++++++++++++++++++++
> drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 32 ++++++++--
> drivers/vfio/pci/nvgrace-gpu/egm_dev.h | 5 +-
> drivers/vfio/pci/nvgrace-gpu/main.c | 8 ++-
> include/linux/nvgrace-egm.h | 2 +
> 5 files changed, 118 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-
> gpu/egm.c
> index bf1241ed1d60..7a026b4d98f7 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
> @@ -8,6 +8,11 @@
>
> #define MAX_EGM_NODES 4
>
> +struct h_node {
> + unsigned long mem_offset;
> + struct hlist_node node;
> +};
> +
> static dev_t dev;
> static struct class *class;
> static DEFINE_XARRAY(egm_chardevs);
> @@ -16,6 +21,7 @@ struct chardev {
> struct device device;
> struct cdev cdev;
> atomic_t open_count;
> + DECLARE_HASHTABLE(htbl, 0x10);
> };
>
> static struct nvgrace_egm_dev *
> @@ -145,20 +151,86 @@ static void del_egm_chardev(struct chardev
> *egm_chardev)
> put_device(&egm_chardev->device);
> }
>
> +static void cleanup_retired_pages(struct chardev *egm_chardev)
> +{
> + struct h_node *cur_page;
> + unsigned long bkt;
> + struct hlist_node *temp_node;
> +
> + hash_for_each_safe(egm_chardev->htbl, bkt, temp_node, cur_page,
> node) {
> + hash_del(&cur_page->node);
> + kvfree(cur_page);
> + }
> +}
> +
> +static int nvgrace_egm_fetch_retired_pages(struct nvgrace_egm_dev
> *egm_dev,
> + struct chardev *egm_chardev)
> +{
> + u64 count;
> + void *memaddr;
> + int index, ret = 0;
> +
> + memaddr = memremap(egm_dev->retiredpagesphys, PAGE_SIZE,
> MEMREMAP_WB);
> + if (!memaddr)
> + return -ENOMEM;
> +
> + count = *(u64 *)memaddr;
> +
> + for (index = 0; index < count; index++) {
> + struct h_node *retired_page;
> +
> + /*
> + * Since the EGM is linearly mapped, the offset in the
> + * carveout is the same offset in the VM system memory.
> + *
> + * Calculate the offset to communicate to the usermode
> + * apps.
> + */
> + retired_page = kvzalloc(sizeof(*retired_page), GFP_KERNEL);
> + if (!retired_page) {
> + ret = -ENOMEM;
> + break;
> + }
> +
> + retired_page->mem_offset = *((u64 *)memaddr + index + 1) -
> + egm_dev->egmphys;
Above the mapping is only for PAGE_SIZE and there is no check for count. There
is a possibility here to access goes beyond the mapped area. Please check.
> + hash_add(egm_chardev->htbl, &retired_page->node,
> + retired_page->mem_offset);
> + }
> +
> + memunmap(memaddr);
> +
> + if (ret)
> + cleanup_retired_pages(egm_chardev);
> +
> + return ret;
> +}
> +
> static int egm_driver_probe(struct auxiliary_device *aux_dev,
> const struct auxiliary_device_id *id)
> {
> struct nvgrace_egm_dev *egm_dev =
> container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
> struct chardev *egm_chardev;
> + int ret;
>
> egm_chardev = setup_egm_chardev(egm_dev);
> if (!egm_chardev)
> return -EINVAL;
>
> + hash_init(egm_chardev->htbl);
> +
> + ret = nvgrace_egm_fetch_retired_pages(egm_dev, egm_chardev);
> + if (ret)
> + goto error_exit;
> +
> xa_store(&egm_chardevs, egm_dev->egmpxm, egm_chardev,
> GFP_KERNEL);
>
> return 0;
> +
> +error_exit:
> + del_egm_chardev(egm_chardev);
> + return ret;
> }
>
> static void egm_driver_remove(struct auxiliary_device *aux_dev)
> @@ -166,10 +238,19 @@ static void egm_driver_remove(struct
> auxiliary_device *aux_dev)
> struct nvgrace_egm_dev *egm_dev =
> container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
> struct chardev *egm_chardev = xa_erase(&egm_chardevs, egm_dev-
> >egmpxm);
> + struct h_node *cur_page;
> + unsigned long bkt;
> + struct hlist_node *temp_node;
>
> if (!egm_chardev)
> return;
>
> + hash_for_each_safe(egm_chardev->htbl, bkt, temp_node, cur_page,
> node) {
> + hash_del(&cur_page->node);
> + kvfree(cur_page);
> + }
The above is not required as the below cleanup also does the same thing.
Also, do we really need a hash table here? Since this is just storing page info
and returning it via an IOCTL, a simple array or linked list would suffice.
Or is there any plan to use it later for lookups?
Thanks,
Shameer
> + cleanup_retired_pages(egm_chardev);
> del_egm_chardev(egm_chardev);
> }
>
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> index ca50bc1f67a0..b8e143542bce 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> @@ -18,22 +18,41 @@ int nvgrace_gpu_has_egm_property(struct pci_dev
> *pdev, u64 *pegmpxm)
> }
>
> int nvgrace_gpu_fetch_egm_property(struct pci_dev *pdev, u64 *pegmphys,
> - u64 *pegmlength)
> + u64 *pegmlength, u64 *pretiredpagesphys)
> {
> int ret;
>
> /*
> - * The memory information is present in the system ACPI tables as DSD
> - * properties nvidia,egm-base-pa and nvidia,egm-size.
> + * The EGM memory information is present in the system ACPI tables
> + * as DSD properties nvidia,egm-base-pa and nvidia,egm-size.
> */
> ret = device_property_read_u64(&pdev->dev, "nvidia,egm-size",
> pegmlength);
> if (ret)
> - return ret;
> + goto error_exit;
>
> ret = device_property_read_u64(&pdev->dev, "nvidia,egm-base-pa",
> pegmphys);
> + if (ret)
> + goto error_exit;
> +
> + /*
> + * SBIOS puts the list of retired pages on a region. The region
> + * SPA is exposed as "nvidia,egm-retired-pages-data-base".
> + */
> + ret = device_property_read_u64(&pdev->dev,
> + "nvidia,egm-retired-pages-data-base",
> + pretiredpagesphys);
> + if (ret)
> + goto error_exit;
> +
> + /* Catch firmware bug and avoid a crash */
> + if (*pretiredpagesphys == 0) {
> + dev_err(&pdev->dev, "Retired pages region is not setup\n");
> + ret = -EINVAL;
> + }
>
> +error_exit:
> return ret;
> }
>
> @@ -74,7 +93,8 @@ static void nvgrace_gpu_release_aux_device(struct
> device *device)
>
> struct nvgrace_egm_dev *
> nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
> - u64 egmphys, u64 egmlength, u64 egmpxm)
> + u64 egmphys, u64 egmlength, u64 egmpxm,
> + u64 retiredpagesphys)
> {
> struct nvgrace_egm_dev *egm_dev;
> int ret;
> @@ -86,6 +106,8 @@ nvgrace_gpu_create_aux_device(struct pci_dev *pdev,
> const char *name,
> egm_dev->egmpxm = egmpxm;
> egm_dev->egmphys = egmphys;
> egm_dev->egmlength = egmlength;
> + egm_dev->retiredpagesphys = retiredpagesphys;
> +
> INIT_LIST_HEAD(&egm_dev->gpus);
>
> egm_dev->aux_dev.id = egmpxm;
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> index 2e1612445898..2f329a05685d 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> @@ -16,8 +16,9 @@ void remove_gpu(struct nvgrace_egm_dev *egm_dev,
> struct pci_dev *pdev);
>
> struct nvgrace_egm_dev *
> nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
> - u64 egmphys, u64 egmlength, u64 egmpxm);
> + u64 egmphys, u64 egmlength, u64 egmpxm,
> + u64 retiredpagesphys);
>
> int nvgrace_gpu_fetch_egm_property(struct pci_dev *pdev, u64 *pegmphys,
> - u64 *pegmlength);
> + u64 *pegmlength, u64 *pretiredpagesphys);
> #endif /* EGM_DEV_H */
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-
> gpu/main.c
> index b1ccd1ac2e0a..534dc3ee6113 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -67,7 +67,7 @@ static struct list_head egm_dev_list;
> static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
> {
> struct nvgrace_egm_dev_entry *egm_entry = NULL;
> - u64 egmphys, egmlength, egmpxm;
> + u64 egmphys, egmlength, egmpxm, retiredpagesphys;
> int ret = 0;
> bool is_new_region = false;
>
> @@ -80,7 +80,8 @@ static int nvgrace_gpu_create_egm_aux_device(struct
> pci_dev *pdev)
> if (nvgrace_gpu_has_egm_property(pdev, &egmpxm))
> goto exit;
>
> - ret = nvgrace_gpu_fetch_egm_property(pdev, &egmphys,
> &egmlength);
> + ret = nvgrace_gpu_fetch_egm_property(pdev, &egmphys,
> &egmlength,
> + &retiredpagesphys);
> if (ret)
> goto exit;
>
> @@ -103,7 +104,8 @@ static int nvgrace_gpu_create_egm_aux_device(struct
> pci_dev *pdev)
>
> egm_entry->egm_dev =
> nvgrace_gpu_create_aux_device(pdev,
> NVGRACE_EGM_DEV_NAME,
> - egmphys, egmlength, egmpxm);
> + egmphys, egmlength, egmpxm,
> + retiredpagesphys);
> if (!egm_entry->egm_dev) {
> ret = -EINVAL;
> goto free_egm_entry;
> diff --git a/include/linux/nvgrace-egm.h b/include/linux/nvgrace-egm.h
> index a66906753267..197255c2a3b7 100644
> --- a/include/linux/nvgrace-egm.h
> +++ b/include/linux/nvgrace-egm.h
> @@ -7,6 +7,7 @@
> #define NVGRACE_EGM_H
>
> #include <linux/auxiliary_bus.h>
> +#include <linux/hashtable.h>
>
> #define NVGRACE_EGM_DEV_NAME "egm"
>
> @@ -19,6 +20,7 @@ struct nvgrace_egm_dev {
> struct auxiliary_device aux_dev;
> phys_addr_t egmphys;
> size_t egmlength;
> + phys_addr_t retiredpagesphys;
> u64 egmpxm;
> struct list_head gpus;
> };
> --
> 2.34.1
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2025-09-15 9:21 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-04 4:08 [RFC 00/14] cover-letter: Add virtualization support for EGM ankita
2025-09-04 4:08 ` [RFC 01/14] vfio/nvgrace-gpu: Expand module_pci_driver to allow custom module init ankita
2025-09-04 4:08 ` [RFC 02/14] vfio/nvgrace-gpu: Create auxiliary device for EGM ankita
2025-09-15 6:56 ` Shameer Kolothum
2025-09-04 4:08 ` [RFC 03/14] vfio/nvgrace-gpu: track GPUs associated with the EGM regions ankita
2025-09-15 7:19 ` Shameer Kolothum
2025-09-04 4:08 ` [RFC 04/14] vfio/nvgrace-gpu: Introduce functions to fetch and save EGM info ankita
2025-09-04 4:08 ` [RFC 05/14] vfio/nvgrace-egm: Introduce module to manage EGM ankita
2025-09-05 13:26 ` Jason Gunthorpe
2025-09-15 7:47 ` Shameer Kolothum
2025-09-04 4:08 ` [RFC 06/14] vfio/nvgrace-egm: Introduce egm class and register char device numbers ankita
2025-09-04 4:08 ` [RFC 07/14] vfio/nvgrace-egm: Register auxiliary driver ops ankita
2025-09-05 13:31 ` Jason Gunthorpe
2025-09-04 4:08 ` [RFC 08/14] vfio/nvgrace-egm: Expose EGM region as char device ankita
2025-09-05 13:34 ` Jason Gunthorpe
2025-09-15 8:36 ` Shameer Kolothum
2025-09-04 4:08 ` [RFC 09/14] vfio/nvgrace-egm: Add chardev ops for EGM management ankita
2025-09-05 13:36 ` Jason Gunthorpe
2025-09-04 4:08 ` [RFC 10/14] vfio/nvgrace-egm: Clear Memory before handing out to VM ankita
2025-09-05 13:39 ` Jason Gunthorpe
2025-09-15 8:45 ` Shameer Kolothum
2025-09-04 4:08 ` [RFC 11/14] vfio/nvgrace-egm: Fetch EGM region retired pages list ankita
2025-09-15 9:21 ` Shameer Kolothum
2025-09-04 4:08 ` [RFC 12/14] vfio/nvgrace-egm: Introduce ioctl to share retired pages ankita
2025-09-04 4:08 ` [RFC 13/14] vfio/nvgrace-egm: expose the egm size through sysfs ankita
2025-09-04 4:08 ` [RFC 14/14] vfio/nvgrace-gpu: Add link from pci to EGM ankita
2025-09-05 13:42 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox