From: <ankita@nvidia.com>
To: <ankita@nvidia.com>, <jgg@nvidia.com>,
<alex.williamson@redhat.com>, <yishaih@nvidia.com>,
<skolothumtho@nvidia.com>, <kevin.tian@intel.com>,
<yi.l.liu@intel.com>, <zhiw@nvidia.com>
Cc: <aniketa@nvidia.com>, <cjia@nvidia.com>, <kwankhede@nvidia.com>,
<targupta@nvidia.com>, <vsethi@nvidia.com>, <acurrid@nvidia.com>,
<apopple@nvidia.com>, <jhubbard@nvidia.com>, <danw@nvidia.com>,
<anuaggarwal@nvidia.com>, <mochs@nvidia.com>, <kjaju@nvidia.com>,
<dnigam@nvidia.com>, <kvm@vger.kernel.org>,
<linux-kernel@vger.kernel.org>
Subject: [RFC 00/14] cover-letter: Add virtualization support for EGM
Date: Thu, 4 Sep 2025 04:08:14 +0000 [thread overview]
Message-ID: <20250904040828.319452-1-ankita@nvidia.com> (raw)
From: Ankit Agrawal <ankita@nvidia.com>
Background
----------
Grace Hopper/Blackwell systems support the Extended GPU Memory (EGM)
feature that enable the GPU to access the system memory allocations
within and across nodes through high bandwidth path. This access path
goes as: GPU <--> NVswitch <--> GPU <--> CPU. The GPU can utilize the
system memory located on the same socket or from a different socket
or even on a different node in a multi-node system [1]. This feature is
being extended to virtualization.
Design Details
--------------
EGM when enabled in the virtualization stack, the host memory
is partitioned into 2 parts: One partition for the Host OS usage
called Hypervisor region, and a second Hypervisor-Invisible (HI) region
for the VM. Only the hypervisor region is part of the host EFI map
and is thus visible to the host OS on bootup. Since the entire VM
sysmem is eligible for EGM allocations within the VM, the HI partition
is interchangeably called as EGM region in the series. This HI/EGM region
range base SPA and size is exposed through the ACPI DSDT properties.
Whilst the EGM region is accessible on the host, it is not added to
the kernel. The HI region is assigned to a VM by mapping the QEMU VMA
to the SPA using remap_pfn_range().
The following figure shows the memory map in the virtualization
environment.
|---- Sysmem ----| |--- GPU mem ---| VM Memory
| | | |
|IPA <-> SPA map | |IPA <-> SPA map|
| | | |
|--- HI / EGM ---|-- Host Mem --| |--- GPU mem ---| Host Memory
The patch series introduce a new nvgrace-egm auxiliary driver module
to manage and map the HI/EGM region in the Grace Blackwell systems.
This binds to the auxiliary device created by the parent
nvgrace-gpu (in-tree module for device assignment) / nvidia-vgpu-vfio
(out-of-tree open source module for SRIOV vGPU) to manage the
EGM region for the VM. Note that there is a unique EGM region per
socket and the auxiliary device gets created for every region. The
parent module fetches the EGM region information from the ACPI
tables and populate to the data structures shared with the auxiliary
nvgrace-egm module.
nvgrace-egm module handles the following:
1. Fetch the EGM memory properties (base HPA, length, proximity domain)
from the parent device shared EGM region structure.
2. Create a char device that can be used as memory-backend-file by Qemu
for the VM and implement file operations. The char device is /dev/egmX,
where X is the PXM node ID of the EGM being mapped fetched in 1.
3. Zero the EGM memory on first device open().
4. Map the QEMU VMA to the EGM region using remap_pfn_range.
5. Cleaning up state and destroying the chardev on device unbind.
6. Handle presence of retired ECC pages on the EGM region.
Since nvgrace-egm is an auxiliary module to the nvgrace-gpu, it is kept
in the same directory.
Implementation
--------------
Patch 1-4 makes changes to the nvgrace-gpu module to fetch the
EGM information, create auxiliary device and save the EGM region
information in the shared structures.
Path 5-10 introduce the new nvgrace-egm module to manage the EGM
region. The module implements a char device to expose the EGM to
usermode apps such as QEMU. The module does the mapping of the
QEMU VMA to the EGM SPA using remap_pfn range.
Patch 11-12 fetches the list of pages on EGM with known ECC errors.
Patch 13-14 expose the EGM topology and size through sysfs.
Enablement
----------
The EGM mode is enabled through a flag in the SBIOS. The size of
the Hypervisor region is modifiable through a second parameter in
the SBIOS. All the remaining system memory on the host will be
invisible to the Hypervisor.
Verification
------------
Applied over v6.17-rc4 and using qemu repository [3]. Tested on the
Grace Blackwell platform by booting up VM, loading NVIDIA module [2] and
running nvidia-smi in the VM to check for the presence of EGM capability.
To run CUDA workloads, there is a dependency on the Nested Page Table
patches being worked on separately by Shameer Kolothum
(skolothumtho@nvidia.com).
Recognitions
------------
Many thanks to Jason Gunthorpe, Vikram Sethi, Aniket Agashe for design
suggestions and Matt Ochs, Andy Currid, Neo Jia, Kirti Wankhede among
others for the review feedbacks.
Links
-----
Link: https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/#extended_gpu_memory [1]
Link: https://github.com/NVIDIA/open-gpu-kernel-modules [2]
Link: https://github.com/ankita-nv/nicolinc-qemu/tree/iommufd_veventq-v9-egm-0903 [3]
Ankit Agrawal (14):
vfio/nvgrace-gpu: Expand module_pci_driver to allow custom module init
vfio/nvgrace-gpu: Create auxiliary device for EGM
vfio/nvgrace-gpu: track GPUs associated with the EGM regions
vfio/nvgrace-gpu: Introduce functions to fetch and save EGM info
vfio/nvgrace-egm: Introduce module to manage EGM
vfio/nvgrace-egm: Introduce egm class and register char device numbers
vfio/nvgrace-egm: Register auxiliary driver ops
vfio/nvgrace-egm: Expose EGM region as char device
vfio/nvgrace-egm: Add chardev ops for EGM management
vfio/nvgrace-egm: Clear Memory before handing out to VM
vfio/nvgrace-egm: Fetch EGM region retired pages list
vfio/nvgrace-egm: Introduce ioctl to share retired pages
vfio/nvgrace-egm: expose the egm size through sysfs
vfio/nvgrace-gpu: Add link from pci to EGM
MAINTAINERS | 12 +-
drivers/vfio/pci/nvgrace-gpu/Kconfig | 11 +
drivers/vfio/pci/nvgrace-gpu/Makefile | 5 +-
drivers/vfio/pci/nvgrace-gpu/egm.c | 418 +++++++++++++++++++++++++
drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 174 ++++++++++
drivers/vfio/pci/nvgrace-gpu/egm_dev.h | 24 ++
drivers/vfio/pci/nvgrace-gpu/main.c | 117 ++++++-
include/linux/nvgrace-egm.h | 33 ++
include/uapi/linux/egm.h | 26 ++
9 files changed, 816 insertions(+), 4 deletions(-)
create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm.c
create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm_dev.c
create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm_dev.h
create mode 100644 include/linux/nvgrace-egm.h
create mode 100644 include/uapi/linux/egm.h
--
2.34.1
next reply other threads:[~2025-09-04 4:08 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-04 4:08 ankita [this message]
2025-09-04 4:08 ` [RFC 01/14] vfio/nvgrace-gpu: Expand module_pci_driver to allow custom module init ankita
2025-09-04 4:08 ` [RFC 02/14] vfio/nvgrace-gpu: Create auxiliary device for EGM ankita
2025-09-15 6:56 ` Shameer Kolothum
2025-09-04 4:08 ` [RFC 03/14] vfio/nvgrace-gpu: track GPUs associated with the EGM regions ankita
2025-09-15 7:19 ` Shameer Kolothum
2025-09-04 4:08 ` [RFC 04/14] vfio/nvgrace-gpu: Introduce functions to fetch and save EGM info ankita
2025-09-04 4:08 ` [RFC 05/14] vfio/nvgrace-egm: Introduce module to manage EGM ankita
2025-09-05 13:26 ` Jason Gunthorpe
2025-09-15 7:47 ` Shameer Kolothum
2025-09-04 4:08 ` [RFC 06/14] vfio/nvgrace-egm: Introduce egm class and register char device numbers ankita
2025-09-04 4:08 ` [RFC 07/14] vfio/nvgrace-egm: Register auxiliary driver ops ankita
2025-09-05 13:31 ` Jason Gunthorpe
2025-09-04 4:08 ` [RFC 08/14] vfio/nvgrace-egm: Expose EGM region as char device ankita
2025-09-05 13:34 ` Jason Gunthorpe
2025-09-15 8:36 ` Shameer Kolothum
2025-09-04 4:08 ` [RFC 09/14] vfio/nvgrace-egm: Add chardev ops for EGM management ankita
2025-09-05 13:36 ` Jason Gunthorpe
2025-09-04 4:08 ` [RFC 10/14] vfio/nvgrace-egm: Clear Memory before handing out to VM ankita
2025-09-05 13:39 ` Jason Gunthorpe
2025-09-15 8:45 ` Shameer Kolothum
2025-09-04 4:08 ` [RFC 11/14] vfio/nvgrace-egm: Fetch EGM region retired pages list ankita
2025-09-15 9:21 ` Shameer Kolothum
2025-09-04 4:08 ` [RFC 12/14] vfio/nvgrace-egm: Introduce ioctl to share retired pages ankita
2025-09-04 4:08 ` [RFC 13/14] vfio/nvgrace-egm: expose the egm size through sysfs ankita
2025-09-04 4:08 ` [RFC 14/14] vfio/nvgrace-gpu: Add link from pci to EGM ankita
2025-09-05 13:42 ` Jason Gunthorpe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250904040828.319452-1-ankita@nvidia.com \
--to=ankita@nvidia.com \
--cc=acurrid@nvidia.com \
--cc=alex.williamson@redhat.com \
--cc=aniketa@nvidia.com \
--cc=anuaggarwal@nvidia.com \
--cc=apopple@nvidia.com \
--cc=cjia@nvidia.com \
--cc=danw@nvidia.com \
--cc=dnigam@nvidia.com \
--cc=jgg@nvidia.com \
--cc=jhubbard@nvidia.com \
--cc=kevin.tian@intel.com \
--cc=kjaju@nvidia.com \
--cc=kvm@vger.kernel.org \
--cc=kwankhede@nvidia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mochs@nvidia.com \
--cc=skolothumtho@nvidia.com \
--cc=targupta@nvidia.com \
--cc=vsethi@nvidia.com \
--cc=yi.l.liu@intel.com \
--cc=yishaih@nvidia.com \
--cc=zhiw@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox