public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Alex Williamson <alex@shazbot.org>
To: Ankit Agrawal <ankita@nvidia.com>, Jason Gunthorpe <jgg@nvidia.com>
Cc: Vikram Sethi <vsethi@nvidia.com>, Matt Ochs <mochs@nvidia.com>,
	"jgg@ziepe.ca" <jgg@ziepe.ca>,
	Shameer Kolothum Thodi <skolothumtho@nvidia.com>,
	Neo Jia <cjia@nvidia.com>, Zhi Wang <zhiw@nvidia.com>,
	Krishnakant Jaju <kjaju@nvidia.com>,
	Yishai Hadas <yishaih@nvidia.com>,
	"kevin.tian@intel.com" <kevin.tian@intel.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	alex@shazbot.org
Subject: Re: [PATCH RFC v2 00/15] Add virtualization support for EGM
Date: Wed, 11 Mar 2026 14:37:06 -0600	[thread overview]
Message-ID: <20260311143706.2095a547@shazbot.org> (raw)
In-Reply-To: <SA1PR12MB719904F9ED153FF4BA4EFA28B047A@SA1PR12MB7199.namprd12.prod.outlook.com>

On Wed, 11 Mar 2026 06:47:12 +0000
Ankit Agrawal <ankita@nvidia.com> wrote:

> Thanks Alex for the review.
> 
> >> The patch series introduce a new nvgrace-egm auxiliary driver module
> >> to manage and map the HI/EGM region in the Grace Blackwell systems.
> >> This binds to the auxiliary device created by the parent
> >> nvgrace-gpu (in-tree module for device assignment) / nvidia-vgpu-vfio
> >> (out-of-tree open source module for SRIOV vGPU) to manage the
> >> EGM region for the VM. Note that there is a unique EGM region per
> >> socket and the auxiliary device gets created for every region. The
> >> parent module fetches the EGM region information from the ACPI
> >> tables and populate to the data structures shared with the auxiliary
> >> nvgrace-egm module.
> >>
> >> nvgrace-egm module handles the following:
> >> 1. Fetch the EGM memory properties (base HPA, length, proximity domain)
> >> from the parent device shared EGM region structure.
> >> 2. Create a char device that can be used as memory-backend-file by Qemu
> >> for the VM and implement file operations. The char device is /dev/egmX,
> >> where X is the PXM node ID of the EGM being mapped fetched in 1.
> >> 3. Zero the EGM memory on first device open().
> >> 4. Map the QEMU VMA to the EGM region using remap_pfn_range.
> >> 5. Cleaning up state and destroying the chardev on device unbind.
> >> 6. Handle presence of retired poisoned pages on the EGM region.
> >>
> >> Since nvgrace-egm is an auxiliary module to the nvgrace-gpu, it is kept
> >> in the same directory.  
> >
> >
> > Pondering this series for a bit, is this auxiliary chardev approach
> > really the model we should be pursuing?
> >
> > I know we're trying to disassociate the EGM region from the GPU, and
> > de-duplicate it between GPUs on the same socket, but is there actually a
> > use case of the EGM chardev separate from the GPU?  
> 
> It is not just de-duplication. The EGM is a carveout of system memory
> logically and physically separate and disconnected from the GPU. The
> uniqueness here is that the information (SPA, size) of the region is present
> on the GPU ACPI tables.
> 
> >
> > The independent lifecycle of this aux device is troubling and it hasn't
> > been confirmed whether or not access to the EGM region has some
> >dependency on the state of the GPU.   
> 
> The EGM region is independent on the state of the GPU. One can plausibly
> bootup the VM with just the EGM memory chardev as the backend file and
> no GPU.

Seems like we have the wrong model then to base the lifecycle of the
aux devices on the state of the PCI driver if EGM is fully independent
of the state of the PCI device.

> > nvgrace-gpu is manipulating sysfs
> > on devices owned by nvgrace-egm, we don't have mechanisms to manage the
> > aux device relative to the state of the GPU, we're trying to add a
> > driver that can bind to device created by an out-of-tree driver, and
> > we're inventing new uAPIs on the chardev for things that already exist
> > for vfio regions.  
> 
> Sorry for the confusion. The nvgrace-egm would not bind to the device
> created by the out-of-tree driver. We would have a separate out-of-tree
> equivalent of nvgrace-egm to bind to the device by the out-of-tree vfio
> driver. Maybe we can consider exposing a register/unregister APIs from
> nvgrace-egm where a module (in-tree nvgrace / out-of-tree) can register
> a pdev and nvgrace-egm can use to fetch the region info.

Ok, this wasn't clear to me, but does that also mean that if some GPUs
are managed by nvgrace-gpu and others by out-of-tree drivers that the
in-kernel and out-of-tree equivalent drivers are both installing
chardevs as /dev/egmXX?  Playing in the same space is ugly, but what
happens when the 2 GPUs per socket are split between drivers and they
both try to added the same chardev?

> > Therefore, does it actually make more sense to expose EGM as a device
> > specific region on the vfio device fd?
> >
> > For example, nvgrace-gpu might manage the de-duplication by only
> > exposing this device specific region on the lowest BDF GPU per socket.
> > The existing REGION_INFO ioctl handles reporting the size to the user.
> > The direct association to the GPU device handles reporting the node
> > locality.  If necessary, a capability on the region could report the
> > associated PXM, and maybe even the retired page list.
> >
> > All of the lifecycle issues are automatically handled, there's no
> > separate aux device.  If necessary, zapping and faulting across reset
> > is handled just like a BAR mapping.  
> 
> The EGM memory (which becomes the system memory of the VM) cannot
> be connected to the GPU reset as it is unrelated to the GPU device. We would
> not want that to happen to system memory on GPU reset.

It's not the state of the EGM/system memory that I'm concerned about,
it's the fact that the routing to access that memory traverses two
GPUs and both the backplane and c2c NVLink connections.  If access
through that channel is 100% independent of the state of either GPU
then GPU resets are irrelevant.

However, I'd then ask the question why we're associating EGM to the GPU
PCI driver at all.  For instance, why should nvgrace-gpu spawn aux
devices to feed into an nvgrace-egm driver, and duplicate that whole
thing in an out-of-tree driver, when we could just have one in-kernel
platform(?) driver walk ACPI, find these ranges, and expose them as
chardev entirely independent of the PCI driver bound to the GPU?
 
> > If we need to expose the EGM size and GPU association via sysfs for
> > management tooling, nvgrace-gpu could add an "egm_size" attribute to the
> > PCI device's sysfs node.  This could also avoid the implicit
> >  implementation knowledge about which GPU exposes the EGM device
> > specific region.
> >
> > Was such a design considered?  It seems much, much simpler and could be
> > implemented by either nvgrace-gpu or identically by an out-of-tree
> > driver without references in an in-kernel ID table.
> >
> > I'd like to understand the pros and cons of such an approach vs the one
> > presented here.  Thanks,  
> 
> We didn't consider it as a separate BAR / region as the EGM memory (part of the
> system memory) is unrelated to the GPU device besides having its information
> in the GPU ACPI table and becomes the system memory of the VM. Considering
> it as part of the device BAR / region would connect the lifecyle of the EGM region
> on the GPU device. Also we cannot consider zapping/faulting across GPU reset
> as it is system memory of the VM.

It's curious why the EGM description is associated to the GPU ACPI
object if it really is fully independent.  It seems like perhaps it
should be a unique ACPI object in that case, which would make claiming
it via a platform driver easier.  Maybe we don't need to be tied to
that firmware decision in the design of the software driver though.
Thanks,

Alex

  reply	other threads:[~2026-03-11 20:37 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-23 15:54 [PATCH RFC v2 00/15] Add virtualization support for EGM ankita
2026-02-23 15:55 ` [PATCH RFC v2 01/15] vfio/nvgrace-gpu: Expand module_pci_driver to allow custom module init ankita
2026-02-23 15:55 ` [PATCH RFC v2 02/15] vfio/nvgrace-gpu: Create auxiliary device for EGM ankita
2026-02-26 14:28   ` Shameer Kolothum Thodi
2026-03-04  0:13   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 03/15] vfio/nvgrace-gpu: track GPUs associated with the EGM regions ankita
2026-02-26 14:55   ` Shameer Kolothum Thodi
2026-03-04 17:14     ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 04/15] vfio/nvgrace-gpu: Introduce functions to fetch and save EGM info ankita
2026-02-26 15:12   ` Shameer Kolothum Thodi
2026-03-04 17:37   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 05/15] vfio/nvgrace-egm: Introduce module to manage EGM ankita
2026-03-04 18:09   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 06/15] vfio/nvgrace-egm: Introduce egm class and register char device numbers ankita
2026-03-04 18:56   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 07/15] vfio/nvgrace-egm: Register auxiliary driver ops ankita
2026-03-04 19:06   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 08/15] vfio/nvgrace-egm: Expose EGM region as char device ankita
2026-02-26 17:08   ` Shameer Kolothum Thodi
2026-03-04 20:16   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 09/15] vfio/nvgrace-egm: Add chardev ops for EGM management ankita
2026-03-04 22:04   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 10/15] vfio/nvgrace-egm: Clear Memory before handing out to VM ankita
2026-02-26 18:15   ` Shameer Kolothum Thodi
2026-02-26 18:56     ` Jason Gunthorpe
2026-02-26 19:29       ` Shameer Kolothum Thodi
2026-03-04 22:14   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 11/15] vfio/nvgrace-egm: Fetch EGM region retired pages list ankita
2026-03-04 22:37   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 12/15] vfio/nvgrace-egm: Introduce ioctl to share retired pages ankita
2026-03-04 23:00   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 13/15] vfio/nvgrace-egm: expose the egm size through sysfs ankita
2026-03-04 23:22   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 14/15] vfio/nvgrace-gpu: Add link from pci to EGM ankita
2026-03-04 23:37   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 15/15] vfio/nvgrace-egm: register EGM PFNMAP range with memory_failure ankita
2026-03-04 23:48   ` Alex Williamson
2026-03-05 17:33 ` [PATCH RFC v2 00/15] Add virtualization support for EGM Alex Williamson
2026-03-11  6:47   ` Ankit Agrawal
2026-03-11 20:37     ` Alex Williamson [this message]
2026-03-12 13:51       ` Ankit Agrawal
2026-03-12 14:59         ` Alex Williamson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260311143706.2095a547@shazbot.org \
    --to=alex@shazbot.org \
    --cc=ankita@nvidia.com \
    --cc=cjia@nvidia.com \
    --cc=jgg@nvidia.com \
    --cc=jgg@ziepe.ca \
    --cc=kevin.tian@intel.com \
    --cc=kjaju@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mochs@nvidia.com \
    --cc=skolothumtho@nvidia.com \
    --cc=vsethi@nvidia.com \
    --cc=yishaih@nvidia.com \
    --cc=zhiw@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox