Re: [PATCH RFC v2 02/15] vfio/nvgrace-gpu: Create auxiliary device for EGM

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Alex Williamson <alex@shazbot.org>
To: <ankita@nvidia.com>
Cc: <vsethi@nvidia.com>, <jgg@nvidia.com>, <mochs@nvidia.com>,
	<jgg@ziepe.ca>, <skolothumtho@nvidia.com>, <cjia@nvidia.com>,
	<zhiw@nvidia.com>, <kjaju@nvidia.com>, <yishaih@nvidia.com>,
	<kevin.tian@intel.com>, <kvm@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>,
	alex@shazbot.org
Subject: Re: [PATCH RFC v2 02/15] vfio/nvgrace-gpu: Create auxiliary device for EGM
Date: Tue, 3 Mar 2026 17:13:49 -0700	[thread overview]
Message-ID: <20260303171349.36be0589@shazbot.org> (raw)
In-Reply-To: <20260223155514.152435-3-ankita@nvidia.com>

On Mon, 23 Feb 2026 15:55:01 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The Extended GPU Memory (EGM) feature enables the GPU access to
> the system memory across sockets and physical systems on the
> Grace Hopper and Grace Blackwell systems. When the feature is
> enabled through SBIOS, part of the system memory is made available
> to the GPU for access through EGM path.
> 
> The EGM functionality is separate and largely independent from the

"largely independent", what happens to access to the remote memory
through the GPU during reset?

In your KVM Forum presentation you show a remote CPU accessing EGM
memory through a local GPU, through the NVLink, though a remote GPU, to
the remote CPU memory.  Does this only work if all the GPUs in the path
are bound to nvgrace-gpu?

The ownership of these egm devices vs the vfio device seems dubious.

> core GPU device functionality. However, the EGM region information
> of base SPA and size is associated with the GPU on the ACPI tables.
> An architecture wih EGM represented as an auxiliary device suits well

s/wih/with/

> in this context.
> 
> The parent GPU device creates an EGM auxiliary device to be managed
> independently by an auxiliary EGM driver. The EGM region information
> is kept as part of the shared struct nvgrace_egm_dev along with the
> auxiliary device handle.
> 
> Each socket has a separate EGM region and hence a multi-socket system
> have multiple EGM regions. Each EGM region has a separate nvgrace_egm_dev
> and the nvgrace-gpu keeps the EGM regions as part of a list.
> 
> Note that EGM is an optional feature enabled through SBIOS. The EGM
> properties are only populated in ACPI tables if the feature is enabled;
> they are absent otherwise. The absence of the properties is thus not
> considered fatal. The presence of improper set of values however are
> considered fatal.
> 
> It is also noteworthy that there may also be multiple GPUs present per
> socket and have duplicate EGM region information with them. Make sure
> the duplicate data does not get added.

De-duplication isn't done until the next patch.

> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  MAINTAINERS                            |  5 +-
>  drivers/vfio/pci/nvgrace-gpu/Makefile  |  2 +-
>  drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 61 +++++++++++++++++++++
>  drivers/vfio/pci/nvgrace-gpu/egm_dev.h | 17 ++++++
>  drivers/vfio/pci/nvgrace-gpu/main.c    | 76 +++++++++++++++++++++++++-
>  include/linux/nvgrace-egm.h            | 23 ++++++++
>  6 files changed, 181 insertions(+), 3 deletions(-)
>  create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm_dev.c
>  create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm_dev.h
>  create mode 100644 include/linux/nvgrace-egm.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 765ad2daa218..5b3d86de9ec0 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -27379,7 +27379,10 @@ VFIO NVIDIA GRACE GPU DRIVER
>  M:	Ankit Agrawal <ankita@nvidia.com>
>  L:	kvm@vger.kernel.org
>  S:	Supported
> -F:	drivers/vfio/pci/nvgrace-gpu/
> +F:	drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> +F:	drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> +F:	drivers/vfio/pci/nvgrace-gpu/main.c

This was better before, you own the sub-directory, we don't need to
list each file and it adds maintenance.

> +F:	include/linux/nvgrace-egm.h
>  
>  VFIO PCI DEVICE SPECIFIC DRIVERS
>  R:	Jason Gunthorpe <jgg@nvidia.com>
> diff --git a/drivers/vfio/pci/nvgrace-gpu/Makefile b/drivers/vfio/pci/nvgrace-gpu/Makefile
> index 3ca8c187897a..e72cc6739ef8 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/Makefile
> +++ b/drivers/vfio/pci/nvgrace-gpu/Makefile
> @@ -1,3 +1,3 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>  obj-$(CONFIG_NVGRACE_GPU_VFIO_PCI) += nvgrace-gpu-vfio-pci.o
> -nvgrace-gpu-vfio-pci-y := main.o
> +nvgrace-gpu-vfio-pci-y := main.o egm_dev.o
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> new file mode 100644
> index 000000000000..faf658723f7a
> --- /dev/null
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> @@ -0,0 +1,61 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved

2026

> + */
> +
> +#include <linux/vfio_pci_core.h>
> +#include "egm_dev.h"
> +
> +/*
> + * Determine if the EGM feature is enabled. If disabled, there
> + * will be no EGM properties populated in the ACPI tables and this
> + * fetch would fail.
> + */
> +int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm)
> +{
> +	return device_property_read_u64(&pdev->dev, "nvidia,egm-pxm",
> +					pegmpxm);
> +}
> +
> +static void nvgrace_gpu_release_aux_device(struct device *device)
> +{
> +	struct auxiliary_device *aux_dev = container_of(device, struct auxiliary_device, dev);
> +	struct nvgrace_egm_dev *egm_dev = container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
> +
> +	kvfree(egm_dev);

This was allocated with kzalloc() it should use kfree() not kvfree().

> +}
> +
> +struct nvgrace_egm_dev *
> +nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
> +			      u64 egmpxm)
> +{
> +	struct nvgrace_egm_dev *egm_dev;
> +	int ret;
> +
> +	egm_dev = kzalloc(sizeof(*egm_dev), GFP_KERNEL);
> +	if (!egm_dev)
> +		goto create_err;
> +
> +	egm_dev->egmpxm = egmpxm;
> +	egm_dev->aux_dev.id = egmpxm;
> +	egm_dev->aux_dev.name = name;
> +	egm_dev->aux_dev.dev.release = nvgrace_gpu_release_aux_device;
> +	egm_dev->aux_dev.dev.parent = &pdev->dev;
> +
> +	ret = auxiliary_device_init(&egm_dev->aux_dev);
> +	if (ret)
> +		goto free_dev;
> +
> +	ret = auxiliary_device_add(&egm_dev->aux_dev);
> +	if (ret) {
> +		auxiliary_device_uninit(&egm_dev->aux_dev);
> +		goto free_dev;

There's a double free here, from auxiliary_device_init():

 * It returns 0 on success.  On success, the device_initialize has been
 * performed.  After this point any error unwinding will need to include a call
 * to auxiliary_device_uninit().  In this post-initialize error scenario, a call
 * to the device's .release callback will be triggered, and all memory clean-up
 * is expected to be handled there.


> +	}
> +
> +	return egm_dev;
> +
> +free_dev:
> +	kvfree(egm_dev);
> +create_err:
> +	return NULL;
> +}
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.h b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> new file mode 100644
> index 000000000000..c00f5288f4e7
> --- /dev/null
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved

2026

> + */
> +
> +#ifndef EGM_DEV_H
> +#define EGM_DEV_H
> +
> +#include <linux/nvgrace-egm.h>
> +
> +int nvgrace_gpu_has_egm_property(struct pci_dev *pdev, u64 *pegmpxm);
> +
> +struct nvgrace_egm_dev *
> +nvgrace_gpu_create_aux_device(struct pci_dev *pdev, const char *name,
> +			      u64 egmphys);

egmpxm

> +
> +#endif /* EGM_DEV_H */
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
> index 7c4d51f5c701..23028e6e7192 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -10,6 +10,8 @@
>  #include <linux/pci-p2pdma.h>
>  #include <linux/pm_runtime.h>
>  #include <linux/memory-failure.h>
> +#include <linux/nvgrace-egm.h>
> +#include "egm_dev.h"
>  
>  /*
>   * The device memory usable to the workloads running in the VM is cached
> @@ -66,6 +68,68 @@ struct nvgrace_gpu_pci_core_device {
>  	bool reset_done;
>  };
>  
> +/*
> + * Track egm device lists. Note that there is one device per socket.
> + * All the GPUs belonging to the same sockets are associated with
> + * the EGM device for that socket.
> + */
> +static struct list_head egm_dev_list;

As Shameer notes, this list needs locking to avoid concurrent
operation corruption.  I'd also question why we're tracking this list
in the main code of the nvgrace-gpu driver rather than in the egm_dev
aux driver portion of the code.  It would be trivial to do
de-duplication in the create function if the list were over there.

> +
> +static int nvgrace_gpu_create_egm_aux_device(struct pci_dev *pdev)
> +{
> +	struct nvgrace_egm_dev_entry *egm_entry;
> +	u64 egmpxm;
> +	int ret = 0;
> +
> +	/*
> +	 * EGM is an optional feature enabled in SBIOS. If disabled, there
> +	 * will be no EGM properties populated in the ACPI tables and this
> +	 * fetch would fail. Treat this failure as non-fatal and return
> +	 * early.
> +	 */
> +	if (nvgrace_gpu_has_egm_property(pdev, &egmpxm))
> +		goto exit;

return 0;

> +
> +	egm_entry = kzalloc(sizeof(*egm_entry), GFP_KERNEL);
> +	if (!egm_entry)
> +		return -ENOMEM;
> +
> +	egm_entry->egm_dev =
> +		nvgrace_gpu_create_aux_device(pdev, NVGRACE_EGM_DEV_NAME,
> +					      egmpxm);
> +	if (!egm_entry->egm_dev) {
> +		kvfree(egm_entry);

kzalloc() -> kfree()

> +		ret = -EINVAL;

Why doesn't the previous function return ERR_PTR() to propagate the
errno through rather than clobber it?  We don't really need the goto
here for now either.

struct nvgrace_egm_dev egm_dev;

egm_dev = nvgrace_gpu_create...

if (IS_ERR(egm_dev)) {
	kfree(egm_entry);
	return ERR_PTR(egm_dev);
}

egm_entry->egm_dev = egm_dev;

> +		goto exit;
> +	}
> +
> +	list_add_tail(&egm_entry->list, &egm_dev_list);
> +
> +exit:

s/exit://

> +	return ret;

return 0;

> +}
> +
> +static void nvgrace_gpu_destroy_egm_aux_device(struct pci_dev *pdev)
> +{
> +	struct nvgrace_egm_dev_entry *egm_entry, *temp_egm_entry;
> +	u64 egmpxm;
> +
> +	if (nvgrace_gpu_has_egm_property(pdev, &egmpxm))
> +		return;
> +
> +	list_for_each_entry_safe(egm_entry, temp_egm_entry, &egm_dev_list, list) {
> +		/*
> +		 * Free the EGM region corresponding to the input GPU
> +		 * device.
> +		 */
> +		if (egm_entry->egm_dev->egmpxm == egmpxm) {
> +			auxiliary_device_destroy(&egm_entry->egm_dev->aux_dev);
> +			list_del(&egm_entry->list);
> +			kvfree(egm_entry);

kfree()

Why do we continue walking the list after we've found it?  Is this
because we don't yet do the de-duplication?

> +		}
> +	}
> +}
> +
>  static void nvgrace_gpu_init_fake_bar_emu_regs(struct vfio_device *core_vdev)
>  {
>  	struct nvgrace_gpu_pci_core_device *nvdev =
> @@ -1212,6 +1276,11 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
>  						    memphys, memlength);
>  		if (ret)
>  			goto out_put_vdev;
> +
> +		ret = nvgrace_gpu_create_egm_aux_device(pdev);
> +		if (ret)
> +			goto out_put_vdev;
> +
>  		nvdev->core_device.pci_ops = &nvgrace_gpu_pci_dev_ops;
>  	} else {
>  		nvdev->core_device.pci_ops = &nvgrace_gpu_pci_dev_core_ops;
> @@ -1219,10 +1288,12 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
>  
>  	ret = vfio_pci_core_register_device(&nvdev->core_device);
>  	if (ret)
> -		goto out_put_vdev;
> +		goto out_reg;
>  
>  	return ret;
>  
> +out_reg:
> +	nvgrace_gpu_destroy_egm_aux_device(pdev);
>  out_put_vdev:
>  	vfio_put_device(&nvdev->core_device.vdev);
>  	return ret;
> @@ -1232,6 +1303,7 @@ static void nvgrace_gpu_remove(struct pci_dev *pdev)
>  {
>  	struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
>  
> +	nvgrace_gpu_destroy_egm_aux_device(pdev);

I'm curious how this will handle the lifecycle issues if the device is
unbound from the nvgrace-gpu driver while the aux egm device is still
in use...

>  	vfio_pci_core_unregister_device(core_device);
>  	vfio_put_device(&core_device->vdev);
>  }
> @@ -1289,6 +1361,8 @@ static struct pci_driver nvgrace_gpu_vfio_pci_driver = {
>  
>  static int __init nvgrace_gpu_vfio_pci_init(void)
>  {
> +	INIT_LIST_HEAD(&egm_dev_list);
> +
>  	return pci_register_driver(&nvgrace_gpu_vfio_pci_driver);
>  }
>  module_init(nvgrace_gpu_vfio_pci_init);
> diff --git a/include/linux/nvgrace-egm.h b/include/linux/nvgrace-egm.h
> new file mode 100644
> index 000000000000..9575d4ad4338
> --- /dev/null
> +++ b/include/linux/nvgrace-egm.h
> @@ -0,0 +1,23 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved

2026

> + */
> +
> +#ifndef NVGRACE_EGM_H
> +#define NVGRACE_EGM_H
> +
> +#include <linux/auxiliary_bus.h>
> +
> +#define NVGRACE_EGM_DEV_NAME "egm"
> +
> +struct nvgrace_egm_dev {
> +	struct auxiliary_device aux_dev;
> +	u64 egmpxm;
> +};
> +
> +struct nvgrace_egm_dev_entry {
> +	struct list_head list;
> +	struct nvgrace_egm_dev *egm_dev;
> +};

Looks like only nvgrace_egm_dev eventually requires a public header.
The list entry certainly doesn't need to be here.  Thanks,

Alex

next prev parent reply	other threads:[~2026-03-04  0:13 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-23 15:54 [PATCH RFC v2 00/15] Add virtualization support for EGM ankita
2026-02-23 15:55 ` [PATCH RFC v2 01/15] vfio/nvgrace-gpu: Expand module_pci_driver to allow custom module init ankita
2026-02-23 15:55 ` [PATCH RFC v2 02/15] vfio/nvgrace-gpu: Create auxiliary device for EGM ankita
2026-02-26 14:28   ` Shameer Kolothum Thodi
2026-03-04  0:13   ` Alex Williamson [this message]
2026-02-23 15:55 ` [PATCH RFC v2 03/15] vfio/nvgrace-gpu: track GPUs associated with the EGM regions ankita
2026-02-26 14:55   ` Shameer Kolothum Thodi
2026-03-04 17:14     ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 04/15] vfio/nvgrace-gpu: Introduce functions to fetch and save EGM info ankita
2026-02-26 15:12   ` Shameer Kolothum Thodi
2026-03-04 17:37   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 05/15] vfio/nvgrace-egm: Introduce module to manage EGM ankita
2026-03-04 18:09   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 06/15] vfio/nvgrace-egm: Introduce egm class and register char device numbers ankita
2026-03-04 18:56   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 07/15] vfio/nvgrace-egm: Register auxiliary driver ops ankita
2026-03-04 19:06   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 08/15] vfio/nvgrace-egm: Expose EGM region as char device ankita
2026-02-26 17:08   ` Shameer Kolothum Thodi
2026-03-04 20:16   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 09/15] vfio/nvgrace-egm: Add chardev ops for EGM management ankita
2026-03-04 22:04   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 10/15] vfio/nvgrace-egm: Clear Memory before handing out to VM ankita
2026-02-26 18:15   ` Shameer Kolothum Thodi
2026-02-26 18:56     ` Jason Gunthorpe
2026-02-26 19:29       ` Shameer Kolothum Thodi
2026-03-04 22:14   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 11/15] vfio/nvgrace-egm: Fetch EGM region retired pages list ankita
2026-03-04 22:37   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 12/15] vfio/nvgrace-egm: Introduce ioctl to share retired pages ankita
2026-03-04 23:00   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 13/15] vfio/nvgrace-egm: expose the egm size through sysfs ankita
2026-03-04 23:22   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 14/15] vfio/nvgrace-gpu: Add link from pci to EGM ankita
2026-03-04 23:37   ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 15/15] vfio/nvgrace-egm: register EGM PFNMAP range with memory_failure ankita
2026-03-04 23:48   ` Alex Williamson
2026-03-05 17:33 ` [PATCH RFC v2 00/15] Add virtualization support for EGM Alex Williamson
2026-03-11  6:47   ` Ankit Agrawal
2026-03-11 20:37     ` Alex Williamson
2026-03-12 13:51       ` Ankit Agrawal
2026-03-12 14:59         ` Alex Williamson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260303171349.36be0589@shazbot.org \
    --to=alex@shazbot.org \
    --cc=ankita@nvidia.com \
    --cc=cjia@nvidia.com \
    --cc=jgg@nvidia.com \
    --cc=jgg@ziepe.ca \
    --cc=kevin.tian@intel.com \
    --cc=kjaju@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mochs@nvidia.com \
    --cc=skolothumtho@nvidia.com \
    --cc=vsethi@nvidia.com \
    --cc=yishaih@nvidia.com \
    --cc=zhiw@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox