From: Alex Williamson <alex@shazbot.org>
To: <ankita@nvidia.com>
Cc: <vsethi@nvidia.com>, <jgg@nvidia.com>, <mochs@nvidia.com>,
<jgg@ziepe.ca>, <skolothumtho@nvidia.com>, <cjia@nvidia.com>,
<zhiw@nvidia.com>, <kjaju@nvidia.com>, <yishaih@nvidia.com>,
<kevin.tian@intel.com>, <kvm@vger.kernel.org>,
<linux-kernel@vger.kernel.org>,
alex@shazbot.org
Subject: Re: [PATCH RFC v2 15/15] vfio/nvgrace-egm: register EGM PFNMAP range with memory_failure
Date: Wed, 4 Mar 2026 16:48:36 -0700 [thread overview]
Message-ID: <20260304164836.11ece0f5@shazbot.org> (raw)
In-Reply-To: <20260223155514.152435-16-ankita@nvidia.com>
On Mon, 23 Feb 2026 15:55:14 +0000
<ankita@nvidia.com> wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
>
> EGM carveout memory is mapped directly into userspace (QEMU) and is not
> added to the kernel. It is not managed by the kernel page allocator and
> has no struct pages. The module can thus utilize the Linux memory manager's
> memory_failure mechanism for regions with no struct pages. The Linux MM
> code exposes register/unregister APIs allowing modules to register such
> memory regions for memory_failure handling.
>
> Register the EGM PFN range with the MM memory_failure infrastructure on
> open, and unregister it on the last close. Provide a PFN-to-VMA offset
> callback that validates the PFN is within the EGM region and the VMA,
> then converts it to a file offset and records the poisoned offset in the
> existing hashtable for reporting to userspace.
So the idea is that we kill the process owning the VMA and add the page
to the hash such that the next user process avoids it, and this is what
encourages userspace to consume the bad page list?
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
> drivers/vfio/pci/nvgrace-gpu/egm.c | 100 +++++++++++++++++++++++++++++
> 1 file changed, 100 insertions(+)
>
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm.c b/drivers/vfio/pci/nvgrace-gpu/egm.c
> index 2e4024c25e8a..5b60db6294a8 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm.c
> @@ -6,6 +6,7 @@
> #include <linux/vfio_pci_core.h>
> #include <linux/nvgrace-egm.h>
> #include <linux/egm.h>
> +#include <linux/memory-failure.h>
>
> #define MAX_EGM_NODES 4
>
> @@ -23,6 +24,7 @@ struct chardev {
> struct cdev cdev;
> atomic_t open_count;
> DECLARE_HASHTABLE(htbl, 0x10);
> + struct pfn_address_space pfn_address_space;
> };
>
> static struct nvgrace_egm_dev *
> @@ -34,6 +36,94 @@ egm_chardev_to_nvgrace_egm_dev(struct chardev *egm_chardev)
> return container_of(aux_dev, struct nvgrace_egm_dev, aux_dev);
> }
>
> +static int pfn_memregion_offset(struct chardev *egm_chardev,
> + unsigned long pfn,
> + pgoff_t *pfn_offset_in_region)
> +{
> + unsigned long start_pfn, num_pages;
> + struct nvgrace_egm_dev *egm_dev =
> + egm_chardev_to_nvgrace_egm_dev(egm_chardev);
> +
> + start_pfn = PHYS_PFN(egm_dev->egmphys);
> + num_pages = egm_dev->egmlength >> PAGE_SHIFT;
> +
> + if (pfn < start_pfn || pfn >= start_pfn + num_pages)
> + return -EFAULT;
> +
> + *pfn_offset_in_region = pfn - start_pfn;
> +
> + return 0;
> +}
> +
> +static int track_ecc_offset(struct chardev *egm_chardev,
> + unsigned long mem_offset)
> +{
> + struct h_node *cur_page, *ecc_page;
> +
> + hash_for_each_possible(egm_chardev->htbl, cur_page, node, mem_offset) {
> + if (cur_page->mem_offset == mem_offset)
> + return 0;
> + }
> +
> + ecc_page = kzalloc(sizeof(*ecc_page), GFP_NOFS);
> + if (!ecc_page)
> + return -ENOMEM;
> +
> + ecc_page->mem_offset = mem_offset;
> +
> + hash_add(egm_chardev->htbl, &ecc_page->node, ecc_page->mem_offset);
> +
> + return 0;
> +}
How do concurrent faults work? No locking on the hash table.
> +
> +static int nvgrace_egm_pfn_to_vma_pgoff(struct vm_area_struct *vma,
> + unsigned long pfn,
> + pgoff_t *pgoff)
> +{
> + struct chardev *egm_chardev = vma->vm_file->private_data;
> + pgoff_t vma_offset_in_region = vma->vm_pgoff &
> + ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> + pgoff_t pfn_offset_in_region;
> + int ret;
> +
> + ret = pfn_memregion_offset(egm_chardev, pfn, &pfn_offset_in_region);
> + if (ret)
> + return ret;
> +
> + /* Ensure PFN is not before VMA's start within the region */
> + if (pfn_offset_in_region < vma_offset_in_region)
> + return -EFAULT;
> +
> + /* Calculate offset from VMA start */
> + *pgoff = vma->vm_pgoff +
> + (pfn_offset_in_region - vma_offset_in_region);
> +
> + /* Track and save the poisoned offset */
> + return track_ecc_offset(egm_chardev, *pgoff << PAGE_SHIFT);
> +}
> +
> +static int
> +nvgrace_egm_vfio_pci_register_pfn_range(struct inode *inode,
> + struct chardev *egm_chardev)
What does this have to do with vfio-pci? It's not even a device
address space. Thanks,
Alex
> +{
> + struct nvgrace_egm_dev *egm_dev =
> + egm_chardev_to_nvgrace_egm_dev(egm_chardev);
> + unsigned long pfn, nr_pages;
> + int ret;
> +
> + pfn = PHYS_PFN(egm_dev->egmphys);
> + nr_pages = egm_dev->egmlength >> PAGE_SHIFT;
> +
> + egm_chardev->pfn_address_space.node.start = pfn;
> + egm_chardev->pfn_address_space.node.last = pfn + nr_pages - 1;
> + egm_chardev->pfn_address_space.mapping = inode->i_mapping;
> + egm_chardev->pfn_address_space.pfn_to_vma_pgoff = nvgrace_egm_pfn_to_vma_pgoff;
> +
> + ret = register_pfn_address_space(&egm_chardev->pfn_address_space);
> +
> + return ret;
> +}
> +
> static int nvgrace_egm_open(struct inode *inode, struct file *file)
> {
> struct chardev *egm_chardev =
> @@ -41,6 +131,7 @@ static int nvgrace_egm_open(struct inode *inode, struct file *file)
> struct nvgrace_egm_dev *egm_dev =
> egm_chardev_to_nvgrace_egm_dev(egm_chardev);
> void *memaddr;
> + int ret;
>
> if (atomic_cmpxchg(&egm_chardev->open_count, 0, 1) != 0)
> return -EBUSY;
> @@ -77,6 +168,13 @@ static int nvgrace_egm_open(struct inode *inode, struct file *file)
>
> file->private_data = egm_chardev;
>
> + ret = nvgrace_egm_vfio_pci_register_pfn_range(inode, egm_chardev);
> + if (ret && ret != -EOPNOTSUPP) {
> + file->private_data = NULL;
> + atomic_dec(&egm_chardev->open_count);
> + return ret;
> + }
> +
> return 0;
> }
>
> @@ -85,6 +183,8 @@ static int nvgrace_egm_release(struct inode *inode, struct file *file)
> struct chardev *egm_chardev =
> container_of(inode->i_cdev, struct chardev, cdev);
>
> + unregister_pfn_address_space(&egm_chardev->pfn_address_space);
> +
> file->private_data = NULL;
>
> atomic_dec(&egm_chardev->open_count);
next prev parent reply other threads:[~2026-03-04 23:48 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-23 15:54 [PATCH RFC v2 00/15] Add virtualization support for EGM ankita
2026-02-23 15:55 ` [PATCH RFC v2 01/15] vfio/nvgrace-gpu: Expand module_pci_driver to allow custom module init ankita
2026-02-23 15:55 ` [PATCH RFC v2 02/15] vfio/nvgrace-gpu: Create auxiliary device for EGM ankita
2026-02-26 14:28 ` Shameer Kolothum Thodi
2026-03-04 0:13 ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 03/15] vfio/nvgrace-gpu: track GPUs associated with the EGM regions ankita
2026-02-26 14:55 ` Shameer Kolothum Thodi
2026-03-04 17:14 ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 04/15] vfio/nvgrace-gpu: Introduce functions to fetch and save EGM info ankita
2026-02-26 15:12 ` Shameer Kolothum Thodi
2026-03-04 17:37 ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 05/15] vfio/nvgrace-egm: Introduce module to manage EGM ankita
2026-03-04 18:09 ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 06/15] vfio/nvgrace-egm: Introduce egm class and register char device numbers ankita
2026-03-04 18:56 ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 07/15] vfio/nvgrace-egm: Register auxiliary driver ops ankita
2026-03-04 19:06 ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 08/15] vfio/nvgrace-egm: Expose EGM region as char device ankita
2026-02-26 17:08 ` Shameer Kolothum Thodi
2026-03-04 20:16 ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 09/15] vfio/nvgrace-egm: Add chardev ops for EGM management ankita
2026-03-04 22:04 ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 10/15] vfio/nvgrace-egm: Clear Memory before handing out to VM ankita
2026-02-26 18:15 ` Shameer Kolothum Thodi
2026-02-26 18:56 ` Jason Gunthorpe
2026-02-26 19:29 ` Shameer Kolothum Thodi
2026-03-04 22:14 ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 11/15] vfio/nvgrace-egm: Fetch EGM region retired pages list ankita
2026-03-04 22:37 ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 12/15] vfio/nvgrace-egm: Introduce ioctl to share retired pages ankita
2026-03-04 23:00 ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 13/15] vfio/nvgrace-egm: expose the egm size through sysfs ankita
2026-03-04 23:22 ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 14/15] vfio/nvgrace-gpu: Add link from pci to EGM ankita
2026-03-04 23:37 ` Alex Williamson
2026-02-23 15:55 ` [PATCH RFC v2 15/15] vfio/nvgrace-egm: register EGM PFNMAP range with memory_failure ankita
2026-03-04 23:48 ` Alex Williamson [this message]
2026-03-05 17:33 ` [PATCH RFC v2 00/15] Add virtualization support for EGM Alex Williamson
2026-03-11 6:47 ` Ankit Agrawal
2026-03-11 20:37 ` Alex Williamson
2026-03-12 13:51 ` Ankit Agrawal
2026-03-12 14:59 ` Alex Williamson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260304164836.11ece0f5@shazbot.org \
--to=alex@shazbot.org \
--cc=ankita@nvidia.com \
--cc=cjia@nvidia.com \
--cc=jgg@nvidia.com \
--cc=jgg@ziepe.ca \
--cc=kevin.tian@intel.com \
--cc=kjaju@nvidia.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mochs@nvidia.com \
--cc=skolothumtho@nvidia.com \
--cc=vsethi@nvidia.com \
--cc=yishaih@nvidia.com \
--cc=zhiw@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox