* [PATCH v6 1/6] vfio: export function to map the VMA
2025-11-25 17:30 [PATCH v6 0/6] vfio/nvgrace-gpu: Support huge PFNMAP and wait for GPU ready post reset ankita
@ 2025-11-25 17:30 ` ankita
2025-11-25 20:04 ` Zhi Wang
2025-11-25 20:52 ` Alex Williamson
2025-11-25 17:30 ` [PATCH v6 2/6] vfio/nvgrace-gpu: Add support for huge pfnmap ankita
` (4 subsequent siblings)
5 siblings, 2 replies; 18+ messages in thread
From: ankita @ 2025-11-25 17:30 UTC (permalink / raw)
To: ankita, jgg, yishaih, skolothumtho, kevin.tian, alex, aniketa,
vsethi, mochs
Cc: Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas, peterx,
pstanner, apopple, kvm, linux-kernel, cjia, kwankhede, targupta,
zhiw, danw, dnigam, kjaju
From: Ankit Agrawal <ankita@nvidia.com>
Take out the implementation to map the VMA to the PTE/PMD/PUD
as a separate function.
Export the function to be used by nvgrace-gpu module.
cc: Shameer Kolothum <skolothumtho@nvidia.com>
cc: Alex Williamson <alex@shazbot.org>
cc: Jason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
drivers/vfio/pci/vfio_pci_core.c | 50 ++++++++++++++++++++------------
include/linux/vfio_pci_core.h | 3 ++
2 files changed, 34 insertions(+), 19 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 7dcf5439dedc..c445a53ee12e 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1640,31 +1640,21 @@ static unsigned long vma_to_pfn(struct vm_area_struct *vma)
return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff;
}
-static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
- unsigned int order)
+vm_fault_t vfio_pci_vmf_insert_pfn(struct vfio_pci_core_device *vdev,
+ struct vm_fault *vmf,
+ unsigned long pfn,
+ unsigned int order)
{
- struct vm_area_struct *vma = vmf->vma;
- struct vfio_pci_core_device *vdev = vma->vm_private_data;
- unsigned long addr = vmf->address & ~((PAGE_SIZE << order) - 1);
- unsigned long pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
- unsigned long pfn = vma_to_pfn(vma) + pgoff;
- vm_fault_t ret = VM_FAULT_SIGBUS;
+ vm_fault_t ret;
- if (order && (addr < vma->vm_start ||
- addr + (PAGE_SIZE << order) > vma->vm_end ||
- pfn & ((1 << order) - 1))) {
- ret = VM_FAULT_FALLBACK;
- goto out;
- }
-
- down_read(&vdev->memory_lock);
+ lockdep_assert_held_read(&vdev->memory_lock);
if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev))
- goto out_unlock;
+ return VM_FAULT_SIGBUS;
switch (order) {
case 0:
- ret = vmf_insert_pfn(vma, vmf->address, pfn);
+ ret = vmf_insert_pfn(vmf->vma, vmf->address, pfn);
break;
#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
case PMD_ORDER:
@@ -1680,7 +1670,29 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
ret = VM_FAULT_FALLBACK;
}
-out_unlock:
+ return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_pci_vmf_insert_pfn);
+
+static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
+ unsigned int order)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ struct vfio_pci_core_device *vdev = vma->vm_private_data;
+ unsigned long addr = vmf->address & ~((PAGE_SIZE << order) - 1);
+ unsigned long pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
+ unsigned long pfn = vma_to_pfn(vma) + pgoff;
+ vm_fault_t ret = VM_FAULT_SIGBUS;
+
+ if (order && (addr < vma->vm_start ||
+ addr + (PAGE_SIZE << order) > vma->vm_end ||
+ pfn & ((1 << order) - 1))) {
+ ret = VM_FAULT_FALLBACK;
+ goto out;
+ }
+
+ down_read(&vdev->memory_lock);
+ ret = vfio_pci_vmf_insert_pfn(vdev, vmf, pfn, order);
up_read(&vdev->memory_lock);
out:
dev_dbg_ratelimited(&vdev->pdev->dev,
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index f541044e42a2..6f7c6c0d4278 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -119,6 +119,9 @@ ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
size_t count, loff_t *ppos);
ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
size_t count, loff_t *ppos);
+vm_fault_t vfio_pci_vmf_insert_pfn(struct vfio_pci_core_device *vdev,
+ struct vm_fault *vmf, unsigned long pfn,
+ unsigned int order);
int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma);
void vfio_pci_core_request(struct vfio_device *core_vdev, unsigned int count);
int vfio_pci_core_match(struct vfio_device *core_vdev, char *buf);
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v6 1/6] vfio: export function to map the VMA
2025-11-25 17:30 ` [PATCH v6 1/6] vfio: export function to map the VMA ankita
@ 2025-11-25 20:04 ` Zhi Wang
2025-11-25 20:52 ` Alex Williamson
1 sibling, 0 replies; 18+ messages in thread
From: Zhi Wang @ 2025-11-25 20:04 UTC (permalink / raw)
To: ankita
Cc: jgg, yishaih, skolothumtho, kevin.tian, alex, aniketa, vsethi,
mochs, Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas,
peterx, pstanner, apopple, kvm, linux-kernel, cjia, kwankhede,
targupta, danw, dnigam, kjaju
On Tue, 25 Nov 2025 17:30:08 +0000
<ankita@nvidia.com> wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
>
> Take out the implementation to map the VMA to the PTE/PMD/PUD
> as a separate function.
>
> Export the function to be used by nvgrace-gpu module.
>
This looks more like a re-factor than a simple symbol export. Let's add:
No functional change is intended.
> cc: Shameer Kolothum <skolothumtho@nvidia.com>
> cc: Alex Williamson <alex@shazbot.org>
> cc: Jason Gunthorpe <jgg@ziepe.ca>
Nit: I saw "cc" tag is also used somewhere else in the git log. I was
suprised that checkpatch.pl doesn't complain about it. I did test it.
In VFIO, people tend to use "Cc:" according to a search of the git log.
Let's keep using "Cc:" in VFIO.
> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
> drivers/vfio/pci/vfio_pci_core.c | 50
> ++++++++++++++++++++------------ include/linux/vfio_pci_core.h |
> 3 ++ 2 files changed, 34 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_core.c
> b/drivers/vfio/pci/vfio_pci_core.c index 7dcf5439dedc..c445a53ee12e
> 100644 --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1640,31 +1640,21 @@ static unsigned long vma_to_pfn(struct
> vm_area_struct *vma) return (pci_resource_start(vdev->pdev, index) >>
> PAGE_SHIFT) + pgoff; }
>
> -static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
> - unsigned int order)
> +vm_fault_t vfio_pci_vmf_insert_pfn(struct vfio_pci_core_device *vdev,
> + struct vm_fault *vmf,
> + unsigned long pfn,
> + unsigned int order)
> {
> - struct vm_area_struct *vma = vmf->vma;
> - struct vfio_pci_core_device *vdev = vma->vm_private_data;
> - unsigned long addr = vmf->address & ~((PAGE_SIZE << order) -
> 1);
> - unsigned long pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
> - unsigned long pfn = vma_to_pfn(vma) + pgoff;
> - vm_fault_t ret = VM_FAULT_SIGBUS;
> + vm_fault_t ret;
>
> - if (order && (addr < vma->vm_start ||
> - addr + (PAGE_SIZE << order) > vma->vm_end ||
> - pfn & ((1 << order) - 1))) {
> - ret = VM_FAULT_FALLBACK;
> - goto out;
> - }
> -
> - down_read(&vdev->memory_lock);
> + lockdep_assert_held_read(&vdev->memory_lock);
>
> if (vdev->pm_runtime_engaged ||
> !__vfio_pci_memory_enabled(vdev))
> - goto out_unlock;
> + return VM_FAULT_SIGBUS;
>
> switch (order) {
> case 0:
> - ret = vmf_insert_pfn(vma, vmf->address, pfn);
> + ret = vmf_insert_pfn(vmf->vma, vmf->address, pfn);
> break;
> #ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
> case PMD_ORDER:
> @@ -1680,7 +1670,29 @@ static vm_fault_t
> vfio_pci_mmap_huge_fault(struct vm_fault *vmf, ret =
> VM_FAULT_FALLBACK; }
>
> -out_unlock:
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_pci_vmf_insert_pfn);
> +
> +static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
> + unsigned int order)
> +{
> + struct vm_area_struct *vma = vmf->vma;
> + struct vfio_pci_core_device *vdev = vma->vm_private_data;
> + unsigned long addr = vmf->address & ~((PAGE_SIZE << order) -
> 1);
> + unsigned long pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
> + unsigned long pfn = vma_to_pfn(vma) + pgoff;
> + vm_fault_t ret = VM_FAULT_SIGBUS;
> +
> + if (order && (addr < vma->vm_start ||
> + addr + (PAGE_SIZE << order) > vma->vm_end ||
> + pfn & ((1 << order) - 1))) {
> + ret = VM_FAULT_FALLBACK;
> + goto out;
> + }
> +
> + down_read(&vdev->memory_lock);
> + ret = vfio_pci_vmf_insert_pfn(vdev, vmf, pfn, order);
> up_read(&vdev->memory_lock);
> out:
> dev_dbg_ratelimited(&vdev->pdev->dev,
> diff --git a/include/linux/vfio_pci_core.h
> b/include/linux/vfio_pci_core.h index f541044e42a2..6f7c6c0d4278
> 100644 --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -119,6 +119,9 @@ ssize_t vfio_pci_core_read(struct vfio_device
> *core_vdev, char __user *buf, size_t count, loff_t *ppos);
> ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const
> char __user *buf, size_t count, loff_t *ppos);
> +vm_fault_t vfio_pci_vmf_insert_pfn(struct vfio_pci_core_device *vdev,
> + struct vm_fault *vmf, unsigned
> long pfn,
> + unsigned int order);
> int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct
> vm_area_struct *vma); void vfio_pci_core_request(struct vfio_device
> *core_vdev, unsigned int count); int vfio_pci_core_match(struct
> vfio_device *core_vdev, char *buf);
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v6 1/6] vfio: export function to map the VMA
2025-11-25 17:30 ` [PATCH v6 1/6] vfio: export function to map the VMA ankita
2025-11-25 20:04 ` Zhi Wang
@ 2025-11-25 20:52 ` Alex Williamson
1 sibling, 0 replies; 18+ messages in thread
From: Alex Williamson @ 2025-11-25 20:52 UTC (permalink / raw)
To: ankita
Cc: jgg, yishaih, skolothumtho, kevin.tian, aniketa, vsethi, mochs,
Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas, peterx,
pstanner, apopple, kvm, linux-kernel, cjia, kwankhede, targupta,
zhiw, danw, dnigam, kjaju
On Tue, 25 Nov 2025 17:30:08 +0000
<ankita@nvidia.com> wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
>
> Take out the implementation to map the VMA to the PTE/PMD/PUD
> as a separate function.
>
> Export the function to be used by nvgrace-gpu module.
>
> cc: Shameer Kolothum <skolothumtho@nvidia.com>
> cc: Alex Williamson <alex@shazbot.org>
> cc: Jason Gunthorpe <jgg@ziepe.ca>
> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
> drivers/vfio/pci/vfio_pci_core.c | 50 ++++++++++++++++++++------------
> include/linux/vfio_pci_core.h | 3 ++
> 2 files changed, 34 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 7dcf5439dedc..c445a53ee12e 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1640,31 +1640,21 @@ static unsigned long vma_to_pfn(struct vm_area_struct *vma)
> return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff;
> }
>
> -static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
> - unsigned int order)
> +vm_fault_t vfio_pci_vmf_insert_pfn(struct vfio_pci_core_device *vdev,
> + struct vm_fault *vmf,
> + unsigned long pfn,
> + unsigned int order)
> {
> - struct vm_area_struct *vma = vmf->vma;
> - struct vfio_pci_core_device *vdev = vma->vm_private_data;
> - unsigned long addr = vmf->address & ~((PAGE_SIZE << order) - 1);
> - unsigned long pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
> - unsigned long pfn = vma_to_pfn(vma) + pgoff;
> - vm_fault_t ret = VM_FAULT_SIGBUS;
> + vm_fault_t ret;
>
> - if (order && (addr < vma->vm_start ||
> - addr + (PAGE_SIZE << order) > vma->vm_end ||
> - pfn & ((1 << order) - 1))) {
> - ret = VM_FAULT_FALLBACK;
> - goto out;
> - }
> -
> - down_read(&vdev->memory_lock);
> + lockdep_assert_held_read(&vdev->memory_lock);
>
> if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev))
> - goto out_unlock;
> + return VM_FAULT_SIGBUS;
>
> switch (order) {
> case 0:
> - ret = vmf_insert_pfn(vma, vmf->address, pfn);
> + ret = vmf_insert_pfn(vmf->vma, vmf->address, pfn);
> break;
> #ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
> case PMD_ORDER:
> @@ -1680,7 +1670,29 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
> ret = VM_FAULT_FALLBACK;
> }
>
> -out_unlock:
> + return ret;
> +}
At this point we no longer need @ret, we can return directly in all
cases.
> +EXPORT_SYMBOL_GPL(vfio_pci_vmf_insert_pfn);
> +
> +static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
> + unsigned int order)
> +{
> + struct vm_area_struct *vma = vmf->vma;
> + struct vfio_pci_core_device *vdev = vma->vm_private_data;
> + unsigned long addr = vmf->address & ~((PAGE_SIZE << order) - 1);
> + unsigned long pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
> + unsigned long pfn = vma_to_pfn(vma) + pgoff;
> + vm_fault_t ret = VM_FAULT_SIGBUS;
The only use case of this initialization is now in the new function.
> +
> + if (order && (addr < vma->vm_start ||
> + addr + (PAGE_SIZE << order) > vma->vm_end ||
> + pfn & ((1 << order) - 1))) {
> + ret = VM_FAULT_FALLBACK;
> + goto out;
> + }
Should we make a static inline in a vfio header for the above to avoid
the duplicate implementation in the next patch? Also we might as well
use an else branch rather than goto with the bulk of the code moved
now. Maybe also just convert to a scoped_guard as well. Thanks,
Alex
> +
> + down_read(&vdev->memory_lock);
> + ret = vfio_pci_vmf_insert_pfn(vdev, vmf, pfn, order);
> up_read(&vdev->memory_lock);
> out:
> dev_dbg_ratelimited(&vdev->pdev->dev,
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index f541044e42a2..6f7c6c0d4278 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -119,6 +119,9 @@ ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
> size_t count, loff_t *ppos);
> ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
> size_t count, loff_t *ppos);
> +vm_fault_t vfio_pci_vmf_insert_pfn(struct vfio_pci_core_device *vdev,
> + struct vm_fault *vmf, unsigned long pfn,
> + unsigned int order);
> int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma);
> void vfio_pci_core_request(struct vfio_device *core_vdev, unsigned int count);
> int vfio_pci_core_match(struct vfio_device *core_vdev, char *buf);
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v6 2/6] vfio/nvgrace-gpu: Add support for huge pfnmap
2025-11-25 17:30 [PATCH v6 0/6] vfio/nvgrace-gpu: Support huge PFNMAP and wait for GPU ready post reset ankita
2025-11-25 17:30 ` [PATCH v6 1/6] vfio: export function to map the VMA ankita
@ 2025-11-25 17:30 ` ankita
2025-11-25 19:58 ` Zhi Wang
2025-11-25 20:52 ` Alex Williamson
2025-11-25 17:30 ` [PATCH v6 3/6] vfio: use vfio_pci_core_setup_barmap to map bar in mmap ankita
` (3 subsequent siblings)
5 siblings, 2 replies; 18+ messages in thread
From: ankita @ 2025-11-25 17:30 UTC (permalink / raw)
To: ankita, jgg, yishaih, skolothumtho, kevin.tian, alex, aniketa,
vsethi, mochs
Cc: Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas, peterx,
pstanner, apopple, kvm, linux-kernel, cjia, kwankhede, targupta,
zhiw, danw, dnigam, kjaju
From: Ankit Agrawal <ankita@nvidia.com>
NVIDIA's Grace based systems have large device memory. The device
memory is mapped as VM_PFNMAP in the VMM VMA. The nvgrace-gpu
module could make use of the huge PFNMAP support added in mm [1].
To make use of the huge pfnmap support, fault/huge_fault ops
based mapping mechanism needs to be implemented. Currently nvgrace-gpu
module relies on remap_pfn_range to do the mapping during VM bootup.
Replace it to instead rely on fault and use vfio_pci_vmf_insert_pfn
to setup the mapping.
Moreover to enable huge pfnmap, nvgrace-gpu module is updated by
adding huge_fault ops implementation. The implementation establishes
mapping according to the order request. Note that if the PFN or the
VMA address is unaligned to the order, the mapping fallbacks to
the PTE level.
Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1]
cc: Shameer Kolothum <skolothumtho@nvidia.com>
cc: Alex Williamson <alex@shazbot.org>
cc: Jason Gunthorpe <jgg@ziepe.ca>
cc: Vikram Sethi <vsethi@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
drivers/vfio/pci/nvgrace-gpu/main.c | 84 +++++++++++++++++++++--------
1 file changed, 62 insertions(+), 22 deletions(-)
diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index e346392b72f6..8a982310b188 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -130,6 +130,62 @@ static void nvgrace_gpu_close_device(struct vfio_device *core_vdev)
vfio_pci_core_close_device(core_vdev);
}
+static unsigned long addr_to_pgoff(struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ u64 pgoff = vma->vm_pgoff &
+ ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+
+ return ((addr - vma->vm_start) >> PAGE_SHIFT) + pgoff;
+}
+
+static vm_fault_t nvgrace_gpu_vfio_pci_huge_fault(struct vm_fault *vmf,
+ unsigned int order)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ struct nvgrace_gpu_pci_core_device *nvdev = vma->vm_private_data;
+ struct vfio_pci_core_device *vdev = &nvdev->core_device;
+ unsigned int index =
+ vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+ vm_fault_t ret = VM_FAULT_SIGBUS;
+ struct mem_region *memregion;
+ unsigned long pfn, addr;
+
+ memregion = nvgrace_gpu_memregion(index, nvdev);
+ if (!memregion)
+ return ret;
+
+ addr = vmf->address & ~((PAGE_SIZE << order) - 1);
+ pfn = PHYS_PFN(memregion->memphys) + addr_to_pgoff(vma, addr);
+
+ if (order && (addr < vma->vm_start ||
+ addr + (PAGE_SIZE << order) > vma->vm_end ||
+ pfn & ((1 << order) - 1)))
+ return VM_FAULT_FALLBACK;
+
+ scoped_guard(rwsem_read, &vdev->memory_lock)
+ ret = vfio_pci_vmf_insert_pfn(vdev, vmf, pfn, order);
+
+ dev_dbg_ratelimited(&vdev->pdev->dev,
+ "%s order = %d pfn 0x%lx: 0x%x\n",
+ __func__, order, pfn,
+ (unsigned int)ret);
+
+ return ret;
+}
+
+static vm_fault_t nvgrace_gpu_vfio_pci_fault(struct vm_fault *vmf)
+{
+ return nvgrace_gpu_vfio_pci_huge_fault(vmf, 0);
+}
+
+static const struct vm_operations_struct nvgrace_gpu_vfio_pci_mmap_ops = {
+ .fault = nvgrace_gpu_vfio_pci_fault,
+#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
+ .huge_fault = nvgrace_gpu_vfio_pci_huge_fault,
+#endif
+};
+
static int nvgrace_gpu_mmap(struct vfio_device *core_vdev,
struct vm_area_struct *vma)
{
@@ -137,10 +193,8 @@ static int nvgrace_gpu_mmap(struct vfio_device *core_vdev,
container_of(core_vdev, struct nvgrace_gpu_pci_core_device,
core_device.vdev);
struct mem_region *memregion;
- unsigned long start_pfn;
u64 req_len, pgoff, end;
unsigned int index;
- int ret = 0;
index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
@@ -157,17 +211,18 @@ static int nvgrace_gpu_mmap(struct vfio_device *core_vdev,
((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
if (check_sub_overflow(vma->vm_end, vma->vm_start, &req_len) ||
- check_add_overflow(PHYS_PFN(memregion->memphys), pgoff, &start_pfn) ||
check_add_overflow(PFN_PHYS(pgoff), req_len, &end))
return -EOVERFLOW;
/*
- * Check that the mapping request does not go beyond available device
- * memory size
+ * Check that the mapping request does not go beyond the exposed
+ * device memory size.
*/
if (end > memregion->memlength)
return -EINVAL;
+ vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
+
/*
* The carved out region of the device memory needs the NORMAL_NC
* property. Communicate as such to the hypervisor.
@@ -184,23 +239,8 @@ static int nvgrace_gpu_mmap(struct vfio_device *core_vdev,
vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
}
- /*
- * Perform a PFN map to the memory and back the device BAR by the
- * GPU memory.
- *
- * The available GPU memory size may not be power-of-2 aligned. The
- * remainder is only backed by vfio_device_ops read/write handlers.
- *
- * During device reset, the GPU is safely disconnected to the CPU
- * and access to the BAR will be immediately returned preventing
- * machine check.
- */
- ret = remap_pfn_range(vma, vma->vm_start, start_pfn,
- req_len, vma->vm_page_prot);
- if (ret)
- return ret;
-
- vma->vm_pgoff = start_pfn;
+ vma->vm_ops = &nvgrace_gpu_vfio_pci_mmap_ops;
+ vma->vm_private_data = nvdev;
return 0;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v6 2/6] vfio/nvgrace-gpu: Add support for huge pfnmap
2025-11-25 17:30 ` [PATCH v6 2/6] vfio/nvgrace-gpu: Add support for huge pfnmap ankita
@ 2025-11-25 19:58 ` Zhi Wang
2025-11-25 20:52 ` Alex Williamson
1 sibling, 0 replies; 18+ messages in thread
From: Zhi Wang @ 2025-11-25 19:58 UTC (permalink / raw)
To: ankita
Cc: jgg, yishaih, skolothumtho, kevin.tian, alex, aniketa, vsethi,
mochs, Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas,
peterx, pstanner, apopple, kvm, linux-kernel, cjia, kwankhede,
targupta, danw, dnigam, kjaju
On Tue, 25 Nov 2025 17:30:09 +0000
<ankita@nvidia.com> wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
>
> NVIDIA's Grace based systems have large device memory. The device
> memory is mapped as VM_PFNMAP in the VMM VMA. The nvgrace-gpu
> module could make use of the huge PFNMAP support added in mm [1].
>
> To make use of the huge pfnmap support, fault/huge_fault ops
> based mapping mechanism needs to be implemented. Currently nvgrace-gpu
> module relies on remap_pfn_range to do the mapping during VM bootup.
> Replace it to instead rely on fault and use vfio_pci_vmf_insert_pfn
> to setup the mapping.
>
> Moreover to enable huge pfnmap, nvgrace-gpu module is updated by
> adding huge_fault ops implementation. The implementation establishes
> mapping according to the order request. Note that if the PFN or the
> VMA address is unaligned to the order, the mapping fallbacks to
> the PTE level.
>
> Link:
> https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/
> [1]
>
> cc: Shameer Kolothum <skolothumtho@nvidia.com>
> cc: Alex Williamson <alex@shazbot.org>
> cc: Jason Gunthorpe <jgg@ziepe.ca>
> cc: Vikram Sethi <vsethi@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
> drivers/vfio/pci/nvgrace-gpu/main.c | 84
> +++++++++++++++++++++-------- 1 file changed, 62 insertions(+), 22
> deletions(-)
>
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c
> b/drivers/vfio/pci/nvgrace-gpu/main.c index
> e346392b72f6..8a982310b188 100644 ---
> a/drivers/vfio/pci/nvgrace-gpu/main.c +++
> b/drivers/vfio/pci/nvgrace-gpu/main.c @@ -130,6 +130,62 @@ static
> void nvgrace_gpu_close_device(struct vfio_device *core_vdev)
> vfio_pci_core_close_device(core_vdev); }
>
> +static unsigned long addr_to_pgoff(struct vm_area_struct *vma,
> + unsigned long addr)
> +{
> + u64 pgoff = vma->vm_pgoff &
> + ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> +
> + return ((addr - vma->vm_start) >> PAGE_SHIFT) + pgoff;
> +}
> +
> +static vm_fault_t nvgrace_gpu_vfio_pci_huge_fault(struct vm_fault
> *vmf,
> + unsigned int order)
> +{
> + struct vm_area_struct *vma = vmf->vma;
> + struct nvgrace_gpu_pci_core_device *nvdev =
> vma->vm_private_data;
> + struct vfio_pci_core_device *vdev = &nvdev->core_device;
> + unsigned int index =
> + vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT -
> PAGE_SHIFT);
> + vm_fault_t ret = VM_FAULT_SIGBUS;
> + struct mem_region *memregion;
> + unsigned long pfn, addr;
> +
> + memregion = nvgrace_gpu_memregion(index, nvdev);
> + if (!memregion)
> + return ret;
> +
> + addr = vmf->address & ~((PAGE_SIZE << order) - 1);
ALIGN_DOWN(vmf->address, PAGE_SIZE << order).
> + pfn = PHYS_PFN(memregion->memphys) + addr_to_pgoff(vma,
> addr); +
> + if (order && (addr < vma->vm_start ||
> + addr + (PAGE_SIZE << order) > vma->vm_end ||
> + pfn & ((1 << order) - 1)))
!IS_ALIGNED(pfn, 1 << order).
Other parts looks good to me.
Reviewed-by: Zhi Wang <zhiw@nvidia.com>
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v6 2/6] vfio/nvgrace-gpu: Add support for huge pfnmap
2025-11-25 17:30 ` [PATCH v6 2/6] vfio/nvgrace-gpu: Add support for huge pfnmap ankita
2025-11-25 19:58 ` Zhi Wang
@ 2025-11-25 20:52 ` Alex Williamson
1 sibling, 0 replies; 18+ messages in thread
From: Alex Williamson @ 2025-11-25 20:52 UTC (permalink / raw)
To: ankita
Cc: jgg, yishaih, skolothumtho, kevin.tian, aniketa, vsethi, mochs,
Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas, peterx,
pstanner, apopple, kvm, linux-kernel, cjia, kwankhede, targupta,
zhiw, danw, dnigam, kjaju
On Tue, 25 Nov 2025 17:30:09 +0000
<ankita@nvidia.com> wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
>
> NVIDIA's Grace based systems have large device memory. The device
> memory is mapped as VM_PFNMAP in the VMM VMA. The nvgrace-gpu
> module could make use of the huge PFNMAP support added in mm [1].
>
> To make use of the huge pfnmap support, fault/huge_fault ops
> based mapping mechanism needs to be implemented. Currently nvgrace-gpu
> module relies on remap_pfn_range to do the mapping during VM bootup.
> Replace it to instead rely on fault and use vfio_pci_vmf_insert_pfn
> to setup the mapping.
>
> Moreover to enable huge pfnmap, nvgrace-gpu module is updated by
> adding huge_fault ops implementation. The implementation establishes
> mapping according to the order request. Note that if the PFN or the
> VMA address is unaligned to the order, the mapping fallbacks to
> the PTE level.
>
> Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1]
>
> cc: Shameer Kolothum <skolothumtho@nvidia.com>
> cc: Alex Williamson <alex@shazbot.org>
> cc: Jason Gunthorpe <jgg@ziepe.ca>
> cc: Vikram Sethi <vsethi@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
> drivers/vfio/pci/nvgrace-gpu/main.c | 84 +++++++++++++++++++++--------
> 1 file changed, 62 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
> index e346392b72f6..8a982310b188 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -130,6 +130,62 @@ static void nvgrace_gpu_close_device(struct vfio_device *core_vdev)
> vfio_pci_core_close_device(core_vdev);
> }
>
> +static unsigned long addr_to_pgoff(struct vm_area_struct *vma,
> + unsigned long addr)
> +{
> + u64 pgoff = vma->vm_pgoff &
> + ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> +
> + return ((addr - vma->vm_start) >> PAGE_SHIFT) + pgoff;
> +}
> +
> +static vm_fault_t nvgrace_gpu_vfio_pci_huge_fault(struct vm_fault *vmf,
> + unsigned int order)
> +{
> + struct vm_area_struct *vma = vmf->vma;
> + struct nvgrace_gpu_pci_core_device *nvdev = vma->vm_private_data;
> + struct vfio_pci_core_device *vdev = &nvdev->core_device;
> + unsigned int index =
> + vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> + vm_fault_t ret = VM_FAULT_SIGBUS;
> + struct mem_region *memregion;
> + unsigned long pfn, addr;
> +
> + memregion = nvgrace_gpu_memregion(index, nvdev);
> + if (!memregion)
> + return ret;
> +
> + addr = vmf->address & ~((PAGE_SIZE << order) - 1);
> + pfn = PHYS_PFN(memregion->memphys) + addr_to_pgoff(vma, addr);
> +
> + if (order && (addr < vma->vm_start ||
> + addr + (PAGE_SIZE << order) > vma->vm_end ||
> + pfn & ((1 << order) - 1)))
> + return VM_FAULT_FALLBACK;
The dev_dbg misses this fallback this way. Thanks,
Alex
> +
> + scoped_guard(rwsem_read, &vdev->memory_lock)
> + ret = vfio_pci_vmf_insert_pfn(vdev, vmf, pfn, order);
> +
> + dev_dbg_ratelimited(&vdev->pdev->dev,
> + "%s order = %d pfn 0x%lx: 0x%x\n",
> + __func__, order, pfn,
> + (unsigned int)ret);
> +
> + return ret;
> +}
> +
> +static vm_fault_t nvgrace_gpu_vfio_pci_fault(struct vm_fault *vmf)
> +{
> + return nvgrace_gpu_vfio_pci_huge_fault(vmf, 0);
> +}
> +
> +static const struct vm_operations_struct nvgrace_gpu_vfio_pci_mmap_ops = {
> + .fault = nvgrace_gpu_vfio_pci_fault,
> +#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
> + .huge_fault = nvgrace_gpu_vfio_pci_huge_fault,
> +#endif
> +};
> +
> static int nvgrace_gpu_mmap(struct vfio_device *core_vdev,
> struct vm_area_struct *vma)
> {
> @@ -137,10 +193,8 @@ static int nvgrace_gpu_mmap(struct vfio_device *core_vdev,
> container_of(core_vdev, struct nvgrace_gpu_pci_core_device,
> core_device.vdev);
> struct mem_region *memregion;
> - unsigned long start_pfn;
> u64 req_len, pgoff, end;
> unsigned int index;
> - int ret = 0;
>
> index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
>
> @@ -157,17 +211,18 @@ static int nvgrace_gpu_mmap(struct vfio_device *core_vdev,
> ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
>
> if (check_sub_overflow(vma->vm_end, vma->vm_start, &req_len) ||
> - check_add_overflow(PHYS_PFN(memregion->memphys), pgoff, &start_pfn) ||
> check_add_overflow(PFN_PHYS(pgoff), req_len, &end))
> return -EOVERFLOW;
>
> /*
> - * Check that the mapping request does not go beyond available device
> - * memory size
> + * Check that the mapping request does not go beyond the exposed
> + * device memory size.
> */
> if (end > memregion->memlength)
> return -EINVAL;
>
> + vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
> +
> /*
> * The carved out region of the device memory needs the NORMAL_NC
> * property. Communicate as such to the hypervisor.
> @@ -184,23 +239,8 @@ static int nvgrace_gpu_mmap(struct vfio_device *core_vdev,
> vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
> }
>
> - /*
> - * Perform a PFN map to the memory and back the device BAR by the
> - * GPU memory.
> - *
> - * The available GPU memory size may not be power-of-2 aligned. The
> - * remainder is only backed by vfio_device_ops read/write handlers.
> - *
> - * During device reset, the GPU is safely disconnected to the CPU
> - * and access to the BAR will be immediately returned preventing
> - * machine check.
> - */
> - ret = remap_pfn_range(vma, vma->vm_start, start_pfn,
> - req_len, vma->vm_page_prot);
> - if (ret)
> - return ret;
> -
> - vma->vm_pgoff = start_pfn;
> + vma->vm_ops = &nvgrace_gpu_vfio_pci_mmap_ops;
> + vma->vm_private_data = nvdev;
>
> return 0;
> }
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v6 3/6] vfio: use vfio_pci_core_setup_barmap to map bar in mmap
2025-11-25 17:30 [PATCH v6 0/6] vfio/nvgrace-gpu: Support huge PFNMAP and wait for GPU ready post reset ankita
2025-11-25 17:30 ` [PATCH v6 1/6] vfio: export function to map the VMA ankita
2025-11-25 17:30 ` [PATCH v6 2/6] vfio/nvgrace-gpu: Add support for huge pfnmap ankita
@ 2025-11-25 17:30 ` ankita
2025-11-25 20:04 ` Zhi Wang
2025-11-25 17:30 ` [PATCH v6 4/6] vfio/nvgrace-gpu: split the code to wait for GPU ready ankita
` (2 subsequent siblings)
5 siblings, 1 reply; 18+ messages in thread
From: ankita @ 2025-11-25 17:30 UTC (permalink / raw)
To: ankita, jgg, yishaih, skolothumtho, kevin.tian, alex, aniketa,
vsethi, mochs
Cc: Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas, peterx,
pstanner, apopple, kvm, linux-kernel, cjia, kwankhede, targupta,
zhiw, danw, dnigam, kjaju
From: Ankit Agrawal <ankita@nvidia.com>
Remove code duplication in vfio_pci_core_mmap by calling
vfio_pci_core_setup_barmap to perform the bar mapping.
cc: Donald Dutile <ddutile@redhat.com>
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
Suggested-by: Alex Williamson <alex@shazbot.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
drivers/vfio/pci/vfio_pci_core.c | 15 +++------------
1 file changed, 3 insertions(+), 12 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index c445a53ee12e..3cc799eb75ea 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1761,18 +1761,9 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma
* Even though we don't make use of the barmap for the mmap,
* we need to request the region and the barmap tracks that.
*/
- if (!vdev->barmap[index]) {
- ret = pci_request_selected_regions(pdev,
- 1 << index, "vfio-pci");
- if (ret)
- return ret;
-
- vdev->barmap[index] = pci_iomap(pdev, index, 0);
- if (!vdev->barmap[index]) {
- pci_release_selected_regions(pdev, 1 << index);
- return -ENOMEM;
- }
- }
+ ret = vfio_pci_core_setup_barmap(vdev, index);
+ if (ret)
+ return ret;
vma->vm_private_data = vdev;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v6 3/6] vfio: use vfio_pci_core_setup_barmap to map bar in mmap
2025-11-25 17:30 ` [PATCH v6 3/6] vfio: use vfio_pci_core_setup_barmap to map bar in mmap ankita
@ 2025-11-25 20:04 ` Zhi Wang
0 siblings, 0 replies; 18+ messages in thread
From: Zhi Wang @ 2025-11-25 20:04 UTC (permalink / raw)
To: ankita
Cc: jgg, yishaih, skolothumtho, kevin.tian, alex, aniketa, vsethi,
mochs, Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas,
peterx, pstanner, apopple, kvm, linux-kernel, cjia, kwankhede,
targupta, danw, dnigam, kjaju
On Tue, 25 Nov 2025 17:30:10 +0000
<ankita@nvidia.com> wrote:
LGTM.
Reviewed-by: Zhi Wang <zhiw@nvidia.com>
> From: Ankit Agrawal <ankita@nvidia.com>
>
> Remove code duplication in vfio_pci_core_mmap by calling
> vfio_pci_core_setup_barmap to perform the bar mapping.
>
> cc: Donald Dutile <ddutile@redhat.com>
> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
> Suggested-by: Alex Williamson <alex@shazbot.org>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
> drivers/vfio/pci/vfio_pci_core.c | 15 +++------------
> 1 file changed, 3 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_core.c
> b/drivers/vfio/pci/vfio_pci_core.c index c445a53ee12e..3cc799eb75ea
> 100644 --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1761,18 +1761,9 @@ int vfio_pci_core_mmap(struct vfio_device
> *core_vdev, struct vm_area_struct *vma
> * Even though we don't make use of the barmap for the mmap,
> * we need to request the region and the barmap tracks that.
> */
> - if (!vdev->barmap[index]) {
> - ret = pci_request_selected_regions(pdev,
> - 1 << index,
> "vfio-pci");
> - if (ret)
> - return ret;
> -
> - vdev->barmap[index] = pci_iomap(pdev, index, 0);
> - if (!vdev->barmap[index]) {
> - pci_release_selected_regions(pdev, 1 <<
> index);
> - return -ENOMEM;
> - }
> - }
> + ret = vfio_pci_core_setup_barmap(vdev, index);
> + if (ret)
> + return ret;
>
> vma->vm_private_data = vdev;
> vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v6 4/6] vfio/nvgrace-gpu: split the code to wait for GPU ready
2025-11-25 17:30 [PATCH v6 0/6] vfio/nvgrace-gpu: Support huge PFNMAP and wait for GPU ready post reset ankita
` (2 preceding siblings ...)
2025-11-25 17:30 ` [PATCH v6 3/6] vfio: use vfio_pci_core_setup_barmap to map bar in mmap ankita
@ 2025-11-25 17:30 ` ankita
2025-11-25 20:30 ` Zhi Wang
2025-11-25 20:52 ` Alex Williamson
2025-11-25 17:30 ` [PATCH v6 5/6] vfio/nvgrace-gpu: Inform devmem unmapped after reset ankita
2025-11-25 17:30 ` [PATCH v6 6/6] vfio/nvgrace-gpu: wait for the GPU mem to be ready ankita
5 siblings, 2 replies; 18+ messages in thread
From: ankita @ 2025-11-25 17:30 UTC (permalink / raw)
To: ankita, jgg, yishaih, skolothumtho, kevin.tian, alex, aniketa,
vsethi, mochs
Cc: Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas, peterx,
pstanner, apopple, kvm, linux-kernel, cjia, kwankhede, targupta,
zhiw, danw, dnigam, kjaju
From: Ankit Agrawal <ankita@nvidia.com>
Split the function that check for the GPU device being ready on
the probe.
Move the code to wait for the GPU to be ready through BAR0 register
reads to a separate function. This would help reuse the code.
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
drivers/vfio/pci/nvgrace-gpu/main.c | 29 +++++++++++++++++------------
1 file changed, 17 insertions(+), 12 deletions(-)
diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index 8a982310b188..2b736cb82f38 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -130,6 +130,20 @@ static void nvgrace_gpu_close_device(struct vfio_device *core_vdev)
vfio_pci_core_close_device(core_vdev);
}
+static int nvgrace_gpu_wait_device_ready(void __iomem *io)
+{
+ unsigned long timeout = jiffies + msecs_to_jiffies(POLL_TIMEOUT_MS);
+
+ do {
+ if ((ioread32(io + C2C_LINK_BAR0_OFFSET) == STATUS_READY) &&
+ (ioread32(io + HBM_TRAINING_BAR0_OFFSET) == STATUS_READY))
+ return 0;
+ msleep(POLL_QUANTUM_MS);
+ } while (!time_after(jiffies, timeout));
+
+ return -ETIME;
+}
+
static unsigned long addr_to_pgoff(struct vm_area_struct *vma,
unsigned long addr)
{
@@ -933,9 +947,8 @@ static bool nvgrace_gpu_has_mig_hw_bug(struct pci_dev *pdev)
* Ensure that the BAR0 region is enabled before accessing the
* registers.
*/
-static int nvgrace_gpu_wait_device_ready(struct pci_dev *pdev)
+static int nvgrace_gpu_probe_check_device_ready(struct pci_dev *pdev)
{
- unsigned long timeout = jiffies + msecs_to_jiffies(POLL_TIMEOUT_MS);
void __iomem *io;
int ret = -ETIME;
@@ -953,16 +966,8 @@ static int nvgrace_gpu_wait_device_ready(struct pci_dev *pdev)
goto iomap_exit;
}
- do {
- if ((ioread32(io + C2C_LINK_BAR0_OFFSET) == STATUS_READY) &&
- (ioread32(io + HBM_TRAINING_BAR0_OFFSET) == STATUS_READY)) {
- ret = 0;
- goto reg_check_exit;
- }
- msleep(POLL_QUANTUM_MS);
- } while (!time_after(jiffies, timeout));
+ ret = nvgrace_gpu_wait_device_ready(io);
-reg_check_exit:
pci_iounmap(pdev, io);
iomap_exit:
pci_release_selected_regions(pdev, 1 << 0);
@@ -979,7 +984,7 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
u64 memphys, memlength;
int ret;
- ret = nvgrace_gpu_wait_device_ready(pdev);
+ ret = nvgrace_gpu_probe_check_device_ready(pdev);
if (ret)
return ret;
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v6 4/6] vfio/nvgrace-gpu: split the code to wait for GPU ready
2025-11-25 17:30 ` [PATCH v6 4/6] vfio/nvgrace-gpu: split the code to wait for GPU ready ankita
@ 2025-11-25 20:30 ` Zhi Wang
2025-11-25 20:52 ` Alex Williamson
1 sibling, 0 replies; 18+ messages in thread
From: Zhi Wang @ 2025-11-25 20:30 UTC (permalink / raw)
To: ankita
Cc: jgg, yishaih, skolothumtho, kevin.tian, alex, aniketa, vsethi,
mochs, Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas,
peterx, pstanner, apopple, kvm, linux-kernel, cjia, kwankhede,
targupta, danw, dnigam, kjaju
On Tue, 25 Nov 2025 17:30:11 +0000
<ankita@nvidia.com> wrote:
Looking good to me.
Reviewed-by: Zhi Wang <zhiw@nvidia.com>
> From: Ankit Agrawal <ankita@nvidia.com>
>
> Split the function that check for the GPU device being ready on
> the probe.
>
> Move the code to wait for the GPU to be ready through BAR0 register
> reads to a separate function. This would help reuse the code.
>
> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
> drivers/vfio/pci/nvgrace-gpu/main.c | 29
> +++++++++++++++++------------ 1 file changed, 17 insertions(+), 12
> deletions(-)
>
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c
> b/drivers/vfio/pci/nvgrace-gpu/main.c index
> 8a982310b188..2b736cb82f38 100644 ---
> a/drivers/vfio/pci/nvgrace-gpu/main.c +++
> b/drivers/vfio/pci/nvgrace-gpu/main.c @@ -130,6 +130,20 @@ static
> void nvgrace_gpu_close_device(struct vfio_device *core_vdev)
> vfio_pci_core_close_device(core_vdev); }
>
> +static int nvgrace_gpu_wait_device_ready(void __iomem *io)
> +{
> + unsigned long timeout = jiffies +
> msecs_to_jiffies(POLL_TIMEOUT_MS); +
> + do {
> + if ((ioread32(io + C2C_LINK_BAR0_OFFSET) ==
> STATUS_READY) &&
> + (ioread32(io + HBM_TRAINING_BAR0_OFFSET) ==
> STATUS_READY))
> + return 0;
> + msleep(POLL_QUANTUM_MS);
> + } while (!time_after(jiffies, timeout));
> +
> + return -ETIME;
> +}
> +
> static unsigned long addr_to_pgoff(struct vm_area_struct *vma,
> unsigned long addr)
> {
> @@ -933,9 +947,8 @@ static bool nvgrace_gpu_has_mig_hw_bug(struct
> pci_dev *pdev)
> * Ensure that the BAR0 region is enabled before accessing the
> * registers.
> */
> -static int nvgrace_gpu_wait_device_ready(struct pci_dev *pdev)
> +static int nvgrace_gpu_probe_check_device_ready(struct pci_dev *pdev)
> {
> - unsigned long timeout = jiffies +
> msecs_to_jiffies(POLL_TIMEOUT_MS); void __iomem *io;
> int ret = -ETIME;
>
> @@ -953,16 +966,8 @@ static int nvgrace_gpu_wait_device_ready(struct
> pci_dev *pdev) goto iomap_exit;
> }
>
> - do {
> - if ((ioread32(io + C2C_LINK_BAR0_OFFSET) ==
> STATUS_READY) &&
> - (ioread32(io + HBM_TRAINING_BAR0_OFFSET) ==
> STATUS_READY)) {
> - ret = 0;
> - goto reg_check_exit;
> - }
> - msleep(POLL_QUANTUM_MS);
> - } while (!time_after(jiffies, timeout));
> + ret = nvgrace_gpu_wait_device_ready(io);
>
> -reg_check_exit:
> pci_iounmap(pdev, io);
> iomap_exit:
> pci_release_selected_regions(pdev, 1 << 0);
> @@ -979,7 +984,7 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
> u64 memphys, memlength;
> int ret;
>
> - ret = nvgrace_gpu_wait_device_ready(pdev);
> + ret = nvgrace_gpu_probe_check_device_ready(pdev);
> if (ret)
> return ret;
>
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v6 4/6] vfio/nvgrace-gpu: split the code to wait for GPU ready
2025-11-25 17:30 ` [PATCH v6 4/6] vfio/nvgrace-gpu: split the code to wait for GPU ready ankita
2025-11-25 20:30 ` Zhi Wang
@ 2025-11-25 20:52 ` Alex Williamson
1 sibling, 0 replies; 18+ messages in thread
From: Alex Williamson @ 2025-11-25 20:52 UTC (permalink / raw)
To: ankita
Cc: jgg, yishaih, skolothumtho, kevin.tian, aniketa, vsethi, mochs,
Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas, peterx,
pstanner, apopple, kvm, linux-kernel, cjia, kwankhede, targupta,
zhiw, danw, dnigam, kjaju
On Tue, 25 Nov 2025 17:30:11 +0000
<ankita@nvidia.com> wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
>
> Split the function that check for the GPU device being ready on
> the probe.
>
> Move the code to wait for the GPU to be ready through BAR0 register
> reads to a separate function. This would help reuse the code.
>
> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
As noted last round:
Fixes: ...
And note the fix in the commit log.
> ---
> drivers/vfio/pci/nvgrace-gpu/main.c | 29 +++++++++++++++++------------
> 1 file changed, 17 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
> index 8a982310b188..2b736cb82f38 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -130,6 +130,20 @@ static void nvgrace_gpu_close_device(struct vfio_device *core_vdev)
> vfio_pci_core_close_device(core_vdev);
> }
>
> +static int nvgrace_gpu_wait_device_ready(void __iomem *io)
> +{
> + unsigned long timeout = jiffies + msecs_to_jiffies(POLL_TIMEOUT_MS);
> +
> + do {
> + if ((ioread32(io + C2C_LINK_BAR0_OFFSET) == STATUS_READY) &&
> + (ioread32(io + HBM_TRAINING_BAR0_OFFSET) == STATUS_READY))
> + return 0;
> + msleep(POLL_QUANTUM_MS);
> + } while (!time_after(jiffies, timeout));
> +
> + return -ETIME;
> +}
> +
> static unsigned long addr_to_pgoff(struct vm_area_struct *vma,
> unsigned long addr)
> {
> @@ -933,9 +947,8 @@ static bool nvgrace_gpu_has_mig_hw_bug(struct pci_dev *pdev)
> * Ensure that the BAR0 region is enabled before accessing the
> * registers.
> */
> -static int nvgrace_gpu_wait_device_ready(struct pci_dev *pdev)
> +static int nvgrace_gpu_probe_check_device_ready(struct pci_dev *pdev)
> {
> - unsigned long timeout = jiffies + msecs_to_jiffies(POLL_TIMEOUT_MS);
> void __iomem *io;
> int ret = -ETIME;
And this initialization is unnecessary. Thanks,
Alex
>
> @@ -953,16 +966,8 @@ static int nvgrace_gpu_wait_device_ready(struct pci_dev *pdev)
> goto iomap_exit;
> }
>
> - do {
> - if ((ioread32(io + C2C_LINK_BAR0_OFFSET) == STATUS_READY) &&
> - (ioread32(io + HBM_TRAINING_BAR0_OFFSET) == STATUS_READY)) {
> - ret = 0;
> - goto reg_check_exit;
> - }
> - msleep(POLL_QUANTUM_MS);
> - } while (!time_after(jiffies, timeout));
> + ret = nvgrace_gpu_wait_device_ready(io);
>
> -reg_check_exit:
> pci_iounmap(pdev, io);
> iomap_exit:
> pci_release_selected_regions(pdev, 1 << 0);
> @@ -979,7 +984,7 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
> u64 memphys, memlength;
> int ret;
>
> - ret = nvgrace_gpu_wait_device_ready(pdev);
> + ret = nvgrace_gpu_probe_check_device_ready(pdev);
> if (ret)
> return ret;
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v6 5/6] vfio/nvgrace-gpu: Inform devmem unmapped after reset
2025-11-25 17:30 [PATCH v6 0/6] vfio/nvgrace-gpu: Support huge PFNMAP and wait for GPU ready post reset ankita
` (3 preceding siblings ...)
2025-11-25 17:30 ` [PATCH v6 4/6] vfio/nvgrace-gpu: split the code to wait for GPU ready ankita
@ 2025-11-25 17:30 ` ankita
2025-11-25 20:52 ` Alex Williamson
2025-11-25 17:30 ` [PATCH v6 6/6] vfio/nvgrace-gpu: wait for the GPU mem to be ready ankita
5 siblings, 1 reply; 18+ messages in thread
From: ankita @ 2025-11-25 17:30 UTC (permalink / raw)
To: ankita, jgg, yishaih, skolothumtho, kevin.tian, alex, aniketa,
vsethi, mochs
Cc: Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas, peterx,
pstanner, apopple, kvm, linux-kernel, cjia, kwankhede, targupta,
zhiw, danw, dnigam, kjaju
From: Ankit Agrawal <ankita@nvidia.com>
Introduce a new flag reset_done to notify that the GPU has just
been reset and the mapping to the GPU memory is zapped.
Implement the reset_done handler to set this new variable. It
will be used later in the patches to wait for the GPU memory
to be ready before doing any mapping or access.
cc: Jason Gunthorpe <jgg@ziepe.ca>
Suggested-by: Alex Williamson <alex@shazbot.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
drivers/vfio/pci/nvgrace-gpu/main.c | 19 ++++++++++++++++++-
1 file changed, 18 insertions(+), 1 deletion(-)
diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index 2b736cb82f38..7d5544280ed2 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -58,6 +58,8 @@ struct nvgrace_gpu_pci_core_device {
/* Lock to control device memory kernel mapping */
struct mutex remap_lock;
bool has_mig_hw_bug;
+ /* GPU has just been reset */
+ bool reset_done;
};
static void nvgrace_gpu_init_fake_bar_emu_regs(struct vfio_device *core_vdev)
@@ -1047,12 +1049,27 @@ static const struct pci_device_id nvgrace_gpu_vfio_pci_table[] = {
MODULE_DEVICE_TABLE(pci, nvgrace_gpu_vfio_pci_table);
+static void nvgrace_gpu_vfio_pci_reset_done(struct pci_dev *pdev)
+{
+ struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
+ struct nvgrace_gpu_pci_core_device *nvdev =
+ container_of(core_device, struct nvgrace_gpu_pci_core_device,
+ core_device);
+
+ nvdev->reset_done = true;
+}
+
+static const struct pci_error_handlers nvgrace_gpu_vfio_pci_err_handlers = {
+ .reset_done = nvgrace_gpu_vfio_pci_reset_done,
+ .error_detected = vfio_pci_core_aer_err_detected,
+};
+
static struct pci_driver nvgrace_gpu_vfio_pci_driver = {
.name = KBUILD_MODNAME,
.id_table = nvgrace_gpu_vfio_pci_table,
.probe = nvgrace_gpu_probe,
.remove = nvgrace_gpu_remove,
- .err_handler = &vfio_pci_core_err_handlers,
+ .err_handler = &nvgrace_gpu_vfio_pci_err_handlers,
.driver_managed_dma = true,
};
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v6 5/6] vfio/nvgrace-gpu: Inform devmem unmapped after reset
2025-11-25 17:30 ` [PATCH v6 5/6] vfio/nvgrace-gpu: Inform devmem unmapped after reset ankita
@ 2025-11-25 20:52 ` Alex Williamson
2025-11-26 3:26 ` Ankit Agrawal
0 siblings, 1 reply; 18+ messages in thread
From: Alex Williamson @ 2025-11-25 20:52 UTC (permalink / raw)
To: ankita
Cc: jgg, yishaih, skolothumtho, kevin.tian, aniketa, vsethi, mochs,
Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas, peterx,
pstanner, apopple, kvm, linux-kernel, cjia, kwankhede, targupta,
zhiw, danw, dnigam, kjaju
On Tue, 25 Nov 2025 17:30:12 +0000
<ankita@nvidia.com> wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
>
> Introduce a new flag reset_done to notify that the GPU has just
> been reset and the mapping to the GPU memory is zapped.
>
> Implement the reset_done handler to set this new variable. It
> will be used later in the patches to wait for the GPU memory
> to be ready before doing any mapping or access.
>
> cc: Jason Gunthorpe <jgg@ziepe.ca>
> Suggested-by: Alex Williamson <alex@shazbot.org>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
> drivers/vfio/pci/nvgrace-gpu/main.c | 19 ++++++++++++++++++-
> 1 file changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
> index 2b736cb82f38..7d5544280ed2 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -58,6 +58,8 @@ struct nvgrace_gpu_pci_core_device {
> /* Lock to control device memory kernel mapping */
> struct mutex remap_lock;
> bool has_mig_hw_bug;
> + /* GPU has just been reset */
> + bool reset_done;
> };
>
> static void nvgrace_gpu_init_fake_bar_emu_regs(struct vfio_device *core_vdev)
> @@ -1047,12 +1049,27 @@ static const struct pci_device_id nvgrace_gpu_vfio_pci_table[] = {
>
> MODULE_DEVICE_TABLE(pci, nvgrace_gpu_vfio_pci_table);
>
/*
* Comment explaining why this can't use lockdep_assert_held_write but
* in vfio use cases relies on this for serialization against faults and
* read/write.
*/
Thanks,
Alex
> +static void nvgrace_gpu_vfio_pci_reset_done(struct pci_dev *pdev)
> +{
> + struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
> + struct nvgrace_gpu_pci_core_device *nvdev =
> + container_of(core_device, struct nvgrace_gpu_pci_core_device,
> + core_device);
> +
> + nvdev->reset_done = true;
> +}
> +
> +static const struct pci_error_handlers nvgrace_gpu_vfio_pci_err_handlers = {
> + .reset_done = nvgrace_gpu_vfio_pci_reset_done,
> + .error_detected = vfio_pci_core_aer_err_detected,
> +};
> +
> static struct pci_driver nvgrace_gpu_vfio_pci_driver = {
> .name = KBUILD_MODNAME,
> .id_table = nvgrace_gpu_vfio_pci_table,
> .probe = nvgrace_gpu_probe,
> .remove = nvgrace_gpu_remove,
> - .err_handler = &vfio_pci_core_err_handlers,
> + .err_handler = &nvgrace_gpu_vfio_pci_err_handlers,
> .driver_managed_dma = true,
> };
>
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v6 5/6] vfio/nvgrace-gpu: Inform devmem unmapped after reset
2025-11-25 20:52 ` Alex Williamson
@ 2025-11-26 3:26 ` Ankit Agrawal
2025-11-26 4:54 ` Alex Williamson
0 siblings, 1 reply; 18+ messages in thread
From: Ankit Agrawal @ 2025-11-26 3:26 UTC (permalink / raw)
To: Alex Williamson
Cc: jgg@ziepe.ca, Yishai Hadas, Shameer Kolothum,
kevin.tian@intel.com, Aniket Agashe, Vikram Sethi, Matt Ochs,
Yunxiang.Li@amd.com, yi.l.liu@intel.com,
zhangdongdong@eswincomputing.com, Avihai Horon,
bhelgaas@google.com, peterx@redhat.com, pstanner@redhat.com,
Alistair Popple, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, Neo Jia, Kirti Wankhede,
Tarun Gupta (SW-GPU), Zhi Wang, Dan Williams, Dheeraj Nigam,
Krishnakant Jaju
>> MODULE_DEVICE_TABLE(pci, nvgrace_gpu_vfio_pci_table);
>>
> /*
> * Comment explaining why this can't use lockdep_assert_held_write but
> * in vfio use cases relies on this for serialization against faults and
> * read/write.
> */
>
In this patch or the next where we actually do the serialization with
memory_lock?
> Thanks,
> Alex
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v6 5/6] vfio/nvgrace-gpu: Inform devmem unmapped after reset
2025-11-26 3:26 ` Ankit Agrawal
@ 2025-11-26 4:54 ` Alex Williamson
0 siblings, 0 replies; 18+ messages in thread
From: Alex Williamson @ 2025-11-26 4:54 UTC (permalink / raw)
To: Ankit Agrawal
Cc: jgg@ziepe.ca, Yishai Hadas, Shameer Kolothum,
kevin.tian@intel.com, Aniket Agashe, Vikram Sethi, Matt Ochs,
Yunxiang.Li@amd.com, yi.l.liu@intel.com,
zhangdongdong@eswincomputing.com, Avihai Horon,
bhelgaas@google.com, peterx@redhat.com, pstanner@redhat.com,
Alistair Popple, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, Neo Jia, Kirti Wankhede,
Tarun Gupta (SW-GPU), Zhi Wang, Dan Williams, Dheeraj Nigam,
Krishnakant Jaju
On Wed, 26 Nov 2025 03:26:54 +0000
Ankit Agrawal <ankita@nvidia.com> wrote:
> >> MODULE_DEVICE_TABLE(pci, nvgrace_gpu_vfio_pci_table);
> >>
> > /*
> > * Comment explaining why this can't use lockdep_assert_held_write but
> > * in vfio use cases relies on this for serialization against faults and
> > * read/write.
> > */
> >
>
> In this patch or the next where we actually do the serialization with
> memory_lock?
When the interaction is added would make more sense. Thanks,
Alex
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v6 6/6] vfio/nvgrace-gpu: wait for the GPU mem to be ready
2025-11-25 17:30 [PATCH v6 0/6] vfio/nvgrace-gpu: Support huge PFNMAP and wait for GPU ready post reset ankita
` (4 preceding siblings ...)
2025-11-25 17:30 ` [PATCH v6 5/6] vfio/nvgrace-gpu: Inform devmem unmapped after reset ankita
@ 2025-11-25 17:30 ` ankita
2025-11-25 20:28 ` Zhi Wang
5 siblings, 1 reply; 18+ messages in thread
From: ankita @ 2025-11-25 17:30 UTC (permalink / raw)
To: ankita, jgg, yishaih, skolothumtho, kevin.tian, alex, aniketa,
vsethi, mochs
Cc: Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas, peterx,
pstanner, apopple, kvm, linux-kernel, cjia, kwankhede, targupta,
zhiw, danw, dnigam, kjaju
From: Ankit Agrawal <ankita@nvidia.com>
Speculative prefetches from CPU to GPU memory until the GPU is
ready after reset can cause harmless corrected RAS events to
be logged on Grace systems. It is thus preferred that the
mapping not be re-established until the GPU is ready post reset.
The GPU readiness can be checked through BAR0 registers similar
to the checking at the time of device probe.
It can take several seconds for the GPU to be ready. So it is
desirable that the time overlaps as much of the VM startup as
possible to reduce impact on the VM bootup time. The GPU
readiness state is thus checked on the first fault/huge_fault
request or read/write access which amortizes the GPU readiness
time.
The first fault and read/write checks the GPU state when the
reset_done flag - which denotes whether the GPU has just been
reset. The memory_lock is taken across map/access to avoid
races with GPU reset.
cc: Shameer Kolothum <skolothumtho@nvidia.com>
cc: Alex Williamson <alex@shazbot.org>
cc: Jason Gunthorpe <jgg@ziepe.ca>
cc: Vikram Sethi <vsethi@nvidia.com>
Suggested-by: Alex Williamson <alex@shazbot.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
drivers/vfio/pci/nvgrace-gpu/main.c | 66 ++++++++++++++++++++++++++---
1 file changed, 59 insertions(+), 7 deletions(-)
diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index 7d5544280ed2..f9cea19093fa 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -104,6 +104,17 @@ static int nvgrace_gpu_open_device(struct vfio_device *core_vdev)
mutex_init(&nvdev->remap_lock);
}
+ /*
+ * GPU readiness is checked by reading the BAR0 registers.
+ *
+ * ioremap BAR0 to ensure that the BAR0 mapping is present before
+ * register reads on first fault before establishing any GPU
+ * memory mapping.
+ */
+ ret = vfio_pci_core_setup_barmap(vdev, 0);
+ if (ret)
+ return ret;
+
vfio_pci_core_finish_enable(vdev);
return 0;
@@ -146,6 +157,31 @@ static int nvgrace_gpu_wait_device_ready(void __iomem *io)
return -ETIME;
}
+/*
+ * If the GPU memory is accessed by the CPU while the GPU is not ready
+ * after reset, it can cause harmless corrected RAS events to be logged.
+ * Make sure the GPU is ready before establishing the mappings.
+ */
+static int
+nvgrace_gpu_check_device_ready(struct nvgrace_gpu_pci_core_device *nvdev)
+{
+ struct vfio_pci_core_device *vdev = &nvdev->core_device;
+ int ret;
+
+ lockdep_assert_held_read(&vdev->memory_lock);
+
+ if (!nvdev->reset_done)
+ return 0;
+
+ ret = nvgrace_gpu_wait_device_ready(vdev->barmap[0]);
+ if (ret)
+ return ret;
+
+ nvdev->reset_done = false;
+
+ return 0;
+}
+
static unsigned long addr_to_pgoff(struct vm_area_struct *vma,
unsigned long addr)
{
@@ -179,8 +215,12 @@ static vm_fault_t nvgrace_gpu_vfio_pci_huge_fault(struct vm_fault *vmf,
pfn & ((1 << order) - 1)))
return VM_FAULT_FALLBACK;
- scoped_guard(rwsem_read, &vdev->memory_lock)
+ scoped_guard(rwsem_read, &vdev->memory_lock) {
+ if (nvgrace_gpu_check_device_ready(nvdev))
+ return ret;
+
ret = vfio_pci_vmf_insert_pfn(vdev, vmf, pfn, order);
+ }
dev_dbg_ratelimited(&vdev->pdev->dev,
"%s order = %d pfn 0x%lx: 0x%x\n",
@@ -592,9 +632,15 @@ nvgrace_gpu_read_mem(struct nvgrace_gpu_pci_core_device *nvdev,
else
mem_count = min(count, memregion->memlength - (size_t)offset);
- ret = nvgrace_gpu_map_and_read(nvdev, buf, mem_count, ppos);
- if (ret)
- return ret;
+ scoped_guard(rwsem_read, &nvdev->core_device.memory_lock) {
+ ret = nvgrace_gpu_check_device_ready(nvdev);
+ if (ret)
+ return ret;
+
+ ret = nvgrace_gpu_map_and_read(nvdev, buf, mem_count, ppos);
+ if (ret)
+ return ret;
+ }
/*
* Only the device memory present on the hardware is mapped, which may
@@ -712,9 +758,15 @@ nvgrace_gpu_write_mem(struct nvgrace_gpu_pci_core_device *nvdev,
*/
mem_count = min(count, memregion->memlength - (size_t)offset);
- ret = nvgrace_gpu_map_and_write(nvdev, buf, mem_count, ppos);
- if (ret)
- return ret;
+ scoped_guard(rwsem_read, &nvdev->core_device.memory_lock) {
+ ret = nvgrace_gpu_check_device_ready(nvdev);
+ if (ret)
+ return ret;
+
+ ret = nvgrace_gpu_map_and_write(nvdev, buf, mem_count, ppos);
+ if (ret)
+ return ret;
+ }
exitfn:
*ppos += count;
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v6 6/6] vfio/nvgrace-gpu: wait for the GPU mem to be ready
2025-11-25 17:30 ` [PATCH v6 6/6] vfio/nvgrace-gpu: wait for the GPU mem to be ready ankita
@ 2025-11-25 20:28 ` Zhi Wang
0 siblings, 0 replies; 18+ messages in thread
From: Zhi Wang @ 2025-11-25 20:28 UTC (permalink / raw)
To: ankita
Cc: jgg, yishaih, skolothumtho, kevin.tian, alex, aniketa, vsethi,
mochs, Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas,
peterx, pstanner, apopple, kvm, linux-kernel, cjia, kwankhede,
targupta, danw, dnigam, kjaju
On Tue, 25 Nov 2025 17:30:13 +0000
<ankita@nvidia.com> wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
>
> Speculative prefetches from CPU to GPU memory until the GPU is
> ready after reset can cause harmless corrected RAS events to
> be logged on Grace systems. It is thus preferred that the
> mapping not be re-established until the GPU is ready post reset.
>
> The GPU readiness can be checked through BAR0 registers similar
> to the checking at the time of device probe.
>
> It can take several seconds for the GPU to be ready. So it is
> desirable that the time overlaps as much of the VM startup as
> possible to reduce impact on the VM bootup time. The GPU
> readiness state is thus checked on the first fault/huge_fault
> request or read/write access which amortizes the GPU readiness
> time.
>
snip
> @@ -179,8 +215,12 @@ static vm_fault_t
> nvgrace_gpu_vfio_pci_huge_fault(struct vm_fault *vmf, pfn & ((1 <<
> order) - 1))) return VM_FAULT_FALLBACK;
>
> - scoped_guard(rwsem_read, &vdev->memory_lock)
> + scoped_guard(rwsem_read, &vdev->memory_lock) {
> + if (nvgrace_gpu_check_device_ready(nvdev))
> + return ret;
> +
I would suggest opening the error code if we don't have a "bailing
out without touching the ret" similar to vfio_pci_mmap_huge_fault(),
since this looks unnecessarily confusing.
Please also fix the same in PATCH 2.
> ret = vfio_pci_vmf_insert_pfn(vdev, vmf, pfn, order);
> + }
>
> dev_dbg_ratelimited(&vdev->pdev->dev,
> "%s order = %d pfn 0x%lx: 0x%x\n",
> @@ -592,9 +632,15 @@ nvgrace_gpu_read_mem(struct
^ permalink raw reply [flat|nested] 18+ messages in thread