[PATCH v5 0/7] vfio/nvgrace-gpu: Support huge PFNMAP and wait for GPU ready post reset

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 0/7] vfio/nvgrace-gpu: Support huge PFNMAP and wait for GPU ready post reset
@ 2025-11-24 11:59 ankita
  2025-11-24 11:59 ` [PATCH v5 1/7] vfio/nvgrace-gpu: Use faults to map device memory ankita
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: ankita @ 2025-11-24 11:59 UTC (permalink / raw)
  To: ankita, jgg, yishaih, skolothumtho, kevin.tian, alex, aniketa,
	vsethi, mochs
  Cc: Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas, peterx,
	pstanner, apopple, kvm, linux-kernel, cjia, kwankhede, targupta,
	zhiw, danw, dnigam, kjaju

From: Ankit Agrawal <ankita@nvidia.com>

NVIDIA's Grace based system have large GPU device memory. The device
memory is mapped as VM_PFNMAP in the VMM VMA. The nvgrace-gpu
module could make use of the huge PFNMAP support added in mm [1].

To achieve this, nvgrace-gpu module is updated to implement huge_fault ops.
The implementation establishes mapping according to the order request.
Note that if the PFN or the VMA address is unaligned to the order, the
mapping fallbacks to the PTE level.

Secondly, it is expected that the mapping not be re-established until
the GPU is ready post reset. Presence of the mappings during that time
could potentially leads to harmless corrected RAS events to be logged if
the CPU attempts to do speculative reads on the GPU memory on the Grace
systems.

It can take several seconds for the GPU to be ready. So it is desirable
that the time overlaps as much of the VM startup as possible to reduce
impact on the VM bootup time. The GPU readiness state is thus checked
on the first fault/huge_fault request which amortizes the GPU readiness
time. The GPU readiness is checked through BAR0 registers as is done
at the device probe.

Patch 1 updates the mapping mechanism to be done through faults.

Patch 2 splits the code to map at the various levels.

Patch 3 implements support for huge pfnmap.

Patch 4 vfio_pci_core_mmap cleanup.

Patch 5 split the code to check the device readiness.

Patch 6 reset_done handler implementation

Patch 7 Ensures that the GPU is ready before re-establishing the mapping
after reset.

Applied over 6.18-rc6.

Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1]

Changelog:
v5:
- Updated gpu_mem_mapped with reset_done flag for clearer semantics. (6/7)
  (Thanks Alex Williamson)
- Renamed vfio_pci_map_pfn to vfio_pci_vmf_insert_pfn. (2/7)
  (Thanks Alex Williamson)
- Updated to hold memory_lock across the vmf_insert_pfn and the
  read/write access of the device. (7/7) (Thanks Alex Williamson)
- Used scoped_guard to simplify critical region. (1/7, 7/7)
[v4]
- Implemented reset_done handler to set gpu_mem_mapped flag. Cleaned up
  FLR detection path (Thanks Alex Williamson)
- Moved the premap check of the device readiness to a new function.
  Added locking to avoid races. (Thanks Alex Williamson)
- vfio_pci_core_mmap cleanup.
- Added ioremap to BAR0 during open.
Link: https://lore.kernel.org/all/20251121141141.3175-1-ankita@nvidia.com/ [v3]
- Moved the code for BAR mapping to a separate function.
- Added BAR0 mapping during open. Ensures BAR0 is mapped when registers
  are checked. (Thanks Alex Williamson, Jason Gunthorpe for suggestion)
- Added check for GPU readiness on nvgrace_gpu_map_device_mem. (Thanks
  Alex Williamson for the suggestion.
Link: https://lore.kernel.org/all/20251118074422.58081-1-ankita@nvidia.com/ [v2]
- Fixed build kernel warning
- subject text changes
- Rebased to 6.18-rc6.
Link: https://lore.kernel.org/all/20251117124159.3560-1-ankita@nvidia.com/ [v1]

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>

Ankit Agrawal (7):
  vfio/nvgrace-gpu: Use faults to map device memory
  vfio: export function to map the VMA
  vfio/nvgrace-gpu: Add support for huge pfnmap
  vfio: use vfio_pci_core_setup_barmap to map bar in mmap
  vfio/nvgrace-gpu: split the code to wait for GPU ready
  vfio/nvgrace-gpu: Inform devmem unmapped after reset
  vfio/nvgrace-gpu: wait for the GPU mem to be ready

 drivers/vfio/pci/nvgrace-gpu/main.c | 216 ++++++++++++++++++++++------
 drivers/vfio/pci/vfio_pci_core.c    |  61 ++++----
 include/linux/vfio_pci_core.h       |   2 +
 3 files changed, 207 insertions(+), 72 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v5 1/7] vfio/nvgrace-gpu: Use faults to map device memory
  2025-11-24 11:59 [PATCH v5 0/7] vfio/nvgrace-gpu: Support huge PFNMAP and wait for GPU ready post reset ankita
@ 2025-11-24 11:59 ` ankita
  2025-11-24 17:09   ` Shameer Kolothum
  2025-11-24 17:16   ` Jason Gunthorpe
  2025-11-24 11:59 ` [PATCH v5 2/7] vfio: export function to map the VMA ankita
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 24+ messages in thread
From: ankita @ 2025-11-24 11:59 UTC (permalink / raw)
  To: ankita, jgg, yishaih, skolothumtho, kevin.tian, alex, aniketa,
	vsethi, mochs
  Cc: Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas, peterx,
	pstanner, apopple, kvm, linux-kernel, cjia, kwankhede, targupta,
	zhiw, danw, dnigam, kjaju

From: Ankit Agrawal <ankita@nvidia.com>

To make use of the huge pfnmap support and to support zap/remap
sequence, fault/huge_fault ops based mapping mechanism needs to
be implemented.

Currently nvgrace-gpu module relies on remap_pfn_range to do
the mapping during VM bootup. Replace it to instead rely on fault
and use vmf_insert_pfn to setup the mapping.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/main.c | 50 +++++++++++++++++------------
 1 file changed, 30 insertions(+), 20 deletions(-)

diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index e346392b72f6..f74f3d8e1ebe 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -130,6 +130,32 @@ static void nvgrace_gpu_close_device(struct vfio_device *core_vdev)
 	vfio_pci_core_close_device(core_vdev);
 }
 
+static vm_fault_t nvgrace_gpu_vfio_pci_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct nvgrace_gpu_pci_core_device *nvdev = vma->vm_private_data;
+	int index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+	vm_fault_t ret = VM_FAULT_SIGBUS;
+	struct mem_region *memregion;
+	unsigned long pgoff, pfn;
+
+	memregion = nvgrace_gpu_memregion(index, nvdev);
+	if (!memregion)
+		return ret;
+
+	pgoff = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
+	pfn = PHYS_PFN(memregion->memphys) + pgoff;
+
+	scoped_guard(rwsem_read, &nvdev->core_device.memory_lock)
+		ret = vmf_insert_pfn(vmf->vma, vmf->address, pfn);
+
+	return ret;
+}
+
+static const struct vm_operations_struct nvgrace_gpu_vfio_pci_mmap_ops = {
+	.fault = nvgrace_gpu_vfio_pci_fault,
+};
+
 static int nvgrace_gpu_mmap(struct vfio_device *core_vdev,
 			    struct vm_area_struct *vma)
 {
@@ -137,10 +163,8 @@ static int nvgrace_gpu_mmap(struct vfio_device *core_vdev,
 		container_of(core_vdev, struct nvgrace_gpu_pci_core_device,
 			     core_device.vdev);
 	struct mem_region *memregion;
-	unsigned long start_pfn;
 	u64 req_len, pgoff, end;
 	unsigned int index;
-	int ret = 0;
 
 	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
 
@@ -157,7 +181,6 @@ static int nvgrace_gpu_mmap(struct vfio_device *core_vdev,
 		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
 
 	if (check_sub_overflow(vma->vm_end, vma->vm_start, &req_len) ||
-	    check_add_overflow(PHYS_PFN(memregion->memphys), pgoff, &start_pfn) ||
 	    check_add_overflow(PFN_PHYS(pgoff), req_len, &end))
 		return -EOVERFLOW;
 
@@ -168,6 +191,8 @@ static int nvgrace_gpu_mmap(struct vfio_device *core_vdev,
 	if (end > memregion->memlength)
 		return -EINVAL;
 
+	vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
+
 	/*
 	 * The carved out region of the device memory needs the NORMAL_NC
 	 * property. Communicate as such to the hypervisor.
@@ -184,23 +209,8 @@ static int nvgrace_gpu_mmap(struct vfio_device *core_vdev,
 		vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
 	}
 
-	/*
-	 * Perform a PFN map to the memory and back the device BAR by the
-	 * GPU memory.
-	 *
-	 * The available GPU memory size may not be power-of-2 aligned. The
-	 * remainder is only backed by vfio_device_ops read/write handlers.
-	 *
-	 * During device reset, the GPU is safely disconnected to the CPU
-	 * and access to the BAR will be immediately returned preventing
-	 * machine check.
-	 */
-	ret = remap_pfn_range(vma, vma->vm_start, start_pfn,
-			      req_len, vma->vm_page_prot);
-	if (ret)
-		return ret;
-
-	vma->vm_pgoff = start_pfn;
+	vma->vm_ops = &nvgrace_gpu_vfio_pci_mmap_ops;
+	vma->vm_private_data = nvdev;
 
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v5 2/7] vfio: export function to map the VMA
  2025-11-24 11:59 [PATCH v5 0/7] vfio/nvgrace-gpu: Support huge PFNMAP and wait for GPU ready post reset ankita
  2025-11-24 11:59 ` [PATCH v5 1/7] vfio/nvgrace-gpu: Use faults to map device memory ankita
@ 2025-11-24 11:59 ` ankita
  2025-11-24 17:22   ` Shameer Kolothum
  2025-11-24 11:59 ` [PATCH v5 3/7] vfio/nvgrace-gpu: Add support for huge pfnmap ankita
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: ankita @ 2025-11-24 11:59 UTC (permalink / raw)
  To: ankita, jgg, yishaih, skolothumtho, kevin.tian, alex, aniketa,
	vsethi, mochs
  Cc: Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas, peterx,
	pstanner, apopple, kvm, linux-kernel, cjia, kwankhede, targupta,
	zhiw, danw, dnigam, kjaju

From: Ankit Agrawal <ankita@nvidia.com>

Take out the implementation to map the VMA to the PTE/PMD/PUD
as a separate function.

Export the function to be used by nvgrace-gpu module.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 46 ++++++++++++++++++++------------
 include/linux/vfio_pci_core.h    |  2 ++
 2 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 7dcf5439dedc..ede410e0ae1c 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1640,6 +1640,34 @@ static unsigned long vma_to_pfn(struct vm_area_struct *vma)
 	return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff;
 }
 
+vm_fault_t vfio_pci_vmf_insert_pfn(struct vm_fault *vmf,
+				   unsigned long pfn,
+				   unsigned int order)
+{
+	vm_fault_t ret;
+
+	switch (order) {
+	case 0:
+		ret = vmf_insert_pfn(vmf->vma, vmf->address, pfn);
+		break;
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
+	case PMD_ORDER:
+		ret = vmf_insert_pfn_pmd(vmf, pfn, false);
+		break;
+#endif
+#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
+	case PUD_ORDER:
+		ret = vmf_insert_pfn_pud(vmf, pfn, false);
+		break;
+#endif
+	default:
+		ret = VM_FAULT_FALLBACK;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_pci_vmf_insert_pfn);
+
 static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
 					   unsigned int order)
 {
@@ -1662,23 +1690,7 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
 	if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev))
 		goto out_unlock;
 
-	switch (order) {
-	case 0:
-		ret = vmf_insert_pfn(vma, vmf->address, pfn);
-		break;
-#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
-	case PMD_ORDER:
-		ret = vmf_insert_pfn_pmd(vmf, pfn, false);
-		break;
-#endif
-#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
-	case PUD_ORDER:
-		ret = vmf_insert_pfn_pud(vmf, pfn, false);
-		break;
-#endif
-	default:
-		ret = VM_FAULT_FALLBACK;
-	}
+	ret = vfio_pci_vmf_insert_pfn(vmf, pfn, order);
 
 out_unlock:
 	up_read(&vdev->memory_lock);
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index f541044e42a2..970b3775505e 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -119,6 +119,8 @@ ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
 		size_t count, loff_t *ppos);
 ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
 		size_t count, loff_t *ppos);
+vm_fault_t vfio_pci_vmf_insert_pfn(struct vm_fault *vmf, unsigned long pfn,
+				   unsigned int order);
 int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma);
 void vfio_pci_core_request(struct vfio_device *core_vdev, unsigned int count);
 int vfio_pci_core_match(struct vfio_device *core_vdev, char *buf);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v5 3/7] vfio/nvgrace-gpu: Add support for huge pfnmap
  2025-11-24 11:59 [PATCH v5 0/7] vfio/nvgrace-gpu: Support huge PFNMAP and wait for GPU ready post reset ankita
  2025-11-24 11:59 ` [PATCH v5 1/7] vfio/nvgrace-gpu: Use faults to map device memory ankita
  2025-11-24 11:59 ` [PATCH v5 2/7] vfio: export function to map the VMA ankita
@ 2025-11-24 11:59 ` ankita
  2025-11-24 15:32   ` Alex Williamson
                     ` (2 more replies)
  2025-11-24 11:59 ` [PATCH v5 4/7] vfio: use vfio_pci_core_setup_barmap to map bar in mmap ankita
                   ` (3 subsequent siblings)
  6 siblings, 3 replies; 24+ messages in thread
From: ankita @ 2025-11-24 11:59 UTC (permalink / raw)
  To: ankita, jgg, yishaih, skolothumtho, kevin.tian, alex, aniketa,
	vsethi, mochs
  Cc: Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas, peterx,
	pstanner, apopple, kvm, linux-kernel, cjia, kwankhede, targupta,
	zhiw, danw, dnigam, kjaju

From: Ankit Agrawal <ankita@nvidia.com>

NVIDIA's Grace based systems have large device memory. The device
memory is mapped as VM_PFNMAP in the VMM VMA. The nvgrace-gpu
module could make use of the huge PFNMAP support added in mm [1].

To achieve this, nvgrace-gpu module is updated to implement huge_fault ops.
The implementation establishes mapping according to the order request.
Note that if the PFN or the VMA address is unaligned to the order, the
mapping fallbacks to the PTE level.

Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1]

cc: Alex Williamson <alex@shazbot.org>
cc: Jason Gunthorpe <jgg@ziepe.ca>
cc: Vikram Sethi <vsethi@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/main.c | 43 +++++++++++++++++++++++------
 1 file changed, 35 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index f74f3d8e1ebe..c84c01954c9e 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -130,32 +130,58 @@ static void nvgrace_gpu_close_device(struct vfio_device *core_vdev)
 	vfio_pci_core_close_device(core_vdev);
 }
 
-static vm_fault_t nvgrace_gpu_vfio_pci_fault(struct vm_fault *vmf)
+static vm_fault_t nvgrace_gpu_vfio_pci_huge_fault(struct vm_fault *vmf,
+						  unsigned int order)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct nvgrace_gpu_pci_core_device *nvdev = vma->vm_private_data;
 	int index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
 	vm_fault_t ret = VM_FAULT_SIGBUS;
 	struct mem_region *memregion;
-	unsigned long pgoff, pfn;
+	unsigned long pgoff, pfn, addr;
 
 	memregion = nvgrace_gpu_memregion(index, nvdev);
 	if (!memregion)
 		return ret;
 
-	pgoff = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
+	addr = vmf->address & ~((PAGE_SIZE << order) - 1);
+	pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
 	pfn = PHYS_PFN(memregion->memphys) + pgoff;
 
+	if (order && (addr < vma->vm_start ||
+		      addr + (PAGE_SIZE << order) > vma->vm_end ||
+		      pfn & ((1 << order) - 1)))
+		return VM_FAULT_FALLBACK;
+
 	scoped_guard(rwsem_read, &nvdev->core_device.memory_lock)
-		ret = vmf_insert_pfn(vmf->vma, vmf->address, pfn);
+		ret = vfio_pci_vmf_insert_pfn(vmf, pfn, order);
 
 	return ret;
 }
 
+static vm_fault_t nvgrace_gpu_vfio_pci_fault(struct vm_fault *vmf)
+{
+	return nvgrace_gpu_vfio_pci_huge_fault(vmf, 0);
+}
+
 static const struct vm_operations_struct nvgrace_gpu_vfio_pci_mmap_ops = {
 	.fault = nvgrace_gpu_vfio_pci_fault,
+#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
+	.huge_fault = nvgrace_gpu_vfio_pci_huge_fault,
+#endif
 };
 
+static size_t nvgrace_gpu_aligned_devmem_size(size_t memlength)
+{
+#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
+	return ALIGN(memlength, PMD_SIZE);
+#endif
+#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
+	return ALIGN(memlength, PUD_SIZE);
+#endif
+	return memlength;
+}
+
 static int nvgrace_gpu_mmap(struct vfio_device *core_vdev,
 			    struct vm_area_struct *vma)
 {
@@ -185,10 +211,10 @@ static int nvgrace_gpu_mmap(struct vfio_device *core_vdev,
 		return -EOVERFLOW;
 
 	/*
-	 * Check that the mapping request does not go beyond available device
-	 * memory size
+	 * Check that the mapping request does not go beyond the exposed
+	 * device memory size.
 	 */
-	if (end > memregion->memlength)
+	if (end > nvgrace_gpu_aligned_devmem_size(memregion->memlength))
 		return -EINVAL;
 
 	vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
@@ -258,7 +284,8 @@ nvgrace_gpu_ioctl_get_region_info(struct vfio_device *core_vdev,
 
 	sparse->nr_areas = 1;
 	sparse->areas[0].offset = 0;
-	sparse->areas[0].size = memregion->memlength;
+	sparse->areas[0].size =
+		nvgrace_gpu_aligned_devmem_size(memregion->memlength);
 	sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
 	sparse->header.version = 1;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v5 4/7] vfio: use vfio_pci_core_setup_barmap to map bar in mmap
  2025-11-24 11:59 [PATCH v5 0/7] vfio/nvgrace-gpu: Support huge PFNMAP and wait for GPU ready post reset ankita
                   ` (2 preceding siblings ...)
  2025-11-24 11:59 ` [PATCH v5 3/7] vfio/nvgrace-gpu: Add support for huge pfnmap ankita
@ 2025-11-24 11:59 ` ankita
  2025-11-24 18:10   ` Shameer Kolothum
  2025-11-24 11:59 ` [PATCH v5 5/7] vfio/nvgrace-gpu: split the code to wait for GPU ready ankita
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: ankita @ 2025-11-24 11:59 UTC (permalink / raw)
  To: ankita, jgg, yishaih, skolothumtho, kevin.tian, alex, aniketa,
	vsethi, mochs
  Cc: Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas, peterx,
	pstanner, apopple, kvm, linux-kernel, cjia, kwankhede, targupta,
	zhiw, danw, dnigam, kjaju

From: Ankit Agrawal <ankita@nvidia.com>

Remove code duplication in vfio_pci_core_mmap by calling
vfio_pci_core_setup_barmap to perform the bar mapping.

cc: Donald Dutile <ddutile@redhat.com>
Suggested-by: Alex Williamson <alex@shazbot.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index ede410e0ae1c..3d21f6a8c279 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1761,18 +1761,9 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma
 	 * Even though we don't make use of the barmap for the mmap,
 	 * we need to request the region and the barmap tracks that.
 	 */
-	if (!vdev->barmap[index]) {
-		ret = pci_request_selected_regions(pdev,
-						   1 << index, "vfio-pci");
-		if (ret)
-			return ret;
-
-		vdev->barmap[index] = pci_iomap(pdev, index, 0);
-		if (!vdev->barmap[index]) {
-			pci_release_selected_regions(pdev, 1 << index);
-			return -ENOMEM;
-		}
-	}
+	ret = vfio_pci_core_setup_barmap(vdev, index);
+	if (ret)
+		return ret;
 
 	vma->vm_private_data = vdev;
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v5 5/7] vfio/nvgrace-gpu: split the code to wait for GPU ready
  2025-11-24 11:59 [PATCH v5 0/7] vfio/nvgrace-gpu: Support huge PFNMAP and wait for GPU ready post reset ankita
                   ` (3 preceding siblings ...)
  2025-11-24 11:59 ` [PATCH v5 4/7] vfio: use vfio_pci_core_setup_barmap to map bar in mmap ankita
@ 2025-11-24 11:59 ` ankita
  2025-11-24 15:32   ` Alex Williamson
  2025-11-24 18:22   ` Shameer Kolothum
  2025-11-24 11:59 ` [PATCH v5 6/7] vfio/nvgrace-gpu: Inform devmem unmapped after reset ankita
  2025-11-24 11:59 ` [PATCH v5 7/7] vfio/nvgrace-gpu: wait for the GPU mem to be ready ankita
  6 siblings, 2 replies; 24+ messages in thread
From: ankita @ 2025-11-24 11:59 UTC (permalink / raw)
  To: ankita, jgg, yishaih, skolothumtho, kevin.tian, alex, aniketa,
	vsethi, mochs
  Cc: Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas, peterx,
	pstanner, apopple, kvm, linux-kernel, cjia, kwankhede, targupta,
	zhiw, danw, dnigam, kjaju

From: Ankit Agrawal <ankita@nvidia.com>

Split the function that check for the GPU device being ready on
the probe.

Move the code to wait for the GPU to be ready through BAR0 register
reads to a separate function. This would help reuse the code.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/main.c | 33 ++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index c84c01954c9e..3e45b8bd1a89 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -130,6 +130,24 @@ static void nvgrace_gpu_close_device(struct vfio_device *core_vdev)
 	vfio_pci_core_close_device(core_vdev);
 }
 
+static int nvgrace_gpu_wait_device_ready(void __iomem *io)
+{
+	unsigned long timeout = jiffies + msecs_to_jiffies(POLL_TIMEOUT_MS);
+	int ret = -ETIME;
+
+	do {
+		if ((ioread32(io + C2C_LINK_BAR0_OFFSET) == STATUS_READY) &&
+		    (ioread32(io + HBM_TRAINING_BAR0_OFFSET) == STATUS_READY)) {
+			ret = 0;
+			goto ready_check_exit;
+		}
+		msleep(POLL_QUANTUM_MS);
+	} while (!time_after(jiffies, timeout));
+
+ready_check_exit:
+	return ret;
+}
+
 static vm_fault_t nvgrace_gpu_vfio_pci_huge_fault(struct vm_fault *vmf,
 						  unsigned int order)
 {
@@ -930,9 +948,8 @@ static bool nvgrace_gpu_has_mig_hw_bug(struct pci_dev *pdev)
  * Ensure that the BAR0 region is enabled before accessing the
  * registers.
  */
-static int nvgrace_gpu_wait_device_ready(struct pci_dev *pdev)
+static int nvgrace_gpu_probe_check_device_ready(struct pci_dev *pdev)
 {
-	unsigned long timeout = jiffies + msecs_to_jiffies(POLL_TIMEOUT_MS);
 	void __iomem *io;
 	int ret = -ETIME;
 
@@ -950,16 +967,8 @@ static int nvgrace_gpu_wait_device_ready(struct pci_dev *pdev)
 		goto iomap_exit;
 	}
 
-	do {
-		if ((ioread32(io + C2C_LINK_BAR0_OFFSET) == STATUS_READY) &&
-		    (ioread32(io + HBM_TRAINING_BAR0_OFFSET) == STATUS_READY)) {
-			ret = 0;
-			goto reg_check_exit;
-		}
-		msleep(POLL_QUANTUM_MS);
-	} while (!time_after(jiffies, timeout));
+	ret = nvgrace_gpu_wait_device_ready(io);
 
-reg_check_exit:
 	pci_iounmap(pdev, io);
 iomap_exit:
 	pci_release_selected_regions(pdev, 1 << 0);
@@ -976,7 +985,7 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
 	u64 memphys, memlength;
 	int ret;
 
-	ret = nvgrace_gpu_wait_device_ready(pdev);
+	ret = nvgrace_gpu_probe_check_device_ready(pdev);
 	if (ret)
 		return ret;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v5 6/7] vfio/nvgrace-gpu: Inform devmem unmapped after reset
  2025-11-24 11:59 [PATCH v5 0/7] vfio/nvgrace-gpu: Support huge PFNMAP and wait for GPU ready post reset ankita
                   ` (4 preceding siblings ...)
  2025-11-24 11:59 ` [PATCH v5 5/7] vfio/nvgrace-gpu: split the code to wait for GPU ready ankita
@ 2025-11-24 11:59 ` ankita
  2025-11-24 18:16   ` Jason Gunthorpe
  2025-11-24 11:59 ` [PATCH v5 7/7] vfio/nvgrace-gpu: wait for the GPU mem to be ready ankita
  6 siblings, 1 reply; 24+ messages in thread
From: ankita @ 2025-11-24 11:59 UTC (permalink / raw)
  To: ankita, jgg, yishaih, skolothumtho, kevin.tian, alex, aniketa,
	vsethi, mochs
  Cc: Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas, peterx,
	pstanner, apopple, kvm, linux-kernel, cjia, kwankhede, targupta,
	zhiw, danw, dnigam, kjaju

From: Ankit Agrawal <ankita@nvidia.com>

Introduce a new flag reset_done to notify that the GPU has just
been reset and the mapping to the GPU memory is zapped.

Implement the reset_done handler to set this new variable. It
will be used later in the patches to wait for the GPU memory
to be ready before doing any mapping or access.

Suggested-by: Alex Williamson <alex@shazbot.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/main.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index 3e45b8bd1a89..bef9f25bf8f3 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -58,6 +58,8 @@ struct nvgrace_gpu_pci_core_device {
 	/* Lock to control device memory kernel mapping */
 	struct mutex remap_lock;
 	bool has_mig_hw_bug;
+	/* GPU has just been reset */
+	bool reset_done;
 };
 
 static void nvgrace_gpu_init_fake_bar_emu_regs(struct vfio_device *core_vdev)
@@ -1048,12 +1050,29 @@ static const struct pci_device_id nvgrace_gpu_vfio_pci_table[] = {
 
 MODULE_DEVICE_TABLE(pci, nvgrace_gpu_vfio_pci_table);
 
+static void nvgrace_gpu_vfio_pci_reset_done(struct pci_dev *pdev)
+{
+	struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
+	struct nvgrace_gpu_pci_core_device *nvdev =
+		container_of(core_device, struct nvgrace_gpu_pci_core_device,
+			     core_device);
+
+	lockdep_assert_held_write(&core_device->memory_lock);
+
+	nvdev->reset_done = true;
+}
+
+static const struct pci_error_handlers nvgrace_gpu_vfio_pci_err_handlers = {
+	.reset_done = nvgrace_gpu_vfio_pci_reset_done,
+	.error_detected = vfio_pci_core_aer_err_detected,
+};
+
 static struct pci_driver nvgrace_gpu_vfio_pci_driver = {
 	.name = KBUILD_MODNAME,
 	.id_table = nvgrace_gpu_vfio_pci_table,
 	.probe = nvgrace_gpu_probe,
 	.remove = nvgrace_gpu_remove,
-	.err_handler = &vfio_pci_core_err_handlers,
+	.err_handler = &nvgrace_gpu_vfio_pci_err_handlers,
 	.driver_managed_dma = true,
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v5 7/7] vfio/nvgrace-gpu: wait for the GPU mem to be ready
  2025-11-24 11:59 [PATCH v5 0/7] vfio/nvgrace-gpu: Support huge PFNMAP and wait for GPU ready post reset ankita
                   ` (5 preceding siblings ...)
  2025-11-24 11:59 ` [PATCH v5 6/7] vfio/nvgrace-gpu: Inform devmem unmapped after reset ankita
@ 2025-11-24 11:59 ` ankita
  2025-11-24 18:41   ` Shameer Kolothum
  6 siblings, 1 reply; 24+ messages in thread
From: ankita @ 2025-11-24 11:59 UTC (permalink / raw)
  To: ankita, jgg, yishaih, skolothumtho, kevin.tian, alex, aniketa,
	vsethi, mochs
  Cc: Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas, peterx,
	pstanner, apopple, kvm, linux-kernel, cjia, kwankhede, targupta,
	zhiw, danw, dnigam, kjaju

From: Ankit Agrawal <ankita@nvidia.com>

Speculative prefetches from CPU to GPU memory until the GPU is
ready after reset can cause harmless corrected RAS events to
be logged on Grace systems. It is thus preferred that the
mapping not be re-established until the GPU is ready post reset.

The GPU readiness can be checked through BAR0 registers similar
to the checking at the time of device probe.

It can take several seconds for the GPU to be ready. So it is
desirable that the time overlaps as much of the VM startup as
possible to reduce impact on the VM bootup time. The GPU
readiness state is thus checked on the first fault/huge_fault
request or read/write access which amortizes the GPU readiness
time.

The first fault and read/write checks the GPU state when the
reset_done flag - which denotes whether the GPU has just been
reset. The memory_lock is taken across map/access to avoid
races with GPU reset.

cc: Alex Williamson <alex@shazbot.org>
cc: Jason Gunthorpe <jgg@ziepe.ca>
cc: Vikram Sethi <vsethi@nvidia.com>
Suggested-by: Alex Williamson <alex@shazbot.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/main.c | 79 ++++++++++++++++++++++++++---
 1 file changed, 72 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index bef9f25bf8f3..fbc19fe688ca 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -104,6 +104,17 @@ static int nvgrace_gpu_open_device(struct vfio_device *core_vdev)
 		mutex_init(&nvdev->remap_lock);
 	}
 
+	/*
+	 * GPU readiness is checked by reading the BAR0 registers.
+	 *
+	 * ioremap BAR0 to ensure that the BAR0 mapping is present before
+	 * register reads on first fault before establishing any GPU
+	 * memory mapping.
+	 */
+	ret = vfio_pci_core_setup_barmap(vdev, 0);
+	if (ret)
+		return ret;
+
 	vfio_pci_core_finish_enable(vdev);
 
 	return 0;
@@ -150,6 +161,26 @@ static int nvgrace_gpu_wait_device_ready(void __iomem *io)
 	return ret;
 }
 
+static int
+nvgrace_gpu_check_device_ready(struct nvgrace_gpu_pci_core_device *nvdev)
+{
+	struct vfio_pci_core_device *vdev = &nvdev->core_device;
+	int ret;
+
+	lockdep_assert_held_read(&vdev->memory_lock);
+
+	if (!nvdev->reset_done)
+		return 0;
+
+	ret = nvgrace_gpu_wait_device_ready(vdev->barmap[0]);
+	if (ret)
+		return ret;
+
+	nvdev->reset_done = false;
+
+	return 0;
+}
+
 static vm_fault_t nvgrace_gpu_vfio_pci_huge_fault(struct vm_fault *vmf,
 						  unsigned int order)
 {
@@ -173,8 +204,18 @@ static vm_fault_t nvgrace_gpu_vfio_pci_huge_fault(struct vm_fault *vmf,
 		      pfn & ((1 << order) - 1)))
 		return VM_FAULT_FALLBACK;
 
-	scoped_guard(rwsem_read, &nvdev->core_device.memory_lock)
+	scoped_guard(rwsem_read, &nvdev->core_device.memory_lock) {
+		/*
+		 * If the GPU memory is accessed by the CPU while the GPU is
+		 * not ready after reset, it can cause harmless corrected RAS
+		 * events to be logged. Make sure the GPU is ready before
+		 * establishing the mappings.
+		 */
+		if (nvgrace_gpu_check_device_ready(nvdev))
+			return ret;
+
 		ret = vfio_pci_vmf_insert_pfn(vmf, pfn, order);
+	}
 
 	return ret;
 }
@@ -593,9 +634,21 @@ nvgrace_gpu_read_mem(struct nvgrace_gpu_pci_core_device *nvdev,
 	else
 		mem_count = min(count, memregion->memlength - (size_t)offset);
 
-	ret = nvgrace_gpu_map_and_read(nvdev, buf, mem_count, ppos);
-	if (ret)
-		return ret;
+	scoped_guard(rwsem_read, &nvdev->core_device.memory_lock) {
+		/*
+		 * If the GPU memory is accessed by the CPU while the GPU is
+		 * not ready after reset, it can cause harmless corrected RAS
+		 * events to be logged. Make sure the GPU is ready before
+		 * establishing the mappings.
+		 */
+		ret = nvgrace_gpu_check_device_ready(nvdev);
+		if (ret)
+			return ret;
+
+		ret = nvgrace_gpu_map_and_read(nvdev, buf, mem_count, ppos);
+		if (ret)
+			return ret;
+	}
 
 	/*
 	 * Only the device memory present on the hardware is mapped, which may
@@ -713,9 +766,21 @@ nvgrace_gpu_write_mem(struct nvgrace_gpu_pci_core_device *nvdev,
 	 */
 	mem_count = min(count, memregion->memlength - (size_t)offset);
 
-	ret = nvgrace_gpu_map_and_write(nvdev, buf, mem_count, ppos);
-	if (ret)
-		return ret;
+	scoped_guard(rwsem_read, &nvdev->core_device.memory_lock) {
+		/*
+		 * If the GPU memory is accessed by the CPU while the GPU is
+		 * not ready after reset, it can cause harmless corrected RAS
+		 * events to be logged. Make sure the GPU is ready before
+		 * establishing the mappings.
+		 */
+		ret = nvgrace_gpu_check_device_ready(nvdev);
+		if (ret)
+			return ret;
+
+		ret = nvgrace_gpu_map_and_write(nvdev, buf, mem_count, ppos);
+		if (ret)
+			return ret;
+	}
 
 exitfn:
 	*ppos += count;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 3/7] vfio/nvgrace-gpu: Add support for huge pfnmap
  2025-11-24 11:59 ` [PATCH v5 3/7] vfio/nvgrace-gpu: Add support for huge pfnmap ankita
@ 2025-11-24 15:32   ` Alex Williamson
  2025-11-24 15:40     ` Ankit Agrawal
  2025-11-24 18:08   ` Shameer Kolothum
  2025-11-24 18:15   ` Jason Gunthorpe
  2 siblings, 1 reply; 24+ messages in thread
From: Alex Williamson @ 2025-11-24 15:32 UTC (permalink / raw)
  To: ankita
  Cc: jgg, yishaih, skolothumtho, kevin.tian, aniketa, vsethi, mochs,
	Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas, peterx,
	pstanner, apopple, kvm, linux-kernel, cjia, kwankhede, targupta,
	zhiw, danw, dnigam, kjaju

On Mon, 24 Nov 2025 11:59:22 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> NVIDIA's Grace based systems have large device memory. The device
> memory is mapped as VM_PFNMAP in the VMM VMA. The nvgrace-gpu
> module could make use of the huge PFNMAP support added in mm [1].
> 
> To achieve this, nvgrace-gpu module is updated to implement huge_fault ops.
> The implementation establishes mapping according to the order request.
> Note that if the PFN or the VMA address is unaligned to the order, the
> mapping fallbacks to the PTE level.
> 
> Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1]
> 
> cc: Alex Williamson <alex@shazbot.org>
> cc: Jason Gunthorpe <jgg@ziepe.ca>
> cc: Vikram Sethi <vsethi@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/nvgrace-gpu/main.c | 43 +++++++++++++++++++++++------
>  1 file changed, 35 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
> index f74f3d8e1ebe..c84c01954c9e 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -130,32 +130,58 @@ static void nvgrace_gpu_close_device(struct vfio_device *core_vdev)
>  	vfio_pci_core_close_device(core_vdev);
>  }
>  
> -static vm_fault_t nvgrace_gpu_vfio_pci_fault(struct vm_fault *vmf)
> +static vm_fault_t nvgrace_gpu_vfio_pci_huge_fault(struct vm_fault *vmf,
> +						  unsigned int order)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct nvgrace_gpu_pci_core_device *nvdev = vma->vm_private_data;
>  	int index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
>  	vm_fault_t ret = VM_FAULT_SIGBUS;
>  	struct mem_region *memregion;
> -	unsigned long pgoff, pfn;
> +	unsigned long pgoff, pfn, addr;
>  
>  	memregion = nvgrace_gpu_memregion(index, nvdev);
>  	if (!memregion)
>  		return ret;
>  
> -	pgoff = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
> +	addr = vmf->address & ~((PAGE_SIZE << order) - 1);
> +	pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
>  	pfn = PHYS_PFN(memregion->memphys) + pgoff;
>  
> +	if (order && (addr < vma->vm_start ||
> +		      addr + (PAGE_SIZE << order) > vma->vm_end ||
> +		      pfn & ((1 << order) - 1)))
> +		return VM_FAULT_FALLBACK;
> +
>  	scoped_guard(rwsem_read, &nvdev->core_device.memory_lock)
> -		ret = vmf_insert_pfn(vmf->vma, vmf->address, pfn);
> +		ret = vfio_pci_vmf_insert_pfn(vmf, pfn, order);
>  
>  	return ret;
>  }


It may be worth considering replicating the dev_dbg from
vfio_pci_mmap_huge_fault(), it's been very useful in validating that
we're getting the huge PFNMAPs we expect.  Thanks,

Alex

>  
> +static vm_fault_t nvgrace_gpu_vfio_pci_fault(struct vm_fault *vmf)
> +{
> +	return nvgrace_gpu_vfio_pci_huge_fault(vmf, 0);
> +}
> +
>  static const struct vm_operations_struct nvgrace_gpu_vfio_pci_mmap_ops = {
>  	.fault = nvgrace_gpu_vfio_pci_fault,
> +#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
> +	.huge_fault = nvgrace_gpu_vfio_pci_huge_fault,
> +#endif
>  };
>  
> +static size_t nvgrace_gpu_aligned_devmem_size(size_t memlength)
> +{
> +#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
> +	return ALIGN(memlength, PMD_SIZE);
> +#endif
> +#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
> +	return ALIGN(memlength, PUD_SIZE);
> +#endif
> +	return memlength;
> +}
> +
>  static int nvgrace_gpu_mmap(struct vfio_device *core_vdev,
>  			    struct vm_area_struct *vma)
>  {
> @@ -185,10 +211,10 @@ static int nvgrace_gpu_mmap(struct vfio_device *core_vdev,
>  		return -EOVERFLOW;
>  
>  	/*
> -	 * Check that the mapping request does not go beyond available device
> -	 * memory size
> +	 * Check that the mapping request does not go beyond the exposed
> +	 * device memory size.
>  	 */
> -	if (end > memregion->memlength)
> +	if (end > nvgrace_gpu_aligned_devmem_size(memregion->memlength))
>  		return -EINVAL;
>  
>  	vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
> @@ -258,7 +284,8 @@ nvgrace_gpu_ioctl_get_region_info(struct vfio_device *core_vdev,
>  
>  	sparse->nr_areas = 1;
>  	sparse->areas[0].offset = 0;
> -	sparse->areas[0].size = memregion->memlength;
> +	sparse->areas[0].size =
> +		nvgrace_gpu_aligned_devmem_size(memregion->memlength);
>  	sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
>  	sparse->header.version = 1;
>  


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 5/7] vfio/nvgrace-gpu: split the code to wait for GPU ready
  2025-11-24 11:59 ` [PATCH v5 5/7] vfio/nvgrace-gpu: split the code to wait for GPU ready ankita
@ 2025-11-24 15:32   ` Alex Williamson
  2025-11-24 15:39     ` Ankit Agrawal
  2025-11-24 18:22   ` Shameer Kolothum
  1 sibling, 1 reply; 24+ messages in thread
From: Alex Williamson @ 2025-11-24 15:32 UTC (permalink / raw)
  To: ankita
  Cc: jgg, yishaih, skolothumtho, kevin.tian, aniketa, vsethi, mochs,
	Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas, peterx,
	pstanner, apopple, kvm, linux-kernel, cjia, kwankhede, targupta,
	zhiw, danw, dnigam, kjaju

On Mon, 24 Nov 2025 11:59:24 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Split the function that check for the GPU device being ready on
> the probe.
> 
> Move the code to wait for the GPU to be ready through BAR0 register
> reads to a separate function. This would help reuse the code.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/nvgrace-gpu/main.c | 33 ++++++++++++++++++-----------
>  1 file changed, 21 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
> index c84c01954c9e..3e45b8bd1a89 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -130,6 +130,24 @@ static void nvgrace_gpu_close_device(struct vfio_device *core_vdev)
>  	vfio_pci_core_close_device(core_vdev);
>  }
>  
> +static int nvgrace_gpu_wait_device_ready(void __iomem *io)
> +{
> +	unsigned long timeout = jiffies + msecs_to_jiffies(POLL_TIMEOUT_MS);
> +	int ret = -ETIME;
> +
> +	do {
> +		if ((ioread32(io + C2C_LINK_BAR0_OFFSET) == STATUS_READY) &&
> +		    (ioread32(io + HBM_TRAINING_BAR0_OFFSET) == STATUS_READY)) {
> +			ret = 0;
> +			goto ready_check_exit;
> +		}
> +		msleep(POLL_QUANTUM_MS);
> +	} while (!time_after(jiffies, timeout));
> +
> +ready_check_exit:
> +	return ret;
> +}
> +
>  static vm_fault_t nvgrace_gpu_vfio_pci_huge_fault(struct vm_fault *vmf,
>  						  unsigned int order)
>  {
> @@ -930,9 +948,8 @@ static bool nvgrace_gpu_has_mig_hw_bug(struct pci_dev *pdev)
>   * Ensure that the BAR0 region is enabled before accessing the
>   * registers.
>   */
> -static int nvgrace_gpu_wait_device_ready(struct pci_dev *pdev)
> +static int nvgrace_gpu_probe_check_device_ready(struct pci_dev *pdev)
>  {
> -	unsigned long timeout = jiffies + msecs_to_jiffies(POLL_TIMEOUT_MS);
>  	void __iomem *io;
>  	int ret = -ETIME;
>  
> @@ -950,16 +967,8 @@ static int nvgrace_gpu_wait_device_ready(struct pci_dev *pdev)
>  		goto iomap_exit;
>  	}
>  
> -	do {
> -		if ((ioread32(io + C2C_LINK_BAR0_OFFSET) == STATUS_READY) &&
> -		    (ioread32(io + HBM_TRAINING_BAR0_OFFSET) == STATUS_READY)) {
> -			ret = 0;
> -			goto reg_check_exit;
> -		}
> -		msleep(POLL_QUANTUM_MS);
> -	} while (!time_after(jiffies, timeout));
> +	ret = nvgrace_gpu_wait_device_ready(io);

I think you're inadvertently fixing a bug here too.  The ret
initialization to -ETIME is immediately clobbered by
pci_enable_device(), so exceeding the timeout would never generate an
error.  Now it will:

Fixes: d85f69d520e6 ("vfio/nvgrace-gpu: Check the HBM training and C2C link status")

Also we should remove the ret initialization.  Otherwise the series
LGTM.  Thanks,

Alex

>  
> -reg_check_exit:
>  	pci_iounmap(pdev, io);
>  iomap_exit:
>  	pci_release_selected_regions(pdev, 1 << 0);
> @@ -976,7 +985,7 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
>  	u64 memphys, memlength;
>  	int ret;
>  
> -	ret = nvgrace_gpu_wait_device_ready(pdev);
> +	ret = nvgrace_gpu_probe_check_device_ready(pdev);
>  	if (ret)
>  		return ret;
>  


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 5/7] vfio/nvgrace-gpu: split the code to wait for GPU ready
  2025-11-24 15:32   ` Alex Williamson
@ 2025-11-24 15:39     ` Ankit Agrawal
  0 siblings, 0 replies; 24+ messages in thread
From: Ankit Agrawal @ 2025-11-24 15:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: jgg@ziepe.ca, Yishai Hadas, Shameer Kolothum,
	kevin.tian@intel.com, Aniket Agashe, Vikram Sethi, Matt Ochs,
	Yunxiang.Li@amd.com, yi.l.liu@intel.com,
	zhangdongdong@eswincomputing.com, Avihai Horon,
	bhelgaas@google.com, peterx@redhat.com, pstanner@redhat.com,
	Alistair Popple, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU), Zhi Wang, Dan Williams, Dheeraj Nigam,
	Krishnakant Jaju

>> -     do {
>> -             if ((ioread32(io + C2C_LINK_BAR0_OFFSET) == STATUS_READY) &&
>> -                 (ioread32(io + HBM_TRAINING_BAR0_OFFSET) == STATUS_READY)) {
>> -                     ret = 0;
>> -                     goto reg_check_exit;
>> -             }
>> -             msleep(POLL_QUANTUM_MS);
>> -     } while (!time_after(jiffies, timeout));
>> +     ret = nvgrace_gpu_wait_device_ready(io);
>
> I think you're inadvertently fixing a bug here too.  The ret
> initialization to -ETIME is immediately clobbered by
> pci_enable_device(), so exceeding the timeout would never generate an
> error.  Now it will:
>
> Fixes: d85f69d520e6 ("vfio/nvgrace-gpu: Check the HBM training and C2C link status")

Oh yeah. Will add this line in the commit message.

> Also we should remove the ret initialization.  Otherwise the series
> LGTM.  Thanks,
>
> Alex

Thank you very much for the review, Alex!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 3/7] vfio/nvgrace-gpu: Add support for huge pfnmap
  2025-11-24 15:32   ` Alex Williamson
@ 2025-11-24 15:40     ` Ankit Agrawal
  0 siblings, 0 replies; 24+ messages in thread
From: Ankit Agrawal @ 2025-11-24 15:40 UTC (permalink / raw)
  To: Alex Williamson
  Cc: jgg@ziepe.ca, Yishai Hadas, Shameer Kolothum,
	kevin.tian@intel.com, Aniket Agashe, Vikram Sethi, Matt Ochs,
	Yunxiang.Li@amd.com, yi.l.liu@intel.com,
	zhangdongdong@eswincomputing.com, Avihai Horon,
	bhelgaas@google.com, peterx@redhat.com, pstanner@redhat.com,
	Alistair Popple, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU), Zhi Wang, Dan Williams, Dheeraj Nigam,
	Krishnakant Jaju

>>       scoped_guard(rwsem_read, &nvdev->core_device.memory_lock)
>> -             ret = vmf_insert_pfn(vmf->vma, vmf->address, pfn);
>> +             ret = vfio_pci_vmf_insert_pfn(vmf, pfn, order);
>>
>>       return ret;
>>  }
>
> It may be worth considering replicating the dev_dbg from
> vfio_pci_mmap_huge_fault(), it's been very useful in validating that
> we're getting the huge PFNMAPs we expect.  Thanks,

Sure, will add the log message. Thanks!

>
> Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH v5 1/7] vfio/nvgrace-gpu: Use faults to map device memory
  2025-11-24 11:59 ` [PATCH v5 1/7] vfio/nvgrace-gpu: Use faults to map device memory ankita
@ 2025-11-24 17:09   ` Shameer Kolothum
  2025-11-24 17:16     ` Jason Gunthorpe
  2025-11-24 17:16   ` Jason Gunthorpe
  1 sibling, 1 reply; 24+ messages in thread
From: Shameer Kolothum @ 2025-11-24 17:09 UTC (permalink / raw)
  To: Ankit Agrawal, jgg@ziepe.ca, Yishai Hadas, kevin.tian@intel.com,
	alex@shazbot.org, Aniket Agashe, Vikram Sethi, Matt Ochs
  Cc: Yunxiang.Li@amd.com, yi.l.liu@intel.com,
	zhangdongdong@eswincomputing.com, Avihai Horon,
	bhelgaas@google.com, peterx@redhat.com, pstanner@redhat.com,
	Alistair Popple, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU), Zhi Wang, Dan Williams, Dheeraj Nigam,
	Krishnakant Jaju



> -----Original Message-----
> From: Ankit Agrawal <ankita@nvidia.com>
> Sent: 24 November 2025 11:59
> To: Ankit Agrawal <ankita@nvidia.com>; jgg@ziepe.ca; Yishai Hadas
> <yishaih@nvidia.com>; Shameer Kolothum <skolothumtho@nvidia.com>;
> kevin.tian@intel.com; alex@shazbot.org; Aniket Agashe
> <aniketa@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>
> Cc: Yunxiang.Li@amd.com; yi.l.liu@intel.com;
> zhangdongdong@eswincomputing.com; Avihai Horon <avihaih@nvidia.com>;
> bhelgaas@google.com; peterx@redhat.com; pstanner@redhat.com; Alistair
> Popple <apopple@nvidia.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> Zhi Wang <zhiw@nvidia.com>; Dan Williams <danw@nvidia.com>; Dheeraj
> Nigam <dnigam@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>
> Subject: [PATCH v5 1/7] vfio/nvgrace-gpu: Use faults to map device memory
> 
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> To make use of the huge pfnmap support and to support zap/remap
> sequence, fault/huge_fault ops based mapping mechanism needs to
> be implemented.
> 
> Currently nvgrace-gpu module relies on remap_pfn_range to do
> the mapping during VM bootup. Replace it to instead rely on fault
> and use vmf_insert_pfn to setup the mapping.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/nvgrace-gpu/main.c | 50 +++++++++++++++++------------
>  1 file changed, 30 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-
> gpu/main.c
> index e346392b72f6..f74f3d8e1ebe 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -130,6 +130,32 @@ static void nvgrace_gpu_close_device(struct
> vfio_device *core_vdev)
>  	vfio_pci_core_close_device(core_vdev);
>  }
> 
> +static vm_fault_t nvgrace_gpu_vfio_pci_fault(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct nvgrace_gpu_pci_core_device *nvdev = vma->vm_private_data;
> +	int index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT -
> PAGE_SHIFT);
> +	vm_fault_t ret = VM_FAULT_SIGBUS;
> +	struct mem_region *memregion;
> +	unsigned long pgoff, pfn;
> +
> +	memregion = nvgrace_gpu_memregion(index, nvdev);
> +	if (!memregion)
> +		return ret;
> +
> +	pgoff = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
> +	pfn = PHYS_PFN(memregion->memphys) + pgoff;

The core fault code seems to calculate the BAR offset in vma_to_pfn()
which is missing here.

pgoff = vma->vm_pgoff &
                ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);

Is the assumption here is user space will always map at BAR offset 0?

> +
> +	scoped_guard(rwsem_read, &nvdev->core_device.memory_lock)
> +		ret = vmf_insert_pfn(vmf->vma, vmf->address, pfn);
> +
> +	return ret;

Could do return vmf_insert_pfn(vmf->vma, vmf->address, pfn); if
you don't need it later.

Thanks,
Shameer


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 1/7] vfio/nvgrace-gpu: Use faults to map device memory
  2025-11-24 11:59 ` [PATCH v5 1/7] vfio/nvgrace-gpu: Use faults to map device memory ankita
  2025-11-24 17:09   ` Shameer Kolothum
@ 2025-11-24 17:16   ` Jason Gunthorpe
  1 sibling, 0 replies; 24+ messages in thread
From: Jason Gunthorpe @ 2025-11-24 17:16 UTC (permalink / raw)
  To: ankita
  Cc: yishaih, skolothumtho, kevin.tian, alex, aniketa, vsethi, mochs,
	Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas, peterx,
	pstanner, apopple, kvm, linux-kernel, cjia, kwankhede, targupta,
	zhiw, danw, dnigam, kjaju

On Mon, Nov 24, 2025 at 11:59:20AM +0000, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> To make use of the huge pfnmap support and to support zap/remap
> sequence, fault/huge_fault ops based mapping mechanism needs to
> be implemented.
> 
> Currently nvgrace-gpu module relies on remap_pfn_range to do
> the mapping during VM bootup. Replace it to instead rely on fault
> and use vmf_insert_pfn to setup the mapping.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/nvgrace-gpu/main.c | 50 +++++++++++++++++------------
>  1 file changed, 30 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
> index e346392b72f6..f74f3d8e1ebe 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -130,6 +130,32 @@ static void nvgrace_gpu_close_device(struct vfio_device *core_vdev)
>  	vfio_pci_core_close_device(core_vdev);
>  }
>  
> +static vm_fault_t nvgrace_gpu_vfio_pci_fault(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct nvgrace_gpu_pci_core_device *nvdev = vma->vm_private_data;
> +	int index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);

This should not be a signed value. I think the right type is unsigned long.

> +	vm_fault_t ret = VM_FAULT_SIGBUS;
> +	struct mem_region *memregion;
> +	unsigned long pgoff, pfn;
> +
> +	memregion = nvgrace_gpu_memregion(index, nvdev);
> +	if (!memregion)
> +		return ret;
> +
> +	pgoff = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
> +	pfn = PHYS_PFN(memregion->memphys) + pgoff;
> +
> +	scoped_guard(rwsem_read, &nvdev->core_device.memory_lock)
> +		ret = vmf_insert_pfn(vmf->vma, vmf->address, pfn);

This needs to check for this:

        if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev))
                goto out_unlock;

So I think your series is out of order.

Move patch 2 before this one, add the above lines to
vfio_pci_vmf_insert_pfn() as well as a lockdep to check the
memory_lock

Then just call vfio_pci_vmf_insert_pfn() here and consider squashing
patch 3.

Jason

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 1/7] vfio/nvgrace-gpu: Use faults to map device memory
  2025-11-24 17:09   ` Shameer Kolothum
@ 2025-11-24 17:16     ` Jason Gunthorpe
  0 siblings, 0 replies; 24+ messages in thread
From: Jason Gunthorpe @ 2025-11-24 17:16 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: Ankit Agrawal, Yishai Hadas, kevin.tian@intel.com,
	alex@shazbot.org, Aniket Agashe, Vikram Sethi, Matt Ochs,
	Yunxiang.Li@amd.com, yi.l.liu@intel.com,
	zhangdongdong@eswincomputing.com, Avihai Horon,
	bhelgaas@google.com, peterx@redhat.com, pstanner@redhat.com,
	Alistair Popple, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU), Zhi Wang, Dan Williams, Dheeraj Nigam,
	Krishnakant Jaju

On Mon, Nov 24, 2025 at 05:09:30PM +0000, Shameer Kolothum wrote:
> > +static vm_fault_t nvgrace_gpu_vfio_pci_fault(struct vm_fault *vmf)
> > +{
> > +	struct vm_area_struct *vma = vmf->vma;
> > +	struct nvgrace_gpu_pci_core_device *nvdev = vma->vm_private_data;
> > +	int index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT -
> > PAGE_SHIFT);
> > +	vm_fault_t ret = VM_FAULT_SIGBUS;
> > +	struct mem_region *memregion;
> > +	unsigned long pgoff, pfn;
> > +
> > +	memregion = nvgrace_gpu_memregion(index, nvdev);
> > +	if (!memregion)
> > +		return ret;
> > +
> > +	pgoff = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
> > +	pfn = PHYS_PFN(memregion->memphys) + pgoff;
> 
> The core fault code seems to calculate the BAR offset in vma_to_pfn()
> which is missing here.
> 
> pgoff = vma->vm_pgoff &
>                 ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);

Yes, that should be included for this reason:

> Is the assumption here is user space will always map at BAR offset 0?

It should not be assumed.

Jason

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH v5 2/7] vfio: export function to map the VMA
  2025-11-24 11:59 ` [PATCH v5 2/7] vfio: export function to map the VMA ankita
@ 2025-11-24 17:22   ` Shameer Kolothum
  0 siblings, 0 replies; 24+ messages in thread
From: Shameer Kolothum @ 2025-11-24 17:22 UTC (permalink / raw)
  To: Ankit Agrawal, jgg@ziepe.ca, Yishai Hadas, kevin.tian@intel.com,
	alex@shazbot.org, Aniket Agashe, Vikram Sethi, Matt Ochs
  Cc: Yunxiang.Li@amd.com, yi.l.liu@intel.com,
	zhangdongdong@eswincomputing.com, Avihai Horon,
	bhelgaas@google.com, peterx@redhat.com, pstanner@redhat.com,
	Alistair Popple, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU), Zhi Wang, Dan Williams, Dheeraj Nigam,
	Krishnakant Jaju



> -----Original Message-----
> From: Ankit Agrawal <ankita@nvidia.com>
> Sent: 24 November 2025 11:59
> To: Ankit Agrawal <ankita@nvidia.com>; jgg@ziepe.ca; Yishai Hadas
> <yishaih@nvidia.com>; Shameer Kolothum <skolothumtho@nvidia.com>;
> kevin.tian@intel.com; alex@shazbot.org; Aniket Agashe
> <aniketa@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>
> Cc: Yunxiang.Li@amd.com; yi.l.liu@intel.com;
> zhangdongdong@eswincomputing.com; Avihai Horon <avihaih@nvidia.com>;
> bhelgaas@google.com; peterx@redhat.com; pstanner@redhat.com; Alistair
> Popple <apopple@nvidia.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> Zhi Wang <zhiw@nvidia.com>; Dan Williams <danw@nvidia.com>; Dheeraj
> Nigam <dnigam@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>
> Subject: [PATCH v5 2/7] vfio: export function to map the VMA
> 
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Take out the implementation to map the VMA to the PTE/PMD/PUD
> as a separate function.
> 
> Export the function to be used by nvgrace-gpu module.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 46 ++++++++++++++++++++------------
>  include/linux/vfio_pci_core.h    |  2 ++
>  2 files changed, 31 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 7dcf5439dedc..ede410e0ae1c 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1640,6 +1640,34 @@ static unsigned long vma_to_pfn(struct
> vm_area_struct *vma)
>  	return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) +
> pgoff;
>  }
> +vm_fault_t vfio_pci_vmf_insert_pfn(struct vm_fault *vmf,
> +				   unsigned long pfn,
> +				   unsigned int order)

Does this need to be called with vdev->memory_lock  held?
Please add a comment if so.

> +{
> +	vm_fault_t ret;
> +
> +	switch (order) {
> +	case 0:
> +		ret = vmf_insert_pfn(vmf->vma, vmf->address, pfn);
> +		break;
> +#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
> +	case PMD_ORDER:
> +		ret = vmf_insert_pfn_pmd(vmf, pfn, false);
> +		break;
> +#endif
> +#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
> +	case PUD_ORDER:
> +		ret = vmf_insert_pfn_pud(vmf, pfn, false);
> +		break;
> +#endif
> +	default:
> +		ret = VM_FAULT_FALLBACK;
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_pci_vmf_insert_pfn);
> +
>  static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
>  					   unsigned int order)
>  {
> @@ -1662,23 +1690,7 @@ static vm_fault_t
> vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
>  	if (vdev->pm_runtime_engaged ||
> !__vfio_pci_memory_enabled(vdev))
>  		goto out_unlock;

Is the above check not required for the common helper?

> 
> -	switch (order) {
> -	case 0:
> -		ret = vmf_insert_pfn(vma, vmf->address, pfn);
> -		break;
> -#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
> -	case PMD_ORDER:
> -		ret = vmf_insert_pfn_pmd(vmf, pfn, false);
> -		break;
> -#endif
> -#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
> -	case PUD_ORDER:
> -		ret = vmf_insert_pfn_pud(vmf, pfn, false);
> -		break;
> -#endif
> -	default:
> -		ret = VM_FAULT_FALLBACK;
> -	}
> +	ret = vfio_pci_vmf_insert_pfn(vmf, pfn, order);
> 
>  out_unlock:
>  	up_read(&vdev->memory_lock);
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index f541044e42a2..970b3775505e 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -119,6 +119,8 @@ ssize_t vfio_pci_core_read(struct vfio_device
> *core_vdev, char __user *buf,
>  		size_t count, loff_t *ppos);
>  ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user
> *buf,
>  		size_t count, loff_t *ppos);
> +vm_fault_t vfio_pci_vmf_insert_pfn(struct vm_fault *vmf, unsigned long pfn,
> +				   unsigned int order);
>  int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct
> *vma);
>  void vfio_pci_core_request(struct vfio_device *core_vdev, unsigned int
> count);
>  int vfio_pci_core_match(struct vfio_device *core_vdev, char *buf);

Otherwise LGTM:
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH v5 3/7] vfio/nvgrace-gpu: Add support for huge pfnmap
  2025-11-24 11:59 ` [PATCH v5 3/7] vfio/nvgrace-gpu: Add support for huge pfnmap ankita
  2025-11-24 15:32   ` Alex Williamson
@ 2025-11-24 18:08   ` Shameer Kolothum
  2025-11-24 18:15   ` Jason Gunthorpe
  2 siblings, 0 replies; 24+ messages in thread
From: Shameer Kolothum @ 2025-11-24 18:08 UTC (permalink / raw)
  To: Ankit Agrawal, jgg@ziepe.ca, Yishai Hadas, kevin.tian@intel.com,
	alex@shazbot.org, Aniket Agashe, Vikram Sethi, Matt Ochs
  Cc: Yunxiang.Li@amd.com, yi.l.liu@intel.com,
	zhangdongdong@eswincomputing.com, Avihai Horon,
	bhelgaas@google.com, peterx@redhat.com, pstanner@redhat.com,
	Alistair Popple, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU), Zhi Wang, Dan Williams, Dheeraj Nigam,
	Krishnakant Jaju



> -----Original Message-----
> From: Ankit Agrawal <ankita@nvidia.com>
> Sent: 24 November 2025 11:59
> To: Ankit Agrawal <ankita@nvidia.com>; jgg@ziepe.ca; Yishai Hadas
> <yishaih@nvidia.com>; Shameer Kolothum <skolothumtho@nvidia.com>;
> kevin.tian@intel.com; alex@shazbot.org; Aniket Agashe
> <aniketa@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>
> Cc: Yunxiang.Li@amd.com; yi.l.liu@intel.com;
> zhangdongdong@eswincomputing.com; Avihai Horon <avihaih@nvidia.com>;
> bhelgaas@google.com; peterx@redhat.com; pstanner@redhat.com; Alistair
> Popple <apopple@nvidia.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> Zhi Wang <zhiw@nvidia.com>; Dan Williams <danw@nvidia.com>; Dheeraj
> Nigam <dnigam@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>
> Subject: [PATCH v5 3/7] vfio/nvgrace-gpu: Add support for huge pfnmap
> 
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> NVIDIA's Grace based systems have large device memory. The device
> memory is mapped as VM_PFNMAP in the VMM VMA. The nvgrace-gpu
> module could make use of the huge PFNMAP support added in mm [1].
> 
> To achieve this, nvgrace-gpu module is updated to implement huge_fault ops.
> The implementation establishes mapping according to the order request.
> Note that if the PFN or the VMA address is unaligned to the order, the
> mapping fallbacks to the PTE level.
> 
> Link: https://lore.kernel.org/all/20240826204353.2228736-1-
> peterx@redhat.com/ [1]
> 
> cc: Alex Williamson <alex@shazbot.org>
> cc: Jason Gunthorpe <jgg@ziepe.ca>
> cc: Vikram Sethi <vsethi@nvidia.com>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/nvgrace-gpu/main.c | 43 +++++++++++++++++++++++------
>  1 file changed, 35 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-
> gpu/main.c
> index f74f3d8e1ebe..c84c01954c9e 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -130,32 +130,58 @@ static void nvgrace_gpu_close_device(struct
> vfio_device *core_vdev)
>  	vfio_pci_core_close_device(core_vdev);
>  }
> 
> -static vm_fault_t nvgrace_gpu_vfio_pci_fault(struct vm_fault *vmf)
> +static vm_fault_t nvgrace_gpu_vfio_pci_huge_fault(struct vm_fault *vmf,
> +						  unsigned int order)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct nvgrace_gpu_pci_core_device *nvdev = vma->vm_private_data;
>  	int index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT -
> PAGE_SHIFT);
>  	vm_fault_t ret = VM_FAULT_SIGBUS;
>  	struct mem_region *memregion;
> -	unsigned long pgoff, pfn;
> +	unsigned long pgoff, pfn, addr;
> 
>  	memregion = nvgrace_gpu_memregion(index, nvdev);
>  	if (!memregion)
>  		return ret;
> 
> -	pgoff = (vmf->address - vma->vm_start) >> PAGE_SHIFT;
> +	addr = vmf->address & ~((PAGE_SIZE << order) - 1);
> +	pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
>  	pfn = PHYS_PFN(memregion->memphys) + pgoff;
> 
> +	if (order && (addr < vma->vm_start ||
> +		      addr + (PAGE_SIZE << order) > vma->vm_end ||
> +		      pfn & ((1 << order) - 1)))
> +		return VM_FAULT_FALLBACK;
> +
>  	scoped_guard(rwsem_read, &nvdev->core_device.memory_lock)
> -		ret = vmf_insert_pfn(vmf->vma, vmf->address, pfn);
> +		ret = vfio_pci_vmf_insert_pfn(vmf, pfn, order);
> 
>  	return ret;
>  }
> 
> +static vm_fault_t nvgrace_gpu_vfio_pci_fault(struct vm_fault *vmf)
> +{
> +	return nvgrace_gpu_vfio_pci_huge_fault(vmf, 0);
> +}
> +
>  static const struct vm_operations_struct nvgrace_gpu_vfio_pci_mmap_ops =
> {
>  	.fault = nvgrace_gpu_vfio_pci_fault,
> +#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
> +	.huge_fault = nvgrace_gpu_vfio_pci_huge_fault,
> +#endif
>  };
> 
> +static size_t nvgrace_gpu_aligned_devmem_size(size_t memlength)
> +{
> +#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
> +	return ALIGN(memlength, PMD_SIZE);
> +#endif
> +#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
> +	return ALIGN(memlength, PUD_SIZE);
> +#endif

I think all this should be ALIGN_DOWN to be safe.

Thanks,
Shameer

> +	return memlength;
> +}
> +
>  static int nvgrace_gpu_mmap(struct vfio_device *core_vdev,
>  			    struct vm_area_struct *vma)
>  {
> @@ -185,10 +211,10 @@ static int nvgrace_gpu_mmap(struct vfio_device
> *core_vdev,
>  		return -EOVERFLOW;
> 
>  	/*
> -	 * Check that the mapping request does not go beyond available
> device
> -	 * memory size
> +	 * Check that the mapping request does not go beyond the exposed
> +	 * device memory size.
>  	 */
> -	if (end > memregion->memlength)
> +	if (end > nvgrace_gpu_aligned_devmem_size(memregion-
> >memlength))
>  		return -EINVAL;
> 
>  	vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND |
> VM_DONTDUMP);
> @@ -258,7 +284,8 @@ nvgrace_gpu_ioctl_get_region_info(struct vfio_device
> *core_vdev,
> 
>  	sparse->nr_areas = 1;
>  	sparse->areas[0].offset = 0;
> -	sparse->areas[0].size = memregion->memlength;
> +	sparse->areas[0].size =
> +		nvgrace_gpu_aligned_devmem_size(memregion-
> >memlength);
>  	sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
>  	sparse->header.version = 1;
> 
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH v5 4/7] vfio: use vfio_pci_core_setup_barmap to map bar in mmap
  2025-11-24 11:59 ` [PATCH v5 4/7] vfio: use vfio_pci_core_setup_barmap to map bar in mmap ankita
@ 2025-11-24 18:10   ` Shameer Kolothum
  0 siblings, 0 replies; 24+ messages in thread
From: Shameer Kolothum @ 2025-11-24 18:10 UTC (permalink / raw)
  To: Ankit Agrawal, jgg@ziepe.ca, Yishai Hadas, kevin.tian@intel.com,
	alex@shazbot.org, Aniket Agashe, Vikram Sethi, Matt Ochs
  Cc: Yunxiang.Li@amd.com, yi.l.liu@intel.com,
	zhangdongdong@eswincomputing.com, Avihai Horon,
	bhelgaas@google.com, peterx@redhat.com, pstanner@redhat.com,
	Alistair Popple, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU), Zhi Wang, Dan Williams, Dheeraj Nigam,
	Krishnakant Jaju



> -----Original Message-----
> From: Ankit Agrawal <ankita@nvidia.com>
> Sent: 24 November 2025 11:59
> To: Ankit Agrawal <ankita@nvidia.com>; jgg@ziepe.ca; Yishai Hadas
> <yishaih@nvidia.com>; Shameer Kolothum <skolothumtho@nvidia.com>;
> kevin.tian@intel.com; alex@shazbot.org; Aniket Agashe
> <aniketa@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>
> Cc: Yunxiang.Li@amd.com; yi.l.liu@intel.com;
> zhangdongdong@eswincomputing.com; Avihai Horon <avihaih@nvidia.com>;
> bhelgaas@google.com; peterx@redhat.com; pstanner@redhat.com; Alistair
> Popple <apopple@nvidia.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> Zhi Wang <zhiw@nvidia.com>; Dan Williams <danw@nvidia.com>; Dheeraj
> Nigam <dnigam@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>
> Subject: [PATCH v5 4/7] vfio: use vfio_pci_core_setup_barmap to map bar in
> mmap
> 
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Remove code duplication in vfio_pci_core_mmap by calling
> vfio_pci_core_setup_barmap to perform the bar mapping.
> 
> cc: Donald Dutile <ddutile@redhat.com>
> Suggested-by: Alex Williamson <alex@shazbot.org>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>

LGTM:
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 3/7] vfio/nvgrace-gpu: Add support for huge pfnmap
  2025-11-24 11:59 ` [PATCH v5 3/7] vfio/nvgrace-gpu: Add support for huge pfnmap ankita
  2025-11-24 15:32   ` Alex Williamson
  2025-11-24 18:08   ` Shameer Kolothum
@ 2025-11-24 18:15   ` Jason Gunthorpe
  2 siblings, 0 replies; 24+ messages in thread
From: Jason Gunthorpe @ 2025-11-24 18:15 UTC (permalink / raw)
  To: ankita
  Cc: yishaih, skolothumtho, kevin.tian, alex, aniketa, vsethi, mochs,
	Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas, peterx,
	pstanner, apopple, kvm, linux-kernel, cjia, kwankhede, targupta,
	zhiw, danw, dnigam, kjaju

On Mon, Nov 24, 2025 at 11:59:22AM +0000, ankita@nvidia.com wrote:
> +static size_t nvgrace_gpu_aligned_devmem_size(size_t memlength)
> +{
> +#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
> +	return ALIGN(memlength, PMD_SIZE);
> +#endif
> +#ifdef CONFIG_ARCH_SUPPORTS_PUD_PFNMAP
> +	return ALIGN(memlength, PUD_SIZE);
> +#endif
> +	return memlength;
> +}

This needs a comment why it is OK to change the size up? Seems really
surprising (and wrong!). The commit message should explain it too.

Jason

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 6/7] vfio/nvgrace-gpu: Inform devmem unmapped after reset
  2025-11-24 11:59 ` [PATCH v5 6/7] vfio/nvgrace-gpu: Inform devmem unmapped after reset ankita
@ 2025-11-24 18:16   ` Jason Gunthorpe
  2025-11-25  2:52     ` Ankit Agrawal
  0 siblings, 1 reply; 24+ messages in thread
From: Jason Gunthorpe @ 2025-11-24 18:16 UTC (permalink / raw)
  To: ankita
  Cc: yishaih, skolothumtho, kevin.tian, alex, aniketa, vsethi, mochs,
	Yunxiang.Li, yi.l.liu, zhangdongdong, avihaih, bhelgaas, peterx,
	pstanner, apopple, kvm, linux-kernel, cjia, kwankhede, targupta,
	zhiw, danw, dnigam, kjaju

On Mon, Nov 24, 2025 at 11:59:25AM +0000, ankita@nvidia.com wrote:
> @@ -1048,12 +1050,29 @@ static const struct pci_device_id nvgrace_gpu_vfio_pci_table[] = {
>  
>  MODULE_DEVICE_TABLE(pci, nvgrace_gpu_vfio_pci_table);
>  
> +static void nvgrace_gpu_vfio_pci_reset_done(struct pci_dev *pdev)
> +{
> +	struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
> +	struct nvgrace_gpu_pci_core_device *nvdev =
> +		container_of(core_device, struct nvgrace_gpu_pci_core_device,
> +			     core_device);
> +
> +	lockdep_assert_held_write(&core_device->memory_lock);
> +
> +	nvdev->reset_done = true;
> +}

Seems surprising/wrong at least needs a comment. What on earth is
ensuring that lockdep within a PCI callback??

>  static struct pci_driver nvgrace_gpu_vfio_pci_driver = {
>  	.name = KBUILD_MODNAME,
>  	.id_table = nvgrace_gpu_vfio_pci_table,
>  	.probe = nvgrace_gpu_probe,
>  	.remove = nvgrace_gpu_remove,
> -	.err_handler = &vfio_pci_core_err_handlers,
> +	.err_handler = &nvgrace_gpu_vfio_pci_err_handlers,

Jason

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH v5 5/7] vfio/nvgrace-gpu: split the code to wait for GPU ready
  2025-11-24 11:59 ` [PATCH v5 5/7] vfio/nvgrace-gpu: split the code to wait for GPU ready ankita
  2025-11-24 15:32   ` Alex Williamson
@ 2025-11-24 18:22   ` Shameer Kolothum
  2025-11-25  2:53     ` Ankit Agrawal
  1 sibling, 1 reply; 24+ messages in thread
From: Shameer Kolothum @ 2025-11-24 18:22 UTC (permalink / raw)
  To: Ankit Agrawal, jgg@ziepe.ca, Yishai Hadas, kevin.tian@intel.com,
	alex@shazbot.org, Aniket Agashe, Vikram Sethi, Matt Ochs
  Cc: Yunxiang.Li@amd.com, yi.l.liu@intel.com,
	zhangdongdong@eswincomputing.com, Avihai Horon,
	bhelgaas@google.com, peterx@redhat.com, pstanner@redhat.com,
	Alistair Popple, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU), Zhi Wang, Dan Williams, Dheeraj Nigam,
	Krishnakant Jaju



> -----Original Message-----
> From: Ankit Agrawal <ankita@nvidia.com>
> Sent: 24 November 2025 11:59
> To: Ankit Agrawal <ankita@nvidia.com>; jgg@ziepe.ca; Yishai Hadas
> <yishaih@nvidia.com>; Shameer Kolothum <skolothumtho@nvidia.com>;
> kevin.tian@intel.com; alex@shazbot.org; Aniket Agashe
> <aniketa@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>
> Cc: Yunxiang.Li@amd.com; yi.l.liu@intel.com;
> zhangdongdong@eswincomputing.com; Avihai Horon <avihaih@nvidia.com>;
> bhelgaas@google.com; peterx@redhat.com; pstanner@redhat.com; Alistair
> Popple <apopple@nvidia.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> Zhi Wang <zhiw@nvidia.com>; Dan Williams <danw@nvidia.com>; Dheeraj
> Nigam <dnigam@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>
> Subject: [PATCH v5 5/7] vfio/nvgrace-gpu: split the code to wait for GPU ready
> 
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Split the function that check for the GPU device being ready on
> the probe.
> 
> Move the code to wait for the GPU to be ready through BAR0 register
> reads to a separate function. This would help reuse the code.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>

Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>

With a nit below:

>  drivers/vfio/pci/nvgrace-gpu/main.c | 33 ++++++++++++++++++-----------
>  1 file changed, 21 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-
> gpu/main.c
> index c84c01954c9e..3e45b8bd1a89 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -130,6 +130,24 @@ static void nvgrace_gpu_close_device(struct
> vfio_device *core_vdev)
>  	vfio_pci_core_close_device(core_vdev);
>  }
> 
> +static int nvgrace_gpu_wait_device_ready(void __iomem *io)
> +{
> +	unsigned long timeout = jiffies +
> msecs_to_jiffies(POLL_TIMEOUT_MS);
> +	int ret = -ETIME;
> +
> +	do {
> +		if ((ioread32(io + C2C_LINK_BAR0_OFFSET) ==
> STATUS_READY) &&
> +		    (ioread32(io + HBM_TRAINING_BAR0_OFFSET) ==
> STATUS_READY)) {
> +			ret = 0;
> +			goto ready_check_exit;

You could return directly here and avoid that goto.

Thanks,
Shameer
> +		}
> +		msleep(POLL_QUANTUM_MS);
> +	} while (!time_after(jiffies, timeout));
> +
> +ready_check_exit:
> +	return ret;
> +}
> +
>  static vm_fault_t nvgrace_gpu_vfio_pci_huge_fault(struct vm_fault *vmf,
>  						  unsigned int order)
>  {
> @@ -930,9 +948,8 @@ static bool nvgrace_gpu_has_mig_hw_bug(struct
> pci_dev *pdev)
>   * Ensure that the BAR0 region is enabled before accessing the
>   * registers.
>   */
> -static int nvgrace_gpu_wait_device_ready(struct pci_dev *pdev)
> +static int nvgrace_gpu_probe_check_device_ready(struct pci_dev *pdev)
>  {
> -	unsigned long timeout = jiffies +
> msecs_to_jiffies(POLL_TIMEOUT_MS);
>  	void __iomem *io;
>  	int ret = -ETIME;
> 
> @@ -950,16 +967,8 @@ static int nvgrace_gpu_wait_device_ready(struct
> pci_dev *pdev)
>  		goto iomap_exit;
>  	}
> 
> -	do {
> -		if ((ioread32(io + C2C_LINK_BAR0_OFFSET) ==
> STATUS_READY) &&
> -		    (ioread32(io + HBM_TRAINING_BAR0_OFFSET) ==
> STATUS_READY)) {
> -			ret = 0;
> -			goto reg_check_exit;
> -		}
> -		msleep(POLL_QUANTUM_MS);
> -	} while (!time_after(jiffies, timeout));
> +	ret = nvgrace_gpu_wait_device_ready(io);
> 
> -reg_check_exit:
>  	pci_iounmap(pdev, io);
>  iomap_exit:
>  	pci_release_selected_regions(pdev, 1 << 0);
> @@ -976,7 +985,7 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
>  	u64 memphys, memlength;
>  	int ret;
> 
> -	ret = nvgrace_gpu_wait_device_ready(pdev);
> +	ret = nvgrace_gpu_probe_check_device_ready(pdev);
>  	if (ret)
>  		return ret;
> 
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH v5 7/7] vfio/nvgrace-gpu: wait for the GPU mem to be ready
  2025-11-24 11:59 ` [PATCH v5 7/7] vfio/nvgrace-gpu: wait for the GPU mem to be ready ankita
@ 2025-11-24 18:41   ` Shameer Kolothum
  0 siblings, 0 replies; 24+ messages in thread
From: Shameer Kolothum @ 2025-11-24 18:41 UTC (permalink / raw)
  To: Ankit Agrawal, jgg@ziepe.ca, Yishai Hadas, kevin.tian@intel.com,
	alex@shazbot.org, Aniket Agashe, Vikram Sethi, Matt Ochs
  Cc: Yunxiang.Li@amd.com, yi.l.liu@intel.com,
	zhangdongdong@eswincomputing.com, Avihai Horon,
	bhelgaas@google.com, peterx@redhat.com, pstanner@redhat.com,
	Alistair Popple, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU), Zhi Wang, Dan Williams, Dheeraj Nigam,
	Krishnakant Jaju



> -----Original Message-----
> From: Ankit Agrawal <ankita@nvidia.com>
> Sent: 24 November 2025 11:59
> To: Ankit Agrawal <ankita@nvidia.com>; jgg@ziepe.ca; Yishai Hadas
> <yishaih@nvidia.com>; Shameer Kolothum <skolothumtho@nvidia.com>;
> kevin.tian@intel.com; alex@shazbot.org; Aniket Agashe
> <aniketa@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>
> Cc: Yunxiang.Li@amd.com; yi.l.liu@intel.com;
> zhangdongdong@eswincomputing.com; Avihai Horon <avihaih@nvidia.com>;
> bhelgaas@google.com; peterx@redhat.com; pstanner@redhat.com; Alistair
> Popple <apopple@nvidia.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> Zhi Wang <zhiw@nvidia.com>; Dan Williams <danw@nvidia.com>; Dheeraj
> Nigam <dnigam@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>
> Subject: [PATCH v5 7/7] vfio/nvgrace-gpu: wait for the GPU mem to be ready
> 
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Speculative prefetches from CPU to GPU memory until the GPU is
> ready after reset can cause harmless corrected RAS events to
> be logged on Grace systems. It is thus preferred that the
> mapping not be re-established until the GPU is ready post reset.
> 
> The GPU readiness can be checked through BAR0 registers similar
> to the checking at the time of device probe.
> 
> It can take several seconds for the GPU to be ready. So it is
> desirable that the time overlaps as much of the VM startup as
> possible to reduce impact on the VM bootup time. The GPU
> readiness state is thus checked on the first fault/huge_fault
> request or read/write access which amortizes the GPU readiness
> time.
> 
> The first fault and read/write checks the GPU state when the
> reset_done flag - which denotes whether the GPU has just been
> reset. The memory_lock is taken across map/access to avoid
> races with GPU reset.
> 
> cc: Alex Williamson <alex@shazbot.org>
> cc: Jason Gunthorpe <jgg@ziepe.ca>
> cc: Vikram Sethi <vsethi@nvidia.com>
> Suggested-by: Alex Williamson <alex@shazbot.org>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/nvgrace-gpu/main.c | 79 ++++++++++++++++++++++++++-
> --
>  1 file changed, 72 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-
> gpu/main.c
> index bef9f25bf8f3..fbc19fe688ca 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -104,6 +104,17 @@ static int nvgrace_gpu_open_device(struct
> vfio_device *core_vdev)
>  		mutex_init(&nvdev->remap_lock);
>  	}
> 
> +	/*
> +	 * GPU readiness is checked by reading the BAR0 registers.
> +	 *
> +	 * ioremap BAR0 to ensure that the BAR0 mapping is present before
> +	 * register reads on first fault before establishing any GPU
> +	 * memory mapping.
> +	 */
> +	ret = vfio_pci_core_setup_barmap(vdev, 0);
> +	if (ret)
> +		return ret;
> +
>  	vfio_pci_core_finish_enable(vdev);
> 
>  	return 0;
> @@ -150,6 +161,26 @@ static int nvgrace_gpu_wait_device_ready(void
> __iomem *io)
>  	return ret;
>  }
> 
> +static int
> +nvgrace_gpu_check_device_ready(struct nvgrace_gpu_pci_core_device
> *nvdev)
> +{
> +	struct vfio_pci_core_device *vdev = &nvdev->core_device;
> +	int ret;
> +
> +	lockdep_assert_held_read(&vdev->memory_lock);
> +
> +	if (!nvdev->reset_done)
> +		return 0;
> +
> +	ret = nvgrace_gpu_wait_device_ready(vdev->barmap[0]);
> +	if (ret)
> +		return ret;
> +
> +	nvdev->reset_done = false;
> +
> +	return 0;
> +}
> +
>  static vm_fault_t nvgrace_gpu_vfio_pci_huge_fault(struct vm_fault *vmf,
>  						  unsigned int order)
>  {
> @@ -173,8 +204,18 @@ static vm_fault_t
> nvgrace_gpu_vfio_pci_huge_fault(struct vm_fault *vmf,
>  		      pfn & ((1 << order) - 1)))
>  		return VM_FAULT_FALLBACK;
> 
> -	scoped_guard(rwsem_read, &nvdev->core_device.memory_lock)
> +	scoped_guard(rwsem_read, &nvdev->core_device.memory_lock) {
> +		/*
> +		 * If the GPU memory is accessed by the CPU while the GPU is
> +		 * not ready after reset, it can cause harmless corrected RAS
> +		 * events to be logged. Make sure the GPU is ready before
> +		 * establishing the mappings.
> +		 */
> +		if (nvgrace_gpu_check_device_ready(nvdev))
> +			return ret;
> +
>  		ret = vfio_pci_vmf_insert_pfn(vmf, pfn, order);
> +	}
> 
>  	return ret;
>  }
> @@ -593,9 +634,21 @@ nvgrace_gpu_read_mem(struct
> nvgrace_gpu_pci_core_device *nvdev,
>  	else
>  		mem_count = min(count, memregion->memlength -
> (size_t)offset);
> 
> -	ret = nvgrace_gpu_map_and_read(nvdev, buf, mem_count, ppos);
> -	if (ret)
> -		return ret;
> +	scoped_guard(rwsem_read, &nvdev->core_device.memory_lock) {
> +		/*
> +		 * If the GPU memory is accessed by the CPU while the GPU is
> +		 * not ready after reset, it can cause harmless corrected RAS
> +		 * events to be logged. Make sure the GPU is ready before
> +		 * establishing the mappings.
> +		 */
> +		ret = nvgrace_gpu_check_device_ready(nvdev);
> +		if (ret)
> +			return ret;
> +
> +		ret = nvgrace_gpu_map_and_read(nvdev, buf, mem_count,
> ppos);
> +		if (ret)
> +			return ret;
> +	}
> 
>  	/*
>  	 * Only the device memory present on the hardware is mapped, which
> may
> @@ -713,9 +766,21 @@ nvgrace_gpu_write_mem(struct
> nvgrace_gpu_pci_core_device *nvdev,
>  	 */
>  	mem_count = min(count, memregion->memlength - (size_t)offset);
> 
> -	ret = nvgrace_gpu_map_and_write(nvdev, buf, mem_count, ppos);
> -	if (ret)
> -		return ret;
> +	scoped_guard(rwsem_read, &nvdev->core_device.memory_lock) {
> +		/*
> +		 * If the GPU memory is accessed by the CPU while the GPU is
> +		 * not ready after reset, it can cause harmless corrected RAS
> +		 * events to be logged. Make sure the GPU is ready before
> +		 * establishing the mappings.
> +		 */

The comment above is now repeated 3 times. Good to consolidate and add 
that comment above nvgrace_gpu_check_device_ready().

Thanks,
Shameer

> +		ret = nvgrace_gpu_check_device_ready(nvdev);
> +		if (ret)
> +			return ret;
> +
> +		ret = nvgrace_gpu_map_and_write(nvdev, buf, mem_count,
> ppos);
> +		if (ret)
> +			return ret;
> +	}
> 
>  exitfn:
>  	*ppos += count;
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 6/7] vfio/nvgrace-gpu: Inform devmem unmapped after reset
  2025-11-24 18:16   ` Jason Gunthorpe
@ 2025-11-25  2:52     ` Ankit Agrawal
  0 siblings, 0 replies; 24+ messages in thread
From: Ankit Agrawal @ 2025-11-25  2:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yishai Hadas, Shameer Kolothum, kevin.tian@intel.com,
	alex@shazbot.org, Aniket Agashe, Vikram Sethi, Matt Ochs,
	Yunxiang.Li@amd.com, yi.l.liu@intel.com,
	zhangdongdong@eswincomputing.com, Avihai Horon,
	bhelgaas@google.com, peterx@redhat.com, pstanner@redhat.com,
	Alistair Popple, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU), Zhi Wang, Dan Williams, Dheeraj Nigam,
	Krishnakant Jaju

>> +static void nvgrace_gpu_vfio_pci_reset_done(struct pci_dev *pdev)
>> +{
>> +     struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
>> +     struct nvgrace_gpu_pci_core_device *nvdev =
>> +             container_of(core_device, struct nvgrace_gpu_pci_core_device,
>> +                          core_device);
>> +
>> +     lockdep_assert_held_write(&core_device->memory_lock);
>> +
>> +     nvdev->reset_done = true;
>> +}
>
> Seems surprising/wrong at least needs a comment. What on earth is
> ensuring that lockdep within a PCI callback??

Yeah, I noticed that even on vfio_pci_core_enable in the open, the lock
isn't taken. I'll remove it.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v5 5/7] vfio/nvgrace-gpu: split the code to wait for GPU ready
  2025-11-24 18:22   ` Shameer Kolothum
@ 2025-11-25  2:53     ` Ankit Agrawal
  0 siblings, 0 replies; 24+ messages in thread
From: Ankit Agrawal @ 2025-11-25  2:53 UTC (permalink / raw)
  To: Shameer Kolothum, jgg@ziepe.ca, Yishai Hadas,
	kevin.tian@intel.com, alex@shazbot.org, Aniket Agashe,
	Vikram Sethi, Matt Ochs
  Cc: Yunxiang.Li@amd.com, yi.l.liu@intel.com,
	zhangdongdong@eswincomputing.com, Avihai Horon,
	bhelgaas@google.com, peterx@redhat.com, pstanner@redhat.com,
	Alistair Popple, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU), Zhi Wang, Dan Williams, Dheeraj Nigam,
	Krishnakant Jaju

Thanks Shameer for the reviewed-by.

>> +     do {
>> +             if ((ioread32(io + C2C_LINK_BAR0_OFFSET) ==
>> STATUS_READY) &&
>> +                 (ioread32(io + HBM_TRAINING_BAR0_OFFSET) ==
>> STATUS_READY)) {
>> +                     ret = 0;
>> +                     goto ready_check_exit;
>
> You could return directly here and avoid that goto.

Yeah, I have a bad habit of overusing goto. I'll update it to return.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2025-11-25  2:53 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-24 11:59 [PATCH v5 0/7] vfio/nvgrace-gpu: Support huge PFNMAP and wait for GPU ready post reset ankita
2025-11-24 11:59 ` [PATCH v5 1/7] vfio/nvgrace-gpu: Use faults to map device memory ankita
2025-11-24 17:09   ` Shameer Kolothum
2025-11-24 17:16     ` Jason Gunthorpe
2025-11-24 17:16   ` Jason Gunthorpe
2025-11-24 11:59 ` [PATCH v5 2/7] vfio: export function to map the VMA ankita
2025-11-24 17:22   ` Shameer Kolothum
2025-11-24 11:59 ` [PATCH v5 3/7] vfio/nvgrace-gpu: Add support for huge pfnmap ankita
2025-11-24 15:32   ` Alex Williamson
2025-11-24 15:40     ` Ankit Agrawal
2025-11-24 18:08   ` Shameer Kolothum
2025-11-24 18:15   ` Jason Gunthorpe
2025-11-24 11:59 ` [PATCH v5 4/7] vfio: use vfio_pci_core_setup_barmap to map bar in mmap ankita
2025-11-24 18:10   ` Shameer Kolothum
2025-11-24 11:59 ` [PATCH v5 5/7] vfio/nvgrace-gpu: split the code to wait for GPU ready ankita
2025-11-24 15:32   ` Alex Williamson
2025-11-24 15:39     ` Ankit Agrawal
2025-11-24 18:22   ` Shameer Kolothum
2025-11-25  2:53     ` Ankit Agrawal
2025-11-24 11:59 ` [PATCH v5 6/7] vfio/nvgrace-gpu: Inform devmem unmapped after reset ankita
2025-11-24 18:16   ` Jason Gunthorpe
2025-11-25  2:52     ` Ankit Agrawal
2025-11-24 11:59 ` [PATCH v5 7/7] vfio/nvgrace-gpu: wait for the GPU mem to be ready ankita
2025-11-24 18:41   ` Shameer Kolothum

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox