[PATCH v4 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards
@ 2025-01-17 23:37 ankita
  2025-01-17 23:37 ` [PATCH v4 1/3] vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem ankita
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: ankita @ 2025-01-17 23:37 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, yishaih, shameerali.kolothum.thodi,
	kevin.tian, zhiw
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, anuaggarwal, mochs, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

NVIDIA's recently introduced Grace Blackwell (GB) Superchip in
continuation with the Grace Hopper (GH) superchip that provides a
cache coherent access to CPU and GPU to each other's memory with
an internal proprietary chip-to-chip (C2C) cache coherent interconnect.
The in-tree nvgrace-gpu driver manages the GH devices. The intention
is to extend the support to the new Grace Blackwell boards.

There is a HW defect on GH to support the Multi-Instance GPU (MIG)
feature [1] that necessiated the presence of a 1G carved out from
the device memory and mapped uncached. The 1G region is shown as a
fake BAR (comprising region 2 and 3) to workaround the issue.

The GB systems differ from GH systems in the following aspects.
1. The aforementioned HW defect is fixed on GB systems.
2. There is a usable BAR1 (region 2 and 3) on GB systems for the
GPUdirect RDMA feature [2].

This patch series accommodate those GB changes by showing the real
physical device BAR1 (region2 and 3) to the VM instead of the fake
one. This takes care of both the differences.

The presence of the fix for the HW defect is communicated by the
firmware through a DVSEC PCI config register. The module reads
this to take a different codepath on GB vs GH.

To improve system bootup time, HBM training is moved out of UEFI
in GB system. Poll for the register indicating the training state.
Also check the C2C link status if it is ready. Fail the probe if
either fails.

Applied over next-20241220 and the required KVM patch (under review
on the mailing list) to map the GPU device memory as cacheable [3].
Tested on the Grace Blackwell platform by booting up VM, loading
NVIDIA module [4] and running nvidia-smi in the VM.

To run CUDA workloads, there is a dependency on the IOMMUFD and the
Nested Page Table patches being worked on separately by Nicolin Chen.
(nicolinc@nvidia.com). NVIDIA has provided git repositories which
includes all the requisite kernel [5] and Qemu [6] patches in case
one wants to try.

Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]
Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2]
Link: https://lore.kernel.org/all/20241118131958.4609-2-ankita@nvidia.com/ [3]
Link: https://github.com/NVIDIA/open-gpu-kernel-modules [4]
Link: https://github.com/NVIDIA/NV-Kernels/tree/6.8_ghvirt [5]
Link: https://github.com/NVIDIA/QEMU/tree/6.8_ghvirt_iommufd_vcmdq [6]

v3 -> v4
* Added code to enable and restore device memory regions before reading
  BAR0 registers as per Alex Williamson's suggestion.

v2 -> v3
* Incorporated Alex Williamson's suggestion to simplify patch 2/3.
* Updated the code in 3/3 to use time_after() and other miscellaneous
  suggestions from Alex Williamson.

v1 -> v2
* Rebased to next-20241220.

v3:
Link: https://lore.kernel.org/all/20250117152334.2786-1-ankita@nvidia.com/

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>

Ankit Agrawal (3):
  vfio/nvgrace-gpu: Read dvsec register to determine need for uncached
    resmem
  vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM
  vfio/nvgrace-gpu: Check the HBM training and C2C link status

 drivers/vfio/pci/nvgrace-gpu/main.c | 160 +++++++++++++++++++++++-----
 drivers/vfio/pci/vfio_pci_core.c    |   2 +
 2 files changed, 138 insertions(+), 24 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 1/3] vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem
  2025-01-17 23:37 [PATCH v4 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards ankita
@ 2025-01-17 23:37 ` ankita
  2025-01-20  7:09   ` Tian, Kevin
  2025-01-17 23:37 ` [PATCH v4 2/3] vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM ankita
  2025-01-17 23:37 ` [PATCH v4 3/3] vfio/nvgrace-gpu: Check the HBM training and C2C link status ankita
  2 siblings, 1 reply; 16+ messages in thread
From: ankita @ 2025-01-17 23:37 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, yishaih, shameerali.kolothum.thodi,
	kevin.tian, zhiw
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, anuaggarwal, mochs, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

NVIDIA's recently introduced Grace Blackwell (GB) Superchip is a
continuation with the Grace Hopper (GH) superchip that provides a
cache coherent access to CPU and GPU to each other's memory with
an internal proprietary chip-to-chip cache coherent interconnect.

There is a HW defect on GH systems to support the Multi-Instance
GPU (MIG) feature [1] that necessiated the presence of a 1G region
with uncached mapping carved out from the device memory. The 1G
region is shown as a fake BAR (comprising region 2 and 3) to
workaround the issue. This is fixed on the GB systems.

The presence of the fix for the HW defect is communicated by the
device firmware through the DVSEC PCI config register with ID 3.
The module reads this to take a different codepath on GB vs GH.

Scan through the DVSEC registers to identify the correct one and use
it to determine the presence of the fix. Save the value in the device's
nvgrace_gpu_pci_core_device structure.

Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/main.c | 30 +++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index a467085038f0..85eacafaffdf 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -23,6 +23,11 @@
 /* A hardwired and constant ABI value between the GPU FW and VFIO driver. */
 #define MEMBLK_SIZE SZ_512M
 
+#define DVSEC_BITMAP_OFFSET 0xA
+#define MIG_SUPPORTED_WITH_CACHED_RESMEM BIT(0)
+
+#define GPU_CAP_DVSEC_REGISTER 3
+
 /*
  * The state of the two device memory region - resmem and usemem - is
  * saved as struct mem_region.
@@ -46,6 +51,7 @@ struct nvgrace_gpu_pci_core_device {
 	struct mem_region resmem;
 	/* Lock to control device memory kernel mapping */
 	struct mutex remap_lock;
+	bool has_mig_hw_bug_fix;
 };
 
 static void nvgrace_gpu_init_fake_bar_emu_regs(struct vfio_device *core_vdev)
@@ -812,6 +818,26 @@ nvgrace_gpu_init_nvdev_struct(struct pci_dev *pdev,
 	return ret;
 }
 
+static bool nvgrace_gpu_has_mig_hw_bug_fix(struct pci_dev *pdev)
+{
+	int pcie_dvsec;
+	u16 dvsec_ctrl16;
+
+	pcie_dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_NVIDIA,
+					       GPU_CAP_DVSEC_REGISTER);
+
+	if (pcie_dvsec) {
+		pci_read_config_word(pdev,
+				     pcie_dvsec + DVSEC_BITMAP_OFFSET,
+				     &dvsec_ctrl16);
+
+		if (dvsec_ctrl16 & MIG_SUPPORTED_WITH_CACHED_RESMEM)
+			return true;
+	}
+
+	return false;
+}
+
 static int nvgrace_gpu_probe(struct pci_dev *pdev,
 			     const struct pci_device_id *id)
 {
@@ -832,6 +858,8 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
 	dev_set_drvdata(&pdev->dev, &nvdev->core_device);
 
 	if (ops == &nvgrace_gpu_pci_ops) {
+		nvdev->has_mig_hw_bug_fix = nvgrace_gpu_has_mig_hw_bug_fix(pdev);
+
 		/*
 		 * Device memory properties are identified in the host ACPI
 		 * table. Set the nvgrace_gpu_pci_core_device structure.
@@ -868,6 +896,8 @@ static const struct pci_device_id nvgrace_gpu_vfio_pci_table[] = {
 	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2345) },
 	/* GH200 SKU */
 	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2348) },
+	/* GB200 SKU */
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2941) },
 	{}
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 2/3] vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM
  2025-01-17 23:37 [PATCH v4 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards ankita
  2025-01-17 23:37 ` [PATCH v4 1/3] vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem ankita
@ 2025-01-17 23:37 ` ankita
  2025-01-20  7:29   ` Tian, Kevin
  2025-01-17 23:37 ` [PATCH v4 3/3] vfio/nvgrace-gpu: Check the HBM training and C2C link status ankita
  2 siblings, 1 reply; 16+ messages in thread
From: ankita @ 2025-01-17 23:37 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, yishaih, shameerali.kolothum.thodi,
	kevin.tian, zhiw
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, anuaggarwal, mochs, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

There is a HW defect on Grace Hopper (GH) to support the
Multi-Instance GPU (MIG) feature [1] that necessiated the presence
of a 1G region carved out from the device memory and mapped as
uncached. The 1G region is shown as a fake BAR (comprising region 2 and 3)
to workaround the issue.

The Grace Blackwell systems (GB) differ from GH systems in the following
aspects:
1. The aforementioned HW defect is fixed on GB systems.
2. There is a usable BAR1 (region 2 and 3) on GB systems for the
GPUdirect RDMA feature [2].

This patch accommodate those GB changes by showing the 64b physical
device BAR1 (region2 and 3) to the VM instead of the fake one. This
takes care of both the differences.

Moreover, the entire device memory is exposed on GB as cacheable to
the VM as there is no carveout required.

Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]
Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2]

Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/main.c | 66 ++++++++++++++++++-----------
 1 file changed, 42 insertions(+), 24 deletions(-)

diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index 85eacafaffdf..e6fe5bc8940f 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -17,9 +17,6 @@
 #define RESMEM_REGION_INDEX VFIO_PCI_BAR2_REGION_INDEX
 #define USEMEM_REGION_INDEX VFIO_PCI_BAR4_REGION_INDEX
 
-/* Memory size expected as non cached and reserved by the VM driver */
-#define RESMEM_SIZE SZ_1G
-
 /* A hardwired and constant ABI value between the GPU FW and VFIO driver. */
 #define MEMBLK_SIZE SZ_512M
 
@@ -72,7 +69,7 @@ nvgrace_gpu_memregion(int index,
 	if (index == USEMEM_REGION_INDEX)
 		return &nvdev->usemem;
 
-	if (index == RESMEM_REGION_INDEX)
+	if (nvdev->resmem.memlength && index == RESMEM_REGION_INDEX)
 		return &nvdev->resmem;
 
 	return NULL;
@@ -757,21 +754,31 @@ nvgrace_gpu_init_nvdev_struct(struct pci_dev *pdev,
 			      u64 memphys, u64 memlength)
 {
 	int ret = 0;
+	u64 resmem_size = 0;
 
 	/*
-	 * The VM GPU device driver needs a non-cacheable region to support
-	 * the MIG feature. Since the device memory is mapped as NORMAL cached,
-	 * carve out a region from the end with a different NORMAL_NC
-	 * property (called as reserved memory and represented as resmem). This
-	 * region then is exposed as a 64b BAR (region 2 and 3) to the VM, while
-	 * exposing the rest (termed as usable memory and represented using usemem)
-	 * as cacheable 64b BAR (region 4 and 5).
+	 * On Grace Hopper systems, the VM GPU device driver needs a non-cacheable
+	 * region to support the MIG feature owing to a hardware bug. Since the
+	 * device memory is mapped as NORMAL cached, carve out a region from the end
+	 * with a different NORMAL_NC property (called as reserved memory and
+	 * represented as resmem). This region then is exposed as a 64b BAR
+	 * (region 2 and 3) to the VM, while exposing the rest (termed as usable
+	 * memory and represented using usemem) as cacheable 64b BAR (region 4 and 5).
 	 *
 	 *               devmem (memlength)
 	 * |-------------------------------------------------|
 	 * |                                           |
 	 * usemem.memphys                              resmem.memphys
+	 *
+	 * This hardware bug is fixed on the Grace Blackwell platforms and the
+	 * presence of fix can be determined through nvdev->has_mig_hw_bug_fix.
+	 * Thus on systems with the hardware fix, there is no need to partition
+	 * the GPU device memory and the entire memory is usable and mapped as
+	 * NORMAL cached.
 	 */
+	if (!nvdev->has_mig_hw_bug_fix)
+		resmem_size = SZ_1G;
+
 	nvdev->usemem.memphys = memphys;
 
 	/*
@@ -780,23 +787,31 @@ nvgrace_gpu_init_nvdev_struct(struct pci_dev *pdev,
 	 * memory (usemem) is added to the kernel for usage by the VM
 	 * workloads. Make the usable memory size memblock aligned.
 	 */
-	if (check_sub_overflow(memlength, RESMEM_SIZE,
+	if (check_sub_overflow(memlength, resmem_size,
 			       &nvdev->usemem.memlength)) {
 		ret = -EOVERFLOW;
 		goto done;
 	}
 
-	/*
-	 * The USEMEM part of the device memory has to be MEMBLK_SIZE
-	 * aligned. This is a hardwired ABI value between the GPU FW and
-	 * VFIO driver. The VM device driver is also aware of it and make
-	 * use of the value for its calculation to determine USEMEM size.
-	 */
-	nvdev->usemem.memlength = round_down(nvdev->usemem.memlength,
-					     MEMBLK_SIZE);
-	if (nvdev->usemem.memlength == 0) {
-		ret = -EINVAL;
-		goto done;
+	if (!nvdev->has_mig_hw_bug_fix) {
+		/*
+		 * If the device memory is split to workaround the MIG bug,
+		 * the USEMEM part of the device memory has to be MEMBLK_SIZE
+		 * aligned. This is a hardwired ABI value between the GPU FW and
+		 * VFIO driver. The VM device driver is also aware of it and make
+		 * use of the value for its calculation to determine USEMEM size.
+		 * Note that the device memory may not be 512M aligned.
+		 *
+		 * If the hardware has the fix for MIG, there is no requirement
+		 * for splitting the device memory to create RESMEM. The entire
+		 * device memory is usable and will be USEMEM.
+		 */
+		nvdev->usemem.memlength = round_down(nvdev->usemem.memlength,
+						     MEMBLK_SIZE);
+		if (nvdev->usemem.memlength == 0) {
+			ret = -EINVAL;
+			goto done;
+		}
 	}
 
 	if ((check_add_overflow(nvdev->usemem.memphys,
@@ -813,7 +828,10 @@ nvgrace_gpu_init_nvdev_struct(struct pci_dev *pdev,
 	 * the BAR size for them.
 	 */
 	nvdev->usemem.bar_size = roundup_pow_of_two(nvdev->usemem.memlength);
-	nvdev->resmem.bar_size = roundup_pow_of_two(nvdev->resmem.memlength);
+
+	if (nvdev->resmem.memlength)
+		nvdev->resmem.bar_size =
+			roundup_pow_of_two(nvdev->resmem.memlength);
 done:
 	return ret;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 3/3] vfio/nvgrace-gpu: Check the HBM training and C2C link status
  2025-01-17 23:37 [PATCH v4 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards ankita
  2025-01-17 23:37 ` [PATCH v4 1/3] vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem ankita
  2025-01-17 23:37 ` [PATCH v4 2/3] vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM ankita
@ 2025-01-17 23:37 ` ankita
  2025-01-18  1:52   ` Alex Williamson
  2 siblings, 1 reply; 16+ messages in thread
From: ankita @ 2025-01-17 23:37 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, yishaih, shameerali.kolothum.thodi,
	kevin.tian, zhiw
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, anuaggarwal, mochs, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

In contrast to Grace Hopper systems, the HBM training has been moved
out of the UEFI on the Grace Blackwell systems. This reduces the system
bootup time significantly.

The onus of checking whether the HBM training has completed thus falls
on the module.

The HBM training status can be determined from a BAR0 register.
Similarly, another BAR0 register exposes the status of the CPU-GPU
chip-to-chip (C2C) cache coherent interconnect.

Based on testing, 30s is determined to be sufficient to ensure
initialization completion on all the Grace based systems. Thus poll
these register and check for 30s. If the HBM training is not complete
or if the C2C link is not ready, fail the probe.

While the time is not required on Grace Hopper systems, it is
beneficial to make the check to ensure the device is in an
expected state. Hence keeping it generalized to both the generations.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/main.c | 64 +++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_core.c    |  2 +
 2 files changed, 66 insertions(+)

diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index e6fe5bc8940f..d3529d2cc3b0 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -5,6 +5,10 @@
 
 #include <linux/sizes.h>
 #include <linux/vfio_pci_core.h>
+#include <linux/delay.h>
+#include <linux/jiffies.h>
+
+#include "../vfio_pci_priv.h"
 
 /*
  * The device memory usable to the workloads running in the VM is cached
@@ -25,6 +29,13 @@
 
 #define GPU_CAP_DVSEC_REGISTER 3
 
+#define C2C_LINK_BAR0_OFFSET 0x1498
+#define HBM_TRAINING_BAR0_OFFSET 0x200BC
+#define STATUS_READY 0xFF
+
+#define POLL_QUANTUM_MS 1000
+#define POLL_TIMEOUT_MS (30 * 1000)
+
 /*
  * The state of the two device memory region - resmem and usemem - is
  * saved as struct mem_region.
@@ -856,6 +867,55 @@ static bool nvgrace_gpu_has_mig_hw_bug_fix(struct pci_dev *pdev)
 	return false;
 }
 
+/*
+ * To reduce the system bootup time, the HBM training has
+ * been moved out of the UEFI on the Grace-Blackwell systems.
+ *
+ * The onus of checking whether the HBM training has completed
+ * thus falls on the module. The HBM training status can be
+ * determined from a BAR0 register.
+ *
+ * Similarly, another BAR0 register exposes the status of the
+ * CPU-GPU chip-to-chip (C2C) cache coherent interconnect.
+ *
+ * Poll these register and check for 30s. If the HBM training is
+ * not complete or if the C2C link is not ready, fail the probe.
+ *
+ * While the wait is not required on Grace Hopper systems, it
+ * is beneficial to make the check to ensure the device is in an
+ * expected state.
+ */
+static int nvgrace_gpu_wait_device_ready(struct pci_dev *pdev,
+					 struct vfio_pci_core_device *vdev)
+{
+	unsigned long timeout = jiffies + msecs_to_jiffies(POLL_TIMEOUT_MS);
+	void __iomem *io;
+	int ret = -ETIME;
+	u16 cmd;
+
+	cmd = vfio_pci_memory_lock_and_enable(vdev);
+	io = pci_iomap(pdev, 0, 0);
+	if (!io) {
+		ret = -ENOMEM;
+		goto iomap_exit;
+	}
+
+	do {
+		if ((ioread32(io + C2C_LINK_BAR0_OFFSET) == STATUS_READY) &&
+		    (ioread32(io + HBM_TRAINING_BAR0_OFFSET) == STATUS_READY)) {
+			ret = 0;
+			goto reg_check_exit;
+		}
+		msleep(POLL_QUANTUM_MS);
+	} while (!time_after(jiffies, timeout));
+
+reg_check_exit:
+	pci_iounmap(pdev, io);
+iomap_exit:
+	vfio_pci_memory_unlock_and_restore(vdev, cmd);
+	return ret;
+}
+
 static int nvgrace_gpu_probe(struct pci_dev *pdev,
 			     const struct pci_device_id *id)
 {
@@ -875,6 +935,10 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
 
 	dev_set_drvdata(&pdev->dev, &nvdev->core_device);
 
+	ret = nvgrace_gpu_wait_device_ready(pdev, &nvdev->core_device);
+	if (ret)
+		return ret;
+
 	if (ops == &nvgrace_gpu_pci_ops) {
 		nvdev->has_mig_hw_bug_fix = nvgrace_gpu_has_mig_hw_bug_fix(pdev);
 
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 90240c8d51aa..68f123d17c4b 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1634,12 +1634,14 @@ u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_core_device *vdev)
 
 	return cmd;
 }
+EXPORT_SYMBOL_GPL(vfio_pci_memory_lock_and_enable);
 
 void vfio_pci_memory_unlock_and_restore(struct vfio_pci_core_device *vdev, u16 cmd)
 {
 	pci_write_config_word(vdev->pdev, PCI_COMMAND, cmd);
 	up_write(&vdev->memory_lock);
 }
+EXPORT_SYMBOL_GPL(vfio_pci_memory_unlock_and_restore);
 
 static unsigned long vma_to_pfn(struct vm_area_struct *vma)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/3] vfio/nvgrace-gpu: Check the HBM training and C2C link status
  2025-01-17 23:37 ` [PATCH v4 3/3] vfio/nvgrace-gpu: Check the HBM training and C2C link status ankita
@ 2025-01-18  1:52   ` Alex Williamson
  2025-01-20  2:24     ` Ankit Agrawal
  0 siblings, 1 reply; 16+ messages in thread
From: Alex Williamson @ 2025-01-18  1:52 UTC (permalink / raw)
  To: ankita
  Cc: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, zhiw,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, anuaggarwal, mochs, kvm, linux-kernel

On Fri, 17 Jan 2025 23:37:04 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> In contrast to Grace Hopper systems, the HBM training has been moved
> out of the UEFI on the Grace Blackwell systems. This reduces the system
> bootup time significantly.
> 
> The onus of checking whether the HBM training has completed thus falls
> on the module.
> 
> The HBM training status can be determined from a BAR0 register.
> Similarly, another BAR0 register exposes the status of the CPU-GPU
> chip-to-chip (C2C) cache coherent interconnect.
> 
> Based on testing, 30s is determined to be sufficient to ensure
> initialization completion on all the Grace based systems. Thus poll
> these register and check for 30s. If the HBM training is not complete
> or if the C2C link is not ready, fail the probe.
> 
> While the time is not required on Grace Hopper systems, it is
> beneficial to make the check to ensure the device is in an
> expected state. Hence keeping it generalized to both the generations.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  drivers/vfio/pci/nvgrace-gpu/main.c | 64 +++++++++++++++++++++++++++++
>  drivers/vfio/pci/vfio_pci_core.c    |  2 +
>  2 files changed, 66 insertions(+)
> 
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
> index e6fe5bc8940f..d3529d2cc3b0 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -5,6 +5,10 @@
>  
>  #include <linux/sizes.h>
>  #include <linux/vfio_pci_core.h>
> +#include <linux/delay.h>
> +#include <linux/jiffies.h>
> +
> +#include "../vfio_pci_priv.h"
>  
>  /*
>   * The device memory usable to the workloads running in the VM is cached
> @@ -25,6 +29,13 @@
>  
>  #define GPU_CAP_DVSEC_REGISTER 3
>  
> +#define C2C_LINK_BAR0_OFFSET 0x1498
> +#define HBM_TRAINING_BAR0_OFFSET 0x200BC
> +#define STATUS_READY 0xFF
> +
> +#define POLL_QUANTUM_MS 1000
> +#define POLL_TIMEOUT_MS (30 * 1000)
> +
>  /*
>   * The state of the two device memory region - resmem and usemem - is
>   * saved as struct mem_region.
> @@ -856,6 +867,55 @@ static bool nvgrace_gpu_has_mig_hw_bug_fix(struct pci_dev *pdev)
>  	return false;
>  }
>  
> +/*
> + * To reduce the system bootup time, the HBM training has
> + * been moved out of the UEFI on the Grace-Blackwell systems.
> + *
> + * The onus of checking whether the HBM training has completed
> + * thus falls on the module. The HBM training status can be
> + * determined from a BAR0 register.
> + *
> + * Similarly, another BAR0 register exposes the status of the
> + * CPU-GPU chip-to-chip (C2C) cache coherent interconnect.
> + *
> + * Poll these register and check for 30s. If the HBM training is
> + * not complete or if the C2C link is not ready, fail the probe.
> + *
> + * While the wait is not required on Grace Hopper systems, it
> + * is beneficial to make the check to ensure the device is in an
> + * expected state.
> + */
> +static int nvgrace_gpu_wait_device_ready(struct pci_dev *pdev,
> +					 struct vfio_pci_core_device *vdev)
> +{
> +	unsigned long timeout = jiffies + msecs_to_jiffies(POLL_TIMEOUT_MS);
> +	void __iomem *io;
> +	int ret = -ETIME;
> +	u16 cmd;
> +
> +	cmd = vfio_pci_memory_lock_and_enable(vdev);
> +	io = pci_iomap(pdev, 0, 0);
> +	if (!io) {
> +		ret = -ENOMEM;
> +		goto iomap_exit;
> +	}
> +
> +	do {
> +		if ((ioread32(io + C2C_LINK_BAR0_OFFSET) == STATUS_READY) &&
> +		    (ioread32(io + HBM_TRAINING_BAR0_OFFSET) == STATUS_READY)) {
> +			ret = 0;
> +			goto reg_check_exit;
> +		}
> +		msleep(POLL_QUANTUM_MS);
> +	} while (!time_after(jiffies, timeout));
> +
> +reg_check_exit:
> +	pci_iounmap(pdev, io);
> +iomap_exit:
> +	vfio_pci_memory_unlock_and_restore(vdev, cmd);
> +	return ret;
> +}
> +
>  static int nvgrace_gpu_probe(struct pci_dev *pdev,
>  			     const struct pci_device_id *id)
>  {
> @@ -875,6 +935,10 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
>  
>  	dev_set_drvdata(&pdev->dev, &nvdev->core_device);
>  
> +	ret = nvgrace_gpu_wait_device_ready(pdev, &nvdev->core_device);
> +	if (ret)
> +		return ret;
> +
>  	if (ops == &nvgrace_gpu_pci_ops) {
>  		nvdev->has_mig_hw_bug_fix = nvgrace_gpu_has_mig_hw_bug_fix(pdev);
>  
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 90240c8d51aa..68f123d17c4b 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1634,12 +1634,14 @@ u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_core_device *vdev)
>  
>  	return cmd;
>  }
> +EXPORT_SYMBOL_GPL(vfio_pci_memory_lock_and_enable);
>  
>  void vfio_pci_memory_unlock_and_restore(struct vfio_pci_core_device *vdev, u16 cmd)
>  {
>  	pci_write_config_word(vdev->pdev, PCI_COMMAND, cmd);
>  	up_write(&vdev->memory_lock);
>  }
> +EXPORT_SYMBOL_GPL(vfio_pci_memory_unlock_and_restore);
>  
>  static unsigned long vma_to_pfn(struct vm_area_struct *vma)
>  {

The access is happening before the device is exposed to the user, the
above are for handling conditions while there may be races with user
access, this is totally unnecessary.

Does this delay even need to happen in the probe function, or could it
happen in the open_device callback?  That would still be before user
access, but if we expect it to generally work, it would allow the
training to happen in the background up until the user tries to open
the device.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/3] vfio/nvgrace-gpu: Check the HBM training and C2C link status
  2025-01-18  1:52   ` Alex Williamson
@ 2025-01-20  2:24     ` Ankit Agrawal
  2025-01-20  3:12       ` Alex Williamson
  0 siblings, 1 reply; 16+ messages in thread
From: Ankit Agrawal @ 2025-01-20  2:24 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Yishai Hadas,
	shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com,
	Zhi Wang, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
	John Hubbard, Dan Williams, Anuj Aggarwal (SW-GPU), Matt Ochs,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org


>> +EXPORT_SYMBOL_GPL(vfio_pci_memory_lock_and_enable);
>>
>>  void vfio_pci_memory_unlock_and_restore(struct vfio_pci_core_device *vdev, u16 cmd)
>>  {
>>       pci_write_config_word(vdev->pdev, PCI_COMMAND, cmd);
>>       up_write(&vdev->memory_lock);
>>  }
>> +EXPORT_SYMBOL_GPL(vfio_pci_memory_unlock_and_restore);
>>
>>  static unsigned long vma_to_pfn(struct vm_area_struct *vma)
>>  {
>
> The access is happening before the device is exposed to the user, the
> above are for handling conditions while there may be races with user
> access, this is totally unnecessary.

Right. What I could do to reuse the code is to take out the part
related to locking/unlocking as new functions and export that.
The current vfio_pci_memory_lock_and_enable() would take the lock
and call the new function. Same for vfio_pci_memory_unlock_and_restore().
The nvgrace module could also call that new function. Does that sound
reasonable?

> Does this delay even need to happen in the probe function, or could it
> happen in the open_device callback?  That would still be before user
> access, but if we expect it to generally work, it would allow the
> training to happen in the background up until the user tries to open
> the device.  Thanks,
>
> Alex

The thought process is that since it is purely bare metal coming to proper
state while boot, the nvgrace module should probably wait for the startup
to complete during probe() instead of delaying until open() time.

- Ankit Agrawal

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/3] vfio/nvgrace-gpu: Check the HBM training and C2C link status
  2025-01-20  2:24     ` Ankit Agrawal
@ 2025-01-20  3:12       ` Alex Williamson
  2025-01-20  3:22         ` Alex Williamson
  0 siblings, 1 reply; 16+ messages in thread
From: Alex Williamson @ 2025-01-20  3:12 UTC (permalink / raw)
  To: Ankit Agrawal
  Cc: Jason Gunthorpe, Yishai Hadas,
	shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com,
	Zhi Wang, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
	John Hubbard, Dan Williams, Anuj Aggarwal (SW-GPU), Matt Ochs,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org

On Mon, 20 Jan 2025 02:24:14 +0000
Ankit Agrawal <ankita@nvidia.com> wrote:

> >> +EXPORT_SYMBOL_GPL(vfio_pci_memory_lock_and_enable);
> >>
> >>  void vfio_pci_memory_unlock_and_restore(struct vfio_pci_core_device *vdev, u16 cmd)
> >>  {
> >>       pci_write_config_word(vdev->pdev, PCI_COMMAND, cmd);
> >>       up_write(&vdev->memory_lock);
> >>  }
> >> +EXPORT_SYMBOL_GPL(vfio_pci_memory_unlock_and_restore);
> >>
> >>  static unsigned long vma_to_pfn(struct vm_area_struct *vma)
> >>  {  
> >
> > The access is happening before the device is exposed to the user, the
> > above are for handling conditions while there may be races with user
> > access, this is totally unnecessary.  
> 
> Right. What I could do to reuse the code is to take out the part
> related to locking/unlocking as new functions and export that.
> The current vfio_pci_memory_lock_and_enable() would take the lock
> and call the new function. Same for vfio_pci_memory_unlock_and_restore().
> The nvgrace module could also call that new function. Does that sound
> reasonable?

No, this is standard PCI driver stuff, everything you need is already
there.  Probably pci_enable_device() and some variant of
pci_request_regions().

> > Does this delay even need to happen in the probe function, or could it
> > happen in the open_device callback?  That would still be before user
> > access, but if we expect it to generally work, it would allow the
> > training to happen in the background up until the user tries to open
> > the device.  Thanks,
> >
> > Alex  
> 
> The thought process is that since it is purely bare metal coming to proper
> state while boot, the nvgrace module should probably wait for the startup
> to complete during probe() instead of delaying until open() time.

If the driver is statically loaded, that might mean you're willing to
stall boot for up to 30s.  In practice is this ever actually going to
fail?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/3] vfio/nvgrace-gpu: Check the HBM training and C2C link status
  2025-01-20  3:12       ` Alex Williamson
@ 2025-01-20  3:22         ` Alex Williamson
  2025-01-20  3:35           ` Ankit Agrawal
                             ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Alex Williamson @ 2025-01-20  3:22 UTC (permalink / raw)
  To: Ankit Agrawal
  Cc: Jason Gunthorpe, Yishai Hadas,
	shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com,
	Zhi Wang, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
	John Hubbard, Dan Williams, Anuj Aggarwal (SW-GPU), Matt Ochs,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org

On Sun, 19 Jan 2025 20:12:32 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Mon, 20 Jan 2025 02:24:14 +0000
> Ankit Agrawal <ankita@nvidia.com> wrote:
> 
> > >> +EXPORT_SYMBOL_GPL(vfio_pci_memory_lock_and_enable);
> > >>
> > >>  void vfio_pci_memory_unlock_and_restore(struct vfio_pci_core_device *vdev, u16 cmd)
> > >>  {
> > >>       pci_write_config_word(vdev->pdev, PCI_COMMAND, cmd);
> > >>       up_write(&vdev->memory_lock);
> > >>  }
> > >> +EXPORT_SYMBOL_GPL(vfio_pci_memory_unlock_and_restore);
> > >>
> > >>  static unsigned long vma_to_pfn(struct vm_area_struct *vma)
> > >>  {    
> > >
> > > The access is happening before the device is exposed to the user, the
> > > above are for handling conditions while there may be races with user
> > > access, this is totally unnecessary.    
> > 
> > Right. What I could do to reuse the code is to take out the part
> > related to locking/unlocking as new functions and export that.
> > The current vfio_pci_memory_lock_and_enable() would take the lock
> > and call the new function. Same for vfio_pci_memory_unlock_and_restore().
> > The nvgrace module could also call that new function. Does that sound
> > reasonable?  
> 
> No, this is standard PCI driver stuff, everything you need is already
> there.  Probably pci_enable_device() and some variant of
> pci_request_regions().
> 
> > > Does this delay even need to happen in the probe function, or could it
> > > happen in the open_device callback?  That would still be before user
> > > access, but if we expect it to generally work, it would allow the
> > > training to happen in the background up until the user tries to open
> > > the device.  Thanks,
> > >
> > > Alex    
> > 
> > The thought process is that since it is purely bare metal coming to proper
> > state while boot, the nvgrace module should probably wait for the startup
> > to complete during probe() instead of delaying until open() time.  
> 
> If the driver is statically loaded, that might mean you're willing to
> stall boot for up to 30s.  In practice is this ever actually going to
> fail?  Thanks,

On second thought, I guess a vfio-pci variant driver can't
automatically bind to a device, whether statically built or not, so
maybe this isn't a concern.  I'm not sure if there are other concerns
with busy waiting for up to 30s at driver probe.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/3] vfio/nvgrace-gpu: Check the HBM training and C2C link status
  2025-01-20  3:22         ` Alex Williamson
@ 2025-01-20  3:35           ` Ankit Agrawal
  2025-01-20  7:04           ` Tian, Kevin
  2025-01-20 13:12           ` Jason Gunthorpe
  2 siblings, 0 replies; 16+ messages in thread
From: Ankit Agrawal @ 2025-01-20  3:35 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Yishai Hadas,
	shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com,
	Zhi Wang, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
	John Hubbard, Dan Williams, Anuj Aggarwal (SW-GPU), Matt Ochs,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org

> No, this is standard PCI driver stuff, everything you need is already
> there.  Probably pci_enable_device() and some variant of
> pci_request_regions().

Ok thanks, I'll take a look at that.

>> > > Does this delay even need to happen in the probe function, or could it
>> > > happen in the open_device callback?  That would still be before user
>> > > access, but if we expect it to generally work, it would allow the
>> > > training to happen in the background up until the user tries to open
>> > > the device.  Thanks,
>> > >
>> > > Alex
>> >
>> > The thought process is that since it is purely bare metal coming to proper
>> > state while boot, the nvgrace module should probably wait for the startup
>> > to complete during probe() instead of delaying until open() time.
>>
>> If the driver is statically loaded, that might mean you're willing to
>> stall boot for up to 30s.  In practice is this ever actually going to
>> fail?  Thanks,

No, I have not seen this getting timeout in my testing. 30s is considered
to be sufficient to be sure that the hardware is not in a bad state.

> On second thought, I guess a vfio-pci variant driver can't
> automatically bind to a device, whether statically built or not, so
> maybe this isn't a concern.  I'm not sure if there are other concerns
> with busy waiting for up to 30s at driver probe.  Thanks,
>
> Alex



^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH v4 3/3] vfio/nvgrace-gpu: Check the HBM training and C2C link status
  2025-01-20  3:22         ` Alex Williamson
  2025-01-20  3:35           ` Ankit Agrawal
@ 2025-01-20  7:04           ` Tian, Kevin
  2025-01-20 13:12           ` Jason Gunthorpe
  2 siblings, 0 replies; 16+ messages in thread
From: Tian, Kevin @ 2025-01-20  7:04 UTC (permalink / raw)
  To: Alex Williamson, Ankit Agrawal
  Cc: Jason Gunthorpe, Yishai Hadas,
	shameerali.kolothum.thodi@huawei.com, Zhi Wang, Aniket Agashe,
	Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Sethi, Vikram,
	Currid, Andy, Alistair Popple, John Hubbard, Dan Williams,
	Anuj Aggarwal (SW-GPU), Matt Ochs, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Monday, January 20, 2025 11:23 AM
> 
> On Sun, 19 Jan 2025 20:12:32 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Mon, 20 Jan 2025 02:24:14 +0000
> > Ankit Agrawal <ankita@nvidia.com> wrote:
> >
> > > >> +EXPORT_SYMBOL_GPL(vfio_pci_memory_lock_and_enable);
> > > >>
> > > >>  void vfio_pci_memory_unlock_and_restore(struct
> vfio_pci_core_device *vdev, u16 cmd)
> > > >>  {
> > > >>       pci_write_config_word(vdev->pdev, PCI_COMMAND, cmd);
> > > >>       up_write(&vdev->memory_lock);
> > > >>  }
> > > >> +EXPORT_SYMBOL_GPL(vfio_pci_memory_unlock_and_restore);
> > > >>
> > > >>  static unsigned long vma_to_pfn(struct vm_area_struct *vma)
> > > >>  {
> > > >
> > > > The access is happening before the device is exposed to the user, the
> > > > above are for handling conditions while there may be races with user
> > > > access, this is totally unnecessary.
> > >
> > > Right. What I could do to reuse the code is to take out the part
> > > related to locking/unlocking as new functions and export that.
> > > The current vfio_pci_memory_lock_and_enable() would take the lock
> > > and call the new function. Same for
> vfio_pci_memory_unlock_and_restore().
> > > The nvgrace module could also call that new function. Does that sound
> > > reasonable?
> >
> > No, this is standard PCI driver stuff, everything you need is already
> > there.  Probably pci_enable_device() and some variant of
> > pci_request_regions().
> >
> > > > Does this delay even need to happen in the probe function, or could it
> > > > happen in the open_device callback?  That would still be before user
> > > > access, but if we expect it to generally work, it would allow the
> > > > training to happen in the background up until the user tries to open
> > > > the device.  Thanks,
> > > >
> > > > Alex
> > >
> > > The thought process is that since it is purely bare metal coming to proper
> > > state while boot, the nvgrace module should probably wait for the
> startup
> > > to complete during probe() instead of delaying until open() time.
> >
> > If the driver is statically loaded, that might mean you're willing to
> > stall boot for up to 30s.  In practice is this ever actually going to
> > fail?  Thanks,
> 
> On second thought, I guess a vfio-pci variant driver can't
> automatically bind to a device, whether statically built or not, so
> maybe this isn't a concern.  I'm not sure if there are other concerns
> with busy waiting for up to 30s at driver probe.  Thanks,
> 

Can this wait be left to userspace i.e. the variant driver just does
one-off check and fail the probe if the device is not ready? Nvidia
can describe the requirement that the administrator may need to
wait for 30s to retry driver probe if the 1st attempt fails...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH v4 1/3] vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem
  2025-01-17 23:37 ` [PATCH v4 1/3] vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem ankita
@ 2025-01-20  7:09   ` Tian, Kevin
  2025-01-20 17:01     ` Ankit Agrawal
  0 siblings, 1 reply; 16+ messages in thread
From: Tian, Kevin @ 2025-01-20  7:09 UTC (permalink / raw)
  To: ankita@nvidia.com, jgg@nvidia.com, alex.williamson@redhat.com,
	yishaih@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	zhiw@nvidia.com
  Cc: aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com,
	targupta@nvidia.com, Sethi, Vikram, Currid, Andy,
	apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com,
	anuaggarwal@nvidia.com, mochs@nvidia.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org

> From: ankita@nvidia.com <ankita@nvidia.com>
> Sent: Saturday, January 18, 2025 7:37 AM
> 
> @@ -46,6 +51,7 @@ struct nvgrace_gpu_pci_core_device {
>  	struct mem_region resmem;
>  	/* Lock to control device memory kernel mapping */
>  	struct mutex remap_lock;
> +	bool has_mig_hw_bug_fix;

Is 'has_mig_hw_bug" clearer given GB+ hardware should all inherit
the fix anyway?

> 
>  	if (ops == &nvgrace_gpu_pci_ops) {
> +		nvdev->has_mig_hw_bug_fix =
> nvgrace_gpu_has_mig_hw_bug_fix(pdev);
> +

Move it into nvgrace_gpu_init_nvdev_struct() which has plenty
of information to help understand that line.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH v4 2/3] vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM
  2025-01-17 23:37 ` [PATCH v4 2/3] vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM ankita
@ 2025-01-20  7:29   ` Tian, Kevin
  2025-01-20 17:13     ` Ankit Agrawal
  0 siblings, 1 reply; 16+ messages in thread
From: Tian, Kevin @ 2025-01-20  7:29 UTC (permalink / raw)
  To: ankita@nvidia.com, jgg@nvidia.com, alex.williamson@redhat.com,
	yishaih@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	zhiw@nvidia.com
  Cc: aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com,
	targupta@nvidia.com, Sethi, Vikram, Currid, Andy,
	apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com,
	anuaggarwal@nvidia.com, mochs@nvidia.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org

> From: ankita@nvidia.com <ankita@nvidia.com>
> Sent: Saturday, January 18, 2025 7:37 AM
> @@ -780,23 +787,31 @@ nvgrace_gpu_init_nvdev_struct(struct pci_dev
> *pdev,
>  	 * memory (usemem) is added to the kernel for usage by the VM
>  	 * workloads. Make the usable memory size memblock aligned.
>  	 */
> -	if (check_sub_overflow(memlength, RESMEM_SIZE,
> +	if (check_sub_overflow(memlength, resmem_size,
>  			       &nvdev->usemem.memlength)) {
>  		ret = -EOVERFLOW;
>  		goto done;
>  	}
> 
> -	/*
> -	 * The USEMEM part of the device memory has to be MEMBLK_SIZE
> -	 * aligned. This is a hardwired ABI value between the GPU FW and
> -	 * VFIO driver. The VM device driver is also aware of it and make
> -	 * use of the value for its calculation to determine USEMEM size.
> -	 */
> -	nvdev->usemem.memlength = round_down(nvdev-
> >usemem.memlength,
> -					     MEMBLK_SIZE);
> -	if (nvdev->usemem.memlength == 0) {
> -		ret = -EINVAL;
> -		goto done;
> +	if (!nvdev->has_mig_hw_bug_fix) {
> +		/*
> +		 * If the device memory is split to workaround the MIG bug,
> +		 * the USEMEM part of the device memory has to be
> MEMBLK_SIZE
> +		 * aligned. This is a hardwired ABI value between the GPU FW
> and
> +		 * VFIO driver. The VM device driver is also aware of it and
> make
> +		 * use of the value for its calculation to determine USEMEM
> size.
> +		 * Note that the device memory may not be 512M aligned.
> +		 *
> +		 * If the hardware has the fix for MIG, there is no
> requirement
> +		 * for splitting the device memory to create RESMEM. The
> entire
> +		 * device memory is usable and will be USEMEM.

Just double confirm. With the fix it's not required to have the usemem
512M aligned, or does hardware guarantee that usemem is always 
512M aligned?

And it's clearer to return early when the fix is there so the majority of
the existing code can be left intact instead of causing unnecessary
indent here.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/3] vfio/nvgrace-gpu: Check the HBM training and C2C link status
  2025-01-20  3:22         ` Alex Williamson
  2025-01-20  3:35           ` Ankit Agrawal
  2025-01-20  7:04           ` Tian, Kevin
@ 2025-01-20 13:12           ` Jason Gunthorpe
  2 siblings, 0 replies; 16+ messages in thread
From: Jason Gunthorpe @ 2025-01-20 13:12 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Ankit Agrawal, Yishai Hadas, shameerali.kolothum.thodi@huawei.com,
	kevin.tian@intel.com, Zhi Wang, Aniket Agashe, Neo Jia,
	Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid,
	Alistair Popple, John Hubbard, Dan Williams,
	Anuj Aggarwal (SW-GPU), Matt Ochs, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Sun, Jan 19, 2025 at 08:22:52PM -0700, Alex Williamson wrote:
> On second thought, I guess a vfio-pci variant driver can't
> automatically bind to a device, whether statically built or not, so
> maybe this isn't a concern.  I'm not sure if there are other concerns
> with busy waiting for up to 30s at driver probe.  Thanks,

It is not entirely abnormal. mlx5 has a timeout while it pushes
its first command and on the failure path it is long > 30s IIRC..

Drivers that take a couple of seconds to complete probe are not
unusual.

Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 1/3] vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem
  2025-01-20  7:09   ` Tian, Kevin
@ 2025-01-20 17:01     ` Ankit Agrawal
  2025-01-21  1:20       ` Tian, Kevin
  0 siblings, 1 reply; 16+ messages in thread
From: Ankit Agrawal @ 2025-01-20 17:01 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe, alex.williamson@redhat.com,
	Yishai Hadas, shameerali.kolothum.thodi@huawei.com, Zhi Wang
  Cc: Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard,
	Dan Williams, Anuj Aggarwal (SW-GPU), Matt Ochs,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org

Thanks Kevin for the comments.

>>
>> @@ -46,6 +51,7 @@ struct nvgrace_gpu_pci_core_device {
>>       struct mem_region resmem;
>>       /* Lock to control device memory kernel mapping */
>>       struct mutex remap_lock;
>> +     bool has_mig_hw_bug_fix;
>
> Is 'has_mig_hw_bug" clearer given GB+ hardware should all inherit
> the fix anyway?

IIUC, are you suggesting an inverted implementation? i.e. use
has_mig_hw_bug as the struct member with the semantics
!has_mig_hw_bug_fix?

>>
>>       if (ops == &nvgrace_gpu_pci_ops) {
>> +             nvdev->has_mig_hw_bug_fix =
>> nvgrace_gpu_has_mig_hw_bug_fix(pdev);
>> +
>
> Move it into nvgrace_gpu_init_nvdev_struct() which has plenty
> of information to help understand that line.

Ack, will update in the next version.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 2/3] vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM
  2025-01-20  7:29   ` Tian, Kevin
@ 2025-01-20 17:13     ` Ankit Agrawal
  0 siblings, 0 replies; 16+ messages in thread
From: Ankit Agrawal @ 2025-01-20 17:13 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe, alex.williamson@redhat.com,
	Yishai Hadas, shameerali.kolothum.thodi@huawei.com, Zhi Wang
  Cc: Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard,
	Dan Williams, Anuj Aggarwal (SW-GPU), Matt Ochs,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org

>> +     if (!nvdev->has_mig_hw_bug_fix) {
>> +             /*
>> +              * If the device memory is split to workaround the MIG bug,
>> +              * the USEMEM part of the device memory has to be
>> MEMBLK_SIZE
>> +              * aligned. This is a hardwired ABI value between the GPU FW
>> and
>> +              * VFIO driver. The VM device driver is also aware of it and
>> make
>> +              * use of the value for its calculation to determine USEMEM
>> size.
>> +              * Note that the device memory may not be 512M aligned.
>> +              *
>> +              * If the hardware has the fix for MIG, there is no
>> requirement
>> +              * for splitting the device memory to create RESMEM. The
>> entire
>> +              * device memory is usable and will be USEMEM.
>
> Just double confirm. With the fix it's not required to have the usemem
> 512M aligned, or does hardware guarantee that usemem is always
> 512M aligned?

The first one - On devices without the MIG bug, the device memory
passed to the VM need not be 512M aligned. The devices may still have
non 512M aligned memory.

> And it's clearer to return early when the fix is there so the majority of
> the existing code can be left intact instead of causing unnecessary
> indent here.

I think that can be done. We calculate nvdev->usemem.bar_size down
the function, but I suppose that can be moved up before returning
early.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH v4 1/3] vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem
  2025-01-20 17:01     ` Ankit Agrawal
@ 2025-01-21  1:20       ` Tian, Kevin
  0 siblings, 0 replies; 16+ messages in thread
From: Tian, Kevin @ 2025-01-21  1:20 UTC (permalink / raw)
  To: Ankit Agrawal, Jason Gunthorpe, alex.williamson@redhat.com,
	Yishai Hadas, shameerali.kolothum.thodi@huawei.com, Zhi Wang
  Cc: Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Sethi, Vikram, Currid, Andy, Alistair Popple, John Hubbard,
	Dan Williams, Anuj Aggarwal (SW-GPU), Matt Ochs,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org

> From: Ankit Agrawal <ankita@nvidia.com>
> Sent: Tuesday, January 21, 2025 1:02 AM
> >>
> >> @@ -46,6 +51,7 @@ struct nvgrace_gpu_pci_core_device {
> >>       struct mem_region resmem;
> >>       /* Lock to control device memory kernel mapping */
> >>       struct mutex remap_lock;
> >> +     bool has_mig_hw_bug_fix;
> >
> > Is 'has_mig_hw_bug" clearer given GB+ hardware should all inherit
> > the fix anyway?
> 
> IIUC, are you suggesting an inverted implementation? i.e. use
> has_mig_hw_bug as the struct member with the semantics
> !has_mig_hw_bug_fix?

yes

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-01-21  1:20 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-17 23:37 [PATCH v4 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards ankita
2025-01-17 23:37 ` [PATCH v4 1/3] vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem ankita
2025-01-20  7:09   ` Tian, Kevin
2025-01-20 17:01     ` Ankit Agrawal
2025-01-21  1:20       ` Tian, Kevin
2025-01-17 23:37 ` [PATCH v4 2/3] vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM ankita
2025-01-20  7:29   ` Tian, Kevin
2025-01-20 17:13     ` Ankit Agrawal
2025-01-17 23:37 ` [PATCH v4 3/3] vfio/nvgrace-gpu: Check the HBM training and C2C link status ankita
2025-01-18  1:52   ` Alex Williamson
2025-01-20  2:24     ` Ankit Agrawal
2025-01-20  3:12       ` Alex Williamson
2025-01-20  3:22         ` Alex Williamson
2025-01-20  3:35           ` Ankit Agrawal
2025-01-20  7:04           ` Tian, Kevin
2025-01-20 13:12           ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).