[PATCH v6 0/4] vfio/nvgrace-gpu: Enable grace blackwell boards

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v6 0/4] vfio/nvgrace-gpu: Enable grace blackwell boards
@ 2025-01-24 18:30 ankita
  2025-01-24 18:30 ` [PATCH v6 1/4] vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem ankita
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: ankita @ 2025-01-24 18:30 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, yishaih, shameerali.kolothum.thodi,
	kevin.tian, zhiw
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, kjaju, udhoke, dnigam, nandinid, anuaggarwal,
	mochs, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

NVIDIA's recently introduced Grace Blackwell (GB) Superchip in
continuation with the Grace Hopper (GH) superchip that provides a
cache coherent access to CPU and GPU to each other's memory with
an internal proprietary chip-to-chip (C2C) cache coherent interconnect.
The in-tree nvgrace-gpu driver manages the GH devices. The intention
is to extend the support to the new Grace Blackwell boards.

There is a HW defect on GH to support the Multi-Instance GPU (MIG)
feature [1] that necessiated the presence of a 1G carved out from
the device memory and mapped uncached. The 1G region is shown as a
fake BAR (comprising region 2 and 3) to workaround the issue.

The GB systems differ from GH systems in the following aspects.
1. The aforementioned HW defect is fixed on GB systems.
2. There is a usable BAR1 (region 2 and 3) on GB systems for the
GPUdirect RDMA feature [2].

This patch series accommodate those GB changes by showing the real
physical device BAR1 (region2 and 3) to the VM instead of the fake
one. This takes care of both the differences.

The presence of the fix for the HW defect is communicated by the
firmware through a DVSEC PCI config register. The module reads
this to take a different codepath on GB vs GH.

To improve system bootup time, HBM training is moved out of UEFI
in GB system. Poll for the register indicating the training state.
Also check the C2C link status if it is ready. Fail the probe if
either fails.

Applied over next-20241220 and the required KVM patch (under review
on the mailing list) to map the GPU device memory as cacheable [3].
Tested on the Grace Blackwell platform by booting up VM, loading
NVIDIA module [4] and running nvidia-smi in the VM.

To run CUDA workloads, there is a dependency on the IOMMUFD and the
Nested Page Table patches being worked on separately by Nicolin Chen.
(nicolinc@nvidia.com). NVIDIA has provided git repositories which
includes all the requisite kernel [5] and Qemu [6] patches in case
one wants to try.

Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]
Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2]
Link: https://lore.kernel.org/all/20241118131958.4609-2-ankita@nvidia.com/ [3]
Link: https://github.com/NVIDIA/open-gpu-kernel-modules [4]
Link: https://github.com/NVIDIA/NV-Kernels/tree/6.8_ghvirt [5]
Link: https://github.com/NVIDIA/QEMU/tree/6.8_ghvirt_iommufd_vcmdq [6]

v5 -> v6
* Updated the code based on Alex Williamson's suggestion to move the
  device id enablement to a new patch and using KBUILD_MODNAME
  in place of "vfio-pci"

v4 -> v5
* Added code to enable the BAR0 region as per Alex Williamson's suggestion.
* Updated code based on Kevin Tian's suggestion to replace the variable
  with the semantic representing the presence of MIG bug. Also reorg the
  code to return early for blackwell without any resmem processing.
* Code comments updates.

v3 -> v4
* Added code to enable and restore device memory regions before reading
  BAR0 registers as per Alex Williamson's suggestion.

v2 -> v3
* Incorporated Alex Williamson's suggestion to simplify patch 2/3.
* Updated the code in 3/3 to use time_after() and other miscellaneous
  suggestions from Alex Williamson.

v1 -> v2
* Rebased to next-20241220.

v5:
Link: https://lore.kernel.org/all/20250123174854.3338-1-ankita@nvidia.com/

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>

Ankit Agrawal (4):
  vfio/nvgrace-gpu: Read dvsec register to determine need for uncached
    resmem
  vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM
  vfio/nvgrace-gpu: Check the HBM training and C2C link status
  vfio/nvgrace-gpu: Add GB200 SKU to the devid table

 drivers/vfio/pci/nvgrace-gpu/main.c | 169 ++++++++++++++++++++++++----
 1 file changed, 147 insertions(+), 22 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v6 1/4] vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem
  2025-01-24 18:30 [PATCH v6 0/4] vfio/nvgrace-gpu: Enable grace blackwell boards ankita
@ 2025-01-24 18:30 ` ankita
  2025-01-24 18:31 ` [PATCH v6 2/4] vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM ankita
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: ankita @ 2025-01-24 18:30 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, yishaih, shameerali.kolothum.thodi,
	kevin.tian, zhiw
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, kjaju, udhoke, dnigam, nandinid, anuaggarwal,
	mochs, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

NVIDIA's recently introduced Grace Blackwell (GB) Superchip is a
continuation with the Grace Hopper (GH) superchip that provides a
cache coherent access to CPU and GPU to each other's memory with
an internal proprietary chip-to-chip cache coherent interconnect.

There is a HW defect on GH systems to support the Multi-Instance
GPU (MIG) feature [1] that necessiated the presence of a 1G region
with uncached mapping carved out from the device memory. The 1G
region is shown as a fake BAR (comprising region 2 and 3) to
workaround the issue. This is fixed on the GB systems.

The presence of the fix for the HW defect is communicated by the
device firmware through the DVSEC PCI config register with ID 3.
The module reads this to take a different codepath on GB vs GH.

Scan through the DVSEC registers to identify the correct one and use
it to determine the presence of the fix. Save the value in the device's
nvgrace_gpu_pci_core_device structure.

Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]

CC: Jason Gunthorpe <jgg@nvidia.com>
CC: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/main.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index a467085038f0..b76368958d1c 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -23,6 +23,11 @@
 /* A hardwired and constant ABI value between the GPU FW and VFIO driver. */
 #define MEMBLK_SIZE SZ_512M
 
+#define DVSEC_BITMAP_OFFSET 0xA
+#define MIG_SUPPORTED_WITH_CACHED_RESMEM BIT(0)
+
+#define GPU_CAP_DVSEC_REGISTER 3
+
 /*
  * The state of the two device memory region - resmem and usemem - is
  * saved as struct mem_region.
@@ -46,6 +51,7 @@ struct nvgrace_gpu_pci_core_device {
 	struct mem_region resmem;
 	/* Lock to control device memory kernel mapping */
 	struct mutex remap_lock;
+	bool has_mig_hw_bug;
 };
 
 static void nvgrace_gpu_init_fake_bar_emu_regs(struct vfio_device *core_vdev)
@@ -812,6 +818,26 @@ nvgrace_gpu_init_nvdev_struct(struct pci_dev *pdev,
 	return ret;
 }
 
+static bool nvgrace_gpu_has_mig_hw_bug(struct pci_dev *pdev)
+{
+	int pcie_dvsec;
+	u16 dvsec_ctrl16;
+
+	pcie_dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_NVIDIA,
+					       GPU_CAP_DVSEC_REGISTER);
+
+	if (pcie_dvsec) {
+		pci_read_config_word(pdev,
+				     pcie_dvsec + DVSEC_BITMAP_OFFSET,
+				     &dvsec_ctrl16);
+
+		if (dvsec_ctrl16 & MIG_SUPPORTED_WITH_CACHED_RESMEM)
+			return false;
+	}
+
+	return true;
+}
+
 static int nvgrace_gpu_probe(struct pci_dev *pdev,
 			     const struct pci_device_id *id)
 {
@@ -832,6 +858,8 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
 	dev_set_drvdata(&pdev->dev, &nvdev->core_device);
 
 	if (ops == &nvgrace_gpu_pci_ops) {
+		nvdev->has_mig_hw_bug = nvgrace_gpu_has_mig_hw_bug(pdev);
+
 		/*
 		 * Device memory properties are identified in the host ACPI
 		 * table. Set the nvgrace_gpu_pci_core_device structure.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v6 2/4] vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM
  2025-01-24 18:30 [PATCH v6 0/4] vfio/nvgrace-gpu: Enable grace blackwell boards ankita
  2025-01-24 18:30 ` [PATCH v6 1/4] vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem ankita
@ 2025-01-24 18:31 ` ankita
  2025-01-24 18:31 ` [PATCH v6 3/4] vfio/nvgrace-gpu: Check the HBM training and C2C link status ankita
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: ankita @ 2025-01-24 18:31 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, yishaih, shameerali.kolothum.thodi,
	kevin.tian, zhiw
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, kjaju, udhoke, dnigam, nandinid, anuaggarwal,
	mochs, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

There is a HW defect on Grace Hopper (GH) to support the
Multi-Instance GPU (MIG) feature [1] that necessiated the presence
of a 1G region carved out from the device memory and mapped as
uncached. The 1G region is shown as a fake BAR (comprising region 2 and 3)
to workaround the issue.

The Grace Blackwell systems (GB) differ from GH systems in the following
aspects:
1. The aforementioned HW defect is fixed on GB systems.
2. There is a usable BAR1 (region 2 and 3) on GB systems for the
GPUdirect RDMA feature [2].

This patch accommodate those GB changes by showing the 64b physical
device BAR1 (region2 and 3) to the VM instead of the fake one. This
takes care of both the differences.

Moreover, the entire device memory is exposed on GB as cacheable to
the VM as there is no carveout required.

Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]
Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2]

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/main.c | 67 +++++++++++++++++++----------
 1 file changed, 45 insertions(+), 22 deletions(-)

diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index b76368958d1c..778bfd0655de 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -17,9 +17,6 @@
 #define RESMEM_REGION_INDEX VFIO_PCI_BAR2_REGION_INDEX
 #define USEMEM_REGION_INDEX VFIO_PCI_BAR4_REGION_INDEX
 
-/* Memory size expected as non cached and reserved by the VM driver */
-#define RESMEM_SIZE SZ_1G
-
 /* A hardwired and constant ABI value between the GPU FW and VFIO driver. */
 #define MEMBLK_SIZE SZ_512M
 
@@ -72,7 +69,7 @@ nvgrace_gpu_memregion(int index,
 	if (index == USEMEM_REGION_INDEX)
 		return &nvdev->usemem;
 
-	if (index == RESMEM_REGION_INDEX)
+	if (nvdev->resmem.memlength && index == RESMEM_REGION_INDEX)
 		return &nvdev->resmem;
 
 	return NULL;
@@ -757,40 +754,67 @@ nvgrace_gpu_init_nvdev_struct(struct pci_dev *pdev,
 			      u64 memphys, u64 memlength)
 {
 	int ret = 0;
+	u64 resmem_size = 0;
 
 	/*
-	 * The VM GPU device driver needs a non-cacheable region to support
-	 * the MIG feature. Since the device memory is mapped as NORMAL cached,
-	 * carve out a region from the end with a different NORMAL_NC
-	 * property (called as reserved memory and represented as resmem). This
-	 * region then is exposed as a 64b BAR (region 2 and 3) to the VM, while
-	 * exposing the rest (termed as usable memory and represented using usemem)
-	 * as cacheable 64b BAR (region 4 and 5).
+	 * On Grace Hopper systems, the VM GPU device driver needs a non-cacheable
+	 * region to support the MIG feature owing to a hardware bug. Since the
+	 * device memory is mapped as NORMAL cached, carve out a region from the end
+	 * with a different NORMAL_NC property (called as reserved memory and
+	 * represented as resmem). This region then is exposed as a 64b BAR
+	 * (region 2 and 3) to the VM, while exposing the rest (termed as usable
+	 * memory and represented using usemem) as cacheable 64b BAR (region 4 and 5).
 	 *
 	 *               devmem (memlength)
 	 * |-------------------------------------------------|
 	 * |                                           |
 	 * usemem.memphys                              resmem.memphys
+	 *
+	 * This hardware bug is fixed on the Grace Blackwell platforms and the
+	 * presence of the bug can be determined through nvdev->has_mig_hw_bug.
+	 * Thus on systems with the hardware fix, there is no need to partition
+	 * the GPU device memory and the entire memory is usable and mapped as
+	 * NORMAL cached (i.e. resmem size is 0).
 	 */
+	if (nvdev->has_mig_hw_bug)
+		resmem_size = SZ_1G;
+
 	nvdev->usemem.memphys = memphys;
 
 	/*
 	 * The device memory exposed to the VM is added to the kernel by the
-	 * VM driver module in chunks of memory block size. Only the usable
-	 * memory (usemem) is added to the kernel for usage by the VM
-	 * workloads. Make the usable memory size memblock aligned.
+	 * VM driver module in chunks of memory block size. Note that only the
+	 * usable memory (usemem) is added to the kernel for usage by the VM
+	 * workloads.
 	 */
-	if (check_sub_overflow(memlength, RESMEM_SIZE,
+	if (check_sub_overflow(memlength, resmem_size,
 			       &nvdev->usemem.memlength)) {
 		ret = -EOVERFLOW;
 		goto done;
 	}
 
 	/*
-	 * The USEMEM part of the device memory has to be MEMBLK_SIZE
-	 * aligned. This is a hardwired ABI value between the GPU FW and
-	 * VFIO driver. The VM device driver is also aware of it and make
-	 * use of the value for its calculation to determine USEMEM size.
+	 * The usemem region is exposed as a 64B Bar composed of region 4 and 5.
+	 * Calculate and save the BAR size for the region.
+	 */
+	nvdev->usemem.bar_size = roundup_pow_of_two(nvdev->usemem.memlength);
+
+	/*
+	 * If the hardware has the fix for MIG, there is no requirement
+	 * for splitting the device memory to create RESMEM. The entire
+	 * device memory is usable and will be USEMEM. Return here for
+	 * such case.
+	 */
+	if (!nvdev->has_mig_hw_bug)
+		goto done;
+
+	/*
+	 * When the device memory is split to workaround the MIG bug on
+	 * Grace Hopper, the USEMEM part of the device memory has to be
+	 * MEMBLK_SIZE aligned. This is a hardwired ABI value between the
+	 * GPU FW and VFIO driver. The VM device driver is also aware of it
+	 * and make use of the value for its calculation to determine USEMEM
+	 * size. Note that the device memory may not be 512M aligned.
 	 */
 	nvdev->usemem.memlength = round_down(nvdev->usemem.memlength,
 					     MEMBLK_SIZE);
@@ -809,10 +833,9 @@ nvgrace_gpu_init_nvdev_struct(struct pci_dev *pdev,
 	}
 
 	/*
-	 * The memory regions are exposed as BARs. Calculate and save
-	 * the BAR size for them.
+	 * The resmem region is exposed as a 64b BAR composed of region 2 and 3
+	 * for Grace Hopper. Calculate and save the BAR size for the region.
 	 */
-	nvdev->usemem.bar_size = roundup_pow_of_two(nvdev->usemem.memlength);
 	nvdev->resmem.bar_size = roundup_pow_of_two(nvdev->resmem.memlength);
 done:
 	return ret;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v6 3/4] vfio/nvgrace-gpu: Check the HBM training and C2C link status
  2025-01-24 18:30 [PATCH v6 0/4] vfio/nvgrace-gpu: Enable grace blackwell boards ankita
  2025-01-24 18:30 ` [PATCH v6 1/4] vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem ankita
  2025-01-24 18:31 ` [PATCH v6 2/4] vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM ankita
@ 2025-01-24 18:31 ` ankita
  2025-01-24 18:31 ` [PATCH v6 4/4] vfio/nvgrace-gpu: Add GB200 SKU to the devid table ankita
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: ankita @ 2025-01-24 18:31 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, yishaih, shameerali.kolothum.thodi,
	kevin.tian, zhiw
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, kjaju, udhoke, dnigam, nandinid, anuaggarwal,
	mochs, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

In contrast to Grace Hopper systems, the HBM training has been moved
out of the UEFI on the Grace Blackwell systems. This reduces the system
bootup time significantly.

The onus of checking whether the HBM training has completed thus falls
on the module.

The HBM training status can be determined from a BAR0 register.
Similarly, another BAR0 register exposes the status of the CPU-GPU
chip-to-chip (C2C) cache coherent interconnect.

Based on testing, 30s is determined to be sufficient to ensure
initialization completion on all the Grace based systems. Thus poll
these register and check for 30s. If the HBM training is not complete
or if the C2C link is not ready, fail the probe.

While the time is not required on Grace Hopper systems, it is
beneficial to make the check to ensure the device is in an
expected state. Hence keeping it generalized to both the generations.

Ensure that the BAR0 is enabled before accessing the registers.

CC: Alex Williamson <alex.williamson@redhat.com>
CC: Kevin Tian <kevin.tian@intel.com>
CC: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/main.c | 72 +++++++++++++++++++++++++++++
 1 file changed, 72 insertions(+)

diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index 778bfd0655de..655a624134cc 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -5,6 +5,8 @@
 
 #include <linux/sizes.h>
 #include <linux/vfio_pci_core.h>
+#include <linux/delay.h>
+#include <linux/jiffies.h>
 
 /*
  * The device memory usable to the workloads running in the VM is cached
@@ -25,6 +27,13 @@
 
 #define GPU_CAP_DVSEC_REGISTER 3
 
+#define C2C_LINK_BAR0_OFFSET 0x1498
+#define HBM_TRAINING_BAR0_OFFSET 0x200BC
+#define STATUS_READY 0xFF
+
+#define POLL_QUANTUM_MS 1000
+#define POLL_TIMEOUT_MS (30 * 1000)
+
 /*
  * The state of the two device memory region - resmem and usemem - is
  * saved as struct mem_region.
@@ -861,6 +870,65 @@ static bool nvgrace_gpu_has_mig_hw_bug(struct pci_dev *pdev)
 	return true;
 }
 
+/*
+ * To reduce the system bootup time, the HBM training has
+ * been moved out of the UEFI on the Grace-Blackwell systems.
+ *
+ * The onus of checking whether the HBM training has completed
+ * thus falls on the module. The HBM training status can be
+ * determined from a BAR0 register.
+ *
+ * Similarly, another BAR0 register exposes the status of the
+ * CPU-GPU chip-to-chip (C2C) cache coherent interconnect.
+ *
+ * Poll these register and check for 30s. If the HBM training is
+ * not complete or if the C2C link is not ready, fail the probe.
+ *
+ * While the wait is not required on Grace Hopper systems, it
+ * is beneficial to make the check to ensure the device is in an
+ * expected state.
+ *
+ * Ensure that the BAR0 region is enabled before accessing the
+ * registers.
+ */
+static int nvgrace_gpu_wait_device_ready(struct pci_dev *pdev)
+{
+	unsigned long timeout = jiffies + msecs_to_jiffies(POLL_TIMEOUT_MS);
+	void __iomem *io;
+	int ret = -ETIME;
+
+	ret = pci_enable_device(pdev);
+	if (ret)
+		return ret;
+
+	ret = pci_request_selected_regions(pdev, 1 << 0, KBUILD_MODNAME);
+	if (ret)
+		goto request_region_exit;
+
+	io = pci_iomap(pdev, 0, 0);
+	if (!io) {
+		ret = -ENOMEM;
+		goto iomap_exit;
+	}
+
+	do {
+		if ((ioread32(io + C2C_LINK_BAR0_OFFSET) == STATUS_READY) &&
+		    (ioread32(io + HBM_TRAINING_BAR0_OFFSET) == STATUS_READY)) {
+			ret = 0;
+			goto reg_check_exit;
+		}
+		msleep(POLL_QUANTUM_MS);
+	} while (!time_after(jiffies, timeout));
+
+reg_check_exit:
+	pci_iounmap(pdev, io);
+iomap_exit:
+	pci_release_selected_regions(pdev, 1 << 0);
+request_region_exit:
+	pci_disable_device(pdev);
+	return ret;
+}
+
 static int nvgrace_gpu_probe(struct pci_dev *pdev,
 			     const struct pci_device_id *id)
 {
@@ -869,6 +937,10 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
 	u64 memphys, memlength;
 	int ret;
 
+	ret = nvgrace_gpu_wait_device_ready(pdev);
+	if (ret)
+		return ret;
+
 	ret = nvgrace_gpu_fetch_memory_property(pdev, &memphys, &memlength);
 	if (!ret)
 		ops = &nvgrace_gpu_pci_ops;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v6 4/4] vfio/nvgrace-gpu: Add GB200 SKU to the devid table
  2025-01-24 18:30 [PATCH v6 0/4] vfio/nvgrace-gpu: Enable grace blackwell boards ankita
                   ` (2 preceding siblings ...)
  2025-01-24 18:31 ` [PATCH v6 3/4] vfio/nvgrace-gpu: Check the HBM training and C2C link status ankita
@ 2025-01-24 18:31 ` ankita
  2025-01-24 22:05 ` [PATCH v6 0/4] vfio/nvgrace-gpu: Enable grace blackwell boards Alex Williamson
  2025-01-28  2:03 ` Matt Ochs
  5 siblings, 0 replies; 9+ messages in thread
From: ankita @ 2025-01-24 18:31 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, yishaih, shameerali.kolothum.thodi,
	kevin.tian, zhiw
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, kjaju, udhoke, dnigam, nandinid, anuaggarwal,
	mochs, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

NVIDIA is productizing the new Grace Blackwell superchip
SKU bearing device ID 0x2941.

Add the SKU devid to nvgrace_gpu_vfio_pci_table.

CC: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index 655a624134cc..e5ac39c4cc6b 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -991,6 +991,8 @@ static const struct pci_device_id nvgrace_gpu_vfio_pci_table[] = {
 	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2345) },
 	/* GH200 SKU */
 	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2348) },
+	/* GB200 SKU */
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2941) },
 	{}
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v6 0/4] vfio/nvgrace-gpu: Enable grace blackwell boards
  2025-01-24 18:30 [PATCH v6 0/4] vfio/nvgrace-gpu: Enable grace blackwell boards ankita
                   ` (3 preceding siblings ...)
  2025-01-24 18:31 ` [PATCH v6 4/4] vfio/nvgrace-gpu: Add GB200 SKU to the devid table ankita
@ 2025-01-24 22:05 ` Alex Williamson
  2025-01-29  2:18   ` Ankit Agrawal
  2025-01-28  2:03 ` Matt Ochs
  5 siblings, 1 reply; 9+ messages in thread
From: Alex Williamson @ 2025-01-24 22:05 UTC (permalink / raw)
  To: ankita
  Cc: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, zhiw,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, kjaju, udhoke, dnigam, nandinid, anuaggarwal,
	mochs, kvm, linux-kernel

On Fri, 24 Jan 2025 18:30:58 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> NVIDIA's recently introduced Grace Blackwell (GB) Superchip in
> continuation with the Grace Hopper (GH) superchip that provides a
> cache coherent access to CPU and GPU to each other's memory with
> an internal proprietary chip-to-chip (C2C) cache coherent interconnect.
> The in-tree nvgrace-gpu driver manages the GH devices. The intention
> is to extend the support to the new Grace Blackwell boards.
> 
> There is a HW defect on GH to support the Multi-Instance GPU (MIG)
> feature [1] that necessiated the presence of a 1G carved out from
> the device memory and mapped uncached. The 1G region is shown as a
> fake BAR (comprising region 2 and 3) to workaround the issue.
> 
> The GB systems differ from GH systems in the following aspects.
> 1. The aforementioned HW defect is fixed on GB systems.
> 2. There is a usable BAR1 (region 2 and 3) on GB systems for the
> GPUdirect RDMA feature [2].
> 
> This patch series accommodate those GB changes by showing the real
> physical device BAR1 (region2 and 3) to the VM instead of the fake
> one. This takes care of both the differences.
> 
> The presence of the fix for the HW defect is communicated by the
> firmware through a DVSEC PCI config register. The module reads
> this to take a different codepath on GB vs GH.
> 
> To improve system bootup time, HBM training is moved out of UEFI
> in GB system. Poll for the register indicating the training state.
> Also check the C2C link status if it is ready. Fail the probe if
> either fails.
> 
> Applied over next-20241220 and the required KVM patch (under review
> on the mailing list) to map the GPU device memory as cacheable [3].
> Tested on the Grace Blackwell platform by booting up VM, loading
> NVIDIA module [4] and running nvidia-smi in the VM.
> 
> To run CUDA workloads, there is a dependency on the IOMMUFD and the
> Nested Page Table patches being worked on separately by Nicolin Chen.
> (nicolinc@nvidia.com). NVIDIA has provided git repositories which
> includes all the requisite kernel [5] and Qemu [6] patches in case
> one wants to try.
> 
> Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]
> Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2]
> Link: https://lore.kernel.org/all/20241118131958.4609-2-ankita@nvidia.com/ [3]
> Link: https://github.com/NVIDIA/open-gpu-kernel-modules [4]
> Link: https://github.com/NVIDIA/NV-Kernels/tree/6.8_ghvirt [5]
> Link: https://github.com/NVIDIA/QEMU/tree/6.8_ghvirt_iommufd_vcmdq [6]
> 
> v5 -> v6

LGTM.  I'll give others who have reviewed this a short opportunity to
take a final look.  We're already in the merge window but I think we're
just wrapping up some loose ends and I don't see any benefit to holding
it back, so pending comments from others, I'll plan to include it early
next week.  Thanks,

Alex

> * Updated the code based on Alex Williamson's suggestion to move the
>   device id enablement to a new patch and using KBUILD_MODNAME
>   in place of "vfio-pci"
> 
> v4 -> v5
> * Added code to enable the BAR0 region as per Alex Williamson's suggestion.
> * Updated code based on Kevin Tian's suggestion to replace the variable
>   with the semantic representing the presence of MIG bug. Also reorg the
>   code to return early for blackwell without any resmem processing.
> * Code comments updates.
> 
> v3 -> v4
> * Added code to enable and restore device memory regions before reading
>   BAR0 registers as per Alex Williamson's suggestion.
> 
> v2 -> v3
> * Incorporated Alex Williamson's suggestion to simplify patch 2/3.
> * Updated the code in 3/3 to use time_after() and other miscellaneous
>   suggestions from Alex Williamson.
> 
> v1 -> v2
> * Rebased to next-20241220.
> 
> v5:
> Link: https://lore.kernel.org/all/20250123174854.3338-1-ankita@nvidia.com/
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> 
> Ankit Agrawal (4):
>   vfio/nvgrace-gpu: Read dvsec register to determine need for uncached
>     resmem
>   vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM
>   vfio/nvgrace-gpu: Check the HBM training and C2C link status
>   vfio/nvgrace-gpu: Add GB200 SKU to the devid table
> 
>  drivers/vfio/pci/nvgrace-gpu/main.c | 169 ++++++++++++++++++++++++----
>  1 file changed, 147 insertions(+), 22 deletions(-)
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v6 0/4] vfio/nvgrace-gpu: Enable grace blackwell boards
  2025-01-24 18:30 [PATCH v6 0/4] vfio/nvgrace-gpu: Enable grace blackwell boards ankita
                   ` (4 preceding siblings ...)
  2025-01-24 22:05 ` [PATCH v6 0/4] vfio/nvgrace-gpu: Enable grace blackwell boards Alex Williamson
@ 2025-01-28  2:03 ` Matt Ochs
  2025-01-28  5:11   ` Ankit Agrawal
  5 siblings, 1 reply; 9+ messages in thread
From: Matt Ochs @ 2025-01-28  2:03 UTC (permalink / raw)
  To: Ankit Agrawal
  Cc: Jason Gunthorpe, Alex Williamson, Yishai Hadas,
	Shameerali Kolothum Thodi, Kevin Tian, Zhi Wang, Aniket Agashe,
	Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi,
	Andy Currid, Alistair Popple, John Hubbard, Dan Williams,
	Krishnakant Jaju, Uday Dhoke, Dheeraj Nigam, Nandini De,
	Anuj Aggarwal (SW-GPU), kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org

> On Jan 24, 2025, at 12:30 PM, Ankit Agrawal <ankita@nvidia.com> wrote:
> 
> v5 -> v6
> * Updated the code based on Alex Williamson's suggestion to move the
>  device id enablement to a new patch and using KBUILD_MODNAME
>  in place of "vfio-pci"
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> 

Tested series with Grace-Blackwell and Grace-Hopper.

Tested-by: Matthew R. Ochs <mochs@nvidia.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v6 0/4] vfio/nvgrace-gpu: Enable grace blackwell boards
  2025-01-28  2:03 ` Matt Ochs
@ 2025-01-28  5:11   ` Ankit Agrawal
  0 siblings, 0 replies; 9+ messages in thread
From: Ankit Agrawal @ 2025-01-28  5:11 UTC (permalink / raw)
  To: Matt Ochs
  Cc: Jason Gunthorpe, Alex Williamson, Yishai Hadas,
	Shameerali Kolothum Thodi, Kevin Tian, Zhi Wang, Aniket Agashe,
	Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi,
	Andy Currid, Alistair Popple, John Hubbard, Dan Williams,
	Krishnakant Jaju, Uday Dhoke, Dheeraj Nigam, Nandini De,
	Anuj Aggarwal (SW-GPU), kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org

>> v5 -> v6
>> * Updated the code based on Alex Williamson's suggestion to move the
>>  device id enablement to a new patch and using KBUILD_MODNAME
>>  in place of "vfio-pci"
>>
>> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
>>
>
> Tested series with Grace-Blackwell and Grace-Hopper.
> 
> Tested-by: Matthew R. Ochs <mochs@nvidia.com>

Thank you so much, Matt!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v6 0/4] vfio/nvgrace-gpu: Enable grace blackwell boards
  2025-01-24 22:05 ` [PATCH v6 0/4] vfio/nvgrace-gpu: Enable grace blackwell boards Alex Williamson
@ 2025-01-29  2:18   ` Ankit Agrawal
  0 siblings, 0 replies; 9+ messages in thread
From: Ankit Agrawal @ 2025-01-29  2:18 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Yishai Hadas,
	shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com,
	Zhi Wang, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
	John Hubbard, Dan Williams, Krishnakant Jaju, Uday Dhoke,
	Dheeraj Nigam, Nandini De, Anuj Aggarwal (SW-GPU), Matt Ochs,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org

>>
>> v5 -> v6
>
> LGTM.  I'll give others who have reviewed this a short opportunity to
> take a final look.  We're already in the merge window but I think we're
> just wrapping up some loose ends and I don't see any benefit to holding
> it back, so pending comments from others, I'll plan to include it early
> next week.  Thanks,
> 
> Alex

Thank you very much Alex for guiding through this!

- Ankit

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-01-29  2:18 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-24 18:30 [PATCH v6 0/4] vfio/nvgrace-gpu: Enable grace blackwell boards ankita
2025-01-24 18:30 ` [PATCH v6 1/4] vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem ankita
2025-01-24 18:31 ` [PATCH v6 2/4] vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM ankita
2025-01-24 18:31 ` [PATCH v6 3/4] vfio/nvgrace-gpu: Check the HBM training and C2C link status ankita
2025-01-24 18:31 ` [PATCH v6 4/4] vfio/nvgrace-gpu: Add GB200 SKU to the devid table ankita
2025-01-24 22:05 ` [PATCH v6 0/4] vfio/nvgrace-gpu: Enable grace blackwell boards Alex Williamson
2025-01-29  2:18   ` Ankit Agrawal
2025-01-28  2:03 ` Matt Ochs
2025-01-28  5:11   ` Ankit Agrawal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox