[PATCH v1 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v1 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards
@ 2024-10-06 10:27 ankita
  2024-10-06 10:27 ` [PATCH v1 1/3] vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem ankita
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: ankita @ 2024-10-06 10:27 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, yishaih, shameerali.kolothum.thodi,
	kevin.tian, zhiw
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, anuaggarwal, mochs, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

NVIDIA's recently introduced Grace Blackwell (GB) Superchip in
continuation with the Grace Hopper (GH) superchip that provides a
cache coherent access to CPU and GPU to each other's memory with
an internal proprietary chip-to-chip (C2C) cache coherent interconnect.
The in-tree nvgrace-gpu driver manages the GH devices. The intention
is to extend the support to the new Grace Blackwell boards.

There is a HW defect on GH to support the Multi-Instance GPU (MIG)
feature [1] that necessiated the presence of a 1G carved out from
the device memory and mapped uncached. The 1G region is shown as a
fake BAR (comprising region 2 and 3) to workaround the issue.

The GB systems differ from GH systems in the following aspects.
1. The aforementioned HW defect is fixed on GB systems.
2. There is a usable BAR1 (region 2 and 3) on GB systems for the
GPUdirect RDMA feature [2].

This patch series accommodate those GB changes by showing the real
physical device BAR1 (region2 and 3) to the VM instead of the fake
one. This takes care of both the differences.

The presence of the fix for the HW defect is communicated by the
firmware through a DVSEC PCI config register. The module reads
this to take a different codepath on GB vs GH.

To improve system bootup time, HBM training is moved out of UEFI
in GB system. Poll for the register indicating the training state.
Also check the C2C link status if it is ready. Fail the probe if
either fails.

Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]
Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2]

Applied over next-20241003.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>

Ankit Agrawal (3):
  vfio/nvgrace-gpu: Read dvsec register to determine need for uncached
    resmem
  vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM
  vfio/nvgrace-gpu: Check the HBM training and C2C link status

 drivers/vfio/pci/nvgrace-gpu/main.c | 115 ++++++++++++++++++++++++++--
 1 file changed, 107 insertions(+), 8 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v1 1/3] vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem
  2024-10-06 10:27 [PATCH v1 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards ankita
@ 2024-10-06 10:27 ` ankita
  2024-10-06 10:27 ` [PATCH v1 2/3] vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM ankita
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: ankita @ 2024-10-06 10:27 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, yishaih, shameerali.kolothum.thodi,
	kevin.tian, zhiw
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, anuaggarwal, mochs, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

NVIDIA's recently introduced Grace Blackwell (GB) Superchip is a
continuation with the Grace Hopper (GH) superchip that provides a
cache coherent access to CPU and GPU to each other's memory with
an internal proprietary chip-to-chip cache coherent interconnect.

There is a HW defect on GH systems to support the Multi-Instance
GPU (MIG) feature [1] that necessiated the presence of a 1G region
with uncached mapping carved out from the device memory. The 1G
region is shown as a fake BAR (comprising region 2 and 3) to
workaround the issue. This is fixed on the GB systems.

The presence of the fix for the HW defect is communicated by the
device firmware through the DVSEC PCI config register with ID 3.
The module reads this to take a different codepath on GB vs GH.

Scan through the DVSEC registers to identify the correct one and use
it to determine the presence of the fix. Save the value in the device's
nvgrace_gpu_pci_core_device structure.

Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/main.c | 30 +++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index a7fd018aa548..c23db6eaf979 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -23,6 +23,11 @@
 /* A hardwired and constant ABI value between the GPU FW and VFIO driver. */
 #define MEMBLK_SIZE SZ_512M
 
+#define DVSEC_BITMAP_OFFSET 0xA
+#define MIG_SUPPORTED_WITH_CACHED_RESMEM BIT(0)
+
+#define GPU_CAP_DVSEC_REGISTER 3
+
 /*
  * The state of the two device memory region - resmem and usemem - is
  * saved as struct mem_region.
@@ -46,6 +51,7 @@ struct nvgrace_gpu_pci_core_device {
 	struct mem_region resmem;
 	/* Lock to control device memory kernel mapping */
 	struct mutex remap_lock;
+	bool has_mig_hw_bug_fix;
 };
 
 static void nvgrace_gpu_init_fake_bar_emu_regs(struct vfio_device *core_vdev)
@@ -812,6 +818,26 @@ nvgrace_gpu_init_nvdev_struct(struct pci_dev *pdev,
 	return ret;
 }
 
+static bool nvgrace_gpu_has_mig_hw_bug_fix(struct pci_dev *pdev)
+{
+	int pcie_dvsec;
+	u16 dvsec_ctrl16;
+
+	pcie_dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_NVIDIA,
+					       GPU_CAP_DVSEC_REGISTER);
+
+	if (pcie_dvsec) {
+		pci_read_config_word(pdev,
+				     pcie_dvsec + DVSEC_BITMAP_OFFSET,
+				     &dvsec_ctrl16);
+
+		if (dvsec_ctrl16 & MIG_SUPPORTED_WITH_CACHED_RESMEM)
+			return true;
+	}
+
+	return false;
+}
+
 static int nvgrace_gpu_probe(struct pci_dev *pdev,
 			     const struct pci_device_id *id)
 {
@@ -832,6 +858,8 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
 	dev_set_drvdata(&pdev->dev, &nvdev->core_device);
 
 	if (ops == &nvgrace_gpu_pci_ops) {
+		nvdev->has_mig_hw_bug_fix = nvgrace_gpu_has_mig_hw_bug_fix(pdev);
+
 		/*
 		 * Device memory properties are identified in the host ACPI
 		 * table. Set the nvgrace_gpu_pci_core_device structure.
@@ -866,6 +894,8 @@ static const struct pci_device_id nvgrace_gpu_vfio_pci_table[] = {
 	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2342) },
 	/* GH200 480GB */
 	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2345) },
+	/* GB200 SKU */
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2941) },
 	{}
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v1 2/3] vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM
  2024-10-06 10:27 [PATCH v1 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards ankita
  2024-10-06 10:27 ` [PATCH v1 1/3] vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem ankita
@ 2024-10-06 10:27 ` ankita
  2024-10-06 10:27 ` [PATCH v1 3/3] vfio/nvgrace-gpu: Check the HBM training and C2C link status ankita
  2024-10-07 14:19 ` [PATCH v1 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards Alex Williamson
  3 siblings, 0 replies; 8+ messages in thread
From: ankita @ 2024-10-06 10:27 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, yishaih, shameerali.kolothum.thodi,
	kevin.tian, zhiw
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, anuaggarwal, mochs, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

There is a HW defect on Grace Hopper (GH) to support the
Multi-Instance GPU (MIG) feature [1] that necessiated the presence
of a 1G region carved out from the device memory and mapped as
uncached. The 1G region is shown as a fake BAR (comprising region 2 and 3)
to workaround the issue.

The Grace Blackwell systems (GB) differ from GH systems in the following
aspects:
1. The aforementioned HW defect is fixed on GB systems.
2. There is a usable BAR1 (region 2 and 3) on GB systems for the
GPUdirect RDMA feature [2].

This patch accommodate those GB changes by showing the 64b physical
device BAR1 (region2 and 3) to the VM instead of the fake one. This
takes care of both the differences.

Moreover, the entire device memory is exposed on GB as cacheable to
the VM as there is no carveout required.

Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]
Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2]

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/main.c | 32 +++++++++++++++++++++--------
 1 file changed, 24 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index c23db6eaf979..e3a7eceb6228 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -72,7 +72,7 @@ nvgrace_gpu_memregion(int index,
 	if (index == USEMEM_REGION_INDEX)
 		return &nvdev->usemem;
 
-	if (index == RESMEM_REGION_INDEX)
+	if (!nvdev->has_mig_hw_bug_fix && index == RESMEM_REGION_INDEX)
 		return &nvdev->resmem;
 
 	return NULL;
@@ -715,6 +715,16 @@ static const struct vfio_device_ops nvgrace_gpu_pci_core_ops = {
 	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
 };
 
+static void
+nvgrace_gpu_init_nvdev_struct(struct pci_dev *pdev,
+			      struct nvgrace_gpu_pci_core_device *nvdev,
+			      u64 memphys, u64 memlength)
+{
+	nvdev->usemem.memphys = memphys;
+	nvdev->usemem.memlength = memlength;
+	nvdev->usemem.bar_size = roundup_pow_of_two(nvdev->usemem.memlength);
+}
+
 static int
 nvgrace_gpu_fetch_memory_property(struct pci_dev *pdev,
 				  u64 *pmemphys, u64 *pmemlength)
@@ -752,9 +762,9 @@ nvgrace_gpu_fetch_memory_property(struct pci_dev *pdev,
 }
 
 static int
-nvgrace_gpu_init_nvdev_struct(struct pci_dev *pdev,
-			      struct nvgrace_gpu_pci_core_device *nvdev,
-			      u64 memphys, u64 memlength)
+nvgrace_gpu_nvdev_struct_workaround(struct pci_dev *pdev,
+				    struct nvgrace_gpu_pci_core_device *nvdev,
+				    u64 memphys, u64 memlength)
 {
 	int ret = 0;
 
@@ -864,10 +874,16 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
 		 * Device memory properties are identified in the host ACPI
 		 * table. Set the nvgrace_gpu_pci_core_device structure.
 		 */
-		ret = nvgrace_gpu_init_nvdev_struct(pdev, nvdev,
-						    memphys, memlength);
-		if (ret)
-			goto out_put_vdev;
+		if (nvdev->has_mig_hw_bug_fix) {
+			nvgrace_gpu_init_nvdev_struct(pdev, nvdev,
+						      memphys, memlength);
+		} else {
+			ret = nvgrace_gpu_nvdev_struct_workaround(pdev, nvdev,
+								  memphys,
+								  memlength);
+			if (ret)
+				goto out_put_vdev;
+		}
 	}
 
 	ret = vfio_pci_core_register_device(&nvdev->core_device);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v1 3/3] vfio/nvgrace-gpu: Check the HBM training and C2C link status
  2024-10-06 10:27 [PATCH v1 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards ankita
  2024-10-06 10:27 ` [PATCH v1 1/3] vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem ankita
  2024-10-06 10:27 ` [PATCH v1 2/3] vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM ankita
@ 2024-10-06 10:27 ` ankita
  2024-10-07 14:19 ` [PATCH v1 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards Alex Williamson
  3 siblings, 0 replies; 8+ messages in thread
From: ankita @ 2024-10-06 10:27 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, yishaih, shameerali.kolothum.thodi,
	kevin.tian, zhiw
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, anuaggarwal, mochs, kvm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

In contrast to Grace Hopper systems, the HBM training has been moved
out of the UEFI on the Grace Blackwell systems. This reduces the system
bootup time significantly.

The onus of checking whether the HBM training has completed thus falls
on the module.

The HBM training status can be determined from a BAR0 register.
Similarly, another BAR0 register exposes the status of the CPU-GPU
chip-to-chip (C2C) cache coherent interconnect.

Based on testing, 30s is determined to be sufficient to ensure
initialization completion on all the Grace based systems. Thus poll
these register and check for 30s. If the HBM training is not complete
or if the C2C link is not ready, fail the probe.

While the time is not required on Grace Hopper systems, it is
beneficial to make the check to ensure the device is in an
expected state. Hence keeping it generalized to both the generations.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/main.c | 53 +++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index e3a7eceb6228..5736d8f8caa3 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -5,6 +5,7 @@
 
 #include <linux/sizes.h>
 #include <linux/vfio_pci_core.h>
+#include <linux/delay.h>
 
 /*
  * The device memory usable to the workloads running in the VM is cached
@@ -28,6 +29,13 @@
 
 #define GPU_CAP_DVSEC_REGISTER 3
 
+#define C2C_LINK_BAR0_OFFSET 0x1498
+#define HBM_TRAINING_BAR0_OFFSET 0x200BC
+#define STATUS_READY 0xFF
+
+#define POLL_QUANTUM_MS 1000
+#define POLL_TIMEOUT_MS (30 * 1000)
+
 /*
  * The state of the two device memory region - resmem and usemem - is
  * saved as struct mem_region.
@@ -848,6 +856,47 @@ static bool nvgrace_gpu_has_mig_hw_bug_fix(struct pci_dev *pdev)
 	return false;
 }
 
+/*
+ * To reduce the system bootup time, the HBM training has
+ * been moved out of the UEFI on the Grace-Blackwell systems.
+ *
+ * The onus of checking whether the HBM training has completed
+ * thus falls on the module. The HBM training status can be
+ * determined from a BAR0 register.
+ *
+ * Similarly, another BAR0 register exposes the status of the
+ * CPU-GPU chip-to-chip (C2C) cache coherent interconnect.
+ *
+ * Poll these register and check for 30s. If the HBM training is
+ * not complete or if the C2C link is not ready, fail the probe.
+ *
+ * While the wait is not required on Grace Hopper systems, it
+ * is beneficial to make the check to ensure the device is in an
+ * expected state.
+ */
+static int nvgrace_gpu_check_device_status(struct pci_dev *pdev)
+{
+	void __iomem *io;
+	int time_elasped;
+
+	io = pci_iomap(pdev, 0, ~0UL);
+	if (!io)
+		return -ENOMEM;
+
+	for (time_elasped = 0; time_elasped < POLL_TIMEOUT_MS;
+	     time_elasped += POLL_QUANTUM_MS) {
+		if ((ioread32(io + C2C_LINK_BAR0_OFFSET) == STATUS_READY) &&
+		    (ioread32(io + HBM_TRAINING_BAR0_OFFSET) == STATUS_READY)) {
+			pci_iounmap(pdev, io);
+			return 0;
+		}
+		msleep(POLL_QUANTUM_MS);
+	}
+
+	pci_iounmap(pdev, io);
+	return -ENODEV;
+}
+
 static int nvgrace_gpu_probe(struct pci_dev *pdev,
 			     const struct pci_device_id *id)
 {
@@ -856,6 +905,10 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
 	u64 memphys, memlength;
 	int ret;
 
+	ret = nvgrace_gpu_check_device_status(pdev);
+	if (ret)
+		return ret;
+
 	ret = nvgrace_gpu_fetch_memory_property(pdev, &memphys, &memlength);
 	if (!ret)
 		ops = &nvgrace_gpu_pci_ops;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v1 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards
  2024-10-06 10:27 [PATCH v1 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards ankita
                   ` (2 preceding siblings ...)
  2024-10-06 10:27 ` [PATCH v1 3/3] vfio/nvgrace-gpu: Check the HBM training and C2C link status ankita
@ 2024-10-07 14:19 ` Alex Williamson
  2024-10-07 16:37   ` Ankit Agrawal
  3 siblings, 1 reply; 8+ messages in thread
From: Alex Williamson @ 2024-10-07 14:19 UTC (permalink / raw)
  To: ankita
  Cc: jgg, yishaih, shameerali.kolothum.thodi, kevin.tian, zhiw,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, anuaggarwal, mochs, kvm, linux-kernel

On Sun, 6 Oct 2024 10:27:19 +0000
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> NVIDIA's recently introduced Grace Blackwell (GB) Superchip in
> continuation with the Grace Hopper (GH) superchip that provides a
> cache coherent access to CPU and GPU to each other's memory with
> an internal proprietary chip-to-chip (C2C) cache coherent interconnect.
> The in-tree nvgrace-gpu driver manages the GH devices. The intention
> is to extend the support to the new Grace Blackwell boards.

Where do we stand on QEMU enablement of GH, or the GB support here?
IIRC, the nvgrace-gpu variant driver was initially proposed with QEMU
being the means through which the community could make use of this
driver, but there seem to be a number of pieces missing for that
support.  Thanks,

Alex

> There is a HW defect on GH to support the Multi-Instance GPU (MIG)
> feature [1] that necessiated the presence of a 1G carved out from
> the device memory and mapped uncached. The 1G region is shown as a
> fake BAR (comprising region 2 and 3) to workaround the issue.
> 
> The GB systems differ from GH systems in the following aspects.
> 1. The aforementioned HW defect is fixed on GB systems.
> 2. There is a usable BAR1 (region 2 and 3) on GB systems for the
> GPUdirect RDMA feature [2].
> 
> This patch series accommodate those GB changes by showing the real
> physical device BAR1 (region2 and 3) to the VM instead of the fake
> one. This takes care of both the differences.
> 
> The presence of the fix for the HW defect is communicated by the
> firmware through a DVSEC PCI config register. The module reads
> this to take a different codepath on GB vs GH.
> 
> To improve system bootup time, HBM training is moved out of UEFI
> in GB system. Poll for the register indicating the training state.
> Also check the C2C link status if it is ready. Fail the probe if
> either fails.
> 
> Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]
> Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2]
> 
> Applied over next-20241003.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> 
> Ankit Agrawal (3):
>   vfio/nvgrace-gpu: Read dvsec register to determine need for uncached
>     resmem
>   vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM
>   vfio/nvgrace-gpu: Check the HBM training and C2C link status
> 
>  drivers/vfio/pci/nvgrace-gpu/main.c | 115 ++++++++++++++++++++++++++--
>  1 file changed, 107 insertions(+), 8 deletions(-)
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards
  2024-10-07 14:19 ` [PATCH v1 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards Alex Williamson
@ 2024-10-07 16:37   ` Ankit Agrawal
  2024-10-07 21:16     ` Alex Williamson
  0 siblings, 1 reply; 8+ messages in thread
From: Ankit Agrawal @ 2024-10-07 16:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Yishai Hadas,
	shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com,
	Zhi Wang, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
	John Hubbard, Dan Williams, Anuj Aggarwal (SW-GPU), Matt Ochs,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org

>>
>> NVIDIA's recently introduced Grace Blackwell (GB) Superchip in
>> continuation with the Grace Hopper (GH) superchip that provides a
>> cache coherent access to CPU and GPU to each other's memory with
>> an internal proprietary chip-to-chip (C2C) cache coherent interconnect.
>> The in-tree nvgrace-gpu driver manages the GH devices. The intention
>> is to extend the support to the new Grace Blackwell boards.
>
> Where do we stand on QEMU enablement of GH, or the GB support here?
> IIRC, the nvgrace-gpu variant driver was initially proposed with QEMU
> being the means through which the community could make use of this
> driver, but there seem to be a number of pieces missing for that
> support.  Thanks,
> 
> Alex

Hi Alex, the Qemu enablement changes for GH is already in Qemu 9.0.
This is the Generic initiator change that got merged:
https://lore.kernel.org/all/20240308145525.10886-1-ankita@nvidia.com/

The missing pieces are actually in the kvm/kernel viz:
1. KVM need to map the device memory as Normal. The KVM patch was
proposed here. This patch need refresh to address the suggestions:
https://lore.kernel.org/all/20230907181459.18145-2-ankita@nvidia.com/
2. ECC handling series for the GPU device memory that is remap_pfn_range()
mapped: https://lore.kernel.org/all/20231123003513.24292-1-ankita@nvidia.com/

With those changes, the GH would be functional with the Qemu 9.0.
We discovered a separate Qemu issue while doing verification of Grace Blackwell,
where the 512G of highmem proved short here:
https://github.com/qemu/qemu/blob/v9.0.0/hw/arm/virt.c#L211
We are planning to have a proposal for the fix floated for that.

Thanks
Ankit Agrawal

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards
  2024-10-07 16:37   ` Ankit Agrawal
@ 2024-10-07 21:16     ` Alex Williamson
  2024-10-08  7:22       ` Ankit Agrawal
  0 siblings, 1 reply; 8+ messages in thread
From: Alex Williamson @ 2024-10-07 21:16 UTC (permalink / raw)
  To: Ankit Agrawal
  Cc: Jason Gunthorpe, Yishai Hadas,
	shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com,
	Zhi Wang, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
	John Hubbard, Dan Williams, Anuj Aggarwal (SW-GPU), Matt Ochs,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org

On Mon, 7 Oct 2024 16:37:12 +0000
Ankit Agrawal <ankita@nvidia.com> wrote:

> >>
> >> NVIDIA's recently introduced Grace Blackwell (GB) Superchip in
> >> continuation with the Grace Hopper (GH) superchip that provides a
> >> cache coherent access to CPU and GPU to each other's memory with
> >> an internal proprietary chip-to-chip (C2C) cache coherent interconnect.
> >> The in-tree nvgrace-gpu driver manages the GH devices. The intention
> >> is to extend the support to the new Grace Blackwell boards.  
> >
> > Where do we stand on QEMU enablement of GH, or the GB support here?
> > IIRC, the nvgrace-gpu variant driver was initially proposed with QEMU
> > being the means through which the community could make use of this
> > driver, but there seem to be a number of pieces missing for that
> > support.  Thanks,
> > 
> > Alex  
> 
> Hi Alex, the Qemu enablement changes for GH is already in Qemu 9.0.
> This is the Generic initiator change that got merged:
> https://lore.kernel.org/all/20240308145525.10886-1-ankita@nvidia.com/
> 
> The missing pieces are actually in the kvm/kernel viz:
> 1. KVM need to map the device memory as Normal. The KVM patch was
> proposed here. This patch need refresh to address the suggestions:
> https://lore.kernel.org/all/20230907181459.18145-2-ankita@nvidia.com/
> 2. ECC handling series for the GPU device memory that is remap_pfn_range()
> mapped: https://lore.kernel.org/all/20231123003513.24292-1-ankita@nvidia.com/
> 
> With those changes, the GH would be functional with the Qemu 9.0.

Sure, unless we note that those series were posted a year ago, which
makes it much harder to claim that we're actively enabling upstream
testing for this driver that we're now trying to extend to new
hardware.  Thanks,

Alex

> We discovered a separate Qemu issue while doing verification of Grace Blackwell,
> where the 512G of highmem proved short here:
> https://github.com/qemu/qemu/blob/v9.0.0/hw/arm/virt.c#L211
> We are planning to have a proposal for the fix floated for that.
> 
> Thanks
> Ankit Agrawal
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards
  2024-10-07 21:16     ` Alex Williamson
@ 2024-10-08  7:22       ` Ankit Agrawal
  0 siblings, 0 replies; 8+ messages in thread
From: Ankit Agrawal @ 2024-10-08  7:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Yishai Hadas,
	shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com,
	Zhi Wang, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
	John Hubbard, Dan Williams, Anuj Aggarwal (SW-GPU), Matt Ochs,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org

>>
>> Hi Alex, the Qemu enablement changes for GH is already in Qemu 9.0.
>> This is the Generic initiator change that got merged:
>> https://lore.kernel.org/all/20240308145525.10886-1-ankita@nvidia.com/
>>
>> The missing pieces are actually in the kvm/kernel viz:
>> 1. KVM need to map the device memory as Normal. The KVM patch was
>> proposed here. This patch need refresh to address the suggestions:
>> https://lore.kernel.org/all/20230907181459.18145-2-ankita@nvidia.com/
>> 2. ECC handling series for the GPU device memory that is remap_pfn_range()
>> mapped: https://lore.kernel.org/all/20231123003513.24292-1-ankita@nvidia.com/
>>
>> With those changes, the GH would be functional with the Qemu 9.0.
>
> Sure, unless we note that those series were posted a year ago, which
> makes it much harder to claim that we're actively enabling upstream
> testing for this driver that we're now trying to extend to new
> hardware.  Thanks,
>
> Alex

Right, I am also working to implement the leftover items mentioned above.
The work to refresh the aforementioned items is ongoing and would be posting
it shortly as well (starting with the KVM patch).

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-10-08  7:22 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-06 10:27 [PATCH v1 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards ankita
2024-10-06 10:27 ` [PATCH v1 1/3] vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem ankita
2024-10-06 10:27 ` [PATCH v1 2/3] vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM ankita
2024-10-06 10:27 ` [PATCH v1 3/3] vfio/nvgrace-gpu: Check the HBM training and C2C link status ankita
2024-10-07 14:19 ` [PATCH v1 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards Alex Williamson
2024-10-07 16:37   ` Ankit Agrawal
2024-10-07 21:16     ` Alex Williamson
2024-10-08  7:22       ` Ankit Agrawal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox