From: Alex Williamson <alex.williamson@redhat.com>
To: <ankita@nvidia.com>
Cc: <jgg@nvidia.com>, <yishaih@nvidia.com>,
<shameerali.kolothum.thodi@huawei.com>, <kevin.tian@intel.com>,
<zhiw@nvidia.com>, <aniketa@nvidia.com>, <cjia@nvidia.com>,
<kwankhede@nvidia.com>, <targupta@nvidia.com>,
<vsethi@nvidia.com>, <acurrid@nvidia.com>, <apopple@nvidia.com>,
<jhubbard@nvidia.com>, <danw@nvidia.com>,
<anuaggarwal@nvidia.com>, <mochs@nvidia.com>,
<kvm@vger.kernel.org>, <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v1 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards
Date: Mon, 7 Oct 2024 08:19:13 -0600 [thread overview]
Message-ID: <20241007081913.74b3deed.alex.williamson@redhat.com> (raw)
In-Reply-To: <20241006102722.3991-1-ankita@nvidia.com>
On Sun, 6 Oct 2024 10:27:19 +0000
<ankita@nvidia.com> wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
>
> NVIDIA's recently introduced Grace Blackwell (GB) Superchip in
> continuation with the Grace Hopper (GH) superchip that provides a
> cache coherent access to CPU and GPU to each other's memory with
> an internal proprietary chip-to-chip (C2C) cache coherent interconnect.
> The in-tree nvgrace-gpu driver manages the GH devices. The intention
> is to extend the support to the new Grace Blackwell boards.
Where do we stand on QEMU enablement of GH, or the GB support here?
IIRC, the nvgrace-gpu variant driver was initially proposed with QEMU
being the means through which the community could make use of this
driver, but there seem to be a number of pieces missing for that
support. Thanks,
Alex
> There is a HW defect on GH to support the Multi-Instance GPU (MIG)
> feature [1] that necessiated the presence of a 1G carved out from
> the device memory and mapped uncached. The 1G region is shown as a
> fake BAR (comprising region 2 and 3) to workaround the issue.
>
> The GB systems differ from GH systems in the following aspects.
> 1. The aforementioned HW defect is fixed on GB systems.
> 2. There is a usable BAR1 (region 2 and 3) on GB systems for the
> GPUdirect RDMA feature [2].
>
> This patch series accommodate those GB changes by showing the real
> physical device BAR1 (region2 and 3) to the VM instead of the fake
> one. This takes care of both the differences.
>
> The presence of the fix for the HW defect is communicated by the
> firmware through a DVSEC PCI config register. The module reads
> this to take a different codepath on GB vs GH.
>
> To improve system bootup time, HBM training is moved out of UEFI
> in GB system. Poll for the register indicating the training state.
> Also check the C2C link status if it is ready. Fail the probe if
> either fails.
>
> Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1]
> Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2]
>
> Applied over next-20241003.
>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
>
> Ankit Agrawal (3):
> vfio/nvgrace-gpu: Read dvsec register to determine need for uncached
> resmem
> vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM
> vfio/nvgrace-gpu: Check the HBM training and C2C link status
>
> drivers/vfio/pci/nvgrace-gpu/main.c | 115 ++++++++++++++++++++++++++--
> 1 file changed, 107 insertions(+), 8 deletions(-)
>
next prev parent reply other threads:[~2024-10-07 14:19 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-06 10:27 [PATCH v1 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards ankita
2024-10-06 10:27 ` [PATCH v1 1/3] vfio/nvgrace-gpu: Read dvsec register to determine need for uncached resmem ankita
2024-10-06 10:27 ` [PATCH v1 2/3] vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM ankita
2024-10-06 10:27 ` [PATCH v1 3/3] vfio/nvgrace-gpu: Check the HBM training and C2C link status ankita
2024-10-07 14:19 ` Alex Williamson [this message]
2024-10-07 16:37 ` [PATCH v1 0/3] vfio/nvgrace-gpu: Enable grace blackwell boards Ankit Agrawal
2024-10-07 21:16 ` Alex Williamson
2024-10-08 7:22 ` Ankit Agrawal
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241007081913.74b3deed.alex.williamson@redhat.com \
--to=alex.williamson@redhat.com \
--cc=acurrid@nvidia.com \
--cc=aniketa@nvidia.com \
--cc=ankita@nvidia.com \
--cc=anuaggarwal@nvidia.com \
--cc=apopple@nvidia.com \
--cc=cjia@nvidia.com \
--cc=danw@nvidia.com \
--cc=jgg@nvidia.com \
--cc=jhubbard@nvidia.com \
--cc=kevin.tian@intel.com \
--cc=kvm@vger.kernel.org \
--cc=kwankhede@nvidia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mochs@nvidia.com \
--cc=shameerali.kolothum.thodi@huawei.com \
--cc=targupta@nvidia.com \
--cc=vsethi@nvidia.com \
--cc=yishaih@nvidia.com \
--cc=zhiw@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox