[RFC v2 00/14] Introduce NVIDIA GPU Virtualization (vGPU) Support

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC v2 00/14] Introduce NVIDIA GPU Virtualization (vGPU) Support
@ 2025-09-03 22:10 Zhi Wang
  2025-09-03 22:10 ` [RFC v2 01/14] vfio/nvidia-vgpu: introduce vGPU lifecycle management prelude Zhi Wang
                   ` (13 more replies)
  0 siblings, 14 replies; 23+ messages in thread
From: Zhi Wang @ 2025-09-03 22:10 UTC (permalink / raw)
  To: kvm
  Cc: alex.williamson, kevin.tian, jgg, airlied, daniel, dakr, acurrid,
	cjia, smitra, ankita, aniketa, kwankhede, targupta, zhiw, zhiwang

1. Background
=============

NVIDIA vGPU[1] software enables powerful GPU performance for workloads
ranging from graphics-rich virtual workstations to data science and AI,
enabling IT to leverage the management and security benefits of
virtualization as well as the performance of NVIDIA GPUs required for
modern workloads. Installed on a physical GPU in a cloud or enterprise
data center server, NVIDIA vGPU software creates virtual GPUs that can
be shared across multiple virtual machines.

The vGPU architecture[2] can be illustrated as follow:

 +--------------------+    +--------------------+ +--------------------+ +--------------------+ 
 | Hypervisor         |    | Guest VM           | | Guest VM           | | Guest VM           | 
 |                    |    | +----------------+ | | +----------------+ | | +----------------+ | 
 | +----------------+ |    | |Applications... | | | |Applications... | | | |Applications... | | 
 | |  NVIDIA        | |    | +----------------+ | | +----------------+ | | +----------------+ | 
 | |  Virtual GPU   | |    | +----------------+ | | +----------------+ | | +----------------+ | 
 | |  Manager       | |    | |  Guest Driver  | | | |  Guest Driver  | | | |  Guest Driver  | | 
 | +------^---------+ |    | +----------------+ | | +----------------+ | | +----------------+ | 
 |        |           |    +---------^----------+ +----------^---------+ +----------^---------+ 
 |        |           |              |                       |                      |           
 |        |           +--------------+-----------------------+----------------------+---------+ 
 |        |                          |                       |                      |         | 
 |        |                          |                       |                      |         | 
 +--------+--------------------------+-----------------------+----------------------+---------+ 
+---------v--------------------------+-----------------------+----------------------+----------+
| NVIDIA                  +----------v---------+ +-----------v--------+ +-----------v--------+ |
| Physical GPU            |   Virtual GPU      | |   Virtual GPU      | |   Virtual GPU      | |
|                         +--------------------+ +--------------------+ +--------------------+ |
+----------------------------------------------------------------------------------------------+

Each NVIDIA vGPU is analogous to a conventional GPU, having a fixed amount
of GPU framebuffer, and one or more virtual display outputs or "heads".
The vGPU's framebuffer is allocated out of the physical GPU's framebuffer
at the time the vGPU is created, and the vGPU retains exclusive use of
that framebuffer until it is destroyed.

Each physical GPU can support several different types of virtual GPU
(vGPU). vGPU types have a fixed amount of frame buffer, number of
supported display heads, and maximum resolutions. They are grouped into
different series according to the different classes of workload for which
they are optimized. Each series is identified by the last letter of the
vGPU type name.

NVIDIA vGPU supports Windows and Linux guest VM operating systems. The
supported vGPU types depend on the guest VM OS.

2. Proposal For Upstream
========================

2.1 Architecture
----------------

Moving to the upstream, the proposed architecture can be illustrated as followings:

                            +--------------------+ +--------------------+ +--------------------+ 
                            | Linux VM           | | Windows VM         | | Guest VM           | 
                            | +----------------+ | | +----------------+ | | +----------------+ | 
                            | |Applications... | | | |Applications... | | | |Applications... | | 
                            | +----------------+ | | +----------------+ | | +----------------+ | ... 
                            | +----------------+ | | +----------------+ | | +----------------+ | 
                            | |  Guest Driver  | | | |  Guest Driver  | | | |  Guest Driver  | | 
                            | +----------------+ | | +----------------+ | | +----------------+ | 
                            +---------^----------+ +----------^---------+ +----------^---------+ 
                                      |                       |                      |           
                           +--------------------------------------------------------------------+
                           |+--------------------+ +--------------------+ +--------------------+|
                           ||       QEMU         | |       QEMU         | |       QEMU         ||
                           ||                    | |                    | |                    ||
                           |+--------------------+ +--------------------+ +--------------------+|
                           +--------------------------------------------------------------------+
                                      |                       |                      |
+-----------------------------------------------------------------------------------------------+
|                           +----------------------------------------------------------------+  |
|                           |                                VFIO                            |  |
|                           |                                                                |  |
| +-----------------------+ | +-------------------------------------------------------------+|  |
| |                       | | |                                                             ||  |
| |     nova_core        <--->|                                                             ||  |
| +    (core driver)      + | |                      NVIDIA vGPU VFIO Driver                ||  |
| |                       | | |                                                             ||  |
| |                       | | +-------------------------------------------------------------+|  |
| +--------^--------------+ +----------------------------------------------------------------+  |
|          |                          |                       |                      |          |
+-----------------------------------------------------------------------------------------------+
           |                          |                       |                      |           
+----------|--------------------------|-----------------------|----------------------|----------+
|          v               +----------v---------+ +-----------v--------+ +-----------v--------+ |
|  NVIDIA                  |       PCI VF       | |       PCI VF       | |       PCI VF       | |
|  Physical GPU            |                    | |                    | |                    | |
|                          |   (Virtual GPU)    | |   (Virtual GPU)    | |    (Virtual GPU)   | |
|                          +--------------------+ +--------------------+ +--------------------+ |
+-----------------------------------------------------------------------------------------------+

Each virtual GPU (vGPU) instance is implemented atop a PCIe Virtual
Function (VF). The NVIDIA vGPU VFIO driver, in coordination with the
VFIO framework, operates directly on these VFs to enable key
functionalities including vGPU type selection, dynamic instantiation and
destruction of vGPU instances, support for live migration, and warm
update.

Consistent with other VFIO variant drivers, the NVIDIA vGPU VFIO driver
adheres to the standard VFIO userspace interface, facilitating device
lifecycle management and integration with advanced VFIO capabilities.

At the low level, the NVIDIA vGPU VFIO driver interfaces with the core
driver, which provides the necessary abstractions and mechanisms to access
and manipulate the underlying GPU hardware resources.

2.2 Core Driver (nova_core)
---------------------------

The primary deployment model for cloud service providers (CSPs) and
enterprise environments is to have a standalone, minimal driver stack
with the vGPU support and other essential components. Thus, a minimal
core driver is required to support the NVIDIA vGPU VFIO driver.

The core GPU driver provides the foundational infrastructure necessary
to support the following operations:

- Firmware management: Load the GSP (GPU System Processor) firmware,
initiate GSP boot procedures, and establish the communication channel
between the host and the GSP.

- Hardware resource management: Control and partition shared GPU
resources-such as framebuffer memory and hardware channels-used by the
VFIO driver for instantiating and operating vGPUs.

- Exception handling: Relay hardware and firmware-level exception events,
including GSP notifications, to the VFIO driver. E.g. FIFO nonstall.

- Host event coordination: Handle system-wide events such as suspend and
resume, PF driver unbind, etc. ensuring proper synchronization with GPU
subsystems.

- Hardware configuration enumeration: Discover and expose static and
dynamic hardware capabilities required for vGPU orchestration. E.g.
engine bitmap, total FB memory size.

2.3 NVIDIA vGPU VFIO Driver
---------------------------

The NVIDIA vGPU VFIO driver exposes standard VFIO interfaces for userspace
access to vGPUs, while also providing control paths for vGPU creation and
destruction with the help of core driver.

The driver provides an additional sysfs interface for the admin to query
the creatable vGPU types on a VF. Once the vGPU type is selected, the
userspace VMM, e.g. QEMU can manipulate the VF via the standard VFIO
device interfaces. Only homogeneous vGPU has been supported.

As different NVIDIA GPUs support different available vGPU types, a
loadable vGPU metadata file is introduced to host those blobs,
which are the support vGPU types on supported NVIDIA GPUs. It is loaded
together with the VFIO driver. The VFIO driver chooses the usable vGPU
types from it based on the installed NVIDIA GPU in the system.

The driver also exposes an per-vGPU logging interface to collect the GSP
logs for bug report.

2.4 Changes from RFC [3]
-----------------------------

- vGPU is supported since GSP microcode 570.
- Multiple vGPU support with homogeneous scheme.
- CE workload submission for FB memory scrubbing.
- Interface to create/destroy/select vGPUs.
- Loadable vGPU type support via vGPU metadata file.
- Expose per-vGPU GSP log.
- Proper VFIO driver attach/detach flow.
- PF driver event forwarding to support PF driver unbind by admin.

3 Try the patches
-----------------------

- Host kernel: http://github.com/zhiwang-nvidia/linux/tree/zhi/vgpu-rfc-v2
- vGPU metadata file: https://github.com/zhiwang-nvidia/vgpu-tools/blob/metadata/metadata/18.1/vgpu-570.144.bin
  The metadata file needs to be placed at: /lib/firmware/nvidia
- Guest driver package: NVIDIA-Linux-x86_64-570.124.04.run [4]

  Install guest driver:
  # export GRID_BUILD=1
  # ./NVIDIA-Linux-x86_64-570.124.04.run

- Tested platforms: RTX A6000 Ada.
- Tested host OS: RHEL 8.4.
- Tested guest OS: Ubutnu 24.04 LTS, Windows 11.
- Supported experience: Rich desktop experience with simple 3D workload,
  e.g. glmark2, heaven.

- Demo video: running heaven on two -24Q vGPUs on NVIDIA RTX A6000 Ada [5]

[1] https://www.nvidia.com/en-us/data-center/virtual-solutions/
[2] https://docs.nvidia.com/vgpu/17.0/grid-vgpu-user-guide/index.html#architecture-grid-vgpu
[3] https://lore.kernel.org/kvm/20240922161121.000060a0.zhiw@nvidia.com/T/
[4] https://us.download.nvidia.com/XFree86/Linux-x86_64/570.124.04/NVIDIA-Linux-x86_64-570.124.04.run
[5] https://youtu.be/DhW--wVlLfU

Zhi Wang (14):
  vfio/nvidia-vgpu: introduce vGPU lifecycle management prelude
  vfio/nvidia-vgpu: allocate GSP RM client for NVIDIA vGPU manager
  vfio/nvidia-vgpu: introduce vGPU type uploading
  vfio/nvidia-vgpu: allocate vGPU channels when creating vGPUs
  vfio/nvidia-vgpu: allocate vGPU FB memory when creating vGPUs
  vfio/nvidia-vgpu: allocate mgmt heap when creating vGPUs
  vfio/nvidia-vgpu: map mgmt heap when creating a vGPU
  vfio/nvidia-vgpu: allocate GSP RM client when creating vGPUs
  vfio/nvidia-vgpu: bootload the new vGPU
  vfio/nvidia-vgpu: introduce vGPU host RPC channel
  vfio/nvidia-vgpu: introduce NVIDIA vGPU VFIO variant driver
  vfio/nvidia-vgpu: scrub the guest FB memory of a vGPU
  vfio/nvidia-vgpu: introduce vGPU logging
  vfio/nvidia-vgpu: add a kernel doc to introduce NVIDIA vGPU

 .../ABI/stable/sysfs-driver-nvidia-vgpu       |  11 +
 Documentation/gpu/drivers.rst                 |   1 +
 Documentation/gpu/nvidia-vgpu.rst             | 264 +++++++
 drivers/vfio/pci/Kconfig                      |   2 +
 drivers/vfio/pci/Makefile                     |   2 +
 drivers/vfio/pci/nvidia-vgpu/Kconfig          |  15 +
 drivers/vfio/pci/nvidia-vgpu/Makefile         |   6 +
 drivers/vfio/pci/nvidia-vgpu/debug.h          |  35 +
 drivers/vfio/pci/nvidia-vgpu/debugfs.c        |  65 ++
 .../pci/nvidia-vgpu/include/nvrm/bootload.h   |  58 ++
 .../vfio/pci/nvidia-vgpu/include/nvrm/ecc.h   |  45 ++
 .../vfio/pci/nvidia-vgpu/include/nvrm/gsp.h   |  18 +
 .../nvidia-vgpu/include/nvrm/nv_vgpu_types.h  |  34 +
 .../pci/nvidia-vgpu/include/nvrm/nvtypes.h    |  26 +
 .../vfio/pci/nvidia-vgpu/include/nvrm/vgpu.h  | 182 +++++
 .../vfio/pci/nvidia-vgpu/include/nvrm/vmmu.h  |  39 +
 drivers/vfio/pci/nvidia-vgpu/metadata.c       | 319 ++++++++
 drivers/vfio/pci/nvidia-vgpu/metadata.h       |  89 +++
 .../vfio/pci/nvidia-vgpu/metadata_vgpu_type.c | 153 ++++
 drivers/vfio/pci/nvidia-vgpu/pf.h             | 145 ++++
 drivers/vfio/pci/nvidia-vgpu/rpc.c            | 254 ++++++
 drivers/vfio/pci/nvidia-vgpu/vfio.h           |  65 ++
 drivers/vfio/pci/nvidia-vgpu/vfio_access.c    | 313 ++++++++
 drivers/vfio/pci/nvidia-vgpu/vfio_debugfs.c   | 117 +++
 drivers/vfio/pci/nvidia-vgpu/vfio_main.c      | 730 ++++++++++++++++++
 drivers/vfio/pci/nvidia-vgpu/vfio_sysfs.c     | 209 +++++
 drivers/vfio/pci/nvidia-vgpu/vgpu.c           | 690 +++++++++++++++++
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c       | 450 +++++++++++
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h       | 231 ++++++
 29 files changed, 4568 insertions(+)
 create mode 100644 Documentation/ABI/stable/sysfs-driver-nvidia-vgpu
 create mode 100644 Documentation/gpu/nvidia-vgpu.rst
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/Kconfig
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/Makefile
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/debug.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/debugfs.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/include/nvrm/bootload.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/include/nvrm/ecc.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/include/nvrm/gsp.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/include/nvrm/nv_vgpu_types.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/include/nvrm/nvtypes.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/include/nvrm/vgpu.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/include/nvrm/vmmu.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/metadata.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/metadata.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/metadata_vgpu_type.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/pf.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/rpc.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/vfio.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/vfio_access.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/vfio_debugfs.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/vfio_main.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/vfio_sysfs.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/vgpu.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC v2 01/14] vfio/nvidia-vgpu: introduce vGPU lifecycle management prelude
  2025-09-03 22:10 [RFC v2 00/14] Introduce NVIDIA GPU Virtualization (vGPU) Support Zhi Wang
@ 2025-09-03 22:10 ` Zhi Wang
  2025-09-03 22:10 ` [RFC v2 02/14] vfio/nvidia-vgpu: allocate GSP RM client for NVIDIA vGPU manager Zhi Wang
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Zhi Wang @ 2025-09-03 22:10 UTC (permalink / raw)
  To: kvm
  Cc: alex.williamson, kevin.tian, jgg, airlied, daniel, dakr, acurrid,
	cjia, smitra, ankita, aniketa, kwankhede, targupta, zhiw, zhiwang

To introduce the routines when creating a vGPU one by one in the
following patches, first, introduce the prelude of the vGPU lifecycle
management as the skeleton.

Introduce the vGPU lifecycle managemement data structures and routines.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 drivers/vfio/pci/Kconfig                |   2 +
 drivers/vfio/pci/Makefile               |   2 +
 drivers/vfio/pci/nvidia-vgpu/Kconfig    |  15 ++
 drivers/vfio/pci/nvidia-vgpu/Makefile   |   3 +
 drivers/vfio/pci/nvidia-vgpu/debug.h    |  17 +++
 drivers/vfio/pci/nvidia-vgpu/pf.h       |  65 ++++++++
 drivers/vfio/pci/nvidia-vgpu/vgpu.c     | 101 +++++++++++++
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c | 193 ++++++++++++++++++++++++
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h |  79 ++++++++++
 9 files changed, 477 insertions(+)
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/Kconfig
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/Makefile
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/debug.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/pf.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/vgpu.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 2b0172f54665..4bb2ddb120cc 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -65,6 +65,8 @@ source "drivers/vfio/pci/virtio/Kconfig"
 
 source "drivers/vfio/pci/nvgrace-gpu/Kconfig"
 
+source "drivers/vfio/pci/nvidia-vgpu/Kconfig"
+
 source "drivers/vfio/pci/qat/Kconfig"
 
 endmenu
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index cf00c0a7e55c..0e56f2e7ea36 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -18,4 +18,6 @@ obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio/
 
 obj-$(CONFIG_NVGRACE_GPU_VFIO_PCI) += nvgrace-gpu/
 
+obj-$(CONFIG_NVIDIA_VGPU_VFIO_PCI) += nvidia-vgpu/
+
 obj-$(CONFIG_QAT_VFIO_PCI) += qat/
diff --git a/drivers/vfio/pci/nvidia-vgpu/Kconfig b/drivers/vfio/pci/nvidia-vgpu/Kconfig
new file mode 100644
index 000000000000..3a0dab70e31d
--- /dev/null
+++ b/drivers/vfio/pci/nvidia-vgpu/Kconfig
@@ -0,0 +1,15 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config NVIDIA_VGPU_VFIO_PCI
+	tristate "VFIO support for the NVIDIA vGPU"
+	select VFIO_PCI_CORE
+	help
+	  This option enables VFIO (Virtual Function I/O) support for
+	  NVIDIA virtual GPUs (vGPU). It allows the assignment of a virtual
+	  GPU instance to userspace applications via VFIO, typically used
+	  with hypervisors such as KVM and device emulators like QEMU.
+
+          The NVIDIA vGPU allows a physical GPU to be partitioned into
+	  multiple virtual GPUs, each of which can be passed to a virtual
+	  machine as a PCI device using the standard VFIO infrastructure.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/nvidia-vgpu/Makefile b/drivers/vfio/pci/nvidia-vgpu/Makefile
new file mode 100644
index 000000000000..14ff08175231
--- /dev/null
+++ b/drivers/vfio/pci/nvidia-vgpu/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_NVIDIA_VGPU_VFIO_PCI) += nvidia_vgpu_vfio_pci.o
+nvidia_vgpu_vfio_pci-y := vgpu_mgr.o vgpu.o
diff --git a/drivers/vfio/pci/nvidia-vgpu/debug.h b/drivers/vfio/pci/nvidia-vgpu/debug.h
new file mode 100644
index 000000000000..19a2ecd8863e
--- /dev/null
+++ b/drivers/vfio/pci/nvidia-vgpu/debug.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright © 2025 NVIDIA Corporation
+ */
+
+#ifndef __NVIDIA_VGPU_DEBUG_H__
+#define __NVIDIA_VGPU_DEBUG_H__
+
+#define vgpu_mgr_debug(v, f, a...) \
+	pci_dbg((v)->handle.pf_pdev, "nvidia-vgpu-mgr: "f, ##a)
+
+#define vgpu_debug(v, f, a...) ({ \
+	typeof(v) __v = (v); \
+	pci_dbg(__v->pdev, "nvidia-vgpu %d: "f, __v->info.id, ##a); \
+})
+
+#endif
diff --git a/drivers/vfio/pci/nvidia-vgpu/pf.h b/drivers/vfio/pci/nvidia-vgpu/pf.h
new file mode 100644
index 000000000000..e8a11dd29427
--- /dev/null
+++ b/drivers/vfio/pci/nvidia-vgpu/pf.h
@@ -0,0 +1,65 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright © 2025 NVIDIA Corporation
+ */
+#ifndef __NVIDIA_VGPU_PF_H__
+#define __NVIDIA_VGPU_PF_H__
+
+#include <linux/pci.h>
+#include <drm/nvidia_vgpu_vfio_pf_intf.h>
+
+struct nvidia_vgpu_mgr_handle {
+	void *pf_drvdata;
+	struct pci_dev *pf_pdev;
+	struct nvidia_vgpu_vfio_ops *ops;
+};
+
+static inline int nvidia_vgpu_mgr_init_handle(struct pci_dev *pdev,
+					      struct nvidia_vgpu_mgr_handle *h)
+{
+	struct pci_dev *pf_pdev;
+
+	if (!pdev->is_virtfn)
+		return -EINVAL;
+
+	pf_pdev = pdev->physfn;
+
+	h->ops = NULL;
+	h->pf_pdev = pf_pdev;
+	h->pf_drvdata = pci_get_drvdata(pf_pdev);
+
+	if (strcmp(pf_pdev->driver->name, "NovaCore")) {
+		pr_err("Cannot find an available PF driver!\n");
+		return -EINVAL;
+	}
+
+	h->ops = nova_vgpu_get_vfio_ops(h->pf_drvdata);
+	return 0;
+}
+
+#define nvidia_vgpu_mgr_support_is_enabled(h) ({ \
+	typeof(h) __h = (h); \
+	__h->ops->vgpu_is_enabled(__h->pf_drvdata); \
+})
+
+#define nvidia_vgpu_mgr_attach_handle(h, data) ({ \
+	typeof(h) __h = (h); \
+	__h->ops->attach_handle(__h->pf_drvdata, data); \
+})
+
+#define nvidia_vgpu_mgr_detach_handle(h) ({ \
+	typeof(h) __h = (h); \
+	__h->ops->detach_handle(__h->pf_drvdata); \
+})
+
+#define nvidia_vgpu_mgr_get_avail_chids(m) ({ \
+	typeof(m) __m = (m); \
+	__m->handle.ops->get_avail_chids(__m->handle.pf_drvdata); \
+})
+
+#define nvidia_vgpu_mgr_get_total_fbmem_size(m) ({ \
+	typeof(m) __m = (m); \
+	__m->handle.ops->get_total_fbmem_size(__m->handle.pf_drvdata); \
+})
+
+#endif
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu.c b/drivers/vfio/pci/nvidia-vgpu/vgpu.c
new file mode 100644
index 000000000000..79e6a9f16f74
--- /dev/null
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu.c
@@ -0,0 +1,101 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright © 2025 NVIDIA Corporation
+ */
+
+#include "debug.h"
+#include "vgpu_mgr.h"
+
+static void unregister_vgpu(struct nvidia_vgpu *vgpu)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = vgpu->vgpu_mgr;
+
+	mutex_lock(&vgpu_mgr->vgpu_list_lock);
+
+	list_del(&vgpu->vgpu_list);
+	atomic_dec(&vgpu_mgr->num_vgpus);
+
+	mutex_unlock(&vgpu_mgr->vgpu_list_lock);
+
+	vgpu_debug(vgpu, "unregistered\n");
+}
+
+static int register_vgpu(struct nvidia_vgpu *vgpu)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = vgpu->vgpu_mgr;
+	struct nvidia_vgpu *p;
+
+	mutex_lock(&vgpu_mgr->vgpu_list_lock);
+
+	nvidia_vgpu_mgr_for_each_vgpu(p, vgpu_mgr) {
+		if (WARN_ON(p->info.id == vgpu->info.id)) {
+			mutex_unlock(&vgpu_mgr->vgpu_list_lock);
+			return -EBUSY;
+		}
+	}
+
+	list_add_tail(&vgpu->vgpu_list, &vgpu_mgr->vgpu_list_head);
+	atomic_inc(&vgpu_mgr->num_vgpus);
+
+	mutex_unlock(&vgpu_mgr->vgpu_list_lock);
+
+	vgpu_debug(vgpu, "registered\n");
+	return 0;
+}
+
+/**
+ * nvidia_vgpu_mgr_destroy_vgpu - destroy a vGPU instance
+ * @vgpu: the vGPU instance going to be destroyed.
+ *
+ * Returns: 0 on success, others on failure.
+ */
+int nvidia_vgpu_mgr_destroy_vgpu(struct nvidia_vgpu *vgpu)
+{
+	if (!atomic_cmpxchg(&vgpu->status, 1, 0))
+		return -ENODEV;
+
+	unregister_vgpu(vgpu);
+
+	vgpu_debug(vgpu, "destroyed\n");
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(nvidia_vgpu_mgr_destroy_vgpu);
+
+/**
+ * nvidia_vgpu_mgr_create_vgpu - create a vGPU instance
+ * @vgpu: the vGPU instance going to be created.
+ *
+ * The caller must initialize vgpu->vgpu_mgr, vgpu->pdev and vgpu->info.
+ *
+ * Returns: 0 on success, others on failure.
+ */
+int nvidia_vgpu_mgr_create_vgpu(struct nvidia_vgpu *vgpu)
+{
+	struct nvidia_vgpu_info *info = &vgpu->info;
+	int ret;
+
+	if (WARN_ON(!info->gfid || !info->dbdf))
+		return -EINVAL;
+
+	if (WARN_ON(!vgpu->vgpu_mgr || !vgpu->pdev))
+		return -EINVAL;
+
+	mutex_init(&vgpu->lock);
+	INIT_LIST_HEAD(&vgpu->vgpu_list);
+
+	vgpu->info = *info;
+
+	vgpu_debug(vgpu, "create vgpu on vgpu_mgr %px\n", vgpu->vgpu_mgr);
+
+	ret = register_vgpu(vgpu);
+	if (ret)
+		return ret;
+
+	atomic_set(&vgpu->status, 1);
+
+	vgpu_debug(vgpu, "created\n");
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(nvidia_vgpu_mgr_create_vgpu);
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
new file mode 100644
index 000000000000..3ef81b89c748
--- /dev/null
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
@@ -0,0 +1,193 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright © 2025 NVIDIA Corporation
+ */
+
+#include "debug.h"
+#include "vgpu_mgr.h"
+
+static void vgpu_mgr_release(struct kref *kref)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr =
+		container_of(kref, struct nvidia_vgpu_mgr, refcount);
+
+	vgpu_mgr_debug(vgpu_mgr, "release\n");
+
+	if (WARN_ON(atomic_read(&vgpu_mgr->num_vgpus)))
+		return;
+
+	kvfree(vgpu_mgr);
+}
+
+static void detach_vgpu_mgr(struct nvidia_vgpu_vfio_handle_data *handle_data)
+{
+	handle_data->vfio.private_data = NULL;
+}
+
+static void pf_detach_handle_fn(void *handle, struct nvidia_vgpu_vfio_handle_data *handle_data)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = handle_data->vfio.private_data;
+
+	vgpu_mgr_debug(vgpu_mgr, "put\n");
+
+	if (kref_put(&vgpu_mgr->refcount, vgpu_mgr_release))
+		detach_vgpu_mgr(handle_data);
+}
+
+/**
+ * nvidia_vgpu_mgr_release - release the vGPU manager
+ * @vgpu_mgr: the vGPU manager to release.
+ */
+void nvidia_vgpu_mgr_release(struct nvidia_vgpu_mgr *vgpu_mgr)
+{
+	if (!nvidia_vgpu_mgr_support_is_enabled(&vgpu_mgr->handle))
+		return;
+
+	nvidia_vgpu_mgr_detach_handle(&vgpu_mgr->handle);
+}
+EXPORT_SYMBOL(nvidia_vgpu_mgr_release);
+
+static struct nvidia_vgpu_mgr *alloc_vgpu_mgr(struct nvidia_vgpu_mgr_handle *handle)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr;
+
+	vgpu_mgr = kvzalloc(sizeof(*vgpu_mgr), GFP_KERNEL);
+	if (!vgpu_mgr)
+		return ERR_PTR(-ENOMEM);
+
+	vgpu_mgr->handle = *handle;
+
+	kref_init(&vgpu_mgr->refcount);
+	mutex_init(&vgpu_mgr->vgpu_list_lock);
+	INIT_LIST_HEAD(&vgpu_mgr->vgpu_list_head);
+	atomic_set(&vgpu_mgr->num_vgpus, 0);
+
+	return vgpu_mgr;
+}
+
+static const char *pf_events_string[NVIDIA_VGPU_PF_EVENT_MAX] = {
+	[NVIDIA_VGPU_PF_DRIVER_EVENT_SRIOV_CONFIGURE] = "SRIOV configure",
+	[NVIDIA_VGPU_PF_DRIVER_EVENT_DRIVER_UNBIND] = "driver unbind",
+};
+
+static int pf_event_notify_fn(void *priv, unsigned int event, void *data)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = priv;
+
+	if (WARN_ON(event >= NVIDIA_VGPU_PF_EVENT_MAX))
+		return -EINVAL;
+
+	vgpu_mgr_debug(vgpu_mgr, "handle PF event %s\n", pf_events_string[event]);
+
+	/* more to come. */
+	return 0;
+}
+
+static void attach_vgpu_mgr(struct nvidia_vgpu_mgr *vgpu_mgr,
+			    struct nvidia_vgpu_vfio_handle_data *handle_data)
+{
+	handle_data->vfio.handle = vgpu_mgr->handle.pf_drvdata;
+	handle_data->vfio.module = THIS_MODULE;
+	handle_data->vfio.private_data = vgpu_mgr;
+	handle_data->vfio.pf_event_notify_fn = pf_event_notify_fn;
+	handle_data->vfio.pf_detach_handle_fn = pf_detach_handle_fn;
+}
+
+static int init_vgpu_mgr(struct nvidia_vgpu_mgr *vgpu_mgr)
+{
+	vgpu_mgr->total_avail_chids = nvidia_vgpu_mgr_get_avail_chids(vgpu_mgr);
+	vgpu_mgr->total_fbmem_size = nvidia_vgpu_mgr_get_total_fbmem_size(vgpu_mgr);
+
+	vgpu_mgr_debug(vgpu_mgr, "total avail chids %u\n", vgpu_mgr->total_avail_chids);
+	vgpu_mgr_debug(vgpu_mgr, "total fbmem size 0x%llx\n", vgpu_mgr->total_fbmem_size);
+
+	return 0;
+}
+
+static int setup_pf_driver_caps(struct nvidia_vgpu_mgr *vgpu_mgr, unsigned long *caps)
+{
+	/* more to come */
+	return 0;
+}
+
+static int pf_attach_handle_fn(void *handle, struct nvidia_vgpu_vfio_handle_data *handle_data,
+			       struct nvidia_vgpu_vfio_attach_handle_data *attach_data)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr;
+	int ret;
+
+	/* PF driver is unbinding */
+	if (handle_data->pf.driver_is_unbound)
+		return -ENODEV;
+
+	if (handle_data->vfio.private_data) {
+		vgpu_mgr = handle_data->vfio.private_data;
+
+		ret = attach_data->init_vfio_fn(vgpu_mgr, attach_data->init_vfio_fn_data);
+		if (ret)
+			return ret;
+
+		kref_get(&vgpu_mgr->refcount);
+		vgpu_mgr_debug(vgpu_mgr, "use existing %px\n", vgpu_mgr);
+		return 0;
+	}
+
+	vgpu_mgr = alloc_vgpu_mgr(attach_data->vgpu_mgr_handle);
+	if (IS_ERR(vgpu_mgr))
+		return PTR_ERR(vgpu_mgr);
+
+	ret = setup_pf_driver_caps(vgpu_mgr, handle_data->pf.driver_caps);
+	if (ret)
+		goto fail_setup_pf_driver_caps;
+
+	ret = init_vgpu_mgr(vgpu_mgr);
+	if (ret)
+		goto fail_init_vgpu_mgr;
+
+	attach_vgpu_mgr(vgpu_mgr, handle_data);
+
+	ret = attach_data->init_vfio_fn(vgpu_mgr, attach_data->init_vfio_fn_data);
+	if (ret)
+		goto fail_init_fn;
+
+	vgpu_mgr_debug(vgpu_mgr, "created new %px\n", vgpu_mgr);
+
+	return 0;
+
+fail_init_fn:
+	detach_vgpu_mgr(handle_data);
+fail_init_vgpu_mgr:
+fail_setup_pf_driver_caps:
+	kvfree(vgpu_mgr);
+	return ret;
+}
+
+/**
+ * nvidia_vgpu_mgr_setup - setup the vGPU manager
+ * @dev: the VF pci_dev.
+ * @init_fn: the init function of VFIO interfaces
+ * @init_fn_data: the init function data of VFIO interfaces
+ * Returns: zero on success, others on failure.
+ */
+int nvidia_vgpu_mgr_setup(struct pci_dev *dev, int (*init_vfio_fn)(void *priv, void *data),
+			  void *init_vfio_fn_data)
+{
+	struct nvidia_vgpu_mgr_handle handle = {0};
+	struct nvidia_vgpu_vfio_attach_handle_data attach_handle_data;
+	int ret;
+
+	ret = nvidia_vgpu_mgr_init_handle(dev, &handle);
+	if (ret)
+		return ret;
+
+	if (!nvidia_vgpu_mgr_support_is_enabled(&handle))
+		return -ENODEV;
+
+	attach_handle_data.pf_attach_handle_fn = pf_attach_handle_fn;
+	attach_handle_data.init_vfio_fn = init_vfio_fn;
+	attach_handle_data.init_vfio_fn_data = init_vfio_fn_data;
+	attach_handle_data.vgpu_mgr_handle = &handle;
+
+	return nvidia_vgpu_mgr_attach_handle(&handle, &attach_handle_data);
+}
+EXPORT_SYMBOL(nvidia_vgpu_mgr_setup);
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
new file mode 100644
index 000000000000..9fe25b2d8ec1
--- /dev/null
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
@@ -0,0 +1,79 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright © 2025 NVIDIA Corporation
+ */
+#ifndef __NVIDIA_VGPU_MGR_H__
+#define __NVIDIA_VGPU_MGR_H__
+
+#include "pf.h"
+
+/**
+ * struct nvidia_vgpu_info - vGPU information
+ *
+ * @id: vGPU ID
+ * @gfid: VF function ID
+ * @dbdf: VF BDF
+ */
+struct nvidia_vgpu_info {
+	int id;
+	u32 gfid;
+	u32 dbdf;
+};
+
+/**
+ * struct nvidia_vgpu - per-vGPU state
+ *
+ * @lock: per-vGPU lock
+ * @pdev: PCI device
+ * @status: vGPU status
+ * @vgpu_list: list node to the vGPU list
+ * @info: vGPU info
+ * @vgpu_mgr: pointer to vGPU manager
+ */
+struct nvidia_vgpu {
+	/* Per-vGPU lock */
+	struct mutex lock;
+	struct pci_dev *pdev;
+	atomic_t status;
+	struct list_head vgpu_list;
+
+	struct nvidia_vgpu_info info;
+	struct nvidia_vgpu_mgr *vgpu_mgr;
+};
+
+/**
+ * struct nvidia_vgpu_mgr - the vGPU manager
+ *
+ * @refcount: the reference count
+ * @handle: the driver handle
+ * @total_avail_chids: total available channel IDs
+ * @total_fbmem_size: total FB memory size
+ * @vgpu_list_lock: lock to protect vGPU list
+ * @vgpu_list_head: list head of vGPU list
+ * @num_vgpus: number of vGPUs in the vGPU list
+ */
+struct nvidia_vgpu_mgr {
+	struct kref refcount;
+	struct nvidia_vgpu_mgr_handle handle;
+
+	/* core driver configurations */
+	u32 total_avail_chids;
+	u64 total_fbmem_size;
+
+	/* lock for vGPU list */
+	struct mutex vgpu_list_lock;
+	struct list_head vgpu_list_head;
+	atomic_t num_vgpus;
+};
+
+#define nvidia_vgpu_mgr_for_each_vgpu(vgpu, vgpu_mgr) \
+	list_for_each_entry((vgpu), &(vgpu_mgr)->vgpu_list_head, vgpu_list)
+
+int nvidia_vgpu_mgr_setup(struct pci_dev *dev, int (*init_vfio_fn)(void *priv, void *data),
+			  void *init_vfio_fn_data);
+void nvidia_vgpu_mgr_release(struct nvidia_vgpu_mgr *vgpu_mgr);
+
+int nvidia_vgpu_mgr_destroy_vgpu(struct nvidia_vgpu *vgpu);
+int nvidia_vgpu_mgr_create_vgpu(struct nvidia_vgpu *vgpu);
+
+#endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC v2 02/14] vfio/nvidia-vgpu: allocate GSP RM client for NVIDIA vGPU manager
  2025-09-03 22:10 [RFC v2 00/14] Introduce NVIDIA GPU Virtualization (vGPU) Support Zhi Wang
  2025-09-03 22:10 ` [RFC v2 01/14] vfio/nvidia-vgpu: introduce vGPU lifecycle management prelude Zhi Wang
@ 2025-09-03 22:10 ` Zhi Wang
  2025-09-03 22:11 ` [RFC v2 03/14] vfio/nvidia-vgpu: introduce vGPU type uploading Zhi Wang
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Zhi Wang @ 2025-09-03 22:10 UTC (permalink / raw)
  To: kvm
  Cc: alex.williamson, kevin.tian, jgg, airlied, daniel, dakr, acurrid,
	cjia, smitra, ankita, aniketa, kwankhede, targupta, zhiw, zhiwang

A GSP RM client is required when talking to the GSP firmware via GSP RM
controls.

In order to create vGPUs, NVIDIA vGPU manager requires a GSP RM client
to acquire necessary information from GSP, upload vGPU types to GSP...

Allocate a dedicated GSP RM client for NVIDIA vGPU manager.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 drivers/vfio/pci/nvidia-vgpu/pf.h       | 11 +++++++++++
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c |  8 ++++++++
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h |  3 +++
 3 files changed, 22 insertions(+)

diff --git a/drivers/vfio/pci/nvidia-vgpu/pf.h b/drivers/vfio/pci/nvidia-vgpu/pf.h
index e8a11dd29427..044bc3aef5a6 100644
--- a/drivers/vfio/pci/nvidia-vgpu/pf.h
+++ b/drivers/vfio/pci/nvidia-vgpu/pf.h
@@ -62,4 +62,15 @@ static inline int nvidia_vgpu_mgr_init_handle(struct pci_dev *pdev,
 	__m->handle.ops->get_total_fbmem_size(__m->handle.pf_drvdata); \
 })
 
+#define nvidia_vgpu_mgr_alloc_gsp_client(m, c) ({ \
+	typeof(m) __m = (m); \
+	__m->handle.ops->alloc_gsp_client(__m->handle.pf_drvdata, c); \
+})
+
+#define nvidia_vgpu_mgr_free_gsp_client(m, c) \
+	((m)->handle.ops->free_gsp_client(c))
+
+#define nvidia_vgpu_mgr_get_gsp_client_handle(m, c) \
+	((m)->handle.ops->get_gsp_client_handle(c))
+
 #endif
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
index 3ef81b89c748..1455ca51eca1 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
@@ -16,6 +16,7 @@ static void vgpu_mgr_release(struct kref *kref)
 	if (WARN_ON(atomic_read(&vgpu_mgr->num_vgpus)))
 		return;
 
+	nvidia_vgpu_mgr_free_gsp_client(vgpu_mgr, &vgpu_mgr->gsp_client);
 	kvfree(vgpu_mgr);
 }
 
@@ -140,6 +141,11 @@ static int pf_attach_handle_fn(void *handle, struct nvidia_vgpu_vfio_handle_data
 	if (ret)
 		goto fail_setup_pf_driver_caps;
 
+	ret = nvidia_vgpu_mgr_alloc_gsp_client(vgpu_mgr,
+					       &vgpu_mgr->gsp_client);
+	if (ret)
+		goto fail_alloc_gsp_client;
+
 	ret = init_vgpu_mgr(vgpu_mgr);
 	if (ret)
 		goto fail_init_vgpu_mgr;
@@ -157,6 +163,8 @@ static int pf_attach_handle_fn(void *handle, struct nvidia_vgpu_vfio_handle_data
 fail_init_fn:
 	detach_vgpu_mgr(handle_data);
 fail_init_vgpu_mgr:
+	nvidia_vgpu_mgr_free_gsp_client(vgpu_mgr, &vgpu_mgr->gsp_client);
+fail_alloc_gsp_client:
 fail_setup_pf_driver_caps:
 	kvfree(vgpu_mgr);
 	return ret;
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
index 9fe25b2d8ec1..98dcbb682b92 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
@@ -51,6 +51,7 @@ struct nvidia_vgpu {
  * @vgpu_list_lock: lock to protect vGPU list
  * @vgpu_list_head: list head of vGPU list
  * @num_vgpus: number of vGPUs in the vGPU list
+ * @gsp_client: the GSP client
  */
 struct nvidia_vgpu_mgr {
 	struct kref refcount;
@@ -64,6 +65,8 @@ struct nvidia_vgpu_mgr {
 	struct mutex vgpu_list_lock;
 	struct list_head vgpu_list_head;
 	atomic_t num_vgpus;
+
+	struct nvidia_vgpu_gsp_client gsp_client;
 };
 
 #define nvidia_vgpu_mgr_for_each_vgpu(vgpu, vgpu_mgr) \
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC v2 03/14] vfio/nvidia-vgpu: introduce vGPU type uploading
  2025-09-03 22:10 [RFC v2 00/14] Introduce NVIDIA GPU Virtualization (vGPU) Support Zhi Wang
  2025-09-03 22:10 ` [RFC v2 01/14] vfio/nvidia-vgpu: introduce vGPU lifecycle management prelude Zhi Wang
  2025-09-03 22:10 ` [RFC v2 02/14] vfio/nvidia-vgpu: allocate GSP RM client for NVIDIA vGPU manager Zhi Wang
@ 2025-09-03 22:11 ` Zhi Wang
  2025-09-04  9:37   ` Danilo Krummrich
  2025-09-03 22:11 ` [RFC v2 04/14] vfio/nvidia-vgpu: allocate vGPU channels when creating vGPUs Zhi Wang
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 23+ messages in thread
From: Zhi Wang @ 2025-09-03 22:11 UTC (permalink / raw)
  To: kvm
  Cc: alex.williamson, kevin.tian, jgg, airlied, daniel, dakr, acurrid,
	cjia, smitra, ankita, aniketa, kwankhede, targupta, zhiw, zhiwang

Each type of vGPU is designed to meet specific requirements, from
supporting multiple users with demanding graphics applications to
powering AI workloads in virtualized environments.

To create a vGPU associated with a vGPU type, the vGPU type specs are
required to be uploaded to GSP firmware.

Intorduce vGPU metadata uploading framework to check and upload vGPU
types from the vGPU metadata file when vGPU is enabled.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 drivers/vfio/pci/nvidia-vgpu/Makefile         |   4 +-
 drivers/vfio/pci/nvidia-vgpu/debug.h          |   3 +
 .../vfio/pci/nvidia-vgpu/include/nvrm/gsp.h   |  18 +
 .../pci/nvidia-vgpu/include/nvrm/nvtypes.h    |  26 ++
 drivers/vfio/pci/nvidia-vgpu/metadata.c       | 319 ++++++++++++++++++
 drivers/vfio/pci/nvidia-vgpu/metadata.h       |  89 +++++
 .../vfio/pci/nvidia-vgpu/metadata_vgpu_type.c | 153 +++++++++
 drivers/vfio/pci/nvidia-vgpu/pf.h             |  12 +
 drivers/vfio/pci/nvidia-vgpu/vgpu.c           |   5 +-
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c       |   7 +
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h       |  27 ++
 11 files changed, 660 insertions(+), 3 deletions(-)
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/include/nvrm/gsp.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/include/nvrm/nvtypes.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/metadata.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/metadata.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/metadata_vgpu_type.c

diff --git a/drivers/vfio/pci/nvidia-vgpu/Makefile b/drivers/vfio/pci/nvidia-vgpu/Makefile
index 14ff08175231..94ba4ed4e131 100644
--- a/drivers/vfio/pci/nvidia-vgpu/Makefile
+++ b/drivers/vfio/pci/nvidia-vgpu/Makefile
@@ -1,3 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0-only
+subdir-ccflags-y += -I$(src)/include
+
 obj-$(CONFIG_NVIDIA_VGPU_VFIO_PCI) += nvidia_vgpu_vfio_pci.o
-nvidia_vgpu_vfio_pci-y := vgpu_mgr.o vgpu.o
+nvidia_vgpu_vfio_pci-y := vgpu_mgr.o vgpu.o metadata.o metadata_vgpu_type.o
diff --git a/drivers/vfio/pci/nvidia-vgpu/debug.h b/drivers/vfio/pci/nvidia-vgpu/debug.h
index 19a2ecd8863e..7cf92c9060ae 100644
--- a/drivers/vfio/pci/nvidia-vgpu/debug.h
+++ b/drivers/vfio/pci/nvidia-vgpu/debug.h
@@ -9,6 +9,9 @@
 #define vgpu_mgr_debug(v, f, a...) \
 	pci_dbg((v)->handle.pf_pdev, "nvidia-vgpu-mgr: "f, ##a)
 
+#define vgpu_mgr_error(v, f, a...) \
+	pci_err((v)->handle.pf_pdev, "nvidia-vgpu-mgr: "f, ##a)
+
 #define vgpu_debug(v, f, a...) ({ \
 	typeof(v) __v = (v); \
 	pci_dbg(__v->pdev, "nvidia-vgpu %d: "f, __v->info.id, ##a); \
diff --git a/drivers/vfio/pci/nvidia-vgpu/include/nvrm/gsp.h b/drivers/vfio/pci/nvidia-vgpu/include/nvrm/gsp.h
new file mode 100644
index 000000000000..c3fb7b299533
--- /dev/null
+++ b/drivers/vfio/pci/nvidia-vgpu/include/nvrm/gsp.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: MIT */
+#ifndef __NVRM_GSP_H__
+#define __NVRM_GSP_H__
+
+#include <nvrm/nvtypes.h>
+
+/* Excerpt of RM headers from https://github.com/NVIDIA/open-gpu-kernel-modules/tree/570 */
+
+#define NV2080_CTRL_CMD_GSP_GET_FEATURES (0x20803601)
+
+typedef struct NV2080_CTRL_GSP_GET_FEATURES_PARAMS {
+	NvU32  gspFeatures;
+	NvBool bValid;
+	NvBool bDefaultGspRmGpu;
+	NvU8   firmwareVersion[GSP_MAX_BUILD_VERSION_LENGTH];
+} NV2080_CTRL_GSP_GET_FEATURES_PARAMS;
+
+#endif
diff --git a/drivers/vfio/pci/nvidia-vgpu/include/nvrm/nvtypes.h b/drivers/vfio/pci/nvidia-vgpu/include/nvrm/nvtypes.h
new file mode 100644
index 000000000000..5445ba15500f
--- /dev/null
+++ b/drivers/vfio/pci/nvidia-vgpu/include/nvrm/nvtypes.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: MIT */
+#ifndef __NVRM_NVTYPES_H__
+#define __NVRM_NVTYPES_H__
+
+#define NV_ALIGN_BYTES(a) __attribute__ ((__aligned__(a)))
+#define NV_DECLARE_ALIGNED(f, a) f __attribute__ ((__aligned__(a)))
+
+typedef u32 NvV32;
+
+typedef u8 NvU8;
+typedef u16 NvU16;
+typedef u32 NvU32;
+typedef u64 NvU64;
+
+typedef void* NvP64;
+
+typedef NvU8 NvBool;
+typedef NvU32 NvHandle;
+typedef NvU64 NvLength;
+
+typedef NvU64 RmPhysAddr;
+
+typedef NvU32 NV_STATUS;
+
+typedef union {} rpc_generic_union;
+#endif
diff --git a/drivers/vfio/pci/nvidia-vgpu/metadata.c b/drivers/vfio/pci/nvidia-vgpu/metadata.c
new file mode 100644
index 000000000000..8e2c326c43f4
--- /dev/null
+++ b/drivers/vfio/pci/nvidia-vgpu/metadata.c
@@ -0,0 +1,319 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright © 2025 NVIDIA Corporation
+ */
+
+#include <linux/crc32.h>
+#include <linux/firmware.h>
+
+#include "debug.h"
+#include "vgpu_mgr.h"
+#include "metadata.h"
+
+#include <nvrm/gsp.h>
+
+/* Sanity checks on main headers */
+static int check_main_headers(struct nvidia_vgpu_mgr *vgpu_mgr, const struct firmware *fw)
+{
+	struct metadata_hdr *hdr = (struct metadata_hdr *)fw->data;
+	struct metadata_blob_hdr *blob;
+	u32 crc;
+
+	if (fw->size <= sizeof(*hdr)) {
+		vgpu_mgr_error(vgpu_mgr, "metadata: file is too small\n");
+		return -EINVAL;
+	}
+
+	crc = crc32_le(0xffffffff, fw->data + 16, fw->size - 16);
+	if (crc != hdr->crc32) {
+		vgpu_mgr_error(vgpu_mgr, "metadata: invalid CRC\n");
+		return -EINVAL;
+	}
+
+	if (memcmp(&hdr->identifier, METADATA_IDR, sizeof(hdr->identifier))) {
+		vgpu_mgr_error(vgpu_mgr, "metadata: invalid identifier\n");
+		return -EINVAL;
+	}
+
+	if (!hdr->num_blobs ||
+	    (hdr->num_blobs > (fw->size - sizeof(*hdr)) / sizeof(*blob))) {
+		vgpu_mgr_error(vgpu_mgr, "metadata: invalid num_blobs\n");
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int get_running_gsp_build_version(struct nvidia_vgpu_mgr *vgpu_mgr,
+					 char *running_gsp_build_version)
+{
+	NV2080_CTRL_GSP_GET_FEATURES_PARAMS *ctrl;
+
+	ctrl = nvidia_vgpu_mgr_rm_ctrl_rd(vgpu_mgr, &vgpu_mgr->gsp_client,
+			NV2080_CTRL_CMD_GSP_GET_FEATURES, sizeof(*ctrl));
+	if (IS_ERR(ctrl))
+		return PTR_ERR(ctrl);
+
+	memcpy(running_gsp_build_version, ctrl->firmwareVersion, GSP_MAX_BUILD_VERSION_LENGTH);
+
+	nvidia_vgpu_mgr_rm_ctrl_done(vgpu_mgr, &vgpu_mgr->gsp_client, ctrl);
+
+	vgpu_mgr_debug(vgpu_mgr, "running GSP build version %s\n", running_gsp_build_version);
+
+	return 0;
+}
+
+struct version {
+	u64 vgpu_major;
+	u64 vgpu_minor;
+	const char *gsp_build_version;
+};
+
+static struct version supported_version_list[] = {
+	{ 18, 1, "570.144" },
+};
+
+/* check supported versions */
+static int check_versions(struct nvidia_vgpu_mgr *vgpu_mgr, const struct firmware *fw,
+			  char *running_gsp_build_version)
+{
+	struct metadata_hdr *hdr = (struct metadata_hdr *)fw->data;
+	unsigned int i;
+
+	/*
+	 * The running GSP metadata supports vGPU (or we won't be here).
+	 * Check if the vGPU metadata file matches with the version of GSP metadata.
+	 */
+	if (strncmp(running_gsp_build_version, hdr->gsp_build_version,
+		    GSP_MAX_BUILD_VERSION_LENGTH)) {
+		vgpu_mgr_error(vgpu_mgr, "unexpected metadata GSP version %s, running %s\n",
+			       hdr->gsp_build_version, running_gsp_build_version);
+		return -EINVAL;
+	}
+
+	/* Check vGPU release version. */
+	for (i = 0; i < ARRAY_SIZE(supported_version_list); i++) {
+		struct version *v = supported_version_list + i;
+
+		if (strncmp(v->gsp_build_version, hdr->gsp_build_version,
+			    GSP_MAX_BUILD_VERSION_LENGTH))
+			continue;
+
+		if (v->vgpu_major == hdr->vgpu_major && v->vgpu_minor == hdr->vgpu_minor)
+			break;
+	}
+
+	if (i == ARRAY_SIZE(supported_version_list)) {
+		vgpu_mgr_error(vgpu_mgr, "unexpected metadata vGPU %llu.%llu GSP %s, running %s\n",
+			       hdr->vgpu_major, hdr->vgpu_minor, hdr->gsp_build_version,
+			       running_gsp_build_version);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+#define for_each_blob(hdr, blob, i) \
+	for (i = 0, blob = (typeof(blob))hdr->data; i < hdr->num_blobs; \
+	     i++, blob = ((void *)blob) + blob->size)
+
+/* Sanity check on blob headers */
+static int check_blob_headers(struct nvidia_vgpu_mgr *vgpu_mgr, const struct firmware *fw)
+{
+	struct metadata_hdr *hdr = (struct metadata_hdr *)fw->data;
+	struct metadata_blob_hdr *blob;
+	unsigned int i;
+
+	for_each_blob(hdr, blob, i) {
+		vgpu_mgr_debug(vgpu_mgr, "check blob header %u type 0x%llx size 0x%llx\n",
+			       i, blob->type, blob->size);
+
+		if (blob->type >= METADATA_BLOB_MAX) {
+			vgpu_mgr_error(vgpu_mgr, "unknown blob type 0x%llx\n", blob->type);
+			return -EINVAL;
+		}
+
+		if (blob->size <= sizeof(*blob) ||
+		    (blob->size > (fw->size - ((void *)blob - (void *)fw->data)))) {
+			vgpu_mgr_error(vgpu_mgr, "invalid blob_size 0x%llx\n", blob->size);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+typedef int (*blob_handler_t)(struct nvidia_vgpu_mgr *vgpu_mgr, void *blob, u64 blob_size);
+
+struct blob_handler_fn {
+	blob_handler_t check;
+	blob_handler_t setup;
+	blob_handler_t post_setup;
+	blob_handler_t clean;
+};
+
+struct blob_handler_fn blob_handlers[] = {
+	[METADATA_BLOB_VGPU_TYPE] = {
+		.check = nvidia_vgpu_metadata_check_vgpu_type,
+		.setup = nvidia_vgpu_metadata_setup_vgpu_type,
+		.post_setup = nvidia_vgpu_metadata_post_setup_vgpu_type,
+		.clean = nvidia_vgpu_metadata_clean_vgpu_type,
+	},
+};
+
+/* Check blobs in this metadata file */
+static int check_blobs(struct nvidia_vgpu_mgr *vgpu_mgr, const struct firmware *fw)
+{
+	struct metadata_hdr *hdr = (struct metadata_hdr *)fw->data;
+	struct metadata_blob_hdr *blob;
+	unsigned int i;
+	int ret;
+
+	for_each_blob(hdr, blob, i) {
+		ret = blob_handlers[blob->type].check(vgpu_mgr, blob->data, blob->size);
+		if (ret) {
+			vgpu_mgr_error(vgpu_mgr, "metadata: blob is invalid, type: 0x%llx\n",
+				       blob->type);
+			return ret;
+		}
+	}
+	return 0;
+}
+
+/* Setup blobs in this metadata file */
+static int setup_blobs(struct nvidia_vgpu_mgr *vgpu_mgr, const struct firmware *fw)
+{
+	struct metadata_hdr *hdr = (struct metadata_hdr *)fw->data;
+	struct metadata_blob_hdr *blob;
+	unsigned int i;
+	int ret;
+
+	for_each_blob(hdr, blob, i) {
+		ret = blob_handlers[blob->type].setup(vgpu_mgr, blob->data, blob->size);
+		if (ret) {
+			vgpu_mgr_error(vgpu_mgr, "metadata: fail to setup blob, type: 0x%llx\n",
+				       blob->type);
+			return ret;
+		}
+	}
+	return 0;
+}
+
+/* Final setup after installing all the blobs */
+static int post_setup_blobs(struct nvidia_vgpu_mgr *vgpu_mgr, const struct firmware *fw)
+{
+	struct metadata_hdr *hdr = (struct metadata_hdr *)fw->data;
+	unsigned int i;
+	int ret;
+
+	for (i = 0; i < ARRAY_SIZE(blob_handlers); i++) {
+		ret = blob_handlers[i].post_setup(vgpu_mgr, NULL, 0);
+		if (ret) {
+			vgpu_mgr_error(vgpu_mgr, "metadata: fail to post setup blob, type: 0x%x\n",
+				       i);
+			return ret;
+		}
+	}
+
+	vgpu_mgr->vgpu_major = hdr->vgpu_major;
+	vgpu_mgr->vgpu_minor = hdr->vgpu_minor;
+
+	return 0;
+}
+
+/* Clean all the installed blobs */
+static void clean_blobs(struct nvidia_vgpu_mgr *vgpu_mgr)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(blob_handlers); i++)
+		blob_handlers[i].clean(vgpu_mgr, NULL, 0);
+}
+
+/**
+ * nvidia_vgpu_mgr_clean_metadata - clean vGPU metadata
+ * @vgpu_mgr: the vGPU manager.
+ */
+void nvidia_vgpu_mgr_clean_metadata(struct nvidia_vgpu_mgr *vgpu_mgr)
+{
+	clean_blobs(vgpu_mgr);
+
+	vgpu_mgr_debug(vgpu_mgr, "clean vgpu metadata\n");
+}
+
+/**
+ * nvidia_vgpu_mgr_setup_metadata - setup vGPU metadata
+ * @vgpu_mgr: the vGPU manager.
+ *
+ * Returns: zero on success, others on failure.
+ */
+int nvidia_vgpu_mgr_setup_metadata(struct nvidia_vgpu_mgr *vgpu_mgr)
+{
+	u8 running_gsp_build_version[GSP_MAX_BUILD_VERSION_LENGTH];
+	char *path;
+	const struct firmware *fw;
+	int ret = 0;
+
+	ret = get_running_gsp_build_version(vgpu_mgr, running_gsp_build_version);
+	if (ret)
+		return ret;
+
+	path = kvzalloc(PATH_MAX, GFP_KERNEL);
+	if (!path)
+		return -ENOMEM;
+
+	snprintf(path, PATH_MAX, METADATA_PATH "vgpu-%s.bin",
+		 running_gsp_build_version);
+
+	vgpu_mgr_debug(vgpu_mgr, "request vgpu metadata %s\n", path);
+
+	ret = request_firmware(&fw, path, &vgpu_mgr->handle.pf_pdev->dev);
+
+	kvfree(path);
+
+	if (ret)
+		return ret;
+
+	vgpu_mgr_debug(vgpu_mgr, "check main headers\n");
+
+	ret = check_main_headers(vgpu_mgr, fw);
+	if (ret)
+		goto out_free_fw;
+
+	vgpu_mgr_debug(vgpu_mgr, "check versions\n");
+
+	ret = check_versions(vgpu_mgr, fw, running_gsp_build_version);
+	if (ret)
+		goto out_free_fw;
+
+	vgpu_mgr_debug(vgpu_mgr, "check blob headers\n");
+
+	ret = check_blob_headers(vgpu_mgr, fw);
+	if (ret)
+		goto out_free_fw;
+
+	vgpu_mgr_debug(vgpu_mgr, "check blobs\n");
+
+	ret = check_blobs(vgpu_mgr, fw);
+	if (ret)
+		goto out_free_fw;
+
+	vgpu_mgr_debug(vgpu_mgr, "setup blobs\n");
+
+	ret = setup_blobs(vgpu_mgr, fw);
+	if (ret)
+		goto out_free_fw;
+
+	vgpu_mgr_debug(vgpu_mgr, "post-setup blobs\n");
+
+	ret = post_setup_blobs(vgpu_mgr, fw);
+	if (ret) {
+		clean_blobs(vgpu_mgr);
+		goto out_free_fw;
+	}
+
+	vgpu_mgr_debug(vgpu_mgr, "metadata loaded, vgpu major %llu vgpu minor %llu\n",
+		       vgpu_mgr->vgpu_major, vgpu_mgr->vgpu_minor);
+
+out_free_fw:
+	release_firmware(fw);
+	return ret;
+}
diff --git a/drivers/vfio/pci/nvidia-vgpu/metadata.h b/drivers/vfio/pci/nvidia-vgpu/metadata.h
new file mode 100644
index 000000000000..c55da3e8e44f
--- /dev/null
+++ b/drivers/vfio/pci/nvidia-vgpu/metadata.h
@@ -0,0 +1,89 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright © 2025 NVIDIA Corporation
+ */
+#ifndef __NVIDIA_VGPU_METADATA_H__
+#define __NVIDIA_VGPU_METADATA_H__
+
+#define METADATA_PATH "nvidia/"
+#define METADATA_IDR "NVVGPUMT"
+
+enum {
+	METADATA_BLOB_VGPU_TYPE = 0,
+	METADATA_BLOB_MAX,
+};
+
+#define GSP_MAX_BUILD_VERSION_LENGTH (0x0000040)
+
+#define METADATA_VGPU_FEATURE_SIZE 128
+
+/**
+ * struct metadata_hdr - vGPU metafile main header
+ *
+ * @identifier: identifier to check
+ * @crc32: crc32 of the metafile
+ * @vgpu_major: vGPU major version
+ * @vgpu_minor: vGPU minor version
+ * @vgpu_features: vGPU features in a specific version
+ * @gsp_build_version: GSP build version for this vGPU version
+ * @num_blobs: total blob amount
+ * @data: blob data
+ */
+struct metadata_hdr {
+	u64 identifier; /* "NVVGPUMT" */
+	u32 crc32;
+	u32 padding;
+	u64 vgpu_major;
+	u64 vgpu_minor;
+	u8 vgpu_features[METADATA_VGPU_FEATURE_SIZE];
+	u8 gsp_build_version[GSP_MAX_BUILD_VERSION_LENGTH];
+	u64 num_blobs;
+	unsigned char data[];
+};
+
+/**
+ * struct metadata_blob_hdr - vGPU metafile blob section header
+ *
+ * @type: blob type
+ * @size: blob size
+ * @data: blob data
+ */
+struct metadata_blob_hdr {
+	u64 type;
+	u64 size;
+	unsigned char data[];
+};
+
+/**
+ * struct vgpu_type_blob_hdr - vGPU metafile vGPU type blob header
+ *
+ * @device_id: supported device ID
+ * @gsp_rmctrl_vgpu_info_offset: vgpu info offset in rmctrl part
+ * @gsp_rmctrl_vgpu_info_szie: vgpu info size in rmctrl part
+ * @kernel_struct_size: kernel struct size
+ * @num_kernel_struct: amount of kernel structs
+ * @gsp_rmctrl_cmd: GSP rmctrl command
+ * @gsp_rmctrl_size: GSP rmctl size
+ * @data: blob data
+ */
+struct vgpu_type_blob_hdr {
+	u64 device_id;
+	u64 gsp_rmctrl_vgpu_info_offset;
+	u64 gsp_rmctrl_vgpu_info_size;
+
+	u64 kernel_struct_size;
+	u64 num_kernel_structs;
+	u64 gsp_rmctrl_cmd;
+	u64 gsp_rmctrl_size;
+	unsigned char data[];
+};
+
+int nvidia_vgpu_metadata_check_vgpu_type(struct nvidia_vgpu_mgr *vgpu_mgr,
+					 void *blob, u64 blob_size);
+int nvidia_vgpu_metadata_setup_vgpu_type(struct nvidia_vgpu_mgr *vgpu_mgr, void *blob,
+					 u64 blob_size);
+int nvidia_vgpu_metadata_post_setup_vgpu_type(struct nvidia_vgpu_mgr *vgpu_mgr, void *blob,
+					      u64 blob_size);
+int nvidia_vgpu_metadata_clean_vgpu_type(struct nvidia_vgpu_mgr *vgpu_mgr, void *blob,
+					 u64 blob_size);
+#endif
diff --git a/drivers/vfio/pci/nvidia-vgpu/metadata_vgpu_type.c b/drivers/vfio/pci/nvidia-vgpu/metadata_vgpu_type.c
new file mode 100644
index 000000000000..013fbc90c6de
--- /dev/null
+++ b/drivers/vfio/pci/nvidia-vgpu/metadata_vgpu_type.c
@@ -0,0 +1,153 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright © 2025 NVIDIA Corporation
+ */
+
+#include "debug.h"
+#include "vgpu_mgr.h"
+#include "metadata.h"
+
+#include <nvrm/gsp.h>
+
+/**
+ * nvidia_vgpu_metadata_check_vgpu_type - check vGPU type blobs
+ * @vgpu_mgr: the vGPU manager
+ * @blob: the blob header
+ * @blob_size: the blob size
+ *
+ * Returns: zero on success, others on errors.
+ */
+int nvidia_vgpu_metadata_check_vgpu_type(struct nvidia_vgpu_mgr *vgpu_mgr,
+					 void *blob, u64 blob_size)
+{
+	struct vgpu_type_blob_hdr *hdr = blob;
+	u64 size;
+
+	vgpu_mgr_debug(vgpu_mgr, "check vgpu type blob for device 0x%llx\n", hdr->device_id);
+
+	if (!hdr->device_id || !hdr->num_kernel_structs || !hdr->kernel_struct_size ||
+	    !hdr->gsp_rmctrl_cmd || !hdr->gsp_rmctrl_size) {
+		vgpu_mgr_error(vgpu_mgr, "metadata: vgpu type blob header is invalid\n");
+		return -EINVAL;
+	}
+
+	size = sizeof(struct metadata_blob_hdr);
+	size += sizeof(*hdr);
+	size += hdr->kernel_struct_size;
+	size += hdr->gsp_rmctrl_size;
+
+	if (size != blob_size) {
+		vgpu_mgr_error(vgpu_mgr, "metadata: vgpu type blob size mismatch\n");
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int send_gsp_rmctrl(struct nvidia_vgpu_mgr *vgpu_mgr, struct vgpu_type_blob_hdr *hdr)
+{
+	void *ctrl;
+	int ret;
+
+	vgpu_mgr_debug(vgpu_mgr, "send rmctrl cmd 0x%llx size 0x%llx\n", hdr->gsp_rmctrl_cmd,
+		       hdr->gsp_rmctrl_size);
+
+	ctrl = nvidia_vgpu_mgr_rm_ctrl_get(vgpu_mgr, &vgpu_mgr->gsp_client,
+					   hdr->gsp_rmctrl_cmd, hdr->gsp_rmctrl_size);
+	if (IS_ERR(ctrl))
+		return PTR_ERR(ctrl);
+
+	memcpy(ctrl, hdr->data + hdr->kernel_struct_size, hdr->gsp_rmctrl_size);
+
+	ret = nvidia_vgpu_mgr_rm_ctrl_wr(vgpu_mgr, &vgpu_mgr->gsp_client, ctrl);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+/**
+ * nvidia_vgpu_metadata_setup_vgpu_type - setup vGPU type blob
+ * @vgpu_mgr: the vGPU manager
+ * @blob: the blob header
+ * @blob_size: the blob size
+ *
+ * Returns: zero on success, others on errors.
+ */
+int nvidia_vgpu_metadata_setup_vgpu_type(struct nvidia_vgpu_mgr *vgpu_mgr, void *blob,
+					 u64 blob_size)
+{
+	struct vgpu_type_blob_hdr *hdr = blob;
+	u64 size, copy_size;
+	int ret;
+	void *p;
+	int i;
+
+	/* Not for this device, skip */
+	if (hdr->device_id != vgpu_mgr->handle.pf_pdev->device)
+		return 0;
+
+	vgpu_mgr_debug(vgpu_mgr, "setup vgpu type blob for device 0x%llx\n", hdr->device_id);
+
+	vgpu_mgr->vgpu_types = kvrealloc(vgpu_mgr->vgpu_types, hdr->kernel_struct_size, GFP_KERNEL);
+	if (!vgpu_mgr->vgpu_types)
+		return -ENOMEM;
+
+	ret = send_gsp_rmctrl(vgpu_mgr, hdr);
+	if (ret) {
+		kvfree(vgpu_mgr->vgpu_types);
+		vgpu_mgr->vgpu_types = NULL;
+		return ret;
+	}
+
+	size = hdr->kernel_struct_size / hdr->num_kernel_structs;
+	copy_size = min(size, sizeof(struct nvidia_vgpu_type));
+	p = hdr->data;
+
+	for (i = 0; i < hdr->num_kernel_structs; i++, p += size) {
+		memcpy(vgpu_mgr->vgpu_types + i, p, copy_size);
+
+		vgpu_mgr_debug(vgpu_mgr, "setup vgpu type %u %s for device 0x%llx\n",
+			       vgpu_mgr->vgpu_types[i].vgpu_type,
+			       vgpu_mgr->vgpu_types[i].vgpu_type_name, hdr->device_id);
+	}
+
+	vgpu_mgr->num_vgpu_types = hdr->num_kernel_structs;
+	return 0;
+}
+
+/**
+ * nvidia_vgpu_metadata_post_setup_vgpu_type - vGPU type post setup
+ *
+ * @vgpu_mgr: the vGPU manager
+ * @blob: the blob header
+ * @blob_size: the blob size
+ *
+ * Returns: zero on success, others on failure.
+ */
+int nvidia_vgpu_metadata_post_setup_vgpu_type(struct nvidia_vgpu_mgr *vgpu_mgr, void *blob,
+					      u64 blob_size)
+{
+	if (WARN_ON(!vgpu_mgr->vgpu_types || !vgpu_mgr->num_vgpu_types)) {
+		vgpu_mgr_error(vgpu_mgr, "metadata: no available vgpu type blob\n");
+		return -EINVAL;
+	}
+	return 0;
+}
+
+/**
+ * nvidia_vgpu_metadata_clean_vgpu_type - clean vGPU type
+ *
+ * @vgpu_mgr: the vGPU manager
+ * @blob: the blob header
+ * @blob_size: the blob size
+ *
+ * Returns: zero on success, others on failure.
+ */
+int nvidia_vgpu_metadata_clean_vgpu_type(struct nvidia_vgpu_mgr *vgpu_mgr, void *blob,
+					 u64 blob_size)
+{
+	kvfree(vgpu_mgr->vgpu_types);
+	vgpu_mgr->vgpu_types = NULL;
+	vgpu_mgr->num_vgpu_types = 0;
+	return 0;
+}
diff --git a/drivers/vfio/pci/nvidia-vgpu/pf.h b/drivers/vfio/pci/nvidia-vgpu/pf.h
index 044bc3aef5a6..19f0aca56d12 100644
--- a/drivers/vfio/pci/nvidia-vgpu/pf.h
+++ b/drivers/vfio/pci/nvidia-vgpu/pf.h
@@ -73,4 +73,16 @@ static inline int nvidia_vgpu_mgr_init_handle(struct pci_dev *pdev,
 #define nvidia_vgpu_mgr_get_gsp_client_handle(m, c) \
 	((m)->handle.ops->get_gsp_client_handle(c))
 
+#define nvidia_vgpu_mgr_rm_ctrl_get(m, g, c, s) \
+	((m)->handle.ops->rm_ctrl_get(g, c, s))
+
+#define nvidia_vgpu_mgr_rm_ctrl_wr(m, g, c) \
+	((m)->handle.ops->rm_ctrl_wr(g, c))
+
+#define nvidia_vgpu_mgr_rm_ctrl_rd(m, g, c, s) \
+	((m)->handle.ops->rm_ctrl_rd(g, c, s))
+
+#define nvidia_vgpu_mgr_rm_ctrl_done(m, g, c) \
+	((m)->handle.ops->rm_ctrl_done(g, c))
+
 #endif
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu.c b/drivers/vfio/pci/nvidia-vgpu/vgpu.c
index 79e6a9f16f74..cbb51b939f0b 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu.c
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu.c
@@ -75,7 +75,7 @@ int nvidia_vgpu_mgr_create_vgpu(struct nvidia_vgpu *vgpu)
 	struct nvidia_vgpu_info *info = &vgpu->info;
 	int ret;
 
-	if (WARN_ON(!info->gfid || !info->dbdf))
+	if (WARN_ON(!info->gfid || !info->dbdf || !info->vgpu_type))
 		return -EINVAL;
 
 	if (WARN_ON(!vgpu->vgpu_mgr || !vgpu->pdev))
@@ -86,7 +86,8 @@ int nvidia_vgpu_mgr_create_vgpu(struct nvidia_vgpu *vgpu)
 
 	vgpu->info = *info;
 
-	vgpu_debug(vgpu, "create vgpu on vgpu_mgr %px\n", vgpu->vgpu_mgr);
+	vgpu_debug(vgpu, "create vgpu %s on vgpu_mgr %px\n",
+		   info->vgpu_type->vgpu_type_name, vgpu->vgpu_mgr);
 
 	ret = register_vgpu(vgpu);
 	if (ret)
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
index 1455ca51eca1..a7f8a00f96bf 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
@@ -16,6 +16,7 @@ static void vgpu_mgr_release(struct kref *kref)
 	if (WARN_ON(atomic_read(&vgpu_mgr->num_vgpus)))
 		return;
 
+	nvidia_vgpu_mgr_clean_metadata(vgpu_mgr);
 	nvidia_vgpu_mgr_free_gsp_client(vgpu_mgr, &vgpu_mgr->gsp_client);
 	kvfree(vgpu_mgr);
 }
@@ -150,6 +151,10 @@ static int pf_attach_handle_fn(void *handle, struct nvidia_vgpu_vfio_handle_data
 	if (ret)
 		goto fail_init_vgpu_mgr;
 
+	ret = nvidia_vgpu_mgr_setup_metadata(vgpu_mgr);
+	if (ret)
+		goto fail_setup_metadata;
+
 	attach_vgpu_mgr(vgpu_mgr, handle_data);
 
 	ret = attach_data->init_vfio_fn(vgpu_mgr, attach_data->init_vfio_fn_data);
@@ -162,6 +167,8 @@ static int pf_attach_handle_fn(void *handle, struct nvidia_vgpu_vfio_handle_data
 
 fail_init_fn:
 	detach_vgpu_mgr(handle_data);
+	nvidia_vgpu_mgr_clean_metadata(vgpu_mgr);
+fail_setup_metadata:
 fail_init_vgpu_mgr:
 	nvidia_vgpu_mgr_free_gsp_client(vgpu_mgr, &vgpu_mgr->gsp_client);
 fail_alloc_gsp_client:
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
index 98dcbb682b92..0519b595378f 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
@@ -7,6 +7,21 @@
 
 #include "pf.h"
 
+#define NVIDIA_VGPU_TYPE_NAME_MAX 32
+
+struct nvidia_vgpu_type {
+	u32 vgpu_type;
+	char vgpu_type_name[NVIDIA_VGPU_TYPE_NAME_MAX];
+	u64 vdev_id;
+	u64 pdev_id;
+	u64 fb_length;
+	u64 gsp_heap_size;
+	u64 bar1_length;
+	u32 max_instance;
+	u32 ecc_supported;
+	u64 fb_reservation;
+};
+
 /**
  * struct nvidia_vgpu_info - vGPU information
  *
@@ -18,6 +33,7 @@ struct nvidia_vgpu_info {
 	int id;
 	u32 gfid;
 	u32 dbdf;
+	struct nvidia_vgpu_type *vgpu_type;
 };
 
 /**
@@ -48,10 +64,14 @@ struct nvidia_vgpu {
  * @handle: the driver handle
  * @total_avail_chids: total available channel IDs
  * @total_fbmem_size: total FB memory size
+ * @vgpu_major: vGPU major version
+ * @vgpu_minor: vGPU minor version
  * @vgpu_list_lock: lock to protect vGPU list
  * @vgpu_list_head: list head of vGPU list
  * @num_vgpus: number of vGPUs in the vGPU list
  * @gsp_client: the GSP client
+ * @vgpu_types: installed vGPU types
+ * @num_vgpu_types: number of installed vGPU types
  */
 struct nvidia_vgpu_mgr {
 	struct kref refcount;
@@ -61,12 +81,17 @@ struct nvidia_vgpu_mgr {
 	u32 total_avail_chids;
 	u64 total_fbmem_size;
 
+	u64 vgpu_major;
+	u64 vgpu_minor;
+
 	/* lock for vGPU list */
 	struct mutex vgpu_list_lock;
 	struct list_head vgpu_list_head;
 	atomic_t num_vgpus;
 
 	struct nvidia_vgpu_gsp_client gsp_client;
+	struct nvidia_vgpu_type *vgpu_types;
+	unsigned int num_vgpu_types;
 };
 
 #define nvidia_vgpu_mgr_for_each_vgpu(vgpu, vgpu_mgr) \
@@ -78,5 +103,7 @@ void nvidia_vgpu_mgr_release(struct nvidia_vgpu_mgr *vgpu_mgr);
 
 int nvidia_vgpu_mgr_destroy_vgpu(struct nvidia_vgpu *vgpu);
 int nvidia_vgpu_mgr_create_vgpu(struct nvidia_vgpu *vgpu);
+int nvidia_vgpu_mgr_setup_metadata(struct nvidia_vgpu_mgr *vgpu_mgr);
+void nvidia_vgpu_mgr_clean_metadata(struct nvidia_vgpu_mgr *vgpu_mgr);
 
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC v2 03/14] vfio/nvidia-vgpu: introduce vGPU type uploading
  2025-09-03 22:11 ` [RFC v2 03/14] vfio/nvidia-vgpu: introduce vGPU type uploading Zhi Wang
@ 2025-09-04  9:37   ` Danilo Krummrich
  2025-09-04  9:41     ` Danilo Krummrich
  0 siblings, 1 reply; 23+ messages in thread
From: Danilo Krummrich @ 2025-09-04  9:37 UTC (permalink / raw)
  To: Zhi Wang
  Cc: kvm, alex.williamson, kevin.tian, jgg, airlied, daniel, acurrid,
	cjia, smitra, ankita, aniketa, kwankhede, targupta, zhiwang

On Thu Sep 4, 2025 at 12:11 AM CEST, Zhi Wang wrote:
> diff --git a/drivers/vfio/pci/nvidia-vgpu/include/nvrm/gsp.h b/drivers/vfio/pci/nvidia-vgpu/include/nvrm/gsp.h
> new file mode 100644
> index 000000000000..c3fb7b299533
> --- /dev/null
> +++ b/drivers/vfio/pci/nvidia-vgpu/include/nvrm/gsp.h
> @@ -0,0 +1,18 @@
> +/* SPDX-License-Identifier: MIT */
> +#ifndef __NVRM_GSP_H__
> +#define __NVRM_GSP_H__
> +
> +#include <nvrm/nvtypes.h>
> +
> +/* Excerpt of RM headers from https://github.com/NVIDIA/open-gpu-kernel-modules/tree/570 */
> +
> +#define NV2080_CTRL_CMD_GSP_GET_FEATURES (0x20803601)
> +
> +typedef struct NV2080_CTRL_GSP_GET_FEATURES_PARAMS {
> +	NvU32  gspFeatures;
> +	NvBool bValid;
> +	NvBool bDefaultGspRmGpu;
> +	NvU8   firmwareVersion[GSP_MAX_BUILD_VERSION_LENGTH];
> +} NV2080_CTRL_GSP_GET_FEATURES_PARAMS;
> +
> +#endif

<snip>

> +static struct version supported_version_list[] = {
> +	{ 18, 1, "570.144" },
> +};

nova-core won't provide any firmware specific APIs, it is meant to serve as a
hardware and firmware abstraction layer for higher level drivers, such as vGPU
or nova-drm.

As a general rule the interface between nova-core and higher level drivers must
not leak any hardware or firmware specific details, but work on a higher level
abstraction layer.

Now, I recognize that at some point it might be necessary to do some kind of
versioning in this API anyways. For instance, when the semantics of the firmware
API changes too significantly.

However, this would be a separte API where nova-core, at the initial handshake,
then asks clients to use e.g. v2 of the nova-core API, still hiding any firmware
and hardware details from the client.

Some more general notes, since I also had a look at the nova-core <-> vGPU
interface patches in your tree (even though I'm aware that they're not part of
the RFC of course):

The interface for the general lifecycle management for any clients attaching to
nova-core (VGPU, nova-drm) should be common and not specific to vGPU. (The same
goes for interfaces that will be used by vGPU and nova-drm.)

The interface nova-core provides for that should be designed in Rust, so we can
take advantage of all the features the type system provides us with connecting
to Rust clients (nova-drm).

For vGPU, we can then monomorphize those types into the corresponding C
structures and provide the corresponding functions very easily.

Doing it the other way around would be a very bad idea, since the Rust type
system is much more powerful and hence it'd be very hard to avoid introducing
limitations on the Rust side of things.

Hence, I recommend to start with some patches defining the API in nova-core for
the general lifecycle (in Rust), so we can take it from there.

Another note: I don't see any use of the auxiliary bus in vGPU, any clients
should attach via the auxiliary bus API, it provides proper matching where
there's more than on compatible GPU in the system. nova-core already registers
an auxiliary device for each bound PCI device.

Please don't re-implement what the auxiliary bus already does for us.

- Danilo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC v2 03/14] vfio/nvidia-vgpu: introduce vGPU type uploading
  2025-09-04  9:37   ` Danilo Krummrich
@ 2025-09-04  9:41     ` Danilo Krummrich
  2025-09-04 12:15       ` Jason Gunthorpe
  2025-09-04 15:43       ` Zhi Wang
  0 siblings, 2 replies; 23+ messages in thread
From: Danilo Krummrich @ 2025-09-04  9:41 UTC (permalink / raw)
  To: Zhi Wang
  Cc: kvm, alex.williamson, kevin.tian, jgg, airlied, daniel, acurrid,
	cjia, smitra, ankita, aniketa, kwankhede, targupta, zhiwang,
	acourbot, joelagnelf, apopple, jhubbard, nouveau

(Cc: Alex, John, Joel, Alistair, nouveau)

On Thu Sep 4, 2025 at 11:37 AM CEST, Danilo Krummrich wrote:
> On Thu Sep 4, 2025 at 12:11 AM CEST, Zhi Wang wrote:
>> diff --git a/drivers/vfio/pci/nvidia-vgpu/include/nvrm/gsp.h b/drivers/vfio/pci/nvidia-vgpu/include/nvrm/gsp.h
>> new file mode 100644
>> index 000000000000..c3fb7b299533
>> --- /dev/null
>> +++ b/drivers/vfio/pci/nvidia-vgpu/include/nvrm/gsp.h
>> @@ -0,0 +1,18 @@
>> +/* SPDX-License-Identifier: MIT */
>> +#ifndef __NVRM_GSP_H__
>> +#define __NVRM_GSP_H__
>> +
>> +#include <nvrm/nvtypes.h>
>> +
>> +/* Excerpt of RM headers from https://github.com/NVIDIA/open-gpu-kernel-modules/tree/570 */
>> +
>> +#define NV2080_CTRL_CMD_GSP_GET_FEATURES (0x20803601)
>> +
>> +typedef struct NV2080_CTRL_GSP_GET_FEATURES_PARAMS {
>> +	NvU32  gspFeatures;
>> +	NvBool bValid;
>> +	NvBool bDefaultGspRmGpu;
>> +	NvU8   firmwareVersion[GSP_MAX_BUILD_VERSION_LENGTH];
>> +} NV2080_CTRL_GSP_GET_FEATURES_PARAMS;
>> +
>> +#endif
>
> <snip>
>
>> +static struct version supported_version_list[] = {
>> +	{ 18, 1, "570.144" },
>> +};
>
> nova-core won't provide any firmware specific APIs, it is meant to serve as a
> hardware and firmware abstraction layer for higher level drivers, such as vGPU
> or nova-drm.
>
> As a general rule the interface between nova-core and higher level drivers must
> not leak any hardware or firmware specific details, but work on a higher level
> abstraction layer.
>
> Now, I recognize that at some point it might be necessary to do some kind of
> versioning in this API anyways. For instance, when the semantics of the firmware
> API changes too significantly.
>
> However, this would be a separte API where nova-core, at the initial handshake,
> then asks clients to use e.g. v2 of the nova-core API, still hiding any firmware
> and hardware details from the client.
>
> Some more general notes, since I also had a look at the nova-core <-> vGPU
> interface patches in your tree (even though I'm aware that they're not part of
> the RFC of course):
>
> The interface for the general lifecycle management for any clients attaching to
> nova-core (VGPU, nova-drm) should be common and not specific to vGPU. (The same
> goes for interfaces that will be used by vGPU and nova-drm.)
>
> The interface nova-core provides for that should be designed in Rust, so we can
> take advantage of all the features the type system provides us with connecting
> to Rust clients (nova-drm).
>
> For vGPU, we can then monomorphize those types into the corresponding C
> structures and provide the corresponding functions very easily.
>
> Doing it the other way around would be a very bad idea, since the Rust type
> system is much more powerful and hence it'd be very hard to avoid introducing
> limitations on the Rust side of things.
>
> Hence, I recommend to start with some patches defining the API in nova-core for
> the general lifecycle (in Rust), so we can take it from there.
>
> Another note: I don't see any use of the auxiliary bus in vGPU, any clients
> should attach via the auxiliary bus API, it provides proper matching where
> there's more than on compatible GPU in the system. nova-core already registers
> an auxiliary device for each bound PCI device.
>
> Please don't re-implement what the auxiliary bus already does for us.
>
> - Danilo


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC v2 03/14] vfio/nvidia-vgpu: introduce vGPU type uploading
  2025-09-04  9:41     ` Danilo Krummrich
@ 2025-09-04 12:15       ` Jason Gunthorpe
  2025-09-04 12:45         ` Danilo Krummrich
  2025-09-04 15:43       ` Zhi Wang
  1 sibling, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2025-09-04 12:15 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Zhi Wang, kvm, alex.williamson, kevin.tian, airlied, daniel,
	acurrid, cjia, smitra, ankita, aniketa, kwankhede, targupta,
	zhiwang, acourbot, joelagnelf, apopple, jhubbard, nouveau

On Thu, Sep 04, 2025 at 11:41:03AM +0200, Danilo Krummrich wrote:

> > Another note: I don't see any use of the auxiliary bus in vGPU, any clients
> > should attach via the auxiliary bus API, it provides proper matching where
> > there's more than on compatible GPU in the system. nova-core already registers
> > an auxiliary device for each bound PCI device.

The driver here attaches to the SRIOV VF pci_device, it should obtain the
nova-core handle of the PF device through pci_iov_get_pf_drvdata().

This is the expected design of VFIO drivers because the driver core
does not support a single driver binding to two devices (aux and VF)
today.

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC v2 03/14] vfio/nvidia-vgpu: introduce vGPU type uploading
  2025-09-04 12:15       ` Jason Gunthorpe
@ 2025-09-04 12:45         ` Danilo Krummrich
  2025-09-04 13:58           ` Jason Gunthorpe
  0 siblings, 1 reply; 23+ messages in thread
From: Danilo Krummrich @ 2025-09-04 12:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Zhi Wang, kvm, alex.williamson, kevin.tian, airlied, daniel,
	acurrid, cjia, smitra, ankita, aniketa, kwankhede, targupta,
	zhiwang, acourbot, joelagnelf, apopple, jhubbard, nouveau

On Thu Sep 4, 2025 at 2:15 PM CEST, Jason Gunthorpe wrote:
> On Thu, Sep 04, 2025 at 11:41:03AM +0200, Danilo Krummrich wrote:
>
>> > Another note: I don't see any use of the auxiliary bus in vGPU, any clients
>> > should attach via the auxiliary bus API, it provides proper matching where
>> > there's more than on compatible GPU in the system. nova-core already registers
>> > an auxiliary device for each bound PCI device.
>
> The driver here attaches to the SRIOV VF pci_device, it should obtain the
> nova-core handle of the PF device through pci_iov_get_pf_drvdata().
>
> This is the expected design of VFIO drivers because the driver core
> does not support a single driver binding to two devices (aux and VF)
> today.

Yeah, that's for the VF PCI devices, but I thought vGPU will also have some kind
of "control instance" for each physical device through which it can control the
creation of VFs?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC v2 03/14] vfio/nvidia-vgpu: introduce vGPU type uploading
  2025-09-04 12:45         ` Danilo Krummrich
@ 2025-09-04 13:58           ` Jason Gunthorpe
  0 siblings, 0 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2025-09-04 13:58 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Zhi Wang, kvm, alex.williamson, kevin.tian, airlied, daniel,
	acurrid, cjia, smitra, ankita, aniketa, kwankhede, targupta,
	zhiwang, acourbot, joelagnelf, apopple, jhubbard, nouveau

On Thu, Sep 04, 2025 at 02:45:34PM +0200, Danilo Krummrich wrote:
> On Thu Sep 4, 2025 at 2:15 PM CEST, Jason Gunthorpe wrote:
> > On Thu, Sep 04, 2025 at 11:41:03AM +0200, Danilo Krummrich wrote:
> >
> >> > Another note: I don't see any use of the auxiliary bus in vGPU, any clients
> >> > should attach via the auxiliary bus API, it provides proper matching where
> >> > there's more than on compatible GPU in the system. nova-core already registers
> >> > an auxiliary device for each bound PCI device.
> >
> > The driver here attaches to the SRIOV VF pci_device, it should obtain the
> > nova-core handle of the PF device through pci_iov_get_pf_drvdata().
> >
> > This is the expected design of VFIO drivers because the driver core
> > does not support a single driver binding to two devices (aux and VF)
> > today.
> 
> Yeah, that's for the VF PCI devices, but I thought vGPU will also have some kind
> of "control instance" for each physical device through which it can control the
> creation of VFs?

I recall there is something on the PF that is independent of the VFs,
but it is hard to stick that in an aux device, it will make the the
lifetime model very difficult since aux devices can become unbound at
any time while the vf is using it. It is alot easier to be part of
the PF driver somehow..

Then userspace activities like provisioning VFs I hope we will see
that done through fwctl as the networking drivers are doing. I see
this series is using request_firmware() to get VF profiles and some
sysfs, which seems inconsisent with any other VF provisioning scheme
in the kernel.

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC v2 03/14] vfio/nvidia-vgpu: introduce vGPU type uploading
  2025-09-04  9:41     ` Danilo Krummrich
  2025-09-04 12:15       ` Jason Gunthorpe
@ 2025-09-04 15:43       ` Zhi Wang
  2025-09-06 10:34         ` Danilo Krummrich
  1 sibling, 1 reply; 23+ messages in thread
From: Zhi Wang @ 2025-09-04 15:43 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: kvm, alex.williamson, kevin.tian, jgg, airlied, daniel, acurrid,
	cjia, smitra, ankita, aniketa, kwankhede, targupta, zhiwang,
	acourbot, joelagnelf, apopple, jhubbard, nouveau

On Thu, 04 Sep 2025 11:41:03 +0200
"Danilo Krummrich" <dakr@kernel.org> wrote:

> (Cc: Alex, John, Joel, Alistair, nouveau)
> 
> On Thu Sep 4, 2025 at 11:37 AM CEST, Danilo Krummrich wrote:
> > On Thu Sep 4, 2025 at 12:11 AM CEST, Zhi Wang wrote:
> >> diff --git a/drivers/vfio/pci/nvidia-vgpu/include/nvrm/gsp.h b/drivers/vfio/pci/nvidia-vgpu/include/nvrm/gsp.h
> >> new file mode 100644
> >> index 000000000000..c3fb7b299533
> >> --- /dev/null
> >> +++ b/drivers/vfio/pci/nvidia-vgpu/include/nvrm/gsp.h
> >> @@ -0,0 +1,18 @@
> >> +/* SPDX-License-Identifier: MIT */
> >> +#ifndef __NVRM_GSP_H__
> >> +#define __NVRM_GSP_H__
> >> +
> >> +#include <nvrm/nvtypes.h>
> >> +
> >> +/* Excerpt of RM headers from https://github.com/NVIDIA/open-gpu-kernel-modules/tree/570 */
> >> +
> >> +#define NV2080_CTRL_CMD_GSP_GET_FEATURES (0x20803601)
> >> +
> >> +typedef struct NV2080_CTRL_GSP_GET_FEATURES_PARAMS {
> >> +	NvU32  gspFeatures;
> >> +	NvBool bValid;
> >> +	NvBool bDefaultGspRmGpu;
> >> +	NvU8   firmwareVersion[GSP_MAX_BUILD_VERSION_LENGTH];
> >> +} NV2080_CTRL_GSP_GET_FEATURES_PARAMS;
> >> +
> >> +#endif
> >
> > <snip>
> >

The RFC v2 is based on the same architecture of RFC V1 but switching the
core driver from NVKM to nova-core. Yet the new architecture and auxiliary
bus is WIP. So it doesn't represent the final picture, e.g. the rust code
I wrote in the nova-core. The main idea is to demonstrate the progress
of the vGPU development.

> >> +static struct version supported_version_list[] = {
> >> +	{ 18, 1, "570.144" },
> >> +};
> >
> > nova-core won't provide any firmware specific APIs, it is meant to serve as a
> > hardware and firmware abstraction layer for higher level drivers, such as vGPU
> > or nova-drm.
> >
> > As a general rule the interface between nova-core and higher level drivers must
> > not leak any hardware or firmware specific details, but work on a higher level
> > abstraction layer.
> >

It is more a matter of where we are going to place vGPU specific
functionality in the whole picture. In this case, if we are thinking about
the requirement of vGPU type loading, which requires the GSP version
number and checking. Are we leaning towards putting some vGPU specific
functionality also in nova-core?

Regarding not leaking any of the hardware details, is that doable? 
Looking at {nv04 * _fence}.c {chan*}.c in the current NVIF interfaces, I
think we will expose the HW concept somehow.

> > Now, I recognize that at some point it might be necessary to do some kind of
> > versioning in this API anyways. For instance, when the semantics of the firmware
> > API changes too significantly.
> >
> > However, this would be a separte API where nova-core, at the initial handshake,
> > then asks clients to use e.g. v2 of the nova-core API, still hiding any firmware
> > and hardware details from the client.
> >
> > Some more general notes, since I also had a look at the nova-core <-> vGPU
> > interface patches in your tree (even though I'm aware that they're not part of
> > the RFC of course):
> >
> > The interface for the general lifecycle management for any clients attaching to
> > nova-core (VGPU, nova-drm) should be common and not specific to vGPU. (The same
> > goes for interfaces that will be used by vGPU and nova-drm.)
> >
> > The interface nova-core provides for that should be designed in Rust, so we can
> > take advantage of all the features the type system provides us with connecting
> > to Rust clients (nova-drm).
> >
> > For vGPU, we can then monomorphize those types into the corresponding C
> > structures and provide the corresponding functions very easily.
> >
> > Doing it the other way around would be a very bad idea, since the Rust type
> > system is much more powerful and hence it'd be very hard to avoid introducing
> > limitations on the Rust side of things.
> >
> > Hence, I recommend to start with some patches defining the API in nova-core for
> > the general lifecycle (in Rust), so we can take it from there.
> >
> > Another note: I don't see any use of the auxiliary bus in vGPU, any clients
> > should attach via the auxiliary bus API, it provides proper matching where
> > there's more than on compatible GPU in the system. nova-core already registers
> > an auxiliary device for each bound PCI device.
> >
> > Please don't re-implement what the auxiliary bus already does for us.
> >
> > - Danilo
> 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC v2 03/14] vfio/nvidia-vgpu: introduce vGPU type uploading
  2025-09-04 15:43       ` Zhi Wang
@ 2025-09-06 10:34         ` Danilo Krummrich
  0 siblings, 0 replies; 23+ messages in thread
From: Danilo Krummrich @ 2025-09-06 10:34 UTC (permalink / raw)
  To: Zhi Wang
  Cc: kvm, alex.williamson, kevin.tian, jgg, airlied, daniel, acurrid,
	cjia, smitra, ankita, aniketa, kwankhede, targupta, zhiwang,
	acourbot, joelagnelf, apopple, jhubbard, nouveau

On Thu Sep 4, 2025 at 5:43 PM CEST, Zhi Wang wrote:
> On Thu, 04 Sep 2025 11:41:03 +0200
> "Danilo Krummrich" <dakr@kernel.org> wrote:
>
>> (Cc: Alex, John, Joel, Alistair, nouveau)
>> 
>> On Thu Sep 4, 2025 at 11:37 AM CEST, Danilo Krummrich wrote:
>> > nova-core won't provide any firmware specific APIs, it is meant to serve as a
>> > hardware and firmware abstraction layer for higher level drivers, such as vGPU
>> > or nova-drm.
>> >
>> > As a general rule the interface between nova-core and higher level drivers must
>> > not leak any hardware or firmware specific details, but work on a higher level
>> > abstraction layer.
>> >
>
> It is more a matter of where we are going to place vGPU specific
> functionality in the whole picture. In this case, if we are thinking about
> the requirement of vGPU type loading, which requires the GSP version
> number and checking. Are we leaning towards putting some vGPU specific
> functionality also in nova-core?

As much as needed to abstract firmware (and hardware) API details.

> Regarding not leaking any of the hardware details, is that doable? 
> Looking at {nv04 * _fence}.c {chan*}.c in the current NVIF interfaces, I
> think we will expose the HW concept somehow.

I don't really mean that vGPU must be entirely unaware of the hardware, it's
still a driver of course. But for the API between nova-core and client drivers
we want to abstract how the firmware and hardware is programmed, i.e. not leak
any (version specific) RM structures or provide APIs that consume raw register
values to write, etc.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC v2 04/14] vfio/nvidia-vgpu: allocate vGPU channels when creating vGPUs
  2025-09-03 22:10 [RFC v2 00/14] Introduce NVIDIA GPU Virtualization (vGPU) Support Zhi Wang
                   ` (2 preceding siblings ...)
  2025-09-03 22:11 ` [RFC v2 03/14] vfio/nvidia-vgpu: introduce vGPU type uploading Zhi Wang
@ 2025-09-03 22:11 ` Zhi Wang
  2025-09-03 22:11 ` [RFC v2 05/14] vfio/nvidia-vgpu: allocate vGPU FB memory " Zhi Wang
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Zhi Wang @ 2025-09-03 22:11 UTC (permalink / raw)
  To: kvm
  Cc: alex.williamson, kevin.tian, jgg, airlied, daniel, dakr, acurrid,
	cjia, smitra, ankita, aniketa, kwankhede, targupta, zhiw, zhiwang

Creating a vGPU requires allocating a portion of the channels from the
reserved channel pool.

Allocate the channels from the reserved channel pool when creating a vGPU.

Cc: Aniket Agashe <aniketa@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 drivers/vfio/pci/nvidia-vgpu/pf.h       | 10 ++++
 drivers/vfio/pci/nvidia-vgpu/vgpu.c     | 76 +++++++++++++++++++++++++
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c | 33 ++++++++++-
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h | 21 +++++++
 4 files changed, 138 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/nvidia-vgpu/pf.h b/drivers/vfio/pci/nvidia-vgpu/pf.h
index 19f0aca56d12..b8008d8ee434 100644
--- a/drivers/vfio/pci/nvidia-vgpu/pf.h
+++ b/drivers/vfio/pci/nvidia-vgpu/pf.h
@@ -85,4 +85,14 @@ static inline int nvidia_vgpu_mgr_init_handle(struct pci_dev *pdev,
 #define nvidia_vgpu_mgr_rm_ctrl_done(m, g, c) \
 	((m)->handle.ops->rm_ctrl_done(g, c))
 
+#define nvidia_vgpu_mgr_alloc_chids(m, o, s) ({ \
+	typeof(m) __m = (m); \
+	__m->handle.ops->alloc_chids(__m->handle.pf_drvdata, o, s); \
+})
+
+#define nvidia_vgpu_mgr_free_chids(m, o, s) ({ \
+	typeof(m) __m = (m); \
+	__m->handle.ops->free_chids(__m->handle.pf_drvdata, o, s); \
+})
+
 #endif
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu.c b/drivers/vfio/pci/nvidia-vgpu/vgpu.c
index cbb51b939f0b..52b946469043 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu.c
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu.c
@@ -3,6 +3,8 @@
  * Copyright © 2025 NVIDIA Corporation
  */
 
+#include <linux/log2.h>
+
 #include "debug.h"
 #include "vgpu_mgr.h"
 
@@ -43,6 +45,70 @@ static int register_vgpu(struct nvidia_vgpu *vgpu)
 	return 0;
 }
 
+static void clean_chids(struct nvidia_vgpu *vgpu)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = vgpu->vgpu_mgr;
+	struct nvidia_vgpu_chid *chid = &vgpu->chid;
+
+	vgpu_debug(vgpu, "free guest channel offset %d size %d\n", chid->chid_offset,
+		   chid->num_chid);
+
+	if (vgpu_mgr->use_chid_alloc_bitmap)
+		bitmap_clear(vgpu_mgr->chid_alloc_bitmap, chid->chid_offset, chid->num_chid);
+	else
+		nvidia_vgpu_mgr_free_chids(vgpu_mgr, chid->chid_offset, chid->num_chid);
+}
+
+static inline u32 prev_pow2(const u32 x)
+{
+	return x ? 1U << ilog2(x) : 0;
+}
+
+static void get_alloc_chids_num(struct nvidia_vgpu *vgpu, u32 *size)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = vgpu->vgpu_mgr;
+	struct nvidia_vgpu_info *info = &vgpu->info;
+	struct nvidia_vgpu_type *type = info->vgpu_type;
+	u32 v;
+
+	/* Calculate with total reserved CHIDs for vGPUs. */
+	v = (vgpu_mgr->total_avail_chids) / type->max_instance;
+	*size = prev_pow2(v);
+}
+
+static int setup_chids(struct nvidia_vgpu *vgpu)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = vgpu->vgpu_mgr;
+	struct nvidia_vgpu_chid *chid = &vgpu->chid;
+	u32 size, offset;
+	int ret;
+
+	get_alloc_chids_num(vgpu, &size);
+
+	if (vgpu_mgr->use_chid_alloc_bitmap) {
+		offset = bitmap_find_next_zero_area(vgpu_mgr->chid_alloc_bitmap,
+						    vgpu_mgr->total_avail_chids, 0, size, 0);
+
+		if (offset + size > vgpu_mgr->total_avail_chids)
+			return -ENOSPC;
+
+		bitmap_set(vgpu_mgr->chid_alloc_bitmap, offset, size);
+	} else {
+		ret = nvidia_vgpu_mgr_alloc_chids(vgpu_mgr, &offset, size);
+		if (ret)
+			return ret;
+	}
+
+	chid->chid_offset = offset;
+	chid->num_chid = size;
+	chid->num_plugin_channels = 1;
+
+	vgpu_debug(vgpu, "alloc guest channel offset %u size %u\n", chid->chid_offset,
+		   chid->num_chid);
+
+	return 0;
+}
+
 /**
  * nvidia_vgpu_mgr_destroy_vgpu - destroy a vGPU instance
  * @vgpu: the vGPU instance going to be destroyed.
@@ -54,6 +120,7 @@ int nvidia_vgpu_mgr_destroy_vgpu(struct nvidia_vgpu *vgpu)
 	if (!atomic_cmpxchg(&vgpu->status, 1, 0))
 		return -ENODEV;
 
+	clean_chids(vgpu);
 	unregister_vgpu(vgpu);
 
 	vgpu_debug(vgpu, "destroyed\n");
@@ -93,10 +160,19 @@ int nvidia_vgpu_mgr_create_vgpu(struct nvidia_vgpu *vgpu)
 	if (ret)
 		return ret;
 
+	ret = setup_chids(vgpu);
+	if (ret)
+		goto err_setup_chids;
+
 	atomic_set(&vgpu->status, 1);
 
 	vgpu_debug(vgpu, "created\n");
 
 	return 0;
+
+err_setup_chids:
+	unregister_vgpu(vgpu);
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(nvidia_vgpu_mgr_create_vgpu);
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
index a7f8a00f96bf..8565bb881fda 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
@@ -6,6 +6,14 @@
 #include "debug.h"
 #include "vgpu_mgr.h"
 
+static void clean_vgpu_mgr(struct nvidia_vgpu_mgr *vgpu_mgr)
+{
+	if (vgpu_mgr->use_chid_alloc_bitmap) {
+		bitmap_free(vgpu_mgr->chid_alloc_bitmap);
+		vgpu_mgr->chid_alloc_bitmap = NULL;
+	}
+}
+
 static void vgpu_mgr_release(struct kref *kref)
 {
 	struct nvidia_vgpu_mgr *vgpu_mgr =
@@ -17,6 +25,7 @@ static void vgpu_mgr_release(struct kref *kref)
 		return;
 
 	nvidia_vgpu_mgr_clean_metadata(vgpu_mgr);
+	clean_vgpu_mgr(vgpu_mgr);
 	nvidia_vgpu_mgr_free_gsp_client(vgpu_mgr, &vgpu_mgr->gsp_client);
 	kvfree(vgpu_mgr);
 }
@@ -95,6 +104,20 @@ static void attach_vgpu_mgr(struct nvidia_vgpu_mgr *vgpu_mgr,
 	handle_data->vfio.pf_detach_handle_fn = pf_detach_handle_fn;
 }
 
+static int setup_chid_alloc_bitmap(struct nvidia_vgpu_mgr *vgpu_mgr)
+{
+	if (WARN_ON(!vgpu_mgr->use_chid_alloc_bitmap))
+		return 0;
+
+	vgpu_mgr->chid_alloc_bitmap = bitmap_alloc(vgpu_mgr->total_avail_chids, GFP_KERNEL);
+	if (!vgpu_mgr->chid_alloc_bitmap)
+		return -ENOMEM;
+	bitmap_zero(vgpu_mgr->chid_alloc_bitmap, vgpu_mgr->total_avail_chids);
+
+	vgpu_mgr_debug(vgpu_mgr, "using chid allocation bitmap.\n");
+	return 0;
+}
+
 static int init_vgpu_mgr(struct nvidia_vgpu_mgr *vgpu_mgr)
 {
 	vgpu_mgr->total_avail_chids = nvidia_vgpu_mgr_get_avail_chids(vgpu_mgr);
@@ -103,12 +126,17 @@ static int init_vgpu_mgr(struct nvidia_vgpu_mgr *vgpu_mgr)
 	vgpu_mgr_debug(vgpu_mgr, "total avail chids %u\n", vgpu_mgr->total_avail_chids);
 	vgpu_mgr_debug(vgpu_mgr, "total fbmem size 0x%llx\n", vgpu_mgr->total_fbmem_size);
 
-	return 0;
+	return vgpu_mgr->use_chid_alloc_bitmap ? setup_chid_alloc_bitmap(vgpu_mgr) : 0;
 }
 
 static int setup_pf_driver_caps(struct nvidia_vgpu_mgr *vgpu_mgr, unsigned long *caps)
 {
-	/* more to come */
+#define HAS_CAP(cap) \
+	test_bit(NVIDIA_VGPU_PF_DRIVER_CAP_HAS_##cap, caps)
+
+	vgpu_mgr->use_chid_alloc_bitmap = !HAS_CAP(CHID_ALLOC);
+
+#undef HAS_CAP
 	return 0;
 }
 
@@ -169,6 +197,7 @@ static int pf_attach_handle_fn(void *handle, struct nvidia_vgpu_vfio_handle_data
 	detach_vgpu_mgr(handle_data);
 	nvidia_vgpu_mgr_clean_metadata(vgpu_mgr);
 fail_setup_metadata:
+	clean_vgpu_mgr(vgpu_mgr);
 fail_init_vgpu_mgr:
 	nvidia_vgpu_mgr_free_gsp_client(vgpu_mgr, &vgpu_mgr->gsp_client);
 fail_alloc_gsp_client:
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
index 0519b595378f..5a7a6103a677 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
@@ -36,6 +36,19 @@ struct nvidia_vgpu_info {
 	struct nvidia_vgpu_type *vgpu_type;
 };
 
+/**
+ * struct nvidia_vgpu_chid - per-vGPU channel IDs
+ *
+ * @chid_offset: beginning offset of channel IDs
+ * @num_chid: number of allocated channel IDs
+ * @num_plugin_channels: number of channels for vGPU manager
+ */
+struct nvidia_vgpu_chid {
+	u32 chid_offset;
+	u32 num_chid;
+	u32 num_plugin_channels;
+};
+
 /**
  * struct nvidia_vgpu - per-vGPU state
  *
@@ -45,6 +58,7 @@ struct nvidia_vgpu_info {
  * @vgpu_list: list node to the vGPU list
  * @info: vGPU info
  * @vgpu_mgr: pointer to vGPU manager
+ * @chid: vGPU channel IDs
  */
 struct nvidia_vgpu {
 	/* Per-vGPU lock */
@@ -55,6 +69,8 @@ struct nvidia_vgpu {
 
 	struct nvidia_vgpu_info info;
 	struct nvidia_vgpu_mgr *vgpu_mgr;
+
+	struct nvidia_vgpu_chid chid;
 };
 
 /**
@@ -72,6 +88,8 @@ struct nvidia_vgpu {
  * @gsp_client: the GSP client
  * @vgpu_types: installed vGPU types
  * @num_vgpu_types: number of installed vGPU types
+ * @use_alloc_bitmap: use chid allocator for the PF driver doesn't support chid allocation
+ * @chid_alloc_bitmap: chid allocator bitmap
  */
 struct nvidia_vgpu_mgr {
 	struct kref refcount;
@@ -92,6 +110,9 @@ struct nvidia_vgpu_mgr {
 	struct nvidia_vgpu_gsp_client gsp_client;
 	struct nvidia_vgpu_type *vgpu_types;
 	unsigned int num_vgpu_types;
+
+	bool use_chid_alloc_bitmap;
+	void *chid_alloc_bitmap;
 };
 
 #define nvidia_vgpu_mgr_for_each_vgpu(vgpu, vgpu_mgr) \
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC v2 05/14] vfio/nvidia-vgpu: allocate vGPU FB memory when creating vGPUs
  2025-09-03 22:10 [RFC v2 00/14] Introduce NVIDIA GPU Virtualization (vGPU) Support Zhi Wang
                   ` (3 preceding siblings ...)
  2025-09-03 22:11 ` [RFC v2 04/14] vfio/nvidia-vgpu: allocate vGPU channels when creating vGPUs Zhi Wang
@ 2025-09-03 22:11 ` Zhi Wang
  2025-09-03 22:11 ` [RFC v2 06/14] vfio/nvidia-vgpu: allocate mgmt heap " Zhi Wang
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Zhi Wang @ 2025-09-03 22:11 UTC (permalink / raw)
  To: kvm
  Cc: alex.williamson, kevin.tian, jgg, airlied, daniel, dakr, acurrid,
	cjia, smitra, ankita, aniketa, kwankhede, targupta, zhiw, zhiwang

Creating a vGPU requires allocating a portion of the FB memory from the
NVKM. The size of the FB memory that a vGPU requires is from the vGPU
type.

Acquire the size of the required FB memory from the vGPU type. Allocate
the FB memory from NVKM when creating a vGPU.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 drivers/vfio/pci/nvidia-vgpu/debug.h          |  5 ++
 .../vfio/pci/nvidia-vgpu/include/nvrm/ecc.h   | 45 ++++++++++++
 .../vfio/pci/nvidia-vgpu/include/nvrm/vmmu.h  | 39 +++++++++++
 drivers/vfio/pci/nvidia-vgpu/pf.h             |  8 +++
 drivers/vfio/pci/nvidia-vgpu/vgpu.c           | 70 +++++++++++++++++++
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c       | 56 ++++++++++++++-
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h       |  8 +++
 7 files changed, 229 insertions(+), 2 deletions(-)
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/include/nvrm/ecc.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/include/nvrm/vmmu.h

diff --git a/drivers/vfio/pci/nvidia-vgpu/debug.h b/drivers/vfio/pci/nvidia-vgpu/debug.h
index 7cf92c9060ae..db9288752384 100644
--- a/drivers/vfio/pci/nvidia-vgpu/debug.h
+++ b/drivers/vfio/pci/nvidia-vgpu/debug.h
@@ -17,4 +17,9 @@
 	pci_dbg(__v->pdev, "nvidia-vgpu %d: "f, __v->info.id, ##a); \
 })
 
+#define vgpu_error(v, f, a...) ({ \
+	typeof(v) __v = (v); \
+	pci_err(__v->pdev, "nvidia-vgpu %d: "f, __v->info.id, ##a); \
+})
+
 #endif
diff --git a/drivers/vfio/pci/nvidia-vgpu/include/nvrm/ecc.h b/drivers/vfio/pci/nvidia-vgpu/include/nvrm/ecc.h
new file mode 100644
index 000000000000..d2a8316a0f12
--- /dev/null
+++ b/drivers/vfio/pci/nvidia-vgpu/include/nvrm/ecc.h
@@ -0,0 +1,45 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 NVIDIA Corporation
+ */
+#ifndef __NVRM_ECC_H__
+#define __NVRM_ECC_H__
+
+#include <nvrm/nvtypes.h>
+
+/* Excerpt of RM headers from https://github.com/NVIDIA/open-gpu-kernel-modules/tree/570.124.04 */
+
+typedef struct NV2080_CTRL_GPU_QUERY_ECC_EXCEPTION_STATUS {
+    NV_DECLARE_ALIGNED(NvU64 count, 8);
+} NV2080_CTRL_GPU_QUERY_ECC_EXCEPTION_STATUS;
+
+typedef struct NV2080_CTRL_GPU_QUERY_ECC_UNIT_STATUS {
+    NvBool enabled;
+    NvBool scrubComplete;
+    NvBool supported;
+    NV_DECLARE_ALIGNED(NV2080_CTRL_GPU_QUERY_ECC_EXCEPTION_STATUS dbe, 8);
+    NV_DECLARE_ALIGNED(NV2080_CTRL_GPU_QUERY_ECC_EXCEPTION_STATUS dbeNonResettable, 8);
+    NV_DECLARE_ALIGNED(NV2080_CTRL_GPU_QUERY_ECC_EXCEPTION_STATUS sbe, 8);
+    NV_DECLARE_ALIGNED(NV2080_CTRL_GPU_QUERY_ECC_EXCEPTION_STATUS sbeNonResettable, 8);
+} NV2080_CTRL_GPU_QUERY_ECC_UNIT_STATUS;
+
+typedef struct NV0080_CTRL_GR_ROUTE_INFO {
+    NvU32 flags;
+    NV_DECLARE_ALIGNED(NvU64 route, 8);
+} NV0080_CTRL_GR_ROUTE_INFO;
+
+typedef NV0080_CTRL_GR_ROUTE_INFO NV2080_CTRL_GR_ROUTE_INFO;
+
+#define NV2080_CTRL_GPU_ECC_UNIT_COUNT (0x00000024U)
+
+#define NV2080_CTRL_CMD_GPU_QUERY_ECC_STATUS (0x2080012fU)
+
+typedef struct NV2080_CTRL_GPU_QUERY_ECC_STATUS_PARAMS {
+    NV_DECLARE_ALIGNED(NV2080_CTRL_GPU_QUERY_ECC_UNIT_STATUS units[NV2080_CTRL_GPU_ECC_UNIT_COUNT], 8);
+    NvBool bFatalPoisonError;
+    NvU8   uncorrectableError;
+    NvU32  flags;
+    NV_DECLARE_ALIGNED(NV2080_CTRL_GR_ROUTE_INFO grRouteInfo, 8);
+} NV2080_CTRL_GPU_QUERY_ECC_STATUS_PARAMS;
+
+#endif
diff --git a/drivers/vfio/pci/nvidia-vgpu/include/nvrm/vmmu.h b/drivers/vfio/pci/nvidia-vgpu/include/nvrm/vmmu.h
new file mode 100644
index 000000000000..fb1f100deac4
--- /dev/null
+++ b/drivers/vfio/pci/nvidia-vgpu/include/nvrm/vmmu.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: MIT */
+
+/* Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. */
+
+#ifndef __NVRM_VMMU_H__
+#define __NVRM_VMMU_H__
+
+#include <nvrm/nvtypes.h>
+
+/* Excerpt of RM headers from https://github.com/NVIDIA/open-gpu-kernel-modules/tree/570.124.04 */
+
+/*
+ * NV2080_CTRL_CMD_GPU_GET_VMMU_SEGMENT_SIZE
+ *
+ * This command returns the VMMU page size
+ *
+ *   vmmuSegmentSize
+ *     Output parameter.
+ *     Returns the VMMU segment size (in bytes)
+ *
+ * Possible status values returned are:
+ *   NV_OK
+ *   NV_ERR_NOT_SUPPORTED
+ */
+#define NV2080_CTRL_CMD_GPU_GET_VMMU_SEGMENT_SIZE  (0x2080017eU) /* finn: Evaluated from "(FINN_NV20_SUBDEVICE_0_GPU_INTERFACE_ID << 8) | NV2080_CTRL_GPU_GET_VMMU_SEGMENT_SIZE_PARAMS_MESSAGE_ID" */
+
+#define NV2080_CTRL_GPU_GET_VMMU_SEGMENT_SIZE_PARAMS_MESSAGE_ID (0x7EU)
+
+typedef struct NV2080_CTRL_GPU_GET_VMMU_SEGMENT_SIZE_PARAMS {
+	NV_DECLARE_ALIGNED(NvU64 vmmuSegmentSize, 8);
+} NV2080_CTRL_GPU_GET_VMMU_SEGMENT_SIZE_PARAMS;
+
+#define NV2080_CTRL_GPU_VMMU_SEGMENT_SIZE_32MB     0x02000000U
+#define NV2080_CTRL_GPU_VMMU_SEGMENT_SIZE_64MB     0x04000000U
+#define NV2080_CTRL_GPU_VMMU_SEGMENT_SIZE_128MB    0x08000000U
+#define NV2080_CTRL_GPU_VMMU_SEGMENT_SIZE_256MB    0x10000000U
+#define NV2080_CTRL_GPU_VMMU_SEGMENT_SIZE_512MB    0x20000000U
+
+#endif
diff --git a/drivers/vfio/pci/nvidia-vgpu/pf.h b/drivers/vfio/pci/nvidia-vgpu/pf.h
index b8008d8ee434..ce2728ce969b 100644
--- a/drivers/vfio/pci/nvidia-vgpu/pf.h
+++ b/drivers/vfio/pci/nvidia-vgpu/pf.h
@@ -95,4 +95,12 @@ static inline int nvidia_vgpu_mgr_init_handle(struct pci_dev *pdev,
 	__m->handle.ops->free_chids(__m->handle.pf_drvdata, o, s); \
 })
 
+#define nvidia_vgpu_mgr_alloc_fbmem(m, info) ({\
+	typeof(m) __m = (m); \
+	__m->handle.ops->alloc_fbmem(__m->handle.pf_drvdata, info); \
+})
+
+#define nvidia_vgpu_mgr_free_fbmem(m, h) \
+	((m)->handle.ops->free_fbmem(h))
+
 #endif
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu.c b/drivers/vfio/pci/nvidia-vgpu/vgpu.c
index 52b946469043..7025c7e2b9ac 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu.c
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu.c
@@ -105,7 +105,70 @@ static int setup_chids(struct nvidia_vgpu *vgpu)
 
 	vgpu_debug(vgpu, "alloc guest channel offset %u size %u\n", chid->chid_offset,
 		   chid->num_chid);
+	return 0;
+}
+
+static void clean_fbmem_heap(struct nvidia_vgpu *vgpu)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = vgpu->vgpu_mgr;
+
+	vgpu_debug(vgpu, "free guest FB memory, offset 0x%llx size 0x%llx\n",
+		   vgpu->fbmem_heap->addr, vgpu->fbmem_heap->size);
+
+	nvidia_vgpu_mgr_free_fbmem(vgpu_mgr, vgpu->fbmem_heap);
+	vgpu->fbmem_heap = NULL;
+}
+
+static int get_alloc_fbmem_size(struct nvidia_vgpu *vgpu, u64 *size)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = vgpu->vgpu_mgr;
+	struct nvidia_vgpu_info *info = &vgpu->info;
+	struct nvidia_vgpu_type *type = info->vgpu_type;
+	u64 fb_length;
+
+	if (!vgpu_mgr->ecc_enabled) {
+		*size = type->fb_length;
+		return 0;
+	}
+
+	if (!info->vgpu_type->ecc_supported) {
+		vgpu_error(vgpu, "ECC is enabled. vGPU type %s doesn't support ECC!\n",
+			   type->vgpu_type_name);
+		return -ENODEV;
+	}
 
+	/* Re-calculate the FB memory length when ECC is enabled. */
+	fb_length = ALIGN(vgpu_mgr->total_fbmem_size, vgpu_mgr->vmmu_segment_size);
+	fb_length = fb_length / type->max_instance - type->fb_reservation - type->gsp_heap_size;
+	fb_length = min(type->fb_length, fb_length);
+	fb_length = ALIGN_DOWN(fb_length, vgpu_mgr->vmmu_segment_size);
+
+	*size = fb_length;
+	return 0;
+}
+
+static int setup_fbmem_heap(struct nvidia_vgpu *vgpu)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = vgpu->vgpu_mgr;
+	struct nvidia_vgpu_alloc_fbmem_info info = {0};
+	struct nvidia_vgpu_mem *mem;
+	int ret;
+
+	ret = get_alloc_fbmem_size(vgpu, &info.size);
+	if (ret)
+		return ret;
+
+	info.align = vgpu_mgr->vmmu_segment_size;
+
+	vgpu_debug(vgpu, "alloc guest FB memory, size 0x%llx\n", info.size);
+
+	mem = nvidia_vgpu_mgr_alloc_fbmem(vgpu_mgr, &info);
+	if (IS_ERR(mem))
+		return PTR_ERR(mem);
+
+	vgpu_debug(vgpu, "guest FB memory offset 0x%llx size 0x%llx\n", mem->addr, mem->size);
+
+	vgpu->fbmem_heap = mem;
 	return 0;
 }
 
@@ -120,6 +183,7 @@ int nvidia_vgpu_mgr_destroy_vgpu(struct nvidia_vgpu *vgpu)
 	if (!atomic_cmpxchg(&vgpu->status, 1, 0))
 		return -ENODEV;
 
+	clean_fbmem_heap(vgpu);
 	clean_chids(vgpu);
 	unregister_vgpu(vgpu);
 
@@ -164,12 +228,18 @@ int nvidia_vgpu_mgr_create_vgpu(struct nvidia_vgpu *vgpu)
 	if (ret)
 		goto err_setup_chids;
 
+	ret = setup_fbmem_heap(vgpu);
+	if (ret)
+		goto err_setup_fbmem_heap;
+
 	atomic_set(&vgpu->status, 1);
 
 	vgpu_debug(vgpu, "created\n");
 
 	return 0;
 
+err_setup_fbmem_heap:
+	clean_chids(vgpu);
 err_setup_chids:
 	unregister_vgpu(vgpu);
 
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
index 8565bb881fda..e8b670308b21 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
@@ -6,6 +6,9 @@
 #include "debug.h"
 #include "vgpu_mgr.h"
 
+#include <nvrm/vmmu.h>
+#include <nvrm/ecc.h>
+
 static void clean_vgpu_mgr(struct nvidia_vgpu_mgr *vgpu_mgr)
 {
 	if (vgpu_mgr->use_chid_alloc_bitmap) {
@@ -104,6 +107,39 @@ static void attach_vgpu_mgr(struct nvidia_vgpu_mgr *vgpu_mgr,
 	handle_data->vfio.pf_detach_handle_fn = pf_detach_handle_fn;
 }
 
+static int get_vmmu_segment_size(struct nvidia_vgpu_mgr *vgpu_mgr)
+{
+	NV2080_CTRL_GPU_GET_VMMU_SEGMENT_SIZE_PARAMS *ctrl;
+
+	ctrl = nvidia_vgpu_mgr_rm_ctrl_rd(vgpu_mgr, &vgpu_mgr->gsp_client,
+					  NV2080_CTRL_CMD_GPU_GET_VMMU_SEGMENT_SIZE,
+					  sizeof(*ctrl));
+	if (IS_ERR(ctrl))
+		return PTR_ERR(ctrl);
+
+	vgpu_mgr->vmmu_segment_size = ctrl->vmmuSegmentSize;
+
+	nvidia_vgpu_mgr_rm_ctrl_done(vgpu_mgr, &vgpu_mgr->gsp_client, ctrl);
+
+	return 0;
+}
+
+static int get_ecc_status(struct nvidia_vgpu_mgr *vgpu_mgr)
+{
+	NV2080_CTRL_GPU_QUERY_ECC_STATUS_PARAMS *ctrl;
+
+	ctrl = nvidia_vgpu_mgr_rm_ctrl_rd(vgpu_mgr, &vgpu_mgr->gsp_client,
+					  NV2080_CTRL_CMD_GPU_QUERY_ECC_STATUS,
+					  sizeof(*ctrl));
+	if (IS_ERR(ctrl))
+		return PTR_ERR(ctrl);
+
+	vgpu_mgr->ecc_enabled = ctrl->units[0].enabled;
+
+	nvidia_vgpu_mgr_rm_ctrl_done(vgpu_mgr, &vgpu_mgr->gsp_client, ctrl);
+	return 0;
+}
+
 static int setup_chid_alloc_bitmap(struct nvidia_vgpu_mgr *vgpu_mgr)
 {
 	if (WARN_ON(!vgpu_mgr->use_chid_alloc_bitmap))
@@ -120,11 +156,27 @@ static int setup_chid_alloc_bitmap(struct nvidia_vgpu_mgr *vgpu_mgr)
 
 static int init_vgpu_mgr(struct nvidia_vgpu_mgr *vgpu_mgr)
 {
+	int ret;
+
+	ret = get_vmmu_segment_size(vgpu_mgr);
+	if (ret)
+		return ret;
+
+	ret = get_ecc_status(vgpu_mgr);
+	if (ret)
+		return ret;
+
+	vgpu_mgr_debug(vgpu_mgr, "[GSP RM] VMMU segment size: 0x%llx\n",
+		       vgpu_mgr->vmmu_segment_size);
+	vgpu_mgr_debug(vgpu_mgr, "[GSP RM] ECC enabled: %d\n", vgpu_mgr->ecc_enabled);
+
 	vgpu_mgr->total_avail_chids = nvidia_vgpu_mgr_get_avail_chids(vgpu_mgr);
 	vgpu_mgr->total_fbmem_size = nvidia_vgpu_mgr_get_total_fbmem_size(vgpu_mgr);
 
-	vgpu_mgr_debug(vgpu_mgr, "total avail chids %u\n", vgpu_mgr->total_avail_chids);
-	vgpu_mgr_debug(vgpu_mgr, "total fbmem size 0x%llx\n", vgpu_mgr->total_fbmem_size);
+	vgpu_mgr_debug(vgpu_mgr, "[core driver] total avail chids %u\n",
+		       vgpu_mgr->total_avail_chids);
+	vgpu_mgr_debug(vgpu_mgr, "[core driver] total fbmem size 0x%llx\n",
+		       vgpu_mgr->total_fbmem_size);
 
 	return vgpu_mgr->use_chid_alloc_bitmap ? setup_chid_alloc_bitmap(vgpu_mgr) : 0;
 }
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
index 5a7a6103a677..356779404cc2 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
@@ -59,6 +59,7 @@ struct nvidia_vgpu_chid {
  * @info: vGPU info
  * @vgpu_mgr: pointer to vGPU manager
  * @chid: vGPU channel IDs
+ * @fbmem_heap: allocated FB memory for the vGPU
  */
 struct nvidia_vgpu {
 	/* Per-vGPU lock */
@@ -71,6 +72,7 @@ struct nvidia_vgpu {
 	struct nvidia_vgpu_mgr *vgpu_mgr;
 
 	struct nvidia_vgpu_chid chid;
+	struct nvidia_vgpu_mem *fbmem_heap;
 };
 
 /**
@@ -80,6 +82,8 @@ struct nvidia_vgpu {
  * @handle: the driver handle
  * @total_avail_chids: total available channel IDs
  * @total_fbmem_size: total FB memory size
+ * @vmmu_segment_size: VMMU segment size
+ * @ecc_enabled: ECC is enabled in the GPU
  * @vgpu_major: vGPU major version
  * @vgpu_minor: vGPU minor version
  * @vgpu_list_lock: lock to protect vGPU list
@@ -99,6 +103,10 @@ struct nvidia_vgpu_mgr {
 	u32 total_avail_chids;
 	u64 total_fbmem_size;
 
+	/* GSP RM configurations */
+	u64 vmmu_segment_size;
+	bool ecc_enabled;
+
 	u64 vgpu_major;
 	u64 vgpu_minor;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC v2 06/14] vfio/nvidia-vgpu: allocate mgmt heap when creating vGPUs
  2025-09-03 22:10 [RFC v2 00/14] Introduce NVIDIA GPU Virtualization (vGPU) Support Zhi Wang
                   ` (4 preceding siblings ...)
  2025-09-03 22:11 ` [RFC v2 05/14] vfio/nvidia-vgpu: allocate vGPU FB memory " Zhi Wang
@ 2025-09-03 22:11 ` Zhi Wang
  2025-09-03 22:11 ` [RFC v2 07/14] vfio/nvidia-vgpu: map mgmt heap when creating a vGPU Zhi Wang
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Zhi Wang @ 2025-09-03 22:11 UTC (permalink / raw)
  To: kvm
  Cc: alex.williamson, kevin.tian, jgg, airlied, daniel, dakr, acurrid,
	cjia, smitra, ankita, aniketa, kwankhede, targupta, zhiw, zhiwang

The mgmt heap is a block of shared FBMEM between the GSP firmware and
the vGPU host. It is used for supporting vGPU RPCs, vGPU logging.

Creating a vGPU requires allocating a mgmt heap from the FBMEM. The size
of the mgmt heap that a vGPU requires is from the vGPU type.

Acquire the size of mgmt heap from the vGPU type. Allocate the mgmt
heap from nvkm when creating a vGPU.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 drivers/vfio/pci/nvidia-vgpu/vgpu.c     | 42 +++++++++++++++++++++++++
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h |  7 +++++
 2 files changed, 49 insertions(+)

diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu.c b/drivers/vfio/pci/nvidia-vgpu/vgpu.c
index 7025c7e2b9ac..53c2da0645b3 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu.c
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu.c
@@ -172,6 +172,41 @@ static int setup_fbmem_heap(struct nvidia_vgpu *vgpu)
 	return 0;
 }
 
+static void clean_mgmt_heap(struct nvidia_vgpu *vgpu)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = vgpu->vgpu_mgr;
+	struct nvidia_vgpu_mgmt *mgmt = &vgpu->mgmt;
+
+	vgpu_debug(vgpu, "free mgmt heap, offset 0x%llx size 0x%llx\n", mgmt->heap_mem->addr,
+		   mgmt->heap_mem->size);
+
+	nvidia_vgpu_mgr_free_fbmem(vgpu_mgr, mgmt->heap_mem);
+	mgmt->heap_mem = NULL;
+}
+
+static int setup_mgmt_heap(struct nvidia_vgpu *vgpu)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = vgpu->vgpu_mgr;
+	struct nvidia_vgpu_mgmt *mgmt = &vgpu->mgmt;
+	struct nvidia_vgpu_info *info = &vgpu->info;
+	struct nvidia_vgpu_type *vgpu_type = info->vgpu_type;
+	struct nvidia_vgpu_alloc_fbmem_info alloc_info = {0};
+	struct nvidia_vgpu_mem *mem;
+
+	alloc_info.size = vgpu_type->gsp_heap_size;
+
+	vgpu_debug(vgpu, "alloc mgmt heap, size 0x%llx\n", alloc_info.size);
+
+	mem = nvidia_vgpu_mgr_alloc_fbmem(vgpu_mgr, &alloc_info);
+	if (IS_ERR(mem))
+		return PTR_ERR(mem);
+
+	vgpu_debug(vgpu, "mgmt heap offset 0x%llx size 0x%llx\n", mem->addr, mem->size);
+
+	mgmt->heap_mem = mem;
+	return 0;
+}
+
 /**
  * nvidia_vgpu_mgr_destroy_vgpu - destroy a vGPU instance
  * @vgpu: the vGPU instance going to be destroyed.
@@ -183,6 +218,7 @@ int nvidia_vgpu_mgr_destroy_vgpu(struct nvidia_vgpu *vgpu)
 	if (!atomic_cmpxchg(&vgpu->status, 1, 0))
 		return -ENODEV;
 
+	clean_mgmt_heap(vgpu);
 	clean_fbmem_heap(vgpu);
 	clean_chids(vgpu);
 	unregister_vgpu(vgpu);
@@ -232,12 +268,18 @@ int nvidia_vgpu_mgr_create_vgpu(struct nvidia_vgpu *vgpu)
 	if (ret)
 		goto err_setup_fbmem_heap;
 
+	ret = setup_mgmt_heap(vgpu);
+	if (ret)
+		goto err_setup_mgmt_heap;
+
 	atomic_set(&vgpu->status, 1);
 
 	vgpu_debug(vgpu, "created\n");
 
 	return 0;
 
+err_setup_mgmt_heap:
+	clean_fbmem_heap(vgpu);
 err_setup_fbmem_heap:
 	clean_chids(vgpu);
 err_setup_chids:
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
index 356779404cc2..facecd060856 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
@@ -49,6 +49,11 @@ struct nvidia_vgpu_chid {
 	u32 num_plugin_channels;
 };
 
+struct nvidia_vgpu_mgmt {
+	struct nvidia_vgpu_mem *heap_mem;
+	/* more to come */
+};
+
 /**
  * struct nvidia_vgpu - per-vGPU state
  *
@@ -60,6 +65,7 @@ struct nvidia_vgpu_chid {
  * @vgpu_mgr: pointer to vGPU manager
  * @chid: vGPU channel IDs
  * @fbmem_heap: allocated FB memory for the vGPU
+ * @mgmt: vGPU mgmt heap
  */
 struct nvidia_vgpu {
 	/* Per-vGPU lock */
@@ -73,6 +79,7 @@ struct nvidia_vgpu {
 
 	struct nvidia_vgpu_chid chid;
 	struct nvidia_vgpu_mem *fbmem_heap;
+	struct nvidia_vgpu_mgmt mgmt;
 };
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC v2 07/14] vfio/nvidia-vgpu: map mgmt heap when creating a vGPU
  2025-09-03 22:10 [RFC v2 00/14] Introduce NVIDIA GPU Virtualization (vGPU) Support Zhi Wang
                   ` (5 preceding siblings ...)
  2025-09-03 22:11 ` [RFC v2 06/14] vfio/nvidia-vgpu: allocate mgmt heap " Zhi Wang
@ 2025-09-03 22:11 ` Zhi Wang
  2025-09-03 22:11 ` [RFC v2 08/14] vfio/nvidia-vgpu: allocate GSP RM client when creating vGPUs Zhi Wang
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Zhi Wang @ 2025-09-03 22:11 UTC (permalink / raw)
  To: kvm
  Cc: alex.williamson, kevin.tian, jgg, airlied, daniel, dakr, acurrid,
	cjia, smitra, ankita, aniketa, kwankhede, targupta, zhiw, zhiwang

The mgmt heap is a block of shared FB memory between the GSP firmware
and the vGPU host. It is used for supporting vGPU RPCs, vGPU logging.

To access the data structures of vGPU RPCs and vGPU logging, the mgmt
heap FB memory needs to mapped into BAR1 and the region in the BAR1 is
required to be mapped into CPU vaddr.

Map the mgmt heap FB memory into BAR1 and map the related BAR1 region
into CPU vaddr. Initialize the pointers to the mgmt heap FB memory.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 drivers/vfio/pci/nvidia-vgpu/pf.h       |  6 ++++++
 drivers/vfio/pci/nvidia-vgpu/vgpu.c     | 23 ++++++++++++++++++++++
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c | 26 +++++++++++++++++++++++++
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h | 17 +++++++++++++++-
 4 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/nvidia-vgpu/pf.h b/drivers/vfio/pci/nvidia-vgpu/pf.h
index ce2728ce969b..167296ba7e3d 100644
--- a/drivers/vfio/pci/nvidia-vgpu/pf.h
+++ b/drivers/vfio/pci/nvidia-vgpu/pf.h
@@ -103,4 +103,10 @@ static inline int nvidia_vgpu_mgr_init_handle(struct pci_dev *pdev,
 #define nvidia_vgpu_mgr_free_fbmem(m, h) \
 	((m)->handle.ops->free_fbmem(h))
 
+#define nvidia_vgpu_mgr_bar1_map_mem(m, mem, info) \
+	((m)->handle.ops->bar1_map_mem(mem, info))
+
+#define nvidia_vgpu_mgr_bar1_unmap_mem(m, mem) \
+	((m)->handle.ops->bar1_unmap_mem(mem))
+
 #endif
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu.c b/drivers/vfio/pci/nvidia-vgpu/vgpu.c
index 53c2da0645b3..4c106a9803f6 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu.c
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu.c
@@ -177,10 +177,14 @@ static void clean_mgmt_heap(struct nvidia_vgpu *vgpu)
 	struct nvidia_vgpu_mgr *vgpu_mgr = vgpu->vgpu_mgr;
 	struct nvidia_vgpu_mgmt *mgmt = &vgpu->mgmt;
 
+	nvidia_vgpu_mgr_bar1_unmap_mem(vgpu_mgr, mgmt->heap_mem);
+
 	vgpu_debug(vgpu, "free mgmt heap, offset 0x%llx size 0x%llx\n", mgmt->heap_mem->addr,
 		   mgmt->heap_mem->size);
 
 	nvidia_vgpu_mgr_free_fbmem(vgpu_mgr, mgmt->heap_mem);
+	mgmt->init_task_log_vaddr = mgmt->vgpu_task_log_vaddr = NULL;
+	mgmt->ctrl_vaddr = mgmt->kernel_log_vaddr = NULL;
 	mgmt->heap_mem = NULL;
 }
 
@@ -191,7 +195,9 @@ static int setup_mgmt_heap(struct nvidia_vgpu *vgpu)
 	struct nvidia_vgpu_info *info = &vgpu->info;
 	struct nvidia_vgpu_type *vgpu_type = info->vgpu_type;
 	struct nvidia_vgpu_alloc_fbmem_info alloc_info = {0};
+	struct nvidia_vgpu_map_mem_info map_info = {0};
 	struct nvidia_vgpu_mem *mem;
+	int ret;
 
 	alloc_info.size = vgpu_type->gsp_heap_size;
 
@@ -203,6 +209,23 @@ static int setup_mgmt_heap(struct nvidia_vgpu *vgpu)
 
 	vgpu_debug(vgpu, "mgmt heap offset 0x%llx size 0x%llx\n", mem->addr, mem->size);
 
+	map_info.map_size = vgpu_mgr->comm_buff_size;
+
+	ret = nvidia_vgpu_mgr_bar1_map_mem(vgpu_mgr, mem, &map_info);
+	if (ret) {
+		nvidia_vgpu_mgr_free_fbmem(vgpu_mgr, mem);
+		return ret;
+	}
+
+	vgpu_debug(vgpu, "mgmt heap mapped\n");
+
+	mgmt->ctrl_vaddr = mem->bar1_vaddr;
+	mgmt->init_task_log_vaddr = mgmt->ctrl_vaddr +
+				    vgpu_mgr->init_task_log_offset;
+	mgmt->vgpu_task_log_vaddr = mgmt->init_task_log_vaddr +
+				    vgpu_mgr->init_task_log_size;
+	mgmt->kernel_log_vaddr = mgmt->vgpu_task_log_vaddr +
+				 vgpu_mgr->vgpu_task_log_size;
 	mgmt->heap_mem = mem;
 	return 0;
 }
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
index e8b670308b21..cf5dd9a8e258 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
@@ -154,6 +154,30 @@ static int setup_chid_alloc_bitmap(struct nvidia_vgpu_mgr *vgpu_mgr)
 	return 0;
 }
 
+static void init_gsp_rm_constraints(struct nvidia_vgpu_mgr *vgpu_mgr)
+{
+	vgpu_mgr->comm_buff_size = (3 * SZ_4K) + SZ_2M + SZ_4K + SZ_128K + SZ_256K + SZ_64K;
+	vgpu_mgr->init_task_log_offset = (3 * SZ_4K) + SZ_2M + SZ_4K;
+	vgpu_mgr->init_task_log_size = SZ_128K;
+	vgpu_mgr->vgpu_task_log_size = SZ_256K;
+	vgpu_mgr->kernel_log_size = SZ_64K;
+
+	vgpu_mgr_debug(vgpu_mgr, "[GSP RM constraint] comm_buff_size 0x%llx\n",
+		       vgpu_mgr->comm_buff_size);
+
+	vgpu_mgr_debug(vgpu_mgr, "[GSP RM constraint] init_task_log_offset 0x%llx\n",
+		       vgpu_mgr->init_task_log_offset);
+
+	vgpu_mgr_debug(vgpu_mgr, "[GSP RM constraint] init_task_log size 0x%llx\n",
+		       vgpu_mgr->init_task_log_size);
+
+	vgpu_mgr_debug(vgpu_mgr, "[GSP RM constraint] vgpu_task_log size 0x%llx\n",
+		       vgpu_mgr->vgpu_task_log_size);
+
+	vgpu_mgr_debug(vgpu_mgr, "[GSP RM constraint] kernel_log size 0x%llx\n",
+		       vgpu_mgr->kernel_log_size);
+}
+
 static int init_vgpu_mgr(struct nvidia_vgpu_mgr *vgpu_mgr)
 {
 	int ret;
@@ -178,6 +202,8 @@ static int init_vgpu_mgr(struct nvidia_vgpu_mgr *vgpu_mgr)
 	vgpu_mgr_debug(vgpu_mgr, "[core driver] total fbmem size 0x%llx\n",
 		       vgpu_mgr->total_fbmem_size);
 
+	init_gsp_rm_constraints(vgpu_mgr);
+
 	return vgpu_mgr->use_chid_alloc_bitmap ? setup_chid_alloc_bitmap(vgpu_mgr) : 0;
 }
 
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
index facecd060856..9a3af35e5eee 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
@@ -51,7 +51,10 @@ struct nvidia_vgpu_chid {
 
 struct nvidia_vgpu_mgmt {
 	struct nvidia_vgpu_mem *heap_mem;
-	/* more to come */
+	void __iomem *ctrl_vaddr;
+	void __iomem *init_task_log_vaddr;
+	void __iomem *vgpu_task_log_vaddr;
+	void __iomem *kernel_log_vaddr;
 };
 
 /**
@@ -91,6 +94,11 @@ struct nvidia_vgpu {
  * @total_fbmem_size: total FB memory size
  * @vmmu_segment_size: VMMU segment size
  * @ecc_enabled: ECC is enabled in the GPU
+ * @comm_buff_size: communication buffer size of mgmt heap
+ * @init_task_log_offset: offset of init task log in mgmt heap
+ * @init_task_log_size: size of init task size in mgmt heap
+ * @vgpu_task_log_size: size of vgpu task log size in mgmt heap
+ * @kernel_log_size: size of kernel log size in mgmt heap
  * @vgpu_major: vGPU major version
  * @vgpu_minor: vGPU minor version
  * @vgpu_list_lock: lock to protect vGPU list
@@ -114,6 +122,13 @@ struct nvidia_vgpu_mgr {
 	u64 vmmu_segment_size;
 	bool ecc_enabled;
 
+	/* GSP RM constraints */
+	u64 comm_buff_size;
+	u64 init_task_log_offset;
+	u64 init_task_log_size;
+	u64 vgpu_task_log_size;
+	u64 kernel_log_size;
+
 	u64 vgpu_major;
 	u64 vgpu_minor;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC v2 08/14] vfio/nvidia-vgpu: allocate GSP RM client when creating vGPUs
  2025-09-03 22:10 [RFC v2 00/14] Introduce NVIDIA GPU Virtualization (vGPU) Support Zhi Wang
                   ` (6 preceding siblings ...)
  2025-09-03 22:11 ` [RFC v2 07/14] vfio/nvidia-vgpu: map mgmt heap when creating a vGPU Zhi Wang
@ 2025-09-03 22:11 ` Zhi Wang
  2025-09-03 22:11 ` [RFC v2 09/14] vfio/nvidia-vgpu: bootload the new vGPU Zhi Wang
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Zhi Wang @ 2025-09-03 22:11 UTC (permalink / raw)
  To: kvm
  Cc: alex.williamson, kevin.tian, jgg, airlied, daniel, dakr, acurrid,
	cjia, smitra, ankita, aniketa, kwankhede, targupta, zhiw, zhiwang

A GSP RM client is required when talking to the GSP firmware via GSP RM
controls.

So far, all the vGPU GSP RPCs are sent via the GSP RM client allocated
for vGPU manager and some vGPU GSP RPCs needs a per-vGPU GSP RM client.

Allocate a dedicated GSP RM client for each vGPU.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 drivers/vfio/pci/nvidia-vgpu/vgpu.c     | 11 +++++++++++
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h |  2 ++
 2 files changed, 13 insertions(+)

diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu.c b/drivers/vfio/pci/nvidia-vgpu/vgpu.c
index 4c106a9803f6..cf28367ac6a0 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu.c
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu.c
@@ -238,9 +238,12 @@ static int setup_mgmt_heap(struct nvidia_vgpu *vgpu)
  */
 int nvidia_vgpu_mgr_destroy_vgpu(struct nvidia_vgpu *vgpu)
 {
+	struct nvidia_vgpu_mgr *vgpu_mgr = vgpu->vgpu_mgr;
+
 	if (!atomic_cmpxchg(&vgpu->status, 1, 0))
 		return -ENODEV;
 
+	nvidia_vgpu_mgr_free_gsp_client(vgpu_mgr, &vgpu->gsp_client);
 	clean_mgmt_heap(vgpu);
 	clean_fbmem_heap(vgpu);
 	clean_chids(vgpu);
@@ -262,6 +265,7 @@ EXPORT_SYMBOL_GPL(nvidia_vgpu_mgr_destroy_vgpu);
  */
 int nvidia_vgpu_mgr_create_vgpu(struct nvidia_vgpu *vgpu)
 {
+	struct nvidia_vgpu_mgr *vgpu_mgr = vgpu->vgpu_mgr;
 	struct nvidia_vgpu_info *info = &vgpu->info;
 	int ret;
 
@@ -295,12 +299,19 @@ int nvidia_vgpu_mgr_create_vgpu(struct nvidia_vgpu *vgpu)
 	if (ret)
 		goto err_setup_mgmt_heap;
 
+	ret = nvidia_vgpu_mgr_alloc_gsp_client(vgpu_mgr,
+					       &vgpu->gsp_client);
+	if (ret)
+		goto err_alloc_gsp_client;
+
 	atomic_set(&vgpu->status, 1);
 
 	vgpu_debug(vgpu, "created\n");
 
 	return 0;
 
+err_alloc_gsp_client:
+	clean_mgmt_heap(vgpu);
 err_setup_mgmt_heap:
 	clean_fbmem_heap(vgpu);
 err_setup_fbmem_heap:
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
index 9a3af35e5eee..84bafea295a0 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
@@ -66,6 +66,7 @@ struct nvidia_vgpu_mgmt {
  * @vgpu_list: list node to the vGPU list
  * @info: vGPU info
  * @vgpu_mgr: pointer to vGPU manager
+ * @gsp_client: per-vGPU GSP client
  * @chid: vGPU channel IDs
  * @fbmem_heap: allocated FB memory for the vGPU
  * @mgmt: vGPU mgmt heap
@@ -79,6 +80,7 @@ struct nvidia_vgpu {
 
 	struct nvidia_vgpu_info info;
 	struct nvidia_vgpu_mgr *vgpu_mgr;
+	struct nvidia_vgpu_gsp_client gsp_client;
 
 	struct nvidia_vgpu_chid chid;
 	struct nvidia_vgpu_mem *fbmem_heap;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC v2 09/14] vfio/nvidia-vgpu: bootload the new vGPU
  2025-09-03 22:10 [RFC v2 00/14] Introduce NVIDIA GPU Virtualization (vGPU) Support Zhi Wang
                   ` (7 preceding siblings ...)
  2025-09-03 22:11 ` [RFC v2 08/14] vfio/nvidia-vgpu: allocate GSP RM client when creating vGPUs Zhi Wang
@ 2025-09-03 22:11 ` Zhi Wang
  2025-09-03 22:11 ` [RFC v2 10/14] vfio/nvidia-vgpu: introduce vGPU host RPC channel Zhi Wang
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Zhi Wang @ 2025-09-03 22:11 UTC (permalink / raw)
  To: kvm
  Cc: alex.williamson, kevin.tian, jgg, airlied, daniel, dakr, acurrid,
	cjia, smitra, ankita, aniketa, kwankhede, targupta, zhiw, zhiwang

All the resources that required by a new vGPU has been set up. It is time
to activate it.

Send the NV2080_CTRL_CMD_VGPU_MGR_INTERNAL_BOOTLOAD_GSP_VGPU_PLUGIN_TASK
GSP RPC to activate the new vGPU.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 .../pci/nvidia-vgpu/include/nvrm/bootload.h   | 58 +++++++++++
 drivers/vfio/pci/nvidia-vgpu/pf.h             | 10 ++
 drivers/vfio/pci/nvidia-vgpu/vgpu.c           | 97 +++++++++++++++++++
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c       | 36 ++++++-
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h       |  2 +
 5 files changed, 202 insertions(+), 1 deletion(-)
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/include/nvrm/bootload.h

diff --git a/drivers/vfio/pci/nvidia-vgpu/include/nvrm/bootload.h b/drivers/vfio/pci/nvidia-vgpu/include/nvrm/bootload.h
new file mode 100644
index 000000000000..ec0cb03f27e8
--- /dev/null
+++ b/drivers/vfio/pci/nvidia-vgpu/include/nvrm/bootload.h
@@ -0,0 +1,58 @@
+/* SPDX-License-Identifier: MIT */
+
+/* Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. */
+
+#ifndef __NVRM_BOOTLOAD_H__
+#define __NVRM_BOOTLOAD_H__
+
+#include <nvrm/nvtypes.h>
+
+/* Excerpt of RM headers from https://github.com/NVIDIA/open-gpu-kernel-modules/tree/570.124.04 */
+
+#define NV2080_CTRL_CMD_VGPU_MGR_INTERNAL_BOOTLOAD_GSP_VGPU_PLUGIN_TASK (0x20804001)
+
+#define NV2080_CTRL_MAX_VMMU_SEGMENTS                                   384
+
+/* Must match NV2080_ENGINE_TYPE_LAST from cl2080.h */
+#define NV2080_GPU_MAX_ENGINES                                          0x54
+
+typedef struct NV2080_CTRL_VGPU_MGR_INTERNAL_BOOTLOAD_GSP_VGPU_PLUGIN_TASK_PARAMS {
+	NvU32  dbdf;
+	NvU32  gfid;
+	NvU32  vgpuType;
+	NvU32  vmPid;
+	NvU32  swizzId;
+	NvU32  numChannels;
+	NvU32  numPluginChannels;
+	NvU32  chidOffset[NV2080_GPU_MAX_ENGINES];
+	NvBool bDisableDefaultSmcExecPartRestore;
+	NvU32  numGuestFbSegments;
+	NV_DECLARE_ALIGNED(NvU64 guestFbPhysAddrList[NV2080_CTRL_MAX_VMMU_SEGMENTS], 8);
+	NV_DECLARE_ALIGNED(NvU64 guestFbLengthList[NV2080_CTRL_MAX_VMMU_SEGMENTS], 8);
+	NV_DECLARE_ALIGNED(NvU64 pluginHeapMemoryPhysAddr, 8);
+	NV_DECLARE_ALIGNED(NvU64 pluginHeapMemoryLength, 8);
+	NV_DECLARE_ALIGNED(NvU64 ctrlBuffOffset, 8);
+	NV_DECLARE_ALIGNED(NvU64 initTaskLogBuffOffset, 8);
+	NV_DECLARE_ALIGNED(NvU64 initTaskLogBuffSize, 8);
+	NV_DECLARE_ALIGNED(NvU64 vgpuTaskLogBuffOffset, 8);
+	NV_DECLARE_ALIGNED(NvU64 vgpuTaskLogBuffSize, 8);
+	NV_DECLARE_ALIGNED(NvU64 kernelLogBuffOffset, 8);
+	NV_DECLARE_ALIGNED(NvU64 kernelLogBuffSize, 8);
+	NV_DECLARE_ALIGNED(NvU64 migRmHeapMemoryPhysAddr, 8);
+	NV_DECLARE_ALIGNED(NvU64 migRmHeapMemoryLength, 8);
+	NvBool bDeviceProfilingEnabled;
+} NV2080_CTRL_VGPU_MGR_INTERNAL_BOOTLOAD_GSP_VGPU_PLUGIN_TASK_PARAMS;
+
+#define NV2080_CTRL_CMD_VGPU_MGR_INTERNAL_SHUTDOWN_GSP_VGPU_PLUGIN_TASK (0x20804002)
+
+typedef struct NV2080_CTRL_VGPU_MGR_INTERNAL_SHUTDOWN_GSP_VGPU_PLUGIN_TASK_PARAMS {
+	NvU32 gfid;
+} NV2080_CTRL_VGPU_MGR_INTERNAL_SHUTDOWN_GSP_VGPU_PLUGIN_TASK_PARAMS;
+
+#define NV2080_CTRL_CMD_VGPU_MGR_INTERNAL_VGPU_PLUGIN_CLEANUP (0x20804008)
+
+typedef struct NV2080_CTRL_VGPU_MGR_INTERNAL_VGPU_PLUGIN_CLEANUP_PARAMS {
+	NvU32 gfid;
+} NV2080_CTRL_VGPU_MGR_INTERNAL_VGPU_PLUGIN_CLEANUP_PARAMS;
+
+#endif
diff --git a/drivers/vfio/pci/nvidia-vgpu/pf.h b/drivers/vfio/pci/nvidia-vgpu/pf.h
index 167296ba7e3d..d081d8e718e1 100644
--- a/drivers/vfio/pci/nvidia-vgpu/pf.h
+++ b/drivers/vfio/pci/nvidia-vgpu/pf.h
@@ -109,4 +109,14 @@ static inline int nvidia_vgpu_mgr_init_handle(struct pci_dev *pdev,
 #define nvidia_vgpu_mgr_bar1_unmap_mem(m, mem) \
 	((m)->handle.ops->bar1_unmap_mem(mem))
 
+#define nvidia_vgpu_mgr_get_engine_bitmap_size(m) ({ \
+	typeof(m) __m = (m); \
+	__m->handle.ops->get_engine_bitmap_size(__m->handle.pf_drvdata); \
+})
+
+#define nvidia_vgpu_mgr_get_engine_bitmap(m, bitmap) ({ \
+	typeof(m) __m = (m); \
+	__m->handle.ops->get_engine_bitmap(__m->handle.pf_drvdata, bitmap); \
+})
+
 #endif
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu.c b/drivers/vfio/pci/nvidia-vgpu/vgpu.c
index cf28367ac6a0..5778365c051f 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu.c
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu.c
@@ -8,6 +8,8 @@
 #include "debug.h"
 #include "vgpu_mgr.h"
 
+#include <nvrm/bootload.h>
+
 static void unregister_vgpu(struct nvidia_vgpu *vgpu)
 {
 	struct nvidia_vgpu_mgr *vgpu_mgr = vgpu->vgpu_mgr;
@@ -230,6 +232,93 @@ static int setup_mgmt_heap(struct nvidia_vgpu *vgpu)
 	return 0;
 }
 
+static int shutdown_vgpu_plugin_task(struct nvidia_vgpu *vgpu)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = vgpu->vgpu_mgr;
+	NV2080_CTRL_VGPU_MGR_INTERNAL_SHUTDOWN_GSP_VGPU_PLUGIN_TASK_PARAMS *ctrl;
+
+	ctrl = nvidia_vgpu_mgr_rm_ctrl_get(vgpu_mgr, &vgpu->gsp_client,
+			NV2080_CTRL_CMD_VGPU_MGR_INTERNAL_SHUTDOWN_GSP_VGPU_PLUGIN_TASK,
+			sizeof(*ctrl));
+	if (IS_ERR(ctrl))
+		return PTR_ERR(ctrl);
+
+	ctrl->gfid = vgpu->info.gfid;
+
+	return nvidia_vgpu_mgr_rm_ctrl_wr(vgpu_mgr, &vgpu->gsp_client,
+					  ctrl);
+}
+
+static int cleanup_vgpu_plugin_task(struct nvidia_vgpu *vgpu)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = vgpu->vgpu_mgr;
+	NV2080_CTRL_VGPU_MGR_INTERNAL_VGPU_PLUGIN_CLEANUP_PARAMS *ctrl;
+
+	ctrl = nvidia_vgpu_mgr_rm_ctrl_get(vgpu_mgr, &vgpu->gsp_client,
+			NV2080_CTRL_CMD_VGPU_MGR_INTERNAL_VGPU_PLUGIN_CLEANUP,
+			sizeof(*ctrl));
+	if (IS_ERR(ctrl))
+		return PTR_ERR(ctrl);
+
+	ctrl->gfid = vgpu->info.gfid;
+
+	return nvidia_vgpu_mgr_rm_ctrl_wr(vgpu_mgr, &vgpu->gsp_client,
+					  ctrl);
+}
+
+static int bootload_vgpu_plugin_task(struct nvidia_vgpu *vgpu)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = vgpu->vgpu_mgr;
+	struct nvidia_vgpu_mgmt *mgmt = &vgpu->mgmt;
+	NV2080_CTRL_VGPU_MGR_INTERNAL_BOOTLOAD_GSP_VGPU_PLUGIN_TASK_PARAMS *ctrl;
+	int ret, i;
+
+	vgpu_debug(vgpu, "bootload\n");
+
+	ctrl = nvidia_vgpu_mgr_rm_ctrl_get(vgpu_mgr, &vgpu->gsp_client,
+			NV2080_CTRL_CMD_VGPU_MGR_INTERNAL_BOOTLOAD_GSP_VGPU_PLUGIN_TASK,
+			sizeof(*ctrl));
+	if (IS_ERR(ctrl))
+		return PTR_ERR(ctrl);
+
+	ctrl->dbdf = vgpu->info.dbdf;
+	ctrl->gfid = vgpu->info.gfid;
+	ctrl->vmPid = vgpu->info.vm_pid;
+	ctrl->swizzId = 0;
+	ctrl->numChannels = vgpu->chid.num_chid;
+	ctrl->numPluginChannels = vgpu->chid.num_plugin_channels;
+
+	for_each_set_bit(i, vgpu_mgr->engine_bitmap, NV2080_GPU_MAX_ENGINES)
+		ctrl->chidOffset[i] = vgpu->chid.chid_offset;
+
+	ctrl->bDisableDefaultSmcExecPartRestore = false;
+	ctrl->numGuestFbSegments = 1;
+	ctrl->guestFbPhysAddrList[0] = vgpu->fbmem_heap->addr;
+	ctrl->guestFbLengthList[0] = vgpu->fbmem_heap->size;
+	ctrl->pluginHeapMemoryPhysAddr = mgmt->heap_mem->addr;
+	ctrl->pluginHeapMemoryLength = mgmt->heap_mem->size;
+	ctrl->ctrlBuffOffset = 0;
+	ctrl->initTaskLogBuffOffset = mgmt->heap_mem->addr +
+				      vgpu_mgr->init_task_log_offset;
+	ctrl->initTaskLogBuffSize = vgpu_mgr->init_task_log_size;
+	ctrl->vgpuTaskLogBuffOffset = ctrl->initTaskLogBuffOffset +
+				      ctrl->initTaskLogBuffSize;
+	ctrl->vgpuTaskLogBuffSize = vgpu_mgr->vgpu_task_log_size;
+	ctrl->kernelLogBuffOffset = ctrl->vgpuTaskLogBuffOffset +
+				      ctrl->vgpuTaskLogBuffSize;
+	ctrl->kernelLogBuffSize = vgpu_mgr->kernel_log_size;
+
+	ctrl->bDeviceProfilingEnabled = false;
+
+	ret = nvidia_vgpu_mgr_rm_ctrl_wr(vgpu_mgr, &vgpu->gsp_client,
+					 ctrl);
+	if (ret)
+		return ret;
+
+	vgpu_debug(vgpu, "bootloading\n");
+	return 0;
+}
+
 /**
  * nvidia_vgpu_mgr_destroy_vgpu - destroy a vGPU instance
  * @vgpu: the vGPU instance going to be destroyed.
@@ -243,6 +332,8 @@ int nvidia_vgpu_mgr_destroy_vgpu(struct nvidia_vgpu *vgpu)
 	if (!atomic_cmpxchg(&vgpu->status, 1, 0))
 		return -ENODEV;
 
+	WARN_ON(shutdown_vgpu_plugin_task(vgpu));
+	WARN_ON(cleanup_vgpu_plugin_task(vgpu));
 	nvidia_vgpu_mgr_free_gsp_client(vgpu_mgr, &vgpu->gsp_client);
 	clean_mgmt_heap(vgpu);
 	clean_fbmem_heap(vgpu);
@@ -304,12 +395,18 @@ int nvidia_vgpu_mgr_create_vgpu(struct nvidia_vgpu *vgpu)
 	if (ret)
 		goto err_alloc_gsp_client;
 
+	ret = bootload_vgpu_plugin_task(vgpu);
+	if (ret)
+		goto err_bootload_vgpu_plugin_task;
+
 	atomic_set(&vgpu->status, 1);
 
 	vgpu_debug(vgpu, "created\n");
 
 	return 0;
 
+err_bootload_vgpu_plugin_task:
+	nvidia_vgpu_mgr_free_gsp_client(vgpu_mgr, &vgpu->gsp_client);
 err_alloc_gsp_client:
 	clean_mgmt_heap(vgpu);
 err_setup_mgmt_heap:
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
index cf5dd9a8e258..6338dd9c86b6 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
@@ -15,6 +15,9 @@ static void clean_vgpu_mgr(struct nvidia_vgpu_mgr *vgpu_mgr)
 		bitmap_free(vgpu_mgr->chid_alloc_bitmap);
 		vgpu_mgr->chid_alloc_bitmap = NULL;
 	}
+
+	kvfree(vgpu_mgr->engine_bitmap);
+	vgpu_mgr->engine_bitmap = NULL;
 }
 
 static void vgpu_mgr_release(struct kref *kref)
@@ -140,6 +143,25 @@ static int get_ecc_status(struct nvidia_vgpu_mgr *vgpu_mgr)
 	return 0;
 }
 
+static int setup_engine_bitmap(struct nvidia_vgpu_mgr *vgpu_mgr)
+{
+	u64 size;
+
+	size = nvidia_vgpu_mgr_get_engine_bitmap_size(vgpu_mgr);
+
+	if (WARN_ON(!size))
+		return -EINVAL;
+
+	vgpu_mgr->engine_bitmap = kvmalloc(ALIGN(size, 8), GFP_KERNEL);
+	if (!vgpu_mgr->engine_bitmap)
+		return -ENOMEM;
+
+	vgpu_mgr_debug(vgpu_mgr, "[core driver] engine bitmap size: 0x%llx\n", size);
+
+	nvidia_vgpu_mgr_get_engine_bitmap(vgpu_mgr, vgpu_mgr->engine_bitmap);
+	return 0;
+}
+
 static int setup_chid_alloc_bitmap(struct nvidia_vgpu_mgr *vgpu_mgr)
 {
 	if (WARN_ON(!vgpu_mgr->use_chid_alloc_bitmap))
@@ -194,6 +216,10 @@ static int init_vgpu_mgr(struct nvidia_vgpu_mgr *vgpu_mgr)
 		       vgpu_mgr->vmmu_segment_size);
 	vgpu_mgr_debug(vgpu_mgr, "[GSP RM] ECC enabled: %d\n", vgpu_mgr->ecc_enabled);
 
+	ret = setup_engine_bitmap(vgpu_mgr);
+	if (ret)
+		return ret;
+
 	vgpu_mgr->total_avail_chids = nvidia_vgpu_mgr_get_avail_chids(vgpu_mgr);
 	vgpu_mgr->total_fbmem_size = nvidia_vgpu_mgr_get_total_fbmem_size(vgpu_mgr);
 
@@ -204,7 +230,15 @@ static int init_vgpu_mgr(struct nvidia_vgpu_mgr *vgpu_mgr)
 
 	init_gsp_rm_constraints(vgpu_mgr);
 
-	return vgpu_mgr->use_chid_alloc_bitmap ? setup_chid_alloc_bitmap(vgpu_mgr) : 0;
+	if (vgpu_mgr->use_chid_alloc_bitmap) {
+		ret = setup_chid_alloc_bitmap(vgpu_mgr);
+		if (ret) {
+			kvfree(vgpu_mgr->engine_bitmap);
+			vgpu_mgr->engine_bitmap = NULL;
+			return ret;
+		}
+	}
+	return 0;
 }
 
 static int setup_pf_driver_caps(struct nvidia_vgpu_mgr *vgpu_mgr, unsigned long *caps)
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
index 84bafea295a0..323acf52068e 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
@@ -34,6 +34,7 @@ struct nvidia_vgpu_info {
 	u32 gfid;
 	u32 dbdf;
 	struct nvidia_vgpu_type *vgpu_type;
+	u32 vm_pid;
 };
 
 /**
@@ -119,6 +120,7 @@ struct nvidia_vgpu_mgr {
 	/* core driver configurations */
 	u32 total_avail_chids;
 	u64 total_fbmem_size;
+	void *engine_bitmap;
 
 	/* GSP RM configurations */
 	u64 vmmu_segment_size;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC v2 10/14] vfio/nvidia-vgpu: introduce vGPU host RPC channel
  2025-09-03 22:10 [RFC v2 00/14] Introduce NVIDIA GPU Virtualization (vGPU) Support Zhi Wang
                   ` (8 preceding siblings ...)
  2025-09-03 22:11 ` [RFC v2 09/14] vfio/nvidia-vgpu: bootload the new vGPU Zhi Wang
@ 2025-09-03 22:11 ` Zhi Wang
  2025-09-03 22:36   ` Timur Tabi
  2025-09-03 22:11 ` [RFC v2 11/14] vfio/nvidia-vgpu: introduce NVIDIA vGPU VFIO variant driver Zhi Wang
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 23+ messages in thread
From: Zhi Wang @ 2025-09-03 22:11 UTC (permalink / raw)
  To: kvm
  Cc: alex.williamson, kevin.tian, jgg, airlied, daniel, dakr, acurrid,
	cjia, smitra, ankita, aniketa, kwankhede, targupta, zhiw, zhiwang,
	Timur Tabi

A newly created vGPU requires some runtime configuration to be uploaded
before moving on.

Introduce the vGPU host RPCs manipulation APIs to send vGPU RPCs.
Send vGPU RPCs to upload the runtime configuration of a vGPU.

Cc: Timur Tabi <ttabi@nvidia.com>
Cc: Aniket Agashe <aniketa@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 drivers/vfio/pci/nvidia-vgpu/Makefile         |   2 +-
 .../nvidia-vgpu/include/nvrm/nv_vgpu_types.h  |  34 +++
 .../vfio/pci/nvidia-vgpu/include/nvrm/vgpu.h  | 182 +++++++++++++
 drivers/vfio/pci/nvidia-vgpu/rpc.c            | 254 ++++++++++++++++++
 drivers/vfio/pci/nvidia-vgpu/vgpu.c           |   8 +
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c       |  31 +++
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h       |  22 ++
 7 files changed, 532 insertions(+), 1 deletion(-)
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/include/nvrm/nv_vgpu_types.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/include/nvrm/vgpu.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/rpc.c

diff --git a/drivers/vfio/pci/nvidia-vgpu/Makefile b/drivers/vfio/pci/nvidia-vgpu/Makefile
index 94ba4ed4e131..91e57c65ca27 100644
--- a/drivers/vfio/pci/nvidia-vgpu/Makefile
+++ b/drivers/vfio/pci/nvidia-vgpu/Makefile
@@ -2,4 +2,4 @@
 subdir-ccflags-y += -I$(src)/include
 
 obj-$(CONFIG_NVIDIA_VGPU_VFIO_PCI) += nvidia_vgpu_vfio_pci.o
-nvidia_vgpu_vfio_pci-y := vgpu_mgr.o vgpu.o metadata.o metadata_vgpu_type.o
+nvidia_vgpu_vfio_pci-y := vgpu_mgr.o vgpu.o metadata.o metadata_vgpu_type.o rpc.o
diff --git a/drivers/vfio/pci/nvidia-vgpu/include/nvrm/nv_vgpu_types.h b/drivers/vfio/pci/nvidia-vgpu/include/nvrm/nv_vgpu_types.h
new file mode 100644
index 000000000000..fc067caf5dea
--- /dev/null
+++ b/drivers/vfio/pci/nvidia-vgpu/include/nvrm/nv_vgpu_types.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: MIT */
+
+/* Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. */
+
+#ifndef __NVRM_VGPU_TYPES_H__
+#define __NVRM_VGPU_TYPES_H__
+
+/* Excerpt of RM headers from https://github.com/NVIDIA/open-gpu-kernel-modules/tree/570.124.04 */
+
+#include <nvrm/nvtypes.h>
+
+#define VM_UUID_SIZE            16
+#define INVALID_VGPU_DEV_INST   0xFFFFFFFFU
+#define MAX_VGPU_DEVICES_PER_VM 16U
+
+/* This enum represents the current state of guest dependent fields */
+typedef enum GUEST_VM_INFO_STATE {
+	GUEST_VM_INFO_STATE_UNINITIALIZED = 0,
+	GUEST_VM_INFO_STATE_INITIALIZED = 1,
+} GUEST_VM_INFO_STATE;
+
+/* This enum represents types of VM identifiers */
+typedef enum VM_ID_TYPE {
+	VM_ID_DOMAIN_ID = 0,
+	VM_ID_UUID = 1,
+} VM_ID_TYPE;
+
+/* This structure represents VM identifier */
+typedef union VM_ID {
+	NvU8 vmUuid[VM_UUID_SIZE];
+	NV_DECLARE_ALIGNED(NvU64 vmId, 8);
+} VM_ID;
+
+#endif
diff --git a/drivers/vfio/pci/nvidia-vgpu/include/nvrm/vgpu.h b/drivers/vfio/pci/nvidia-vgpu/include/nvrm/vgpu.h
new file mode 100644
index 000000000000..d28af74e9603
--- /dev/null
+++ b/drivers/vfio/pci/nvidia-vgpu/include/nvrm/vgpu.h
@@ -0,0 +1,182 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 NVIDIA Corporation
+ */
+#ifndef __NVRM_VGPU_H__
+#define __NVRM_VGPU_H__
+
+#include <nvrm/nv_vgpu_types.h>
+
+#define VMIOPD_MAX_INSTANCES 16
+#define VMIOPD_MAX_HEADS     4
+
+#define GSP_PLUGIN_BOOTLOADED 0x4E654A6F
+
+/*
+ *   GSP Plugin heap memory layout
+ * +--------------------------------+ offset = 0
+ * |         CONTROL BUFFER         |
+ * +--------------------------------+
+ * |        RESPONSE BUFFER         |
+ * +--------------------------------+
+ * |         MESSAGE BUFFER         |
+ * +--------------------------------+
+ * |        MIGRATION BUFFER        |
+ * +--------------------------------+
+ * |    GSP PLUGIN ERROR BUFFER     |
+ * +--------------------------------+
+ * |    INIT TASK LOG BUFFER        |
+ * +--------------------------------+
+ * |    VGPU TASK LOG BUFFER        |
+ * +--------------------------------+
+ * |       KERNEL LOG BUFFER        |
+ * +--------------------------------+
+ * |      MEMORY AVAILABLE FOR      |
+ * | GSP PLUGIN INTERNAL HEAP USAGE |
+ * +--------------------------------+
+ */
+#define VGPU_CPU_GSP_CTRL_BUFF_VERSION              0x1
+#define VGPU_CPU_GSP_CTRL_BUFF_REGION_SIZE          4096
+#define VGPU_CPU_GSP_RESPONSE_BUFF_REGION_SIZE      4096
+#define VGPU_CPU_GSP_MESSAGE_BUFF_REGION_SIZE       4096
+#define VGPU_CPU_GSP_MIGRATION_BUFF_REGION_SIZE     (2 * 1024 * 1024)
+#define VGPU_CPU_GSP_ERROR_BUFF_REGION_SIZE         4096
+#define VGPU_CPU_GSP_INIT_TASK_LOG_BUFF_REGION_SIZE (128 * 1024)
+#define VGPU_CPU_GSP_VGPU_TASK_LOG_BUFF_REGION_SIZE (256 * 1024)
+#define VGPU_CPU_GSP_KERNEL_TASK_LOG_BUFF_REGION_SIZE (64 * 1024)
+#define VGPU_CPU_GSP_COMMUNICATION_BUFF_TOTAL_SIZE  (VGPU_CPU_GSP_CTRL_BUFF_REGION_SIZE + \
+		VGPU_CPU_GSP_RESPONSE_BUFF_REGION_SIZE + \
+		VGPU_CPU_GSP_MESSAGE_BUFF_REGION_SIZE + \
+		VGPU_CPU_GSP_MIGRATION_BUFF_REGION_SIZE + \
+		VGPU_CPU_GSP_ERROR_BUFF_REGION_SIZE + \
+		VGPU_CPU_GSP_INIT_TASK_LOG_BUFF_REGION_SIZE + \
+		VGPU_CPU_GSP_VGPU_TASK_LOG_BUFF_REGION_SIZE + \
+		VGPU_CPU_GSP_KERNEL_TASK_LOG_BUFF_REGION_SIZE)
+
+typedef union {
+	NvU8 buf[VGPU_CPU_GSP_CTRL_BUFF_REGION_SIZE];
+	struct {
+		NvU32  version;
+		NvU32  message_type;
+		NvU32  message_seq_num;
+		NvU64  response_buff_offset;
+		NvU64  message_buff_offset;
+		NvU64  migration_buff_offset;
+		NvU64  error_buff_offset;
+		NvU32  migration_buf_cpu_access_offset;
+		NvBool is_migration_in_progress;
+		NvU32  error_buff_cpu_get_idx;
+		NvU32 attached_vgpu_count;
+		struct {
+			NvU32 vgpu_type_id;
+			NvU32 host_gpu_pci_id;
+			NvU32 pci_dev_id;
+			NvU8  vgpu_uuid[VM_UUID_SIZE];
+		} host_info[VMIOPD_MAX_INSTANCES];
+	};
+} VGPU_CPU_GSP_CTRL_BUFF_REGION;
+
+enum {
+	NV_VGPU_CPU_RPC_MSG_VERSION_NEGOTIATION = 1,
+	NV_VGPU_CPU_RPC_MSG_SETUP_CONFIG_PARAMS_AND_INIT,
+	NV_VGPU_CPU_RPC_MSG_RESET,
+	NV_VGPU_CPU_RPC_MSG_MIGRATION_STOP_WORK,
+	NV_VGPU_CPU_RPC_MSG_MIGRATION_CANCEL_STOP,
+	NV_VGPU_CPU_RPC_MSG_MIGRATION_SAVE_STATE,
+	NV_VGPU_CPU_RPC_MSG_MIGRATION_CANCEL_SAVE,
+	NV_VGPU_CPU_RPC_MSG_MIGRATION_RESTORE_STATE,
+	NV_VGPU_CPU_RPC_MSG_MIGRATION_RESTORE_DEFERRED_STATE,
+	NV_VGPU_CPU_RPC_MSG_MIGRATION_RESUME_WORK,
+	NV_VGPU_CPU_RPC_MSG_CONSOLE_VNC_STATE,
+	NV_VGPU_CPU_RPC_MSG_VF_BAR0_REG_ACCESS,
+	NV_VGPU_CPU_RPC_MSG_UPDATE_BME_STATE,
+	NV_VGPU_CPU_RPC_MSG_GET_GUEST_INFO,
+	NV_VGPU_CPU_RPC_MSG_MAX,
+};
+
+typedef struct {
+	NvU32 version_cpu;
+	NvU32 version_negotiated;
+} NV_VGPU_CPU_RPC_DATA_VERSION_NEGOTIATION;
+
+typedef struct {
+	NvU8   vgpu_uuid[VM_UUID_SIZE];
+	NvU32  dbdf;
+	NvU32  driver_vm_vf_dbdf;
+	NvU32  vgpu_device_instance_id;
+	NvU32  vgpu_type;
+	NvU32  vm_pid;
+	NvU32  swizz_id;
+	NvU32  num_channels;
+	NvU32  num_plugin_channels;
+	NvU32  vmm_cap;
+	NvU32  migration_feature;
+	NvU32  hypervisor_type;
+	NvU32  host_cpu_arch;
+	NvU64  host_page_size;
+	NvBool rev1[3];
+	NvBool enable_uvm;
+	NvBool linux_interrupt_optimization;
+	NvBool vmm_migration_supported;
+	NvBool rev2;
+	NvBool enable_console_vnc;
+	NvBool use_non_stall_linux_events;
+	NvBool rev3[3];
+	NvU16  placement_id;
+	NvU32  rev4;
+	NvU32  channel_usage_threshold_percentage;
+	NvBool rev5;
+	NvU32  rev6;
+	NvBool rev7;
+} NV_VGPU_CPU_RPC_DATA_COPY_CONFIG_PARAMS;
+
+typedef struct {
+	NvBool enable;
+	NvBool allowed;
+} NV_VGPU_CPU_RPC_DATA_UPDATE_BME_STATE;
+
+typedef union {
+	NvU8 buf[VGPU_CPU_GSP_MESSAGE_BUFF_REGION_SIZE];
+	NV_VGPU_CPU_RPC_DATA_VERSION_NEGOTIATION    version_data;
+	NV_VGPU_CPU_RPC_DATA_UPDATE_BME_STATE       bme_state;
+} VGPU_CPU_GSP_MSG_BUFF_REGION;
+
+typedef struct {
+	NvU64 sequence_update_start;
+	NvU64 sequence_update_end;
+	NvU32 effective_fb_page_size;
+	NvU32 rect_width;
+	NvU32 rect_height;
+	NvU32 surface_width;
+	NvU32 surface_height;
+	NvU32 surface_size;
+	NvU32 surface_offset;
+	NvU32 surface_format;
+	NvU32 surface_kind;
+	NvU32 surface_pitch;
+	NvU32 surface_type;
+	NvU8  surface_block_height;
+	NvBool is_blanking_enabled;
+	NvBool is_flip_pending;
+	NvBool is_free_pending;
+	NvBool is_memory_blocklinear;
+} VGPU_CPU_GSP_DISPLAYLESS_SURFACE;
+
+typedef union {
+	NvU8 buf[VGPU_CPU_GSP_RESPONSE_BUFF_REGION_SIZE];
+	struct {
+		NvU32  message_seq_num_received;
+		NvU32  message_seq_num_processed;
+		NvU32  result_code;
+		NvU32  guest_rpc_version;
+		NvU32  migration_buf_gsp_access_offset;
+		NvU32  migration_state_save_complete;
+		VGPU_CPU_GSP_DISPLAYLESS_SURFACE surface[VMIOPD_MAX_HEADS];
+		NvU32  error_buff_gsp_put_idx;
+		NvU32  grid_license_state;
+		NvU32  guest_os_type;
+		NvU32  frl_config;
+	};
+} VGPU_CPU_GSP_RESPONSE_BUFF_REGION;
+
+#endif
diff --git a/drivers/vfio/pci/nvidia-vgpu/rpc.c b/drivers/vfio/pci/nvidia-vgpu/rpc.c
new file mode 100644
index 000000000000..1d326e194872
--- /dev/null
+++ b/drivers/vfio/pci/nvidia-vgpu/rpc.c
@@ -0,0 +1,254 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright © 2025 NVIDIA Corporation
+ */
+
+#include <linux/delay.h>
+#include <linux/kernel.h>
+
+#include <nvrm/vgpu.h>
+
+#include "debug.h"
+#include "vgpu_mgr.h"
+
+#define NV_VIRTUAL_FUNCTION_PRIV_DOORBELL (0xb80000 + 0x2200)
+
+static void trigger_doorbell(struct nvidia_vgpu *vgpu)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = vgpu->vgpu_mgr;
+
+	u32 v = vgpu->info.gfid * 32 + 17;
+
+	writel(v, vgpu_mgr->bar0_vaddr + NV_VIRTUAL_FUNCTION_PRIV_DOORBELL);
+	readl(vgpu_mgr->bar0_vaddr + NV_VIRTUAL_FUNCTION_PRIV_DOORBELL);
+}
+
+static void send_rpc_request(struct nvidia_vgpu *vgpu, u32 msg_type,
+			     void *data, u64 size)
+{
+	struct nvidia_vgpu_rpc *rpc = &vgpu->rpc;
+	VGPU_CPU_GSP_CTRL_BUFF_REGION *ctrl_buf = rpc->ctrl_buf;
+
+	if (data && size)
+		memcpy_toio(rpc->msg_buf, data, size);
+
+	writel(msg_type, &ctrl_buf->message_type);
+
+	rpc->msg_seq_num++;
+	writel(rpc->msg_seq_num, &ctrl_buf->message_seq_num);
+
+	trigger_doorbell(vgpu);
+}
+
+static int wait_for_response(struct nvidia_vgpu *vgpu)
+{
+	struct nvidia_vgpu_rpc *rpc = &vgpu->rpc;
+	VGPU_CPU_GSP_RESPONSE_BUFF_REGION *resp_buf = rpc->resp_buf;
+
+	u64 timeout = 120 * 1000000; /* 120s */
+
+	do {
+		if (readl(&resp_buf->message_seq_num_processed) == rpc->msg_seq_num)
+			break;
+
+		usleep_range(1, 2);
+	} while (--timeout);
+
+	return timeout ? 0 : -ETIMEDOUT;
+}
+
+static int recv_rpc_response(struct nvidia_vgpu *vgpu, void *data,
+			     u64 size, u32 *result)
+{
+	struct nvidia_vgpu_rpc *rpc = &vgpu->rpc;
+	VGPU_CPU_GSP_RESPONSE_BUFF_REGION *resp_buf = rpc->resp_buf;
+	int ret;
+
+	ret = wait_for_response(vgpu);
+	if (result)
+		*result = resp_buf->result_code;
+
+	if (ret)
+		return ret;
+
+	if (data && size)
+		memcpy_fromio(data, rpc->msg_buf, size);
+
+	return 0;
+}
+
+/**
+ * nvidia_vgpu_rpc_call - vGPU host RPC call.
+ * @vgpu: the vGPU instance.
+ * @msg_type: the RPC message.
+ * @data: the RPC data.
+ * @size: the RPC size.
+ *
+ * Returns: zero on success, others on failure.
+ */
+int nvidia_vgpu_rpc_call(struct nvidia_vgpu *vgpu, u32 msg_type,
+			 void *data, u64 size)
+{
+	struct nvidia_vgpu_rpc *rpc = &vgpu->rpc;
+	u32 result;
+	int ret;
+
+	if (WARN_ON(msg_type >= NV_VGPU_CPU_RPC_MSG_MAX) ||
+	    size > VGPU_CPU_GSP_MESSAGE_BUFF_REGION_SIZE ||
+	    (size != 0 && !data))
+		return -EINVAL;
+
+	mutex_lock(&rpc->lock);
+
+	send_rpc_request(vgpu, msg_type, data, size);
+	ret = recv_rpc_response(vgpu, data, size, &result);
+
+	mutex_unlock(&rpc->lock);
+	if (ret || result) {
+		vgpu_error(vgpu, "fail to recv RPC: result %u\n",
+			   result);
+		return -EINVAL;
+	}
+	return ret;
+}
+
+/**
+ * nvidia_vgpu_clean_rpc - clean vGPU host RPCl
+ * @vgpu: the vGPU instance.
+ */
+void nvidia_vgpu_clean_rpc(struct nvidia_vgpu *vgpu)
+{
+}
+
+static void init_rpc_buf_pointers(struct nvidia_vgpu *vgpu)
+{
+	struct nvidia_vgpu_mgmt *mgmt = &vgpu->mgmt;
+	struct nvidia_vgpu_rpc *rpc = &vgpu->rpc;
+
+	rpc->ctrl_buf = mgmt->ctrl_vaddr;
+	rpc->resp_buf = rpc->ctrl_buf + VGPU_CPU_GSP_CTRL_BUFF_REGION_SIZE;
+	rpc->msg_buf = rpc->resp_buf + VGPU_CPU_GSP_RESPONSE_BUFF_REGION_SIZE;
+	rpc->migration_buf = rpc->msg_buf + VGPU_CPU_GSP_MESSAGE_BUFF_REGION_SIZE;
+	rpc->error_buf = rpc->migration_buf + VGPU_CPU_GSP_MIGRATION_BUFF_REGION_SIZE;
+}
+
+static void init_ctrl_buf_offsets(struct nvidia_vgpu *vgpu)
+{
+	struct nvidia_vgpu_rpc *rpc = &vgpu->rpc;
+	VGPU_CPU_GSP_CTRL_BUFF_REGION *ctrl_buf;
+	u64 offset = 0;
+
+	ctrl_buf = rpc->ctrl_buf;
+
+	writel(VGPU_CPU_GSP_CTRL_BUFF_VERSION, &ctrl_buf->version);
+
+	offset = VGPU_CPU_GSP_CTRL_BUFF_REGION_SIZE;
+	writeq(offset, &ctrl_buf->response_buff_offset);
+
+	offset += VGPU_CPU_GSP_RESPONSE_BUFF_REGION_SIZE;
+	writeq(offset, &ctrl_buf->message_buff_offset);
+
+	offset += VGPU_CPU_GSP_MESSAGE_BUFF_REGION_SIZE;
+	writeq(offset, &ctrl_buf->migration_buff_offset);
+
+	offset += VGPU_CPU_GSP_MIGRATION_BUFF_REGION_SIZE;
+	writeq(offset, &ctrl_buf->error_buff_offset);
+}
+
+static int wait_vgpu_plugin_task_bootloaded(struct nvidia_vgpu *vgpu)
+{
+	struct nvidia_vgpu_rpc *rpc = &vgpu->rpc;
+	VGPU_CPU_GSP_CTRL_BUFF_REGION *ctrl_buf = rpc->ctrl_buf;
+
+	u64 timeout = 10 * 1000000; /* 10 s */
+
+	do {
+		if (readl(&ctrl_buf->message_seq_num) == GSP_PLUGIN_BOOTLOADED)
+			break;
+
+		usleep_range(1, 2);
+	} while (--timeout);
+
+	return timeout ? 0 : -ETIMEDOUT;
+}
+
+static int negotiate_rpc_version(struct nvidia_vgpu *vgpu)
+{
+	return nvidia_vgpu_rpc_call(vgpu, NV_VGPU_CPU_RPC_MSG_VERSION_NEGOTIATION,
+				    NULL, 0);
+}
+
+/* Initial snapshot of config params */
+static const unsigned char config_params[] = {
+	0x24, 0xef, 0x8f, 0xf7, 0x3e, 0xd5, 0x11, 0xef, 0xae, 0x36, 0x97, 0x58,
+	0xb1, 0xcb, 0x0c, 0x87, 0x04, 0xc1, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+	0x14, 0x00, 0xd0, 0xc1, 0x65, 0x03, 0x00, 0x00, 0xa1, 0x0e, 0x00, 0x00,
+	0xff, 0xff, 0xff, 0xff, 0x40, 0x00, 0x00, 0x00, 0x03, 0x00, 0x00, 0x00,
+	0x00, 0x00, 0x00, 0x00, 0x00, 0x40, 0x00, 0x00, 0x04, 0x00, 0x00, 0x00,
+	0x02, 0x00, 0x00, 0x00, 0x00, 0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+	0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x01, 0x01, 0x00, 0x00, 0x00, 0x00,
+	0x00, 0x00, 0x00, 0x00
+};
+
+static int send_config_params_and_init(struct nvidia_vgpu *vgpu)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = vgpu->vgpu_mgr;
+	struct nvidia_vgpu_info *info = &vgpu->info;
+	NV_VGPU_CPU_RPC_DATA_COPY_CONFIG_PARAMS params = {0};
+
+	memcpy(&params, config_params, sizeof(config_params));
+
+	params.dbdf = vgpu->info.dbdf;
+	params.vgpu_device_instance_id =
+		nvidia_vgpu_mgr_get_gsp_client_handle(vgpu_mgr, &vgpu->gsp_client);
+	params.vgpu_type = info->vgpu_type->vgpu_type;
+	params.vm_pid = vgpu->info.vm_pid;
+	params.swizz_id = 0;
+	params.num_channels = vgpu->chid.num_chid;
+	params.num_plugin_channels = vgpu->chid.num_plugin_channels;
+
+	return nvidia_vgpu_rpc_call(vgpu, NV_VGPU_CPU_RPC_MSG_SETUP_CONFIG_PARAMS_AND_INIT,
+				    &params, sizeof(params));
+}
+
+/**
+ * nvidia_vgpu_setup_rpc - setup the vGPU host RPC channel and send runtime
+ * configuration.
+ * @vgpu: the vGPU instance.
+ *
+ * Returns: zero on success, others on failure.
+ */
+int nvidia_vgpu_setup_rpc(struct nvidia_vgpu *vgpu)
+{
+	struct nvidia_vgpu_rpc *rpc = &vgpu->rpc;
+	int ret;
+
+	mutex_init(&rpc->lock);
+
+	init_rpc_buf_pointers(vgpu);
+	init_ctrl_buf_offsets(vgpu);
+
+	ret = wait_vgpu_plugin_task_bootloaded(vgpu);
+	if (ret) {
+		vgpu_error(vgpu, "host_rpc: waiting bootload timeout\n");
+		return ret;
+	}
+
+	vgpu_debug(vgpu, "bootloaded\n");
+
+	ret = negotiate_rpc_version(vgpu);
+	if (ret) {
+		vgpu_error(vgpu, "host_rpc: fail to negotiate rpc version\n");
+		return ret;
+	}
+
+	ret = send_config_params_and_init(vgpu);
+	if (ret) {
+		vgpu_error(vgpu, "host_rpc: fail to init vgpu plugin task\n");
+		return ret;
+	}
+
+	vgpu_debug(vgpu, "active\n");
+
+	return 0;
+}
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu.c b/drivers/vfio/pci/nvidia-vgpu/vgpu.c
index 5778365c051f..9e8ea77bbcc5 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu.c
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu.c
@@ -332,6 +332,7 @@ int nvidia_vgpu_mgr_destroy_vgpu(struct nvidia_vgpu *vgpu)
 	if (!atomic_cmpxchg(&vgpu->status, 1, 0))
 		return -ENODEV;
 
+	nvidia_vgpu_clean_rpc(vgpu);
 	WARN_ON(shutdown_vgpu_plugin_task(vgpu));
 	WARN_ON(cleanup_vgpu_plugin_task(vgpu));
 	nvidia_vgpu_mgr_free_gsp_client(vgpu_mgr, &vgpu->gsp_client);
@@ -399,12 +400,19 @@ int nvidia_vgpu_mgr_create_vgpu(struct nvidia_vgpu *vgpu)
 	if (ret)
 		goto err_bootload_vgpu_plugin_task;
 
+	ret = nvidia_vgpu_setup_rpc(vgpu);
+	if (ret)
+		goto err_setup_rpc;
+
 	atomic_set(&vgpu->status, 1);
 
 	vgpu_debug(vgpu, "created\n");
 
 	return 0;
 
+err_setup_rpc:
+	shutdown_vgpu_plugin_task(vgpu);
+	cleanup_vgpu_plugin_task(vgpu);
 err_bootload_vgpu_plugin_task:
 	nvidia_vgpu_mgr_free_gsp_client(vgpu_mgr, &vgpu->gsp_client);
 err_alloc_gsp_client:
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
index 6338dd9c86b6..6f53bd7ca940 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
@@ -9,6 +9,29 @@
 #include <nvrm/vmmu.h>
 #include <nvrm/ecc.h>
 
+static void unmap_pf_mmio(struct nvidia_vgpu_mgr *vgpu_mgr)
+{
+	iounmap(vgpu_mgr->bar0_vaddr);
+}
+
+static int map_pf_mmio(struct nvidia_vgpu_mgr *vgpu_mgr)
+{
+	struct pci_dev *pdev = vgpu_mgr->pdev;
+	resource_size_t start, size;
+	void *vaddr;
+
+	start = pci_resource_start(pdev, 0);
+	size = pci_resource_len(pdev, 0);
+
+	vaddr = ioremap(start, size);
+	if (!vaddr)
+		return -ENOMEM;
+
+	vgpu_mgr->bar0_vaddr = vaddr;
+
+	return 0;
+}
+
 static void clean_vgpu_mgr(struct nvidia_vgpu_mgr *vgpu_mgr)
 {
 	if (vgpu_mgr->use_chid_alloc_bitmap) {
@@ -30,6 +53,7 @@ static void vgpu_mgr_release(struct kref *kref)
 	if (WARN_ON(atomic_read(&vgpu_mgr->num_vgpus)))
 		return;
 
+	unmap_pf_mmio(vgpu_mgr);
 	nvidia_vgpu_mgr_clean_metadata(vgpu_mgr);
 	clean_vgpu_mgr(vgpu_mgr);
 	nvidia_vgpu_mgr_free_gsp_client(vgpu_mgr, &vgpu_mgr->gsp_client);
@@ -73,6 +97,7 @@ static struct nvidia_vgpu_mgr *alloc_vgpu_mgr(struct nvidia_vgpu_mgr_handle *han
 		return ERR_PTR(-ENOMEM);
 
 	vgpu_mgr->handle = *handle;
+	vgpu_mgr->pdev = handle->pf_pdev;
 
 	kref_init(&vgpu_mgr->refcount);
 	mutex_init(&vgpu_mgr->vgpu_list_lock);
@@ -295,6 +320,10 @@ static int pf_attach_handle_fn(void *handle, struct nvidia_vgpu_vfio_handle_data
 	if (ret)
 		goto fail_setup_metadata;
 
+	ret = map_pf_mmio(vgpu_mgr);
+	if (ret)
+		goto fail_map_pf_mmio;
+
 	attach_vgpu_mgr(vgpu_mgr, handle_data);
 
 	ret = attach_data->init_vfio_fn(vgpu_mgr, attach_data->init_vfio_fn_data);
@@ -307,6 +336,8 @@ static int pf_attach_handle_fn(void *handle, struct nvidia_vgpu_vfio_handle_data
 
 fail_init_fn:
 	detach_vgpu_mgr(handle_data);
+	unmap_pf_mmio(vgpu_mgr);
+fail_map_pf_mmio:
 	nvidia_vgpu_mgr_clean_metadata(vgpu_mgr);
 fail_setup_metadata:
 	clean_vgpu_mgr(vgpu_mgr);
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
index 323acf52068e..fe475f8b2882 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
@@ -58,6 +58,17 @@ struct nvidia_vgpu_mgmt {
 	void __iomem *kernel_log_vaddr;
 };
 
+struct nvidia_vgpu_rpc {
+	/* RPC channel lock */
+	struct mutex lock;
+	u32 msg_seq_num;
+	void __iomem *ctrl_buf;
+	void __iomem *resp_buf;
+	void __iomem *msg_buf;
+	void __iomem *migration_buf;
+	void __iomem *error_buf;
+};
+
 /**
  * struct nvidia_vgpu - per-vGPU state
  *
@@ -71,6 +82,7 @@ struct nvidia_vgpu_mgmt {
  * @chid: vGPU channel IDs
  * @fbmem_heap: allocated FB memory for the vGPU
  * @mgmt: vGPU mgmt heap
+ * @rpc: vGPU host RPC
  */
 struct nvidia_vgpu {
 	/* Per-vGPU lock */
@@ -86,6 +98,7 @@ struct nvidia_vgpu {
 	struct nvidia_vgpu_chid chid;
 	struct nvidia_vgpu_mem *fbmem_heap;
 	struct nvidia_vgpu_mgmt mgmt;
+	struct nvidia_vgpu_rpc rpc;
 };
 
 /**
@@ -112,6 +125,8 @@ struct nvidia_vgpu {
  * @num_vgpu_types: number of installed vGPU types
  * @use_alloc_bitmap: use chid allocator for the PF driver doesn't support chid allocation
  * @chid_alloc_bitmap: chid allocator bitmap
+ * @pdev: the PCI device pointer
+ * @bar0_vaddr: the virtual address of BAR0
  */
 struct nvidia_vgpu_mgr {
 	struct kref refcount;
@@ -147,6 +162,9 @@ struct nvidia_vgpu_mgr {
 
 	bool use_chid_alloc_bitmap;
 	void *chid_alloc_bitmap;
+
+	struct pci_dev *pdev;
+	void __iomem *bar0_vaddr;
 };
 
 #define nvidia_vgpu_mgr_for_each_vgpu(vgpu, vgpu_mgr) \
@@ -160,5 +178,9 @@ int nvidia_vgpu_mgr_destroy_vgpu(struct nvidia_vgpu *vgpu);
 int nvidia_vgpu_mgr_create_vgpu(struct nvidia_vgpu *vgpu);
 int nvidia_vgpu_mgr_setup_metadata(struct nvidia_vgpu_mgr *vgpu_mgr);
 void nvidia_vgpu_mgr_clean_metadata(struct nvidia_vgpu_mgr *vgpu_mgr);
+int nvidia_vgpu_rpc_call(struct nvidia_vgpu *vgpu, u32 msg_type,
+			 void *data, u64 size);
+void nvidia_vgpu_clean_rpc(struct nvidia_vgpu *vgpu);
+int nvidia_vgpu_setup_rpc(struct nvidia_vgpu *vgpu);
 
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC v2 10/14] vfio/nvidia-vgpu: introduce vGPU host RPC channel
  2025-09-03 22:11 ` [RFC v2 10/14] vfio/nvidia-vgpu: introduce vGPU host RPC channel Zhi Wang
@ 2025-09-03 22:36   ` Timur Tabi
  0 siblings, 0 replies; 23+ messages in thread
From: Timur Tabi @ 2025-09-03 22:36 UTC (permalink / raw)
  To: kvm@vger.kernel.org, Zhi Wang
  Cc: Jason Gunthorpe, Surath Mitra, alex.williamson@redhat.com,
	Andy Currid, Tarun Gupta (SW-GPU), airlied@gmail.com,
	zhiwang@kernel.org, dakr@kernel.org, Ankit Agrawal,
	kevin.tian@intel.com, daniel@ffwll.ch, Neo Jia, Aniket Agashe,
	Kirti Wankhede

On Wed, 2025-09-03 at 15:11 -0700, Zhi Wang wrote:
> +typedef struct {
> +	NvU8   vgpu_uuid[VM_UUID_SIZE];
> +	NvU32  dbdf;
> +	NvU32  driver_vm_vf_dbdf;
> +	NvU32  vgpu_device_instance_id;
> +	NvU32  vgpu_type;
> +	NvU32  vm_pid;
> +	NvU32  swizz_id;
> +	NvU32  num_channels;
> +	NvU32  num_plugin_channels;
> +	NvU32  vmm_cap;
> +	NvU32  migration_feature;
> +	NvU32  hypervisor_type;
> +	NvU32  host_cpu_arch;
> +	NvU64  host_page_size;

> +	NvBool rev1[3];
> +	NvBool enable_uvm;
> +	NvBool linux_interrupt_optimization;
> +	NvBool vmm_migration_supported;
> +	NvBool rev2;
> +	NvBool enable_console_vnc;
> +	NvBool use_non_stall_linux_events;
> +	NvBool rev3[3];

This is 12 bytes

> +	NvU16  placement_id;

This is 2 bytes, for a total of 14 so far ...


> +	NvU32  rev4;

This is misaligned.

> +	NvU32  channel_usage_threshold_percentage;
> +	NvBool rev5;
> +	NvU32  rev6;
> +	NvBool rev7;
> +} NV_VGPU_CPU_RPC_DATA_COPY_CONFIG_PARAMS;

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC v2 11/14] vfio/nvidia-vgpu: introduce NVIDIA vGPU VFIO variant driver
  2025-09-03 22:10 [RFC v2 00/14] Introduce NVIDIA GPU Virtualization (vGPU) Support Zhi Wang
                   ` (9 preceding siblings ...)
  2025-09-03 22:11 ` [RFC v2 10/14] vfio/nvidia-vgpu: introduce vGPU host RPC channel Zhi Wang
@ 2025-09-03 22:11 ` Zhi Wang
  2025-09-03 22:11 ` [RFC v2 12/14] vfio/nvidia-vgpu: scrub the guest FB memory of a vGPU Zhi Wang
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Zhi Wang @ 2025-09-03 22:11 UTC (permalink / raw)
  To: kvm
  Cc: alex.williamson, kevin.tian, jgg, airlied, daniel, dakr, acurrid,
	cjia, smitra, ankita, aniketa, kwankhede, targupta, zhiw, zhiwang

A VFIO variant driver module is designed to extend the capabilities of
the existing VFIO (Virtual Function I/O), offering device management
interfaces to the userspace and advanced feature support.

For the userspace to use the NVIDIA vGPU, a new vGPU VFIO variant driver
is introduced to provide vGPU management, like selecting/creating vGPU
instance, support advance features like live migration.

Introduce the NVIDIA vGPU VFIO variant driver to support vGPU lifecycle
management UABI and the future advancd features.

Cc: Aniket Agashe <aniketa@nvidia.com>
Cc: Ankit Agrawal <ankita@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 .../ABI/stable/sysfs-driver-nvidia-vgpu       |  11 +
 drivers/vfio/pci/nvidia-vgpu/Makefile         |   3 +-
 drivers/vfio/pci/nvidia-vgpu/debug.h          |  10 +
 drivers/vfio/pci/nvidia-vgpu/vfio.h           |  49 ++
 drivers/vfio/pci/nvidia-vgpu/vfio_access.c    | 313 ++++++++
 drivers/vfio/pci/nvidia-vgpu/vfio_main.c      | 688 ++++++++++++++++++
 drivers/vfio/pci/nvidia-vgpu/vfio_sysfs.c     | 209 ++++++
 drivers/vfio/pci/nvidia-vgpu/vgpu.c           |  53 +-
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c       |  68 +-
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h       |  29 +
 10 files changed, 1427 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/ABI/stable/sysfs-driver-nvidia-vgpu
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/vfio.h
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/vfio_access.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/vfio_main.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/vfio_sysfs.c

diff --git a/Documentation/ABI/stable/sysfs-driver-nvidia-vgpu b/Documentation/ABI/stable/sysfs-driver-nvidia-vgpu
new file mode 100644
index 000000000000..1fc3ac8e234d
--- /dev/null
+++ b/Documentation/ABI/stable/sysfs-driver-nvidia-vgpu
@@ -0,0 +1,11 @@
+What:           /sys/devices/pciXXXX:XX/0000:XX:XX.X/nvidia/creatable_vgpu_types
+Date:		June 2, 2025
+KernelVersion:	6.17
+Contact:	kvm@vger.kernel.org
+Description:	Query the creatble vGPU types on a virtual function.
+
+What:           /sys/devices/pciXXXX:XX/0000:XX:XX.X/nvidia/current_vgpu_type
+Date:		June 2, 2025
+KernelVersion:	6.17
+Contact:	kvm@vger.kernel.org
+Description:	Set the vGPU type for the virtual function.
diff --git a/drivers/vfio/pci/nvidia-vgpu/Makefile b/drivers/vfio/pci/nvidia-vgpu/Makefile
index 91e57c65ca27..2aba9b4868aa 100644
--- a/drivers/vfio/pci/nvidia-vgpu/Makefile
+++ b/drivers/vfio/pci/nvidia-vgpu/Makefile
@@ -2,4 +2,5 @@
 subdir-ccflags-y += -I$(src)/include
 
 obj-$(CONFIG_NVIDIA_VGPU_VFIO_PCI) += nvidia_vgpu_vfio_pci.o
-nvidia_vgpu_vfio_pci-y := vgpu_mgr.o vgpu.o metadata.o metadata_vgpu_type.o rpc.o
+nvidia_vgpu_vfio_pci-y := vgpu_mgr.o vgpu.o metadata.o metadata_vgpu_type.o rpc.o \
+			  vfio_main.o vfio_access.o vfio_sysfs.o
diff --git a/drivers/vfio/pci/nvidia-vgpu/debug.h b/drivers/vfio/pci/nvidia-vgpu/debug.h
index db9288752384..05cb2ea13543 100644
--- a/drivers/vfio/pci/nvidia-vgpu/debug.h
+++ b/drivers/vfio/pci/nvidia-vgpu/debug.h
@@ -22,4 +22,14 @@
 	pci_err(__v->pdev, "nvidia-vgpu %d: "f, __v->info.id, ##a); \
 })
 
+#define nvdev_debug(n, f, a...) ({ \
+	typeof(n) __n = (n); \
+	pci_dbg(__n->core_dev.pdev, "nvidia-vgpu-vfio: "f, ##a); \
+})
+
+#define nvdev_error(n, f, a...) ({ \
+	typeof(n) __n = (n); \
+	pci_err(__n->core_dev.pdev, "nvidia-vgpu-vfio: "f, ##a); \
+})
+
 #endif
diff --git a/drivers/vfio/pci/nvidia-vgpu/vfio.h b/drivers/vfio/pci/nvidia-vgpu/vfio.h
new file mode 100644
index 000000000000..4c9bf9c80f5c
--- /dev/null
+++ b/drivers/vfio/pci/nvidia-vgpu/vfio.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright © 2025 NVIDIA Corporation
+ */
+
+#ifndef _NVIDIA_VGPU_VFIO_H__
+#define _NVIDIA_VGPU_VFIO_H__
+
+#include <linux/vfio_pci_core.h>
+
+#include "vgpu_mgr.h"
+
+#define PCI_CONFIG_SPACE_LENGTH 4096
+
+#define CAP_LIST_NEXT_PTR_MSIX 0x7c
+#define MSIX_CAP_SIZE   0xc
+
+struct nvidia_vgpu_vfio {
+	struct vfio_pci_core_device core_dev;
+	u8 vconfig[PCI_CONFIG_SPACE_LENGTH];
+	void __iomem *bar0_map;
+
+	struct nvidia_vgpu_mgr *vgpu_mgr;
+	struct nvidia_vgpu_type *vgpu_type;
+
+	/* lock to protect vgpu pointer and following members */
+	struct mutex vfio_vgpu_lock;
+	struct nvidia_vgpu *vgpu;
+	bool vdev_is_opened;
+	bool driver_is_unbound;
+	struct pid *task_pid;
+	struct completion vdev_closing_completion;
+
+	struct nvidia_vgpu_event_listener pf_driver_event_listener;
+};
+
+static inline struct nvidia_vgpu_vfio *core_dev_to_nvdev(struct vfio_pci_core_device *core_dev)
+{
+	return container_of(core_dev, struct nvidia_vgpu_vfio, core_dev);
+}
+
+void nvidia_vgpu_vfio_setup_config(struct nvidia_vgpu_vfio *nvdev);
+ssize_t nvidia_vgpu_vfio_access(struct nvidia_vgpu_vfio *nvdev, char __user *buf, size_t count,
+				loff_t ppos, bool iswrite);
+
+int nvidia_vgpu_vfio_setup_sysfs(struct nvidia_vgpu_vfio *nvdev);
+void nvidia_vgpu_vfio_clean_sysfs(struct nvidia_vgpu_vfio *nvdev);
+
+#endif /* _NVIDIA_VGPU_VFIO_H__ */
diff --git a/drivers/vfio/pci/nvidia-vgpu/vfio_access.c b/drivers/vfio/pci/nvidia-vgpu/vfio_access.c
new file mode 100644
index 000000000000..4a72575264ba
--- /dev/null
+++ b/drivers/vfio/pci/nvidia-vgpu/vfio_access.c
@@ -0,0 +1,313 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright © 2025 NVIDIA Corporation
+ */
+
+#include <linux/string.h>
+#include <linux/pci.h>
+#include <linux/pci_regs.h>
+
+#include "vfio.h"
+
+#define vconfig_set8(offset, v) \
+	(*(u8 *)(nvdev->vconfig + (offset)) = v)
+
+#define vconfig_set16(offset, v) \
+	(*(u16 *)(nvdev->vconfig + (offset)) = v)
+
+#define vconfig_set32(offset, v) \
+	(*(u32 *)(nvdev->vconfig + (offset)) = v)
+
+void nvidia_vgpu_vfio_setup_config(struct nvidia_vgpu_vfio *nvdev)
+{
+	struct nvidia_vgpu_type *vgpu_type;
+	u8 val8;
+
+	lockdep_assert_held(&nvdev->vfio_vgpu_lock);
+
+	if (WARN_ON(!nvdev->vgpu_type))
+		return;
+
+	vgpu_type = nvdev->vgpu_type;
+
+	memset(nvdev->vconfig, 0, sizeof(nvdev->vconfig));
+
+	/* Header type 0 (normal devices) */
+	vconfig_set16(PCI_VENDOR_ID, PCI_VENDOR_ID_NVIDIA);
+	vconfig_set16(PCI_DEVICE_ID, FIELD_GET(GENMASK(31, 16), vgpu_type->vdev_id));
+	vconfig_set16(PCI_COMMAND, 0x0000);
+	vconfig_set16(PCI_STATUS, 0x0010);
+
+	pci_read_config_byte(nvdev->core_dev.pdev, PCI_CLASS_REVISION, &val8);
+	vconfig_set8(PCI_CLASS_REVISION, val8);
+
+	vconfig_set8(PCI_CLASS_PROG, 0); /* VGA-compatible */
+	vconfig_set8(PCI_CLASS_DEVICE, 0); /* VGA controller */
+	vconfig_set8(PCI_CLASS_DEVICE + 1, 3); /* Display controller */
+
+	/* BAR0: 32-bit */
+	vconfig_set32(PCI_BASE_ADDRESS_0, 0x00000000);
+	/* BAR1: 64-bit, prefetchable */
+	vconfig_set32(PCI_BASE_ADDRESS_1, 0x0000000c);
+	/* BAR2: 64-bit, prefetchable */
+	vconfig_set32(PCI_BASE_ADDRESS_3, 0x0000000c);
+	/* Disable BAR3: I/O */
+	vconfig_set32(PCI_BASE_ADDRESS_5, 0x00000000);
+
+	vconfig_set16(PCI_SUBSYSTEM_VENDOR_ID, PCI_VENDOR_ID_NVIDIA);
+	vconfig_set16(PCI_SUBSYSTEM_ID, FIELD_GET(GENMASK(15, 0),
+		      nvdev->vgpu->info.vgpu_type->vdev_id));
+
+	vconfig_set8(PCI_CAPABILITY_LIST, CAP_LIST_NEXT_PTR_MSIX);
+	vconfig_set8(CAP_LIST_NEXT_PTR_MSIX + 1, 0);
+
+	/* INTx disabled */
+	vconfig_set8(0x3d, 0);
+}
+
+#define PCI_CONFIG_READ(pdev, off, buf, size) \
+	do { \
+		switch (size) { \
+		case 4: pci_read_config_dword((pdev), (off), (u32 *)(buf)); break; \
+		case 2: pci_read_config_word((pdev), (off), (u16 *)(buf)); break; \
+		case 1: pci_read_config_byte((pdev), (off), (u8 *)(buf));  break; \
+		} \
+	} while (0)
+
+#define PCI_CONFIG_WRITE(pdev, off, buf, size) \
+	do { \
+		switch (size) { \
+		case 4: pci_write_config_dword((pdev), (off), *(u32 *)(buf)); break; \
+		case 2: pci_write_config_word((pdev), (off), *(u16 *)(buf)); break; \
+		case 1: pci_write_config_byte((pdev), (off), *(u8 *)(buf));  break; \
+		} \
+	} while (0)
+
+#define MMIO_READ(map, off, buf, size) \
+	do { \
+		switch (size) { \
+		case 4: { u32 val = ioread32((map) + (off)); memcpy((buf), &val, 4); break; } \
+		case 2: { u16 val = ioread16((map) + (off)); memcpy((buf), &val, 2); break; } \
+		case 1: { u8  val = ioread8((map) + (off)); memcpy((buf), &val, 1); break; } \
+		} \
+	} while (0)
+
+#define MMIO_WRITE(map, off, buf, size) \
+	do { \
+		switch (size) { \
+		case 4: iowrite32(*(u32 *)(buf), (map) + (off)); break; \
+		case 2: iowrite16(*(u16 *)(buf), (map) + (off)); break; \
+		case 1: iowrite8 (*(u8  *)(buf), (map) + (off)); break; \
+		} \
+	} while (0)
+
+static ssize_t bar0_rw(struct nvidia_vgpu_vfio *nvdev, char *buf, size_t count, loff_t ppos,
+		       bool iswrite)
+{
+	struct pci_dev *pdev = nvdev->core_dev.pdev;
+	int index = VFIO_PCI_OFFSET_TO_INDEX(ppos);
+	loff_t offset = ppos;
+	void __iomem *map;
+	int ret;
+
+	if (WARN_ON(index != VFIO_PCI_BAR0_REGION_INDEX))
+		return -EINVAL;
+
+	offset &= VFIO_PCI_OFFSET_MASK;
+
+	if (!nvdev->bar0_map) {
+		ret = pci_request_selected_regions(pdev, 1 << index, "nvidia-vgpu-vfio");
+		if (ret)
+			return ret;
+
+		if (!(pci_resource_flags(pdev, index) & IORESOURCE_MEM)) {
+			pci_release_selected_regions(pdev, 1 << index);
+			return -EIO;
+		}
+
+		map = ioremap(pci_resource_start(pdev, index), pci_resource_len(pdev, index));
+		if (!map) {
+			pci_err(pdev, "Can't map BAR0 MMIO space\n");
+			pci_release_selected_regions(pdev, 1 << index);
+			return -ENOMEM;
+		}
+		nvdev->bar0_map = map;
+	} else {
+		map = nvdev->bar0_map;
+	}
+
+	if (iswrite)
+		MMIO_WRITE(map, offset, buf, count);
+	else
+		MMIO_READ(map, offset, buf, count);
+
+	return count;
+}
+
+/* Generate mask for 32-bit or 64-bit PCI BAR address range */
+#define GEN_BARMASK(size)        ((u32)((~(size) + 1) & ~0xFUL))
+#define GEN_BARMASK_HI(size)     ((u32)(((~(size) + 1) & ~0xFULL) >> 32))
+
+static u32 emulate_pci_base_reg_write(struct nvidia_vgpu_vfio *nvdev, loff_t offset, u32 cfg_addr)
+{
+	struct pci_dev *pdev = nvdev->core_dev.pdev;
+	struct nvidia_vgpu_type *vgpu_type = nvdev->vgpu->info.vgpu_type;
+	u32 bar_mask;
+
+	switch (offset) {
+	case PCI_BASE_ADDRESS_0:
+		bar_mask = GEN_BARMASK(pci_resource_len(pdev, VFIO_PCI_BAR0_REGION_INDEX));
+		cfg_addr = (cfg_addr & bar_mask) | (nvdev->vconfig[offset] & 0xFUL);
+		break;
+
+	case PCI_BASE_ADDRESS_1:
+		bar_mask = GEN_BARMASK(vgpu_type->bar1_length * SZ_1M);
+		cfg_addr = (cfg_addr & bar_mask) | (nvdev->vconfig[offset] & 0xFUL);
+		break;
+
+	case PCI_BASE_ADDRESS_2:
+		bar_mask = GEN_BARMASK_HI(vgpu_type->bar1_length * SZ_1M);
+		cfg_addr &= bar_mask;
+		break;
+
+	case PCI_BASE_ADDRESS_3:
+		bar_mask = GEN_BARMASK(pci_resource_len(pdev, VFIO_PCI_BAR3_REGION_INDEX));
+		cfg_addr = (cfg_addr & bar_mask) | (nvdev->vconfig[offset] & 0xFUL);
+		break;
+
+	case PCI_BASE_ADDRESS_4:
+		bar_mask = GEN_BARMASK_HI(pci_resource_len(pdev, VFIO_PCI_BAR3_REGION_INDEX));
+		cfg_addr &= bar_mask;
+		break;
+
+	default:
+		WARN_ONCE(1, "Unsupported PCI BAR offset: %llx\n", offset);
+		return 0;
+	}
+
+	return cfg_addr;
+}
+
+static void handle_pci_config_read(struct nvidia_vgpu_vfio *nvdev, char *buf,
+				   size_t count, loff_t offset)
+{
+	struct pci_dev *pdev = nvdev->core_dev.pdev;
+	u32 val = 0;
+
+	memcpy(buf, (u8 *)&nvdev->vconfig[offset], count);
+
+	switch (offset) {
+	case PCI_COMMAND:
+		PCI_CONFIG_READ(pdev, offset, (char *)&val, count);
+
+		switch (count) {
+		case 4:
+			val = (u32)(val & 0xFFFF0000) | (val &
+					(PCI_COMMAND_PARITY | PCI_COMMAND_SERR));
+			break;
+		case 2:
+			val = (val & (PCI_COMMAND_PARITY | PCI_COMMAND_SERR));
+			break;
+		default:
+			WARN_ONCE(1, "Not supported access len\n");
+			break;
+		}
+		break;
+	case PCI_STATUS:
+		PCI_CONFIG_READ(pdev, offset, (char *)&val, count);
+		break;
+	default:
+		break;
+	}
+	*(u32 *)buf = *(u32 *)buf | val;
+}
+
+static void handle_pci_config_write(struct nvidia_vgpu_vfio *nvdev, char *buf,
+				    size_t count, loff_t offset)
+{
+	struct pci_dev *pdev = nvdev->core_dev.pdev;
+	u32 val = 0;
+	u32 cfg_addr;
+
+	switch (offset) {
+	case PCI_VENDOR_ID:
+	case PCI_DEVICE_ID:
+	case PCI_CAPABILITY_LIST:
+		break;
+
+	case PCI_STATUS:
+		PCI_CONFIG_WRITE(pdev, offset, buf, count);
+		break;
+	case PCI_COMMAND:
+		if (count == 4) {
+			val = (u32)((*(u32 *)buf & 0xFFFF0000) >> 16);
+			PCI_CONFIG_WRITE(pdev, PCI_STATUS, (char *)&val, 2);
+
+			val = (u32)(*(u32 *)buf & 0x0000FFFF);
+			*(u32 *)buf = val;
+		}
+
+		memcpy((u8 *)&nvdev->vconfig[offset], buf, count);
+		break;
+	case PCI_BASE_ADDRESS_0:
+	case PCI_BASE_ADDRESS_1:
+	case PCI_BASE_ADDRESS_2:
+	case PCI_BASE_ADDRESS_3:
+	case PCI_BASE_ADDRESS_4:
+		cfg_addr = *(u32 *)buf;
+		cfg_addr = emulate_pci_base_reg_write(nvdev, offset, cfg_addr);
+		*(u32 *)&nvdev->vconfig[offset] = cfg_addr;
+		break;
+	default:
+		break;
+	}
+}
+
+static ssize_t pci_config_rw(struct nvidia_vgpu_vfio *nvdev, char *buf, size_t count,
+			     loff_t ppos, bool iswrite)
+{
+	struct pci_dev *pdev = nvdev->core_dev.pdev;
+	int index = VFIO_PCI_OFFSET_TO_INDEX(ppos);
+	loff_t offset = ppos;
+
+	if (WARN_ON(index != VFIO_PCI_CONFIG_REGION_INDEX))
+		return -EINVAL;
+
+	offset &= VFIO_PCI_OFFSET_MASK;
+
+	if (offset >= CAP_LIST_NEXT_PTR_MSIX &&
+	    offset < CAP_LIST_NEXT_PTR_MSIX + MSIX_CAP_SIZE) {
+		if (!iswrite)
+			PCI_CONFIG_READ(pdev, offset, buf, count);
+		else
+			PCI_CONFIG_WRITE(pdev, offset, buf, count);
+		return count;
+	}
+
+	if (!iswrite)
+		handle_pci_config_read(nvdev, buf, count, offset);
+	else
+		handle_pci_config_write(nvdev, buf, count, offset);
+
+	return count;
+}
+
+ssize_t nvidia_vgpu_vfio_access(struct nvidia_vgpu_vfio *nvdev, char *buf,
+				size_t count, loff_t ppos, bool iswrite)
+{
+	int index = VFIO_PCI_OFFSET_TO_INDEX(ppos);
+
+	if (index >= VFIO_PCI_NUM_REGIONS)
+		return -EINVAL;
+
+	switch (index) {
+	case VFIO_PCI_CONFIG_REGION_INDEX:
+		return pci_config_rw(nvdev, buf, count, ppos,
+				     iswrite);
+	case VFIO_PCI_BAR0_REGION_INDEX:
+		return bar0_rw(nvdev, buf, count, ppos, iswrite);
+	default:
+		return -EINVAL;
+	}
+	return count;
+}
diff --git a/drivers/vfio/pci/nvidia-vgpu/vfio_main.c b/drivers/vfio/pci/nvidia-vgpu/vfio_main.c
new file mode 100644
index 000000000000..b557062a4ac2
--- /dev/null
+++ b/drivers/vfio/pci/nvidia-vgpu/vfio_main.c
@@ -0,0 +1,688 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright © 2025 NVIDIA Corporation
+ */
+
+#include <linux/module.h>
+#include <linux/delay.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/pci.h>
+#include <linux/vfio_pci_core.h>
+#include <linux/types.h>
+
+#include "debug.h"
+#include "vfio.h"
+
+static inline struct vfio_pci_core_device *vdev_to_core_dev(struct vfio_device *vdev)
+{
+	return container_of(vdev, struct vfio_pci_core_device, vdev);
+}
+
+static int pdev_to_gfid(struct pci_dev *pdev)
+{
+	return pci_iov_vf_id(pdev) + 1;
+}
+
+static int destroy_vgpu(struct nvidia_vgpu_vfio *nvdev)
+{
+	int ret;
+
+	ret = nvidia_vgpu_mgr_destroy_vgpu(nvdev->vgpu);
+	if (ret)
+		return ret;
+
+	kfree(nvdev->vgpu);
+	nvdev->vgpu = NULL;
+	return 0;
+}
+
+static int create_vgpu(struct nvidia_vgpu_vfio *nvdev)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = nvdev->vgpu_mgr;
+	struct pci_dev *pdev = nvdev->core_dev.pdev;
+	struct nvidia_vgpu_type *type = nvdev->vgpu_type;
+	struct nvidia_vgpu *vgpu;
+	int ret;
+
+	if (WARN_ON(!type || !nvdev->task_pid))
+		return -ENODEV;
+
+	vgpu = kzalloc(sizeof(*vgpu), GFP_KERNEL);
+	if (!vgpu)
+		return -ENOMEM;
+
+	vgpu->info.id = pci_iov_vf_id(pdev);
+	vgpu->info.dbdf = (0 << 16) | pci_dev_id(pdev);
+	vgpu->info.gfid = pdev_to_gfid(pdev);
+	vgpu->info.vgpu_type = type;
+	vgpu->info.vm_pid = pid_nr(nvdev->task_pid);
+
+	vgpu->vgpu_mgr = vgpu_mgr;
+	vgpu->pdev = pdev;
+
+	ret = nvidia_vgpu_mgr_create_vgpu(vgpu);
+	if (ret) {
+		kfree(vgpu);
+		return ret;
+	}
+
+	nvdev->vgpu = vgpu;
+	return 0;
+}
+
+static inline bool pdev_is_present(struct pci_dev *pdev)
+{
+	struct pci_dev *physfn = (pdev->is_virtfn) ? pdev->physfn : pdev;
+
+	if (pdev->is_virtfn)
+		return (pci_device_is_present(physfn) &&
+				pdev->error_state != pci_channel_io_perm_failure);
+	else
+		return pci_device_is_present(physfn);
+}
+
+/* Wait till 1000 ms for HW that returns CRS completion status */
+#define MIN_FLR_WAIT_TIME 100
+#define MAX_FLR_WAIT_TIME 1000
+
+static int do_vf_flr(struct vfio_device *vdev)
+{
+	struct vfio_pci_core_device *core_dev = vdev_to_core_dev(vdev);
+	struct nvidia_vgpu_vfio *nvdev = core_dev_to_nvdev(core_dev);
+	struct pci_dev *pdev = core_dev->pdev;
+	u32 data, elapsed_time = 0;
+
+	if (!pdev->is_virtfn)
+		return 0;
+
+	if (!pdev_is_present(pdev))
+		return -ENOTTY;
+
+	pcie_capability_read_dword(pdev, PCI_EXP_DEVCAP, &data);
+	if (!(data & PCI_EXP_DEVCAP_FLR)) {
+		nvdev_error(nvdev, "FLR capability not present on the VF.\n");
+		return -EINVAL;
+	}
+
+	device_lock(&pdev->dev);
+	pci_set_power_state(pdev, PCI_D0);
+	pci_save_state(pdev);
+
+	if (!pci_wait_for_pending_transaction(pdev))
+		nvdev_error(nvdev, "Timed out waiting for transaction pending to go to 0.\n");
+
+	pcie_capability_set_word(pdev, PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_BCR_FLR);
+
+	/*
+	 * If CRS-SV is supported and enabled, then the root-port returns '0001h'
+	 * for a PCI config read of the 16-byte vendor_id field. This indicates CRS
+	 * completion status.
+	 *
+	 * If CRS-SV is not supported/enabled, then the root-port will generally
+	 * synthesise ~0 data for any PCI config read.
+	 */
+	do {
+		msleep(MIN_FLR_WAIT_TIME);
+		elapsed_time += MIN_FLR_WAIT_TIME;
+
+		pci_read_config_dword(pdev, PCI_VENDOR_ID, &data);
+	} while (((data & 0xffff) == 0x0001) && (elapsed_time < MAX_FLR_WAIT_TIME));
+
+	if (elapsed_time < MAX_FLR_WAIT_TIME) {
+		/*
+		 * Device is back from the CRS-SV, continue checking
+		 * if device is ready by reading PCI_COMMAND.
+		 */
+		do {
+			pci_read_config_dword(pdev, PCI_COMMAND, &data);
+			if (data != ~0)
+				goto flr_done;
+
+			msleep(MIN_FLR_WAIT_TIME);
+			elapsed_time += MIN_FLR_WAIT_TIME;
+		} while (elapsed_time < MAX_FLR_WAIT_TIME);
+
+		nvdev_error(nvdev, "FLR failed non-CRS case, waited for %d ms\n", elapsed_time);
+	} else {
+		nvdev_error(nvdev, "FLR failed CRS case, waited for %d ms\n", elapsed_time);
+	}
+
+	/* Device is not usable. */
+	xchg(&pdev->error_state, pci_channel_io_perm_failure);
+	device_unlock(&pdev->dev);
+	return -ENOTTY;
+
+flr_done:
+	pci_restore_state(pdev);
+	device_unlock(&pdev->dev);
+
+	return 0;
+}
+
+static int nvidia_vgpu_vfio_open_device(struct vfio_device *vdev)
+{
+	struct vfio_pci_core_device *core_dev = vdev_to_core_dev(vdev);
+	struct nvidia_vgpu_vfio *nvdev = core_dev_to_nvdev(core_dev);
+	struct pci_dev *pdev = core_dev->pdev;
+	u64 pf_dma_mask;
+	int ret;
+
+	nvdev_debug(nvdev, "open device\n");
+
+	mutex_lock(&nvdev->vfio_vgpu_lock);
+	if (!nvdev->vgpu_type) {
+		nvdev_error(nvdev, "a vGPU type must be chosen before opening VFIO device\n");
+		ret = -ENODEV;
+		goto err_unlock;
+	}
+
+	if (nvdev->driver_is_unbound) {
+		nvdev_error(nvdev, "the driver has been torn down because PF driver is unbound "
+				   "or the admin is disabling the VF\n");
+		ret = -ENODEV;
+		goto err_unlock;
+	}
+
+	if (nvdev->vdev_is_opened) {
+		ret = -EBUSY;
+		goto err_unlock;
+	}
+
+	ret = pci_enable_device(pdev);
+	if (ret)
+		goto err_unlock;
+
+	pci_set_master(pdev);
+
+	pf_dma_mask = dma_get_mask(&pdev->physfn->dev);
+	dma_set_mask(&pdev->dev, pf_dma_mask);
+	dma_set_coherent_mask(&pdev->dev, pf_dma_mask);
+
+	ret = do_vf_flr(vdev);
+	if (ret)
+		goto err_reset_function;
+
+	nvdev->task_pid = get_task_pid(current, PIDTYPE_PID);
+
+	ret = create_vgpu(nvdev);
+	if (ret)
+		goto err_create_vgpu;
+
+	ret = nvidia_vgpu_mgr_set_bme(nvdev->vgpu, true);
+	if (ret)
+		goto err_enable_bme;
+
+	nvidia_vgpu_vfio_setup_config(nvdev);
+
+	nvdev->vdev_is_opened = true;
+	reinit_completion(&nvdev->vdev_closing_completion);
+
+	nvdev_debug(nvdev, "VFIO device is opened, client pid: %u\n", pid_nr(nvdev->task_pid));
+
+	mutex_unlock(&nvdev->vfio_vgpu_lock);
+	return 0;
+
+err_enable_bme:
+	destroy_vgpu(nvdev);
+err_create_vgpu:
+	put_pid(nvdev->task_pid);
+err_reset_function:
+	pci_clear_master(pdev);
+	pci_disable_device(pdev);
+err_unlock:
+	mutex_unlock(&nvdev->vfio_vgpu_lock);
+	return ret;
+}
+
+static void nvidia_vgpu_vfio_close_device(struct vfio_device *vdev)
+{
+	struct vfio_pci_core_device *core_dev = vdev_to_core_dev(vdev);
+	struct nvidia_vgpu_vfio *nvdev = core_dev_to_nvdev(core_dev);
+	struct pci_dev *pdev = core_dev->pdev;
+
+	nvdev_debug(nvdev, "VFIO device is closing, client pid: %u\n", pid_nr(nvdev->task_pid));
+
+	mutex_lock(&nvdev->vfio_vgpu_lock);
+
+	if (nvdev->bar0_map) {
+		iounmap(nvdev->bar0_map);
+		pci_release_selected_regions(pdev, 1 << 0);
+		nvdev->bar0_map = NULL;
+	}
+
+	destroy_vgpu(nvdev);
+
+	put_pid(nvdev->task_pid);
+	nvdev->task_pid = NULL;
+
+	pci_clear_master(pdev);
+	pci_disable_device(pdev);
+
+	nvdev->vdev_is_opened = false;
+	complete(&nvdev->vdev_closing_completion);
+
+	mutex_unlock(&nvdev->vfio_vgpu_lock);
+
+	nvdev_debug(nvdev, "VFIO device is closed\n");
+}
+
+static int get_region_info(struct vfio_pci_core_device *core_dev, unsigned long arg)
+{
+	struct nvidia_vgpu_vfio *nvdev = core_dev_to_nvdev(core_dev);
+	struct pci_dev *pdev = core_dev->pdev;
+	struct vfio_region_info info;
+	unsigned long minsz;
+	int ret = 0;
+
+	minsz = offsetofend(struct vfio_region_info, offset);
+	if (copy_from_user(&info, (void __user *)arg, minsz))
+		return -EINVAL;
+
+	if (info.argsz < minsz)
+		return -EINVAL;
+
+	switch (info.index) {
+	case VFIO_PCI_CONFIG_REGION_INDEX:
+		info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+		info.size = PCI_CONFIG_SPACE_LENGTH;
+		info.flags = VFIO_REGION_INFO_FLAG_READ |
+			VFIO_REGION_INFO_FLAG_WRITE;
+		break;
+	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR4_REGION_INDEX:
+		info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+		info.size = pci_resource_len(pdev, info.index);
+
+		if (info.index == VFIO_PCI_BAR1_REGION_INDEX)
+			info.size = nvdev->vgpu->info.vgpu_type->bar1_length * SZ_1M;
+
+		if (!info.size) {
+			info.flags = 0;
+			break;
+		}
+		info.flags = VFIO_REGION_INFO_FLAG_READ |
+			VFIO_REGION_INFO_FLAG_WRITE |
+			VFIO_REGION_INFO_FLAG_MMAP;
+		break;
+	case VFIO_PCI_BAR5_REGION_INDEX:
+	case VFIO_PCI_ROM_REGION_INDEX:
+	case VFIO_PCI_VGA_REGION_INDEX:
+		info.size = 0;
+		break;
+	default:
+		if (info.index >= VFIO_PCI_NUM_REGIONS)
+			ret = -EINVAL;
+		break;
+	}
+
+	if (!ret)
+		ret = copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0;
+
+	return ret;
+}
+
+static long nvidia_vgpu_vfio_ioctl(struct vfio_device *vdev, unsigned int cmd, unsigned long arg)
+{
+	struct vfio_pci_core_device *core_dev = vdev_to_core_dev(vdev);
+	struct nvidia_vgpu_vfio *nvdev = core_dev_to_nvdev(core_dev);
+	int ret = 0;
+
+	if (WARN_ON(!nvdev->vgpu || !nvdev->vdev_is_opened))
+		return -ENODEV;
+
+	switch (cmd) {
+	case VFIO_DEVICE_GET_REGION_INFO:
+		ret = get_region_info(core_dev, arg);
+		break;
+	case VFIO_DEVICE_GET_PCI_HOT_RESET_INFO:
+	case VFIO_DEVICE_PCI_HOT_RESET:
+		break;
+	case VFIO_DEVICE_RESET:
+		ret = nvidia_vgpu_mgr_reset_vgpu(nvdev->vgpu);
+		break;
+	default:
+		ret = vfio_pci_core_ioctl(vdev, cmd, arg);
+		break;
+	}
+	return ret;
+}
+
+static ssize_t nvidia_vgpu_vfio_read(struct vfio_device *vdev, char __user *buf, size_t count,
+				     loff_t *ppos)
+{
+	struct vfio_pci_core_device *core_dev = vdev_to_core_dev(vdev);
+	struct nvidia_vgpu_vfio *nvdev = core_dev_to_nvdev(core_dev);
+	u64 val;
+	size_t done = 0;
+	int ret = 0, size;
+
+	if (WARN_ON(!nvdev->vgpu || !nvdev->vdev_is_opened))
+		return -ENODEV;
+
+	while (count) {
+		if (count >= 4 && !(*ppos % 4))
+			size = 4;
+		else if (count >= 2 && !(*ppos % 2))
+			size = 2;
+		else
+			size = 1;
+
+		ret = nvidia_vgpu_vfio_access(nvdev, (char *)&val, size, *ppos, false);
+
+		if (ret <= 0)
+			return ret;
+
+		if (copy_to_user(buf, &val, size) != 0)
+			return -EFAULT;
+
+		*ppos += size;
+		buf += size;
+		count -= size;
+		done += size;
+	}
+	return done;
+}
+
+static ssize_t nvidia_vgpu_vfio_write(struct vfio_device *vdev,
+				      const char __user *buf, size_t count,
+				      loff_t *ppos)
+{
+	struct vfio_pci_core_device *core_dev = vdev_to_core_dev(vdev);
+	struct nvidia_vgpu_vfio *nvdev = core_dev_to_nvdev(core_dev);
+	u64 val;
+	size_t done = 0;
+	int ret = 0, size;
+
+	if (WARN_ON(!nvdev->vgpu || !nvdev->vdev_is_opened))
+		return -ENODEV;
+
+	while (count) {
+		if (count >= 4 && !(*ppos % 4))
+			size = 4;
+		else if (count >= 2 && !(*ppos % 2))
+			size = 2;
+		else
+			size = 1;
+
+		if (copy_from_user(&val, buf, size) != 0)
+			return -EFAULT;
+
+		ret = nvidia_vgpu_vfio_access(nvdev, (char *)&val, size, *ppos, true);
+
+		if (ret <= 0)
+			return ret;
+
+		*ppos += size;
+		buf += size;
+		count -= size;
+		done += size;
+	}
+	return done;
+}
+
+static int nvidia_vgpu_vfio_mmap(struct vfio_device *vdev,
+				 struct vm_area_struct *vma)
+{
+	struct vfio_pci_core_device *core_dev = vdev_to_core_dev(vdev);
+	struct nvidia_vgpu_vfio *nvdev = core_dev_to_nvdev(core_dev);
+	struct pci_dev *pdev = core_dev->pdev;
+	u64 phys_len, req_len, pgoff, req_start;
+	unsigned int index;
+
+	if (WARN_ON(!nvdev->vgpu || !nvdev->vdev_is_opened))
+		return -ENODEV;
+
+	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+
+	if (index >= VFIO_PCI_BAR5_REGION_INDEX)
+		return -EINVAL;
+	if (vma->vm_end < vma->vm_start)
+		return -EINVAL;
+	if ((vma->vm_flags & VM_SHARED) == 0)
+		return -EINVAL;
+
+	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
+	req_len = vma->vm_end - vma->vm_start;
+	pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+	req_start = pgoff << PAGE_SHIFT;
+
+	if (req_len == 0)
+		return -EINVAL;
+
+	if ((req_start + req_len > phys_len) || phys_len == 0)
+		return -EINVAL;
+
+	vma->vm_private_data = vdev;
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
+	vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
+
+	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, req_len, vma->vm_page_prot);
+}
+
+static const struct vfio_device_ops nvidia_vgpu_vfio_ops = {
+	.name           = "nvidia-vgpu-vfio-pci",
+	.init		= vfio_pci_core_init_dev,
+	.release	= vfio_pci_core_release_dev,
+	.open_device    = nvidia_vgpu_vfio_open_device,
+	.close_device   = nvidia_vgpu_vfio_close_device,
+	.ioctl          = nvidia_vgpu_vfio_ioctl,
+	.device_feature = vfio_pci_core_ioctl_feature,
+	.read           = nvidia_vgpu_vfio_read,
+	.write          = nvidia_vgpu_vfio_write,
+	.mmap           = nvidia_vgpu_vfio_mmap,
+	.request	= vfio_pci_core_request,
+	.match		= vfio_pci_core_match,
+	.bind_iommufd	= vfio_iommufd_physical_bind,
+	.unbind_iommufd	= vfio_iommufd_physical_unbind,
+	.attach_ioas	= vfio_iommufd_physical_attach_ioas,
+	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
+};
+
+static void clean_nvdev_unbound(struct nvidia_vgpu_vfio *nvdev)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = nvdev->vgpu_mgr;
+
+	/* driver unbound path is called from the event chain. */
+	lockdep_assert_held(&vgpu_mgr->pf_driver_event_chain.lock);
+	list_del_init(&nvdev->pf_driver_event_listener.list);
+
+	nvidia_vgpu_vfio_clean_sysfs(nvdev);
+
+	nvidia_vgpu_mgr_release(nvdev->vgpu_mgr);
+	nvdev->vgpu_mgr = NULL;
+	nvdev->vgpu_type = NULL;
+}
+
+static void handle_driver_unbound(struct nvidia_vgpu_vfio *nvdev)
+{
+	struct task_struct *task;
+
+	mutex_lock(&nvdev->vfio_vgpu_lock);
+
+	if (nvdev->driver_is_unbound) {
+		mutex_unlock(&nvdev->vfio_vgpu_lock);
+		return;
+	}
+
+	nvdev->driver_is_unbound = true;
+
+	if (nvdev->vdev_is_opened) {
+		task = get_pid_task(nvdev->task_pid, PIDTYPE_PID);
+		if (!task) {
+			mutex_unlock(&nvdev->vfio_vgpu_lock);
+			return;
+		}
+
+		nvdev_debug(nvdev, "Killing client pid: %u\n", pid_nr(nvdev->task_pid));
+
+		send_sig(SIGTERM, task, 1);
+		put_task_struct(task);
+
+		mutex_unlock(&nvdev->vfio_vgpu_lock);
+
+		wait_for_completion(&nvdev->vdev_closing_completion);
+	} else {
+		mutex_unlock(&nvdev->vfio_vgpu_lock);
+	}
+
+	clean_nvdev_unbound(nvdev);
+}
+
+static int handle_pf_driver_event(struct nvidia_vgpu_event_listener *self, unsigned int event,
+				  void *p)
+{
+	struct nvidia_vgpu_vfio *nvdev = container_of(self, struct nvidia_vgpu_vfio,
+			pf_driver_event_listener);
+	struct pci_dev *pdev = nvdev->core_dev.pdev;
+
+	switch (event) {
+	case NVIDIA_VGPU_PF_DRIVER_EVENT_DRIVER_UNBIND:
+		nvdev_debug(nvdev, "handle PF event driver unbind\n");
+
+		handle_driver_unbound(nvdev);
+		break;
+	case NVIDIA_VGPU_PF_DRIVER_EVENT_SRIOV_CONFIGURE:
+		int num_vfs = *(int *)p;
+
+		nvdev_debug(nvdev, "handle PF event SRIOV configure\n");
+
+		if (!num_vfs) {
+			handle_driver_unbound(nvdev);
+		} else {
+			/* convert num_vfs to max VF ID */
+			num_vfs--;
+			if (pci_iov_vf_id(pdev) > num_vfs)
+				handle_driver_unbound(nvdev);
+		}
+		break;
+	}
+	return 0;
+}
+
+static void register_pf_driver_event_listener(struct nvidia_vgpu_vfio *nvdev)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = nvdev->vgpu_mgr;
+
+	nvdev->pf_driver_event_listener.func = handle_pf_driver_event;
+	INIT_LIST_HEAD(&nvdev->pf_driver_event_listener.list);
+
+	nvidia_vgpu_event_register_listener(&vgpu_mgr->pf_driver_event_chain,
+					    &nvdev->pf_driver_event_listener);
+}
+
+static void unregister_pf_driver_event_listener(struct nvidia_vgpu_vfio *nvdev)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = nvdev->vgpu_mgr;
+
+	nvidia_vgpu_event_unregister_listener(&vgpu_mgr->pf_driver_event_chain,
+					      &nvdev->pf_driver_event_listener);
+}
+
+static void clean_nvdev(struct nvidia_vgpu_vfio *nvdev)
+{
+	if (nvdev->driver_is_unbound)
+		return;
+
+	unregister_pf_driver_event_listener(nvdev);
+	nvidia_vgpu_vfio_clean_sysfs(nvdev);
+
+	nvidia_vgpu_mgr_release(nvdev->vgpu_mgr);
+	nvdev->vgpu_mgr = NULL;
+	nvdev->vgpu_type = NULL;
+}
+
+static int setup_nvdev(void *priv, void *data)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = priv;
+	struct nvidia_vgpu_vfio *nvdev = data;
+	int ret;
+
+	mutex_init(&nvdev->vfio_vgpu_lock);
+	init_completion(&nvdev->vdev_closing_completion);
+
+	nvdev->vgpu_mgr = vgpu_mgr;
+
+	ret = nvidia_vgpu_vfio_setup_sysfs(nvdev);
+	if (ret)
+		return ret;
+
+	register_pf_driver_event_listener(nvdev);
+	return 0;
+}
+
+static int nvidia_vgpu_vfio_probe(struct pci_dev *pdev,
+				  const struct pci_device_id *id_table)
+{
+	struct nvidia_vgpu_vfio *nvdev;
+	int ret;
+
+	if (!pdev->is_virtfn)
+		return -EINVAL;
+
+	nvdev = vfio_alloc_device(nvidia_vgpu_vfio, core_dev.vdev,
+				  &pdev->dev, &nvidia_vgpu_vfio_ops);
+	if (IS_ERR(nvdev))
+		return PTR_ERR(nvdev);
+
+	ret = nvidia_vgpu_mgr_setup(pdev, setup_nvdev, nvdev);
+	if (ret)
+		goto err_setup_vgpu_mgr;
+
+	dev_set_drvdata(&pdev->dev, &nvdev->core_dev);
+
+	ret = vfio_pci_core_register_device(&nvdev->core_dev);
+	if (ret)
+		goto err_register_core_device;
+
+	return 0;
+
+err_register_core_device:
+	clean_nvdev(nvdev);
+err_setup_vgpu_mgr:
+	vfio_put_device(&nvdev->core_dev.vdev);
+	pci_err(pdev, "VF probe failed with ret: %d\n", ret);
+	return ret;
+}
+
+static void nvidia_vgpu_vfio_remove(struct pci_dev *pdev)
+{
+	struct vfio_pci_core_device *core_dev = dev_get_drvdata(&pdev->dev);
+	struct nvidia_vgpu_vfio *nvdev = core_dev_to_nvdev(core_dev);
+
+	WARN_ON(nvdev->vgpu || nvdev->vdev_is_opened);
+
+	vfio_pci_core_unregister_device(core_dev);
+	clean_nvdev(nvdev);
+	vfio_put_device(&core_dev->vdev);
+}
+
+struct pci_device_id nvidia_vgpu_vfio_table[] = {
+	{
+		.vendor      = PCI_VENDOR_ID_NVIDIA,
+		.device      = PCI_ANY_ID,
+		.subvendor   = PCI_ANY_ID,
+		.subdevice   = PCI_ANY_ID,
+		.class       = (PCI_CLASS_DISPLAY_3D << 8),
+		.class_mask  = ~0,
+	},
+	{ }
+};
+MODULE_DEVICE_TABLE(pci, nvidia_vgpu_vfio_table);
+
+struct pci_driver nvidia_vgpu_vfio_driver = {
+	.name               = "nvidia-vgpu-vfio",
+	.id_table           = nvidia_vgpu_vfio_table,
+	.probe              = nvidia_vgpu_vfio_probe,
+	.remove             = nvidia_vgpu_vfio_remove,
+	.driver_managed_dma = true,
+};
+
+module_pci_driver(nvidia_vgpu_vfio_driver);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Vinay Kabra <vkabra@nvidia.com>");
+MODULE_AUTHOR("Kirti Wankhede <kwankhede@nvidia.com>");
+MODULE_AUTHOR("Zhi Wang <zhiw@nvidia.com>");
+MODULE_DESCRIPTION("NVIDIA vGPU VFIO Variant Driver - User Level driver for NVIDIA vGPU");
diff --git a/drivers/vfio/pci/nvidia-vgpu/vfio_sysfs.c b/drivers/vfio/pci/nvidia-vgpu/vfio_sysfs.c
new file mode 100644
index 000000000000..271b330f15b1
--- /dev/null
+++ b/drivers/vfio/pci/nvidia-vgpu/vfio_sysfs.c
@@ -0,0 +1,209 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright © 2025 NVIDIA Corporation
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/pci.h>
+#include <linux/vfio_pci_core.h>
+#include <linux/types.h>
+
+#include "vfio.h"
+
+static struct nvidia_vgpu_type *find_vgpu_type(struct nvidia_vgpu_vfio *nvdev, u64 type_id)
+{
+	struct nvidia_vgpu_type *vgpu_type;
+	unsigned int i;
+
+	for (i = 0; i < nvdev->vgpu_mgr->num_vgpu_types; i++) {
+		vgpu_type = nvdev->vgpu_mgr->vgpu_types + i;
+		if (vgpu_type->vgpu_type == type_id)
+			return vgpu_type;
+	}
+	return NULL;
+}
+
+static ssize_t creatable_homogeneous_vgpu_types_show(struct nvidia_vgpu_vfio *nvdev, char *buf)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = nvdev->vgpu_mgr;
+	ssize_t ret = 0;
+	u64 i;
+
+	mutex_lock(&vgpu_mgr->curr_vgpu_type_lock);
+	/* No vGPU has been created. */
+	if (!vgpu_mgr->curr_vgpu_type) {
+		ret += sprintf(buf, "ID    : vGPU Name\n");
+
+		for (i = 0; i < vgpu_mgr->num_vgpu_types; i++) {
+			struct nvidia_vgpu_type *type = vgpu_mgr->vgpu_types + i;
+
+			ret += sprintf(buf + ret, "%-5d : %s\n", type->vgpu_type,
+				       type->vgpu_type_name);
+		}
+	} else {
+		struct nvidia_vgpu_type *type = vgpu_mgr->curr_vgpu_type;
+
+		/* There has been created vGPU(s). */
+		if (vgpu_mgr->num_instances < type->max_instance)
+			ret = sprintf(buf + ret, "%-5d : %s\n", type->vgpu_type,
+				      type->vgpu_type_name);
+	}
+	mutex_unlock(&vgpu_mgr->curr_vgpu_type_lock);
+	return ret;
+}
+
+static int create_homogeneous_instance(struct nvidia_vgpu_vfio *nvdev,
+				       struct nvidia_vgpu_type *type)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = nvdev->vgpu_mgr;
+	int ret = 0;
+
+	mutex_lock(&vgpu_mgr->curr_vgpu_type_lock);
+	if (!vgpu_mgr->curr_vgpu_type) {
+		vgpu_mgr->curr_vgpu_type = type;
+		vgpu_mgr->num_instances++;
+		nvdev->vgpu_type = type;
+	} else {
+		if (type != vgpu_mgr->curr_vgpu_type) {
+			ret = -EINVAL;
+		} else if (vgpu_mgr->num_instances >= vgpu_mgr->curr_vgpu_type->max_instance) {
+			ret = -ENOSPC;
+		} else {
+			vgpu_mgr->num_instances++;
+			nvdev->vgpu_type = type;
+		}
+	}
+	mutex_unlock(&vgpu_mgr->curr_vgpu_type_lock);
+	return ret;
+}
+
+static void destroy_homogeneous_instance(struct nvidia_vgpu_vfio *nvdev)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = nvdev->vgpu_mgr;
+
+	if (!nvdev->vgpu_type)
+		return;
+
+	mutex_lock(&vgpu_mgr->curr_vgpu_type_lock);
+	if (vgpu_mgr->curr_vgpu_type) {
+		if (!--vgpu_mgr->num_instances)
+			vgpu_mgr->curr_vgpu_type = NULL;
+	}
+	nvdev->vgpu_type = NULL;
+	mutex_unlock(&vgpu_mgr->curr_vgpu_type_lock);
+}
+
+static ssize_t creatable_vgpu_types_show(struct device *dev, struct device_attribute *attr,
+					 char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct vfio_pci_core_device *core_dev = pci_get_drvdata(pdev);
+	struct nvidia_vgpu_vfio *nvdev = core_dev_to_nvdev(core_dev);
+	ssize_t ret;
+
+	mutex_lock(&nvdev->vfio_vgpu_lock);
+	if (nvdev->vgpu_type) {
+		mutex_unlock(&nvdev->vfio_vgpu_lock);
+		return 0;
+	}
+
+	ret = creatable_homogeneous_vgpu_types_show(nvdev, buf);
+	mutex_unlock(&nvdev->vfio_vgpu_lock);
+	return ret;
+}
+
+static DEVICE_ATTR_RO(creatable_vgpu_types);
+
+static ssize_t current_vgpu_type_store(struct device *dev, struct device_attribute *attr,
+				       const char *buf, size_t count)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct vfio_pci_core_device *core_dev = pci_get_drvdata(pdev);
+	struct nvidia_vgpu_vfio *nvdev = core_dev_to_nvdev(core_dev);
+	struct nvidia_vgpu_type *type;
+	unsigned long vgpu_type_id = ~0;
+	int ret = 0;
+
+	ret = kstrtoul(buf, 10, &vgpu_type_id);
+	if (ret)
+		return ret;
+
+	mutex_lock(&nvdev->vfio_vgpu_lock);
+
+	if (nvdev->vdev_is_opened) {
+		mutex_unlock(&nvdev->vfio_vgpu_lock);
+		return -EBUSY;
+	}
+
+	if (vgpu_type_id) {
+		type = find_vgpu_type(nvdev, vgpu_type_id);
+		if (!type) {
+			ret = -ENODEV;
+			goto out_unlock;
+		}
+		ret = create_homogeneous_instance(nvdev, type);
+	} else {
+		destroy_homogeneous_instance(nvdev);
+	}
+
+out_unlock:
+	mutex_unlock(&nvdev->vfio_vgpu_lock);
+	return ret ? ret : count;
+}
+
+static ssize_t current_vgpu_type_show(struct device *dev, struct device_attribute *attr,
+				      char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct vfio_pci_core_device *core_dev = pci_get_drvdata(pdev);
+	struct nvidia_vgpu_vfio *nvdev = core_dev_to_nvdev(core_dev);
+	unsigned long type_id;
+
+	mutex_lock(&nvdev->vfio_vgpu_lock);
+
+	type_id = nvdev->vgpu_type ? nvdev->vgpu_type->vgpu_type : 0;
+
+	mutex_unlock(&nvdev->vfio_vgpu_lock);
+
+	return sprintf(buf, "%lu\n", type_id);
+}
+
+static DEVICE_ATTR_RW(current_vgpu_type);
+
+static struct attribute *vf_dev_attrs[] = {
+	&dev_attr_creatable_vgpu_types.attr,
+	&dev_attr_current_vgpu_type.attr,
+	NULL,
+};
+
+static const struct attribute_group vf_dev_group = {
+	.name  = "nvidia",
+	.attrs = vf_dev_attrs,
+};
+
+const struct attribute_group *vf_dev_groups[] = {
+	&vf_dev_group,
+	NULL,
+};
+
+int nvidia_vgpu_vfio_setup_sysfs(struct nvidia_vgpu_vfio *nvdev)
+{
+	struct pci_dev *pdev = nvdev->core_dev.pdev;
+
+	if (WARN_ON(!pdev))
+		return -EINVAL;
+
+	return sysfs_create_groups(&pdev->dev.kobj, vf_dev_groups);
+}
+
+void nvidia_vgpu_vfio_clean_sysfs(struct nvidia_vgpu_vfio *nvdev)
+{
+	struct pci_dev *pdev = nvdev->core_dev.pdev;
+
+	if (WARN_ON(!pdev))
+		return;
+
+	sysfs_remove_groups(&pdev->dev.kobj, vf_dev_groups);
+}
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu.c b/drivers/vfio/pci/nvidia-vgpu/vgpu.c
index 9e8ea77bbcc5..72083d300b8a 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu.c
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu.c
@@ -9,6 +9,7 @@
 #include "vgpu_mgr.h"
 
 #include <nvrm/bootload.h>
+#include <nvrm/vgpu.h>
 
 static void unregister_vgpu(struct nvidia_vgpu *vgpu)
 {
@@ -361,7 +362,7 @@ int nvidia_vgpu_mgr_create_vgpu(struct nvidia_vgpu *vgpu)
 	struct nvidia_vgpu_info *info = &vgpu->info;
 	int ret;
 
-	if (WARN_ON(!info->gfid || !info->dbdf || !info->vgpu_type))
+	if (WARN_ON(!info->gfid || !info->dbdf || !info->vgpu_type || !info->vm_pid))
 		return -EINVAL;
 
 	if (WARN_ON(!vgpu->vgpu_mgr || !vgpu->pdev))
@@ -372,8 +373,8 @@ int nvidia_vgpu_mgr_create_vgpu(struct nvidia_vgpu *vgpu)
 
 	vgpu->info = *info;
 
-	vgpu_debug(vgpu, "create vgpu %s on vgpu_mgr %px\n",
-		   info->vgpu_type->vgpu_type_name, vgpu->vgpu_mgr);
+	vgpu_debug(vgpu, "create vgpu %s on vgpu_mgr %px vm pid %u\n",
+		   info->vgpu_type->vgpu_type_name, vgpu->vgpu_mgr, info->vm_pid);
 
 	ret = register_vgpu(vgpu);
 	if (ret)
@@ -427,3 +428,49 @@ int nvidia_vgpu_mgr_create_vgpu(struct nvidia_vgpu *vgpu)
 	return ret;
 }
 EXPORT_SYMBOL_GPL(nvidia_vgpu_mgr_create_vgpu);
+
+/**
+ * nvidia_vgpu_mgr_reset_vgpu - reset a vGPU instance
+ * @vgpu: the vGPU instance going to be reset.
+ *
+ * Returns: 0 on success, others on failure.
+ */
+int nvidia_vgpu_mgr_reset_vgpu(struct nvidia_vgpu *vgpu)
+{
+	int ret;
+
+	ret = nvidia_vgpu_rpc_call(vgpu, NV_VGPU_CPU_RPC_MSG_RESET, NULL, 0);
+	if (ret) {
+		vgpu_error(vgpu, "fail to reset vgpu ret %d\n", ret);
+		return ret;
+	}
+
+	vgpu_debug(vgpu, "reset done\n");
+	return 0;
+}
+EXPORT_SYMBOL_GPL(nvidia_vgpu_mgr_reset_vgpu);
+
+static int update_bme_state(struct nvidia_vgpu *vgpu, bool enable)
+{
+	NV_VGPU_CPU_RPC_DATA_UPDATE_BME_STATE params = {0};
+
+	params.enable = enable;
+
+	return nvidia_vgpu_rpc_call(vgpu, NV_VGPU_CPU_RPC_MSG_UPDATE_BME_STATE,
+				    &params, sizeof(params));
+}
+
+/**
+ * nvidia_vgpu_set_bme - handle BME sequence
+ * @vgpu: the vGPU instance
+ * @enable: BME enable/disable
+ *
+ * Returns: 0 on success, others on failure.
+ */
+int nvidia_vgpu_mgr_set_bme(struct nvidia_vgpu *vgpu, bool enable)
+{
+	vgpu_debug(vgpu, "set bme, enable %d\n", enable);
+
+	return update_bme_state(vgpu, enable);
+}
+EXPORT_SYMBOL_GPL(nvidia_vgpu_mgr_set_bme);
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
index 6f53bd7ca940..e502a37468e3 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
@@ -103,10 +103,32 @@ static struct nvidia_vgpu_mgr *alloc_vgpu_mgr(struct nvidia_vgpu_mgr_handle *han
 	mutex_init(&vgpu_mgr->vgpu_list_lock);
 	INIT_LIST_HEAD(&vgpu_mgr->vgpu_list_head);
 	atomic_set(&vgpu_mgr->num_vgpus, 0);
+	mutex_init(&vgpu_mgr->curr_vgpu_type_lock);
+	nvidia_vgpu_event_init_chain(&vgpu_mgr->pf_driver_event_chain);
 
 	return vgpu_mgr;
 }
 
+static int call_chain(struct nvidia_vgpu_event_chain *chain, unsigned int event, void *data)
+{
+	struct nvidia_vgpu_event_listener *l;
+	struct list_head *pos, *temp;
+	int ret = 0;
+
+	mutex_lock(&chain->lock);
+
+	list_for_each_safe(pos, temp, &chain->head) {
+		l = container_of(pos, struct nvidia_vgpu_event_listener, list);
+		ret = l->func(l, event, data);
+		if (ret)
+			goto out_unlock;
+	}
+
+out_unlock:
+	mutex_unlock(&chain->lock);
+	return ret;
+}
+
 static const char *pf_events_string[NVIDIA_VGPU_PF_EVENT_MAX] = {
 	[NVIDIA_VGPU_PF_DRIVER_EVENT_SRIOV_CONFIGURE] = "SRIOV configure",
 	[NVIDIA_VGPU_PF_DRIVER_EVENT_DRIVER_UNBIND] = "driver unbind",
@@ -115,14 +137,20 @@ static const char *pf_events_string[NVIDIA_VGPU_PF_EVENT_MAX] = {
 static int pf_event_notify_fn(void *priv, unsigned int event, void *data)
 {
 	struct nvidia_vgpu_mgr *vgpu_mgr = priv;
+	int ret = 0;
 
 	if (WARN_ON(event >= NVIDIA_VGPU_PF_EVENT_MAX))
 		return -EINVAL;
 
 	vgpu_mgr_debug(vgpu_mgr, "handle PF event %s\n", pf_events_string[event]);
 
-	/* more to come. */
-	return 0;
+	switch (event) {
+	case NVIDIA_VGPU_PF_DRIVER_EVENT_START...NVIDIA_VGPU_PF_DRIVER_EVENT_END:
+		ret = call_chain(&vgpu_mgr->pf_driver_event_chain, event, data);
+		break;
+	}
+
+	return ret;
 }
 
 static void attach_vgpu_mgr(struct nvidia_vgpu_mgr *vgpu_mgr,
@@ -378,3 +406,39 @@ int nvidia_vgpu_mgr_setup(struct pci_dev *dev, int (*init_vfio_fn)(void *priv, v
 	return nvidia_vgpu_mgr_attach_handle(&handle, &attach_handle_data);
 }
 EXPORT_SYMBOL(nvidia_vgpu_mgr_setup);
+
+/**
+ * nvidia_vgpu_event_init_chain - initialize an event chain
+ * @chain: the even chain.
+ */
+void nvidia_vgpu_event_init_chain(struct nvidia_vgpu_event_chain *chain)
+{
+	mutex_init(&chain->lock);
+	INIT_LIST_HEAD(&chain->head);
+}
+
+/**
+ * nvidia_vgpu_event_register_listener - register an event listener
+ * @chain: the event chain.
+ * @l: the listener.
+ */
+void nvidia_vgpu_event_register_listener(struct nvidia_vgpu_event_chain *chain,
+					 struct nvidia_vgpu_event_listener *l)
+{
+	mutex_lock(&chain->lock);
+	list_add_tail(&l->list, &chain->head);
+	mutex_unlock(&chain->lock);
+}
+
+/**
+ * nvidia_vgpu_event_unregister_listener - unregister an event listener
+ * @chain: the event chain.
+ * @l: the listener.
+ */
+void nvidia_vgpu_event_unregister_listener(struct nvidia_vgpu_event_chain *chain,
+					   struct nvidia_vgpu_event_listener *l)
+{
+	mutex_lock(&chain->lock);
+	list_del_init(&l->list);
+	mutex_unlock(&chain->lock);
+}
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
index fe475f8b2882..dc782f825f2b 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
@@ -101,6 +101,17 @@ struct nvidia_vgpu {
 	struct nvidia_vgpu_rpc rpc;
 };
 
+struct nvidia_vgpu_event_listener {
+	int (*func)(struct nvidia_vgpu_event_listener *self, unsigned int event, void *data);
+	struct list_head list;
+};
+
+struct nvidia_vgpu_event_chain {
+	/* lock for PF event listener list */
+	struct mutex lock;
+	struct list_head head;
+};
+
 /**
  * struct nvidia_vgpu_mgr - the vGPU manager
  *
@@ -125,6 +136,10 @@ struct nvidia_vgpu {
  * @num_vgpu_types: number of installed vGPU types
  * @use_alloc_bitmap: use chid allocator for the PF driver doesn't support chid allocation
  * @chid_alloc_bitmap: chid allocator bitmap
+ * @curr_vgpu_lock: lock to protect curr_vgpu_type
+ * @curr_vgpu_type: type of current created vgpu in homogeneous mode
+ * @num_instances: number of created vGPU with curr_vgpu_type in homogeneous mode
+ * @pf_driver_event_chain: PF driver event chain
  * @pdev: the PCI device pointer
  * @bar0_vaddr: the virtual address of BAR0
  */
@@ -163,6 +178,13 @@ struct nvidia_vgpu_mgr {
 	bool use_chid_alloc_bitmap;
 	void *chid_alloc_bitmap;
 
+	/* lock for current vGPU type */
+	struct mutex curr_vgpu_type_lock;
+	struct nvidia_vgpu_type *curr_vgpu_type;
+	unsigned int num_instances;
+
+	struct nvidia_vgpu_event_chain pf_driver_event_chain;
+
 	struct pci_dev *pdev;
 	void __iomem *bar0_vaddr;
 };
@@ -173,14 +195,21 @@ struct nvidia_vgpu_mgr {
 int nvidia_vgpu_mgr_setup(struct pci_dev *dev, int (*init_vfio_fn)(void *priv, void *data),
 			  void *init_vfio_fn_data);
 void nvidia_vgpu_mgr_release(struct nvidia_vgpu_mgr *vgpu_mgr);
+void nvidia_vgpu_event_init_chain(struct nvidia_vgpu_event_chain *chain);
+void nvidia_vgpu_event_register_listener(struct nvidia_vgpu_event_chain *chain,
+					 struct nvidia_vgpu_event_listener *l);
+void nvidia_vgpu_event_unregister_listener(struct nvidia_vgpu_event_chain *chain,
+					   struct nvidia_vgpu_event_listener *l);
 
 int nvidia_vgpu_mgr_destroy_vgpu(struct nvidia_vgpu *vgpu);
 int nvidia_vgpu_mgr_create_vgpu(struct nvidia_vgpu *vgpu);
+int nvidia_vgpu_mgr_reset_vgpu(struct nvidia_vgpu *vgpu);
 int nvidia_vgpu_mgr_setup_metadata(struct nvidia_vgpu_mgr *vgpu_mgr);
 void nvidia_vgpu_mgr_clean_metadata(struct nvidia_vgpu_mgr *vgpu_mgr);
 int nvidia_vgpu_rpc_call(struct nvidia_vgpu *vgpu, u32 msg_type,
 			 void *data, u64 size);
 void nvidia_vgpu_clean_rpc(struct nvidia_vgpu *vgpu);
 int nvidia_vgpu_setup_rpc(struct nvidia_vgpu *vgpu);
+int nvidia_vgpu_mgr_set_bme(struct nvidia_vgpu *vgpu, bool enable);
 
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC v2 12/14] vfio/nvidia-vgpu: scrub the guest FB memory of a vGPU
  2025-09-03 22:10 [RFC v2 00/14] Introduce NVIDIA GPU Virtualization (vGPU) Support Zhi Wang
                   ` (10 preceding siblings ...)
  2025-09-03 22:11 ` [RFC v2 11/14] vfio/nvidia-vgpu: introduce NVIDIA vGPU VFIO variant driver Zhi Wang
@ 2025-09-03 22:11 ` Zhi Wang
  2025-09-03 22:11 ` [RFC v2 13/14] vfio/nvidia-vgpu: introduce vGPU logging Zhi Wang
  2025-09-03 22:11 ` [RFC v2 14/14] vfio/nvidia-vgpu: add a kernel doc to introduce NVIDIA vGPU Zhi Wang
  13 siblings, 0 replies; 23+ messages in thread
From: Zhi Wang @ 2025-09-03 22:11 UTC (permalink / raw)
  To: kvm
  Cc: alex.williamson, kevin.tian, jgg, airlied, daniel, dakr, acurrid,
	cjia, smitra, ankita, aniketa, kwankhede, targupta, zhiw, zhiwang

Before reassigning a vGPU to a new guest, its associated FB memory must be
scrubbed to prevent potential information leakage across users.

Residual data left in the FB memory could be visible to the subsequent
guest, posing a significant security risk without the scrubbing.

Scrub the FB memory by issusing copy engine workloads when the user opening
and closing the VFIO device.

Cc: Aniket Agashe <aniketa@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 drivers/vfio/pci/nvidia-vgpu/pf.h       |  23 +++
 drivers/vfio/pci/nvidia-vgpu/vgpu.c     | 218 +++++++++++++++++++++++-
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c |   6 +
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h |  24 ++-
 4 files changed, 264 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/pci/nvidia-vgpu/pf.h b/drivers/vfio/pci/nvidia-vgpu/pf.h
index d081d8e718e1..d9daaace7d31 100644
--- a/drivers/vfio/pci/nvidia-vgpu/pf.h
+++ b/drivers/vfio/pci/nvidia-vgpu/pf.h
@@ -119,4 +119,27 @@ static inline int nvidia_vgpu_mgr_init_handle(struct pci_dev *pdev,
 	__m->handle.ops->get_engine_bitmap(__m->handle.pf_drvdata, bitmap); \
 })
 
+#define nvidia_vgpu_mgr_channel_map_mem(m, chan, mem, info) \
+	((m)->handle.ops->channel_map_mem(chan, mem, info))
+
+#define nvidia_vgpu_mgr_channel_unmap_mem(m, mem) \
+	((m)->handle.ops->channel_unmap_mem(mem))
+
+#define nvidia_vgpu_mgr_alloc_ce_channel(m, chid) ({ \
+	typeof(m) __m = (m); \
+	__m->handle.ops->alloc_ce_channel(__m->handle.pf_drvdata, chid); \
+})
+
+#define nvidia_vgpu_mgr_free_ce_channel(m, chan) \
+	((m)->handle.ops->free_ce_channel(chan))
+
+#define nvidia_vgpu_mgr_begin_pushbuf(m, chan, num_dwords) \
+	((m)->handle.ops->begin_pushbuf(chan, num_dwords))
+
+#define nvidia_vgpu_mgr_emit_pushbuf(m, chan, dwords) \
+	((m)->handle.ops->emit_pushbuf(chan, dwords))
+
+#define nvidia_vgpu_mgr_submit_pushbuf(m, chan) \
+	((m)->handle.ops->submit_pushbuf(chan))
+
 #endif
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu.c b/drivers/vfio/pci/nvidia-vgpu/vgpu.c
index 72083d300b8a..52b01efdf133 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu.c
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu.c
@@ -2,7 +2,6 @@
 /*
  * Copyright © 2025 NVIDIA Corporation
  */
-
 #include <linux/log2.h>
 
 #include "debug.h"
@@ -111,6 +110,209 @@ static int setup_chids(struct nvidia_vgpu *vgpu)
 	return 0;
 }
 
+static void clean_ce_channel(struct nvidia_vgpu *vgpu)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = vgpu->vgpu_mgr;
+	struct nvidia_vgpu_ce_channel *channel = &vgpu->ce_channel;
+
+	nvidia_vgpu_event_unregister_listener(&vgpu_mgr->pf_channel_event_chain,
+					      &channel->listener);
+
+	nvidia_vgpu_mgr_channel_unmap_mem(vgpu_mgr, channel->sema_mem);
+	nvidia_vgpu_mgr_bar1_unmap_mem(vgpu_mgr, channel->sema_mem);
+	nvidia_vgpu_mgr_free_fbmem(vgpu_mgr, channel->sema_mem);
+	nvidia_vgpu_mgr_free_ce_channel(vgpu_mgr, channel->chan);
+	channel->chan = NULL;
+	channel->sema_mem = NULL;
+}
+
+static int handle_channel_events(struct nvidia_vgpu_event_listener *self, unsigned int event,
+				 void *data)
+{
+	struct nvidia_vgpu_ce_channel *channel = container_of(self, typeof(*channel), listener);
+	struct nvidia_vgpu *vgpu = container_of(channel, typeof(*vgpu), ce_channel);
+
+	if (data != channel->chan)
+		return 0;
+
+	switch (event) {
+	case NVIDIA_VGPU_PF_CHANNEL_EVENT_FIFO_NONSTALL:
+		vgpu_debug(vgpu, "handle channel event fifo nonstall\n");
+
+		wake_up(&channel->wq);
+		break;
+	}
+	return 0;
+}
+
+static int setup_ce_channel(struct nvidia_vgpu *vgpu)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = vgpu->vgpu_mgr;
+	struct nvidia_vgpu_ce_channel *channel = &vgpu->ce_channel;
+	struct nvidia_vgpu_chid *chid = &vgpu->chid;
+	struct nvidia_vgpu_alloc_fbmem_info alloc_info = {0};
+	struct nvidia_vgpu_map_mem_info map_info = {0};
+	struct nvidia_vgpu_chan *chan;
+	struct nvidia_vgpu_mem *mem;
+	int ret;
+
+	chan = nvidia_vgpu_mgr_alloc_ce_channel(vgpu_mgr, chid->chid_offset + chid->num_chid - 1);
+	if (IS_ERR(chan))
+		return PTR_ERR(chan);
+
+	/* Allocate a page for semaphore */
+	alloc_info.size = SZ_4K;
+
+	mem = nvidia_vgpu_mgr_alloc_fbmem(vgpu_mgr, &alloc_info);
+	if (IS_ERR(mem))
+		goto err_alloc_fbmem;
+
+	map_info.map_size = SZ_4K;
+
+	ret = nvidia_vgpu_mgr_channel_map_mem(vgpu_mgr, chan, mem, &map_info);
+	if (ret)
+		goto err_chan_map_mem;
+
+	ret = nvidia_vgpu_mgr_bar1_map_mem(vgpu_mgr, mem, &map_info);
+	if (ret)
+		goto err_bar1_map_mem;
+
+	channel->chan = chan;
+	channel->sema_mem = mem;
+
+	init_waitqueue_head(&channel->wq);
+
+	INIT_LIST_HEAD(&channel->listener.list);
+	channel->listener.func = handle_channel_events;
+
+	nvidia_vgpu_event_register_listener(&vgpu_mgr->pf_channel_event_chain,
+					    &channel->listener);
+
+	return 0;
+
+err_bar1_map_mem:
+	nvidia_vgpu_mgr_channel_unmap_mem(vgpu_mgr, mem);
+err_chan_map_mem:
+	nvidia_vgpu_mgr_free_fbmem(vgpu_mgr, mem);
+err_alloc_fbmem:
+	nvidia_vgpu_mgr_free_ce_channel(vgpu_mgr, chan);
+	return ret;
+}
+
+static bool ce_workload_complete(struct nvidia_vgpu_ce_channel *channel)
+{
+	return !!READ_ONCE(*(u32 *)(channel->sema_mem->bar1_vaddr));
+}
+
+#define VGPU_SCRUBBER_LINE_LENGTH_MAX 0x80000000
+#define FBMEM_SCRUB_TIMEOUT_MS (4000)
+
+static int scrub_fbmem_heap(struct nvidia_vgpu *vgpu)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = vgpu->vgpu_mgr;
+	struct nvidia_vgpu_ce_channel *channel;
+	struct nvidia_vgpu_chan *chan;
+	struct nvidia_vgpu_mem *mem = vgpu->fbmem_heap;
+	struct nvidia_vgpu_map_mem_info map_info = {0};
+	u64 line_length = mem->size;
+	u32 line_count = 1;
+	int ret;
+	int i;
+
+	if (WARN_ON(!vgpu_mgr->use_ce_scrub_fbmem))
+		return 0;
+
+	ret = setup_ce_channel(vgpu);
+	if (ret)
+		return ret;
+
+	channel = &vgpu->ce_channel;
+	chan = channel->chan;
+
+	map_info.compressible_disable_plc = true;
+	map_info.huge_page = true;
+	map_info.map_size = mem->size;
+
+	ret = nvidia_vgpu_mgr_channel_map_mem(vgpu_mgr, chan, mem, &map_info);
+	if (ret)
+		goto err_chan_map_mem;
+
+	vgpu_debug(vgpu, "guest FB memory chan vma 0x%llx\n", mem->chan_vma_addr);
+
+	while (line_length > VGPU_SCRUBBER_LINE_LENGTH_MAX) {
+		line_count = line_count << 1;
+		line_length = line_length >> 1;
+	}
+
+	*(u32 *)(channel->sema_mem->bar1_vaddr) = 0;
+
+	vgpu_debug(vgpu, "semaphore seqno before scrubbing 0x%x\n",
+		   *(u32 *)(channel->sema_mem->bar1_vaddr));
+
+	nvidia_vgpu_mgr_begin_pushbuf(vgpu_mgr, chan, 150);
+
+#define EMIT_DWORD(x) \
+	nvidia_vgpu_mgr_emit_pushbuf(vgpu_mgr, chan, x)
+
+	for (i = 0; i < 128; i += 4)
+		EMIT_DWORD(0x0);
+
+	EMIT_DWORD(0x20010000);
+	EMIT_DWORD(chan->ce_object_handle);
+	EMIT_DWORD(0x200181c2);
+	EMIT_DWORD(0x30004);
+
+	EMIT_DWORD(0x200181c0);
+	EMIT_DWORD(0x0);
+
+	EMIT_DWORD(0x20048104);
+	EMIT_DWORD(lower_32_bits(line_length));
+	EMIT_DWORD(lower_32_bits(line_length));
+	EMIT_DWORD(lower_32_bits(line_length >> 2));
+	EMIT_DWORD(line_count);
+
+	EMIT_DWORD(0x20028102);
+	EMIT_DWORD(upper_32_bits(vgpu->fbmem_heap->chan_vma_addr));
+	EMIT_DWORD(lower_32_bits(vgpu->fbmem_heap->chan_vma_addr));
+
+	EMIT_DWORD(0x200180c0);
+	EMIT_DWORD(0x785);
+
+	EMIT_DWORD(0x20038090);
+	EMIT_DWORD(upper_32_bits(vgpu->ce_channel.sema_mem->chan_vma_addr));
+	EMIT_DWORD(lower_32_bits(vgpu->ce_channel.sema_mem->chan_vma_addr));
+	EMIT_DWORD(0xdeadbeef);
+
+	EMIT_DWORD(0x200180c0);
+	EMIT_DWORD(0x5cc);
+
+#undef EMIT_DWORD
+
+	nvidia_vgpu_mgr_submit_pushbuf(vgpu_mgr, chan);
+
+	if (!wait_event_timeout(channel->wq, ce_workload_complete(channel),
+				msecs_to_jiffies(FBMEM_SCRUB_TIMEOUT_MS))) {
+		vgpu_debug(vgpu, "fail to wait for CE workload complete\n");
+
+		ret = -ETIMEDOUT;
+		goto err_pushbuf;
+	}
+
+	vgpu_debug(vgpu, "semaphore seqno after scrubbing 0x%x\n",
+		   *(u32 *)(channel->sema_mem->bar1_vaddr));
+
+	nvidia_vgpu_mgr_channel_unmap_mem(vgpu_mgr, mem);
+	clean_ce_channel(vgpu);
+
+	return 0;
+
+err_pushbuf:
+	nvidia_vgpu_mgr_channel_unmap_mem(vgpu_mgr, mem);
+err_chan_map_mem:
+	clean_ce_channel(vgpu);
+	return ret;
+}
+
 static void clean_fbmem_heap(struct nvidia_vgpu *vgpu)
 {
 	struct nvidia_vgpu_mgr *vgpu_mgr = vgpu->vgpu_mgr;
@@ -118,6 +320,8 @@ static void clean_fbmem_heap(struct nvidia_vgpu *vgpu)
 	vgpu_debug(vgpu, "free guest FB memory, offset 0x%llx size 0x%llx\n",
 		   vgpu->fbmem_heap->addr, vgpu->fbmem_heap->size);
 
+	if (vgpu_mgr->use_ce_scrub_fbmem)
+		WARN_ON(scrub_fbmem_heap(vgpu));
 	nvidia_vgpu_mgr_free_fbmem(vgpu_mgr, vgpu->fbmem_heap);
 	vgpu->fbmem_heap = NULL;
 }
@@ -172,7 +376,8 @@ static int setup_fbmem_heap(struct nvidia_vgpu *vgpu)
 	vgpu_debug(vgpu, "guest FB memory offset 0x%llx size 0x%llx\n", mem->addr, mem->size);
 
 	vgpu->fbmem_heap = mem;
-	return 0;
+
+	return vgpu_mgr->use_ce_scrub_fbmem ? scrub_fbmem_heap(vgpu) : 0;
 }
 
 static void clean_mgmt_heap(struct nvidia_vgpu *vgpu)
@@ -437,6 +642,7 @@ EXPORT_SYMBOL_GPL(nvidia_vgpu_mgr_create_vgpu);
  */
 int nvidia_vgpu_mgr_reset_vgpu(struct nvidia_vgpu *vgpu)
 {
+	struct nvidia_vgpu_mgr *vgpu_mgr = vgpu->vgpu_mgr;
 	int ret;
 
 	ret = nvidia_vgpu_rpc_call(vgpu, NV_VGPU_CPU_RPC_MSG_RESET, NULL, 0);
@@ -445,6 +651,14 @@ int nvidia_vgpu_mgr_reset_vgpu(struct nvidia_vgpu *vgpu)
 		return ret;
 	}
 
+	if (vgpu_mgr->use_ce_scrub_fbmem) {
+		ret = scrub_fbmem_heap(vgpu);
+		if (ret) {
+			vgpu_error(vgpu, "fail to scrub the fbmem %d\n", ret);
+			return ret;
+		}
+	}
+
 	vgpu_debug(vgpu, "reset done\n");
 	return 0;
 }
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
index e502a37468e3..79b8d4b917f7 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.c
@@ -105,6 +105,7 @@ static struct nvidia_vgpu_mgr *alloc_vgpu_mgr(struct nvidia_vgpu_mgr_handle *han
 	atomic_set(&vgpu_mgr->num_vgpus, 0);
 	mutex_init(&vgpu_mgr->curr_vgpu_type_lock);
 	nvidia_vgpu_event_init_chain(&vgpu_mgr->pf_driver_event_chain);
+	nvidia_vgpu_event_init_chain(&vgpu_mgr->pf_channel_event_chain);
 
 	return vgpu_mgr;
 }
@@ -132,6 +133,7 @@ static int call_chain(struct nvidia_vgpu_event_chain *chain, unsigned int event,
 static const char *pf_events_string[NVIDIA_VGPU_PF_EVENT_MAX] = {
 	[NVIDIA_VGPU_PF_DRIVER_EVENT_SRIOV_CONFIGURE] = "SRIOV configure",
 	[NVIDIA_VGPU_PF_DRIVER_EVENT_DRIVER_UNBIND] = "driver unbind",
+	[NVIDIA_VGPU_PF_CHANNEL_EVENT_FIFO_NONSTALL] = "FIFO nonstall",
 };
 
 static int pf_event_notify_fn(void *priv, unsigned int event, void *data)
@@ -148,6 +150,9 @@ static int pf_event_notify_fn(void *priv, unsigned int event, void *data)
 	case NVIDIA_VGPU_PF_DRIVER_EVENT_START...NVIDIA_VGPU_PF_DRIVER_EVENT_END:
 		ret = call_chain(&vgpu_mgr->pf_driver_event_chain, event, data);
 		break;
+	case NVIDIA_VGPU_PF_CHANNEL_EVENT_START...NVIDIA_VGPU_PF_CHANNEL_EVENT_END:
+		ret = call_chain(&vgpu_mgr->pf_channel_event_chain, event, data);
+		break;
 	}
 
 	return ret;
@@ -300,6 +305,7 @@ static int setup_pf_driver_caps(struct nvidia_vgpu_mgr *vgpu_mgr, unsigned long
 	test_bit(NVIDIA_VGPU_PF_DRIVER_CAP_HAS_##cap, caps)
 
 	vgpu_mgr->use_chid_alloc_bitmap = !HAS_CAP(CHID_ALLOC);
+	vgpu_mgr->use_ce_scrub_fbmem = HAS_CAP(CE_CHAN_ALLOC) | HAS_CAP(PUSHBUF_SUBMIT);
 
 #undef HAS_CAP
 	return 0;
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
index dc782f825f2b..b5bcde555a5d 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
@@ -69,6 +69,18 @@ struct nvidia_vgpu_rpc {
 	void __iomem *error_buf;
 };
 
+struct nvidia_vgpu_event_listener {
+	int (*func)(struct nvidia_vgpu_event_listener *self, unsigned int event, void *data);
+	struct list_head list;
+};
+
+struct nvidia_vgpu_ce_channel {
+	struct nvidia_vgpu_chan *chan;
+	struct nvidia_vgpu_mem *sema_mem;
+	struct nvidia_vgpu_event_listener listener;
+	struct wait_queue_head wq;
+};
+
 /**
  * struct nvidia_vgpu - per-vGPU state
  *
@@ -83,6 +95,7 @@ struct nvidia_vgpu_rpc {
  * @fbmem_heap: allocated FB memory for the vGPU
  * @mgmt: vGPU mgmt heap
  * @rpc: vGPU host RPC
+ * @ce_channel: copy engine channel
  */
 struct nvidia_vgpu {
 	/* Per-vGPU lock */
@@ -99,11 +112,7 @@ struct nvidia_vgpu {
 	struct nvidia_vgpu_mem *fbmem_heap;
 	struct nvidia_vgpu_mgmt mgmt;
 	struct nvidia_vgpu_rpc rpc;
-};
-
-struct nvidia_vgpu_event_listener {
-	int (*func)(struct nvidia_vgpu_event_listener *self, unsigned int event, void *data);
-	struct list_head list;
+	struct nvidia_vgpu_ce_channel ce_channel;
 };
 
 struct nvidia_vgpu_event_chain {
@@ -140,8 +149,10 @@ struct nvidia_vgpu_event_chain {
  * @curr_vgpu_type: type of current created vgpu in homogeneous mode
  * @num_instances: number of created vGPU with curr_vgpu_type in homogeneous mode
  * @pf_driver_event_chain: PF driver event chain
+ * @pf_channel_event_chain: PF channel event chain
  * @pdev: the PCI device pointer
  * @bar0_vaddr: the virtual address of BAR0
+ * @use_ce_scrub_fbmem: scrub the FB memory if the PF driver supports.
  */
 struct nvidia_vgpu_mgr {
 	struct kref refcount;
@@ -184,9 +195,12 @@ struct nvidia_vgpu_mgr {
 	unsigned int num_instances;
 
 	struct nvidia_vgpu_event_chain pf_driver_event_chain;
+	struct nvidia_vgpu_event_chain pf_channel_event_chain;
 
 	struct pci_dev *pdev;
 	void __iomem *bar0_vaddr;
+
+	bool use_ce_scrub_fbmem;
 };
 
 #define nvidia_vgpu_mgr_for_each_vgpu(vgpu, vgpu_mgr) \
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC v2 13/14] vfio/nvidia-vgpu: introduce vGPU logging
  2025-09-03 22:10 [RFC v2 00/14] Introduce NVIDIA GPU Virtualization (vGPU) Support Zhi Wang
                   ` (11 preceding siblings ...)
  2025-09-03 22:11 ` [RFC v2 12/14] vfio/nvidia-vgpu: scrub the guest FB memory of a vGPU Zhi Wang
@ 2025-09-03 22:11 ` Zhi Wang
  2025-09-03 22:11 ` [RFC v2 14/14] vfio/nvidia-vgpu: add a kernel doc to introduce NVIDIA vGPU Zhi Wang
  13 siblings, 0 replies; 23+ messages in thread
From: Zhi Wang @ 2025-09-03 22:11 UTC (permalink / raw)
  To: kvm
  Cc: alex.williamson, kevin.tian, jgg, airlied, daniel, dakr, acurrid,
	cjia, smitra, ankita, aniketa, kwankhede, targupta, zhiw, zhiwang

The GSP firmware provides several per-vGPU logging buffers to help on
debugging bugs.

Export those buffers to userspace. Thus, the user can attach them when
reporting bugs.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 drivers/vfio/pci/nvidia-vgpu/Makefile       |   4 +-
 drivers/vfio/pci/nvidia-vgpu/debugfs.c      |  65 +++++++++++
 drivers/vfio/pci/nvidia-vgpu/vfio.h         |  16 +++
 drivers/vfio/pci/nvidia-vgpu/vfio_debugfs.c | 117 ++++++++++++++++++++
 drivers/vfio/pci/nvidia-vgpu/vfio_main.c    |  44 +++++++-
 drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h     |   2 +
 6 files changed, 245 insertions(+), 3 deletions(-)
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/debugfs.c
 create mode 100644 drivers/vfio/pci/nvidia-vgpu/vfio_debugfs.c

diff --git a/drivers/vfio/pci/nvidia-vgpu/Makefile b/drivers/vfio/pci/nvidia-vgpu/Makefile
index 2aba9b4868aa..615712b40128 100644
--- a/drivers/vfio/pci/nvidia-vgpu/Makefile
+++ b/drivers/vfio/pci/nvidia-vgpu/Makefile
@@ -2,5 +2,5 @@
 subdir-ccflags-y += -I$(src)/include
 
 obj-$(CONFIG_NVIDIA_VGPU_VFIO_PCI) += nvidia_vgpu_vfio_pci.o
-nvidia_vgpu_vfio_pci-y := vgpu_mgr.o vgpu.o metadata.o metadata_vgpu_type.o rpc.o \
-			  vfio_main.o vfio_access.o vfio_sysfs.o
+nvidia_vgpu_vfio_pci-y := vgpu_mgr.o vgpu.o metadata.o metadata_vgpu_type.o rpc.o debugfs.o\
+			  vfio_main.o vfio_access.o vfio_sysfs.o vfio_debugfs.o
diff --git a/drivers/vfio/pci/nvidia-vgpu/debugfs.c b/drivers/vfio/pci/nvidia-vgpu/debugfs.c
new file mode 100644
index 000000000000..e6cdf44cd846
--- /dev/null
+++ b/drivers/vfio/pci/nvidia-vgpu/debugfs.c
@@ -0,0 +1,65 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright © 2025 NVIDIA Corporation
+ */
+
+#include <linux/debugfs.h>
+
+#include "vgpu_mgr.h"
+
+struct debugfs_root {
+	/* mutex to protect the debugfs_root */
+	struct mutex mutex;
+	struct kref refcount;
+	struct dentry *root;
+};
+
+struct debugfs_root debugfs_root = {
+	.mutex = __MUTEX_INITIALIZER(debugfs_root.mutex),
+};
+
+struct dentry *nvidia_vgpu_get_debugfs_root(void)
+{
+	struct debugfs_root *root = &debugfs_root;
+	struct dentry *dentry;
+
+	mutex_lock(&root->mutex);
+	if (root->root) {
+		kref_get(&root->refcount);
+		dentry = root->root;
+		goto out_unlock;
+	}
+
+	dentry = debugfs_create_dir("nvidia-vgpu", NULL);
+	if (IS_ERR(dentry))
+		goto out_unlock;
+
+	kref_init(&root->refcount);
+	root->root = dentry;
+
+out_unlock:
+	mutex_unlock(&root->mutex);
+	return dentry;
+}
+
+static void debugfs_root_release(struct kref *kref)
+{
+	struct debugfs_root *root = container_of(kref, struct debugfs_root, refcount);
+
+	debugfs_remove(root->root);
+	root->root = NULL;
+}
+
+void nvidia_vgpu_put_debugfs_root(void)
+{
+	struct debugfs_root *root = &debugfs_root;
+
+	mutex_lock(&root->mutex);
+	if (WARN_ON(!root->root))
+		goto out_unlock;
+
+	kref_put(&root->refcount, debugfs_root_release);
+
+out_unlock:
+	mutex_unlock(&root->mutex);
+}
diff --git a/drivers/vfio/pci/nvidia-vgpu/vfio.h b/drivers/vfio/pci/nvidia-vgpu/vfio.h
index 4c9bf9c80f5c..8edc8cd6c6dc 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vfio.h
+++ b/drivers/vfio/pci/nvidia-vgpu/vfio.h
@@ -6,6 +6,7 @@
 #ifndef _NVIDIA_VGPU_VFIO_H__
 #define _NVIDIA_VGPU_VFIO_H__
 
+#include <linux/debugfs.h>
 #include <linux/vfio_pci_core.h>
 
 #include "vgpu_mgr.h"
@@ -15,6 +16,12 @@
 #define CAP_LIST_NEXT_PTR_MSIX 0x7c
 #define MSIX_CAP_SIZE   0xc
 
+struct nvidia_vgpu_vfio_log {
+	struct debugfs_blob_wrapper blob;
+	void *mem;
+	struct dentry *dentry;
+};
+
 struct nvidia_vgpu_vfio {
 	struct vfio_pci_core_device core_dev;
 	u8 vconfig[PCI_CONFIG_SPACE_LENGTH];
@@ -32,6 +39,12 @@ struct nvidia_vgpu_vfio {
 	struct completion vdev_closing_completion;
 
 	struct nvidia_vgpu_event_listener pf_driver_event_listener;
+	struct nvidia_vgpu_event_listener pf_event_listener;
+
+	/* Logs */
+	struct nvidia_vgpu_vfio_log log_init_task;
+	struct nvidia_vgpu_vfio_log log_vgpu_task;
+	struct nvidia_vgpu_vfio_log log_kernel;
 };
 
 static inline struct nvidia_vgpu_vfio *core_dev_to_nvdev(struct vfio_pci_core_device *core_dev)
@@ -45,5 +58,8 @@ ssize_t nvidia_vgpu_vfio_access(struct nvidia_vgpu_vfio *nvdev, char __user *buf
 
 int nvidia_vgpu_vfio_setup_sysfs(struct nvidia_vgpu_vfio *nvdev);
 void nvidia_vgpu_vfio_clean_sysfs(struct nvidia_vgpu_vfio *nvdev);
+int nvidia_vgpu_vfio_setup_debugfs(struct nvidia_vgpu_vfio *nvdev);
+void nvidia_vgpu_vfio_clean_debugfs(struct nvidia_vgpu_vfio *nvdev);
+void nvidia_vgpu_vfio_update_logs(struct nvidia_vgpu_vfio *nvdev);
 
 #endif /* _NVIDIA_VGPU_VFIO_H__ */
diff --git a/drivers/vfio/pci/nvidia-vgpu/vfio_debugfs.c b/drivers/vfio/pci/nvidia-vgpu/vfio_debugfs.c
new file mode 100644
index 000000000000..52a80928f74f
--- /dev/null
+++ b/drivers/vfio/pci/nvidia-vgpu/vfio_debugfs.c
@@ -0,0 +1,117 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright © 2025 NVIDIA Corporation
+ */
+
+#include <linux/debugfs.h>
+
+#include "vfio.h"
+
+static void free_vgpu_log(struct nvidia_vgpu_vfio_log *log)
+{
+	debugfs_remove(log->dentry);
+	kvfree(log->mem);
+	log->mem = NULL;
+}
+
+static void clean_vgpu_logs(struct nvidia_vgpu_vfio *nvdev)
+{
+	free_vgpu_log(&nvdev->log_init_task);
+	free_vgpu_log(&nvdev->log_vgpu_task);
+	free_vgpu_log(&nvdev->log_kernel);
+}
+
+static int alloc_vgpu_log(struct nvidia_vgpu_vfio_log *log, struct device *dev,
+			  struct dentry *root, const char *name, u64 size)
+{
+	void *path = NULL;
+
+	path = kzalloc(PATH_MAX, GFP_KERNEL);
+	if (!path)
+		return -ENOMEM;
+
+	log->mem = kvzalloc(size, GFP_KERNEL);
+	if (!log->mem) {
+		kfree(log->mem);
+		return -ENOMEM;
+	}
+
+	log->blob.size = size;
+	log->blob.data = log->mem;
+
+	snprintf(path, PATH_MAX, "%s-%s", dev_name(dev), name);
+	log->dentry = debugfs_create_blob(path, 0400, root, &log->blob);
+
+	kfree(path);
+	path = NULL;
+
+	if (IS_ERR(log->dentry)) {
+		kfree(log->mem);
+		return PTR_ERR(log->dentry);
+	}
+	return 0;
+}
+
+static int setup_vgpu_logs(struct nvidia_vgpu_vfio *nvdev, struct dentry *root)
+{
+	struct nvidia_vgpu_mgr *vgpu_mgr = nvdev->vgpu_mgr;
+	struct device *dev = &nvdev->core_dev.pdev->dev;
+	int ret;
+
+	ret = alloc_vgpu_log(&nvdev->log_init_task, dev, root, "init_task_log",
+			     vgpu_mgr->init_task_log_size);
+	if (ret)
+		return ret;
+
+	ret = alloc_vgpu_log(&nvdev->log_vgpu_task, dev, root, "vgpu_task_log",
+			     vgpu_mgr->vgpu_task_log_size);
+	if (ret) {
+		free_vgpu_log(&nvdev->log_init_task);
+		return ret;
+	}
+
+	ret = alloc_vgpu_log(&nvdev->log_kernel, dev, root, "kernel_log",
+			     vgpu_mgr->kernel_log_size);
+	if (ret) {
+		free_vgpu_log(&nvdev->log_init_task);
+		free_vgpu_log(&nvdev->log_vgpu_task);
+		return ret;
+	}
+	return 0;
+}
+
+int nvidia_vgpu_vfio_setup_debugfs(struct nvidia_vgpu_vfio *nvdev)
+{
+	struct dentry *root = nvidia_vgpu_get_debugfs_root();
+	int ret;
+
+	if (IS_ERR(root))
+		return PTR_ERR(root);
+
+	ret = setup_vgpu_logs(nvdev, root);
+	if (ret) {
+		nvidia_vgpu_put_debugfs_root();
+		return ret;
+	}
+
+	return 0;
+}
+
+void nvidia_vgpu_vfio_clean_debugfs(struct nvidia_vgpu_vfio *nvdev)
+{
+	clean_vgpu_logs(nvdev);
+	nvidia_vgpu_put_debugfs_root();
+}
+
+void nvidia_vgpu_vfio_update_logs(struct nvidia_vgpu_vfio *nvdev)
+{
+	struct nvidia_vgpu_vfio_log *logs[] = {
+		&nvdev->log_init_task,
+		&nvdev->log_vgpu_task,
+		&nvdev->log_kernel,
+	};
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(logs); i++)
+		memcpy(logs[i]->mem, logs[i]->blob.data, logs[i]->blob.size);
+}
diff --git a/drivers/vfio/pci/nvidia-vgpu/vfio_main.c b/drivers/vfio/pci/nvidia-vgpu/vfio_main.c
index b557062a4ac2..4a6d939046e0 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vfio_main.c
+++ b/drivers/vfio/pci/nvidia-vgpu/vfio_main.c
@@ -24,10 +24,41 @@ static int pdev_to_gfid(struct pci_dev *pdev)
 	return pci_iov_vf_id(pdev) + 1;
 }
 
+static void disable_vgpu_logs(struct nvidia_vgpu_vfio *nvdev)
+{
+	if (WARN_ON(!nvdev->vgpu))
+		return;
+
+	/* save the latest vGPU logs before disabling */
+	nvidia_vgpu_vfio_update_logs(nvdev);
+
+	nvdev->log_init_task.blob.data = nvdev->log_init_task.mem;
+	nvdev->log_vgpu_task.blob.data = nvdev->log_vgpu_task.mem;
+	nvdev->log_kernel.blob.data = nvdev->log_kernel.mem;
+}
+
+static void enable_vgpu_logs(struct nvidia_vgpu_vfio *nvdev)
+{
+	struct nvidia_vgpu *vgpu = nvdev->vgpu;
+	struct nvidia_vgpu_mgmt *mgmt = &vgpu->mgmt;
+
+	if (WARN_ON(!vgpu))
+		return;
+
+	nvdev->log_init_task.blob.data = mgmt->init_task_log_vaddr;
+	nvdev->log_vgpu_task.blob.data = mgmt->vgpu_task_log_vaddr;
+	nvdev->log_kernel.blob.data = mgmt->kernel_log_vaddr;
+
+	/* get the latest vGPU logs after enabling */
+	nvidia_vgpu_vfio_update_logs(nvdev);
+}
+
 static int destroy_vgpu(struct nvidia_vgpu_vfio *nvdev)
 {
 	int ret;
 
+	disable_vgpu_logs(nvdev);
+
 	ret = nvidia_vgpu_mgr_destroy_vgpu(nvdev->vgpu);
 	if (ret)
 		return ret;
@@ -68,6 +99,8 @@ static int create_vgpu(struct nvidia_vgpu_vfio *nvdev)
 	}
 
 	nvdev->vgpu = vgpu;
+
+	enable_vgpu_logs(nvdev);
 	return 0;
 }
 
@@ -582,11 +615,14 @@ static void unregister_pf_driver_event_listener(struct nvidia_vgpu_vfio *nvdev)
 
 static void clean_nvdev(struct nvidia_vgpu_vfio *nvdev)
 {
-	if (nvdev->driver_is_unbound)
+	if (nvdev->driver_is_unbound) {
+		nvidia_vgpu_vfio_clean_debugfs(nvdev);
 		return;
+	}
 
 	unregister_pf_driver_event_listener(nvdev);
 	nvidia_vgpu_vfio_clean_sysfs(nvdev);
+	nvidia_vgpu_vfio_clean_debugfs(nvdev);
 
 	nvidia_vgpu_mgr_release(nvdev->vgpu_mgr);
 	nvdev->vgpu_mgr = NULL;
@@ -608,6 +644,12 @@ static int setup_nvdev(void *priv, void *data)
 	if (ret)
 		return ret;
 
+	ret = nvidia_vgpu_vfio_setup_debugfs(nvdev);
+	if (ret) {
+		nvidia_vgpu_vfio_clean_sysfs(nvdev);
+		return ret;
+	}
+
 	register_pf_driver_event_listener(nvdev);
 	return 0;
 }
diff --git a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
index b5bcde555a5d..04fef4f69793 100644
--- a/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
+++ b/drivers/vfio/pci/nvidia-vgpu/vgpu_mgr.h
@@ -225,5 +225,7 @@ int nvidia_vgpu_rpc_call(struct nvidia_vgpu *vgpu, u32 msg_type,
 void nvidia_vgpu_clean_rpc(struct nvidia_vgpu *vgpu);
 int nvidia_vgpu_setup_rpc(struct nvidia_vgpu *vgpu);
 int nvidia_vgpu_mgr_set_bme(struct nvidia_vgpu *vgpu, bool enable);
+struct dentry *nvidia_vgpu_get_debugfs_root(void);
+void nvidia_vgpu_put_debugfs_root(void);
 
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC v2 14/14] vfio/nvidia-vgpu: add a kernel doc to introduce NVIDIA vGPU
  2025-09-03 22:10 [RFC v2 00/14] Introduce NVIDIA GPU Virtualization (vGPU) Support Zhi Wang
                   ` (12 preceding siblings ...)
  2025-09-03 22:11 ` [RFC v2 13/14] vfio/nvidia-vgpu: introduce vGPU logging Zhi Wang
@ 2025-09-03 22:11 ` Zhi Wang
  13 siblings, 0 replies; 23+ messages in thread
From: Zhi Wang @ 2025-09-03 22:11 UTC (permalink / raw)
  To: kvm
  Cc: alex.williamson, kevin.tian, jgg, airlied, daniel, dakr, acurrid,
	cjia, smitra, ankita, aniketa, kwankhede, targupta, zhiw, zhiwang

In order to introduce NVIDIA vGPU and the requirements to a core driver,
a kernel doc is introduced to explain the architecture and the
requirements.

Add a kernel doc to introduce NVIDIA vGPU.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 Documentation/gpu/drivers.rst     |   1 +
 Documentation/gpu/nvidia-vgpu.rst | 264 ++++++++++++++++++++++++++++++
 2 files changed, 265 insertions(+)
 create mode 100644 Documentation/gpu/nvidia-vgpu.rst

diff --git a/Documentation/gpu/drivers.rst b/Documentation/gpu/drivers.rst
index 78b80be17f21..abdca636d3ef 100644
--- a/Documentation/gpu/drivers.rst
+++ b/Documentation/gpu/drivers.rst
@@ -11,6 +11,7 @@ GPU Driver Documentation
    mcde
    meson
    nouveau
+   nvidia-vgpu
    pl111
    tegra
    tve200
diff --git a/Documentation/gpu/nvidia-vgpu.rst b/Documentation/gpu/nvidia-vgpu.rst
new file mode 100644
index 000000000000..fb48572c7af2
--- /dev/null
+++ b/Documentation/gpu/nvidia-vgpu.rst
@@ -0,0 +1,264 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+.. include:: <isonum.txt>
+
+=======================
+NVIDIA vGPU VFIO Driver
+=======================
+
+:Copyright: |copy| 2025, NVIDIA CORPORATION. All rights reserved.
+:Author: Zhi Wang <zhiw@nvidia.com>
+
+
+
+Overview
+========
+
+NVIDIA vGPU[1] software enables powerful GPU performance for workloads
+ranging from graphics-rich virtual workstations to data science and AI,
+enabling IT to leverage the management and security benefits of
+virtualization as well as the performance of NVIDIA GPUs required for
+modern workloads. Installed on a physical GPU in a cloud or enterprise
+data center server, NVIDIA vGPU software creates virtual GPUs that can
+be shared across multiple virtual machines.
+
+The vGPU architecture[2] can be illustrated as follow::
+
+         +--------------------+    +--------------------+ +--------------------+ +--------------------+
+         | Hypervisor         |    | Guest VM           | | Guest VM           | | Guest VM           |
+         |                    |    | +----------------+ | | +----------------+ | | +----------------+ |
+         | +----------------+ |    | |Applications... | | | |Applications... | | | |Applications... | |
+         | |  NVIDIA        | |    | +----------------+ | | +----------------+ | | +----------------+ |
+         | |  Virtual GPU   | |    | +----------------+ | | +----------------+ | | +----------------+ |
+         | |  Manager       | |    | |  Guest Driver  | | | |  Guest Driver  | | | |  Guest Driver  | |
+         | +------^---------+ |    | +----------------+ | | +----------------+ | | +----------------+ |
+         |        |           |    +---------^----------+ +----------^---------+ +----------^---------+
+         |        |           |              |                       |                      |
+         |        |           +--------------+-----------------------+----------------------+---------+
+         |        |                          |                       |                      |         |
+         |        |                          |                       |                      |         |
+         +--------+--------------------------+-----------------------+----------------------+---------+
+        +---------v--------------------------+-----------------------+----------------------+----------+
+        | NVIDIA                  +----------v---------+ +-----------v--------+ +-----------v--------+ |
+        | Physical GPU            |   Virtual GPU      | |   Virtual GPU      | |   Virtual GPU      | |
+        |                         +--------------------+ +--------------------+ +--------------------+ |
+        +----------------------------------------------------------------------------------------------+
+
+Each NVIDIA vGPU is analogous to a conventional GPU, having a fixed amount
+of GPU framebuffer, and one or more virtual display outputs or "heads".
+The vGPU’s framebuffer is allocated out of the physical GPU’s framebuffer
+at the time the vGPU is created, and the vGPU retains exclusive use of
+that framebuffer until it is destroyed.
+
+The number of physical GPUs that a board has depends on the board. Each
+physical GPU can support several different types of virtual GPU (vGPU).
+vGPU types have a fixed amount of frame buffer, number of supported
+display heads, and maximum resolutions. They are grouped into different
+series according to the different classes of workload for which they are
+optimized. Each series is identified by the last letter of the vGPU type
+name.
+
+NVIDIA vGPU supports Windows and Linux guest VM operating systems. The
+supported vGPU types depend on the guest VM OS.
+
+Architecture
+============
+::
+
+                                    +--------------------+ +--------------------+ +--------------------+
+                                    | Linux VM           | | Windows VM         | | Guest VM           |
+                                    | +----------------+ | | +----------------+ | | +----------------+ |
+                                    | |Applications... | | | |Applications... | | | |Applications... | |
+                                    | +----------------+ | | +----------------+ | | +----------------+ | ...
+                                    | +----------------+ | | +----------------+ | | +----------------+ |
+                                    | |  Guest Driver  | | | |  Guest Driver  | | | |  Guest Driver  | |
+                                    | +----------------+ | | +----------------+ | | +----------------+ |
+                                    +---------^----------+ +----------^---------+ +----------^---------+
+                                              |                       |                      |
+                                   +--------------------------------------------------------------------+
+                                   |+--------------------+ +--------------------+ +--------------------+|
+                                   ||       QEMU         | |       QEMU         | |       QEMU         ||
+                                   ||                    | |                    | |                    ||
+                                   |+--------------------+ +--------------------+ +--------------------+|
+                                   +--------------------------------------------------------------------+
+                                              |                       |                      |
+        +-----------------------------------------------------------------------------------------------+
+        |                           +----------------------------------------------------------------+  |
+        |                           |                                VFIO                            |  |
+        |                           |                                                                |  |
+        | +-----------------------+ | +-------------------------------------------------------------+|  |
+        | |                       | | |                                                             ||  |
+        | |     nova_core        <--->|                                                             ||  |
+        | +    (core driver)      | | |                  NVIDIA vGPU VFIO Driver                    ||  |
+        | |                       | | |                                                             ||  |
+        | |                       | | +-------------------------------------------------------------+|  |
+        | +--------^--------------+ +----------------------------------------------------------------+  |
+        |          |                          |                       |                      |          |
+        +-----------------------------------------------------------------------------------------------+
+                   |                          |                       |                      |
+        +----------|--------------------------|-----------------------|----------------------|----------+
+        |          v               +----------v---------+ +-----------v--------+ +-----------v--------+ |
+        |  NVIDIA                  |       PCI VF       | |       PCI VF       | |       PCI VF       | |
+        |  Physical GPU            |                    | |                    | |                    | |
+        |                          |   (Virtual GPU)    | |   (Virtual GPU)    | |    (Virtual GPU)   | |
+        |                          +--------------------+ +--------------------+ +--------------------+ |
+        +-----------------------------------------------------------------------------------------------+
+
+Each virtual GPU (vGPU) instance is implemented atop a PCIe Virtual
+Function (VF). The NVIDIA vGPU VFIO driver, in coordination with the
+VFIO framework, operates directly on these VFs to enable key
+functionalities including vGPU type selection, dynamic instantiation and
+destruction of vGPU instances, support for live migration, and warm
+update...
+
+At the low level, the NVIDIA vGPU VFIO driver interfaces with a core
+driver aka nova_core, which provides the necessary abstractions and
+mechanisms to access and manipulate the underlying GPU hardware resources.
+
+Core Driver
+===========
+
+The primary deployment model for cloud service providers (CSPs) and
+enterprise environments is to have a standalone, minimal driver stack
+with the vGPU support and other essential components. Thus, a minimal
+core driver is required to support the NVIDIA vGPU VFIO driver.
+
+Requirements To A Core Driver
+=============================
+
+The NVIDIA vGPU VFIO driver searches the supported core drivers by driver
+names when loading. Once a supported core driver is found, the VFIO driver
+generates a core driver handle for the following interactions with the core
+driver.
+
+With the handle, the VFIO driver first check if the vGPU support is
+enabled in the core driver.
+
+The core driver returns vGPU is supported on this PF if:
+
+- The device advertise SRIOV caps.
+- The device is in the supported device list in the core driver.
+- The GSP microcode loaded by the core driver supports vGPU. Some core
+  drivers, e.g. NVKM, can support multiple version of GSP microcode.
+- The required initialization for vGPU support succeeds.
+
+The core driver handle data is per-PF and shared among VFs. It contains the
+two parts: the core driver part and the VFIO driver part. The core driver
+part contains core driver status, capabilities for the VFIO driver to
+validate. The VFIO driver part contains the data registered to the core
+driver. E.g. event handlers, private data.
+
+If the VFIO driver hasn't been attached with the core driver, the VFIO
+driver attaches the handle data with the core driver. The core driver
+functions are available to the VFIO driver after the attachment.
+
+The core driver is responsible for the locking to protect the handle
+data in attachment/detachment as it can be accessed in multiple paths
+of VFIO driver probing/remove.
+
+Beside the core driver attachment and handle management, the core driver
+is required to provide the following functions to support the VFIO driver:
+
+Enumeration:
+
+- The total FB memory size of the current GPU.
+- The available channel amount.
+- The complete engine bitmap.
+
+GSP RPC manipulation:
+
+- Allocate/de-allocate a GSP client.
+- Get the handle of a GSP client.
+- Allocate/de-allocate RM control.
+- Issue RM controls.
+
+Channel ID Management:
+
+The NVIDIA vGPU VFIO driver expects the core driver manages a reserved
+channel pool that is only meant for vGPU. Other coponents in the core
+driver should have the knowledge about the channel ID from the reserved
+pool is for vGPUs. E.g. reporting channel fault and events to the VFIO
+driver. It requires the following functions:
+
+- Allocate channel IDs from the reserved pool.
+- Free the channel IDs.
+
+FB Memory Management:
+
+- Allocate the FB memory by an allocation info.
+  The allocation info contains the requirements from the VFIO driver
+  besides size. E.g. fixed offset allocation, alignment requirements.
+- Free the FB memory.
+
+FB Memory Mapping:
+
+- Map the FB memory to BAR1 and a channel VMM by a mapping info
+  The mapping info contains the requirements from the VFIO driver. E.g.
+  start offset to map the allocated FB memory, map size, huge page,
+  special memory kind.
+- Unmap the FB memory.
+
+CE Workload Submission:
+
+- CE channel allocation/deallocation.
+- Pushbuf manipulation.
+
+Event forwarding:
+- Nonstall event.
+- SRIOV configuration event.
+- Core driver unbinding event.
+
+vGPU types
+==========
+
+Each type of vGPU is designed to meet specific requirements, from
+supporting multiple users with demanding graphics applications to
+powering AI workloads in virtualized environments.
+
+To create a vGPU associated with a vGPU type, the vGPU type blobs are
+required to be uploaded to GSP firmware. A vGPU metadata file is
+introduced to host the vGPU type blobs and will be loaded by the VFIO
+driver from the userspace when loading.
+
+The vGPU metafile can be found at::
+
+        https://github.com/zhiwang-nvidia/vgpu-tools/tree/metadata
+
+
+Create vGPUs
+============
+
+The VFs can be enabled via (for example 2 VFs)::
+
+        echo 2 > /sys/bus/pci/devices/0000\:c1\:00.0/sriov_numvfs
+
+After the VFIO driver is loaded. A sysfs interface is exposed to select
+the vGPU types::
+
+        cat /sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0/0000:3c:10.0/0000:3e:00.5/nvidia/creatable_vgpu_types
+        ID    : vGPU Name
+        941   : NVIDIA RTX6000-Ada-1Q
+        942   : NVIDIA RTX6000-Ada-2Q
+        943   : NVIDIA RTX6000-Ada-3Q
+        944   : NVIDIA RTX6000-Ada-4Q
+        945   : NVIDIA RTX6000-Ada-6Q
+        946   : NVIDIA RTX6000-Ada-8Q
+        947   : NVIDIA RTX6000-Ada-12Q
+        948   : NVIDIA RTX6000-Ada-16Q
+        949   : NVIDIA RTX6000-Ada-24Q
+        950   : NVIDIA RTX6000-Ada-48Q
+
+A valid vGPU type must be chosen for the VF before using the VFIO device::
+
+        $ echo 941 > /sys/bus/pci/devices/0000\:c1\:00.4/nvidia/current_vgpu_type
+
+To de-select the vGPU type::
+
+        $ echo 0 > /sys/bus/pci/devices/0000\:c1\:00.4/nvidia/current_vgpu_type
+
+Once the vGPU is select, the VFIO device is ready to be used by QEMU. The
+VFIO device must be closed before the user can de-select the vGPU type.
+
+References
+==========
+
+1. See Documentation/driver-api/vfio.rst for more information on VFIO.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-09-06 10:34 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-03 22:10 [RFC v2 00/14] Introduce NVIDIA GPU Virtualization (vGPU) Support Zhi Wang
2025-09-03 22:10 ` [RFC v2 01/14] vfio/nvidia-vgpu: introduce vGPU lifecycle management prelude Zhi Wang
2025-09-03 22:10 ` [RFC v2 02/14] vfio/nvidia-vgpu: allocate GSP RM client for NVIDIA vGPU manager Zhi Wang
2025-09-03 22:11 ` [RFC v2 03/14] vfio/nvidia-vgpu: introduce vGPU type uploading Zhi Wang
2025-09-04  9:37   ` Danilo Krummrich
2025-09-04  9:41     ` Danilo Krummrich
2025-09-04 12:15       ` Jason Gunthorpe
2025-09-04 12:45         ` Danilo Krummrich
2025-09-04 13:58           ` Jason Gunthorpe
2025-09-04 15:43       ` Zhi Wang
2025-09-06 10:34         ` Danilo Krummrich
2025-09-03 22:11 ` [RFC v2 04/14] vfio/nvidia-vgpu: allocate vGPU channels when creating vGPUs Zhi Wang
2025-09-03 22:11 ` [RFC v2 05/14] vfio/nvidia-vgpu: allocate vGPU FB memory " Zhi Wang
2025-09-03 22:11 ` [RFC v2 06/14] vfio/nvidia-vgpu: allocate mgmt heap " Zhi Wang
2025-09-03 22:11 ` [RFC v2 07/14] vfio/nvidia-vgpu: map mgmt heap when creating a vGPU Zhi Wang
2025-09-03 22:11 ` [RFC v2 08/14] vfio/nvidia-vgpu: allocate GSP RM client when creating vGPUs Zhi Wang
2025-09-03 22:11 ` [RFC v2 09/14] vfio/nvidia-vgpu: bootload the new vGPU Zhi Wang
2025-09-03 22:11 ` [RFC v2 10/14] vfio/nvidia-vgpu: introduce vGPU host RPC channel Zhi Wang
2025-09-03 22:36   ` Timur Tabi
2025-09-03 22:11 ` [RFC v2 11/14] vfio/nvidia-vgpu: introduce NVIDIA vGPU VFIO variant driver Zhi Wang
2025-09-03 22:11 ` [RFC v2 12/14] vfio/nvidia-vgpu: scrub the guest FB memory of a vGPU Zhi Wang
2025-09-03 22:11 ` [RFC v2 13/14] vfio/nvidia-vgpu: introduce vGPU logging Zhi Wang
2025-09-03 22:11 ` [RFC v2 14/14] vfio/nvidia-vgpu: add a kernel doc to introduce NVIDIA vGPU Zhi Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox