Re: [PATCH v10 2/2] selftests/vfio: Add NVIDIA Falcon driver for DMA testing

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

From: David Matlack <dmatlack@google.com>
To: Rubin Du <rubind@nvidia.com>
Cc: Alex Williamson <alex@shazbot.org>, Shuah Khan <shuah@kernel.org>,
	kvm@vger.kernel.org, linux-kselftest@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v10 2/2] selftests/vfio: Add NVIDIA Falcon driver for DMA testing
Date: Fri, 27 Mar 2026 22:38:00 +0000	[thread overview]
Message-ID: <accGyPMCOM_Y5VxK@google.com> (raw)
In-Reply-To: <20260325214329.464727-3-rubind@nvidia.com>

On 2026-03-25 02:43 PM, Rubin Du wrote:
> Add a new VFIO PCI driver for NVIDIA GPUs that enables DMA testing
> via the Falcon (Fast Logic Controller) microcontrollers. This driver
> extracts and adapts the DMA test functionality from NVIDIA's
> gpu-admin-tools project and integrates it into the existing VFIO
> selftest framework.
> 
> Falcons are general-purpose microcontrollers present on NVIDIA GPUs
> that can perform DMA operations between system memory and device
> memory. By leveraging Falcon DMA, this driver allows NVIDIA GPUs to
> be tested alongside Intel IOAT and DSA devices using the same
> selftest infrastructure.

> The driver is named 'nv_falcon' to reflect that it specifically
> controls the Falcon microcontrollers for DMA operations, rather
> than exposing general GPU functionality.

Do you forsee this driver ever expanding to utilize other functionality
on the GPU  other than the Falcon microcontroller, e.g. to add
send_msi() support in the future and/or support asynchronous memcpy?

If so, the name "nv_falcon" could get out of date and maybe "nv_gpu"
would be a better name for this driver?

> Reference implementation:
> https://github.com/NVIDIA/gpu-admin-tools
> 
> Signed-off-by: Rubin Du <rubind@nvidia.com>
> ---
>  .../vfio/lib/drivers/nv_falcons/hw.h          | 365 +++++++++
>  .../vfio/lib/drivers/nv_falcons/nv_falcons.c  | 739 ++++++++++++++++++

Please make the directory and file names match the driver. i.e.
s/nv_falcons/nv_falcon/.

> diff --git a/tools/testing/selftests/vfio/lib/drivers/nv_falcons/hw.h b/tools/testing/selftests/vfio/lib/drivers/nv_falcons/hw.h
> new file mode 100644
> index 000000000000..a92cdcfec63f
> --- /dev/null
> +++ b/tools/testing/selftests/vfio/lib/drivers/nv_falcons/hw.h
> @@ -0,0 +1,365 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
> + */
> +#ifndef _NV_FALCONS_HW_H_
> +#define _NV_FALCONS_HW_H_
> +
> +/* PMC (Power Management Controller) Registers */
> +#define NV_PMC_BOOT_0							0x00000000
> +#define NV_PMC_ENABLE							0x00000200
> +#define NV_PMC_ENABLE_PWR						0x00002000
> +#define NV_PMC_ENABLE_HUB						0x20000000

Please make the alignment of macro values consistent here and below.

> +
> +/* Falcon Base Pages for Different Engines */
> +#define NV_PPWR_FALCON_BASE						0x10a000
> +#define NV_PGSP_FALCON_BASE						0x110000
> +
> +/* Falcon Common Register Offsets (relative to base_page) */
> +#define NV_FALCON_DMACTL_OFFSET					0x010c
> +#define NV_FALCON_CPUCTL_OFFSET					0x0100
> +#define NV_FALCON_ENGINE_RESET_OFFSET			0x03c0
> +
> +/* DMEM Control Register Flags */
> +#define NV_PPWR_FALCON_DMEMC_AINCR_TRUE			0x01000000
> +#define NV_PPWR_FALCON_DMEMC_AINCW_TRUE			0x02000000
> +
> +/* Falcon DMEM port offsets (for port 0) */
> +#define NV_FALCON_DMEMC_OFFSET					0x1c0
> +#define NV_FALCON_DMEMD_OFFSET					0x1c4
> +
> +/* DMA Register Offsets (relative to base_page) */
> +#define NV_FALCON_DMA_ADDR_LOW_OFFSET			0x110
> +#define NV_FALCON_DMA_MEM_OFFSET				0x114
> +#define NV_FALCON_DMA_CMD_OFFSET				0x118
> +#define NV_FALCON_DMA_BLOCK_OFFSET				0x11c
> +#define NV_FALCON_DMA_ADDR_HIGH_OFFSET			0x128

> +struct gpu_device {
> +	enum gpu_arch arch;
> +	void *bar0;
> +	bool is_memory_clear_supported;
> +	const struct falcon *falcon;
> +	u32 pmc_enable_mask;
> +	bool fsp_dma_enabled;
> +
> +	/* Pending memcpy parameters, set by memcpy_start() */
> +	u64 memcpy_src;
> +	u64 memcpy_dst;
> +	u64 memcpy_size;
> +	u64 memcpy_count;
> +};

nit: I would move this to the driver .c file since that's where its used
and this isn't a hardware definition. It will make the driver easier to
read to have this struct definition in the same file.

> diff --git a/tools/testing/selftests/vfio/lib/drivers/nv_falcons/nv_falcons.c b/tools/testing/selftests/vfio/lib/drivers/nv_falcons/nv_falcons.c
> new file mode 100644
> index 000000000000..4b62793570a2
> --- /dev/null
> +++ b/tools/testing/selftests/vfio/lib/drivers/nv_falcons/nv_falcons.c
> @@ -0,0 +1,739 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
> + */
> +#include <stdint.h>
> +#include <strings.h>
> +#include <unistd.h>
> +#include <stdbool.h>
> +#include <string.h>
> +#include <time.h>
> +
> +#include <linux/errno.h>
> +#include <linux/io.h>
> +#include <linux/pci_ids.h>
> +
> +#include <libvfio.h>
> +
> +#include "hw.h"
> +
> +static inline struct gpu_device *to_nv_gpu(struct vfio_pci_device *device)

nit: Please align the naming here. i.e. rename this function to
to_gpu_device() or rename struct gpu_device to struct nv_gpu.

> +{
> +	return device->driver.region.vaddr;
> +}

> +static int gpu_poll_register(struct vfio_pci_device *device,
> +			     const char *name, u32 offset,
> +			     u32 expected, u32 mask, u32 timeout_ms)
> +{
> +	struct gpu_device *gpu = to_nv_gpu(device);
> +	u32 value;
> +	struct timespec start, now;
> +	u64 elapsed_ms;
> +
> +	clock_gettime(CLOCK_MONOTONIC, &start);
> +
> +	for (;;) {
> +		value = gpu_read32(gpu, offset);
> +		if ((value & mask) == expected)
> +			return 0;
> +
> +		clock_gettime(CLOCK_MONOTONIC, &now);
> +		elapsed_ms = (now.tv_sec - start.tv_sec) * 1000
> +			     + (now.tv_nsec - start.tv_nsec) / 1000000;
> +
> +		if (elapsed_ms >= timeout_ms)
> +			break;
> +
> +		usleep(1000);
> +	}
> +
> +	dev_err(device,
> +		"Timeout polling %s (0x%x): value=0x%x expected=0x%x mask=0x%x after %llu ms\n",
> +		name, offset, value, expected, mask,
> +		(unsigned long long)elapsed_ms);
> +	return -ETIMEDOUT;

Please fix all the callers of this function that ignore the return
value. Presumably you should be using VFIO_ASSERT_EQ(ret, 0) to bail the
test if something goes wrong (other than during memcpy_wait(), which has
a way to return an error to the caller).

> +static void gpu_enable_bus_master(struct vfio_pci_device *device)
> +{
> +	u16 cmd;
> +
> +	cmd = vfio_pci_config_readw(device, PCI_COMMAND);
> +	vfio_pci_config_writew(device, PCI_COMMAND, cmd | PCI_COMMAND_MASTER);
> +}
> +
> +static void gpu_disable_bus_master(struct vfio_pci_device *device)
> +{
> +	u16 cmd;
> +
> +	cmd = vfio_pci_config_readw(device, PCI_COMMAND);
> +	vfio_pci_config_writew(device, PCI_COMMAND, cmd & ~PCI_COMMAND_MASTER);
> +}

These can be generic to any PCI device so please add them as static
inline helpers in vfio_pci_device.h for other tests to use in the future
(and rename them to something more generic like
vfio_pci_disable_bus_master()).

> +static int nv_gpu_falcon_dma(struct vfio_pci_device *device,
> +			     u64 address,
> +			     u32 size_encoding,
> +			     bool write)
> +{
> +	struct gpu_device *gpu = to_nv_gpu(device);
> +	const struct falcon *falcon = gpu->falcon;
> +	u32 dma_cmd;
> +	int ret;
> +
> +	gpu_write32(gpu, NV_GPU_DMA_ADDR_TOP_BITS_REG,
> +		    (address >> 47) & 0x1ffff);
> +	gpu_write32(gpu, falcon->base_page + NV_FALCON_DMA_ADDR_HIGH_OFFSET,
> +		    (address >> 40) & 0x7f);
> +	gpu_write32(gpu, falcon->base_page + NV_FALCON_DMA_ADDR_LOW_OFFSET,
> +		    (address >> 8) & 0xffffffff);
> +	gpu_write32(gpu, falcon->base_page + NV_FALCON_DMA_BLOCK_OFFSET,
> +		    address & 0xff);
> +	gpu_write32(gpu, falcon->base_page + NV_FALCON_DMA_MEM_OFFSET, 0);
> +
> +	dma_cmd = (size_encoding << NV_FALCON_DMA_CMD_SIZE_SHIFT);
> +
> +	/* Set direction: write (DMEM->mem) or read (mem->DMEM) */
> +	if (write)
> +		dma_cmd |= NV_FALCON_DMA_CMD_WRITE_BIT;
> +
> +	gpu_write32(gpu, falcon->base_page + NV_FALCON_DMA_CMD_OFFSET, dma_cmd);
> +
> +	ret = gpu_poll_register(device, "dma_done",
> +		falcon->base_page + NV_FALCON_DMA_CMD_OFFSET,
> +		NV_FALCON_DMA_CMD_DONE_BIT, NV_FALCON_DMA_CMD_DONE_BIT,
> +		1000);
> +	if (ret)
> +		return ret;
> +
> +	return 0;


nit: This can just be "return ret;" or "return gpu_poll_register(...);".

> +static void nv_gpu_init(struct vfio_pci_device *device)
> +{
> +	struct gpu_device *gpu = to_nv_gpu(device);
> +	const struct gpu_properties *props;
> +	enum gpu_arch gpu_arch;
> +	u32 pmc_boot_0;
> +	int ret;
> +
> +	VFIO_ASSERT_GE(device->driver.region.size, sizeof(*gpu));
> +
> +	/* Read PMC_BOOT_0 register from BAR0 to identify GPU */
> +	pmc_boot_0 = readl(device->bars[0].vaddr + NV_PMC_BOOT_0);
> +
> +	/* Look up GPU architecture */
> +	gpu_arch = nv_gpu_arch_lookup(pmc_boot_0);
> +	if (gpu_arch == GPU_ARCH_UNKNOWN) {
> +		dev_err(device, "Unsupported GPU architecture\n");
> +		return;
> +	}
> +
> +	props = &gpu_properties_map[gpu_arch];
> +
> +	/* Populate GPU structure */
> +	gpu->arch = gpu_arch;
> +	gpu->bar0 = device->bars[0].vaddr;
> +	gpu->is_memory_clear_supported = props->memory_clear_supported;
> +	gpu->falcon = &falcon_map[props->falcon_type];
> +	gpu->pmc_enable_mask = props->pmc_enable_mask;
> +
> +	falcon_enable(device);
> +
> +	/* Initialize falcon for DMA */
> +	ret = nv_gpu_falcon_dma_init(device);
> +	VFIO_ASSERT_EQ(ret, 0, "Failed to initialize falcon DMA: %d\n", ret);
> +
> +	device->driver.features |= VFIO_PCI_DRIVER_F_NO_SEND_MSI;
> +	device->driver.max_memcpy_size = NV_FALCON_DMA_MAX_TRANSFER_SIZE;
> +	device->driver.max_memcpy_count = NV_FALCON_DMA_MAX_TRANSFER_COUNT;

The small memcpy size is going to be a problem tests that want to DMA to
larger regions. They will have to do chunk up the memcpy in a loop.

One option would be to have this driver expose a larger max_memcpy_size
and implement the loop in the memcpy_start()/wait() callback, which is
actually what you have below.

But I think a better approach would be to just make the loop common
across all drivers so that tests never have to care about
max_memcpy_size when using vfio_pci_driver_memcpy().

Can you add a precursor commit to this series that adds a loop to
vfio_pci_driver_memcpy() so that it performs whatever sized memcpy the
user asks for by chunking it up in to max_memcpy_size or smaller
chunks? In that same commit, please update the vfio_pci_driver_test
so we get coverage of the new chunking logic (get rid of the code that
rounds down self->size fo max_memcpy_size, except for the memcpy_storm
test).

In other words:

 - vfio_pci_driver_memcpy() should allow any size memcpy and will chunk
   it up as necessary depending on what the driver supports.

 - vfio_pci_driver_memcpy_start() semantics don't change. This still
   gives tests direct access to the driver memcpy_start() op and thus
   must obey max_memcpy_size.

This should also allow you to greatly simplify the implementation of
memcpy_wait() down below.

> +static int nv_gpu_memcpy_wait(struct vfio_pci_device *device)
> +{
> +	struct gpu_device *gpu = to_nv_gpu(device);
> +	u64 iteration;
> +	u64 offset;
> +	int ret;
> +
> +	VFIO_ASSERT_NE(gpu->memcpy_count, 0);
> +
> +	for (iteration = 0; iteration < gpu->memcpy_count; iteration++) {

This loop is unnecessary. You set max_memcpy_count to
NV_FALCON_DMA_MAX_TRANSFER_COUNT which is 1. So
vfio_pci_driver_memcpy_start() will guarantee that memcpy_count is at
most 1.

> +		offset = 0;
> +
> +		while (offset < gpu->memcpy_size) {
> +			int chunk_encoding;
> +
> +			chunk_encoding = size_to_dma_encoding(
> +						gpu->memcpy_size - offset);

This loop is also unnecessary. You set max_memcpy_size to
NV_FALCON_DMA_MAX_TRANSFER_SIZE, so vfio_pci_driver_memcpy_start() will
guarantee memcpy_size is at most that.

> +			if (chunk_encoding < 0) {
> +				ret = -EINVAL;
> +				goto out;
> +			}
> +
> +			ret = nv_gpu_memcpy_chunk(device,
> +						  gpu->memcpy_src + offset,
> +						  gpu->memcpy_dst + offset,
> +						  chunk_encoding);
> +			if (ret)
> +				goto out;
> +
> +			offset += 0x4 << chunk_encoding;
> +		}
> +	}
> +
> +	ret = 0;
> +out:
> +	gpu->memcpy_count = 0;
> +
> +	return ret;
> +}

next prev parent reply	other threads:[~2026-03-27 22:38 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-25 21:43 [PATCH v10 0/2] selftests/vfio: Add NVIDIA GPU Falcon DMA test driver Rubin Du
2026-03-25 21:43 ` [PATCH v10 1/2] selftests/vfio: Add NO_SEND_MSI feature flag and MSI helper macros Rubin Du
2026-03-27 22:04   ` David Matlack
2026-03-30 20:15     ` Rubin Du
2026-03-25 21:43 ` [PATCH v10 2/2] selftests/vfio: Add NVIDIA Falcon driver for DMA testing Rubin Du
2026-03-27 22:38   ` David Matlack [this message]
2026-03-30 20:15     ` Rubin Du

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=accGyPMCOM_Y5VxK@google.com \
    --to=dmatlack@google.com \
    --cc=alex@shazbot.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=rubind@nvidia.com \
    --cc=shuah@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox