Re: [PATCH v10 2/2] selftests/vfio: Add NVIDIA Falcon driver for DMA testing

All of lore.kernel.org
 help / color / mirror / Atom feed

From: David Matlack <dmatlack@google.com>
To: Rubin Du <rubind@nvidia.com>
Cc: Alex Williamson <alex@shazbot.org>, Shuah Khan <shuah@kernel.org>,
	kvm@vger.kernel.org, linux-kselftest@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v10 2/2] selftests/vfio: Add NVIDIA Falcon driver for DMA testing
Date: Fri, 27 Mar 2026 22:38:00 +0000	[thread overview]
Message-ID: <accGyPMCOM_Y5VxK@google.com> (raw)
In-Reply-To: <20260325214329.464727-3-rubind@nvidia.com>

On 2026-03-25 02:43 PM, Rubin Du wrote:
> Add a new VFIO PCI driver for NVIDIA GPUs that enables DMA testing
> via the Falcon (Fast Logic Controller) microcontrollers. This driver
> extracts and adapts the DMA test functionality from NVIDIA's
> gpu-admin-tools project and integrates it into the existing VFIO
> selftest framework.
> 
> Falcons are general-purpose microcontrollers present on NVIDIA GPUs
> that can perform DMA operations between system memory and device
> memory. By leveraging Falcon DMA, this driver allows NVIDIA GPUs to
> be tested alongside Intel IOAT and DSA devices using the same
> selftest infrastructure.

> The driver is named 'nv_falcon' to reflect that it specifically
> controls the Falcon microcontrollers for DMA operations, rather
> than exposing general GPU functionality.

Do you forsee this driver ever expanding to utilize other functionality
on the GPU  other than the Falcon microcontroller, e.g. to add
send_msi() support in the future and/or support asynchronous memcpy?

If so, the name "nv_falcon" could get out of date and maybe "nv_gpu"
would be a better name for this driver?

> Reference implementation:
> https://github.com/NVIDIA/gpu-admin-tools
> 
> Signed-off-by: Rubin Du <rubind@nvidia.com>
> ---
>  .../vfio/lib/drivers/nv_falcons/hw.h          | 365 +++++++++
>  .../vfio/lib/drivers/nv_falcons/nv_falcons.c  | 739 ++++++++++++++++++

Please make the directory and file names match the driver. i.e.
s/nv_falcons/nv_falcon/.

> diff --git a/tools/testing/selftests/vfio/lib/drivers/nv_falcons/hw.h b/tools/testing/selftests/vfio/lib/drivers/nv_falcons/hw.h
> new file mode 100644
> index 000000000000..a92cdcfec63f
> --- /dev/null
> +++ b/tools/testing/selftests/vfio/lib/drivers/nv_falcons/hw.h
> @@ -0,0 +1,365 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
> + */
> +#ifndef _NV_FALCONS_HW_H_
> +#define _NV_FALCONS_HW_H_
> +
> +/* PMC (Power Management Controller) Registers */
> +#define NV_PMC_BOOT_0							0x00000000
> +#define NV_PMC_ENABLE							0x00000200
> +#define NV_PMC_ENABLE_PWR						0x00002000
> +#define NV_PMC_ENABLE_HUB						0x20000000

Please make the alignment of macro values consistent here and below.

> +
> +/* Falcon Base Pages for Different Engines */
> +#define NV_PPWR_FALCON_BASE						0x10a000
> +#define NV_PGSP_FALCON_BASE						0x110000
> +
> +/* Falcon Common Register Offsets (relative to base_page) */
> +#define NV_FALCON_DMACTL_OFFSET					0x010c
> +#define NV_FALCON_CPUCTL_OFFSET					0x0100
> +#define NV_FALCON_ENGINE_RESET_OFFSET			0x03c0
> +
> +/* DMEM Control Register Flags */
> +#define NV_PPWR_FALCON_DMEMC_AINCR_TRUE			0x01000000
> +#define NV_PPWR_FALCON_DMEMC_AINCW_TRUE			0x02000000
> +
> +/* Falcon DMEM port offsets (for port 0) */
> +#define NV_FALCON_DMEMC_OFFSET					0x1c0
> +#define NV_FALCON_DMEMD_OFFSET					0x1c4
> +
> +/* DMA Register Offsets (relative to base_page) */
> +#define NV_FALCON_DMA_ADDR_LOW_OFFSET			0x110
> +#define NV_FALCON_DMA_MEM_OFFSET				0x114
> +#define NV_FALCON_DMA_CMD_OFFSET				0x118
> +#define NV_FALCON_DMA_BLOCK_OFFSET				0x11c
> +#define NV_FALCON_DMA_ADDR_HIGH_OFFSET			0x128

> +struct gpu_device {
> +	enum gpu_arch arch;
> +	void *bar0;
> +	bool is_memory_clear_supported;
> +	const struct falcon *falcon;
> +	u32 pmc_enable_mask;
> +	bool fsp_dma_enabled;
> +
> +	/* Pending memcpy parameters, set by memcpy_start() */
> +	u64 memcpy_src;
> +	u64 memcpy_dst;
> +	u64 memcpy_size;
> +	u64 memcpy_count;
> +};

nit: I would move this to the driver .c file since that's where its used
and this isn't a hardware definition. It will make the driver easier to
read to have this struct definition in the same file.

> diff --git a/tools/testing/selftests/vfio/lib/drivers/nv_falcons/nv_falcons.c b/tools/testing/selftests/vfio/lib/drivers/nv_falcons/nv_falcons.c
> new file mode 100644
> index 000000000000..4b62793570a2
> --- /dev/null
> +++ b/tools/testing/selftests/vfio/lib/drivers/nv_falcons/nv_falcons.c
> @@ -0,0 +1,739 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
> + */
> +#include <stdint.h>
> +#include <strings.h>
> +#include <unistd.h>
> +#include <stdbool.h>
> +#include <string.h>
> +#include <time.h>
> +
> +#include <linux/errno.h>
> +#include <linux/io.h>
> +#include <linux/pci_ids.h>
> +
> +#include <libvfio.h>
> +
> +#include "hw.h"
> +
> +static inline struct gpu_device *to_nv_gpu(struct vfio_pci_device *device)

nit: Please align the naming here. i.e. rename this function to
to_gpu_device() or rename struct gpu_device to struct nv_gpu.

> +{
> +	return device->driver.region.vaddr;
> +}

> +static int gpu_poll_register(struct vfio_pci_device *device,
> +			     const char *name, u32 offset,
> +			     u32 expected, u32 mask, u32 timeout_ms)
> +{
> +	struct gpu_device *gpu = to_nv_gpu(device);
> +	u32 value;
> +	struct timespec start, now;
> +	u64 elapsed_ms;
> +
> +	clock_gettime(CLOCK_MONOTONIC, &start);
> +
> +	for (;;) {
> +		value = gpu_read32(gpu, offset);
> +		if ((value & mask) == expected)
> +			return 0;
> +
> +		clock_gettime(CLOCK_MONOTONIC, &now);
> +		elapsed_ms = (now.tv_sec - start.tv_sec) * 1000
> +			     + (now.tv_nsec - start.tv_nsec) / 1000000;
> +
> +		if (elapsed_ms >= timeout_ms)
> +			break;
> +
> +		usleep(1000);
> +	}
> +
> +	dev_err(device,
> +		"Timeout polling %s (0x%x): value=0x%x expected=0x%x mask=0x%x after %llu ms\n",
> +		name, offset, value, expected, mask,
> +		(unsigned long long)elapsed_ms);
> +	return -ETIMEDOUT;

Please fix all the callers of this function that ignore the return
value. Presumably you should be using VFIO_ASSERT_EQ(ret, 0) to bail the
test if something goes wrong (other than during memcpy_wait(), which has
a way to return an error to the caller).

> +static void gpu_enable_bus_master(struct vfio_pci_device *device)
> +{
> +	u16 cmd;
> +
> +	cmd = vfio_pci_config_readw(device, PCI_COMMAND);
> +	vfio_pci_config_writew(device, PCI_COMMAND, cmd | PCI_COMMAND_MASTER);
> +}
> +
> +static void gpu_disable_bus_master(struct vfio_pci_device *device)
> +{
> +	u16 cmd;
> +
> +	cmd = vfio_pci_config_readw(device, PCI_COMMAND);
> +	vfio_pci_config_writew(device, PCI_COMMAND, cmd & ~PCI_COMMAND_MASTER);
> +}

These can be generic to any PCI device so please add them as static
inline helpers in vfio_pci_device.h for other tests to use in the future
(and rename them to something more generic like
vfio_pci_disable_bus_master()).

> +static int nv_gpu_falcon_dma(struct vfio_pci_device *device,
> +			     u64 address,
> +			     u32 size_encoding,
> +			     bool write)
> +{
> +	struct gpu_device *gpu = to_nv_gpu(device);
> +	const struct falcon *falcon = gpu->falcon;
> +	u32 dma_cmd;
> +	int ret;
> +
> +	gpu_write32(gpu, NV_GPU_DMA_ADDR_TOP_BITS_REG,
> +		    (address >> 47) & 0x1ffff);
> +	gpu_write32(gpu, falcon->base_page + NV_FALCON_DMA_ADDR_HIGH_OFFSET,
> +		    (address >> 40) & 0x7f);
> +	gpu_write32(gpu, falcon->base_page + NV_FALCON_DMA_ADDR_LOW_OFFSET,
> +		    (address >> 8) & 0xffffffff);
> +	gpu_write32(gpu, falcon->base_page + NV_FALCON_DMA_BLOCK_OFFSET,
> +		    address & 0xff);
> +	gpu_write32(gpu, falcon->base_page + NV_FALCON_DMA_MEM_OFFSET, 0);
> +
> +	dma_cmd = (size_encoding << NV_FALCON_DMA_CMD_SIZE_SHIFT);
> +
> +	/* Set direction: write (DMEM->mem) or read (mem->DMEM) */
> +	if (write)
> +		dma_cmd |= NV_FALCON_DMA_CMD_WRITE_BIT;
> +
> +	gpu_write32(gpu, falcon->base_page + NV_FALCON_DMA_CMD_OFFSET, dma_cmd);
> +
> +	ret = gpu_poll_register(device, "dma_done",
> +		falcon->base_page + NV_FALCON_DMA_CMD_OFFSET,
> +		NV_FALCON_DMA_CMD_DONE_BIT, NV_FALCON_DMA_CMD_DONE_BIT,
> +		1000);
> +	if (ret)
> +		return ret;
> +
> +	return 0;


nit: This can just be "return ret;" or "return gpu_poll_register(...);".

> +static void nv_gpu_init(struct vfio_pci_device *device)
> +{
> +	struct gpu_device *gpu = to_nv_gpu(device);
> +	const struct gpu_properties *props;
> +	enum gpu_arch gpu_arch;
> +	u32 pmc_boot_0;
> +	int ret;
> +
> +	VFIO_ASSERT_GE(device->driver.region.size, sizeof(*gpu));
> +
> +	/* Read PMC_BOOT_0 register from BAR0 to identify GPU */
> +	pmc_boot_0 = readl(device->bars[0].vaddr + NV_PMC_BOOT_0);
> +
> +	/* Look up GPU architecture */
> +	gpu_arch = nv_gpu_arch_lookup(pmc_boot_0);
> +	if (gpu_arch == GPU_ARCH_UNKNOWN) {
> +		dev_err(device, "Unsupported GPU architecture\n");
> +		return;
> +	}
> +
> +	props = &gpu_properties_map[gpu_arch];
> +
> +	/* Populate GPU structure */
> +	gpu->arch = gpu_arch;
> +	gpu->bar0 = device->bars[0].vaddr;
> +	gpu->is_memory_clear_supported = props->memory_clear_supported;
> +	gpu->falcon = &falcon_map[props->falcon_type];
> +	gpu->pmc_enable_mask = props->pmc_enable_mask;
> +
> +	falcon_enable(device);
> +
> +	/* Initialize falcon for DMA */
> +	ret = nv_gpu_falcon_dma_init(device);
> +	VFIO_ASSERT_EQ(ret, 0, "Failed to initialize falcon DMA: %d\n", ret);
> +
> +	device->driver.features |= VFIO_PCI_DRIVER_F_NO_SEND_MSI;
> +	device->driver.max_memcpy_size = NV_FALCON_DMA_MAX_TRANSFER_SIZE;
> +	device->driver.max_memcpy_count = NV_FALCON_DMA_MAX_TRANSFER_COUNT;

The small memcpy size is going to be a problem tests that want to DMA to
larger regions. They will have to do chunk up the memcpy in a loop.

One option would be to have this driver expose a larger max_memcpy_size
and implement the loop in the memcpy_start()/wait() callback, which is
actually what you have below.

But I think a better approach would be to just make the loop common
across all drivers so that tests never have to care about
max_memcpy_size when using vfio_pci_driver_memcpy().

Can you add a precursor commit to this series that adds a loop to
vfio_pci_driver_memcpy() so that it performs whatever sized memcpy the
user asks for by chunking it up in to max_memcpy_size or smaller
chunks? In that same commit, please update the vfio_pci_driver_test
so we get coverage of the new chunking logic (get rid of the code that
rounds down self->size fo max_memcpy_size, except for the memcpy_storm
test).

In other words:

 - vfio_pci_driver_memcpy() should allow any size memcpy and will chunk
   it up as necessary depending on what the driver supports.

 - vfio_pci_driver_memcpy_start() semantics don't change. This still
   gives tests direct access to the driver memcpy_start() op and thus
   must obey max_memcpy_size.

This should also allow you to greatly simplify the implementation of
memcpy_wait() down below.

> +static int nv_gpu_memcpy_wait(struct vfio_pci_device *device)
> +{
> +	struct gpu_device *gpu = to_nv_gpu(device);
> +	u64 iteration;
> +	u64 offset;
> +	int ret;
> +
> +	VFIO_ASSERT_NE(gpu->memcpy_count, 0);
> +
> +	for (iteration = 0; iteration < gpu->memcpy_count; iteration++) {

This loop is unnecessary. You set max_memcpy_count to
NV_FALCON_DMA_MAX_TRANSFER_COUNT which is 1. So
vfio_pci_driver_memcpy_start() will guarantee that memcpy_count is at
most 1.

> +		offset = 0;
> +
> +		while (offset < gpu->memcpy_size) {
> +			int chunk_encoding;
> +
> +			chunk_encoding = size_to_dma_encoding(
> +						gpu->memcpy_size - offset);

This loop is also unnecessary. You set max_memcpy_size to
NV_FALCON_DMA_MAX_TRANSFER_SIZE, so vfio_pci_driver_memcpy_start() will
guarantee memcpy_size is at most that.

> +			if (chunk_encoding < 0) {
> +				ret = -EINVAL;
> +				goto out;
> +			}
> +
> +			ret = nv_gpu_memcpy_chunk(device,
> +						  gpu->memcpy_src + offset,
> +						  gpu->memcpy_dst + offset,
> +						  chunk_encoding);
> +			if (ret)
> +				goto out;
> +
> +			offset += 0x4 << chunk_encoding;
> +		}
> +	}
> +
> +	ret = 0;
> +out:
> +	gpu->memcpy_count = 0;
> +
> +	return ret;
> +}

next prev parent reply	other threads:[~2026-03-27 22:38 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-25 21:43 [PATCH v10 0/2] selftests/vfio: Add NVIDIA GPU Falcon DMA test driver Rubin Du
2026-03-25 21:43 ` [PATCH v10 1/2] selftests/vfio: Add NO_SEND_MSI feature flag and MSI helper macros Rubin Du
2026-03-27 22:04   ` David Matlack
2026-03-30 20:15     ` Rubin Du
2026-03-25 21:43 ` [PATCH v10 2/2] selftests/vfio: Add NVIDIA Falcon driver for DMA testing Rubin Du
2026-03-27 22:38   ` David Matlack [this message]
2026-03-30 20:15     ` Rubin Du

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=accGyPMCOM_Y5VxK@google.com \
    --to=dmatlack@google.com \
    --cc=alex@shazbot.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=rubind@nvidia.com \
    --cc=shuah@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.