DPDK-dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Stephen Hemminger <stephen@networkplumber.org>
To: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: dev@dpdk.org
Subject: Re: [PATCH v8 00/18] Support VFIO cdev API in DPDK
Date: Thu, 11 Jun 2026 10:49:40 -0700	[thread overview]
Message-ID: <20260611104940.41312f98@phoenix.local> (raw)
In-Reply-To: <cover.1781190151.git.anatoly.burakov@intel.com>

On Thu, 11 Jun 2026 16:08:52 +0100
Anatoly Burakov <anatoly.burakov@intel.com> wrote:

> This patchset introduces a major refactor of the VFIO subsystem in DPDK to
> support character device (cdev) interface introduced in Linux kernel, as well as
> make the API more streamlined and useful. The goal is to simplify device
> management, improve compatibility, and clarify API responsibilities.
> 
> The following sections outline the key issues addressed by this patchset and the
> corresponding changes introduced.
> 
> 1. Only group mode is supported
> ===============================
> 
> Since kernel version 4.14.327 (LTS), VFIO supports the new character device
> (cdev)-based way of working with VFIO devices (otherwise known as IOMMUFD). This
> is a device-centric mode and does away with all the complexity regarding groups
> and IOMMU types, delegating it all to the kernel, and exposes a much simpler
> interface to userspace.
> 
> The old group interface is still around, and will need to be kept in DPDK both
> for compatibility reasons, as well as supporting special cases (FSLMC bus, NBL
> driver, no-IOMMU mode etc.).
> 
> To enable this, VFIO is heavily refactored, so that the code can support both
> modes while relying on (mostly) common infrastructure.
> 
> Note that the existing `rte_vfio_device_setup/release` model is fundamentally
> incompatible with cdev mode, because for custom container cases, the expected
> flow is that the user binds the IOMMU group (and thus, implicitly, the device
> itself) to a specific container using `rte_vfio_container_group_bind`, whereas
> this step is not needed for cdev as the device fd is assigned to the container
> straight away.
> 
> Therefore, what we do instead is introduce a new API for container device
> assignment which, semantically, will assign a device to specified container, so
> that when it is mapped using `rte_pci_map_device`, the appropriate container is
> selected. Under the hood though, we essentially transition to getting device fd
> straight away at assign stage, so that by the time the PCI bus attempts to map
> the device, it is already mapped and we just return an fd. There is no
> "unassign" API because `release_device` already performs that function.
> 
> Additionally, a new `rte_vfio_get_mode` API is added for those cases that need
> some introspection into VFIO's internals, with three new modes: group
> (old-style), no-iommu (old-style but without IOMMU), and cdev (the new mode).
> Although no-IOMMU is technically a variant of group mode, the distinction is
> largely irrelevant to the user, as all usages of noiommu checks in our codebase
> are for deciding whether to use IOVA or PA, not anything to do with managing
> groups. The current plan for kernel community is to *not* introduce no-IOMMU
> cdev implementation, and IOMMUFD's own group API compatibility layer also does
> not implement no-IOMMU mode, which is why this will be kept for compatibility
> for these use cases.
> 
> There were other users of VFIO which relied on group API but only for convenience
> purposes; no actual VFIO functionality depended on those API's. Therefore, group
> API's are removed and, where appropriate, replaced with the new API's.
> 
> List of removed API's:
> 
> * `rte_vfio_get_group_fd`
> * `rte_vfio_clear_group`
> * `rte_vfio_container_group_bind` (replaced by container assign API)
> * `rte_vfio_container_group_unbind`
> * `rte_vfio_noiommu_is_enabled` (replaced by new mode API)
> 
> 2. The API responsibilities aren't clear and bleed into each other
> ==================================================================
> 
> Some API's do multiple things at once. In particular:
> 
> * `rte_vfio_get_device_info` will setup the device
> * `rte_vfio_setup_device` will get device info
> 
> These API's have been adjusted to do one thing only.
> 
> v8:
> - Rebase
> - Fixed build errors due to variable shadowing
> - Removed duplicate fd check as kernel does not provide a way to distinguish
>   between device fd's
> 
> v7:
> - Rebase
> - Added removal of deprecation notices
> - Fixed implicit numeric comparison in patch 12
> 
> v6:
> - Fixed missing header include in vfio cdev file
> 
> v5:
> - Added back missing uapi patch
> 
> v4:
> - Fixed issues with documenting rte_vfio_mode enum
> - Separated deprecation notices into a separate patchset
> 
> v3:
> - Make API removal cleaner
> - Fix `get_group_num` usages to align with new API
> - Fix issues with function exports
> - Fix issues with `setup_device` returning old-style values in some cases
> 
> v2:
> - Make the entire API internal
> - More aggressive API pruning, complete removal of group API
> - Fixed a bug in group mode where device could not be used
> - Better documentation and deprecation notice patches
> - Moved doc patches to beginning of patchset
> 
> Anatoly Burakov (18):
>   uapi: update to v6.17 and add iommufd.h
>   vfio: make all functions internal
>   vfio: split get device info from setup
>   vfio: add container device assignment API
>   net/nbl: do not use VFIO group bind API
>   net/ntnic: use container device assignment API
>   vdpa/ifc: use container device assignment API
>   vdpa/nfp: use container device assignment API
>   vdpa/sfc: use container device assignment API
>   vhost: remove group-related API from drivers
>   vfio: remove group-based API
>   vfio: cleanup and refactor
>   bus/pci: use the new VFIO mode API
>   bus/fslmc: use the new VFIO mode API
>   net/hinic3: use the new VFIO mode API
>   net/ntnic: use the new VFIO mode API
>   vfio: remove no-IOMMU check API
>   vfio: introduce cdev mode
> 
>  config/arm/meson.build                    |    1 +
>  config/meson.build                        |    1 +
>  doc/guides/prog_guide/vhost_lib.rst       |    4 -
>  doc/guides/rel_notes/deprecation.rst      |   10 -
>  drivers/bus/cdx/cdx_vfio.c                |   25 +-
>  drivers/bus/fslmc/fslmc_bus.c             |   10 +-
>  drivers/bus/fslmc/fslmc_vfio.c            |    6 +-
>  drivers/bus/pci/linux/pci.c               |    2 +-
>  drivers/bus/pci/linux/pci_vfio.c          |   33 +-
>  drivers/bus/platform/platform.c           |    9 +-
>  drivers/crypto/bcmfs/bcmfs_vfio.c         |   14 +-
>  drivers/net/hinic3/base/hinic3_hwdev.c    |    3 +-
>  drivers/net/nbl/nbl_common/nbl_userdev.c  |   20 +-
>  drivers/net/nbl/nbl_include/nbl_include.h |    1 +
>  drivers/net/ntnic/ntnic_ethdev.c          |    2 +-
>  drivers/net/ntnic/ntnic_vfio.c            |   30 +-
>  drivers/vdpa/ifc/ifcvf_vdpa.c             |   34 +-
>  drivers/vdpa/mlx5/mlx5_vdpa.c             |    1 -
>  drivers/vdpa/nfp/nfp_vdpa.c               |   37 +-
>  drivers/vdpa/sfc/sfc_vdpa.c               |   39 +-
>  drivers/vdpa/sfc/sfc_vdpa.h               |    2 -
>  kernel/linux/uapi/linux/iommufd.h         | 1292 +++++++++++
>  kernel/linux/uapi/linux/vduse.h           |    2 +-
>  kernel/linux/uapi/linux/vfio.h            |   12 +-
>  kernel/linux/uapi/version                 |    2 +-
>  lib/eal/freebsd/eal.c                     |   98 +-
>  lib/eal/include/rte_vfio.h                |  387 ++--
>  lib/eal/linux/eal_vfio.c                  | 2437 ++++++++-------------
>  lib/eal/linux/eal_vfio.h                  |  167 +-
>  lib/eal/linux/eal_vfio_cdev.c             |  390 ++++
>  lib/eal/linux/eal_vfio_group.c            |  984 +++++++++
>  lib/eal/linux/eal_vfio_mp_sync.c          |   80 +-
>  lib/eal/linux/meson.build                 |    2 +
>  lib/eal/windows/eal.c                     |    4 +-
>  lib/vhost/vdpa_driver.h                   |    3 -
>  35 files changed, 4248 insertions(+), 1896 deletions(-)
>  create mode 100644 kernel/linux/uapi/linux/iommufd.h
>  create mode 100644 lib/eal/linux/eal_vfio_cdev.c
>  create mode 100644 lib/eal/linux/eal_vfio_group.c
> 

Big patchset so sent the big AI model at it...

Patch 4 (vfio: add container device assignment API)

Warning: header doc for rte_vfio_container_assign_device() says "<0 on
failure, rte_errno is set", but neither rte_vfio_get_group_num() nor
rte_vfio_container_group_bind() sets rte_errno on the Linux failure
paths at this point in the series. The rte_errno contract only becomes
true after the patch 12 rewrite. Either set rte_errno here or defer the
doc claim to patch 12.

Patch 5 (net/nbl: do not use VFIO group bind API)

Info: function definition does not follow DPDK style (return type on
its own line, blank line between declarations and statements):

	static int
	nbl_open_group_fd(int iommu_group_num)
	{
		char path[PATH_MAX];

		snprintf(path, sizeof(path), RTE_VFIO_GROUP_FMT, iommu_group_num);
		return open(path, O_RDWR);
	}

Patch 7 (vdpa/ifc: use container device assignment API)

Warning: this patch removes both the "internal->vfio_group_fd = -1"
initialization and the only assignment, but ifcvf_get_vfio_group_fd()
still returns the field until patch 10. Between patches 7 and 10 the
vdpa op returns 0 (zeroed allocation), i.e. a "valid" fd value. Nothing
in lib/vhost calls the op anymore so it is not reachable in practice,
but for bisectability either keep the -1 initialization here or move
patch 10 ahead of patches 7-9.

Patch 8 (vdpa/nfp: use container device assignment API)

Warning: same staging issue as patch 7, plus nfp_vdpa_vfio_teardown()
still calls rte_vfio_container_group_unbind(fd, device->iommu_group)
with device->iommu_group now never assigned (always 0 from calloc), so
every teardown between patches 8 and 10 issues an unbind for group 0
that fails silently. The teardown unbind removal currently in patch 10
belongs in this patch (patch 9 does this correctly for sfc, removing
the fields and all uses in one patch).

Patch 12 (vfio: cleanup and refactor) -- partial review

Warning: missing release notes. This patch (together with patches 2, 11,
17, 18) removes the public rte_vfio API, removes the group-bind API, and
changes rte_vfio_setup_device()/rte_vfio_get_group_num() return
semantics. None of the series touches the current release notes file;
the entire VFIO API removal and the new cdev mode need entries in
"Removed Items" / "New Features".

Info: rte_errno convention comment at top of eal_vfio.c says "ENOXIO";
the errno is ENXIO (code uses the correct one).

Patch 18 (vfio: introduce cdev mode)

Error: ioas_id is corrupted in secondary processes. struct container
puts vfio_group_config and vfio_cdev_config in a union, and both place
their first member at offset 0 (bool dma_setup_done / uint32_t ioas_id).
In vfio_select_mode(), the secondary path does:

	if (mode == RTE_VFIO_MODE_CDEV && vfio_cdev_sync_ioas(cfg) < 0)
		goto err;

	/* primary handles DMA setup for default containers */
	group_cfg->dma_setup_done = true;

In cdev mode the unconditional dma_setup_done store overwrites the low
byte of the ioas_id just received from the primary. The corrupted id is
then used by VFIO_DEVICE_ATTACH_IOMMUFD_PT and IOMMU_IOAS_MAP/UNMAP in
the secondary. It happens to work only when the primary's IOAS id has
low byte 1. Fix is to make the store mode-conditional:

	if (mode == RTE_VFIO_MODE_GROUP || mode == RTE_VFIO_MODE_NOIOMMU)
		group_cfg->dma_setup_done = true;

      parent reply	other threads:[~2026-06-11 17:49 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <cover.1763141462.git.anatoly.burakov@intel.com>
2025-11-18 16:29 ` [PATCH v3 00/20] Support VFIO cdev API in DPDK Anatoly Burakov
2025-11-18 16:29   ` [PATCH v3 01/20] doc: add deprecation notice for VFIO API Anatoly Burakov
2025-11-18 16:29   ` [PATCH v3 02/20] doc: add deprecation notice for vDPA driver API Anatoly Burakov
2025-11-18 16:29   ` [PATCH v3 03/20] uapi: update to v6.17 and add iommufd.h Anatoly Burakov
2025-11-18 17:36     ` Stephen Hemminger
2025-11-18 16:29   ` [PATCH v3 04/20] vfio: make all functions internal Anatoly Burakov
2025-11-18 16:29   ` [PATCH v3 05/20] vfio: split get device info from setup Anatoly Burakov
2025-11-18 16:29   ` [PATCH v3 06/20] vfio: add container device assignment API Anatoly Burakov
2025-11-18 16:29   ` [PATCH v3 07/20] net/nbl: do not use VFIO group bind API Anatoly Burakov
2025-11-18 16:29   ` [PATCH v3 08/20] net/ntnic: use container device assignment API Anatoly Burakov
2025-11-18 16:29   ` [PATCH v3 09/20] vdpa/ifc: " Anatoly Burakov
2025-11-18 16:29   ` [PATCH v3 10/20] vdpa/nfp: " Anatoly Burakov
2025-11-18 16:29   ` [PATCH v3 11/20] vdpa/sfc: " Anatoly Burakov
2025-11-18 16:29   ` [PATCH v3 12/20] vhost: remove group-related API from drivers Anatoly Burakov
2025-11-18 16:29   ` [PATCH v3 13/20] vfio: remove group-based API Anatoly Burakov
2025-11-18 16:29   ` [PATCH v3 14/20] vfio: cleanup and refactor Anatoly Burakov
2025-11-18 16:29   ` [PATCH v3 15/20] bus/pci: use the new VFIO mode API Anatoly Burakov
2025-11-18 16:29   ` [PATCH v3 16/20] bus/fslmc: " Anatoly Burakov
2025-11-18 16:29   ` [PATCH v3 17/20] net/hinic3: " Anatoly Burakov
2025-11-18 16:29   ` [PATCH v3 18/20] net/ntnic: " Anatoly Burakov
2025-11-18 16:29   ` [PATCH v3 19/20] vfio: remove no-IOMMU check API Anatoly Burakov
2025-11-18 16:29   ` [PATCH v3 20/20] vfio: introduce cdev mode Anatoly Burakov
2026-02-26 14:17 ` [PATCH v7 00/18] Support VFIO cdev API in DPDK Anatoly Burakov
2026-02-26 14:17   ` [PATCH v7 01/18] uapi: update to v6.17 and add iommufd.h Anatoly Burakov
2026-02-26 14:17   ` [PATCH v7 02/18] vfio: make all functions internal Anatoly Burakov
2026-02-26 14:17   ` [PATCH v7 03/18] vfio: split get device info from setup Anatoly Burakov
2026-02-26 14:17   ` [PATCH v7 04/18] vfio: add container device assignment API Anatoly Burakov
2026-02-26 14:17   ` [PATCH v7 05/18] net/nbl: do not use VFIO group bind API Anatoly Burakov
2026-02-26 14:17   ` [PATCH v7 06/18] net/ntnic: use container device assignment API Anatoly Burakov
2026-02-26 14:17   ` [PATCH v7 07/18] vdpa/ifc: " Anatoly Burakov
2026-02-26 14:17   ` [PATCH v7 08/18] vdpa/nfp: " Anatoly Burakov
2026-02-26 14:17   ` [PATCH v7 09/18] vdpa/sfc: " Anatoly Burakov
2026-02-26 14:17   ` [PATCH v7 10/18] vhost: remove group-related API from drivers Anatoly Burakov
2026-02-26 14:17   ` [PATCH v7 11/18] vfio: remove group-based API Anatoly Burakov
2026-02-26 14:17   ` [PATCH v7 12/18] vfio: cleanup and refactor Anatoly Burakov
2026-02-26 14:17   ` [PATCH v7 13/18] bus/pci: use the new VFIO mode API Anatoly Burakov
2026-02-26 14:17   ` [PATCH v7 14/18] bus/fslmc: " Anatoly Burakov
2026-02-26 14:17   ` [PATCH v7 15/18] net/hinic3: " Anatoly Burakov
2026-02-26 14:17   ` [PATCH v7 16/18] net/ntnic: " Anatoly Burakov
2026-02-26 14:17   ` [PATCH v7 17/18] vfio: remove no-IOMMU check API Anatoly Burakov
2026-02-26 14:17   ` [PATCH v7 18/18] vfio: introduce cdev mode Anatoly Burakov
2026-05-01 22:45   ` [PATCH v7 00/18] Support VFIO cdev API in DPDK Stephen Hemminger
2026-06-11 15:08 ` [PATCH v8 " Anatoly Burakov
2026-06-11 15:08   ` [PATCH v8 01/18] uapi: update to v6.17 and add iommufd.h Anatoly Burakov
2026-06-11 15:08   ` [PATCH v8 02/18] vfio: make all functions internal Anatoly Burakov
2026-06-11 15:08   ` [PATCH v8 03/18] vfio: split get device info from setup Anatoly Burakov
2026-06-11 15:08   ` [PATCH v8 04/18] vfio: add container device assignment API Anatoly Burakov
2026-06-11 15:08   ` [PATCH v8 05/18] net/nbl: do not use VFIO group bind API Anatoly Burakov
2026-06-11 15:08   ` [PATCH v8 06/18] net/ntnic: use container device assignment API Anatoly Burakov
2026-06-11 15:08   ` [PATCH v8 07/18] vdpa/ifc: " Anatoly Burakov
2026-06-11 15:09   ` [PATCH v8 08/18] vdpa/nfp: " Anatoly Burakov
2026-06-11 15:09   ` [PATCH v8 09/18] vdpa/sfc: " Anatoly Burakov
2026-06-11 15:09   ` [PATCH v8 10/18] vhost: remove group-related API from drivers Anatoly Burakov
2026-06-11 15:09   ` [PATCH v8 11/18] vfio: remove group-based API Anatoly Burakov
2026-06-11 15:09   ` [PATCH v8 12/18] vfio: cleanup and refactor Anatoly Burakov
2026-06-11 15:09   ` [PATCH v8 13/18] bus/pci: use the new VFIO mode API Anatoly Burakov
2026-06-11 15:09   ` [PATCH v8 14/18] bus/fslmc: " Anatoly Burakov
2026-06-11 15:09   ` [PATCH v8 15/18] net/hinic3: " Anatoly Burakov
2026-06-11 15:09   ` [PATCH v8 16/18] net/ntnic: " Anatoly Burakov
2026-06-11 15:09   ` [PATCH v8 17/18] vfio: remove no-IOMMU check API Anatoly Burakov
2026-06-11 15:09   ` [PATCH v8 18/18] vfio: introduce cdev mode Anatoly Burakov
2026-06-11 17:49   ` Stephen Hemminger [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260611104940.41312f98@phoenix.local \
    --to=stephen@networkplumber.org \
    --cc=anatoly.burakov@intel.com \
    --cc=dev@dpdk.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox