DPDK-dev Archive on lore.kernel.org

DPDK-dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v2 0/2] ethdev: fix out-of-bounds writes in rte_flow_conv()
From: Stephen Hemminger @ 2026-06-11 18:15 UTC (permalink / raw)
  To: James Raphael Tiovalen; +Cc: dev, orika, thomas, andrew.rybchenko, stable
In-Reply-To: <20260610113334.277895-1-jamestiotio@gmail.com>

On Wed, 10 Jun 2026 19:33:32 +0800
James Raphael Tiovalen <jamestiotio@gmail.com> wrote:

> rte_flow_conv() is documented to truncate output to the caller-supplied
> buffer size, but two paths handling variable-length trailing data
> ignored that contract and copied the full payload whenever the
> destination pointer was non-NULL. A caller passing a buffer just large
> enough for the fixed-size header had adjacent memory clobbered:
> 
> - GENEVE_OPT: up to option_len * 4 bytes
> - FLEX: up to 4 GiB, since src->length is a uint32_t and the API places
>   no bounds on it
> 
> Patch 1 aligns the GENEVE_OPT guard with the sibling RAW branch, which
> already gates its copy on the remaining buffer size.
> 
> Patch 2 plumbs the remaining buffer size into the flex-item desc_fn
> callback (which previously took no size argument at all) and gates the
> inner rte_memcpy() on it.
> 
> v2 fixes the merge conflict between patch 1 and the main branch.
> 
> James Raphael Tiovalen (2):
>   ethdev: fix out-of-bounds write in GENEVE option conversion
>   ethdev: fix out-of-bounds write in flex item conversion
> 
>  lib/ethdev/rte_flow.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 

Applied to next-net, and added you to .mailmap

^ permalink raw reply

* Re: [PATCH v1 1/1] net/nbl: fix multicast reception in promiscuous mode
From: Stephen Hemminger @ 2026-06-11 18:04 UTC (permalink / raw)
  To: Dimon Zhao; +Cc: dev, stable, Leon Yu, Sam Chen
In-Reply-To: <20260609075143.32695-2-dimon.zhao@nebula-matrix.com>

On Tue,  9 Jun 2026 00:51:43 -0700
Dimon Zhao <dimon.zhao@nebula-matrix.com> wrote:

> When promiscuous mode is enabled on NBL PMD,
> the hardware does not forward multicast frames to the host,
> causing the driver to fail receiving multicast packets.
> This patch fixes the issue.
> 
> Fixes: 80bd3cad22c8 ("net/nbl: support promiscuous mode")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Dimon Zhao <dimon.zhao@nebula-matrix.com>
> ---
Applied to next-net

^ permalink raw reply

* Re: [PATCH v8 00/18] Support VFIO cdev API in DPDK
From: Stephen Hemminger @ 2026-06-11 17:49 UTC (permalink / raw)
  To: Anatoly Burakov; +Cc: dev
In-Reply-To: <cover.1781190151.git.anatoly.burakov@intel.com>

On Thu, 11 Jun 2026 16:08:52 +0100
Anatoly Burakov <anatoly.burakov@intel.com> wrote:

> This patchset introduces a major refactor of the VFIO subsystem in DPDK to
> support character device (cdev) interface introduced in Linux kernel, as well as
> make the API more streamlined and useful. The goal is to simplify device
> management, improve compatibility, and clarify API responsibilities.
> 
> The following sections outline the key issues addressed by this patchset and the
> corresponding changes introduced.
> 
> 1. Only group mode is supported
> ===============================
> 
> Since kernel version 4.14.327 (LTS), VFIO supports the new character device
> (cdev)-based way of working with VFIO devices (otherwise known as IOMMUFD). This
> is a device-centric mode and does away with all the complexity regarding groups
> and IOMMU types, delegating it all to the kernel, and exposes a much simpler
> interface to userspace.
> 
> The old group interface is still around, and will need to be kept in DPDK both
> for compatibility reasons, as well as supporting special cases (FSLMC bus, NBL
> driver, no-IOMMU mode etc.).
> 
> To enable this, VFIO is heavily refactored, so that the code can support both
> modes while relying on (mostly) common infrastructure.
> 
> Note that the existing `rte_vfio_device_setup/release` model is fundamentally
> incompatible with cdev mode, because for custom container cases, the expected
> flow is that the user binds the IOMMU group (and thus, implicitly, the device
> itself) to a specific container using `rte_vfio_container_group_bind`, whereas
> this step is not needed for cdev as the device fd is assigned to the container
> straight away.
> 
> Therefore, what we do instead is introduce a new API for container device
> assignment which, semantically, will assign a device to specified container, so
> that when it is mapped using `rte_pci_map_device`, the appropriate container is
> selected. Under the hood though, we essentially transition to getting device fd
> straight away at assign stage, so that by the time the PCI bus attempts to map
> the device, it is already mapped and we just return an fd. There is no
> "unassign" API because `release_device` already performs that function.
> 
> Additionally, a new `rte_vfio_get_mode` API is added for those cases that need
> some introspection into VFIO's internals, with three new modes: group
> (old-style), no-iommu (old-style but without IOMMU), and cdev (the new mode).
> Although no-IOMMU is technically a variant of group mode, the distinction is
> largely irrelevant to the user, as all usages of noiommu checks in our codebase
> are for deciding whether to use IOVA or PA, not anything to do with managing
> groups. The current plan for kernel community is to *not* introduce no-IOMMU
> cdev implementation, and IOMMUFD's own group API compatibility layer also does
> not implement no-IOMMU mode, which is why this will be kept for compatibility
> for these use cases.
> 
> There were other users of VFIO which relied on group API but only for convenience
> purposes; no actual VFIO functionality depended on those API's. Therefore, group
> API's are removed and, where appropriate, replaced with the new API's.
> 
> List of removed API's:
> 
> * `rte_vfio_get_group_fd`
> * `rte_vfio_clear_group`
> * `rte_vfio_container_group_bind` (replaced by container assign API)
> * `rte_vfio_container_group_unbind`
> * `rte_vfio_noiommu_is_enabled` (replaced by new mode API)
> 
> 2. The API responsibilities aren't clear and bleed into each other
> ==================================================================
> 
> Some API's do multiple things at once. In particular:
> 
> * `rte_vfio_get_device_info` will setup the device
> * `rte_vfio_setup_device` will get device info
> 
> These API's have been adjusted to do one thing only.
> 
> v8:
> - Rebase
> - Fixed build errors due to variable shadowing
> - Removed duplicate fd check as kernel does not provide a way to distinguish
>   between device fd's
> 
> v7:
> - Rebase
> - Added removal of deprecation notices
> - Fixed implicit numeric comparison in patch 12
> 
> v6:
> - Fixed missing header include in vfio cdev file
> 
> v5:
> - Added back missing uapi patch
> 
> v4:
> - Fixed issues with documenting rte_vfio_mode enum
> - Separated deprecation notices into a separate patchset
> 
> v3:
> - Make API removal cleaner
> - Fix `get_group_num` usages to align with new API
> - Fix issues with function exports
> - Fix issues with `setup_device` returning old-style values in some cases
> 
> v2:
> - Make the entire API internal
> - More aggressive API pruning, complete removal of group API
> - Fixed a bug in group mode where device could not be used
> - Better documentation and deprecation notice patches
> - Moved doc patches to beginning of patchset
> 
> Anatoly Burakov (18):
>   uapi: update to v6.17 and add iommufd.h
>   vfio: make all functions internal
>   vfio: split get device info from setup
>   vfio: add container device assignment API
>   net/nbl: do not use VFIO group bind API
>   net/ntnic: use container device assignment API
>   vdpa/ifc: use container device assignment API
>   vdpa/nfp: use container device assignment API
>   vdpa/sfc: use container device assignment API
>   vhost: remove group-related API from drivers
>   vfio: remove group-based API
>   vfio: cleanup and refactor
>   bus/pci: use the new VFIO mode API
>   bus/fslmc: use the new VFIO mode API
>   net/hinic3: use the new VFIO mode API
>   net/ntnic: use the new VFIO mode API
>   vfio: remove no-IOMMU check API
>   vfio: introduce cdev mode
> 
>  config/arm/meson.build                    |    1 +
>  config/meson.build                        |    1 +
>  doc/guides/prog_guide/vhost_lib.rst       |    4 -
>  doc/guides/rel_notes/deprecation.rst      |   10 -
>  drivers/bus/cdx/cdx_vfio.c                |   25 +-
>  drivers/bus/fslmc/fslmc_bus.c             |   10 +-
>  drivers/bus/fslmc/fslmc_vfio.c            |    6 +-
>  drivers/bus/pci/linux/pci.c               |    2 +-
>  drivers/bus/pci/linux/pci_vfio.c          |   33 +-
>  drivers/bus/platform/platform.c           |    9 +-
>  drivers/crypto/bcmfs/bcmfs_vfio.c         |   14 +-
>  drivers/net/hinic3/base/hinic3_hwdev.c    |    3 +-
>  drivers/net/nbl/nbl_common/nbl_userdev.c  |   20 +-
>  drivers/net/nbl/nbl_include/nbl_include.h |    1 +
>  drivers/net/ntnic/ntnic_ethdev.c          |    2 +-
>  drivers/net/ntnic/ntnic_vfio.c            |   30 +-
>  drivers/vdpa/ifc/ifcvf_vdpa.c             |   34 +-
>  drivers/vdpa/mlx5/mlx5_vdpa.c             |    1 -
>  drivers/vdpa/nfp/nfp_vdpa.c               |   37 +-
>  drivers/vdpa/sfc/sfc_vdpa.c               |   39 +-
>  drivers/vdpa/sfc/sfc_vdpa.h               |    2 -
>  kernel/linux/uapi/linux/iommufd.h         | 1292 +++++++++++
>  kernel/linux/uapi/linux/vduse.h           |    2 +-
>  kernel/linux/uapi/linux/vfio.h            |   12 +-
>  kernel/linux/uapi/version                 |    2 +-
>  lib/eal/freebsd/eal.c                     |   98 +-
>  lib/eal/include/rte_vfio.h                |  387 ++--
>  lib/eal/linux/eal_vfio.c                  | 2437 ++++++++-------------
>  lib/eal/linux/eal_vfio.h                  |  167 +-
>  lib/eal/linux/eal_vfio_cdev.c             |  390 ++++
>  lib/eal/linux/eal_vfio_group.c            |  984 +++++++++
>  lib/eal/linux/eal_vfio_mp_sync.c          |   80 +-
>  lib/eal/linux/meson.build                 |    2 +
>  lib/eal/windows/eal.c                     |    4 +-
>  lib/vhost/vdpa_driver.h                   |    3 -
>  35 files changed, 4248 insertions(+), 1896 deletions(-)
>  create mode 100644 kernel/linux/uapi/linux/iommufd.h
>  create mode 100644 lib/eal/linux/eal_vfio_cdev.c
>  create mode 100644 lib/eal/linux/eal_vfio_group.c
> 

Big patchset so sent the big AI model at it...

Patch 4 (vfio: add container device assignment API)

Warning: header doc for rte_vfio_container_assign_device() says "<0 on
failure, rte_errno is set", but neither rte_vfio_get_group_num() nor
rte_vfio_container_group_bind() sets rte_errno on the Linux failure
paths at this point in the series. The rte_errno contract only becomes
true after the patch 12 rewrite. Either set rte_errno here or defer the
doc claim to patch 12.

Patch 5 (net/nbl: do not use VFIO group bind API)

Info: function definition does not follow DPDK style (return type on
its own line, blank line between declarations and statements):

	static int
	nbl_open_group_fd(int iommu_group_num)
	{
		char path[PATH_MAX];

		snprintf(path, sizeof(path), RTE_VFIO_GROUP_FMT, iommu_group_num);
		return open(path, O_RDWR);
	}

Patch 7 (vdpa/ifc: use container device assignment API)

Warning: this patch removes both the "internal->vfio_group_fd = -1"
initialization and the only assignment, but ifcvf_get_vfio_group_fd()
still returns the field until patch 10. Between patches 7 and 10 the
vdpa op returns 0 (zeroed allocation), i.e. a "valid" fd value. Nothing
in lib/vhost calls the op anymore so it is not reachable in practice,
but for bisectability either keep the -1 initialization here or move
patch 10 ahead of patches 7-9.

Patch 8 (vdpa/nfp: use container device assignment API)

Warning: same staging issue as patch 7, plus nfp_vdpa_vfio_teardown()
still calls rte_vfio_container_group_unbind(fd, device->iommu_group)
with device->iommu_group now never assigned (always 0 from calloc), so
every teardown between patches 8 and 10 issues an unbind for group 0
that fails silently. The teardown unbind removal currently in patch 10
belongs in this patch (patch 9 does this correctly for sfc, removing
the fields and all uses in one patch).

Patch 12 (vfio: cleanup and refactor) -- partial review

Warning: missing release notes. This patch (together with patches 2, 11,
17, 18) removes the public rte_vfio API, removes the group-bind API, and
changes rte_vfio_setup_device()/rte_vfio_get_group_num() return
semantics. None of the series touches the current release notes file;
the entire VFIO API removal and the new cdev mode need entries in
"Removed Items" / "New Features".

Info: rte_errno convention comment at top of eal_vfio.c says "ENOXIO";
the errno is ENXIO (code uses the correct one).

Patch 18 (vfio: introduce cdev mode)

Error: ioas_id is corrupted in secondary processes. struct container
puts vfio_group_config and vfio_cdev_config in a union, and both place
their first member at offset 0 (bool dma_setup_done / uint32_t ioas_id).
In vfio_select_mode(), the secondary path does:

	if (mode == RTE_VFIO_MODE_CDEV && vfio_cdev_sync_ioas(cfg) < 0)
		goto err;

	/* primary handles DMA setup for default containers */
	group_cfg->dma_setup_done = true;

In cdev mode the unconditional dma_setup_done store overwrites the low
byte of the ioas_id just received from the primary. The corrupted id is
then used by VFIO_DEVICE_ATTACH_IOMMUFD_PT and IOMMU_IOAS_MAP/UNMAP in
the secondary. It happens to work only when the primary's IOAS id has
low byte 1. Fix is to make the store mode-conditional:

	if (mode == RTE_VFIO_MODE_GROUP || mode == RTE_VFIO_MODE_NOIOMMU)
		group_cfg->dma_setup_done = true;

^ permalink raw reply

* Re: [PATCH 9/9] net/dpaa2: drop the fake software VLAN strip offload
From: Stephen Hemminger @ 2026-06-11 17:30 UTC (permalink / raw)
  To: Maxime Leroy; +Cc: hemant.agrawal, sachin.saxena, dev
In-Reply-To: <20260611154926.392670-10-maxime@leroys.fr>

On Thu, 11 Jun 2026 17:49:24 +0200
Maxime Leroy <maxime@leroys.fr> wrote:

> It saves a forwarding application nothing: the datapath reads the L2
> header anyway to classify or strip. The offload does not remove that
> read, it relocates it into the driver Rx burst, where it is far more
> expensive.
> 
> The cost is a matter of timing. rte_vlan_strip() reaches the L2 header
> through rte_pktmbuf_mtod(), which dereferences mbuf->buf_addr. On a
> freshly recycled buffer that mbuf cacheline is cold. eth_fd_to_mbuf()
> has just written other fields of it (data_off, ol_flags), but buf_addr
> is a persistent field it does not rewrite. A write does not stall: it
> posts to the store buffer while the line fills in the background, and
> the rewritten fields are forwarded straight from there. buf_addr has
> nothing to forward, so it must be read from the line, whose fill is
> still in flight, and the read stalls. The ethertype read that follows,
> on the cold payload line, stalls again. Read later by the application,
> when the fill has completed, the same read hits. The offload just
> performs it at the worst possible moment.
> 
> Measured on a single-core port-to-port forwarding test over two 10G
> ports (one core at 2 GHz, 64-byte untagged frames):
> 
>   - throughput 4.22 -> 5.00 Mpps (+18 percent)
>   - IPC 0.93 -> 1.25: the cost was memory stall, not compute
>   - L3/DRAM-bound L2 refills 319M -> 200M over 10s (-37 percent)
> 
> perf confirms it: with the offload, the buf_addr load (the cold mbuf
> field) and the payload load account for about 84 percent of the Rx
> burst's L2 refills; removing it, those vanish and only the inherent DQRR
> dequeue misses remain.
> 
> Stop advertising VLAN_STRIP and remove the rte_vlan_strip() calls from
> every Rx path. This is a behavioural change: the tag is left in the
> frame, so an application must strip it itself, on the L2 header it
> already reads.
> 
> Signed-off-by: Maxime Leroy <maxime@leroys.fr>
> ---

In general I agree, but you overstate the impact. Any real application
is going to look at the mbuf anyway. Relying on testpmd numbers is BS.

The NBL driver does the same thing.
So does PCAP but it has no choice, and is slow anyway.
Virtio/vhost does as well.





^ permalink raw reply

* Re: [PATCH 01/17] net/cnxk: update mbuf next field for multi segment
From: Stephen Hemminger @ 2026-06-11 17:23 UTC (permalink / raw)
  To: Rahul Bhansali
  Cc: dev, Nithin Dabilpuram, Kiran Kumar K, Sunil Kumar Kori,
	Satha Rao, Harman Kalra, jerinj
In-Reply-To: <20260611073311.3129711-1-rbhansali@marvell.com>

On Thu, 11 Jun 2026 13:02:55 +0530
Rahul Bhansali <rbhansali@marvell.com> wrote:

> As per the requirement of rte_mbuf_raw_reset_bulk(), the mbuf's
> 'next' and 'nb_segs' fields are required to be reset.
> This reset these field for multi-segment mbufs on cn9k platform.
> 
> Signed-off-by: Rahul Bhansali <rbhansali@marvell.com>
> ---

Please follow code submission guidelines for DPDK and use
cover letter and threading of replies.
https://doc.dpdk.org/guides/contributing/patches.html#sending-patches

What you get wrong:
  - Please allow at least 24 hours to pass between posting patch revisions.
  - Missing cover letter to explain patchset
  - Use versions and in-reply-to. This keeps mail threads organized and
    helps maintainers track in patchwork as well.


^ permalink raw reply

* Re: [PATCH] net/crc: add 4x folding loop for x86 SSE implementation
From: Stephen Hemminger @ 2026-06-11 17:06 UTC (permalink / raw)
  To: Shreesh Adiga; +Cc: Jasvinder Singh, Bruce Richardson, Konstantin Ananyev, dev
In-Reply-To: <20260609075712.247286-1-16567adigashreesh@gmail.com>

On Tue,  9 Jun 2026 13:27:12 +0530
Shreesh Adiga <16567adigashreesh@gmail.com> wrote:

> Add a 64-byte loop that maintains 4 fold registers and processes
> 64 bytes at a time. The 4x fold registers is then reduced to 16 byte
> single fold, similar to AVX512 implementation. This technique is
> described in the paper by Intel:
> "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction"
> 
> This results in roughly 50% performance improvement due to better ILP
> for large input sizes like 1024.
> 
> Signed-off-by: Shreesh Adiga <16567adigashreesh@gmail.com>
> ---

Looks good applied to next-net.

A couple of nits from more detailed AI review, that you still might want to look at:

The current crc_autotest does not exercise the new 64-byte CRC16 path.
Its CRC32 vectors are 1512 and 348 bytes, so the CRC32 4x loop is
covered — but the largest CRC16 vector is 32 bytes, all three CRC16
tests being ≤32. So the new CRC16 rk1_rk2 (64-byte fold) constants ship
untested in CI. My exhaustive test confirms they're correct, but a
future regression there wouldn't be caught. Suggest adding a CRC16
vector ≥64 bytes, ideally a non-multiple of 64 (e.g. 80 or 100) so it
hits the 4x loop, the single-fold tail, and the partial-bytes path
together.

In partial_bytes the comment /* k = rk1 & rk2 */ is now stale
 — after the patch k holds rk3_rk4 on every path reaching it.
Not introduced by this patch, but the patch is what made it wrong;
worth fixing in passing.

^ permalink raw reply

* Re: [PATCH 9/9] net/dpaa2: drop the fake software VLAN strip offload
From: Maxime Leroy @ 2026-06-11 16:58 UTC (permalink / raw)
  To: Morten Brørup; +Cc: Hemant Agrawal, Sachin Saxena, dev
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35F65908@smartserver.smartshare.dk>

[-- Attachment #1: Type: text/plain, Size: 717 bytes --]

Le jeu. 11 juin 2026, 17:56, Morten Brørup <mb@smartsharesystems.com> a
écrit :

> This patch is unrelated to the series.
>
>
> Splitting this would create an ordering problem. If the NAPI series is
merged with a software VLAN strip implementation and the cleanup removing
the fake VLAN_STRIP offload is merged separately, the two can land in
either order and leave the PMD with inconsistent Rx paths.

The new NAPI/DQRR path must match the offloads reported by the PMD at the
end
of the series. Since VLAN_STRIP is not a real dpaa2 hardware offload, this
series removes the advertised offload and the software rte_vlan_strip()
calls together, so all Rx paths remain consistent at each merge point.

[-- Attachment #2: Type: text/html, Size: 1218 bytes --]

^ permalink raw reply

* Re: [PATCH v1 0/6] net/r8169: hardware updates, optimizations, and a bug fix
From: Stephen Hemminger @ 2026-06-11 16:46 UTC (permalink / raw)
  To: Howard Wang; +Cc: dev, pro_nic_dpdk
In-Reply-To: <20260611083521.20669-1-howard_wang@realsil.com.cn>

On Thu, 11 Jun 2026 16:28:27 +0800
Howard Wang <howard_wang@realsil.com.cn> wrote:

> This patch series primarily focuses on updating hardware configurations, 
> optimizing the datapath, and refining device behaviors for the net/r8169 PMD. 
> Additionally, it includes one bug fix for a segmentation fault encountered 
> during initialization.
> 
> Summary of the series:
> 
>   - Patch 1: Updates RX CRC drop behavior for RTL8125BP and later MAC versions
>     to align with device shutdown sequences and prevent cross-driver states.
>   - Patch 2: Optimizes the Tx datapath performance by removing redundant branch
>     checks for malformed packets, replacing them with RTE_ASSERT.
>   - Patch 3: Enhances RTL8125+ flow control by utilizing a new formula for 
>     nearfull and nearempty thresholds.
>   - Patch 4: Removes RTL9151 CSI (DBI) channel support, as firmware handling 
>     latency makes it no longer suitable for the driver.
>   - Patch 5: Updates PHY and MAC MCU configurations for RTL9151A and RTL8125BP.
>   - Patch 6: Fixes a segmentation fault during RTL8168 initialization by 
>     restricting RTL8125-specific RSS/VMQ configurations to the correct hardware.
> 
> Howard Wang (6):
>   net/r8169: disable RX CRC drop for RTL8125BP and later
>   net/r8169: optimize Tx datapath by removing redundant packet checks
>   net/r8169: improve RTL8125+ flow control
>   net/r8169: remove RTL9151 CSI (DBI) channel support
>   net/r8169: update hardware configurations for 8125
>   net/r8169: fix segmentation fault during RTL8168 initialization
> 
>  drivers/net/r8169/base/rtl8125bp_mcu.c | 15 ++--
>  drivers/net/r8169/base/rtl9151a.c      |  8 +++
>  drivers/net/r8169/base/rtl9151a_mcu.c  | 14 +++-
>  drivers/net/r8169/r8169_compat.h       |  1 +
>  drivers/net/r8169/r8169_hw.c           | 98 ++++++++++++++++++++++++--
>  drivers/net/r8169/r8169_hw.h           |  2 +-
>  drivers/net/r8169/r8169_rxtx.c         | 32 ++++-----
>  7 files changed, 137 insertions(+), 33 deletions(-)
> 

Looks good, the CI AI review complaints are noise and will ignore those.
Applied to next-net


^ permalink raw reply

* RE: [PATCH 9/9] net/dpaa2: drop the fake software VLAN strip offload
From: Morten Brørup @ 2026-06-11 16:13 UTC (permalink / raw)
  To: Maxime Leroy, hemant.agrawal, sachin.saxena; +Cc: dev
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35F65908@smartserver.smartshare.dk>

> This patch is unrelated to the series.
And also,
Acked-by: Morten Brørup <mb@smartsharesystems.com>

We should take note of this for other drivers!


^ permalink raw reply

* RE: [PATCH 8/9] ethdev: keep fast-path ops valid after port stop
From: Morten Brørup @ 2026-06-11 16:01 UTC (permalink / raw)
  To: Maxime Leroy, hemant.agrawal, sachin.saxena
  Cc: dev, stable, Thomas Monjalon, Andrew Rybchenko, Sunil Kumar Kori
In-Reply-To: <20260611154926.392670-9-maxime@leroys.fr>

> From: Maxime Leroy [mailto:maxime.leroys@gmail.com] On Behalf Of Maxime
> Leroy
> Sent: Thursday, 11 June 2026 17.49
> 
> eth_dev_fp_ops_reset() restores a port's fast-path ops on stop/release
> via a compound literal, so every field it omits is zeroed to NULL. It
> sets only rx_pkt_burst/tx_pkt_burst (and the rxq/txq data), leaving
> rx_queue_count, tx_queue_count, rx/tx_descriptor_status, tx_pkt_prepare
> and the recycle callbacks NULL.
> 
> In non-debug builds these ops are reached through an unguarded indirect
> call (the NULL check exists only under RTE_ETHDEV_DEBUG_RX/TX). So a
> thread calling e.g. rte_eth_rx_queue_count() on a port being stopped
> dereferences NULL and crashes, while the same race on
> rte_eth_rx_burst()
> is harmless because the burst ops are reset to dummies. A poll-mode
> worker re-checking rx_queue_count before arming the Rx interrupt and
> sleeping hits exactly this.
> 
> Reset these ops to the same dummies eth_dev_set_dummy_fops() installs,
> so a stopped port behaves like a freshly allocated one: every fast-path
> op is a safe no-op, none is NULL.
> 
> Fixes: 066f3d9cc21c ("ethdev: remove callback checks from fast path")
> Cc: stable@dpdk.org
> Signed-off-by: Maxime Leroy <maxime@leroys.fr>
> ---

Good catch.
Acked-by: Morten Brørup <mb@smartsharesystems.com>

Not related to the series, consider sending as separate patch.


^ permalink raw reply

* RE: [PATCH 9/9] net/dpaa2: drop the fake software VLAN strip offload
From: Morten Brørup @ 2026-06-11 15:56 UTC (permalink / raw)
  To: Maxime Leroy, hemant.agrawal, sachin.saxena; +Cc: dev
In-Reply-To: <20260611154926.392670-10-maxime@leroys.fr>

This patch is unrelated to the series.


^ permalink raw reply

* [PATCH 9/9] net/dpaa2: drop the fake software VLAN strip offload
From: Maxime Leroy @ 2026-06-11 15:49 UTC (permalink / raw)
  To: hemant.agrawal, sachin.saxena; +Cc: dev, Maxime Leroy
In-Reply-To: <20260611154926.392670-1-maxime@leroys.fr>

RTE_ETH_RX_OFFLOAD_VLAN_STRIP is advertised, but no hardware VLAN strip
backs it: when enabled, the Rx burst calls rte_vlan_strip() on every
frame, a software op masquerading as a hardware offload.

It saves a forwarding application nothing: the datapath reads the L2
header anyway to classify or strip. The offload does not remove that
read, it relocates it into the driver Rx burst, where it is far more
expensive.

The cost is a matter of timing. rte_vlan_strip() reaches the L2 header
through rte_pktmbuf_mtod(), which dereferences mbuf->buf_addr. On a
freshly recycled buffer that mbuf cacheline is cold. eth_fd_to_mbuf()
has just written other fields of it (data_off, ol_flags), but buf_addr
is a persistent field it does not rewrite. A write does not stall: it
posts to the store buffer while the line fills in the background, and
the rewritten fields are forwarded straight from there. buf_addr has
nothing to forward, so it must be read from the line, whose fill is
still in flight, and the read stalls. The ethertype read that follows,
on the cold payload line, stalls again. Read later by the application,
when the fill has completed, the same read hits. The offload just
performs it at the worst possible moment.

Measured on a single-core port-to-port forwarding test over two 10G
ports (one core at 2 GHz, 64-byte untagged frames):

  - throughput 4.22 -> 5.00 Mpps (+18 percent)
  - IPC 0.93 -> 1.25: the cost was memory stall, not compute
  - L3/DRAM-bound L2 refills 319M -> 200M over 10s (-37 percent)

perf confirms it: with the offload, the buf_addr load (the cold mbuf
field) and the payload load account for about 84 percent of the Rx
burst's L2 refills; removing it, those vanish and only the inherent DQRR
dequeue misses remain.

Stop advertising VLAN_STRIP and remove the rte_vlan_strip() calls from
every Rx path. This is a behavioural change: the tag is left in the
frame, so an application must strip it itself, on the L2 header it
already reads.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
---
 doc/guides/rel_notes/release_26_07.rst |  3 +++
 drivers/net/dpaa2/dpaa2_ethdev.c       |  1 -
 drivers/net/dpaa2/dpaa2_rxtx.c         | 23 +++--------------------
 3 files changed, 6 insertions(+), 21 deletions(-)

diff --git a/doc/guides/rel_notes/release_26_07.rst b/doc/guides/rel_notes/release_26_07.rst
index 87c7c57bcc..9d01099dad 100644
--- a/doc/guides/rel_notes/release_26_07.rst
+++ b/doc/guides/rel_notes/release_26_07.rst
@@ -130,6 +130,9 @@ New Features

   * Added RSS RETA query and update support.
   * Added Rx queue interrupt support.
+  * Removed the software VLAN strip offload: ``RTE_ETH_RX_OFFLOAD_VLAN_STRIP``
+    is no longer advertised, as no hardware strip backs it. An application
+    that needs the tag removed must now strip it itself.

 * **Updated PCAP ethernet driver.**

diff --git a/drivers/net/dpaa2/dpaa2_ethdev.c b/drivers/net/dpaa2/dpaa2_ethdev.c
index fb117e761f..b3ea826db9 100644
--- a/drivers/net/dpaa2/dpaa2_ethdev.c
+++ b/drivers/net/dpaa2/dpaa2_ethdev.c
@@ -48,7 +48,6 @@ static uint64_t dev_rx_offloads_sup =
 		RTE_ETH_RX_OFFLOAD_SCTP_CKSUM |
 		RTE_ETH_RX_OFFLOAD_OUTER_IPV4_CKSUM |
 		RTE_ETH_RX_OFFLOAD_OUTER_UDP_CKSUM |
-		RTE_ETH_RX_OFFLOAD_VLAN_STRIP |
 		RTE_ETH_RX_OFFLOAD_VLAN_FILTER |
 		RTE_ETH_RX_OFFLOAD_TIMESTAMP;

diff --git a/drivers/net/dpaa2/dpaa2_rxtx.c b/drivers/net/dpaa2/dpaa2_rxtx.c
index 189accc1de..d16e4f8f35 100644
--- a/drivers/net/dpaa2/dpaa2_rxtx.c
+++ b/drivers/net/dpaa2/dpaa2_rxtx.c
@@ -890,10 +890,6 @@ dpaa2_dev_prefetch_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		}
 #endif

-		if (eth_data->dev_conf.rxmode.offloads &
-				RTE_ETH_RX_OFFLOAD_VLAN_STRIP)
-			rte_vlan_strip(bufs[num_rx]);
-
 		dq_storage++;
 		num_rx++;
 	} while (pending);
@@ -922,22 +918,14 @@ dpaa2_dev_prefetch_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	return num_rx;
 }

-/* Convert a DQRR'd FD (single or scatter-gather) to an mbuf and apply software
- * VLAN strip, like the poll path.
- */
+/* Convert a DQRR'd FD (single or scatter-gather) to an mbuf. */
 static inline struct rte_mbuf *
 dpaa2_dqrr_fd_to_mbuf(const struct qbman_fd *fd,
 		      struct rte_eth_dev_data *eth_data)
 {
-	struct rte_mbuf *m;
-
 	if (unlikely(DPAA2_FD_GET_FORMAT(fd) == qbman_fd_sg))
-		m = eth_sg_fd_to_mbuf(fd, eth_data->port_id);
-	else
-		m = eth_fd_to_mbuf(fd, eth_data->port_id);
-	if (eth_data->dev_conf.rxmode.offloads & RTE_ETH_RX_OFFLOAD_VLAN_STRIP)
-		rte_vlan_strip(m);
-	return m;
+		return eth_sg_fd_to_mbuf(fd, eth_data->port_id);
+	return eth_fd_to_mbuf(fd, eth_data->port_id);
 }

 /* prefetch a DQRR'd FD's HW annotation (parse area) ahead of conversion */
@@ -1222,11 +1210,6 @@ dpaa2_dev_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		}
 #endif

-		if (eth_data->dev_conf.rxmode.offloads &
-				RTE_ETH_RX_OFFLOAD_VLAN_STRIP) {
-			rte_vlan_strip(bufs[num_rx]);
-		}
-
 			dq_storage++;
 			num_rx++;
 			num_pulled++;
-- 
2.43.0

^ permalink raw reply related

* [PATCH 8/9] ethdev: keep fast-path ops valid after port stop
From: Maxime Leroy @ 2026-06-11 15:49 UTC (permalink / raw)
  To: hemant.agrawal, sachin.saxena
  Cc: dev, Maxime Leroy, stable, Thomas Monjalon, Andrew Rybchenko,
	Morten Brørup, Sunil Kumar Kori
In-Reply-To: <20260611154926.392670-1-maxime@leroys.fr>

eth_dev_fp_ops_reset() restores a port's fast-path ops on stop/release
via a compound literal, so every field it omits is zeroed to NULL. It
sets only rx_pkt_burst/tx_pkt_burst (and the rxq/txq data), leaving
rx_queue_count, tx_queue_count, rx/tx_descriptor_status, tx_pkt_prepare
and the recycle callbacks NULL.

In non-debug builds these ops are reached through an unguarded indirect
call (the NULL check exists only under RTE_ETHDEV_DEBUG_RX/TX). So a
thread calling e.g. rte_eth_rx_queue_count() on a port being stopped
dereferences NULL and crashes, while the same race on rte_eth_rx_burst()
is harmless because the burst ops are reset to dummies. A poll-mode
worker re-checking rx_queue_count before arming the Rx interrupt and
sleeping hits exactly this.

Reset these ops to the same dummies eth_dev_set_dummy_fops() installs,
so a stopped port behaves like a freshly allocated one: every fast-path
op is a safe no-op, none is NULL.

Fixes: 066f3d9cc21c ("ethdev: remove callback checks from fast path")
Cc: stable@dpdk.org
Signed-off-by: Maxime Leroy <maxime@leroys.fr>
---
 lib/ethdev/ethdev_private.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/lib/ethdev/ethdev_private.c b/lib/ethdev/ethdev_private.c
index 72a0723846..75ea3eedff 100644
--- a/lib/ethdev/ethdev_private.c
+++ b/lib/ethdev/ethdev_private.c
@@ -263,6 +263,13 @@ eth_dev_fp_ops_reset(struct rte_eth_fp_ops *fpo)
 	*fpo = (struct rte_eth_fp_ops) {
 		.rx_pkt_burst = dummy_eth_rx_burst,
 		.tx_pkt_burst = dummy_eth_tx_burst,
+		.tx_pkt_prepare = rte_eth_tx_pkt_prepare_dummy,
+		.rx_queue_count = rte_eth_queue_count_dummy,
+		.tx_queue_count = rte_eth_queue_count_dummy,
+		.rx_descriptor_status = rte_eth_descriptor_status_dummy,
+		.tx_descriptor_status = rte_eth_descriptor_status_dummy,
+		.recycle_tx_mbufs_reuse = rte_eth_recycle_tx_mbufs_reuse_dummy,
+		.recycle_rx_descriptors_refill = rte_eth_recycle_rx_descriptors_refill_dummy,
 		.rxq = {
 			.data = (void **)&dummy_queues_array[port_id],
 			.clbk = dummy_data,
-- 
2.43.0

^ permalink raw reply related

* [PATCH 7/9] net/dpaa2: fix Rx queue count for primary process
From: Maxime Leroy @ 2026-06-11 15:49 UTC (permalink / raw)
  To: hemant.agrawal, sachin.saxena
  Cc: dev, Maxime Leroy, stable, Ferruh Yigit, Andrew Rybchenko,
	David Marchand
In-Reply-To: <20260611154926.392670-1-maxime@leroys.fr>

The rx_queue_count callback was only assigned on the secondary process
path of dpaa2_dev_init(), leaving eth_dev->rx_queue_count NULL for the
primary process. The fast-path rte_eth_rx_queue_count() performs an
unguarded indirect call in non-debug builds, so invoking it on a
primary-process dpaa2 port dereferences a NULL function pointer and
crashes.

Assign the callback once before the process-type split so both the
primary and secondary paths set it.

Fixes: cbfc6111b557 ("ethdev: move inline device operations")
Cc: stable@dpdk.org
Signed-off-by: Maxime Leroy <maxime@leroys.fr>
---
 drivers/net/dpaa2/dpaa2_ethdev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/dpaa2/dpaa2_ethdev.c b/drivers/net/dpaa2/dpaa2_ethdev.c
index 7ca454eaae..fb117e761f 100644
--- a/drivers/net/dpaa2/dpaa2_ethdev.c
+++ b/drivers/net/dpaa2/dpaa2_ethdev.c
@@ -3617,6 +3617,7 @@ dpaa2_dev_init(struct rte_eth_dev *eth_dev)
 	}

 	eth_dev->dev_ops = &dpaa2_ethdev_ops;
+	eth_dev->rx_queue_count = dpaa2_dev_rx_queue_count;

 	if (dpaa2_get_devargs(dev->devargs, DRIVER_LOOPBACK_MODE)) {
 		eth_dev->rx_pkt_burst = dpaa2_dev_loopback_rx;
-- 
2.43.0

^ permalink raw reply related

* [PATCH 6/9] bus/fslmc/dpio: tune DQRI interrupt coalescing holdoff
From: Maxime Leroy @ 2026-06-11 15:49 UTC (permalink / raw)
  To: hemant.agrawal, sachin.saxena; +Cc: dev, Maxime Leroy
In-Reply-To: <20260611154926.392670-1-maxime@leroys.fr>

The portal DQRI interrupt used a fixed threshold of 3 and a raw 0xFF
timeout. Parameterize dpaa2_dpio_intr_init() with (threshold, timeout) so
each mode supplies its own: the event driver keeps the legacy 3 / 0xFF
and its DPAA2_PORTAL_INTR_THRESHOLD / DPAA2_PORTAL_INTR_TIMEOUT env-var
overrides, while rx-queue interrupts default the threshold to the HW DQRR
ring depth (ring-1, =7 on QBMan >= 4.1) and use a coalescing holdoff in
microseconds, converted to ITP units from the MC-reported QBMan clock
(itp = holdoff_us * clk_MHz / 256, capped at the 12-bit field). The setup
is portal-wide and idempotent, so the first mode to arm a given portal
wins; a portal is normally driven by a single mode.

The net/dpaa2 PMD exposes both rx-queue-interrupt knobs as per-port
devargs: drv_rx_intr_holdoff_us (default 100us) and drv_rx_intr_threshold
(default 0 = ring-1, clamped to [1, ring-1]). Also expose
dpaa2_dpio_intr_deinit() (no longer event-only), and on the intr_init
error paths close the epoll fd and disable the interrupt.

Add qbman_swp_dqrr_size() to expose the ring depth.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
---
 doc/guides/nics/dpaa2.rst                     | 10 +++
 drivers/bus/fslmc/portal/dpaa2_hw_dpio.c      | 72 +++++++++++++------
 drivers/bus/fslmc/portal/dpaa2_hw_dpio.h      | 12 +++-
 .../fslmc/qbman/include/fsl_qbman_portal.h    |  9 +++
 drivers/bus/fslmc/qbman/qbman_portal.c        |  6 ++
 drivers/net/dpaa2/dpaa2_ethdev.c              | 60 +++++++++++++++-
 drivers/net/dpaa2/dpaa2_ethdev.h              |  7 ++
 7 files changed, 151 insertions(+), 25 deletions(-)

diff --git a/doc/guides/nics/dpaa2.rst b/doc/guides/nics/dpaa2.rst
index 2d70bd0ab9..47a52c9287 100644
--- a/doc/guides/nics/dpaa2.rst
+++ b/doc/guides/nics/dpaa2.rst
@@ -492,6 +492,16 @@ for details.
   packets, so that user can check what is wrong with those packets.
   e.g. ``fslmc:dpni.1,drv_error_queue=1``
 
+* Use dev arg option ``drv_rx_intr_holdoff_us=<uint32>`` to set the Rx queue
+  interrupt coalescing holdoff in microseconds (default 100). Only applies in
+  Rx queue interrupt mode.
+  e.g. ``fslmc:dpni.1,drv_rx_intr_holdoff_us=50``
+
+* Use dev arg option ``drv_rx_intr_threshold=<uint32>`` to set the Rx queue
+  interrupt coalescing frame threshold; 0 (default) means the DQRR ring depth
+  minus one.
+  e.g. ``fslmc:dpni.1,drv_rx_intr_threshold=4``
+
 Enabling logs
 -------------
 
diff --git a/drivers/bus/fslmc/portal/dpaa2_hw_dpio.c b/drivers/bus/fslmc/portal/dpaa2_hw_dpio.c
index e6b4e74b3b..c5525a94fa 100644
--- a/drivers/bus/fslmc/portal/dpaa2_hw_dpio.c
+++ b/drivers/bus/fslmc/portal/dpaa2_hw_dpio.c
@@ -206,12 +206,35 @@ dpaa2_affine_dpio_intr_to_respective_core(int32_t dpio_id, int cpu_id)
 }
 #endif /* RTE_EVENT_DPAA2 */
 
+/* holdoff (us) -> QBMan ITP units (256 cycles each), capped at the 12-bit field */
+RTE_EXPORT_INTERNAL_SYMBOL(dpaa2_dpio_holdoff_to_itp)
+int dpaa2_dpio_holdoff_to_itp(struct dpaa2_dpio_dev *dpio_dev, uint32_t holdoff_us)
+{
+	uint32_t qman_mhz = 0;
+	struct dpio_attr attr;
+	uint64_t itp;
+
+	if (dpio_get_attributes(dpio_dev->dpio, CMD_PRI_LOW, dpio_dev->token, &attr) == 0)
+		qman_mhz = attr.clk / 1000000;
+	itp = qman_mhz ? ((uint64_t)holdoff_us * qman_mhz) / 256 : 0xFF;
+	if (itp > 0xfff)	/* 12-bit ITP field */
+		itp = 0xfff;
+
+	return (int)itp;
+}
+
+/* threshold: DQRR fill raising DQRI (< ring depth); timeout: holdoff in ITP units.
+ * Per-mode values from the caller (eventdev vs rx-queue intr); no env override.
+ * The DQRI config is portal-wide and this is idempotent: the first caller to
+ * arm a portal wins, a later caller's values are ignored (a portal normally
+ * serves a single mode).
+ */
 RTE_EXPORT_INTERNAL_SYMBOL(dpaa2_dpio_intr_init)
-int dpaa2_dpio_intr_init(struct dpaa2_dpio_dev *dpio_dev, bool build_epoll)
+int dpaa2_dpio_intr_init(struct dpaa2_dpio_dev *dpio_dev, int threshold,
+			 int timeout, bool build_epoll)
 {
-	struct epoll_event epoll_ev;
 	int eventfd, dpio_epoll_fd, ret;
-	int threshold = 0x3, timeout = 0xFF;
+	struct epoll_event epoll_ev;
 
 	if (dpio_dev->intr_enabled)
 		return 0;
@@ -222,12 +245,6 @@ int dpaa2_dpio_intr_init(struct dpaa2_dpio_dev *dpio_dev, bool build_epoll)
 		return -1;
 	}
 
-	if (getenv("DPAA2_PORTAL_INTR_THRESHOLD"))
-		threshold = atoi(getenv("DPAA2_PORTAL_INTR_THRESHOLD"));
-
-	if (getenv("DPAA2_PORTAL_INTR_TIMEOUT"))
-		sscanf(getenv("DPAA2_PORTAL_INTR_TIMEOUT"), "%x", &timeout);
-
 	qbman_swp_interrupt_set_trigger(dpio_dev->sw_portal,
 					QBMAN_SWP_INTERRUPT_DQRI);
 	qbman_swp_interrupt_clear_status(dpio_dev->sw_portal, 0xffffffff);
@@ -238,9 +255,9 @@ int dpaa2_dpio_intr_init(struct dpaa2_dpio_dev *dpio_dev, bool build_epoll)
 	dpio_dev->epoll_fd = -1;
 
 	/* The event PMD dequeues by sleeping on a private epoll instance owned
-	 * by the portal, so build it here. A caller that waits on another
-	 * epoll (the net rx-queue-interrupt path uses the application's) skips
-	 * this.
+	 * by the portal, so build it here. The net rx-queue-interrupt path
+	 * exposes the raw eventfd through the generic ethdev API and waits on
+	 * the application's own epoll instead, so it skips this.
 	 */
 	if (build_epoll) {
 		dpio_epoll_fd = epoll_create(1);
@@ -269,11 +286,14 @@ int dpaa2_dpio_intr_init(struct dpaa2_dpio_dev *dpio_dev, bool build_epoll)
 	return 0;
 }
 
-#ifdef RTE_EVENT_DPAA2
-static void dpaa2_dpio_intr_deinit(struct dpaa2_dpio_dev *dpio_dev)
+RTE_EXPORT_INTERNAL_SYMBOL(dpaa2_dpio_intr_deinit)
+void dpaa2_dpio_intr_deinit(struct dpaa2_dpio_dev *dpio_dev)
 {
 	int ret;
 
+	if (!dpio_dev->intr_enabled)
+		return;
+
 	ret = rte_dpaa2_intr_disable(dpio_dev->intr_handle, 0);
 	if (ret)
 		DPAA2_BUS_ERR("DPIO interrupt disable failed");
@@ -284,7 +304,6 @@ static void dpaa2_dpio_intr_deinit(struct dpaa2_dpio_dev *dpio_dev)
 	}
 	dpio_dev->intr_enabled = 0;
 }
-#endif
 
 static int
 dpaa2_configure_stashing(struct dpaa2_dpio_dev *dpio_dev, int cpu_id)
@@ -306,9 +325,18 @@ dpaa2_configure_stashing(struct dpaa2_dpio_dev *dpio_dev, int cpu_id)
 	}
 
 #ifdef RTE_EVENT_DPAA2
-	if (dpaa2_dpio_intr_init(dpio_dev, true)) {
-		DPAA2_BUS_ERR("Interrupt registration failed for dpio");
-		return -1;
+	{
+		int threshold = 3, timeout = 0xFF;
+
+		if (getenv("DPAA2_PORTAL_INTR_THRESHOLD"))
+			threshold = atoi(getenv("DPAA2_PORTAL_INTR_THRESHOLD"));
+		if (getenv("DPAA2_PORTAL_INTR_TIMEOUT"))
+			sscanf(getenv("DPAA2_PORTAL_INTR_TIMEOUT"), "%x", &timeout);
+
+		if (dpaa2_dpio_intr_init(dpio_dev, threshold, timeout, true)) {
+			DPAA2_BUS_ERR("Interrupt registration failed for dpio");
+			return -1;
+		}
 	}
 	dpaa2_affine_dpio_intr_to_respective_core(dpio_dev->hw_id, cpu_id);
 #endif
@@ -319,9 +347,11 @@ dpaa2_configure_stashing(struct dpaa2_dpio_dev *dpio_dev, int cpu_id)
 static void dpaa2_put_qbman_swp(struct dpaa2_dpio_dev *dpio_dev)
 {
 	if (dpio_dev) {
-#ifdef RTE_EVENT_DPAA2
+		/* rx-queue interrupts (net PMD) can arm a portal without the
+		 * event driver; tear it down unconditionally. Safe when never
+		 * armed: intr_deinit returns early if intr is not enabled.
+		 */
 		dpaa2_dpio_intr_deinit(dpio_dev);
-#endif
 		rte_atomic16_clear(&dpio_dev->ref_count);
 	}
 }
@@ -512,6 +542,8 @@ dpaa2_create_dpio_device(int vdev_fd,
 		goto err;
 	}
 
+	DPAA2_BUS_DEBUG("QBMAN clk = %u Hz (%u MHz)", attr.clk, attr.clk / 1000000);
+
 	/* find the SoC type for the first time */
 	if (!dpaa2_svr_family) {
 		struct mc_soc_version mc_plat_info = {0};
diff --git a/drivers/bus/fslmc/portal/dpaa2_hw_dpio.h b/drivers/bus/fslmc/portal/dpaa2_hw_dpio.h
index 10dd968e5f..090fa14410 100644
--- a/drivers/bus/fslmc/portal/dpaa2_hw_dpio.h
+++ b/drivers/bus/fslmc/portal/dpaa2_hw_dpio.h
@@ -50,9 +50,17 @@ int dpaa2_affine_qbman_swp(void);
 __rte_internal
 int dpaa2_affine_qbman_ethrx_swp(void);
 
-/* set up a DPIO portal's DQRI interrupt (rx-queue interrupt mode) */
+/* set up / tear down a DPIO portal's DQRI interrupt (rx-queue interrupt mode) */
 __rte_internal
-int dpaa2_dpio_intr_init(struct dpaa2_dpio_dev *dpio_dev, bool build_epoll);
+int dpaa2_dpio_intr_init(struct dpaa2_dpio_dev *dpio_dev, int threshold,
+			 int timeout, bool build_epoll);
+
+__rte_internal
+void dpaa2_dpio_intr_deinit(struct dpaa2_dpio_dev *dpio_dev);
+
+/* convert a coalescing holdoff (microseconds) to QBMan ITP units */
+__rte_internal
+int dpaa2_dpio_holdoff_to_itp(struct dpaa2_dpio_dev *dpio_dev, uint32_t holdoff_us);
 
 /* allocate memory for FQ - dq storage */
 __rte_internal
diff --git a/drivers/bus/fslmc/qbman/include/fsl_qbman_portal.h b/drivers/bus/fslmc/qbman/include/fsl_qbman_portal.h
index 5375ea386d..842ef6f067 100644
--- a/drivers/bus/fslmc/qbman/include/fsl_qbman_portal.h
+++ b/drivers/bus/fslmc/qbman/include/fsl_qbman_portal.h
@@ -157,6 +157,15 @@ uint32_t qbman_swp_intr_timeout_read_status(struct qbman_swp *p);
  */
 void qbman_swp_intr_timeout_write(struct qbman_swp *p, uint32_t mask);
 
+/**
+ * qbman_swp_dqrr_size() - Get the HW DQRR ring depth of a software portal.
+ * @p: the given software portal object.
+ *
+ * Returns the number of DQRR entries (4 on QBMan < 4.1, 8 on >= 4.1). Useful
+ * as the upper bound for the DQRR interrupt coalescing threshold.
+ */
+uint8_t qbman_swp_dqrr_size(struct qbman_swp *p);
+
 /**
  * qbman_swp_interrupt_get_trigger() - Get the data in software portal
  * interrupt enable register.
diff --git a/drivers/bus/fslmc/qbman/qbman_portal.c b/drivers/bus/fslmc/qbman/qbman_portal.c
index 947415363a..81c2d87e0a 100644
--- a/drivers/bus/fslmc/qbman/qbman_portal.c
+++ b/drivers/bus/fslmc/qbman/qbman_portal.c
@@ -433,6 +433,12 @@ void qbman_swp_intr_timeout_write(struct qbman_swp *p, uint32_t mask)
 	qbman_cinh_write(&p->sys, QBMAN_CINH_SWP_ITPR, mask);
 }
 
+RTE_EXPORT_INTERNAL_SYMBOL(qbman_swp_dqrr_size)
+uint8_t qbman_swp_dqrr_size(struct qbman_swp *p)
+{
+	return p->dqrr.dqrr_size;
+}
+
 uint32_t qbman_swp_interrupt_get_trigger(struct qbman_swp *p)
 {
 	return qbman_cinh_read(&p->sys, QBMAN_CINH_SWP_IER);
diff --git a/drivers/net/dpaa2/dpaa2_ethdev.c b/drivers/net/dpaa2/dpaa2_ethdev.c
index 6407c24755..7ca454eaae 100644
--- a/drivers/net/dpaa2/dpaa2_ethdev.c
+++ b/drivers/net/dpaa2/dpaa2_ethdev.c
@@ -36,6 +36,9 @@
 #define DRIVER_ERROR_QUEUE  "drv_err_queue"
 #define DRIVER_NO_TAILDROP  "drv_no_taildrop"
 #define DRIVER_NO_DATA_STASHING "drv_no_data_stashing"
+#define DRIVER_RX_INTR_HOLDOFF_US "drv_rx_intr_holdoff_us"
+#define DPAA2_RX_INTR_HOLDOFF_US_DEF 100
+#define DRIVER_RX_INTR_THRESHOLD "drv_rx_intr_threshold"
 #define CHECK_INTERVAL         100  /* 100ms */
 #define MAX_REPEAT_TIME        90   /* 9s (90 * 100ms) in total */
 
@@ -3078,7 +3081,7 @@ dpaa2_dev_rx_queue_intr_enable(struct rte_eth_dev *dev, uint16_t queue_id)
 	struct dpaa2_dev_priv *priv = dev->data->dev_private;
 	struct dpaa2_queue *dpaa2_q = priv->rx_vq[queue_id];
 	struct dpaa2_dpio_dev *dpio, *old;
-	int ret;
+	int ret, threshold, timeout, dqrr_max;
 
 	if (!dpaa2_q->napi_dpcon)
 		return -ENOTSUP;	/* no channel -> caller keeps polling */
@@ -3087,10 +3090,22 @@ dpaa2_dev_rx_queue_intr_enable(struct rte_eth_dev *dev, uint16_t queue_id)
 		return -EIO;
 	dpio = DPAA2_PER_LCORE_ETHRX_DPIO;
 
+	/* threshold from drv_rx_intr_threshold (0 = ring-1), holdoff from
+	 * drv_rx_intr_holdoff_us. idempotent: no-op if the dpio is already
+	 * armed (e.g. event driver)
+	 */
+	dqrr_max = qbman_swp_dqrr_size(dpio->sw_portal) - 1;
+	threshold = priv->rx_intr_threshold ? (int)priv->rx_intr_threshold : dqrr_max;
+	if (threshold < 1 || threshold > dqrr_max) {
+		DPAA2_PMD_WARN("drv_rx_intr_threshold %d out of [1, %d], clamping",
+			       threshold, dqrr_max);
+		threshold = threshold < 1 ? 1 : dqrr_max;
+	}
+	timeout = dpaa2_dpio_holdoff_to_itp(dpio, priv->rx_intr_holdoff_us);
 	/* build_epoll=false: the generic ethdev rx-intr API waits on the
 	 * application epoll, not the portal's private one (event PMD only).
 	 */
-	ret = dpaa2_dpio_intr_init(dpio, false);	/* VFIO eventfd, no MC */
+	ret = dpaa2_dpio_intr_init(dpio, threshold, timeout, false);
 	if (ret)
 		return ret;
 
@@ -3346,6 +3361,35 @@ dpaa2_get_devargs(struct rte_devargs *devargs, const char *key)
 	return 1;
 }
 
+static int
+u32_devarg_handler(__rte_unused const char *key, const char *value, void *opaque)
+{
+	char *end;
+	unsigned long v = strtoul(value, &end, 0);
+
+	if (*value == '\0' || *end != '\0' || v > UINT32_MAX)
+		return -1;
+	*(uint32_t *)opaque = (uint32_t)v;
+
+	return 0;
+}
+
+/* Read a u32-valued devarg into *out, leaving *out untouched if absent. */
+static void
+dpaa2_get_devargs_u32(struct rte_devargs *devargs, const char *key, uint32_t *out)
+{
+	struct rte_kvargs *kvlist;
+
+	if (!devargs)
+		return;
+	kvlist = rte_kvargs_parse(devargs->args, NULL);
+	if (!kvlist)
+		return;
+	if (rte_kvargs_count(kvlist, key))
+		rte_kvargs_process(kvlist, key, u32_devarg_handler, out);
+	rte_kvargs_free(kvlist);
+}
+
 static int
 dpaa2_dev_init(struct rte_eth_dev *eth_dev)
 {
@@ -3373,6 +3417,14 @@ dpaa2_dev_init(struct rte_eth_dev *eth_dev)
 		DPAA2_PMD_INFO("No RX prefetch mode");
 	}
 
+	priv->rx_intr_holdoff_us = DPAA2_RX_INTR_HOLDOFF_US_DEF;
+	dpaa2_get_devargs_u32(dev->devargs, DRIVER_RX_INTR_HOLDOFF_US,
+			      &priv->rx_intr_holdoff_us);
+
+	priv->rx_intr_threshold = 0;
+	dpaa2_get_devargs_u32(dev->devargs, DRIVER_RX_INTR_THRESHOLD,
+			      &priv->rx_intr_threshold);
+
 	if (dpaa2_get_devargs(dev->devargs, DRIVER_LOOPBACK_MODE)) {
 		priv->flags |= DPAA2_RX_LOOPBACK_MODE;
 		DPAA2_PMD_INFO("Rx loopback mode");
@@ -3888,5 +3940,7 @@ RTE_PMD_REGISTER_PARAM_STRING(NET_DPAA2_PMD_DRIVER_NAME,
 		DRIVER_RX_PARSE_ERR_DROP "=<int>"
 		DRIVER_ERROR_QUEUE "=<int>"
 		DRIVER_NO_TAILDROP "=<int>"
-		DRIVER_NO_DATA_STASHING "=<int>");
+		DRIVER_NO_DATA_STASHING "=<int> "
+		DRIVER_RX_INTR_HOLDOFF_US "=<uint32> "
+		DRIVER_RX_INTR_THRESHOLD "=<uint32>");
 RTE_LOG_REGISTER_DEFAULT(dpaa2_logtype_pmd, NOTICE);
diff --git a/drivers/net/dpaa2/dpaa2_ethdev.h b/drivers/net/dpaa2/dpaa2_ethdev.h
index 65fb48bd27..d8be1f8bce 100644
--- a/drivers/net/dpaa2/dpaa2_ethdev.h
+++ b/drivers/net/dpaa2/dpaa2_ethdev.h
@@ -412,6 +412,13 @@ struct dpaa2_dev_priv {
 	uint8_t max_cgs;
 	uint8_t cgid_in_use[MAX_RX_QUEUES];
 
+	/* DQRI holdoff (us) for rx-queue interrupts (drv_rx_intr_holdoff_us) */
+	uint32_t rx_intr_holdoff_us;
+	/* DQRI threshold for rx-queue interrupts (drv_rx_intr_threshold);
+	 * 0 = auto (DQRR ring depth - 1)
+	 */
+	uint32_t rx_intr_threshold;
+
 	/* Current hash distribution size per RX TC, written by
 	 * dpaa2_setup_flow_dist_size() and read by reta_query / reta_update.
 	 * Zero means "use default" (= nb_rx_queues clamped to dist_queues).
-- 
2.43.0


^ permalink raw reply related

* [PATCH 5/9] net/dpaa2: support Rx queue interrupts
From: Maxime Leroy @ 2026-06-11 15:49 UTC (permalink / raw)
  To: hemant.agrawal, sachin.saxena; +Cc: dev, Maxime Leroy
In-Reply-To: <20260611154926.392670-1-maxime@leroys.fr>

Implement .rx_queue_intr_enable / .rx_queue_intr_disable so a worker
can sleep on a queue's data-availability notification instead of
busy-polling, through the generic rte_eth_dev_rx_intr_* API.

A worker wakes on its software portal's DQRI, which fires when the
portal's DQRR holds frames, so the Rx FQ must be scheduled to a channel
that portal dequeues. The natural dpni_set_queue with a notification
destination holds the global MC lock long enough to wedge the firmware
and must target a disabled dpni. But the polling portal is only known
once a worker affines, after dev_start, so the destination cannot be
the worker's portal.

Bind each Rx FQ to its own DPCON channel instead. The default Rx burst
pulls frames from the FQ with a volatile dequeue and cannot be
interrupt-driven; to wake on the DQRI the FQ must be pushed to the
portal's DQRR. dev_start issues the DEST_DPCON set_queue statically on
the still-disabled dpni with no knowledge of the polling lcore; a worker
later subscribes its own ethrx portal to the channel and arms the DQRI
in rx_queue_intr_enable (a one-shot per-portal MC op plus QBMan, never
the wedging set_queue).

This pushed/DQRR consumption is how the event PMD works, but the DPCON
use differs. The event PMD uses one DPCON per worker, concentrates N
FQs onto it, and lets the QBMan scheduler load-balance events across
cores. Here affinity is static and there is no scheduling, so each FQ
gets its own DPCON (one per FQ, more channels, drawn from the shared
pool that the DPCON move to the fslmc bus now feeds), bound once at
dev_start before the lcore is known. Frames are delivered by
rte_eth_rx_burst (dpaa2_dev_rx_dqrr), not as events via
rte_event_dequeue.

rte_eth_dev_rx_intr_enable(q) subscribes the lcore portal to q's DPCON
and arms the DQRI. rte_eth_dev_rx_intr_ctl_q(q) adds q's eventfd (the
portal DQRI fd) to the thread epoll.

      wire
       |
    [ DPMAC ]
       |
    [ DPNI ]                                     (1)
       |
    TC0:  FQ0   FQ1   FQ2   FQ3                  (2)
           |     |     |     |                   (3)
        [DPCON][DPCON][DPCON][DPCON]
            \     |     |     /                  (4)
          [ DPIO A ]      [ DPIO B ]             (5)
             |               |
            DQRR            DQRR                 (6)
             |               |
            DQRI            DQRI                 (7)
             |               |
          eventfd         eventfd                (8)
             |               |
        rte_epoll_wait  rte_epoll_wait           (9)
             |               |
        dpaa2_dev_rx_dqrr                        (10)

  (1)  WRIOP picks a TC (QoS), then RSS-hashes within the TC to an FQ
  (2)  FQ0..FQ3 are the rte_eth Rx queues
  (3)  dpni_set_queue(DEST_DPCON): one DPCON per FQ
  (4)  the lcore portal subscribes to its DPCONs (push_set)
  (5)  one QBMan software portal per lcore
  (6)  QMan pushes the FDs into the portal DQRR
  (7)  DQRI is raised when the DQRR is non-empty
  (8)  a portal's queues share one fd (its DQRI eventfd)
  (9)  worker sleeps here when all its queues are idle
  (10) dpaa2_dev_rx_dqrr drains the DQRR, demuxes FDs to FQs by fqd_ctx

The DQRI and eventfd are portal-wide: a queue's eventfd is its portal's
DQRI fd, and the inhibit bit is refcounted by armed queues so disabling
one queue never masks a sibling. The static per-queue bind also lets a
queue be re-homed to another lcore at runtime, the new worker
reclaiming the channel, with no set_queue and no port stop.

On single-core 64-byte forwarding this interrupt path runs at ~5.0 Mpps
versus ~5.86 Mpps polling: per-frame DQRR demux and consume cost about
15 percent over the polling batch dequeue.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
---
 doc/guides/nics/features/dpaa2.ini       |   1 +
 doc/guides/rel_notes/release_26_07.rst   |   1 +
 drivers/bus/fslmc/portal/dpaa2_hw_dpio.c |  11 +-
 drivers/bus/fslmc/portal/dpaa2_hw_dpio.h |   4 +
 drivers/bus/fslmc/portal/dpaa2_hw_pvt.h  |  27 ++-
 drivers/bus/fslmc/qbman/qbman_portal.c   |   1 +
 drivers/net/dpaa2/dpaa2_ethdev.c         | 293 ++++++++++++++++++++++-
 drivers/net/dpaa2/dpaa2_ethdev.h         |   3 +
 drivers/net/dpaa2/dpaa2_rxtx.c           | 122 ++++++++++
 9 files changed, 457 insertions(+), 6 deletions(-)

diff --git a/doc/guides/nics/features/dpaa2.ini b/doc/guides/nics/features/dpaa2.ini
index 5def653d1d..b53353eb77 100644
--- a/doc/guides/nics/features/dpaa2.ini
+++ b/doc/guides/nics/features/dpaa2.ini
@@ -7,6 +7,7 @@
 Speed capabilities   = Y
 Link status          = Y
 Link status event    = Y
+Rx interrupt         = Y
 Burst mode info      = Y
 Queue start/stop     = Y
 Scattered Rx         = Y
diff --git a/doc/guides/rel_notes/release_26_07.rst b/doc/guides/rel_notes/release_26_07.rst
index 103c4034ca..87c7c57bcc 100644
--- a/doc/guides/rel_notes/release_26_07.rst
+++ b/doc/guides/rel_notes/release_26_07.rst
@@ -129,6 +129,7 @@ New Features
 * **Updated NXP dpaa2 driver.**
 
   * Added RSS RETA query and update support.
+  * Added Rx queue interrupt support.
 
 * **Updated PCAP ethernet driver.**
 
diff --git a/drivers/bus/fslmc/portal/dpaa2_hw_dpio.c b/drivers/bus/fslmc/portal/dpaa2_hw_dpio.c
index 3a5abb2e6d..e6b4e74b3b 100644
--- a/drivers/bus/fslmc/portal/dpaa2_hw_dpio.c
+++ b/drivers/bus/fslmc/portal/dpaa2_hw_dpio.c
@@ -204,13 +204,18 @@ dpaa2_affine_dpio_intr_to_respective_core(int32_t dpio_id, int cpu_id)
 
 	fclose(file);
 }
+#endif /* RTE_EVENT_DPAA2 */
 
-static int dpaa2_dpio_intr_init(struct dpaa2_dpio_dev *dpio_dev, bool build_epoll)
+RTE_EXPORT_INTERNAL_SYMBOL(dpaa2_dpio_intr_init)
+int dpaa2_dpio_intr_init(struct dpaa2_dpio_dev *dpio_dev, bool build_epoll)
 {
 	struct epoll_event epoll_ev;
 	int eventfd, dpio_epoll_fd, ret;
 	int threshold = 0x3, timeout = 0xFF;
 
+	if (dpio_dev->intr_enabled)
+		return 0;
+
 	ret = rte_dpaa2_intr_enable(dpio_dev->intr_handle, 0);
 	if (ret) {
 		DPAA2_BUS_ERR("Interrupt registration failed");
@@ -259,9 +264,12 @@ static int dpaa2_dpio_intr_init(struct dpaa2_dpio_dev *dpio_dev, bool build_epol
 		dpio_dev->epoll_fd = dpio_epoll_fd;
 	}
 
+	dpio_dev->intr_enabled = 1;
+
 	return 0;
 }
 
+#ifdef RTE_EVENT_DPAA2
 static void dpaa2_dpio_intr_deinit(struct dpaa2_dpio_dev *dpio_dev)
 {
 	int ret;
@@ -274,6 +282,7 @@ static void dpaa2_dpio_intr_deinit(struct dpaa2_dpio_dev *dpio_dev)
 		close(dpio_dev->epoll_fd);
 		dpio_dev->epoll_fd = -1;
 	}
+	dpio_dev->intr_enabled = 0;
 }
 #endif
 
diff --git a/drivers/bus/fslmc/portal/dpaa2_hw_dpio.h b/drivers/bus/fslmc/portal/dpaa2_hw_dpio.h
index 328e1e788a..10dd968e5f 100644
--- a/drivers/bus/fslmc/portal/dpaa2_hw_dpio.h
+++ b/drivers/bus/fslmc/portal/dpaa2_hw_dpio.h
@@ -50,6 +50,10 @@ int dpaa2_affine_qbman_swp(void);
 __rte_internal
 int dpaa2_affine_qbman_ethrx_swp(void);
 
+/* set up a DPIO portal's DQRI interrupt (rx-queue interrupt mode) */
+__rte_internal
+int dpaa2_dpio_intr_init(struct dpaa2_dpio_dev *dpio_dev, bool build_epoll);
+
 /* allocate memory for FQ - dq storage */
 __rte_internal
 int
diff --git a/drivers/bus/fslmc/portal/dpaa2_hw_pvt.h b/drivers/bus/fslmc/portal/dpaa2_hw_pvt.h
index 79a2ec41e3..af75e96b27 100644
--- a/drivers/bus/fslmc/portal/dpaa2_hw_pvt.h
+++ b/drivers/bus/fslmc/portal/dpaa2_hw_pvt.h
@@ -133,6 +133,8 @@ struct dpaa2_dpio_dev {
 	struct rte_intr_handle *intr_handle; /* Interrupt related info */
 	int32_t	epoll_fd; /**< File descriptor created for interrupt polling */
 	int32_t hw_id; /**< An unique ID of this DPIO device instance */
+	uint8_t intr_enabled; /**< DQRI portal interrupt already set up */
+	uint16_t ethrx_intr_refcnt; /**< rx queues currently armed on this portal */
 	struct dpaa2_portal_dqrr dpaa2_held_bufs;
 };
 
@@ -164,6 +166,20 @@ typedef void (dpaa2_queue_cb_dqrr_t)(struct qbman_swp *swp,
 typedef void (dpaa2_queue_cb_eqresp_free_t)(uint16_t eqresp_ci,
 					struct dpaa2_queue *dpaa2_q);
 
+#define DPAA2_NAPI_FD_STASH_SIZE 64	/*!< power of 2; >= 2x rx burst so the
+					 * peer port's frames fit before HW
+					 * backpressure (2 ports/worker)
+					 */
+
+/* Lcore-local FIFO of raw FDs demuxed to this queue by another queue's burst
+ * on the same portal (see dpaa2_queue::napi_stash).
+ */
+struct dpaa2_napi_stash {
+	uint16_t head;	/*!< pop index (drain) */
+	uint16_t tail;	/*!< push index (park) */
+	struct qbman_fd fd[DPAA2_NAPI_FD_STASH_SIZE];
+};
+
 struct __rte_cache_aligned dpaa2_queue {
 	struct rte_mempool *mb_pool; /**< mbuf pool to populate RX ring. */
 	union {
@@ -176,7 +192,7 @@ struct __rte_cache_aligned dpaa2_queue {
 	uint8_t cgid;		/*! < Congestion Group id for this queue */
 	uint64_t rx_pkts;
 	uint64_t tx_pkts;
-	uint64_t err_pkts;
+	uint64_t err_pkts;	/*!< also counts NAPI stash-full drops (imissed) */
 	union {
 		/**Ingress*/
 		struct queue_storage_info_t *q_storage[RTE_MAX_LCORE];
@@ -195,6 +211,15 @@ struct __rte_cache_aligned dpaa2_queue {
 	uint64_t offloads;
 	uint64_t lpbk_cntx;
 	uint8_t data_stashing_off;
+	/* NAPI rx-interrupt: per-queue DPCON bound to this FQ at dev_start
+	 * (DEST_DPCON, static); the polling worker subscribes its ethrx portal
+	 * to the channel and arms the DQRI, rx_dqrr drains+demuxes by fqd_ctx.
+	 */
+	struct dpaa2_dpcon_dev *napi_dpcon;	/*!< notif channel, NULL = napi off */
+	RTE_ATOMIC(struct dpaa2_dpio_dev *) napi_sub_dpio;	/*!< subscribed portal or NULL */
+	uint8_t napi_channel_index;		/*!< portal-local static-dequeue idx */
+	uint8_t napi_armed;			/*!< this queue requests DQRI wakeups */
+	struct dpaa2_napi_stash napi_stash;	/*!< NAPI/DQRR demux FDs (~2 KB) */
 };
 
 struct swp_active_dqs {
diff --git a/drivers/bus/fslmc/qbman/qbman_portal.c b/drivers/bus/fslmc/qbman/qbman_portal.c
index 84853924e7..947415363a 100644
--- a/drivers/bus/fslmc/qbman/qbman_portal.c
+++ b/drivers/bus/fslmc/qbman/qbman_portal.c
@@ -448,6 +448,7 @@ int qbman_swp_interrupt_get_inhibit(struct qbman_swp *p)
 	return qbman_cinh_read(&p->sys, QBMAN_CINH_SWP_IIR);
 }
 
+RTE_EXPORT_INTERNAL_SYMBOL(qbman_swp_interrupt_set_inhibit)
 void qbman_swp_interrupt_set_inhibit(struct qbman_swp *p, int inhibit)
 {
 	qbman_cinh_write(&p->sys, QBMAN_CINH_SWP_IIR,
diff --git a/drivers/net/dpaa2/dpaa2_ethdev.c b/drivers/net/dpaa2/dpaa2_ethdev.c
index 8589398324..6407c24755 100644
--- a/drivers/net/dpaa2/dpaa2_ethdev.c
+++ b/drivers/net/dpaa2/dpaa2_ethdev.c
@@ -658,6 +658,8 @@ dpaa2_clear_queue_active_dps(struct dpaa2_queue *q, int num_lcores)
 	}
 }
 
+static void dpaa2_dev_rx_queue_intr_unbind(struct dpaa2_queue *dpaa2_q);
+
 static void
 dpaa2_free_rx_tx_queues(struct rte_eth_dev *dev)
 {
@@ -675,6 +677,12 @@ dpaa2_free_rx_tx_queues(struct rte_eth_dev *dev)
 		/* cleaning up queue storage */
 		for (i = 0; i < priv->nb_rx_queues; i++) {
 			dpaa2_q = priv->rx_vq[i];
+			if (dpaa2_q->napi_dpcon) {	/* release the rx-intr channel */
+				dpaa2_dev_rx_queue_intr_unbind(dpaa2_q);
+				rte_dpaa2_free_dpcon_dev(dpaa2_q->napi_dpcon);
+				dpaa2_q->napi_dpcon = NULL;
+				dpaa2_q->napi_sub_dpio = NULL;
+			}
 			dpaa2_clear_queue_active_dps(dpaa2_q,
 						RTE_MAX_LCORE);
 			dpaa2_queue_storage_free(dpaa2_q,
@@ -880,6 +888,21 @@ dpaa2_eth_dev_configure(struct rte_eth_dev *dev)
 		}
 	}
 
+	if (dev->data->dev_conf.intr_conf.rxq) {
+		if (!dev->intr_handle)
+			dev->intr_handle = rte_intr_instance_alloc(
+					RTE_INTR_INSTANCE_F_PRIVATE);
+		if (!dev->intr_handle ||
+		    rte_intr_vec_list_alloc(dev->intr_handle, "rxq_intr",
+				dev->data->nb_rx_queues) ||
+		    rte_intr_nb_efd_set(dev->intr_handle,
+				dev->data->nb_rx_queues) ||
+		    rte_intr_type_set(dev->intr_handle, RTE_INTR_HANDLE_EXT)) {
+			DPAA2_PMD_ERR("Failed to set up rx-queue interrupts");
+			return -rte_errno;
+		}
+	}
+
 	dpaa2_tm_init(dev);
 
 	return 0;
@@ -898,6 +921,7 @@ dpaa2_dev_rx_queue_setup(struct rte_eth_dev *dev,
 {
 	struct dpaa2_dev_priv *priv = dev->data->dev_private;
 	struct fsl_mc_io *dpni = dev->process_private;
+	bool dpcon_allocated = false;
 	struct dpaa2_queue *dpaa2_q;
 	struct dpni_queue cfg;
 	uint8_t options = 0;
@@ -938,6 +962,21 @@ dpaa2_dev_rx_queue_setup(struct rte_eth_dev *dev,
 	dpaa2_q->bp_array = rte_dpaa2_bpid_info;
 	dpaa2_q->offloads = rx_conf->offloads;
 
+	/* NAPI: grab a DPCON channel so dev_start can bind this FQ statically.
+	 * The DQRR burst replaces the poll path for every queue at once, so a
+	 * missing channel is fatal rather than a silent per-queue fallback.
+	 */
+	dpaa2_q->napi_sub_dpio = NULL;
+	if (dev->data->dev_conf.intr_conf.rxq && !dpaa2_q->napi_dpcon) {
+		dpaa2_q->napi_dpcon = rte_dpaa2_alloc_dpcon_dev();
+		if (!dpaa2_q->napi_dpcon) {
+			DPAA2_PMD_ERR("rxq %d: no DPCON for rx-queue interrupts",
+				      rx_queue_id);
+			return -ENODEV;
+		}
+		dpcon_allocated = true;
+	}
+
 	/*Get the flow id from given VQ id*/
 	flow_id = dpaa2_q->flow_id;
 	memset(&cfg, 0, sizeof(struct dpni_queue));
@@ -945,6 +984,10 @@ dpaa2_dev_rx_queue_setup(struct rte_eth_dev *dev,
 	options = options | DPNI_QUEUE_OPT_USER_CTX;
 	cfg.user_context = (size_t)(dpaa2_q);
 
+	/* clear any stale DPIO dest left scheduled by a prior rx-intr run */
+	options |= DPNI_QUEUE_OPT_DEST;
+	cfg.destination.type = DPNI_DEST_NONE;
+
 	/* check if a private cgr available. */
 	for (i = 0; i < priv->max_cgs; i++) {
 		if (!priv->cgid_in_use[i]) {
@@ -985,7 +1028,7 @@ dpaa2_dev_rx_queue_setup(struct rte_eth_dev *dev,
 			dpaa2_q->tc_index, flow_id, options, &cfg);
 	if (ret) {
 		DPAA2_PMD_ERR("Error in setting the rx flow: = %d", ret);
-		return ret;
+		goto err_free_dpcon;
 	}
 
 	dpaa2_q->nb_desc = nb_rx_desc;
@@ -1026,7 +1069,7 @@ dpaa2_dev_rx_queue_setup(struct rte_eth_dev *dev,
 		if (ret) {
 			DPAA2_PMD_ERR("Error in setting taildrop. err=(%d)",
 				ret);
-			return ret;
+			goto err_free_dpcon;
 		}
 	} else { /* Disable tail Drop */
 		struct dpni_taildrop taildrop = {0};
@@ -1046,12 +1089,22 @@ dpaa2_dev_rx_queue_setup(struct rte_eth_dev *dev,
 		if (ret) {
 			DPAA2_PMD_ERR("Error in setting taildrop. err=(%d)",
 				ret);
-			return ret;
+			goto err_free_dpcon;
 		}
 	}
 
 	dev->data->rx_queues[rx_queue_id] = dpaa2_q;
 	return 0;
+
+err_free_dpcon:
+	/* free only the DPCON this call allocated; a pre-existing one belongs to
+	 * an earlier setup and is released at dev_close
+	 */
+	if (dpcon_allocated) {
+		rte_dpaa2_free_dpcon_dev(dpaa2_q->napi_dpcon);
+		dpaa2_q->napi_dpcon = NULL;
+	}
+	return ret;
 }
 
 static int
@@ -1210,6 +1263,62 @@ dpaa2_dev_tx_queue_setup(struct rte_eth_dev *dev,
 	return 0;
 }
 
+/* Fully release a queue's rx-interrupt state: detach the FQ from its DPCON,
+ * unbind the static dequeue channel from the portal and free any stashed FDs.
+ * Teardown only: the port is stopped and the portal quiesced; not a runtime
+ * rx_queue_intr_disable() replacement. Call before freeing the DPCON.
+ */
+static void
+dpaa2_dev_rx_queue_intr_unbind(struct dpaa2_queue *dpaa2_q)
+{
+	struct dpaa2_dev_priv *priv;
+	struct dpaa2_dpio_dev *dpio;
+	struct fsl_mc_io *dpni;
+	struct dpni_queue cfg;
+	int ret;
+
+	if (!dpaa2_q || !dpaa2_q->napi_dpcon)
+		return;
+
+	/* detach the FQ from its DPCON so it no longer points at a channel
+	 * about to be returned to the pool (dpni is disabled at teardown)
+	 */
+	priv = dpaa2_q->eth_data->dev_private;
+	dpni = priv->eth_dev->process_private;
+	memset(&cfg, 0, sizeof(cfg));
+	cfg.destination.type = DPNI_DEST_NONE;
+	ret = dpni_set_queue(dpni, CMD_PRI_LOW, priv->token, DPNI_QUEUE_RX,
+			     dpaa2_q->tc_index, dpaa2_q->flow_id,
+			     DPNI_QUEUE_OPT_DEST, &cfg);
+	if (ret)
+		DPAA2_PMD_ERR("napi: DEST_NONE rxq flow %u: %d",
+			      dpaa2_q->flow_id, ret);
+
+	/* unbind the static dequeue channel from the portal it was armed on */
+	dpio = rte_atomic_load_explicit(&dpaa2_q->napi_sub_dpio,
+			rte_memory_order_acquire);
+	if (dpio) {
+		qbman_swp_push_set(dpio->sw_portal,
+				dpaa2_q->napi_channel_index, 0);
+		if (dpaa2_q->napi_armed) {
+			dpaa2_q->napi_armed = 0;
+			if (dpio->ethrx_intr_refcnt > 0 &&
+			    --dpio->ethrx_intr_refcnt == 0)
+				qbman_swp_interrupt_set_inhibit(dpio->sw_portal, 1);
+		}
+		ret = dpio_remove_static_dequeue_channel(dpio->dpio, CMD_PRI_LOW,
+				dpio->token, dpaa2_q->napi_dpcon->dpcon_id);
+		if (ret)
+			DPAA2_PMD_ERR("napi: remove DPCON %d static dequeue channel: %d",
+				      dpaa2_q->napi_dpcon->dpcon_id, ret);
+		rte_atomic_store_explicit(&dpaa2_q->napi_sub_dpio, NULL,
+				rte_memory_order_release);
+	}
+
+	/* free FDs parked for this queue but never drained by a burst */
+	dpaa2_dev_rx_queue_napi_stash_drain(dpaa2_q);
+}
+
 static void
 dpaa2_dev_rx_queue_release(struct rte_eth_dev *dev, uint16_t rx_queue_id)
 {
@@ -1239,6 +1348,12 @@ dpaa2_dev_rx_queue_release(struct rte_eth_dev *dev, uint16_t rx_queue_id)
 		priv->cgid_in_use[dpaa2_q->cgid] = 0;
 		dpaa2_q->cgid = DPAA2_INVALID_CGID;
 	}
+
+	if (dpaa2_q->napi_dpcon) {
+		dpaa2_dev_rx_queue_intr_unbind(dpaa2_q);
+		rte_dpaa2_free_dpcon_dev(dpaa2_q->napi_dpcon);
+		dpaa2_q->napi_dpcon = NULL;
+	}
 }
 
 static int
@@ -1389,6 +1504,36 @@ dpaa2_dev_start(struct rte_eth_dev *dev)
 	intr_handle = dpaa2_dev->intr_handle;
 
 	PMD_INIT_FUNC_TRACE();
+
+	/* NAPI: bind each rx FQ to its own DPCON channel while the dpni is still
+	 * disabled (a DEST set_queue on an enabled dpni wedges the shared MC).
+	 * Static, affinity-free; the polling worker subscribes its portal later.
+	 */
+	if (dev->data->dev_conf.intr_conf.rxq) {
+		for (i = 0; i < data->nb_rx_queues; i++) {
+			dpaa2_q = data->rx_queues[i];
+			if (!dpaa2_q->napi_dpcon)
+				continue;
+			memset(&cfg, 0, sizeof(cfg));
+			cfg.destination.type = DPNI_DEST_DPCON;
+			cfg.destination.id = dpaa2_q->napi_dpcon->dpcon_id;
+			cfg.user_context = (size_t)dpaa2_q;
+			ret = dpni_set_queue(dpni, CMD_PRI_LOW, priv->token,
+					DPNI_QUEUE_RX, dpaa2_q->tc_index,
+					dpaa2_q->flow_id,
+					DPNI_QUEUE_OPT_DEST | DPNI_QUEUE_OPT_USER_CTX,
+					&cfg);
+			if (ret) {
+				DPAA2_PMD_ERR("napi: DPCON bind rxq %d: %d", i, ret);
+				return ret;
+			}
+		}
+		/* DQRR burst for all queues; a queue only yields frames once
+		 * rx_queue_intr_enable() has subscribed its portal
+		 */
+		dev->rx_pkt_burst = dpaa2_dev_rx_dqrr;
+	}
+
 	ret = dpni_enable(dpni, CMD_PRI_LOW, priv->token);
 	if (ret) {
 		DPAA2_PMD_ERR("Failure in enabling dpni %d device: err=%d",
@@ -1859,6 +2004,13 @@ dpaa2_dev_stats_get(struct rte_eth_dev *dev,
 	stats->oerrors = value.page_2.egress_discarded_frames;
 	stats->imissed = value.page_2.ingress_nobuffer_discards;
 
+	/* software Rx drops (full napi stash) are not in the HW counters */
+	for (i = 0; i < priv->nb_rx_queues; i++) {
+		dpaa2_rxq = priv->rx_vq[i];
+		if (dpaa2_rxq != NULL)
+			stats->imissed += dpaa2_rxq->err_pkts;
+	}
+
 	/* Fill in per queue stats */
 	if (qstats != NULL) {
 		for (i = 0; (i < RTE_ETHDEV_QUEUE_STAT_CNTRS) &&
@@ -2172,8 +2324,10 @@ dpaa2_dev_stats_reset(struct rte_eth_dev *dev)
 	/* Reset the per queue stats in dpaa2_queue structure */
 	for (i = 0; i < priv->nb_rx_queues; i++) {
 		dpaa2_q = priv->rx_vq[i];
-		if (dpaa2_q)
+		if (dpaa2_q) {
 			dpaa2_q->rx_pkts = 0;
+			dpaa2_q->err_pkts = 0;
+		}
 	}
 
 	for (i = 0; i < priv->nb_tx_queues; i++) {
@@ -2901,6 +3055,135 @@ rte_pmd_dpaa2_thread_init(void)
 	}
 }
 
+/* Arm rx-queue interrupts on the worker lcore: subscribe its ethrx portal to
+ * the queue's DPCON channel (one-shot per-portal MC) and unmask the portal DQRI
+ * (pure QBMan).
+ *
+ * Affinity is static queue-to-lcore; a lcore may own several rx queues. The
+ * DQRI and the eventfd are portal-wide, so frames are demuxed by fqd_ctx in the
+ * burst and the portal's inhibit bit is reference-counted by the number of its
+ * queues currently armed (ethrx_intr_refcnt) -- disabling one queue must not
+ * mask wakeups still wanted by its siblings. napi_armed and ethrx_intr_refcnt
+ * are plain (not atomic): these ops run on the queue's owner lcore against its
+ * own portal (one portal per lcore), so per-portal isolation keeps them from
+ * racing, not control-plane serialization.
+ *
+ * A re-home reclaims the channel by poking the old portal, so the caller must
+ * have quiesced the previous owner and disabled the queue there; napi_armed is
+ * then 0 and only the new portal is counted.
+ */
+static int
+dpaa2_dev_rx_queue_intr_enable(struct rte_eth_dev *dev, uint16_t queue_id)
+{
+	struct dpaa2_dev_priv *priv = dev->data->dev_private;
+	struct dpaa2_queue *dpaa2_q = priv->rx_vq[queue_id];
+	struct dpaa2_dpio_dev *dpio, *old;
+	int ret;
+
+	if (!dpaa2_q->napi_dpcon)
+		return -ENOTSUP;	/* no channel -> caller keeps polling */
+
+	if (dpaa2_affine_qbman_ethrx_swp())
+		return -EIO;
+	dpio = DPAA2_PER_LCORE_ETHRX_DPIO;
+
+	/* build_epoll=false: the generic ethdev rx-intr API waits on the
+	 * application epoll, not the portal's private one (event PMD only).
+	 */
+	ret = dpaa2_dpio_intr_init(dpio, false);	/* VFIO eventfd, no MC */
+	if (ret)
+		return ret;
+
+	old = rte_atomic_load_explicit(&dpaa2_q->napi_sub_dpio, rte_memory_order_acquire);
+	if (old && old != dpio && dpaa2_q->napi_armed) {
+		DPAA2_PMD_ERR("rxq %d still armed on another portal; disable it first",
+			      queue_id);
+		return -EBUSY;
+	}
+	if (old != dpio) {
+		if (old) {	/* reclaim from old portal (quiesced; QBMan MMIO unsynced) */
+			qbman_swp_push_set(old->sw_portal,
+					dpaa2_q->napi_channel_index, 0);
+			ret = dpio_remove_static_dequeue_channel(old->dpio,
+					CMD_PRI_LOW, old->token,
+					dpaa2_q->napi_dpcon->dpcon_id);
+			/* push_set(0) above already stops the old portal from
+			 * dequeuing; a failed unbind only leaks a static-channel
+			 * slot on the old DPIO, so warn and proceed
+			 */
+			if (ret)
+				DPAA2_PMD_WARN("napi: reclaim rxq %d: %d",
+					       queue_id, ret);
+			/* on no portal until the add below succeeds */
+			rte_atomic_store_explicit(&dpaa2_q->napi_sub_dpio, NULL,
+					rte_memory_order_release);
+		}
+		ret = dpio_add_static_dequeue_channel(dpio->dpio, CMD_PRI_LOW,
+				dpio->token, dpaa2_q->napi_dpcon->dpcon_id,
+				&dpaa2_q->napi_channel_index);
+		if (ret) {
+			DPAA2_PMD_ERR("napi: subscribe rxq %d: %d", queue_id, ret);
+			return ret;
+		}
+		qbman_swp_push_set(dpio->sw_portal,
+				dpaa2_q->napi_channel_index, 1);
+		/* point this queue's eventfd at the portal's DQRI fd so the
+		 * generic rte_eth_dev_rx_intr_ctl_q epoll wakes on it
+		 */
+		if (rte_intr_vec_list_index_set(dev->intr_handle, queue_id, queue_id) ||
+		    rte_intr_efds_index_set(dev->intr_handle, queue_id,
+				rte_intr_fd_get(dpio->intr_handle))) {
+			DPAA2_PMD_ERR("napi: efd wiring rxq %d", queue_id);
+			/* unwind the half-done subscription so HW and driver
+			 * state stay consistent
+			 */
+			qbman_swp_push_set(dpio->sw_portal,
+					dpaa2_q->napi_channel_index, 0);
+			dpio_remove_static_dequeue_channel(dpio->dpio,
+					CMD_PRI_LOW, dpio->token,
+					dpaa2_q->napi_dpcon->dpcon_id);
+			return -EIO;
+		}
+		rte_atomic_store_explicit(&dpaa2_q->napi_sub_dpio, dpio, rte_memory_order_release);
+	}
+
+	/* arm this queue; the portal DQRI is unmasked only on the 0 -> 1 edge
+	 * of its armed-queue count
+	 */
+	if (!dpaa2_q->napi_armed) {
+		dpaa2_q->napi_armed = 1;
+		if (dpio->ethrx_intr_refcnt++ == 0) {
+			qbman_swp_interrupt_clear_status(dpio->sw_portal,
+					0xffffffff);
+			qbman_swp_interrupt_set_inhibit(dpio->sw_portal, 0);
+		}
+	}
+
+	return 0;
+}
+
+/* Disarm rx-queue interrupts for this queue. The portal DQRI is masked only
+ * once the last of its queues disarms; act on the portal the queue is actually
+ * subscribed to, not the caller's current portal.
+ */
+static int
+dpaa2_dev_rx_queue_intr_disable(struct rte_eth_dev *dev, uint16_t queue_id)
+{
+	struct dpaa2_dev_priv *priv = dev->data->dev_private;
+	struct dpaa2_queue *dpaa2_q = priv->rx_vq[queue_id];
+	struct dpaa2_dpio_dev *dpio;
+
+	dpio = rte_atomic_load_explicit(&dpaa2_q->napi_sub_dpio, rte_memory_order_acquire);
+	if (dpio && dpaa2_q->napi_armed) {
+		dpaa2_q->napi_armed = 0;
+		if (dpio->ethrx_intr_refcnt > 0 &&
+		    --dpio->ethrx_intr_refcnt == 0)
+			qbman_swp_interrupt_set_inhibit(dpio->sw_portal, 1);
+	}
+
+	return 0;
+}
+
 static struct eth_dev_ops dpaa2_ethdev_ops = {
 	.dev_configure	  = dpaa2_eth_dev_configure,
 	.dev_start	      = dpaa2_dev_start,
@@ -2929,6 +3212,8 @@ static struct eth_dev_ops dpaa2_ethdev_ops = {
 	.vlan_tpid_set	      = dpaa2_vlan_tpid_set,
 	.rx_queue_setup    = dpaa2_dev_rx_queue_setup,
 	.rx_queue_release  = dpaa2_dev_rx_queue_release,
+	.rx_queue_intr_enable = dpaa2_dev_rx_queue_intr_enable,
+	.rx_queue_intr_disable = dpaa2_dev_rx_queue_intr_disable,
 	.tx_queue_setup    = dpaa2_dev_tx_queue_setup,
 	.rx_burst_mode_get = dpaa2_dev_rx_burst_mode_get,
 	.tx_burst_mode_get = dpaa2_dev_tx_burst_mode_get,
diff --git a/drivers/net/dpaa2/dpaa2_ethdev.h b/drivers/net/dpaa2/dpaa2_ethdev.h
index 3f224c654e..65fb48bd27 100644
--- a/drivers/net/dpaa2/dpaa2_ethdev.h
+++ b/drivers/net/dpaa2/dpaa2_ethdev.h
@@ -500,6 +500,9 @@ uint16_t dpaa2_dev_loopback_rx(void *queue, struct rte_mbuf **bufs,
 
 uint16_t dpaa2_dev_prefetch_rx(void *queue, struct rte_mbuf **bufs,
 			       uint16_t nb_pkts);
+uint16_t dpaa2_dev_rx_dqrr(void *queue, struct rte_mbuf **bufs,
+			   uint16_t nb_pkts);
+void dpaa2_dev_rx_queue_napi_stash_drain(struct dpaa2_queue *dpaa2_q);
 void dpaa2_dev_process_parallel_event(struct qbman_swp *swp,
 				      const struct qbman_fd *fd,
 				      const struct qbman_result *dq,
diff --git a/drivers/net/dpaa2/dpaa2_rxtx.c b/drivers/net/dpaa2/dpaa2_rxtx.c
index b316e23e87..189accc1de 100644
--- a/drivers/net/dpaa2/dpaa2_rxtx.c
+++ b/drivers/net/dpaa2/dpaa2_rxtx.c
@@ -922,6 +922,128 @@ dpaa2_dev_prefetch_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	return num_rx;
 }
 
+/* Convert a DQRR'd FD (single or scatter-gather) to an mbuf and apply software
+ * VLAN strip, like the poll path.
+ */
+static inline struct rte_mbuf *
+dpaa2_dqrr_fd_to_mbuf(const struct qbman_fd *fd,
+		      struct rte_eth_dev_data *eth_data)
+{
+	struct rte_mbuf *m;
+
+	if (unlikely(DPAA2_FD_GET_FORMAT(fd) == qbman_fd_sg))
+		m = eth_sg_fd_to_mbuf(fd, eth_data->port_id);
+	else
+		m = eth_fd_to_mbuf(fd, eth_data->port_id);
+	if (eth_data->dev_conf.rxmode.offloads & RTE_ETH_RX_OFFLOAD_VLAN_STRIP)
+		rte_vlan_strip(m);
+	return m;
+}
+
+/* prefetch a DQRR'd FD's HW annotation (parse area) ahead of conversion */
+static inline void
+dpaa2_dqrr_prefetch_annot(const struct qbman_fd *fd)
+{
+	rte_prefetch0((void *)((size_t)DPAA2_IOVA_TO_VADDR(DPAA2_GET_FD_ADDR(fd))
+			       + DPAA2_FD_PTA_SIZE));
+}
+
+/* Free FDs a sibling burst parked in this queue's stash but that were never
+ * drained (queue released/freed while the lcore still held its frames).
+ */
+void
+dpaa2_dev_rx_queue_napi_stash_drain(struct dpaa2_queue *dpaa2_q)
+{
+	struct dpaa2_napi_stash *stash = &dpaa2_q->napi_stash;
+	const struct qbman_fd *fd;
+
+	while (stash->head != stash->tail) {
+		fd = &stash->fd[stash->head & (DPAA2_NAPI_FD_STASH_SIZE - 1)];
+		rte_pktmbuf_free(dpaa2_dqrr_fd_to_mbuf(fd, dpaa2_q->eth_data));
+		stash->head++;
+	}
+	stash->head = 0;
+	stash->tail = 0;
+}
+
+/* rx interrupt/DQRR path: the FQ is scheduled to a channel the lcore's ethrx
+ * portal statically dequeues -- a VDQ on a scheduled FQ never completes, so DQRR
+ * is the only model compatible with interrupt sleep. One portal serves every
+ * queue the lcore owns, so the burst demuxes by fqd_ctx: own frames are
+ * returned, foreign ones have their raw FD parked in the target queue's stash.
+ *
+ * The application must therefore poll all queues assigned to the lcore after a
+ * wakeup -- the same scheduling contract as plain DPDK polling. When a foreign
+ * queue's stash is full the FD is dropped (freed) rather than left on the shared
+ * DQRR ring, which would head-of-line block every other queue on the portal.
+ */
+uint16_t __rte_hot
+dpaa2_dev_rx_dqrr(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct dpaa2_queue *dpaa2_q = queue;
+	struct rte_eth_dev_data *eth_data = dpaa2_q->eth_data;
+	struct dpaa2_napi_stash *stash = &dpaa2_q->napi_stash;
+	const struct qbman_result *dq;
+	const struct qbman_fd *fd;
+	struct dpaa2_queue *rxq;
+	struct qbman_swp *swp;
+	uint16_t num_rx = 0;
+
+	if (unlikely(!DPAA2_PER_LCORE_ETHRX_DPIO)) {
+		if (dpaa2_affine_qbman_ethrx_swp()) {
+			DPAA2_PMD_ERR("Failure in affining portal");
+			return 0;
+		}
+	}
+	swp = DPAA2_PER_LCORE_ETHRX_PORTAL;
+
+	/* our frames parked by another queue's burst -- convert now (hot) */
+	while (num_rx < nb_pkts && stash->head != stash->tail) {
+		fd = &stash->fd[stash->head & (DPAA2_NAPI_FD_STASH_SIZE - 1)];
+		if (dpaa2_svr_family != SVR_LX2160A &&
+		    (uint16_t)(stash->head + 1) != stash->tail)
+			dpaa2_dqrr_prefetch_annot(&stash->fd[(stash->head + 1) &
+					(DPAA2_NAPI_FD_STASH_SIZE - 1)]);
+		bufs[num_rx++] = dpaa2_dqrr_fd_to_mbuf(fd, eth_data);
+		stash->head++;
+	}
+
+	while (num_rx < nb_pkts) {
+		dq = qbman_swp_dqrr_next(swp);
+		if (!dq)
+			break;			/* ring momentarily empty */
+		qbman_swp_prefetch_dqrr_next(swp);
+		fd = qbman_result_DQ_fd(dq);
+		/* parse summary is in the FRC on LX2160A; annotation is HW-stashed */
+		if (dpaa2_svr_family != SVR_LX2160A)
+			dpaa2_dqrr_prefetch_annot(fd);
+		rxq = (struct dpaa2_queue *)(size_t)qbman_result_DQ_fqd_ctx(dq);
+		if (unlikely(!rxq))
+			rxq = dpaa2_q;
+		if (rxq == dpaa2_q) {
+			bufs[num_rx++] = dpaa2_dqrr_fd_to_mbuf(fd, eth_data);
+		} else {
+			struct dpaa2_napi_stash *fs = &rxq->napi_stash;
+
+			if (unlikely((uint16_t)(fs->tail - fs->head) >=
+						DPAA2_NAPI_FD_STASH_SIZE)) {
+				/* stash full: drop rather than leave it on the ring
+				 * and head-of-line block the shared portal
+				 */
+				rte_pktmbuf_free(dpaa2_dqrr_fd_to_mbuf(fd, rxq->eth_data));
+				rxq->err_pkts++;
+			} else {
+				fs->fd[fs->tail & (DPAA2_NAPI_FD_STASH_SIZE - 1)] = *fd;
+				fs->tail++;
+			}
+		}
+		qbman_swp_dqrr_consume(swp, dq);
+	}
+
+	dpaa2_q->rx_pkts += num_rx;
+	return num_rx;
+}
+
 void __rte_hot
 dpaa2_dev_process_parallel_event(struct qbman_swp *swp,
 				 const struct qbman_fd *fd,
-- 
2.43.0


^ permalink raw reply related

* [PATCH 4/9] bus/fslmc/dpio: make the portal DQRI epoll optional
From: Maxime Leroy @ 2026-06-11 15:49 UTC (permalink / raw)
  To: hemant.agrawal, sachin.saxena; +Cc: dev, Maxime Leroy
In-Reply-To: <20260611154926.392670-1-maxime@leroys.fr>

dpaa2_dpio_intr_init() builds a private epoll instance the event PMD
sleeps on. The upcoming net rx-queue-interrupt path waits on the
application's own epoll instead, so that instance would be built but
never used.

Add a build_epoll parameter: pass true to build it (event PMD), false
to skip the epoll_create/epoll_ctl. epoll_fd is set to -1 when none is
built and closed in intr_deinit only when valid. The sole caller passes
true: no functional change.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
---
 drivers/bus/fslmc/portal/dpaa2_hw_dpio.c | 44 +++++++++++++++++-------
 1 file changed, 32 insertions(+), 12 deletions(-)

diff --git a/drivers/bus/fslmc/portal/dpaa2_hw_dpio.c b/drivers/bus/fslmc/portal/dpaa2_hw_dpio.c
index 2a9e519668..3a5abb2e6d 100644
--- a/drivers/bus/fslmc/portal/dpaa2_hw_dpio.c
+++ b/drivers/bus/fslmc/portal/dpaa2_hw_dpio.c
@@ -205,13 +205,12 @@ dpaa2_affine_dpio_intr_to_respective_core(int32_t dpio_id, int cpu_id)
 	fclose(file);
 }
 
-static int dpaa2_dpio_intr_init(struct dpaa2_dpio_dev *dpio_dev)
+static int dpaa2_dpio_intr_init(struct dpaa2_dpio_dev *dpio_dev, bool build_epoll)
 {
 	struct epoll_event epoll_ev;
 	int eventfd, dpio_epoll_fd, ret;
 	int threshold = 0x3, timeout = 0xFF;
 
-	dpio_epoll_fd = epoll_create(1);
 	ret = rte_dpaa2_intr_enable(dpio_dev->intr_handle, 0);
 	if (ret) {
 		DPAA2_BUS_ERR("Interrupt registration failed");
@@ -231,16 +230,34 @@ static int dpaa2_dpio_intr_init(struct dpaa2_dpio_dev *dpio_dev)
 	qbman_swp_dqrr_thrshld_write(dpio_dev->sw_portal, threshold);
 	qbman_swp_intr_timeout_write(dpio_dev->sw_portal, timeout);
 
-	eventfd = rte_intr_fd_get(dpio_dev->intr_handle);
-	epoll_ev.events = EPOLLIN | EPOLLPRI | EPOLLET;
-	epoll_ev.data.fd = eventfd;
+	dpio_dev->epoll_fd = -1;
 
-	ret = epoll_ctl(dpio_epoll_fd, EPOLL_CTL_ADD, eventfd, &epoll_ev);
-	if (ret < 0) {
-		DPAA2_BUS_ERR("epoll_ctl failed");
-		return -1;
+	/* The event PMD dequeues by sleeping on a private epoll instance owned
+	 * by the portal, so build it here. A caller that waits on another
+	 * epoll (the net rx-queue-interrupt path uses the application's) skips
+	 * this.
+	 */
+	if (build_epoll) {
+		dpio_epoll_fd = epoll_create(1);
+		if (dpio_epoll_fd < 0) {
+			DPAA2_BUS_ERR("epoll_create failed");
+			rte_dpaa2_intr_disable(dpio_dev->intr_handle, 0);
+			return -1;
+		}
+
+		eventfd = rte_intr_fd_get(dpio_dev->intr_handle);
+		epoll_ev.events = EPOLLIN | EPOLLPRI | EPOLLET;
+		epoll_ev.data.fd = eventfd;
+
+		ret = epoll_ctl(dpio_epoll_fd, EPOLL_CTL_ADD, eventfd, &epoll_ev);
+		if (ret < 0) {
+			DPAA2_BUS_ERR("epoll_ctl failed");
+			rte_dpaa2_intr_disable(dpio_dev->intr_handle, 0);
+			close(dpio_epoll_fd);
+			return -1;
+		}
+		dpio_dev->epoll_fd = dpio_epoll_fd;
 	}
-	dpio_dev->epoll_fd = dpio_epoll_fd;
 
 	return 0;
 }
@@ -253,7 +270,10 @@ static void dpaa2_dpio_intr_deinit(struct dpaa2_dpio_dev *dpio_dev)
 	if (ret)
 		DPAA2_BUS_ERR("DPIO interrupt disable failed");
 
-	close(dpio_dev->epoll_fd);
+	if (dpio_dev->epoll_fd >= 0) {
+		close(dpio_dev->epoll_fd);
+		dpio_dev->epoll_fd = -1;
+	}
 }
 #endif
 
@@ -277,7 +297,7 @@ dpaa2_configure_stashing(struct dpaa2_dpio_dev *dpio_dev, int cpu_id)
 	}
 
 #ifdef RTE_EVENT_DPAA2
-	if (dpaa2_dpio_intr_init(dpio_dev)) {
+	if (dpaa2_dpio_intr_init(dpio_dev, true)) {
 		DPAA2_BUS_ERR("Interrupt registration failed for dpio");
 		return -1;
 	}
-- 
2.43.0


^ permalink raw reply related

* [PATCH 3/9] bus/fslmc: move DPCON management from event driver to bus
From: Maxime Leroy @ 2026-06-11 15:49 UTC (permalink / raw)
  To: hemant.agrawal, sachin.saxena; +Cc: dev, Maxime Leroy
In-Reply-To: <20260611154926.392670-1-maxime@leroys.fr>

The DPCON allocation helpers (rte_dpaa2_alloc_dpcon_dev /
rte_dpaa2_free_dpcon_dev) lived in the event driver, but a notification
channel is a generic QBMan resource. Move dpaa2_hw_dpcon.c to the fslmc
bus and export the helpers as internal symbols so both the event PMD and
the net driver's rx-queue interrupt path can draw channels from the same
pool. No functional change.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
---
 drivers/bus/fslmc/meson.build                    |  1 +
 .../dpaa2 => bus/fslmc/portal}/dpaa2_hw_dpcon.c  | 16 +++++++---------
 drivers/bus/fslmc/portal/dpaa2_hw_pvt.h          |  8 ++++++++
 drivers/event/dpaa2/dpaa2_eventdev.h             |  5 +++--
 drivers/event/dpaa2/meson.build                  |  1 -
 5 files changed, 19 insertions(+), 12 deletions(-)
 rename drivers/{event/dpaa2 => bus/fslmc/portal}/dpaa2_hw_dpcon.c (90%)

diff --git a/drivers/bus/fslmc/meson.build b/drivers/bus/fslmc/meson.build
index ceae1c6c11..50d9e91a37 100644
--- a/drivers/bus/fslmc/meson.build
+++ b/drivers/bus/fslmc/meson.build
@@ -22,6 +22,7 @@ sources = files(
         'mc/mc_sys.c',
         'portal/dpaa2_hw_dpbp.c',
         'portal/dpaa2_hw_dpci.c',
+        'portal/dpaa2_hw_dpcon.c',
         'portal/dpaa2_hw_dpio.c',
         'portal/dpaa2_hw_dprc.c',
         'qbman/qbman_portal.c',
diff --git a/drivers/event/dpaa2/dpaa2_hw_dpcon.c b/drivers/bus/fslmc/portal/dpaa2_hw_dpcon.c
similarity index 90%
rename from drivers/event/dpaa2/dpaa2_hw_dpcon.c
rename to drivers/bus/fslmc/portal/dpaa2_hw_dpcon.c
index ea5b0d4b85..6fd96ec0b9 100644
--- a/drivers/event/dpaa2/dpaa2_hw_dpcon.c
+++ b/drivers/bus/fslmc/portal/dpaa2_hw_dpcon.c
@@ -18,13 +18,12 @@
 #include <rte_cycles.h>
 #include <rte_kvargs.h>
 #include <dev_driver.h>
-#include <ethdev_driver.h>
+#include <eal_export.h>
 
 #include <bus_fslmc_driver.h>
 #include <mc/fsl_dpcon.h>
 #include <portal/dpaa2_hw_pvt.h>
-#include "dpaa2_eventdev.h"
-#include "dpaa2_eventdev_logs.h"
+#include <fslmc_logs.h>
 
 TAILQ_HEAD(dpcon_dev_list, dpaa2_dpcon_dev);
 static struct dpcon_dev_list dpcon_dev_list
@@ -55,8 +54,7 @@ rte_dpaa2_create_dpcon_device(int dev_fd __rte_unused,
 	/* Allocate DPAA2 dpcon handle */
 	dpcon_node = rte_malloc(NULL, sizeof(struct dpaa2_dpcon_dev), 0);
 	if (!dpcon_node) {
-		DPAA2_EVENTDEV_ERR(
-				"Memory allocation failed for dpcon device");
+		DPAA2_BUS_ERR("Memory allocation failed for dpcon device");
 		return -1;
 	}
 
@@ -65,8 +63,7 @@ rte_dpaa2_create_dpcon_device(int dev_fd __rte_unused,
 	ret = dpcon_open(&dpcon_node->dpcon,
 			 CMD_PRI_LOW, dpcon_id, &dpcon_node->token);
 	if (ret) {
-		DPAA2_EVENTDEV_ERR("Unable to open dpcon device: err(%d)",
-				   ret);
+		DPAA2_BUS_ERR("Unable to open dpcon device: err(%d)", ret);
 		rte_free(dpcon_node);
 		return -1;
 	}
@@ -75,8 +72,7 @@ rte_dpaa2_create_dpcon_device(int dev_fd __rte_unused,
 	ret = dpcon_get_attributes(&dpcon_node->dpcon,
 				   CMD_PRI_LOW, dpcon_node->token, &attr);
 	if (ret != 0) {
-		DPAA2_EVENTDEV_ERR("dpcon attribute fetch failed: err(%d)",
-				   ret);
+		DPAA2_BUS_ERR("dpcon attribute fetch failed: err(%d)", ret);
 		rte_free(dpcon_node);
 		return -1;
 	}
@@ -92,6 +88,7 @@ rte_dpaa2_create_dpcon_device(int dev_fd __rte_unused,
 	return 0;
 }
 
+RTE_EXPORT_INTERNAL_SYMBOL(rte_dpaa2_alloc_dpcon_dev)
 struct dpaa2_dpcon_dev *rte_dpaa2_alloc_dpcon_dev(void)
 {
 	struct dpaa2_dpcon_dev *dpcon_dev = NULL;
@@ -105,6 +102,7 @@ struct dpaa2_dpcon_dev *rte_dpaa2_alloc_dpcon_dev(void)
 	return dpcon_dev;
 }
 
+RTE_EXPORT_INTERNAL_SYMBOL(rte_dpaa2_free_dpcon_dev)
 void rte_dpaa2_free_dpcon_dev(struct dpaa2_dpcon_dev *dpcon)
 {
 	struct dpaa2_dpcon_dev *dpcon_dev = NULL;
diff --git a/drivers/bus/fslmc/portal/dpaa2_hw_pvt.h b/drivers/bus/fslmc/portal/dpaa2_hw_pvt.h
index e625a5c035..79a2ec41e3 100644
--- a/drivers/bus/fslmc/portal/dpaa2_hw_pvt.h
+++ b/drivers/bus/fslmc/portal/dpaa2_hw_pvt.h
@@ -274,6 +274,14 @@ struct dpaa2_dpcon_dev {
 	uint8_t channel_index;
 };
 
+/* DPCON channel allocation -- managed by the fslmc bus so both the net
+ * NAPI/DQRR rx path and the event PMD can grab channels.
+ */
+__rte_internal
+struct dpaa2_dpcon_dev *rte_dpaa2_alloc_dpcon_dev(void);
+__rte_internal
+void rte_dpaa2_free_dpcon_dev(struct dpaa2_dpcon_dev *dpcon);
+
 /* Refer to Table 7-3 in SEC BG */
 #define QBMAN_FLE_WORD4_FMT_SBF 0x0    /* Single buffer frame */
 #define QBMAN_FLE_WORD4_FMT_SGE 0x2 /* Scatter gather frame */
diff --git a/drivers/event/dpaa2/dpaa2_eventdev.h b/drivers/event/dpaa2/dpaa2_eventdev.h
index bb87bdbab2..f53efce61c 100644
--- a/drivers/event/dpaa2/dpaa2_eventdev.h
+++ b/drivers/event/dpaa2/dpaa2_eventdev.h
@@ -85,8 +85,9 @@ struct dpaa2_eventdev {
 	uint32_t event_dev_cfg;
 };
 
-struct dpaa2_dpcon_dev *rte_dpaa2_alloc_dpcon_dev(void);
-void rte_dpaa2_free_dpcon_dev(struct dpaa2_dpcon_dev *dpcon);
+/* rte_dpaa2_alloc_dpcon_dev()/rte_dpaa2_free_dpcon_dev() now live in the fslmc
+ * bus (portal/dpaa2_hw_pvt.h), which this header's includers already pull in.
+ */
 
 int test_eventdev_dpaa2(void);
 
diff --git a/drivers/event/dpaa2/meson.build b/drivers/event/dpaa2/meson.build
index dd5063af43..62b8507652 100644
--- a/drivers/event/dpaa2/meson.build
+++ b/drivers/event/dpaa2/meson.build
@@ -7,7 +7,6 @@ if not is_linux
 endif
 deps += ['bus_vdev', 'net_dpaa2', 'crypto_dpaa2_sec']
 sources = files(
-        'dpaa2_hw_dpcon.c',
         'dpaa2_eventdev.c',
         'dpaa2_eventdev_selftest.c',
 )
-- 
2.43.0


^ permalink raw reply related

* [PATCH 2/9] eal/interrupts: keep real errno on epoll error
From: Maxime Leroy @ 2026-06-11 15:49 UTC (permalink / raw)
  To: hemant.agrawal, sachin.saxena
  Cc: dev, Maxime Leroy, stable, Harman Kalra, Cunming Liang
In-Reply-To: <20260611154926.392670-1-maxime@leroys.fr>

Some interrupt users have several vectors backed by the same eventfd
(e.g. several Rx queues behind one DPAA2 portal eventfd). Adding the
second vector to the same epoll instance then fails with EEXIST.

Upper layers such as ethdev and bbdev already treat -EEXIST as a
non-fatal duplicate registration (if (ret && ret != -EEXIST)), but
rte_intr_rx_ctl() lost that information: rte_epoll_ctl() returned -1 and
rte_intr_rx_ctl() flattened every failure to -EPERM.

Return the negative errno from rte_epoll_ctl() (its documented contract
is already "a negative value") and stop rte_intr_rx_ctl() from
flattening errors to -EPERM, so EEXIST reaches the upper layers that
already handle it; other failures carry their real errno.

Fixes: 9efe9c6cdcac ("eal/linux: add epoll wrappers")
Fixes: c9f3ec1a0f3f ("eal/linux: add Rx interrupt control function")
Cc: stable@dpdk.org
Signed-off-by: Maxime Leroy <maxime@leroys.fr>
---
 lib/eal/include/rte_epoll.h    |  3 ++-
 lib/eal/linux/eal_interrupts.c | 18 +++++++++++-------
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/lib/eal/include/rte_epoll.h b/lib/eal/include/rte_epoll.h
index ae0cf20853..0c7b510563 100644
--- a/lib/eal/include/rte_epoll.h
+++ b/lib/eal/include/rte_epoll.h
@@ -104,7 +104,8 @@ rte_epoll_wait_interruptible(int epfd, struct rte_epoll_event *events,
  *   Note: The caller must take care the object deletion after CTL_DEL.
  * @return
  *   - On success, zero.
- *   - On failure, a negative value.
+ *   - On failure, a negative errno value, e.g. -EEXIST if the fd is already
+ *     registered on the epoll instance (a fd shared between vectors).
  */
 int
 rte_epoll_ctl(int epfd, int op, int fd,
diff --git a/lib/eal/linux/eal_interrupts.c b/lib/eal/linux/eal_interrupts.c
index 5d0607effe..4cfaeba7fe 100644
--- a/lib/eal/linux/eal_interrupts.c
+++ b/lib/eal/linux/eal_interrupts.c
@@ -1443,7 +1443,7 @@ rte_epoll_ctl(int epfd, int op, int fd,
 
 	if (!event) {
 		EAL_LOG(ERR, "rte_epoll_event can't be NULL");
-		return -1;
+		return -EINVAL;
 	}
 
 	/* using per thread epoll fd */
@@ -1460,13 +1460,21 @@ rte_epoll_ctl(int epfd, int op, int fd,
 
 	ev.events = event->epdata.event;
 	if (epoll_ctl(epfd, op, fd, &ev) < 0) {
+		int err = errno;
+
+		/* the fd is already in the set (e.g. shared across vectors):
+		 * keep the event valid and report -EEXIST, not a hard error.
+		 */
+		if (op == EPOLL_CTL_ADD && err == EEXIST)
+			return -EEXIST;
+
 		EAL_LOG(ERR, "Error op %d fd %d epoll_ctl, %s",
-			op, fd, strerror(errno));
+			op, fd, strerror(err));
 		if (op == EPOLL_CTL_ADD)
 			/* rollback status when CTL_ADD fail */
 			rte_atomic_store_explicit(&event->status, RTE_EPOLL_INVALID,
 					rte_memory_order_relaxed);
-		return -1;
+		return -err;
 	}
 
 	if (op == EPOLL_CTL_DEL && rte_atomic_load_explicit(&event->status,
@@ -1518,8 +1526,6 @@ rte_intr_rx_ctl(struct rte_intr_handle *intr_handle, int epfd,
 			EAL_LOG(DEBUG,
 				"efd %d associated with vec %d added on epfd %d",
 				rev->fd, vec, epfd);
-		else
-			rc = -EPERM;
 		break;
 	case RTE_INTR_EVENT_DEL:
 		epfd_op = EPOLL_CTL_DEL;
@@ -1531,8 +1537,6 @@ rte_intr_rx_ctl(struct rte_intr_handle *intr_handle, int epfd,
 		}
 
 		rc = rte_epoll_ctl(rev->epfd, epfd_op, rev->fd, rev);
-		if (rc)
-			rc = -EPERM;
 		break;
 	default:
 		EAL_LOG(ERR, "event op type mismatch");
-- 
2.43.0


^ permalink raw reply related

* [PATCH 1/9] net/dpaa2: implement RSS RETA query and update
From: Maxime Leroy @ 2026-06-11 15:49 UTC (permalink / raw)
  To: hemant.agrawal, sachin.saxena; +Cc: dev, Maxime Leroy
In-Reply-To: <20260611154926.392670-1-maxime@leroys.fr>

DPAA2 dispatches RX frames to FQs using 'queue_id = hash % dist_size',
where dist_size is set per-TC via the dpni_set_rx_hash_dist MC command.
There is no software-visible indirection table, so the standard DPDK
RETA API has never been exposed by this PMD.

Implement reta_update / reta_query as an emulation on top of
dpni_set_rx_hash_dist. The emulation accepts only the uniform pattern
'reta[i] = i % N' for some N in the HW-allowed set (1, 2, 3, 4, 6, 7,
8, 12, 14, 16, 24, ...). Non-uniform or weighted patterns are rejected
with -ENOTSUP, as the HW has no arbitrary indirection table.

Changing N sets the size of the contiguous queue subset that RSS
spreads traffic over; the queues above N are left out of the hash
distribution. This covers the patterns that matter here, e.g. growing
or shrinking the active subset to scale CPU cores with load, or
reserving the upper queues for specific traffic that rte_flow steers
there for dedicated polling or QoS handling on its own core.

Refactor the existing dpaa2_setup_flow_dist() to delegate to a new
helper dpaa2_setup_flow_dist_size() that takes the dist_size explicitly
and caches it in priv->dist_size_cur[tc] so reta_query() can report it.

reta_query() returns reta[i] = i % N: this is representative, not
bit-exact, as the HW maps the hash to a queue through its distribution
size encoding rather than a plain modulo. reta_update() takes the RSS
hash set from dev_conf (rx_adv_conf.rss_conf.rss_hf); a prior
rss_hash_update() with a different hf is not re-read.

The advertised reta_size is 64 (one rte_eth_rss_reta_entry64 group), the
smallest legal value and enough for all HW-permitted N values up to 64.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
---
 doc/guides/nics/features/dpaa2.ini     |   1 +
 doc/guides/rel_notes/release_26_07.rst |   4 +
 drivers/net/dpaa2/base/dpaa2_hw_dpni.c |  34 ++--
 drivers/net/dpaa2/dpaa2_ethdev.c       | 205 +++++++++++++++++++++++++
 drivers/net/dpaa2/dpaa2_ethdev.h       |   9 ++
 5 files changed, 244 insertions(+), 9 deletions(-)

diff --git a/doc/guides/nics/features/dpaa2.ini b/doc/guides/nics/features/dpaa2.ini
index 5f9c587847..5def653d1d 100644
--- a/doc/guides/nics/features/dpaa2.ini
+++ b/doc/guides/nics/features/dpaa2.ini
@@ -15,6 +15,7 @@ Promiscuous mode     = Y
 Allmulticast mode    = Y
 Unicast MAC filter   = Y
 RSS hash             = Y
+RSS reta update      = Y
 VLAN filter          = Y
 Flow control         = Y
 Traffic manager      = Y
diff --git a/doc/guides/rel_notes/release_26_07.rst b/doc/guides/rel_notes/release_26_07.rst
index b5285af5fe..103c4034ca 100644
--- a/doc/guides/rel_notes/release_26_07.rst
+++ b/doc/guides/rel_notes/release_26_07.rst
@@ -126,6 +126,10 @@ New Features
 
   * Added support for selective Rx in scalar SPRQ Rx path.
 
+* **Updated NXP dpaa2 driver.**
+
+  * Added RSS RETA query and update support.
+
 * **Updated PCAP ethernet driver.**
 
   * Added support for VLAN insertion and stripping.
diff --git a/drivers/net/dpaa2/base/dpaa2_hw_dpni.c b/drivers/net/dpaa2/base/dpaa2_hw_dpni.c
index 13825046d8..4cbc890cee 100644
--- a/drivers/net/dpaa2/base/dpaa2_hw_dpni.c
+++ b/drivers/net/dpaa2/base/dpaa2_hw_dpni.c
@@ -103,15 +103,10 @@ dpaa2_setup_flow_dist(struct rte_eth_dev *eth_dev,
 	uint64_t req_dist_set, int tc_index)
 {
 	struct dpaa2_dev_priv *priv = eth_dev->data->dev_private;
-	struct fsl_mc_io *dpni = eth_dev->process_private;
-	struct dpni_rx_dist_cfg tc_cfg;
-	struct dpkg_profile_cfg kg_cfg;
-	void *p_params;
-	int ret, tc_dist_queues;
+	int tc_dist_queues;
 
-	/*TC distribution size is set with dist_queues or
-	 * nb_rx_queues % dist_queues in order of TC priority index.
-	 * Calculating dist size for this tc_index:-
+	/* TC distribution size is set with dist_queues or
+	 * (nb_rx_queues - tc_index*dist_queues) in order of TC priority index.
 	 */
 	tc_dist_queues = eth_dev->data->nb_rx_queues -
 		tc_index * priv->dist_queues;
@@ -123,6 +118,24 @@ dpaa2_setup_flow_dist(struct rte_eth_dev *eth_dev,
 	if (tc_dist_queues > priv->dist_queues)
 		tc_dist_queues = priv->dist_queues;
 
+	return dpaa2_setup_flow_dist_size(eth_dev, req_dist_set,
+					   tc_index, tc_dist_queues);
+}
+
+int
+dpaa2_setup_flow_dist_size(struct rte_eth_dev *eth_dev,
+	uint64_t req_dist_set, int tc_index, uint16_t dist_size)
+{
+	struct dpaa2_dev_priv *priv = eth_dev->data->dev_private;
+	struct fsl_mc_io *dpni = eth_dev->process_private;
+	struct dpni_rx_dist_cfg tc_cfg;
+	struct dpkg_profile_cfg kg_cfg;
+	void *p_params;
+	int ret;
+
+	if (dist_size == 0)
+		return 0;
+
 	p_params = rte_malloc(NULL,
 		DIST_PARAM_IOVA_SIZE, RTE_CACHE_LINE_SIZE);
 	if (!p_params) {
@@ -150,7 +163,7 @@ dpaa2_setup_flow_dist(struct rte_eth_dev *eth_dev,
 		return -ENOBUFS;
 	}
 
-	tc_cfg.dist_size = tc_dist_queues;
+	tc_cfg.dist_size = dist_size;
 	tc_cfg.enable = true;
 	tc_cfg.tc = tc_index;
 
@@ -168,6 +181,9 @@ dpaa2_setup_flow_dist(struct rte_eth_dev *eth_dev,
 		return ret;
 	}
 
+	if (tc_index < MAX_TCS)
+		priv->dist_size_cur[tc_index] = dist_size;
+
 	return 0;
 }
 
diff --git a/drivers/net/dpaa2/dpaa2_ethdev.c b/drivers/net/dpaa2/dpaa2_ethdev.c
index 803a8321e0..8589398324 100644
--- a/drivers/net/dpaa2/dpaa2_ethdev.c
+++ b/drivers/net/dpaa2/dpaa2_ethdev.c
@@ -80,6 +80,33 @@ bool dpaa2_print_parser_result;
 #define MAX_NB_RX_DESC_IN_PEB	11264
 static int total_nb_rx_desc;
 
+/* Size of the RETA (Redirection Table) we expose to the standard DPDK API.
+ * Must be a multiple of RTE_ETH_RETA_GROUP_SIZE (64). DPAA2 has no actual
+ * indirection table in HW; this is the granularity at which uniform RSS
+ * patterns are inspected by dpaa2_dev_rss_reta_update().
+ */
+#define DPAA2_RETA_SIZE		64
+
+/* Values of dist_size accepted by the DPNI 'dpni_set_rx_hash_dist' MC command.
+ * Source: fsl_dpni.h, "struct dpni_rx_dist_cfg::dist_size" documentation.
+ * Used by dpaa2_dev_rss_reta_update() to validate user-requested patterns.
+ */
+static const uint16_t dpaa2_dist_size_allowed[] = {
+	1, 2, 3, 4, 6, 7, 8, 12, 14, 16, 24, 28, 32, 48, 56, 64,
+	96, 112, 128, 192, 224, 256, 384, 448, 512, 768, 896, 1024,
+};
+
+static bool
+dpaa2_dist_size_is_supported(uint16_t n)
+{
+	size_t i;
+	for (i = 0; i < RTE_DIM(dpaa2_dist_size_allowed); i++) {
+		if (dpaa2_dist_size_allowed[i] == n)
+			return true;
+	}
+	return false;
+}
+
 int dpaa2_valid_dev;
 struct rte_mempool *dpaa2_tx_sg_pool;
 
@@ -425,6 +452,14 @@ dpaa2_dev_info_get(struct rte_eth_dev *dev,
 	dev_info->max_vfs = 0;
 	dev_info->max_vmdq_pools = RTE_ETH_16_POOLS;
 	dev_info->flow_type_rss_offloads = DPAA2_RSS_OFFLOAD_ALL;
+	/* DPAA2 has no software-visible indirection table: incoming packets are
+	 * dispatched to FQs via 'queue_id = hash % dist_size'. We expose the
+	 * standard RETA API as an emulation that only accepts uniform patterns
+	 * 'reta[i] = i % N' and translates them into a dpni_set_rx_hash_dist
+	 * command with dist_size=N. See dpaa2_dev_rss_reta_update().
+	 */
+	dev_info->reta_size = DPAA2_RETA_SIZE;
+	dev_info->hash_key_size = 0;
 
 	dev_info->default_rxportconf.burst_size = dpaa2_dqrr_size;
 	/* same is rx size for best perf */
@@ -2508,6 +2543,174 @@ dpaa2_dev_rss_hash_conf_get(struct rte_eth_dev *dev,
 	return 0;
 }
 
+/* Emulation of the standard DPDK RETA API on top of DPAA2's
+ * dpni_set_rx_hash_dist MC command.
+ *
+ * DPAA2 hardware dispatches incoming frames using 'queue_id = hash % dist_size'
+ * (no software-visible indirection table). To expose the standard
+ * rte_eth_dev_rss_reta_update() interface, we accept ONLY uniform patterns of
+ * the form 'reta[i] = i % N' where N is in the HW-allowed dist_size list. Any
+ * other pattern (weighted RSS, non-contiguous queue IDs, gaps) is rejected
+ * with -ENOTSUP. This is enough to support dynamic RSS scale-up/down across
+ * a contiguous queue subset, which is the main use case for adaptive
+ * dataplane CPU usage.
+ *
+ * Applies the new dist_size on every configured RX TC, mirroring the
+ * behavior of dpaa2_dev_rss_hash_update().
+ */
+static int
+dpaa2_dev_rss_reta_update(struct rte_eth_dev *dev,
+			  struct rte_eth_rss_reta_entry64 *reta_conf,
+			  uint16_t reta_size)
+{
+	struct dpaa2_dev_priv *priv = dev->data->dev_private;
+	struct rte_eth_conf *eth_conf = &dev->data->dev_conf;
+	uint16_t i, max_q = 0, n;
+	int tc_index, ret;
+	bool any_set = false;
+
+	PMD_INIT_FUNC_TRACE();
+
+	if (reta_size != DPAA2_RETA_SIZE) {
+		DPAA2_PMD_ERR("Invalid reta_size %u (expected %u)",
+			      reta_size, DPAA2_RETA_SIZE);
+		return -EINVAL;
+	}
+
+	/* dpaa2 cannot merge a partial RETA into the live table, so only a
+	 * full update (every entry of every group) is accepted.
+	 */
+	for (i = 0; i < reta_size / RTE_ETH_RETA_GROUP_SIZE; i++) {
+		if (reta_conf[i].mask != UINT64_MAX) {
+			DPAA2_PMD_ERR("partial RETA update not supported; set all %u entries",
+				      DPAA2_RETA_SIZE);
+			return -ENOTSUP;
+		}
+	}
+
+	/* First pass: validate queue IDs, find max, and require at least
+	 * one slot to be selected via the per-group mask.
+	 */
+	for (i = 0; i < reta_size; i++) {
+		uint16_t grp = i / RTE_ETH_RETA_GROUP_SIZE;
+		uint16_t pos = i % RTE_ETH_RETA_GROUP_SIZE;
+		uint16_t q;
+
+		if (!(reta_conf[grp].mask & (1ULL << pos)))
+			continue;
+		any_set = true;
+
+		q = reta_conf[grp].reta[pos];
+		if (q >= dev->data->nb_rx_queues) {
+			DPAA2_PMD_ERR(
+				"reta[%u] = %u out of range (max %u)",
+				i, q, dev->data->nb_rx_queues - 1);
+			return -EINVAL;
+		}
+		if (q > max_q)
+			max_q = q;
+	}
+
+	if (!any_set) {
+		DPAA2_PMD_WARN("reta_update called with empty mask, no-op");
+		return 0;
+	}
+
+	n = max_q + 1;
+
+	/* Second pass: enforce the uniform pattern reta[i] = i % n on every
+	 * slot the user has selected. dpaa2 HW cannot honor any other layout.
+	 */
+	for (i = 0; i < reta_size; i++) {
+		uint16_t grp = i / RTE_ETH_RETA_GROUP_SIZE;
+		uint16_t pos = i % RTE_ETH_RETA_GROUP_SIZE;
+		uint16_t expected = i % n;
+		uint16_t q;
+
+		if (!(reta_conf[grp].mask & (1ULL << pos)))
+			continue;
+
+		q = reta_conf[grp].reta[pos];
+		if (q != expected) {
+			DPAA2_PMD_ERR(
+				"Non-uniform RETA pattern at slot %u "
+				"(got queue %u, expected %u). dpaa2 HW "
+				"only supports queue_id = hash mod N with "
+				"contiguous queues 0..N-1.",
+				i, q, expected);
+			return -ENOTSUP;
+		}
+	}
+
+	if (!dpaa2_dist_size_is_supported(n)) {
+		DPAA2_PMD_ERR(
+			"dist_size %u not supported by HW. Allowed: "
+			"1,2,3,4,6,7,8,12,14,16,24,28,32,48,56,64,...",
+			n);
+		return -ENOTSUP;
+	}
+
+	/* Apply on every configured RX TC, matching rss_hash_update behavior. */
+	for (tc_index = 0; tc_index < priv->num_rx_tc; tc_index++) {
+		ret = dpaa2_setup_flow_dist_size(dev,
+				eth_conf->rx_adv_conf.rss_conf.rss_hf,
+				tc_index, n);
+		if (ret) {
+			DPAA2_PMD_ERR(
+				"Failed to apply dist_size=%u on tc%d (err=%d)",
+				n, tc_index, ret);
+			return ret;
+		}
+	}
+
+	DPAA2_PMD_DEBUG("RETA updated: dist_size now %u on %u TC(s)",
+			n, priv->num_rx_tc);
+	return 0;
+}
+
+/* Synthesizes a RETA snapshot from the currently-active dist_size on TC 0.
+ * Since DPAA2 always uses uniform 'hash mod N' distribution, the returned
+ * RETA is reta[i] = i % dist_size_cur[0].
+ */
+static int
+dpaa2_dev_rss_reta_query(struct rte_eth_dev *dev,
+			 struct rte_eth_rss_reta_entry64 *reta_conf,
+			 uint16_t reta_size)
+{
+	struct dpaa2_dev_priv *priv = dev->data->dev_private;
+	uint16_t i, n;
+
+	PMD_INIT_FUNC_TRACE();
+
+	if (reta_size != DPAA2_RETA_SIZE) {
+		DPAA2_PMD_ERR("Invalid reta_size %u (expected %u)",
+			      reta_size, DPAA2_RETA_SIZE);
+		return -EINVAL;
+	}
+
+	/* Use the cached dist_size on TC 0 (representative). Fall back to the
+	 * default (nb_rx_queues clamped to dist_queues) when never programmed.
+	 */
+	n = priv->dist_size_cur[0];
+	if (n == 0) {
+		n = priv->dist_queues;
+		if (n > dev->data->nb_rx_queues)
+			n = dev->data->nb_rx_queues;
+	}
+	if (n == 0)
+		return -EINVAL;
+
+	for (i = 0; i < reta_size; i++) {
+		uint16_t grp = i / RTE_ETH_RETA_GROUP_SIZE;
+		uint16_t pos = i % RTE_ETH_RETA_GROUP_SIZE;
+
+		if (reta_conf[grp].mask & (1ULL << pos))
+			reta_conf[grp].reta[pos] = i % n;
+	}
+
+	return 0;
+}
+
 RTE_EXPORT_INTERNAL_SYMBOL(dpaa2_eth_eventq_attach)
 int dpaa2_eth_eventq_attach(const struct rte_eth_dev *dev,
 		int eth_rx_queue_id,
@@ -2736,6 +2939,8 @@ static struct eth_dev_ops dpaa2_ethdev_ops = {
 	.mac_addr_set         = dpaa2_dev_set_mac_addr,
 	.rss_hash_update      = dpaa2_dev_rss_hash_update,
 	.rss_hash_conf_get    = dpaa2_dev_rss_hash_conf_get,
+	.reta_update          = dpaa2_dev_rss_reta_update,
+	.reta_query           = dpaa2_dev_rss_reta_query,
 	.flow_ops_get         = dpaa2_dev_flow_ops_get,
 	.rxq_info_get	      = dpaa2_rxq_info_get,
 	.txq_info_get	      = dpaa2_txq_info_get,
diff --git a/drivers/net/dpaa2/dpaa2_ethdev.h b/drivers/net/dpaa2/dpaa2_ethdev.h
index 4da47a543a..3f224c654e 100644
--- a/drivers/net/dpaa2/dpaa2_ethdev.h
+++ b/drivers/net/dpaa2/dpaa2_ethdev.h
@@ -412,6 +412,12 @@ struct dpaa2_dev_priv {
 	uint8_t max_cgs;
 	uint8_t cgid_in_use[MAX_RX_QUEUES];
 
+	/* Current hash distribution size per RX TC, written by
+	 * dpaa2_setup_flow_dist_size() and read by reta_query / reta_update.
+	 * Zero means "use default" (= nb_rx_queues clamped to dist_queues).
+	 */
+	uint16_t dist_size_cur[MAX_TCS];
+
 	uint16_t dpni_ver_major;
 	uint16_t dpni_ver_minor;
 	uint32_t speed_capa;
@@ -468,6 +474,9 @@ int dpaa2_distset_to_dpkg_profile_cfg(uint64_t req_dist_set,
 int dpaa2_setup_flow_dist(struct rte_eth_dev *eth_dev,
 		uint64_t req_dist_set, int tc_index);
 
+int dpaa2_setup_flow_dist_size(struct rte_eth_dev *eth_dev,
+		uint64_t req_dist_set, int tc_index, uint16_t dist_size);
+
 int dpaa2_remove_flow_dist(struct rte_eth_dev *eth_dev,
 			   uint8_t tc_index);
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH 0/9] net/dpaa2: NAPI-style Rx queue interrupts
From: Maxime Leroy @ 2026-06-11 15:49 UTC (permalink / raw)
  To: hemant.agrawal, sachin.saxena; +Cc: dev, Maxime Leroy

This series lets a dpaa2 worker sleep on a queue's data-availability
notification instead of busy-polling, exposed through the generic
rte_eth_dev_rx_intr_* API (NAPI-style: poll while frames keep coming,
arm the interrupt and sleep when the queue runs dry).

Why it is not a trivial .rx_queue_intr_enable
----------------------------------------------
A worker wakes on its software portal's DQRI, which fires when the
portal's DQRR holds frames. The default dpaa2 Rx burst pulls frames
from the FQ with a volatile dequeue and cannot be interrupt-driven; to
wake on the DQRI the FQ must instead be pushed to the portal's DQRR.

The natural dpni_set_queue with a notification destination would have to
target the worker's portal, but that portal is only known once a worker
affines, after dev_start, and that MC command holds the global MC lock
long enough to wedge the firmware while traffic runs. So the bind cannot
be done late, against the polling lcore.

Design
------
Each Rx FQ is bound to its own DPCON channel, statically, at dev_start
while the dpni is still disabled (no knowledge of the polling lcore). A
worker later subscribes its own ethrx portal to the channel and arms the
DQRI in rx_queue_intr_enable, a one-shot per-portal op, never the wedging
set_queue. One portal serves every queue a worker owns, so the DQRR
burst demuxes frames to their FQ by fqd_ctx; foreign frames are parked in
the target queue's stash, so the application polls all its queues after a
wakeup, the same scheduling contract as plain DPDK polling. A queue can
be re-homed to another lcore at runtime with no set_queue and no port
stop.

This reuses the event PMD's pushed/DQRR model but with one DPCON per FQ
and static affinity (no QBMan scheduling), so the DPCON allocator is
moved from the event driver to the fslmc bus and shared.

Patches 3 to 6 build the interrupt support proper, on top of three bug
fixes the path depends on and which it uncovered: patch 2 (eal, the
shared portal eventfd must not fail with -EEXIST), patch 7 (rx_queue_count
NULL on the primary process) and patch 8 (fast-path ops NULL after port
stop). They are real fixes, tagged for stable and backportable on their
own. Patches 1 (RSS RETA) and 9 (drop the software VLAN strip) are
independent net/dpaa2 changes the interrupt path does not require.

Tested on LX2160A (lx2160acex7).

Maxime Leroy (9):
  net/dpaa2: implement RSS RETA query and update
  eal/interrupts: keep real errno on epoll error
  bus/fslmc: move DPCON management from event driver to bus
  bus/fslmc/dpio: make the portal DQRI epoll optional
  net/dpaa2: support Rx queue interrupts
  bus/fslmc/dpio: tune DQRI interrupt coalescing holdoff
  net/dpaa2: fix Rx queue count for primary process
  ethdev: keep fast-path ops valid after port stop
  net/dpaa2: drop the fake software VLAN strip offload

 doc/guides/nics/dpaa2.rst                     |  10 +
 doc/guides/nics/features/dpaa2.ini            |   2 +
 doc/guides/rel_notes/release_26_07.rst        |   8 +
 drivers/bus/fslmc/meson.build                 |   1 +
 .../fslmc/portal}/dpaa2_hw_dpcon.c            |  16 +-
 drivers/bus/fslmc/portal/dpaa2_hw_dpio.c      | 113 +++-
 drivers/bus/fslmc/portal/dpaa2_hw_dpio.h      |  12 +
 drivers/bus/fslmc/portal/dpaa2_hw_pvt.h       |  35 +-
 .../fslmc/qbman/include/fsl_qbman_portal.h    |   9 +
 drivers/bus/fslmc/qbman/qbman_portal.c        |   7 +
 drivers/event/dpaa2/dpaa2_eventdev.h          |   5 +-
 drivers/event/dpaa2/meson.build               |   1 -
 drivers/net/dpaa2/base/dpaa2_hw_dpni.c        |  34 +-
 drivers/net/dpaa2/dpaa2_ethdev.c              | 556 +++++++++++++++++-
 drivers/net/dpaa2/dpaa2_ethdev.h              |  19 +
 drivers/net/dpaa2/dpaa2_rxtx.c                | 123 +++-
 lib/eal/include/rte_epoll.h                   |   3 +-
 lib/eal/linux/eal_interrupts.c                |  18 +-
 lib/ethdev/ethdev_private.c                   |   7 +
 19 files changed, 908 insertions(+), 71 deletions(-)
 rename drivers/{event/dpaa2 => bus/fslmc/portal}/dpaa2_hw_dpcon.c (90%)

-- 
2.43.0


^ permalink raw reply

* Re: [PATCH v2 01/22] net/cnxk: update mbuf next field for multi segment
From: Stephen Hemminger @ 2026-06-11 15:26 UTC (permalink / raw)
  To: Rahul Bhansali
  Cc: dev, Nithin Dabilpuram, Kiran Kumar K, Sunil Kumar Kori,
	Satha Rao, Harman Kalra, jerinj
In-Reply-To: <20260611142029.3351415-1-rbhansali@marvell.com>

On Thu, 11 Jun 2026 19:50:08 +0530
Rahul Bhansali <rbhansali@marvell.com> wrote:

> As per the requirement of rte_mbuf_raw_reset_bulk(), the mbuf's
> 'next' and 'nb_segs' fields are required to be reset.
> This reset these field for multi-segment mbufs on cn9k platform.
> 
> Signed-off-by: Rahul Bhansali <rbhansali@marvell.com>

Please put a cover letter on large multi-patch series in future.

The CI AI review doesn't look at original source and uses cost optimized
model (ie. is not that smart).  Did a UI based review and it saw:

Reviewed the v2 series. Three issues, rest look good.

[06/22] net/cnxk: reserve memory for lookup mem at probe

Error: error path returns success. At the new check rc is still 0 from
the prior successful roc_nix_dev_init(). When
cnxk_nix_fastpath_lookup_mem_get() returns NULL, "goto dev_fini" falls
through to "return rc" with rc == 0, so dev init reports success and the
rc=%d message prints 0. Set rc before the jump:

	if (!cnxk_nix_fastpath_lookup_mem_get()) {
		plt_err("Failed to reserve lookup memory");
		rc = -ENOMEM;
		goto dev_fini;
	}

[07/22] drivers: add support for devargs skip size

Warning: shared memzone freed while other ports still use it.
SKIP_SIZE_PKIND_MEMZONE is a single global memzone created once (lookup-
guarded) in roc_npc_init(), but roc_npc_fini() frees it unconditionally.
On a multi-port device the first port closed tears down the table the
other ports still read in roc_npc_skip_size_pkind_get() during inbound SA
creation; the lookup then returns NULL and skip-size pkind selection
silently stops working for the surviving ports. Refcount the memzone or
tie its lifetime to the common/inline layer instead of per-NPC fini.

[19/22] net/cnxk: add FEC get set and capability ops

Warning: feature not reflected in the features matrix. features.rst maps
the FEC feature to fec_get_capability/fec_get/fec_set, which this patch
implements, but doc/guides/nics/features/cnxk.ini is not updated with
"FEC = Y". The ops return NOTSUP on VF/SDP, so cnxk_vf.ini is correct as-
is. Add the matrix entry.

Note on [16/22]: the changes are good. Moving cpt_cq_ena inside the
"if (idev && idev->nix_inl_dev)" block fixes a NULL deref of inl_dev, and
cpt_cq_ena is initialized to 0 so the fall-through default is correct.
The roc_dev.c / roc_ree.c error-path rework fixes real leaks and wrong-
success returns.

Other patches reviewed with no issues.

^ permalink raw reply

* [PATCH v8 18/18] vfio: introduce cdev mode
From: Anatoly Burakov @ 2026-06-11 15:09 UTC (permalink / raw)
  To: dev, Bruce Richardson
In-Reply-To: <cover.1781190151.git.anatoly.burakov@intel.com>

Add support for VFIO cdev (also known as IOMMUFD) API. The group API is now
considered legacy in the kernel, and all further development is expected to
happen in IOMMUFD infrastructure.

To assist any future use of VFIO cdev mode for custom behavior, also
introduce "get device number" API, which is kind-of-but-not-really similar
to the concept of IOMMU group.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/eal/freebsd/eal.c            |  10 +
 lib/eal/include/rte_vfio.h       |  31 +++
 lib/eal/linux/eal_vfio.c         | 210 +++++++++++++++++
 lib/eal/linux/eal_vfio.h         |  29 ++-
 lib/eal/linux/eal_vfio_cdev.c    | 390 +++++++++++++++++++++++++++++++
 lib/eal/linux/eal_vfio_mp_sync.c |  42 ++++
 lib/eal/linux/meson.build        |   1 +
 7 files changed, 711 insertions(+), 2 deletions(-)
 create mode 100644 lib/eal/linux/eal_vfio_cdev.c

diff --git a/lib/eal/freebsd/eal.c b/lib/eal/freebsd/eal.c
index bb05a969a9..a84280a66c 100644
--- a/lib/eal/freebsd/eal.c
+++ b/lib/eal/freebsd/eal.c
@@ -931,3 +931,13 @@ rte_vfio_get_mode(void)
 {
 	return RTE_VFIO_MODE_NONE;
 }
+
+RTE_EXPORT_INTERNAL_SYMBOL(rte_vfio_get_device_num)
+int
+rte_vfio_get_device_num(__rte_unused const char *sysfs_base,
+		__rte_unused const char *dev_addr,
+		__rte_unused int *vfio_device_num)
+{
+	rte_errno = ENOTSUP;
+	return -1;
+}
diff --git a/lib/eal/include/rte_vfio.h b/lib/eal/include/rte_vfio.h
index c4ba0e5cda..502a68c948 100644
--- a/lib/eal/include/rte_vfio.h
+++ b/lib/eal/include/rte_vfio.h
@@ -28,6 +28,8 @@ extern "C" {
 
 #define RTE_VFIO_DIR "/dev/vfio"
 #define RTE_VFIO_CONTAINER_PATH "/dev/vfio/vfio"
+#define RTE_VFIO_IOMMUFD_PATH "/dev/iommu"
+#define RTE_VFIO_CDEV_DEVICES_PATH "/dev/vfio/devices"
 #define RTE_VFIO_GROUP_FMT "/dev/vfio/%u"
 #define RTE_VFIO_NOIOMMU_GROUP_FMT "/dev/vfio/noiommu-%u"
 #define RTE_VFIO_NOIOMMU_MODE "/sys/module/vfio/parameters/enable_unsafe_noiommu_mode"
@@ -48,11 +50,13 @@ struct vfio_device_info;
  * - RTE_VFIO_MODE_NONE: VFIO is not enabled.
  * - RTE_VFIO_MODE_GROUP: Legacy group mode.
  * - RTE_VFIO_MODE_NOIOMMU: Unsafe no-IOMMU mode.
+ * - RTE_VFIO_MODE_CDEV: Character device mode.
  */
 enum rte_vfio_mode {
 	RTE_VFIO_MODE_NONE = 0, /**< VFIO not enabled */
 	RTE_VFIO_MODE_GROUP,    /**< Group mode */
 	RTE_VFIO_MODE_NOIOMMU,  /**< Group mode with no IOMMU protection */
+	RTE_VFIO_MODE_CDEV,     /**< Device mode */
 };
 
 /**
@@ -197,6 +201,33 @@ __rte_internal
 int
 rte_vfio_get_group_num(const char *sysfs_base, const char *dev_addr, int *iommu_group_num);
 
+/**
+ * @internal
+ * Parse VFIO cdev device number for a device.
+ *
+ * This function is only relevant on Linux in cdev mode.
+ *
+ * @param sysfs_base
+ *   Sysfs path prefix.
+ * @param dev_addr
+ *   Device identifier.
+ * @param vfio_device_num
+ *   Pointer to where VFIO cdev device number will be stored.
+ *
+ * @return
+ *   0 on success.
+ *   <0 on failure, rte_errno is set.
+ *
+ * Possible rte_errno values include:
+ * - ENODEV  - Device not managed by VFIO.
+ * - EINVAL  - Invalid parameters.
+ * - ENXIO   - VFIO support not initialized.
+ * - ENOTSUP - Unsupported VFIO mode.
+ */
+__rte_internal
+int
+rte_vfio_get_device_num(const char *sysfs_base, const char *dev_addr, int *vfio_device_num);
+
 /**
  * @internal
  * Get device information.
diff --git a/lib/eal/linux/eal_vfio.c b/lib/eal/linux/eal_vfio.c
index c104008a43..004ee48cf5 100644
--- a/lib/eal/linux/eal_vfio.c
+++ b/lib/eal/linux/eal_vfio.c
@@ -348,6 +348,20 @@ vfio_container_get_by_group_num(int group_num)
 	return NULL;
 }
 
+static struct container *
+vfio_container_get_by_dev_num(int dev_num)
+{
+	struct container *cfg;
+	struct vfio_device *dev;
+
+	CONTAINER_FOREACH_ACTIVE(cfg) {
+		DEVICE_FOREACH_ACTIVE(cfg, dev)
+			if (dev->dev_num == dev_num)
+				return cfg;
+	}
+	return NULL;
+}
+
 static struct container *
 vfio_container_create(void)
 {
@@ -517,6 +531,55 @@ vfio_setup_dma_mem(struct container *cfg)
 	return 0;
 }
 
+static enum vfio_result
+vfio_cdev_assign_device(struct container *cfg, const char *sysfs_base,
+		const char *dev_addr, struct vfio_device **out_dev)
+{
+	struct vfio_device *dev, *found_dev;
+	enum vfio_result res;
+	int dev_num, ret;
+
+	/* get the cdev device number from sysfs */
+	ret = vfio_cdev_get_device_num(sysfs_base, dev_addr, &dev_num);
+	if (ret < 0) {
+		EAL_LOG(ERR, "Failed to get cdev device number for %s", dev_addr);
+		return VFIO_ERROR;
+	} else if (ret == 0) {
+		EAL_LOG(ERR, "Device %s not bound to vfio-pci cdev", dev_addr);
+		return VFIO_NOT_MANAGED;
+	}
+
+	/* do we already have this device? */
+	found_dev = vfio_cdev_get_dev_by_num(cfg, dev_num);
+	if (found_dev != NULL) {
+		EAL_LOG(ERR, "Device %s already assigned to this container", dev_addr);
+		*out_dev = found_dev;
+		return VFIO_EXISTS;
+	}
+	/* create new device structure */
+	dev = vfio_device_create(cfg);
+	if (dev == NULL) {
+		EAL_LOG(ERR, "No space to track new VFIO cdev device");
+		return VFIO_NO_SPACE;
+	}
+	/* store device number */
+	dev->dev_num = dev_num;
+
+	/* set up our device now and store it in config */
+	ret = vfio_cdev_setup_device(cfg, dev);
+	if (ret < 0) {
+		EAL_LOG(ERR, "Cannot setup cdev device %s", dev_addr);
+		res = VFIO_ERROR;
+		goto err;
+	}
+	*out_dev = dev;
+	return VFIO_SUCCESS;
+
+err:
+	vfio_device_erase(cfg, dev);
+	return res;
+}
+
 static enum vfio_result
 vfio_group_assign_device(struct container *cfg, const char *sysfs_base,
 		const char *dev_addr, struct vfio_device **out_dev)
@@ -663,6 +726,49 @@ rte_vfio_container_assign_device(int container_fd, const char *sysfs_base, const
 		return -1;
 	}
 
+	/*
+	 * The device-to-container assignment is a complex problem to solve, for the following
+	 * reasons:
+	 *
+	 * 1. PCI infrastructure is decoupled from VFIO, so PCI does not know anything about VFIO
+	 *
+	 * This means that while 99% of VFIO usage is PCI-related, we cannot communicate to PCI that
+	 * we want to map a particular device using a particular container. Previously, this was
+	 * achieved using back-channel communication via IOMMU group binding, so that whenever PCI
+	 * map actually happens, VFIO knows which container to use, so this is roughly the model we
+	 * are going with.
+	 *
+	 * 2. VFIO cannot depend on PCI because VFIO is in EAL
+	 *
+	 * We cannot "assign" a PCI device to container using rte_pci_device pointer because VFIO
+	 * cannot depend on PCI definitions, nor can't we even assume that our device is in fact a
+	 * PCI device, even though in practice this is true (at the time of this writing, FSLMC is
+	 * the only bus doing non-PCI VFIO mappings, but FSLMC manages all VFIO infrastructure by
+	 * itself, so in practice even counting FSLMC bus, we're always dealing with PCI devices).
+	 *
+	 * 3. The "assignment" means different things for group and cdev mode
+	 *
+	 * In group mode, to "bind" a device to a specific container, it is enough to bind its
+	 * IOMMU group, so that when rte_vfio_setup_device() is called, we simply retrieve already
+	 * existing group, and through that we figure out which container to use.
+	 *
+	 * For cdev mode, there are no "groups", so "assignment" either means we store some kind of
+	 * uniquely identifying token (such as device number, or an opaque pointer), or we simply
+	 * open the device straight away, and when rte_vfio_setup_device() comes we simply return
+	 * the fd that was already opened at assign.
+	 *
+	 * Doing it the latter way (opening the device at assign for both group and cdev modes)
+	 * actually solves all of these problems, so that's what we're going to do - the device
+	 * setup API call will actually just assign the device to default container, while release
+	 * will automatically cleanup and unassign anything that needs unassigned. There will be no
+	 * "unassign" call, as it is not necessary.
+	 *
+	 * There is one downside for group mode when adding duplicate devices: to get to device fd,
+	 * we need to go through the entire codepath before we arrive at fd only to realize it was
+	 * already opened earlier, but this is acceptable compromise for unifying the API around
+	 * device assignment.
+	 */
+
 	if (vfio_cfg.mode == RTE_VFIO_MODE_NONE) {
 		EAL_LOG(ERR, "VFIO support not initialized");
 		rte_errno = ENXIO;
@@ -683,6 +789,9 @@ rte_vfio_container_assign_device(int container_fd, const char *sysfs_base, const
 	case RTE_VFIO_MODE_NOIOMMU:
 		res = vfio_group_assign_device(cfg, sysfs_base, dev_addr, &dev);
 		break;
+	case RTE_VFIO_MODE_CDEV:
+		res = vfio_cdev_assign_device(cfg, sysfs_base, dev_addr, &dev);
+		break;
 	default:
 		EAL_LOG(ERR, "Unsupported VFIO mode");
 		res = VFIO_NOT_SUPPORTED;
@@ -755,6 +864,24 @@ rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 		res = vfio_group_assign_device(cfg, sysfs_base, dev_addr, &dev);
 		break;
 	}
+	case RTE_VFIO_MODE_CDEV:
+	{
+		int dev_num;
+
+		/* find device number */
+		ret = vfio_cdev_get_device_num(sysfs_base, dev_addr, &dev_num);
+		if (ret < 0)
+			goto assign_fail;
+		else if (ret == 0)
+			goto not_managed;
+
+		cfg = vfio_container_get_by_dev_num(dev_num);
+		if (cfg == NULL)
+			cfg = vfio_cfg.default_cfg;
+
+		res = vfio_cdev_assign_device(cfg, sysfs_base, dev_addr, &dev);
+		break;
+	}
 	default:
 		EAL_LOG(ERR, "Unsupported VFIO mode");
 		rte_errno = ENOTSUP;
@@ -874,6 +1001,12 @@ rte_vfio_release_device(const char *sysfs_base __rte_unused,
 		}
 		break;
 	}
+	case RTE_VFIO_MODE_CDEV:
+	{
+		/* for cdev, just erase the device and we're done */
+		vfio_device_erase(cfg, dev);
+		break;
+	}
 	default:
 		EAL_LOG(ERR, "Unsupported VFIO mode");
 		rte_errno = ENOTSUP;
@@ -939,6 +1072,9 @@ vfio_select_mode(void)
 
 		if (vfio_sync_mode(cfg, &mode) < 0)
 			goto err;
+		/* if primary is in cdev mode, we need to sync ioas as well */
+		if (mode == RTE_VFIO_MODE_CDEV && vfio_cdev_sync_ioas(cfg) < 0)
+			goto err;
 
 		/* primary handles DMA setup for default containers */
 		group_cfg->dma_setup_done = true;
@@ -958,6 +1094,19 @@ vfio_select_mode(void)
 			return RTE_VFIO_MODE_NOIOMMU;
 		return RTE_VFIO_MODE_GROUP;
 	}
+	EAL_LOG(DEBUG, "VFIO group mode not available, trying cdev mode...");
+	/* try cdev mode */
+	if (vfio_cdev_enable(cfg) == 0) {
+		if (vfio_cdev_setup_ioas(cfg) < 0)
+			goto err_mpsync;
+		if (vfio_setup_dma_mem(cfg) < 0)
+			goto err_mpsync;
+		if (vfio_register_mem_event_callback() < 0)
+			goto err_mpsync;
+
+		return RTE_VFIO_MODE_CDEV;
+	}
+	EAL_LOG(DEBUG, "VFIO cdev mode not available");
 err_mpsync:
 	vfio_mp_sync_cleanup();
 err:
@@ -972,6 +1121,7 @@ vfio_mode_to_str(enum rte_vfio_mode mode)
 	switch (mode) {
 	case RTE_VFIO_MODE_GROUP: return "group";
 	case RTE_VFIO_MODE_NOIOMMU: return "noiommu";
+	case RTE_VFIO_MODE_CDEV: return "cdev";
 	default: return "not initialized";
 	}
 }
@@ -1111,6 +1261,40 @@ rte_vfio_get_group_num(const char *sysfs_base, const char *dev_addr, int *iommu_
 	return 0;
 }
 
+RTE_EXPORT_INTERNAL_SYMBOL(rte_vfio_get_device_num)
+int
+rte_vfio_get_device_num(const char *sysfs_base, const char *dev_addr, int *device_num)
+{
+	int ret;
+
+	if (sysfs_base == NULL || dev_addr == NULL || device_num == NULL) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	if (vfio_cfg.mode == RTE_VFIO_MODE_NONE) {
+		EAL_LOG(ERR, "VFIO support not initialized");
+		rte_errno = ENXIO;
+		return -1;
+	}
+
+	if (vfio_cfg.mode != RTE_VFIO_MODE_CDEV) {
+		EAL_LOG(ERR, "VFIO not initialized in cdev mode");
+		rte_errno = ENOTSUP;
+		return -1;
+	}
+
+	ret = vfio_cdev_get_device_num(sysfs_base, dev_addr, device_num);
+	if (ret < 0) {
+		rte_errno = EINVAL;
+		return -1;
+	} else if (ret == 0) {
+		rte_errno = ENODEV;
+		return -1;
+	}
+	return 0;
+}
+
 static int
 vfio_dma_mem_map(struct container *cfg, uint64_t vaddr, uint64_t iova,
 		uint64_t len, int do_map)
@@ -1310,6 +1494,25 @@ rte_vfio_container_create(void)
 		cfg->container_fd = container_fd;
 		break;
 	}
+	case RTE_VFIO_MODE_CDEV:
+	{
+		/* Open new iommufd for custom container */
+		container_fd = vfio_cdev_get_iommufd();
+		if (container_fd < 0) {
+			EAL_LOG(ERR, "Cannot open iommufd for cdev container");
+			rte_errno = EIO;
+			goto err;
+		}
+		cfg->container_fd = container_fd;
+
+		/* Set up IOAS for this container */
+		if (vfio_cdev_setup_ioas(cfg) < 0) {
+			EAL_LOG(ERR, "Cannot setup IOAS for cdev container");
+			rte_errno = EIO;
+			goto err;
+		}
+		break;
+	}
 	default:
 		EAL_LOG(NOTICE, "Unsupported VFIO mode");
 		rte_errno = ENOTSUP;
@@ -1368,6 +1571,13 @@ rte_vfio_container_destroy(int container_fd)
 			vfio_group_erase(cfg, grp);
 		}
 		break;
+	case RTE_VFIO_MODE_CDEV:
+		/* erase all devices */
+		DEVICE_FOREACH_ACTIVE(cfg, dev) {
+			EAL_LOG(DEBUG, "Device vfio%d still open, closing", dev->dev_num);
+			vfio_device_erase(cfg, dev);
+		}
+		break;
 	default:
 		EAL_LOG(ERR, "Unsupported VFIO mode");
 		rte_errno = ENOTSUP;
diff --git a/lib/eal/linux/eal_vfio.h b/lib/eal/linux/eal_vfio.h
index 68d3a3ec6e..52cb3a0e08 100644
--- a/lib/eal/linux/eal_vfio.h
+++ b/lib/eal/linux/eal_vfio.h
@@ -48,7 +48,10 @@ struct vfio_group {
 /* device tracking (common for group and cdev modes) */
 struct vfio_device {
 	bool active;
-	int group; /**< back-reference to group list (group mode) */
+	union {
+		int group; /**< back-reference to group list (group mode) */
+		int dev_num;   /**< device number, e.g., X in /dev/vfio/devices/vfioX (cdev mode) */
+	};
 	int fd;
 };
 
@@ -61,12 +64,20 @@ struct vfio_group_config {
 	struct vfio_group groups[RTE_MAX_VFIO_GROUPS];
 };
 
+/* cdev mode specific configuration */
+struct vfio_cdev_config {
+	uint32_t ioas_id;
+};
+
 /* per-container configuration */
 struct container {
 	bool active;
 	int container_fd;
 	struct user_mem_maps mem_maps;
-	struct vfio_group_config group_cfg;
+	union {
+		struct vfio_group_config group_cfg;
+		struct vfio_cdev_config cdev_cfg;
+	};
 	int n_devices;
 	struct vfio_device devices[RTE_MAX_VFIO_DEVICES];
 };
@@ -160,12 +171,24 @@ int vfio_group_setup_iommu(struct container *cfg);
 int vfio_group_setup_device_fd(const char *dev_addr,
 		struct vfio_group *grp, struct vfio_device *dev);
 
+/* cdev mode functions */
+int vfio_cdev_enable(struct container *cfg);
+int vfio_cdev_setup_ioas(struct container *cfg);
+int vfio_cdev_sync_ioas(struct container *cfg);
+int vfio_cdev_get_iommufd(void);
+int vfio_cdev_get_device_num(const char *sysfs_base, const char *dev_addr,
+		int *cdev_dev_num);
+struct vfio_device *vfio_cdev_get_dev_by_num(struct container *cfg, int cdev_dev_num);
+int vfio_cdev_setup_device(struct container *cfg, struct vfio_device *dev);
+
 #define VFIO_MEM_EVENT_CLB_NAME "vfio_mem_event_clb"
 #define EAL_VFIO_MP "eal_vfio_mp_sync"
 
 #define SOCKET_REQ_CONTAINER 0x100
 #define SOCKET_REQ_GROUP 0x200
 #define SOCKET_REQ_IOMMU_TYPE 0x400
+#define SOCKET_REQ_CDEV 0x800
+#define SOCKET_REQ_IOAS_ID 0x1000
 #define SOCKET_OK 0x0
 #define SOCKET_NO_FD 0x1
 #define SOCKET_ERR 0xFF
@@ -176,6 +199,8 @@ struct vfio_mp_param {
 	union {
 		int group_num;
 		int iommu_type_id;
+		int cdev_dev_num;
+		int ioas_id;
 		enum rte_vfio_mode mode;
 	};
 };
diff --git a/lib/eal/linux/eal_vfio_cdev.c b/lib/eal/linux/eal_vfio_cdev.c
new file mode 100644
index 0000000000..ce61a97853
--- /dev/null
+++ b/lib/eal/linux/eal_vfio_cdev.c
@@ -0,0 +1,390 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2025 Intel Corporation
+ */
+
+#include <dirent.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <unistd.h>
+
+#include <uapi/linux/iommufd.h>
+#include <uapi/linux/vfio.h>
+
+#include <rte_log.h>
+#include <rte_errno.h>
+#include <rte_memory.h>
+#include <rte_string_fns.h>
+
+#include "eal_vfio.h"
+#include "eal_private.h"
+#include "eal_internal_cfg.h"
+
+static int vfio_cdev_dma_map(struct container *cfg);
+static int vfio_cdev_dma_mem_map(struct container *cfg, uint64_t vaddr,
+		uint64_t iova, uint64_t len, int do_map);
+
+/* IOMMUFD cdev mode IOMMU operations */
+static const struct vfio_iommu_ops iommufd_ops = {
+	.type_id = 0, /* cdev mode doesn't use type_id */
+	.name = "IOMMUFD",
+	.partial_unmap = false,
+	.dma_map_func = &vfio_cdev_dma_map,
+	.dma_user_map_func = &vfio_cdev_dma_mem_map
+};
+
+static int
+vfio_cdev_dma_mem_map(struct container *cfg, uint64_t vaddr, uint64_t iova,
+		uint64_t len, int do_map)
+{
+	struct iommu_ioas_map ioas_map;
+	struct iommu_ioas_unmap ioas_unmap;
+	int ret;
+
+	if (do_map != 0) {
+		memset(&ioas_map, 0, sizeof(ioas_map));
+		ioas_map.size = sizeof(struct iommu_ioas_map);
+		ioas_map.flags = IOMMU_IOAS_MAP_FIXED_IOVA |
+				IOMMU_IOAS_MAP_READABLE |
+				IOMMU_IOAS_MAP_WRITEABLE;
+		ioas_map.ioas_id = cfg->cdev_cfg.ioas_id;
+		ioas_map.user_va = vaddr;
+		ioas_map.length = len;
+		ioas_map.iova = iova;
+
+		ret = ioctl(cfg->container_fd, IOMMU_IOAS_MAP, &ioas_map);
+		if (ret) {
+			/**
+			 * In case the mapping was already done EEXIST will be
+			 * returned from kernel.
+			 */
+			if (errno == EEXIST) {
+				EAL_LOG(DEBUG,
+					"Memory segment is already mapped, skipping");
+			} else {
+				EAL_LOG(ERR,
+					"Cannot set up DMA remapping, error "
+					"%i (%s)", errno, strerror(errno));
+				return -1;
+			}
+		}
+	} else {
+		memset(&ioas_unmap, 0, sizeof(ioas_unmap));
+		ioas_unmap.size = sizeof(struct iommu_ioas_unmap);
+		ioas_unmap.ioas_id = cfg->cdev_cfg.ioas_id;
+		ioas_unmap.length = len;
+		ioas_unmap.iova = iova;
+
+		ret = ioctl(cfg->container_fd, IOMMU_IOAS_UNMAP, &ioas_unmap);
+		if (ret) {
+			EAL_LOG(ERR, "Cannot clear DMA remapping, error "
+					"%i (%s)", errno, strerror(errno));
+			return -1;
+		}
+	}
+
+	return 0;
+}
+
+static int
+cdev_map(const struct rte_memseg_list *msl, const struct rte_memseg *ms,
+		void *arg)
+{
+	struct container *cfg = arg;
+
+	/* skip external memory that isn't a heap */
+	if (msl->external && !msl->heap)
+		return 0;
+
+	/* skip any segments with invalid IOVA addresses */
+	if (ms->iova == RTE_BAD_IOVA)
+		return 0;
+
+	return vfio_cdev_dma_mem_map(cfg, ms->addr_64, ms->iova, ms->len, 1);
+}
+
+static int
+vfio_cdev_dma_map(struct container *cfg)
+{
+	return rte_memseg_walk(cdev_map, cfg);
+}
+
+int
+vfio_cdev_sync_ioas(struct container *cfg)
+{
+	struct rte_mp_msg mp_req, *mp_rep;
+	struct rte_mp_reply mp_reply = {0};
+	struct timespec ts = {.tv_sec = 5, .tv_nsec = 0};
+	struct vfio_mp_param *p = (struct vfio_mp_param *)mp_req.param;
+
+	p->req = SOCKET_REQ_IOAS_ID;
+	rte_strscpy(mp_req.name, EAL_VFIO_MP, sizeof(mp_req.name));
+	mp_req.len_param = sizeof(*p);
+	mp_req.num_fds = 0;
+
+	if (rte_mp_request_sync(&mp_req, &mp_reply, &ts) == 0 && mp_reply.nb_received == 1) {
+		mp_rep = &mp_reply.msgs[0];
+		p = (struct vfio_mp_param *)mp_rep->param;
+		if (p->result == SOCKET_OK && mp_rep->num_fds == 0) {
+			cfg->cdev_cfg.ioas_id = p->ioas_id;
+			free(mp_reply.msgs);
+			return 0;
+		}
+	}
+
+	free(mp_reply.msgs);
+	EAL_LOG(ERR, "Cannot request ioas_id");
+	return -1;
+}
+
+int
+vfio_cdev_setup_ioas(struct container *cfg)
+{
+	struct iommu_ioas_alloc ioas_alloc;
+	int ret;
+
+	/* Allocate an IOAS */
+	memset(&ioas_alloc, 0, sizeof(ioas_alloc));
+	ioas_alloc.size = sizeof(struct iommu_ioas_alloc);
+	ioas_alloc.flags = 0;
+
+	ret = ioctl(cfg->container_fd, IOMMU_IOAS_ALLOC, &ioas_alloc);
+	if (ret) {
+		EAL_LOG(ERR, "Cannot allocate IOAS, error %i (%s)",
+				errno, strerror(errno));
+		return -1;
+	}
+	cfg->cdev_cfg.ioas_id = ioas_alloc.out_ioas_id;
+
+	EAL_LOG(DEBUG, "Allocated IOAS with ID %u", cfg->cdev_cfg.ioas_id);
+	return 0;
+}
+
+int
+vfio_cdev_get_iommufd(void)
+{
+	int iommufd;
+
+	/* if not requesting via mp, open iommufd locally */
+	iommufd = open(RTE_VFIO_IOMMUFD_PATH, O_RDWR);
+	if (iommufd < 0) {
+		EAL_LOG(ERR, "Cannot open %s: %s",
+				RTE_VFIO_IOMMUFD_PATH, strerror(errno));
+		return -1;
+	}
+
+	return iommufd;
+}
+
+int
+vfio_cdev_enable(struct container *cfg)
+{
+	int iommufd;
+
+	/* Check if iommufd device exists */
+	if (access(RTE_VFIO_IOMMUFD_PATH, F_OK) != 0) {
+		EAL_LOG(DEBUG,
+			"IOMMUFD device does not exist, skipping VFIO cdev support...");
+		return 1;
+	}
+
+	/* open iommufd */
+	iommufd = vfio_cdev_get_iommufd();
+	if (iommufd < 0)
+		return -1;
+
+	/* cdev mode does not have different IOMMU ops */
+	vfio_cfg.ops = &iommufd_ops;
+
+	cfg->container_fd = iommufd;
+	return 0;
+}
+
+int
+vfio_cdev_get_device_num(const char *sysfs_base, const char *dev_addr, int *cdev_dev_num)
+{
+	char linkname[PATH_MAX];
+	char filename[PATH_MAX];
+	char *dev_tok, *end;
+	int dev_num;
+	DIR *dir;
+	struct dirent *entry;
+
+	memset(linkname, 0, sizeof(linkname));
+	memset(filename, 0, sizeof(filename));
+
+	/* check if vfio-dev directory exists for this device */
+	snprintf(linkname, sizeof(linkname),
+			 "%s/%s/vfio-dev", sysfs_base, dev_addr);
+
+	dir = opendir(linkname);
+	if (dir == NULL) {
+		/* device doesn't have vfio-dev, not bound to vfio-pci cdev */
+		return 0;
+	}
+
+	/* find vfioX entry in vfio-dev directory */
+	while ((entry = readdir(dir)) != NULL) {
+		if (strncmp(entry->d_name, "vfio", 4) == 0) {
+			/* parse device number from vfioX */
+			errno = 0;
+			dev_tok = entry->d_name + 4; /* skip "vfio" prefix */
+			end = dev_tok;
+			dev_num = strtol(dev_tok, &end, 10);
+			if (end == dev_tok || *end != '\0' || errno != 0) {
+				EAL_LOG(ERR, "%s error parsing VFIO cdev device number!",
+						dev_addr);
+				closedir(dir);
+				return -1;
+			}
+			*cdev_dev_num = dev_num;
+			closedir(dir);
+			return 1;
+		}
+	}
+
+	closedir(dir);
+	/* no vfio device found */
+	return 0;
+}
+
+struct vfio_device *
+vfio_cdev_get_dev_by_num(struct container *cfg, int cdev_dev_num)
+{
+	struct vfio_device *dev;
+	/* find device handle */
+	DEVICE_FOREACH_ACTIVE(cfg, dev) {
+		if (dev->dev_num != cdev_dev_num)
+			continue;
+		return dev;
+	}
+	return NULL;
+}
+
+static int
+cdev_open_device_fd(int cdev_dev_num)
+{
+	char devname[PATH_MAX] = {0};
+	int dev_fd;
+
+	snprintf(devname, sizeof(devname), "%s/vfio%d",
+			RTE_VFIO_CDEV_DEVICES_PATH, cdev_dev_num);
+
+	dev_fd = open(devname, O_RDWR);
+	if (dev_fd < 0) {
+		EAL_LOG(ERR, "Cannot open %s: %s", devname, strerror(errno));
+		return -1;
+	}
+
+	return dev_fd;
+}
+
+static int
+cdev_attach_device_to_iommufd(struct container *cfg, struct vfio_device *dev)
+{
+	struct vfio_device_bind_iommufd bind = {0};
+	struct vfio_device_attach_iommufd_pt attach = {0};
+	rte_uuid_t vf_token;
+
+	rte_eal_vfio_get_vf_token(vf_token);
+
+	/* try with token first */
+	if (!rte_uuid_is_null(vf_token)) {
+		bind.flags = VFIO_DEVICE_BIND_FLAG_TOKEN;
+		bind.token_uuid_ptr = (uintptr_t)&vf_token;
+		bind.argsz = sizeof(bind);
+		bind.iommufd = cfg->container_fd;
+
+		/* this may fail because the kernel is too old */
+		if (ioctl(dev->fd, VFIO_DEVICE_BIND_IOMMUFD, &bind) < 0) {
+			EAL_LOG(DEBUG, "Failed to bind device %d with VF token", dev->dev_num);
+			EAL_LOG(NOTICE, "Unable to use VF tokens with current kernel version.");
+			EAL_LOG(NOTICE, "Please use kernel >=6.17 or use group mode.");
+			/* erase the bind structure */
+			bind = (struct vfio_device_bind_iommufd){0};
+		} else {
+			goto attach;
+		}
+	}
+	bind.flags = 0;
+	bind.argsz = sizeof(bind);
+	bind.iommufd = cfg->container_fd;
+
+	if (ioctl(dev->fd, VFIO_DEVICE_BIND_IOMMUFD, &bind) < 0) {
+		EAL_LOG(ERR, "Cannot bind device to IOMMUFD, error %i (%s)",
+				errno, strerror(errno));
+		return -1;
+	}
+
+attach:
+	/* attach device to IOAS */
+	attach.argsz = sizeof(attach);
+	attach.flags = 0;
+	attach.pt_id = cfg->cdev_cfg.ioas_id;
+
+	if (ioctl(dev->fd, VFIO_DEVICE_ATTACH_IOMMUFD_PT, &attach) < 0) {
+		EAL_LOG(ERR, "Cannot attach device to IOAS, error %i (%s)",
+				errno, strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+vfio_cdev_request_dev_fd(struct vfio_device *dev)
+{
+	struct rte_mp_msg mp_req, *mp_rep;
+	struct rte_mp_reply mp_reply = {0};
+	struct timespec ts = {.tv_sec = 5, .tv_nsec = 0};
+	struct vfio_mp_param *p = (struct vfio_mp_param *)mp_req.param;
+	int device_fd = -1;
+
+	/* secondary process requests device fd from primary */
+	p->req = SOCKET_REQ_CDEV;
+	p->cdev_dev_num = dev->dev_num;
+	rte_strscpy(mp_req.name, EAL_VFIO_MP, sizeof(mp_req.name));
+	mp_req.len_param = sizeof(*p);
+	mp_req.num_fds = 0;
+
+	if (rte_mp_request_sync(&mp_req, &mp_reply, &ts) == 0 &&
+			mp_reply.nb_received == 1) {
+		mp_rep = &mp_reply.msgs[0];
+		p = (struct vfio_mp_param *)mp_rep->param;
+		if (p->result == SOCKET_OK && mp_rep->num_fds == 1)
+			device_fd = mp_rep->fds[0];
+	}
+
+	free(mp_reply.msgs);
+
+	if (device_fd < 0) {
+		EAL_LOG(ERR, "Cannot request device fd for vfio%d", dev->dev_num);
+		return -1;
+	}
+	dev->fd = device_fd;
+
+	return 0;
+}
+
+int
+vfio_cdev_setup_device(struct container *cfg, struct vfio_device *dev)
+{
+	int device_fd;
+
+	/* get device fd - primary or custom container opens it, secondary requests from primary */
+	if (rte_eal_process_type() == RTE_PROC_PRIMARY || !vfio_container_is_default(cfg)) {
+		device_fd = cdev_open_device_fd(dev->dev_num);
+		if (device_fd < 0)
+			return -1;
+		dev->fd = device_fd;
+
+		/* attach device to iommufd - only in primary */
+		if (cdev_attach_device_to_iommufd(cfg, dev) < 0)
+			return -1;
+	} else if (vfio_cdev_request_dev_fd(dev) < 0) {
+		return -1;
+	}
+	return 0;
+}
diff --git a/lib/eal/linux/eal_vfio_mp_sync.c b/lib/eal/linux/eal_vfio_mp_sync.c
index 9a07d35023..6d94f44af8 100644
--- a/lib/eal/linux/eal_vfio_mp_sync.c
+++ b/lib/eal/linux/eal_vfio_mp_sync.c
@@ -93,6 +93,48 @@ vfio_mp_primary(const struct rte_mp_msg *msg, const void *peer)
 		}
 		break;
 	}
+	case SOCKET_REQ_CDEV:
+	{
+		struct container *cfg;
+		struct vfio_device *dev;
+
+		if (vfio_cfg.mode != RTE_VFIO_MODE_CDEV) {
+			EAL_LOG(ERR, "VFIO not initialized in cdev mode");
+			r->result = SOCKET_ERR;
+			break;
+		}
+
+		r->req = SOCKET_REQ_CDEV;
+		r->cdev_dev_num = m->cdev_dev_num;
+
+		cfg = vfio_cfg.default_cfg;
+		dev = vfio_cdev_get_dev_by_num(cfg, m->cdev_dev_num);
+		if (dev == NULL) {
+			r->result = SOCKET_NO_FD;
+		} else {
+			r->result = SOCKET_OK;
+			reply.num_fds = 1;
+			reply.fds[0] = dev->fd;
+		}
+		break;
+	}
+	case SOCKET_REQ_IOAS_ID:
+	{
+		struct container *cfg;
+
+		if (vfio_cfg.mode != RTE_VFIO_MODE_CDEV) {
+			EAL_LOG(ERR, "VFIO not initialized in cdev mode");
+			r->result = SOCKET_ERR;
+			break;
+		}
+
+		r->req = SOCKET_REQ_IOAS_ID;
+		cfg = vfio_cfg.default_cfg;
+		r->ioas_id = cfg->cdev_cfg.ioas_id;
+
+		r->result = SOCKET_OK;
+		break;
+	}
 	default:
 		EAL_LOG(ERR, "vfio received invalid message!");
 		return -1;
diff --git a/lib/eal/linux/meson.build b/lib/eal/linux/meson.build
index 5ec8eddaa2..c164a30b49 100644
--- a/lib/eal/linux/meson.build
+++ b/lib/eal/linux/meson.build
@@ -16,6 +16,7 @@ sources += files(
         'eal_thread.c',
         'eal_timer.c',
         'eal_vfio.c',
+        'eal_vfio_cdev.c',
         'eal_vfio_group.c',
         'eal_vfio_mp_sync.c',
 )
-- 
2.47.3


^ permalink raw reply related

* [PATCH v8 17/18] vfio: remove no-IOMMU check API
From: Anatoly Burakov @ 2026-06-11 15:09 UTC (permalink / raw)
  To: dev, Bruce Richardson
In-Reply-To: <cover.1781190151.git.anatoly.burakov@intel.com>

The `rte_vfio_noiommu_is_enabled()` check has now been replaced by the new
mode API, so remove it.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/eal/freebsd/eal.c      |  6 ------
 lib/eal/include/rte_vfio.h | 14 --------------
 lib/eal/linux/eal_vfio.c   |  7 -------
 3 files changed, 27 deletions(-)

diff --git a/lib/eal/freebsd/eal.c b/lib/eal/freebsd/eal.c
index 6c1d1e3751..bb05a969a9 100644
--- a/lib/eal/freebsd/eal.c
+++ b/lib/eal/freebsd/eal.c
@@ -850,12 +850,6 @@ int rte_vfio_is_enabled(__rte_unused const char *modname)
 	return 0;
 }
 
-RTE_EXPORT_INTERNAL_SYMBOL(rte_vfio_noiommu_is_enabled)
-int rte_vfio_noiommu_is_enabled(void)
-{
-	return 0;
-}
-
 RTE_EXPORT_INTERNAL_SYMBOL(rte_vfio_get_group_num)
 int
 rte_vfio_get_group_num(__rte_unused const char *sysfs_base,
diff --git a/lib/eal/include/rte_vfio.h b/lib/eal/include/rte_vfio.h
index 0af41c3610..c4ba0e5cda 100644
--- a/lib/eal/include/rte_vfio.h
+++ b/lib/eal/include/rte_vfio.h
@@ -170,20 +170,6 @@ __rte_internal
 enum rte_vfio_mode
 rte_vfio_get_mode(void);
 
-/**
- * @internal
- * Check if VFIO NOIOMMU mode is enabled.
- *
- * This function is only relevant on Linux in group mode.
- *
- * @return
- *   1 if enabled.
- *   0 if not enabled or not supported.
- */
-__rte_internal
-int
-rte_vfio_noiommu_is_enabled(void);
-
 /**
  * @internal
  * Parse IOMMU group number for a device.
diff --git a/lib/eal/linux/eal_vfio.c b/lib/eal/linux/eal_vfio.c
index 708d14ad51..c104008a43 100644
--- a/lib/eal/linux/eal_vfio.c
+++ b/lib/eal/linux/eal_vfio.c
@@ -1278,13 +1278,6 @@ container_dma_unmap(struct container *cfg, uint64_t vaddr, uint64_t iova,
 	return ret;
 }
 
-RTE_EXPORT_INTERNAL_SYMBOL(rte_vfio_noiommu_is_enabled)
-int
-rte_vfio_noiommu_is_enabled(void)
-{
-	return vfio_cfg.mode == RTE_VFIO_MODE_NOIOMMU;
-}
-
 RTE_EXPORT_INTERNAL_SYMBOL(rte_vfio_container_create)
 int
 rte_vfio_container_create(void)
-- 
2.47.3


^ permalink raw reply related

* [PATCH v8 16/18] net/ntnic: use the new VFIO mode API
From: Anatoly Burakov @ 2026-06-11 15:09 UTC (permalink / raw)
  To: dev, Christian Koue Muf, Serhii Iliushyk
In-Reply-To: <cover.1781190151.git.anatoly.burakov@intel.com>

Use new VFIO mode API to query no-IOMMU status.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 drivers/net/ntnic/ntnic_ethdev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ntnic/ntnic_ethdev.c b/drivers/net/ntnic/ntnic_ethdev.c
index 7cc90a7a5b..8b6bca974c 100644
--- a/drivers/net/ntnic/ntnic_ethdev.c
+++ b/drivers/net/ntnic/ntnic_ethdev.c
@@ -2690,7 +2690,7 @@ nthw_pci_probe(struct rte_pci_driver *pci_drv, struct rte_pci_device *pci_dev)
 			(pci_dev->device.devargs->data ? pci_dev->device.devargs->data : "NULL"));
 	}
 
-	const int n_rte_vfio_no_io_mmu_enabled = rte_vfio_noiommu_is_enabled();
+	const int n_rte_vfio_no_io_mmu_enabled = rte_vfio_get_mode() == RTE_VFIO_MODE_NOIOMMU;
 	NT_LOG(DBG, NTNIC, "vfio_no_iommu_enabled=%d", n_rte_vfio_no_io_mmu_enabled);
 
 	if (n_rte_vfio_no_io_mmu_enabled) {
-- 
2.47.3


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox