[RFC 0/8] ioring: network driver

public inbox for dev@dpdk.org
 help / color / mirror / Atom feed

* [RFC 0/8] ioring: network driver
@ 2024-12-10 21:23 Stephen Hemminger
  2024-12-10 21:23 ` [RFC 1/8] net/ioring: introduce new driver Stephen Hemminger
                   ` (13 more replies)
  0 siblings, 14 replies; 72+ messages in thread
From: Stephen Hemminger @ 2024-12-10 21:23 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

This is first draft of new simplified TAP device that uses
the Linux kernel ioring API to provide a read/write ring
with kernel.

This is split from tap device because there are so many
unnecessary things in existing tap, and supporting ioring is
better without ifdefs etc. The default name of the tap
device is different that other uses in DPDK but the driver
tries to keep the same relevant devargs as before.

This driver will only provide features that match what kernel
does, so no flow support etc. The next version will add checksum
and multi-segment packets. Some of the doc files may need update
as well.

Stephen Hemminger (8):
  net/ioring: introduce new driver
  net/ioring: implement link state
  net/ioring: implement control functions
  net/ioring: implement management functions
  net/ioring: implement primary secondary fd passing
  net/ioring: implement receive and transmit
  net/ioring: add VLAN support
  net/ioring: implement statistics

 doc/guides/nics/features/ioring.ini |   14 +
 doc/guides/nics/index.rst           |    1 +
 doc/guides/nics/ioring.rst          |   66 ++
 drivers/net/ioring/meson.build      |   12 +
 drivers/net/ioring/rte_eth_ioring.c | 1067 +++++++++++++++++++++++++++
 drivers/net/meson.build             |    1 +
 6 files changed, 1161 insertions(+)
 create mode 100644 doc/guides/nics/features/ioring.ini
 create mode 100644 doc/guides/nics/ioring.rst
 create mode 100644 drivers/net/ioring/meson.build
 create mode 100644 drivers/net/ioring/rte_eth_ioring.c

-- 
2.45.2

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [RFC 1/8] net/ioring: introduce new driver
  2024-12-10 21:23 [RFC 0/8] ioring: network driver Stephen Hemminger
@ 2024-12-10 21:23 ` Stephen Hemminger
  2024-12-10 21:23 ` [RFC 2/8] net/ioring: implement link state Stephen Hemminger
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2024-12-10 21:23 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add basic driver initialization, documentation, and device creation
and basic documentation.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/ioring.ini |   9 +
 doc/guides/nics/index.rst           |   1 +
 doc/guides/nics/ioring.rst          |  66 +++++++
 drivers/net/ioring/meson.build      |  12 ++
 drivers/net/ioring/rte_eth_ioring.c | 262 ++++++++++++++++++++++++++++
 drivers/net/meson.build             |   1 +
 6 files changed, 351 insertions(+)
 create mode 100644 doc/guides/nics/features/ioring.ini
 create mode 100644 doc/guides/nics/ioring.rst
 create mode 100644 drivers/net/ioring/meson.build
 create mode 100644 drivers/net/ioring/rte_eth_ioring.c

diff --git a/doc/guides/nics/features/ioring.ini b/doc/guides/nics/features/ioring.ini
new file mode 100644
index 0000000000..c4c57caaa4
--- /dev/null
+++ b/doc/guides/nics/features/ioring.ini
@@ -0,0 +1,9 @@
+;
+; Supported features of the 'ioring' driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+Linux		     = Y
+x86-64               = Y
+Usage doc            = Y
diff --git a/doc/guides/nics/index.rst b/doc/guides/nics/index.rst
index 50688d9f64..e4d243622e 100644
--- a/doc/guides/nics/index.rst
+++ b/doc/guides/nics/index.rst
@@ -41,6 +41,7 @@ Network Interface Controller Drivers
     igc
     intel_vf
     ionic
+    ioring
     ipn3ke
     ixgbe
     mana
diff --git a/doc/guides/nics/ioring.rst b/doc/guides/nics/ioring.rst
new file mode 100644
index 0000000000..7d37a6bb37
--- /dev/null
+++ b/doc/guides/nics/ioring.rst
@@ -0,0 +1,66 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+
+IORING Poll Mode Driver
+=======================
+
+The IORING Poll Mode Driver (PMD) is a simplified and improved version of the TAP PMD. It is a
+virtual device that uses Linux ioring to inject packets into the Linux kernel.
+It is useful when writing DPDK applications, that need to support interaction
+with the Linux TCP/IP stack for control plane or tunneling.
+
+The IORING PMD creates a kernel network device that can be
+managed by standard tools such as ``ip`` and ``ethtool`` commands.
+
+From a DPDK application, the IORING device looks like a DPDK ethdev.
+It supports the standard DPDK API's to query for information, statistics,
+and send/receive packets.
+
+Requirements
+------------
+
+The IORING requires the io_uring library (liburing) which provides the helper
+functions to manage io_uring with the kernel.
+
+For more info on io_uring, please see:
+
+https://kernel.dk/io_uring.pdf
+
+
+Arguments
+---------
+
+IORING devices are created with the command line ``--vdev=net_ioring0`` option.
+This option may be specified more than once by repeating with a different ``net_ioringX`` device.
+
+By default, the Linux interfaces are named ``enio0``, ``enio1``, etc.
+The interface name can be specified by adding the ``iface=foo0``, for example::
+
+   --vdev=net_ioring0,iface=io0 --vdev=net_ioring1,iface=io1, ...
+
+The PMD inherits the MAC address assigned by the kernel which will be
+a locally assigned random Ethernet address.
+
+Normally, when the DPDK application exits, the IORING device is removed.
+But this behavior can be overridden by the use of the persist flag, example::
+
+  --vdev=net_ioring0,iface=io0,persist ...
+
+
+Multi-process sharing
+---------------------
+
+The IORING device does not support secondary process (yet).
+
+
+Limitations
+-----------
+
+- IO uring requires io_uring support. This was add in Linux kernl version 5.1
+  Also, IO uring maybe disabled in some environments or by security policies.
+
+- Since IORING device uses a file descriptor to talk to the kernel,
+  the same number of queues must be specified for receive and transmit.
+
+- No flow support. Receive queue selection for incoming packets is determined
+  by the Linux kernel. See kernel documentation for more info:
+  https://www.kernel.org/doc/html/latest/networking/scaling.html
diff --git a/drivers/net/ioring/meson.build b/drivers/net/ioring/meson.build
new file mode 100644
index 0000000000..198063e51f
--- /dev/null
+++ b/drivers/net/ioring/meson.build
@@ -0,0 +1,12 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2024 Stephen Hemminger
+
+if not is_linux
+    build = false
+    reason = 'only supported on Linux'
+endif
+
+ext_deps += dependency('liburing', required:true)
+
+sources = files('rte_eth_ioring.c')
+require_iova_in_mbuf = false
diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
new file mode 100644
index 0000000000..7b62c47f54
--- /dev/null
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -0,0 +1,262 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) Stephen Hemminger
+ */
+
+#include <ctype.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/ioctl.h>
+#include <sys/socket.h>
+#include <net/if.h>
+#include <linux/if.h>
+#include <linux/if_tun.h>
+
+#include <bus_vdev_driver.h>
+#include <ethdev_driver.h>
+#include <ethdev_vdev.h>
+#include <rte_common.h>
+#include <rte_dev.h>
+#include <rte_eal.h>
+#include <rte_ethdev.h>
+#include <rte_ether.h>
+#include <rte_kvargs.h>
+#include <rte_log.h>
+
+#define IORING_DEFAULT_IFNAME	"enio%d"
+
+RTE_LOG_REGISTER_DEFAULT(ioring_logtype, NOTICE);
+#define RTE_LOGTYPE_IORING ioring_logtype
+#define PMD_LOG(level, ...) RTE_LOG_LINE_PREFIX(level, IORING, "%s(): ", __func__, __VA_ARGS__)
+
+#define IORING_IFACE_ARG	"iface"
+#define IORING_PERSIST_ARG	"persist"
+
+static const char * const valid_arguments[] = {
+	IORING_IFACE_ARG,
+	IORING_PERSIST_ARG,
+	NULL
+};
+
+struct pmd_internals {
+	int keep_fd;			/* keep alive file descriptor */
+	char ifname[IFNAMSIZ];		/* name assigned by kernel */
+	struct rte_ether_addr eth_addr; /* address assigned by kernel */
+};
+
+/* Creates a new tap device, name returned in ifr */
+static int
+tap_open(const char *name, struct ifreq *ifr, uint8_t persist)
+{
+	static const char tun_dev[] = "/dev/net/tun";
+	int tap_fd;
+
+	tap_fd = open(tun_dev, O_RDWR | O_CLOEXEC | O_NONBLOCK);
+	if (tap_fd < 0) {
+		PMD_LOG(ERR, "Open %s failed: %s", tun_dev, strerror(errno));
+		return -1;
+	}
+
+	int features = 0;
+	if (ioctl(tap_fd, TUNGETFEATURES, &features) < 0) {
+		PMD_LOG(ERR, "ioctl(TUNGETFEATURES) %s", strerror(errno));
+		goto error;
+	}
+
+	int flags = IFF_TAP | IFF_MULTI_QUEUE | IFF_NO_PI;
+	if ((features & flags) == 0) {
+		PMD_LOG(ERR, "TUN features %#x missing support for %#x",
+			features, features & flags);
+		goto error;
+	}
+
+#ifdef IFF_NAPI
+	/* If kernel supports using NAPI enable it */
+	if (features & IFF_NAPI)
+		flags |= IFF_NAPI;
+#endif
+	/*
+	 * Sets the device name and packet format.
+	 * Do not want the protocol information (PI)
+	 */
+	strlcpy(ifr->ifr_name, name, IFNAMSIZ);
+	ifr->ifr_flags = flags;
+	if (ioctl(tap_fd, TUNSETIFF, ifr) < 0) {
+		PMD_LOG(ERR, "ioctl(TUNSETIFF) %s: %s",
+			ifr->ifr_name, strerror(errno));
+		goto error;
+	}
+
+	/* (Optional) keep the device after application exit */
+	if (persist && ioctl(tap_fd, TUNSETPERSIST, 1) < 0) {
+		PMD_LOG(ERR, "ioctl(TUNSETPERIST) %s: %s",
+			ifr->ifr_name, strerror(errno));
+		goto error;
+	}
+
+	return tap_fd;
+error:
+	close(tap_fd);
+	return -1;
+}
+
+static int
+eth_dev_close(struct rte_eth_dev *dev)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+
+	PMD_LOG(INFO, "Closing %s", pmd->ifname);
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	/* mac_addrs must not be freed alone because part of dev_private */
+	dev->data->mac_addrs = NULL;
+
+	if (pmd->keep_fd != -1) {
+		close(pmd->keep_fd);
+		pmd->keep_fd = -1;
+	}
+
+	return 0;
+}
+
+static const struct eth_dev_ops ops = {
+	.dev_close		= eth_dev_close,
+};
+
+static int
+ioring_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
+{
+	struct rte_eth_dev_data *data = dev->data;
+	struct pmd_internals *pmd = data->dev_private;
+
+	pmd->keep_fd = -1;
+
+	data->dev_flags = RTE_ETH_DEV_AUTOFILL_QUEUE_XSTATS;
+	dev->dev_ops = &ops;
+
+	/* Get the initial fd used to keep the tap device around */
+	struct ifreq ifr = { };
+	pmd->keep_fd = tap_open(tap_name, &ifr, persist);
+	if (pmd->keep_fd < 0)
+		goto error;
+
+	strlcpy(pmd->ifname, ifr.ifr_name, IFNAMSIZ);
+
+	/* Read the MAC address assigned by the kernel */
+	if (ioctl(pmd->keep_fd, SIOCGIFHWADDR, &ifr) < 0) {
+		PMD_LOG(ERR, "Unable to get MAC address for %s: %s",
+			ifr.ifr_name, strerror(errno));
+		goto error;
+	}
+	memcpy(&pmd->eth_addr, &ifr.ifr_hwaddr.sa_data, RTE_ETHER_ADDR_LEN);
+	data->mac_addrs = &pmd->eth_addr;
+
+	/* Detach this instance, not used for traffic */
+	ifr.ifr_flags = IFF_DETACH_QUEUE;
+	if (ioctl(pmd->keep_fd, TUNSETQUEUE, &ifr) < 0) {
+		PMD_LOG(ERR, "Unable to detach keep-alive queue for %s: %s",
+			ifr.ifr_name, strerror(errno));
+		goto error;
+	}
+
+	PMD_LOG(DEBUG, "%s setup", ifr.ifr_name);
+	return 0;
+
+error:
+	if (pmd->keep_fd != -1)
+		close(pmd->keep_fd);
+	return -1;
+}
+
+static int
+parse_iface_arg(const char *key __rte_unused, const char *value, void *extra_args)
+{
+	char *name = extra_args;
+
+	/* must not be null string */
+	if (name == NULL || name[0] == '\0' ||
+	    strnlen(name, IFNAMSIZ) == IFNAMSIZ)
+		return -EINVAL;
+
+	strlcpy(name, value, IFNAMSIZ);
+	return 0;
+}
+
+static int
+ioring_probe(struct rte_vdev_device *vdev)
+{
+	const char *name = rte_vdev_device_name(vdev);
+	const char *params = rte_vdev_device_args(vdev);
+	struct rte_kvargs *kvlist = NULL;
+	struct rte_eth_dev *eth_dev = NULL;
+	char tap_name[IFNAMSIZ] = IORING_DEFAULT_IFNAME;
+	uint8_t persist = 0;
+	int ret;
+
+	PMD_LOG(INFO, "Initializing %s", name);
+
+	if (rte_eal_process_type() == RTE_PROC_SECONDARY)
+		return -1; /* TODO */
+
+	if (params != NULL) {
+		kvlist = rte_kvargs_parse(params, valid_arguments);
+		if (kvlist == NULL)
+			return -1;
+
+		if (rte_kvargs_count(kvlist, IORING_IFACE_ARG) == 1) {
+			ret = rte_kvargs_process_opt(kvlist, IORING_IFACE_ARG,
+						     &parse_iface_arg, tap_name);
+			if (ret < 0)
+				goto error;
+		}
+
+		if (rte_kvargs_count(kvlist, IORING_PERSIST_ARG) == 1)
+			persist = 1;
+	}
+
+	eth_dev = rte_eth_vdev_allocate(vdev, sizeof(struct pmd_internals));
+	if (eth_dev == NULL) {
+		PMD_LOG(ERR, "%s Unable to allocate device struct", tap_name);
+		goto error;
+	}
+
+	if (ioring_create(eth_dev, tap_name, persist) < 0)
+		goto error;
+
+	rte_eth_dev_probing_finish(eth_dev);
+	return 0;
+
+error:
+	if (eth_dev != NULL)
+		rte_eth_dev_release_port(eth_dev);
+	rte_kvargs_free(kvlist);
+	return -1;
+}
+
+static int
+ioring_remove(struct rte_vdev_device *dev)
+{
+	struct rte_eth_dev *eth_dev;
+
+	eth_dev = rte_eth_dev_allocated(rte_vdev_device_name(dev));
+	if (eth_dev == NULL)
+		return 0;
+
+	eth_dev_close(eth_dev);
+	rte_eth_dev_release_port(eth_dev);
+	return 0;
+}
+
+static struct rte_vdev_driver pmd_ioring_drv = {
+	.probe = ioring_probe,
+	.remove = ioring_remove,
+};
+
+RTE_PMD_REGISTER_VDEV(net_ioring, pmd_ioring_drv);
+RTE_PMD_REGISTER_ALIAS(net_ioring, eth_ioring);
+RTE_PMD_REGISTER_PARAM_STRING(net_ioring, IORING_IFACE_ARG "=<string> ");
diff --git a/drivers/net/meson.build b/drivers/net/meson.build
index dafd637ba4..b68a55e916 100644
--- a/drivers/net/meson.build
+++ b/drivers/net/meson.build
@@ -33,6 +33,7 @@ drivers = [
         'idpf',
         'igc',
         'ionic',
+        'ioring',
         'ipn3ke',
         'ixgbe',
         'mana',
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC 2/8] net/ioring: implement link state
  2024-12-10 21:23 [RFC 0/8] ioring: network driver Stephen Hemminger
  2024-12-10 21:23 ` [RFC 1/8] net/ioring: introduce new driver Stephen Hemminger
@ 2024-12-10 21:23 ` Stephen Hemminger
  2024-12-10 21:23 ` [RFC 3/8] net/ioring: implement control functions Stephen Hemminger
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2024-12-10 21:23 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add hooks to set kernel link up/down and report state.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/ioring.ini |  1 +
 drivers/net/ioring/rte_eth_ioring.c | 84 +++++++++++++++++++++++++++++
 2 files changed, 85 insertions(+)

diff --git a/doc/guides/nics/features/ioring.ini b/doc/guides/nics/features/ioring.ini
index c4c57caaa4..d4bf70cb4f 100644
--- a/doc/guides/nics/features/ioring.ini
+++ b/doc/guides/nics/features/ioring.ini
@@ -4,6 +4,7 @@
 ; Refer to default.ini for the full list of available PMD features.
 ;
 [Features]
+Link status          = Y
 Linux		     = Y
 x86-64               = Y
 Usage doc            = Y
diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index 7b62c47f54..fa3e748cda 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -47,6 +47,53 @@ struct pmd_internals {
 	struct rte_ether_addr eth_addr; /* address assigned by kernel */
 };
 
+static int
+eth_dev_change_flags(struct rte_eth_dev *dev, uint16_t flags, uint16_t mask)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+
+	int sock = socket(AF_INET, SOCK_DGRAM, 0);
+	if (sock < 0)
+		return -errno;
+
+	struct ifreq ifr = { };
+	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
+
+	int ret = ioctl(sock, SIOCGIFFLAGS, &ifr);
+	if (ret < 0)
+		goto error;
+
+	/* NB: ifr.ifr_flags is type short */
+	ifr.ifr_flags &= mask;
+	ifr.ifr_flags |= flags;
+
+	ret = ioctl(sock, SIOCSIFFLAGS, &ifr);
+error:
+	close(sock);
+	return (ret < 0) ? -errno : 0;
+}
+
+static int
+eth_dev_get_flags(struct rte_eth_dev *dev, short *flags)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+
+	int sock = socket(AF_INET, SOCK_DGRAM, 0);
+	if (sock < 0)
+		return -errno;
+
+	struct ifreq ifr = { };
+	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
+
+	int ret = ioctl(sock, SIOCGIFFLAGS, &ifr);
+	if (ret == 0)
+		*flags = ifr.ifr_flags;
+
+	close(sock);
+	return (ret < 0) ? -errno : 0;
+}
+
+
 /* Creates a new tap device, name returned in ifr */
 static int
 tap_open(const char *name, struct ifreq *ifr, uint8_t persist)
@@ -103,6 +150,39 @@ tap_open(const char *name, struct ifreq *ifr, uint8_t persist)
 	return -1;
 }
 
+
+static int
+eth_dev_set_link_up(struct rte_eth_dev *dev)
+{
+	return eth_dev_change_flags(dev, IFF_UP, 0);
+}
+
+static int
+eth_dev_set_link_down(struct rte_eth_dev *dev)
+{
+	return eth_dev_change_flags(dev, 0, ~IFF_UP);
+}
+
+static int
+eth_link_update(struct rte_eth_dev *dev, int wait_to_complete __rte_unused)
+{
+	struct rte_eth_link *eth_link = &dev->data->dev_link;
+	short flags = 0;
+
+	if (eth_dev_get_flags(dev, &flags) < 0) {
+		PMD_LOG(ERR, "ioctl(SIOCGIFFLAGS): %s", strerror(errno));
+		return -1;
+	}
+
+	*eth_link = (struct rte_eth_link) {
+		.link_speed = RTE_ETH_SPEED_NUM_UNKNOWN,
+		.link_duplex = RTE_ETH_LINK_FULL_DUPLEX,
+		.link_status = (flags & IFF_UP) ? RTE_ETH_LINK_UP : RTE_ETH_LINK_DOWN,
+		.link_autoneg = RTE_ETH_LINK_FIXED,
+	};
+	return 0;
+};
+
 static int
 eth_dev_close(struct rte_eth_dev *dev)
 {
@@ -126,8 +206,12 @@ eth_dev_close(struct rte_eth_dev *dev)
 
 static const struct eth_dev_ops ops = {
 	.dev_close		= eth_dev_close,
+	.link_update		= eth_link_update,
+	.dev_set_link_up	= eth_dev_set_link_up,
+	.dev_set_link_down	= eth_dev_set_link_down,
 };
 
+
 static int
 ioring_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
 {
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC 3/8] net/ioring: implement control functions
  2024-12-10 21:23 [RFC 0/8] ioring: network driver Stephen Hemminger
  2024-12-10 21:23 ` [RFC 1/8] net/ioring: introduce new driver Stephen Hemminger
  2024-12-10 21:23 ` [RFC 2/8] net/ioring: implement link state Stephen Hemminger
@ 2024-12-10 21:23 ` Stephen Hemminger
  2024-12-10 21:23 ` [RFC 4/8] net/ioring: implement management functions Stephen Hemminger
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2024-12-10 21:23 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

These internal ops, just force changes to kernel visible net device.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/ioring.ini |  3 ++
 drivers/net/ioring/rte_eth_ioring.c | 69 +++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+)

diff --git a/doc/guides/nics/features/ioring.ini b/doc/guides/nics/features/ioring.ini
index d4bf70cb4f..199c7cd31c 100644
--- a/doc/guides/nics/features/ioring.ini
+++ b/doc/guides/nics/features/ioring.ini
@@ -5,6 +5,9 @@
 ;
 [Features]
 Link status          = Y
+MTU update           = Y
+Promiscuous mode     = Y
+Allmulticast mode    = Y
 Linux		     = Y
 x86-64               = Y
 Usage doc            = Y
diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index fa3e748cda..de10a4d83f 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -13,6 +13,7 @@
 #include <sys/socket.h>
 #include <net/if.h>
 #include <linux/if.h>
+#include <linux/if_arp.h>
 #include <linux/if_tun.h>
 
 #include <bus_vdev_driver.h>
@@ -163,6 +164,30 @@ eth_dev_set_link_down(struct rte_eth_dev *dev)
 	return eth_dev_change_flags(dev, 0, ~IFF_UP);
 }
 
+static int
+eth_dev_promiscuous_enable(struct rte_eth_dev *dev)
+{
+	return eth_dev_change_flags(dev, IFF_PROMISC, ~0);
+}
+
+static int
+eth_dev_promiscuous_disable(struct rte_eth_dev *dev)
+{
+	return eth_dev_change_flags(dev, 0, ~IFF_PROMISC);
+}
+
+static int
+eth_dev_allmulticast_enable(struct rte_eth_dev *dev)
+{
+	return eth_dev_change_flags(dev, IFF_ALLMULTI, ~0);
+}
+
+static int
+eth_dev_allmulticast_disable(struct rte_eth_dev *dev)
+{
+	return eth_dev_change_flags(dev, 0, ~IFF_ALLMULTI);
+}
+
 static int
 eth_link_update(struct rte_eth_dev *dev, int wait_to_complete __rte_unused)
 {
@@ -183,6 +208,44 @@ eth_link_update(struct rte_eth_dev *dev, int wait_to_complete __rte_unused)
 	return 0;
 };
 
+static int
+eth_dev_mtu_set(struct rte_eth_dev *dev, uint16_t mtu)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+	struct ifreq ifr = { .ifr_mtu = mtu };
+	int ret;
+
+	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
+
+	ret = ioctl(pmd->ctl_sock, SIOCSIFMTU, &ifr);
+	if (ret < 0) {
+		PMD_LOG(ERR, "ioctl(SIOCSIFMTU) failed: %s", strerror(errno));
+		ret = -errno;
+	}
+
+	return ret;
+}
+
+static int
+eth_dev_macaddr_set(struct rte_eth_dev *dev, struct rte_ether_addr *addr)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+	struct ifreq ifr = { };
+	int ret;
+
+	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
+	ifr.ifr_hwaddr.sa_family = ARPHRD_ETHER;
+	memcpy(ifr.ifr_hwaddr.sa_data, addr, sizeof(*addr));
+
+	ret = ioctl(pmd->ctl_sock, SIOCSIFHWADDR, &ifr);
+	if (ret < 0) {
+		PMD_LOG(ERR, "ioctl(SIOCSIFHWADDR) failed: %s", strerror(errno));
+		ret = -errno;
+	}
+
+	return ret;
+}
+
 static int
 eth_dev_close(struct rte_eth_dev *dev)
 {
@@ -209,6 +272,12 @@ static const struct eth_dev_ops ops = {
 	.link_update		= eth_link_update,
 	.dev_set_link_up	= eth_dev_set_link_up,
 	.dev_set_link_down	= eth_dev_set_link_down,
+	.mac_addr_set		= eth_dev_macaddr_set,
+	.mtu_set		= eth_dev_mtu_set,
+	.promiscuous_enable	= eth_dev_promiscuous_enable,
+	.promiscuous_disable	= eth_dev_promiscuous_disable,
+	.allmulticast_enable	= eth_dev_allmulticast_enable,
+	.allmulticast_disable	= eth_dev_allmulticast_disable,
 };
 
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC 4/8] net/ioring: implement management functions
  2024-12-10 21:23 [RFC 0/8] ioring: network driver Stephen Hemminger
                   ` (2 preceding siblings ...)
  2024-12-10 21:23 ` [RFC 3/8] net/ioring: implement control functions Stephen Hemminger
@ 2024-12-10 21:23 ` Stephen Hemminger
  2024-12-10 21:23 ` [RFC 5/8] net/ioring: implement primary secondary fd passing Stephen Hemminger
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2024-12-10 21:23 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add start, stop, configure and info functions.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 drivers/net/ioring/rte_eth_ioring.c | 72 ++++++++++++++++++++++++++---
 1 file changed, 66 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index de10a4d83f..b5d9c12bdf 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -212,12 +212,16 @@ static int
 eth_dev_mtu_set(struct rte_eth_dev *dev, uint16_t mtu)
 {
 	struct pmd_internals *pmd = dev->data->dev_private;
+
+	int sock = socket(AF_INET, SOCK_DGRAM, 0);
+	if (sock < 0)
+		return -errno;
+
 	struct ifreq ifr = { .ifr_mtu = mtu };
-	int ret;
 
 	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
 
-	ret = ioctl(pmd->ctl_sock, SIOCSIFMTU, &ifr);
+	int ret = ioctl(sock, SIOCSIFMTU, &ifr);
 	if (ret < 0) {
 		PMD_LOG(ERR, "ioctl(SIOCSIFMTU) failed: %s", strerror(errno));
 		ret = -errno;
@@ -230,14 +234,17 @@ static int
 eth_dev_macaddr_set(struct rte_eth_dev *dev, struct rte_ether_addr *addr)
 {
 	struct pmd_internals *pmd = dev->data->dev_private;
-	struct ifreq ifr = { };
-	int ret;
 
+	int sock = socket(AF_INET, SOCK_DGRAM, 0);
+	if (sock < 0)
+		return -errno;
+
+	struct ifreq ifr = { };
 	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
 	ifr.ifr_hwaddr.sa_family = ARPHRD_ETHER;
 	memcpy(ifr.ifr_hwaddr.sa_data, addr, sizeof(*addr));
 
-	ret = ioctl(pmd->ctl_sock, SIOCSIFHWADDR, &ifr);
+	int ret = ioctl(sock, SIOCSIFHWADDR, &ifr);
 	if (ret < 0) {
 		PMD_LOG(ERR, "ioctl(SIOCSIFHWADDR) failed: %s", strerror(errno));
 		ret = -errno;
@@ -246,6 +253,56 @@ eth_dev_macaddr_set(struct rte_eth_dev *dev, struct rte_ether_addr *addr)
 	return ret;
 }
 
+static int
+eth_dev_start(struct rte_eth_dev *dev)
+{
+	dev->data->dev_link.link_status = RTE_ETH_LINK_UP;
+	eth_dev_set_link_up(dev);
+
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		dev->data->rx_queue_state[i] = RTE_ETH_QUEUE_STATE_STARTED;
+		dev->data->tx_queue_state[i] = RTE_ETH_QUEUE_STATE_STARTED;
+	}
+
+	return 0;
+}
+
+static int
+eth_dev_stop(struct rte_eth_dev *dev)
+{
+	dev->data->dev_link.link_status = RTE_ETH_LINK_DOWN;
+	eth_dev_set_link_down(dev);
+
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		dev->data->rx_queue_state[i] = RTE_ETH_QUEUE_STATE_STOPPED;
+		dev->data->tx_queue_state[i] = RTE_ETH_QUEUE_STATE_STOPPED;
+	}
+
+	return 0;
+}
+
+static int
+eth_dev_configure(struct rte_eth_dev *dev)
+{
+	/* rx/tx must be paired */
+	if (dev->data->nb_rx_queues != dev->data->nb_tx_queues)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int
+eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+
+	dev_info->if_index = if_nametoindex(pmd->ifname);
+	dev_info->max_mac_addrs = 1;
+	dev_info->max_rx_pktlen = RTE_ETHER_MAX_LEN;
+
+	return 0;
+}
+
 static int
 eth_dev_close(struct rte_eth_dev *dev)
 {
@@ -263,11 +320,14 @@ eth_dev_close(struct rte_eth_dev *dev)
 		close(pmd->keep_fd);
 		pmd->keep_fd = -1;
 	}
-
 	return 0;
 }
 
 static const struct eth_dev_ops ops = {
+	.dev_start		= eth_dev_start,
+	.dev_stop		= eth_dev_stop,
+	.dev_configure		= eth_dev_configure,
+	.dev_infos_get		= eth_dev_info,
 	.dev_close		= eth_dev_close,
 	.link_update		= eth_link_update,
 	.dev_set_link_up	= eth_dev_set_link_up,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC 5/8] net/ioring: implement primary secondary fd passing
  2024-12-10 21:23 [RFC 0/8] ioring: network driver Stephen Hemminger
                   ` (3 preceding siblings ...)
  2024-12-10 21:23 ` [RFC 4/8] net/ioring: implement management functions Stephen Hemminger
@ 2024-12-10 21:23 ` Stephen Hemminger
  2024-12-10 21:23 ` [RFC 6/8] net/ioring: implement receive and transmit Stephen Hemminger
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2024-12-10 21:23 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add support for communicating fd's from primary to secondary.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/ioring.ini |   1 +
 drivers/net/ioring/rte_eth_ioring.c | 136 +++++++++++++++++++++++++++-
 2 files changed, 135 insertions(+), 2 deletions(-)

diff --git a/doc/guides/nics/features/ioring.ini b/doc/guides/nics/features/ioring.ini
index 199c7cd31c..da47062adb 100644
--- a/doc/guides/nics/features/ioring.ini
+++ b/doc/guides/nics/features/ioring.ini
@@ -8,6 +8,7 @@ Link status          = Y
 MTU update           = Y
 Promiscuous mode     = Y
 Allmulticast mode    = Y
+Multiprocess aware   = Y
 Linux		     = Y
 x86-64               = Y
 Usage doc            = Y
diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index b5d9c12bdf..ddef57adfb 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -28,6 +28,7 @@
 #include <rte_log.h>
 
 #define IORING_DEFAULT_IFNAME	"enio%d"
+#define IORING_MP_KEY		"ioring_mp_send_fds"
 
 RTE_LOG_REGISTER_DEFAULT(ioring_logtype, NOTICE);
 #define RTE_LOGTYPE_IORING ioring_logtype
@@ -400,6 +401,84 @@ parse_iface_arg(const char *key __rte_unused, const char *value, void *extra_arg
 	return 0;
 }
 
+/* Secondary process requests rxq fds from primary. */
+static int
+ioring_request_fds(const char *name, struct rte_eth_dev *dev)
+{
+	struct rte_mp_msg request = { };
+
+	strlcpy(request.name, IORING_MP_KEY, sizeof(request.name));
+	strlcpy((char *)request.param, name, RTE_MP_MAX_PARAM_LEN);
+	request.len_param = strlen(name);
+
+	/* Send the request and receive the reply */
+	PMD_LOG(DEBUG, "Sending multi-process IPC request for %s", name);
+
+	struct timespec timeout = {.tv_sec = 1, .tv_nsec = 0};
+	struct rte_mp_reply replies;
+	int ret = rte_mp_request_sync(&request, &replies, &timeout);
+	if (ret < 0 || replies.nb_received != 1) {
+		PMD_LOG(ERR, "Failed to request fds from primary: %s",
+			rte_strerror(rte_errno));
+		return -1;
+	}
+
+	struct rte_mp_msg *reply = replies.msgs;
+	PMD_LOG(DEBUG, "Received multi-process IPC reply for %s", name);
+	if (dev->data->nb_rx_queues != reply->num_fds) {
+		PMD_LOG(ERR, "Incorrect number of fds received: %d != %d",
+			reply->num_fds, dev->data->nb_rx_queues);
+		return -EINVAL;
+	}
+
+	int *fds = dev->process_private;
+	for (int i = 0; i < reply->num_fds; i++)
+		fds[i] = reply->fds[i];
+
+	free(reply);
+	return 0;
+}
+
+/* Primary process sends rxq fds to secondary. */
+static int
+ioring_mp_send_fds(const struct rte_mp_msg *request, const void *peer)
+{
+	const char *request_name = (const char *)request->param;
+
+	PMD_LOG(DEBUG, "Received multi-process IPC request for %s", request_name);
+
+	/* Find the requested port */
+	struct rte_eth_dev *dev = rte_eth_dev_get_by_name(request_name);
+	if (!dev) {
+		PMD_LOG(ERR, "Failed to get port id for %s", request_name);
+		return -1;
+	}
+
+	/* Populate the reply with the xsk fd for each queue */
+	struct rte_mp_msg reply = { };
+	if (dev->data->nb_rx_queues > RTE_MP_MAX_FD_NUM) {
+		PMD_LOG(ERR, "Number of rx queues (%d) exceeds max number of fds (%d)",
+			   dev->data->nb_rx_queues, RTE_MP_MAX_FD_NUM);
+		return -EINVAL;
+	}
+
+	int *fds = dev->process_private;
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++)
+		reply.fds[reply.num_fds++] = fds[i];
+
+	/* Send the reply */
+	strlcpy(reply.name, request->name, sizeof(reply.name));
+	strlcpy((char *)reply.param, request_name, RTE_MP_MAX_PARAM_LEN);
+	reply.len_param = strlen(request_name);
+
+	PMD_LOG(DEBUG, "Sending multi-process IPC reply for %s", request_name);
+	if (rte_mp_reply(&reply, peer) < 0) {
+		PMD_LOG(ERR, "Failed to reply to multi-process IPC request");
+		return -1;
+	}
+	return 0;
+}
+
 static int
 ioring_probe(struct rte_vdev_device *vdev)
 {
@@ -407,14 +486,43 @@ ioring_probe(struct rte_vdev_device *vdev)
 	const char *params = rte_vdev_device_args(vdev);
 	struct rte_kvargs *kvlist = NULL;
 	struct rte_eth_dev *eth_dev = NULL;
+	int *fds = NULL;
 	char tap_name[IFNAMSIZ] = IORING_DEFAULT_IFNAME;
 	uint8_t persist = 0;
 	int ret;
 
 	PMD_LOG(INFO, "Initializing %s", name);
 
-	if (rte_eal_process_type() == RTE_PROC_SECONDARY)
-		return -1; /* TODO */
+	if (rte_eal_process_type() == RTE_PROC_SECONDARY) {
+		struct rte_eth_dev *eth_dev;
+
+		eth_dev = rte_eth_dev_attach_secondary(name);
+		if (!eth_dev) {
+			PMD_LOG(ERR, "Failed to probe %s", name);
+			return -1;
+		}
+		eth_dev->dev_ops = &ops;
+		eth_dev->device = &vdev->device;
+
+		if (!rte_eal_primary_proc_alive(NULL)) {
+			PMD_LOG(ERR, "Primary process is missing");
+			return -1;
+		}
+
+		fds  = calloc(RTE_MAX_QUEUES_PER_PORT, sizeof(int));
+		if (fds == NULL) {
+			PMD_LOG(ERR, "Failed to alloc memory for process private");
+			return -1;
+		}
+
+		eth_dev->process_private = fds;
+
+		if (ioring_request_fds(name, eth_dev))
+			return -1;
+
+		rte_eth_dev_probing_finish(eth_dev);
+		return 0;
+	}
 
 	if (params != NULL) {
 		kvlist = rte_kvargs_parse(params, valid_arguments);
@@ -432,21 +540,45 @@ ioring_probe(struct rte_vdev_device *vdev)
 			persist = 1;
 	}
 
+	/* Per-queue tap fd's (for primary process) */
+	fds = calloc(RTE_MAX_QUEUES_PER_PORT, sizeof(int));
+	if (fds == NULL) {
+		PMD_LOG(ERR, "Unable to allocate fd array");
+		return -1;
+	}
+	for (unsigned int i = 0; i < RTE_MAX_QUEUES_PER_PORT; i++)
+		fds[i] = -1;
+
 	eth_dev = rte_eth_vdev_allocate(vdev, sizeof(struct pmd_internals));
 	if (eth_dev == NULL) {
 		PMD_LOG(ERR, "%s Unable to allocate device struct", tap_name);
 		goto error;
 	}
 
+	eth_dev->data->dev_flags = RTE_ETH_DEV_AUTOFILL_QUEUE_XSTATS;
+	eth_dev->dev_ops = &ops;
+	eth_dev->process_private = fds;
+
 	if (ioring_create(eth_dev, tap_name, persist) < 0)
 		goto error;
 
+	/* register the MP server on the first device */
+	static unsigned int ioring_dev_count;
+	if (ioring_dev_count == 0) {
+		if (rte_mp_action_register(IORING_MP_KEY, ioring_mp_send_fds) < 0) {
+			PMD_LOG(ERR, "Failed to register multi-process callback: %s",
+				rte_strerror(rte_errno));
+			goto error;
+		}
+	}
+	++ioring_dev_count;
 	rte_eth_dev_probing_finish(eth_dev);
 	return 0;
 
 error:
 	if (eth_dev != NULL)
 		rte_eth_dev_release_port(eth_dev);
+	free(fds);
 	rte_kvargs_free(kvlist);
 	return -1;
 }
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC 6/8] net/ioring: implement receive and transmit
  2024-12-10 21:23 [RFC 0/8] ioring: network driver Stephen Hemminger
                   ` (4 preceding siblings ...)
  2024-12-10 21:23 ` [RFC 5/8] net/ioring: implement primary secondary fd passing Stephen Hemminger
@ 2024-12-10 21:23 ` Stephen Hemminger
  2024-12-10 21:23 ` [RFC 7/8] net/ioring: add VLAN support Stephen Hemminger
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2024-12-10 21:23 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Use io_uring to read and write from TAP device.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 drivers/net/ioring/rte_eth_ioring.c | 364 +++++++++++++++++++++++++++-
 1 file changed, 363 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index ddef57adfb..fa79bc5667 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -2,6 +2,7 @@
  * Copyright (c) Stephen Hemminger
  */
 
+#include <assert.h>
 #include <ctype.h>
 #include <errno.h>
 #include <fcntl.h>
@@ -9,8 +10,10 @@
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
+#include <liburing.h>
 #include <sys/ioctl.h>
 #include <sys/socket.h>
+#include <sys/uio.h>
 #include <net/if.h>
 #include <linux/if.h>
 #include <linux/if_arp.h>
@@ -27,6 +30,13 @@
 #include <rte_kvargs.h>
 #include <rte_log.h>
 
+#define IORING_DEFAULT_BURST	64
+#define IORING_NUM_BUFFERS	1024
+#define IORING_MAX_QUEUES	128
+
+
+static_assert(IORING_MAX_QUEUES <= RTE_MP_MAX_FD_NUM, "Max queues exceeds MP fd limit");
+
 #define IORING_DEFAULT_IFNAME	"enio%d"
 #define IORING_MP_KEY		"ioring_mp_send_fds"
 
@@ -34,6 +44,20 @@ RTE_LOG_REGISTER_DEFAULT(ioring_logtype, NOTICE);
 #define RTE_LOGTYPE_IORING ioring_logtype
 #define PMD_LOG(level, ...) RTE_LOG_LINE_PREFIX(level, IORING, "%s(): ", __func__, __VA_ARGS__)
 
+#ifdef RTE_ETHDEV_DEBUG_RX
+#define PMD_RX_LOG(level, ...) \
+	RTE_LOG_LINE_PREFIX(level, IORING, "%s() rx: ", __func__, __VA_ARGS__)
+#else
+#define PMD_RX_LOG(...) do { } while (0)
+#endif
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+#define PMD_TX_LOG(level, ...) \
+	RTE_LOG_LINE_PREFIX(level, IORING, "%s() tx: ", __func__, __VA_ARGS__)
+#else
+#define PMD_TX_LOG(...) do { } while (0)
+#endif
+
 #define IORING_IFACE_ARG	"iface"
 #define IORING_PERSIST_ARG	"persist"
 
@@ -43,6 +67,30 @@ static const char * const valid_arguments[] = {
 	NULL
 };
 
+struct rx_queue {
+	struct rte_mempool *mb_pool;	/* rx buffer pool */
+	struct io_uring io_ring;	/* queue of posted read's */
+	uint16_t port_id;
+	uint16_t queue_id;
+
+	uint64_t rx_packets;
+	uint64_t rx_bytes;
+	uint64_t rx_nombuf;
+	uint64_t rx_errors;
+};
+
+struct tx_queue {
+	struct io_uring io_ring;
+
+	uint16_t port_id;
+	uint16_t queue_id;
+	uint16_t free_thresh;
+
+	uint64_t tx_packets;
+	uint64_t tx_bytes;
+	uint64_t tx_errors;
+};
+
 struct pmd_internals {
 	int keep_fd;			/* keep alive file descriptor */
 	char ifname[IFNAMSIZ];		/* name assigned by kernel */
@@ -300,6 +348,15 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 	dev_info->if_index = if_nametoindex(pmd->ifname);
 	dev_info->max_mac_addrs = 1;
 	dev_info->max_rx_pktlen = RTE_ETHER_MAX_LEN;
+	dev_info->max_rx_queues = IORING_MAX_QUEUES;
+	dev_info->max_tx_queues = IORING_MAX_QUEUES;
+	dev_info->min_rx_bufsize = 0;
+
+	dev_info->default_rxportconf = (struct rte_eth_dev_portconf) {
+		.burst_size = IORING_DEFAULT_BURST,
+		.ring_size = IORING_NUM_BUFFERS,
+		.nb_queues = 1,
+	};
 
 	return 0;
 }
@@ -311,6 +368,14 @@ eth_dev_close(struct rte_eth_dev *dev)
 
 	PMD_LOG(INFO, "Closing %s", pmd->ifname);
 
+	int *fds = dev->process_private;
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		if (fds[i] == -1)
+			continue;
+		close(fds[i]);
+		fds[i] = -1;
+	}
+
 	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
 		return 0;
 
@@ -324,6 +389,296 @@ eth_dev_close(struct rte_eth_dev *dev)
 	return 0;
 }
 
+/* Setup another fd to TAP device for the queue */
+static int
+eth_queue_setup(struct rte_eth_dev *dev, const char *name, uint16_t queue_id)
+{
+	int *fds = dev->process_private;
+
+	if (fds[queue_id] != -1)
+		return 0;	/* already setup */
+
+	struct ifreq ifr = { };
+	int tap_fd = tap_open(name, &ifr, 0);
+	if (tap_fd < 0) {
+		PMD_LOG(ERR, "tap_open failed");
+		return -1;
+	}
+
+	PMD_LOG(DEBUG, "opened %d for queue %u", tap_fd, queue_id);
+	fds[queue_id] = tap_fd;
+	return 0;
+}
+
+static int
+eth_queue_fd(uint16_t port_id, uint16_t queue_id)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	int *fds = dev->process_private;
+
+	return fds[queue_id];
+}
+
+/* setup an submit queue to read mbuf */
+static inline void
+eth_rx_submit(struct rx_queue *rxq, int fd, struct rte_mbuf *mb)
+{
+	struct io_uring_sqe *sqe = io_uring_get_sqe(&rxq->io_ring);
+
+	if (unlikely(sqe == NULL)) {
+		PMD_LOG(DEBUG, "io_uring no rx sqe");
+		rxq->rx_errors++;
+	} else {
+		void *base = rte_pktmbuf_mtod(mb, void *);
+		size_t len = mb->buf_len;
+
+		io_uring_prep_read(sqe, fd, base, len, 0);
+		io_uring_sqe_set_data(sqe, mb);
+	}
+}
+
+static uint16_t
+eth_ioring_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct rx_queue *rxq = queue;
+	struct io_uring_cqe *cqe;
+	unsigned int head, num_cqe = 0;
+	uint16_t num_rx = 0;
+	uint32_t num_bytes = 0;
+	int fd = eth_queue_fd(rxq->port_id, rxq->queue_id);
+
+	io_uring_for_each_cqe(&rxq->io_ring, head, cqe) {
+		struct rte_mbuf *mb = (void *)(uintptr_t)cqe->user_data;
+		ssize_t len = cqe->res;
+
+		PMD_RX_LOG(DEBUG, "cqe %u len %zd", num_cqe, len);
+		num_cqe++;
+
+		if (unlikely(len < RTE_ETHER_HDR_LEN)) {
+			if (len < 0)
+				PMD_LOG(ERR, "io_uring_read: %s", strerror(-len));
+			else
+				PMD_LOG(ERR, "io_uring_read missing hdr");
+
+			rxq->rx_errors++;
+			goto resubmit;
+		}
+
+		struct rte_mbuf *nmb = rte_pktmbuf_alloc(rxq->mb_pool);
+		if (unlikely(nmb == 0)) {
+			PMD_LOG(DEBUG, "Rx mbuf alloc failed");
+			++rxq->rx_nombuf;
+			goto resubmit;
+		}
+
+		mb->pkt_len = len;
+		mb->data_len = len;
+		mb->port = rxq->port_id;
+		__rte_mbuf_sanity_check(mb, 1);
+
+		num_bytes += len;
+		bufs[num_rx++] = mb;
+
+		mb = nmb;
+resubmit:
+		eth_rx_submit(rxq, fd, mb);
+
+		if (num_rx == nb_pkts)
+			break;
+	}
+	io_uring_cq_advance(&rxq->io_ring, num_cqe);
+
+	rxq->rx_packets += num_rx;
+	rxq->rx_bytes += num_bytes;
+	return num_rx;
+}
+
+static int
+eth_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id, uint16_t nb_rx_desc,
+		   unsigned int socket_id,
+		   const struct rte_eth_rxconf *rx_conf __rte_unused,
+		   struct rte_mempool *mb_pool)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+
+	PMD_LOG(DEBUG, "setup port %u queue %u rx_descriptors %u",
+		dev->data->port_id, queue_id, nb_rx_desc);
+
+	/* open shared tap fd maybe already setup */
+	if (eth_queue_setup(dev, pmd->ifname, queue_id) < 0)
+		return -1;
+
+	struct rx_queue *rxq = rte_zmalloc_socket(NULL, sizeof(*rxq),
+						  RTE_CACHE_LINE_SIZE, socket_id);
+	if (rxq == NULL) {
+		PMD_LOG(ERR, "rxq alloc failed");
+		return -1;
+	}
+
+	rxq->mb_pool = mb_pool;
+	rxq->port_id = dev->data->port_id;
+	rxq->queue_id = queue_id;
+	dev->data->rx_queues[queue_id] = rxq;
+
+	struct rte_mbuf *mbufs[nb_rx_desc];
+	if (rte_pktmbuf_alloc_bulk(mb_pool, mbufs, nb_rx_desc) < 0) {
+		PMD_LOG(ERR, "Rx mbuf alloc %u bufs failed", nb_rx_desc);
+		return -1;
+	}
+
+	if (io_uring_queue_init(nb_rx_desc, &rxq->io_ring, 0) != 0) {
+		PMD_LOG(ERR, "io_uring_queue_init failed: %s", strerror(errno));
+		goto error;
+	}
+
+	int fd = eth_queue_fd(rxq->port_id, rxq->queue_id);
+
+	for (uint16_t i = 0; i < nb_rx_desc; i++) {
+		struct rte_mbuf *mb = mbufs[i];
+
+		eth_rx_submit(rxq, fd, mb);
+	}
+
+	io_uring_submit(&rxq->io_ring);
+	return 0;
+
+error:
+	rte_pktmbuf_free_bulk(mbufs, nb_rx_desc);
+	return -1;
+}
+
+static void
+eth_rx_queue_release(struct rte_eth_dev *dev, uint16_t queue_id)
+{
+	struct rx_queue *rxq = dev->data->rx_queues[queue_id];
+
+	struct io_uring_sqe *sqe = io_uring_get_sqe(&rxq->io_ring);
+	if (sqe == NULL) {
+		PMD_LOG(ERR, "io_uring_get_sqe failed: %s", strerror(errno));
+	} else {
+		io_uring_prep_cancel(sqe, NULL, IORING_ASYNC_CANCEL_ANY);
+		io_uring_submit_and_wait(&rxq->io_ring, 1);
+	}
+
+	io_uring_queue_exit(&rxq->io_ring);
+}
+
+static int
+eth_tx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id,
+		   uint16_t nb_tx_desc, unsigned int socket_id,
+		   const struct rte_eth_txconf *tx_conf)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+
+	/* open shared tap fd maybe already setup */
+	if (eth_queue_setup(dev, pmd->ifname, queue_id) < 0)
+		return -1;
+
+	struct tx_queue *txq = rte_zmalloc_socket(NULL, sizeof(*txq), RTE_CACHE_LINE_SIZE, socket_id);
+	if (txq == NULL) {
+		PMD_LOG(ERR, "txq alloc failed");
+		return -1;
+	}
+
+	txq->port_id = dev->data->port_id;
+	txq->queue_id = queue_id;
+	txq->free_thresh = tx_conf->tx_free_thresh;
+	dev->data->tx_queues[queue_id] = txq;
+
+	if (io_uring_queue_init(nb_tx_desc, &txq->io_ring, 0) != 0) {
+		PMD_LOG(ERR, "io_uring_queue_init failed: %s", strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static void
+eth_tx_queue_release(struct rte_eth_dev *dev, uint16_t queue_id)
+{
+	struct tx_queue *txq = dev->data->tx_queues[queue_id];
+
+	struct io_uring_sqe *sqe = io_uring_get_sqe(&txq->io_ring);
+	if (sqe == NULL) {
+		PMD_LOG(ERR, "io_uring_get_sqe failed: %s", strerror(errno));
+	} else {
+		io_uring_prep_cancel(sqe, NULL, IORING_ASYNC_CANCEL_ANY);
+		io_uring_submit_and_wait(&txq->io_ring, 1);
+	}
+
+	io_uring_queue_exit(&txq->io_ring);
+}
+
+static void
+eth_ioring_tx_cleanup(struct tx_queue *txq)
+{
+	struct io_uring_cqe *cqe;
+	unsigned int head;
+	unsigned int tx_done = 0;
+	uint64_t tx_bytes = 0;
+
+	io_uring_for_each_cqe(&txq->io_ring, head, cqe) {
+		struct rte_mbuf *mb = (void *)(uintptr_t)cqe->user_data;
+
+		PMD_TX_LOG(DEBUG, " mbuf len %u result: %d", mb->pkt_len, cqe->res);
+		if (unlikely(cqe->res < 0)) {
+			++txq->tx_errors;
+		} else {
+			++tx_done;
+			tx_bytes += mb->pkt_len;
+		}
+
+		rte_pktmbuf_free(mb);
+	}
+	io_uring_cq_advance(&txq->io_ring, tx_done);
+
+	txq->tx_packets += tx_done;
+	txq->tx_bytes += tx_bytes;
+}
+
+static uint16_t
+eth_ioring_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct tx_queue *txq = queue;
+	uint16_t num_tx;
+
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts);
+
+	if (io_uring_sq_space_left(&txq->io_ring) < txq->free_thresh)
+		eth_ioring_tx_cleanup(txq);
+
+	int fd = eth_queue_fd(txq->port_id, txq->queue_id);
+
+	for (num_tx = 0; num_tx < nb_pkts; num_tx++) {
+		struct rte_mbuf *mb = bufs[num_tx];
+
+		struct io_uring_sqe *sqe = io_uring_get_sqe(&txq->io_ring);
+		if (sqe == NULL)
+			break;	/* submit ring is full */
+
+		io_uring_sqe_set_data(sqe, mb);
+
+		if (rte_mbuf_refcnt_read(mb) == 1 &&
+		    RTE_MBUF_DIRECT(mb) && mb->nb_segs == 1) {
+			void *base = rte_pktmbuf_mtod(mb, void *);
+			io_uring_prep_write(sqe, fd, base, mb->pkt_len, 0);
+
+			PMD_TX_LOG(DEBUG, "tx mbuf: %p submit", mb);
+		} else {
+			PMD_LOG(ERR, "Can't do mbuf without space yet!");
+			++txq->tx_errors;
+			continue;
+		}
+	}
+	if (num_tx > 0)
+		io_uring_submit(&txq->io_ring);
+
+	return num_tx;
+}
+
 static const struct eth_dev_ops ops = {
 	.dev_start		= eth_dev_start,
 	.dev_stop		= eth_dev_stop,
@@ -339,9 +694,12 @@ static const struct eth_dev_ops ops = {
 	.promiscuous_disable	= eth_dev_promiscuous_disable,
 	.allmulticast_enable	= eth_dev_allmulticast_enable,
 	.allmulticast_disable	= eth_dev_allmulticast_disable,
+	.rx_queue_setup		= eth_rx_queue_setup,
+	.rx_queue_release	= eth_rx_queue_release,
+	.tx_queue_setup		= eth_tx_queue_setup,
+	.tx_queue_release	= eth_tx_queue_release,
 };
 
-
 static int
 ioring_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
 {
@@ -379,6 +737,10 @@ ioring_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
 	}
 
 	PMD_LOG(DEBUG, "%s setup", ifr.ifr_name);
+
+	dev->rx_pkt_burst = eth_ioring_rx;
+	dev->tx_pkt_burst = eth_ioring_tx;
+
 	return 0;
 
 error:
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC 7/8] net/ioring: add VLAN support
  2024-12-10 21:23 [RFC 0/8] ioring: network driver Stephen Hemminger
                   ` (5 preceding siblings ...)
  2024-12-10 21:23 ` [RFC 6/8] net/ioring: implement receive and transmit Stephen Hemminger
@ 2024-12-10 21:23 ` Stephen Hemminger
  2024-12-10 21:23 ` [RFC 8/8] net/ioring: implement statistics Stephen Hemminger
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2024-12-10 21:23 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add support for VLAN insert and stripping.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 drivers/net/ioring/rte_eth_ioring.c | 45 +++++++++++++++++++++++++++--
 1 file changed, 43 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index fa79bc5667..a2bfefec45 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -34,6 +34,8 @@
 #define IORING_NUM_BUFFERS	1024
 #define IORING_MAX_QUEUES	128
 
+#define IORING_TX_OFFLOAD	RTE_ETH_TX_OFFLOAD_VLAN_INSERT
+#define IORING_RX_OFFLOAD	RTE_ETH_RX_OFFLOAD_VLAN_STRIP
 
 static_assert(IORING_MAX_QUEUES <= RTE_MP_MAX_FD_NUM, "Max queues exceeds MP fd limit");
 
@@ -70,6 +72,7 @@ static const char * const valid_arguments[] = {
 struct rx_queue {
 	struct rte_mempool *mb_pool;	/* rx buffer pool */
 	struct io_uring io_ring;	/* queue of posted read's */
+	uint64_t offloads;
 	uint16_t port_id;
 	uint16_t queue_id;
 
@@ -81,6 +84,7 @@ struct rx_queue {
 
 struct tx_queue {
 	struct io_uring io_ring;
+	uint64_t offloads;
 
 	uint16_t port_id;
 	uint16_t queue_id;
@@ -471,6 +475,9 @@ eth_ioring_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 			goto resubmit;
 		}
 
+		if (rxq->offloads & RTE_ETH_RX_OFFLOAD_VLAN_STRIP)
+			rte_vlan_strip(mb);
+
 		mb->pkt_len = len;
 		mb->data_len = len;
 		mb->port = rxq->port_id;
@@ -495,8 +502,7 @@ eth_ioring_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 
 static int
 eth_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id, uint16_t nb_rx_desc,
-		   unsigned int socket_id,
-		   const struct rte_eth_rxconf *rx_conf __rte_unused,
+		   unsigned int socket_id, const struct rte_eth_rxconf *rx_conf,
 		   struct rte_mempool *mb_pool)
 {
 	struct pmd_internals *pmd = dev->data->dev_private;
@@ -515,6 +521,7 @@ eth_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id, uint16_t nb_rx_de
 		return -1;
 	}
 
+	rxq->offloads = rx_conf->offloads;
 	rxq->mb_pool = mb_pool;
 	rxq->port_id = dev->data->port_id;
 	rxq->queue_id = queue_id;
@@ -582,6 +589,7 @@ eth_tx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id,
 
 	txq->port_id = dev->data->port_id;
 	txq->queue_id = queue_id;
+	txq->offloads = tx_conf->offloads;
 	txq->free_thresh = tx_conf->tx_free_thresh;
 	dev->data->tx_queues[queue_id] = txq;
 
@@ -636,6 +644,38 @@ eth_ioring_tx_cleanup(struct tx_queue *txq)
 	txq->tx_bytes += tx_bytes;
 }
 
+static uint16_t
+eth_ioring_tx_prepare(void *tx_queue __rte_unused, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
+{
+	uint16_t nb_tx;
+	int error;
+
+	for (nb_tx = 0; nb_tx < nb_pkts; nb_tx++) {
+		struct rte_mbuf *m = tx_pkts[nb_tx];
+
+#ifdef RTE_LIBRTE_ETHDEV_DEBUG
+		error = rte_validate_tx_offload(m);
+		if (unlikely(error)) {
+			rte_errno = -error;
+			break;
+		}
+#endif
+		/* Do VLAN tag insertion */
+		if (unlikely(m->ol_flags & RTE_MBUF_F_TX_VLAN)) {
+			error = rte_vlan_insert(&m);
+			/* rte_vlan_insert() may change pointer */
+			tx_pkts[nb_tx] = m;
+
+			if (unlikely(error)) {
+				rte_errno = -error;
+				break;
+			}
+		}
+	}
+
+	return nb_tx;
+}
+
 static uint16_t
 eth_ioring_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 {
@@ -739,6 +779,7 @@ ioring_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
 	PMD_LOG(DEBUG, "%s setup", ifr.ifr_name);
 
 	dev->rx_pkt_burst = eth_ioring_rx;
+	dev->tx_pkt_prepare = eth_ioring_tx_prepare;
 	dev->tx_pkt_burst = eth_ioring_tx;
 
 	return 0;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [RFC 8/8] net/ioring: implement statistics
  2024-12-10 21:23 [RFC 0/8] ioring: network driver Stephen Hemminger
                   ` (6 preceding siblings ...)
  2024-12-10 21:23 ` [RFC 7/8] net/ioring: add VLAN support Stephen Hemminger
@ 2024-12-10 21:23 ` Stephen Hemminger
  2024-12-11 11:34 ` [RFC 0/8] ioring: network driver Konstantin Ananyev
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2024-12-10 21:23 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add support for basic statistics

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 drivers/net/ioring/rte_eth_ioring.c | 57 +++++++++++++++++++++++++++++
 1 file changed, 57 insertions(+)

diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index a2bfefec45..f58740197d 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -365,6 +365,61 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 	return 0;
 }
 
+static int
+eth_dev_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats)
+{
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		const struct rx_queue *rxq = dev->data->rx_queues[i];
+
+		stats->ipackets += rxq->rx_packets;
+		stats->ibytes += rxq->rx_bytes;
+		stats->ierrors += rxq->rx_errors;
+		stats->rx_nombuf += rxq->rx_nombuf;
+
+		if (i < RTE_ETHDEV_QUEUE_STAT_CNTRS) {
+			stats->q_ipackets[i] = rxq->rx_packets;
+			stats->q_ibytes[i] = rxq->rx_bytes;
+		}
+	}
+
+	for (uint16_t i = 0; i < dev->data->nb_tx_queues; i++) {
+		const struct tx_queue *txq = dev->data->tx_queues[i];
+
+		stats->opackets += txq->tx_packets;
+		stats->obytes += txq->tx_bytes;
+		stats->oerrors += txq->tx_errors;
+
+		if (i < RTE_ETHDEV_QUEUE_STAT_CNTRS) {
+			stats->q_opackets[i] = txq->tx_packets;
+			stats->q_obytes[i] = txq->tx_bytes;
+		}
+	}
+
+	return 0;
+}
+
+static int
+eth_dev_stats_reset(struct rte_eth_dev *dev)
+{
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		struct rx_queue *rxq = dev->data->rx_queues[i];
+
+		rxq->rx_packets = 0;
+		rxq->rx_bytes = 0;
+		rxq->rx_nombuf = 0;
+		rxq->rx_errors = 0;
+	}
+
+	for (uint16_t i = 0; i < dev->data->nb_tx_queues; i++) {
+		struct tx_queue *txq = dev->data->tx_queues[i];
+
+		txq->tx_packets = 0;
+		txq->tx_bytes = 0;
+		txq->tx_errors = 0;
+	}
+	return 0;
+}
+
 static int
 eth_dev_close(struct rte_eth_dev *dev)
 {
@@ -734,6 +789,8 @@ static const struct eth_dev_ops ops = {
 	.promiscuous_disable	= eth_dev_promiscuous_disable,
 	.allmulticast_enable	= eth_dev_allmulticast_enable,
 	.allmulticast_disable	= eth_dev_allmulticast_disable,
+	.stats_get              = eth_dev_stats_get,
+	.stats_reset            = eth_dev_stats_reset,
 	.rx_queue_setup		= eth_rx_queue_setup,
 	.rx_queue_release	= eth_rx_queue_release,
 	.tx_queue_setup		= eth_tx_queue_setup,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* RE: [RFC 0/8] ioring: network driver
  2024-12-10 21:23 [RFC 0/8] ioring: network driver Stephen Hemminger
                   ` (7 preceding siblings ...)
  2024-12-10 21:23 ` [RFC 8/8] net/ioring: implement statistics Stephen Hemminger
@ 2024-12-11 11:34 ` Konstantin Ananyev
  2024-12-11 15:03   ` Stephen Hemminger
  2024-12-11 16:28 ` [PATCH v2 " Stephen Hemminger
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 72+ messages in thread
From: Konstantin Ananyev @ 2024-12-11 11:34 UTC (permalink / raw)
  To: Stephen Hemminger, dev@dpdk.org



> This is first draft of new simplified TAP device that uses
> the Linux kernel ioring API to provide a read/write ring
> with kernel.
> 
> This is split from tap device because there are so many
> unnecessary things in existing tap, and supporting ioring is
> better without ifdefs etc. The default name of the tap
> device is different that other uses in DPDK but the driver
> tries to keep the same relevant devargs as before.
> 
> This driver will only provide features that match what kernel
> does, so no flow support etc. The next version will add checksum
> and multi-segment packets. Some of the doc files may need update
> as well.

Makes sense to me, though didn't properly look inside.
One thing - probably add  a 'tap' into the name,
'tap_ioiring' or so, otherwise 'ioring' is a bit too generic 
and might be confusing.

> Stephen Hemminger (8):
>   net/ioring: introduce new driver
>   net/ioring: implement link state
>   net/ioring: implement control functions
>   net/ioring: implement management functions
>   net/ioring: implement primary secondary fd passing
>   net/ioring: implement receive and transmit
>   net/ioring: add VLAN support
>   net/ioring: implement statistics
> 
>  doc/guides/nics/features/ioring.ini |   14 +
>  doc/guides/nics/index.rst           |    1 +
>  doc/guides/nics/ioring.rst          |   66 ++
>  drivers/net/ioring/meson.build      |   12 +
>  drivers/net/ioring/rte_eth_ioring.c | 1067 +++++++++++++++++++++++++++
>  drivers/net/meson.build             |    1 +
>  6 files changed, 1161 insertions(+)
>  create mode 100644 doc/guides/nics/features/ioring.ini
>  create mode 100644 doc/guides/nics/ioring.rst
>  create mode 100644 drivers/net/ioring/meson.build
>  create mode 100644 drivers/net/ioring/rte_eth_ioring.c
> 
> --
> 2.45.2


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC 0/8] ioring: network driver
  2024-12-11 11:34 ` [RFC 0/8] ioring: network driver Konstantin Ananyev
@ 2024-12-11 15:03   ` Stephen Hemminger
  2024-12-12 19:06     ` Konstantin Ananyev
  0 siblings, 1 reply; 72+ messages in thread
From: Stephen Hemminger @ 2024-12-11 15:03 UTC (permalink / raw)
  To: Konstantin Ananyev; +Cc: dev@dpdk.org

On Wed, 11 Dec 2024 11:34:39 +0000
Konstantin Ananyev <konstantin.ananyev@huawei.com> wrote:

> > This is first draft of new simplified TAP device that uses
> > the Linux kernel ioring API to provide a read/write ring
> > with kernel.
> > 
> > This is split from tap device because there are so many
> > unnecessary things in existing tap, and supporting ioring is
> > better without ifdefs etc. The default name of the tap
> > device is different that other uses in DPDK but the driver
> > tries to keep the same relevant devargs as before.
> > 
> > This driver will only provide features that match what kernel
> > does, so no flow support etc. The next version will add checksum
> > and multi-segment packets. Some of the doc files may need update
> > as well.  
> 
> Makes sense to me, though didn't properly look inside.
> One thing - probably add  a 'tap' into the name,
> 'tap_ioiring' or so, otherwise 'ioring' is a bit too generic 
> and might be confusing.

There are some userspaces that look for "e*" in name for some setups.
But names are totally abitrary

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v2 0/8] ioring: network driver
  2024-12-10 21:23 [RFC 0/8] ioring: network driver Stephen Hemminger
                   ` (8 preceding siblings ...)
  2024-12-11 11:34 ` [RFC 0/8] ioring: network driver Konstantin Ananyev
@ 2024-12-11 16:28 ` Stephen Hemminger
  2024-12-11 16:28   ` [PATCH v2 1/8] net/ioring: introduce new driver Stephen Hemminger
                     ` (7 more replies)
  2025-03-11 23:51 ` [PATCH v3 0/9] ioring PMD device Stephen Hemminger
                   ` (3 subsequent siblings)
  13 siblings, 8 replies; 72+ messages in thread
From: Stephen Hemminger @ 2024-12-11 16:28 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

This is initial work of new simplified TAP device that uses
the Linux kernel ioring API to provide a read/write ring
with kernel.

This is split from tap device because there are so many
unnecessary things in existing tap, and supporting ioring is
better without ifdefs etc. The default name of the tap
device is different that other uses in DPDK but the driver
tries to keep the same relevant devargs as before.

This driver will only provide features that match what kernel
does, so no flow support etc. The next version will add checksum
and multi-segment packets. Some of the doc files may need update
as well.

Stephen Hemminger (8):
  net/ioring: introduce new driver
  net/ioring: implement link state
  net/ioring: implement control functions
  net/ioring: implement management functions
  net/ioring: implement primary secondary fd passing
  net/ioring: implement receive and transmit
  net/ioring: add VLAN support
  net/ioring: implement statistics

 doc/guides/nics/features/ioring.ini |   16 +
 doc/guides/nics/index.rst           |    1 +
 doc/guides/nics/ioring.rst          |   66 ++
 drivers/net/ioring/meson.build      |   15 +
 drivers/net/ioring/rte_eth_ioring.c | 1068 +++++++++++++++++++++++++++
 drivers/net/meson.build             |    1 +
 6 files changed, 1167 insertions(+)
 create mode 100644 doc/guides/nics/features/ioring.ini
 create mode 100644 doc/guides/nics/ioring.rst
 create mode 100644 drivers/net/ioring/meson.build
 create mode 100644 drivers/net/ioring/rte_eth_ioring.c

-- 
2.45.2


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v2 1/8] net/ioring: introduce new driver
  2024-12-11 16:28 ` [PATCH v2 " Stephen Hemminger
@ 2024-12-11 16:28   ` Stephen Hemminger
  2024-12-28 16:39     ` Morten Brørup
  2024-12-11 16:28   ` [PATCH v2 2/8] net/ioring: implement link state Stephen Hemminger
                     ` (6 subsequent siblings)
  7 siblings, 1 reply; 72+ messages in thread
From: Stephen Hemminger @ 2024-12-11 16:28 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add basic driver initialization, documentation, and device creation
and basic documentation.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/ioring.ini |   9 +
 doc/guides/nics/index.rst           |   1 +
 doc/guides/nics/ioring.rst          |  66 +++++++
 drivers/net/ioring/meson.build      |  15 ++
 drivers/net/ioring/rte_eth_ioring.c | 262 ++++++++++++++++++++++++++++
 drivers/net/meson.build             |   1 +
 6 files changed, 354 insertions(+)
 create mode 100644 doc/guides/nics/features/ioring.ini
 create mode 100644 doc/guides/nics/ioring.rst
 create mode 100644 drivers/net/ioring/meson.build
 create mode 100644 drivers/net/ioring/rte_eth_ioring.c

diff --git a/doc/guides/nics/features/ioring.ini b/doc/guides/nics/features/ioring.ini
new file mode 100644
index 0000000000..c4c57caaa4
--- /dev/null
+++ b/doc/guides/nics/features/ioring.ini
@@ -0,0 +1,9 @@
+;
+; Supported features of the 'ioring' driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+Linux		     = Y
+x86-64               = Y
+Usage doc            = Y
diff --git a/doc/guides/nics/index.rst b/doc/guides/nics/index.rst
index 50688d9f64..e4d243622e 100644
--- a/doc/guides/nics/index.rst
+++ b/doc/guides/nics/index.rst
@@ -41,6 +41,7 @@ Network Interface Controller Drivers
     igc
     intel_vf
     ionic
+    ioring
     ipn3ke
     ixgbe
     mana
diff --git a/doc/guides/nics/ioring.rst b/doc/guides/nics/ioring.rst
new file mode 100644
index 0000000000..7d37a6bb37
--- /dev/null
+++ b/doc/guides/nics/ioring.rst
@@ -0,0 +1,66 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+
+IORING Poll Mode Driver
+=======================
+
+The IORING Poll Mode Driver (PMD) is a simplified and improved version of the TAP PMD. It is a
+virtual device that uses Linux ioring to inject packets into the Linux kernel.
+It is useful when writing DPDK applications, that need to support interaction
+with the Linux TCP/IP stack for control plane or tunneling.
+
+The IORING PMD creates a kernel network device that can be
+managed by standard tools such as ``ip`` and ``ethtool`` commands.
+
+From a DPDK application, the IORING device looks like a DPDK ethdev.
+It supports the standard DPDK API's to query for information, statistics,
+and send/receive packets.
+
+Requirements
+------------
+
+The IORING requires the io_uring library (liburing) which provides the helper
+functions to manage io_uring with the kernel.
+
+For more info on io_uring, please see:
+
+https://kernel.dk/io_uring.pdf
+
+
+Arguments
+---------
+
+IORING devices are created with the command line ``--vdev=net_ioring0`` option.
+This option may be specified more than once by repeating with a different ``net_ioringX`` device.
+
+By default, the Linux interfaces are named ``enio0``, ``enio1``, etc.
+The interface name can be specified by adding the ``iface=foo0``, for example::
+
+   --vdev=net_ioring0,iface=io0 --vdev=net_ioring1,iface=io1, ...
+
+The PMD inherits the MAC address assigned by the kernel which will be
+a locally assigned random Ethernet address.
+
+Normally, when the DPDK application exits, the IORING device is removed.
+But this behavior can be overridden by the use of the persist flag, example::
+
+  --vdev=net_ioring0,iface=io0,persist ...
+
+
+Multi-process sharing
+---------------------
+
+The IORING device does not support secondary process (yet).
+
+
+Limitations
+-----------
+
+- IO uring requires io_uring support. This was add in Linux kernl version 5.1
+  Also, IO uring maybe disabled in some environments or by security policies.
+
+- Since IORING device uses a file descriptor to talk to the kernel,
+  the same number of queues must be specified for receive and transmit.
+
+- No flow support. Receive queue selection for incoming packets is determined
+  by the Linux kernel. See kernel documentation for more info:
+  https://www.kernel.org/doc/html/latest/networking/scaling.html
diff --git a/drivers/net/ioring/meson.build b/drivers/net/ioring/meson.build
new file mode 100644
index 0000000000..264554d069
--- /dev/null
+++ b/drivers/net/ioring/meson.build
@@ -0,0 +1,15 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2024 Stephen Hemminger
+
+if not is_linux
+    build = false
+    reason = 'only supported on Linux'
+endif
+
+dep = dependency('liburing', required:false)
+reason = 'missing dependency, "liburing"'
+build = dep.found()
+ext_deps += dep
+
+sources = files('rte_eth_ioring.c')
+require_iova_in_mbuf = false
diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
new file mode 100644
index 0000000000..7b62c47f54
--- /dev/null
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -0,0 +1,262 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) Stephen Hemminger
+ */
+
+#include <ctype.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/ioctl.h>
+#include <sys/socket.h>
+#include <net/if.h>
+#include <linux/if.h>
+#include <linux/if_tun.h>
+
+#include <bus_vdev_driver.h>
+#include <ethdev_driver.h>
+#include <ethdev_vdev.h>
+#include <rte_common.h>
+#include <rte_dev.h>
+#include <rte_eal.h>
+#include <rte_ethdev.h>
+#include <rte_ether.h>
+#include <rte_kvargs.h>
+#include <rte_log.h>
+
+#define IORING_DEFAULT_IFNAME	"enio%d"
+
+RTE_LOG_REGISTER_DEFAULT(ioring_logtype, NOTICE);
+#define RTE_LOGTYPE_IORING ioring_logtype
+#define PMD_LOG(level, ...) RTE_LOG_LINE_PREFIX(level, IORING, "%s(): ", __func__, __VA_ARGS__)
+
+#define IORING_IFACE_ARG	"iface"
+#define IORING_PERSIST_ARG	"persist"
+
+static const char * const valid_arguments[] = {
+	IORING_IFACE_ARG,
+	IORING_PERSIST_ARG,
+	NULL
+};
+
+struct pmd_internals {
+	int keep_fd;			/* keep alive file descriptor */
+	char ifname[IFNAMSIZ];		/* name assigned by kernel */
+	struct rte_ether_addr eth_addr; /* address assigned by kernel */
+};
+
+/* Creates a new tap device, name returned in ifr */
+static int
+tap_open(const char *name, struct ifreq *ifr, uint8_t persist)
+{
+	static const char tun_dev[] = "/dev/net/tun";
+	int tap_fd;
+
+	tap_fd = open(tun_dev, O_RDWR | O_CLOEXEC | O_NONBLOCK);
+	if (tap_fd < 0) {
+		PMD_LOG(ERR, "Open %s failed: %s", tun_dev, strerror(errno));
+		return -1;
+	}
+
+	int features = 0;
+	if (ioctl(tap_fd, TUNGETFEATURES, &features) < 0) {
+		PMD_LOG(ERR, "ioctl(TUNGETFEATURES) %s", strerror(errno));
+		goto error;
+	}
+
+	int flags = IFF_TAP | IFF_MULTI_QUEUE | IFF_NO_PI;
+	if ((features & flags) == 0) {
+		PMD_LOG(ERR, "TUN features %#x missing support for %#x",
+			features, features & flags);
+		goto error;
+	}
+
+#ifdef IFF_NAPI
+	/* If kernel supports using NAPI enable it */
+	if (features & IFF_NAPI)
+		flags |= IFF_NAPI;
+#endif
+	/*
+	 * Sets the device name and packet format.
+	 * Do not want the protocol information (PI)
+	 */
+	strlcpy(ifr->ifr_name, name, IFNAMSIZ);
+	ifr->ifr_flags = flags;
+	if (ioctl(tap_fd, TUNSETIFF, ifr) < 0) {
+		PMD_LOG(ERR, "ioctl(TUNSETIFF) %s: %s",
+			ifr->ifr_name, strerror(errno));
+		goto error;
+	}
+
+	/* (Optional) keep the device after application exit */
+	if (persist && ioctl(tap_fd, TUNSETPERSIST, 1) < 0) {
+		PMD_LOG(ERR, "ioctl(TUNSETPERIST) %s: %s",
+			ifr->ifr_name, strerror(errno));
+		goto error;
+	}
+
+	return tap_fd;
+error:
+	close(tap_fd);
+	return -1;
+}
+
+static int
+eth_dev_close(struct rte_eth_dev *dev)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+
+	PMD_LOG(INFO, "Closing %s", pmd->ifname);
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	/* mac_addrs must not be freed alone because part of dev_private */
+	dev->data->mac_addrs = NULL;
+
+	if (pmd->keep_fd != -1) {
+		close(pmd->keep_fd);
+		pmd->keep_fd = -1;
+	}
+
+	return 0;
+}
+
+static const struct eth_dev_ops ops = {
+	.dev_close		= eth_dev_close,
+};
+
+static int
+ioring_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
+{
+	struct rte_eth_dev_data *data = dev->data;
+	struct pmd_internals *pmd = data->dev_private;
+
+	pmd->keep_fd = -1;
+
+	data->dev_flags = RTE_ETH_DEV_AUTOFILL_QUEUE_XSTATS;
+	dev->dev_ops = &ops;
+
+	/* Get the initial fd used to keep the tap device around */
+	struct ifreq ifr = { };
+	pmd->keep_fd = tap_open(tap_name, &ifr, persist);
+	if (pmd->keep_fd < 0)
+		goto error;
+
+	strlcpy(pmd->ifname, ifr.ifr_name, IFNAMSIZ);
+
+	/* Read the MAC address assigned by the kernel */
+	if (ioctl(pmd->keep_fd, SIOCGIFHWADDR, &ifr) < 0) {
+		PMD_LOG(ERR, "Unable to get MAC address for %s: %s",
+			ifr.ifr_name, strerror(errno));
+		goto error;
+	}
+	memcpy(&pmd->eth_addr, &ifr.ifr_hwaddr.sa_data, RTE_ETHER_ADDR_LEN);
+	data->mac_addrs = &pmd->eth_addr;
+
+	/* Detach this instance, not used for traffic */
+	ifr.ifr_flags = IFF_DETACH_QUEUE;
+	if (ioctl(pmd->keep_fd, TUNSETQUEUE, &ifr) < 0) {
+		PMD_LOG(ERR, "Unable to detach keep-alive queue for %s: %s",
+			ifr.ifr_name, strerror(errno));
+		goto error;
+	}
+
+	PMD_LOG(DEBUG, "%s setup", ifr.ifr_name);
+	return 0;
+
+error:
+	if (pmd->keep_fd != -1)
+		close(pmd->keep_fd);
+	return -1;
+}
+
+static int
+parse_iface_arg(const char *key __rte_unused, const char *value, void *extra_args)
+{
+	char *name = extra_args;
+
+	/* must not be null string */
+	if (name == NULL || name[0] == '\0' ||
+	    strnlen(name, IFNAMSIZ) == IFNAMSIZ)
+		return -EINVAL;
+
+	strlcpy(name, value, IFNAMSIZ);
+	return 0;
+}
+
+static int
+ioring_probe(struct rte_vdev_device *vdev)
+{
+	const char *name = rte_vdev_device_name(vdev);
+	const char *params = rte_vdev_device_args(vdev);
+	struct rte_kvargs *kvlist = NULL;
+	struct rte_eth_dev *eth_dev = NULL;
+	char tap_name[IFNAMSIZ] = IORING_DEFAULT_IFNAME;
+	uint8_t persist = 0;
+	int ret;
+
+	PMD_LOG(INFO, "Initializing %s", name);
+
+	if (rte_eal_process_type() == RTE_PROC_SECONDARY)
+		return -1; /* TODO */
+
+	if (params != NULL) {
+		kvlist = rte_kvargs_parse(params, valid_arguments);
+		if (kvlist == NULL)
+			return -1;
+
+		if (rte_kvargs_count(kvlist, IORING_IFACE_ARG) == 1) {
+			ret = rte_kvargs_process_opt(kvlist, IORING_IFACE_ARG,
+						     &parse_iface_arg, tap_name);
+			if (ret < 0)
+				goto error;
+		}
+
+		if (rte_kvargs_count(kvlist, IORING_PERSIST_ARG) == 1)
+			persist = 1;
+	}
+
+	eth_dev = rte_eth_vdev_allocate(vdev, sizeof(struct pmd_internals));
+	if (eth_dev == NULL) {
+		PMD_LOG(ERR, "%s Unable to allocate device struct", tap_name);
+		goto error;
+	}
+
+	if (ioring_create(eth_dev, tap_name, persist) < 0)
+		goto error;
+
+	rte_eth_dev_probing_finish(eth_dev);
+	return 0;
+
+error:
+	if (eth_dev != NULL)
+		rte_eth_dev_release_port(eth_dev);
+	rte_kvargs_free(kvlist);
+	return -1;
+}
+
+static int
+ioring_remove(struct rte_vdev_device *dev)
+{
+	struct rte_eth_dev *eth_dev;
+
+	eth_dev = rte_eth_dev_allocated(rte_vdev_device_name(dev));
+	if (eth_dev == NULL)
+		return 0;
+
+	eth_dev_close(eth_dev);
+	rte_eth_dev_release_port(eth_dev);
+	return 0;
+}
+
+static struct rte_vdev_driver pmd_ioring_drv = {
+	.probe = ioring_probe,
+	.remove = ioring_remove,
+};
+
+RTE_PMD_REGISTER_VDEV(net_ioring, pmd_ioring_drv);
+RTE_PMD_REGISTER_ALIAS(net_ioring, eth_ioring);
+RTE_PMD_REGISTER_PARAM_STRING(net_ioring, IORING_IFACE_ARG "=<string> ");
diff --git a/drivers/net/meson.build b/drivers/net/meson.build
index dafd637ba4..b68a55e916 100644
--- a/drivers/net/meson.build
+++ b/drivers/net/meson.build
@@ -33,6 +33,7 @@ drivers = [
         'idpf',
         'igc',
         'ionic',
+        'ioring',
         'ipn3ke',
         'ixgbe',
         'mana',
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 2/8] net/ioring: implement link state
  2024-12-11 16:28 ` [PATCH v2 " Stephen Hemminger
  2024-12-11 16:28   ` [PATCH v2 1/8] net/ioring: introduce new driver Stephen Hemminger
@ 2024-12-11 16:28   ` Stephen Hemminger
  2024-12-11 16:28   ` [PATCH v2 3/8] net/ioring: implement control functions Stephen Hemminger
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2024-12-11 16:28 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add hooks to set kernel link up/down and report state.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/ioring.ini |  1 +
 drivers/net/ioring/rte_eth_ioring.c | 84 +++++++++++++++++++++++++++++
 2 files changed, 85 insertions(+)

diff --git a/doc/guides/nics/features/ioring.ini b/doc/guides/nics/features/ioring.ini
index c4c57caaa4..d4bf70cb4f 100644
--- a/doc/guides/nics/features/ioring.ini
+++ b/doc/guides/nics/features/ioring.ini
@@ -4,6 +4,7 @@
 ; Refer to default.ini for the full list of available PMD features.
 ;
 [Features]
+Link status          = Y
 Linux		     = Y
 x86-64               = Y
 Usage doc            = Y
diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index 7b62c47f54..fa3e748cda 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -47,6 +47,53 @@ struct pmd_internals {
 	struct rte_ether_addr eth_addr; /* address assigned by kernel */
 };
 
+static int
+eth_dev_change_flags(struct rte_eth_dev *dev, uint16_t flags, uint16_t mask)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+
+	int sock = socket(AF_INET, SOCK_DGRAM, 0);
+	if (sock < 0)
+		return -errno;
+
+	struct ifreq ifr = { };
+	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
+
+	int ret = ioctl(sock, SIOCGIFFLAGS, &ifr);
+	if (ret < 0)
+		goto error;
+
+	/* NB: ifr.ifr_flags is type short */
+	ifr.ifr_flags &= mask;
+	ifr.ifr_flags |= flags;
+
+	ret = ioctl(sock, SIOCSIFFLAGS, &ifr);
+error:
+	close(sock);
+	return (ret < 0) ? -errno : 0;
+}
+
+static int
+eth_dev_get_flags(struct rte_eth_dev *dev, short *flags)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+
+	int sock = socket(AF_INET, SOCK_DGRAM, 0);
+	if (sock < 0)
+		return -errno;
+
+	struct ifreq ifr = { };
+	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
+
+	int ret = ioctl(sock, SIOCGIFFLAGS, &ifr);
+	if (ret == 0)
+		*flags = ifr.ifr_flags;
+
+	close(sock);
+	return (ret < 0) ? -errno : 0;
+}
+
+
 /* Creates a new tap device, name returned in ifr */
 static int
 tap_open(const char *name, struct ifreq *ifr, uint8_t persist)
@@ -103,6 +150,39 @@ tap_open(const char *name, struct ifreq *ifr, uint8_t persist)
 	return -1;
 }
 
+
+static int
+eth_dev_set_link_up(struct rte_eth_dev *dev)
+{
+	return eth_dev_change_flags(dev, IFF_UP, 0);
+}
+
+static int
+eth_dev_set_link_down(struct rte_eth_dev *dev)
+{
+	return eth_dev_change_flags(dev, 0, ~IFF_UP);
+}
+
+static int
+eth_link_update(struct rte_eth_dev *dev, int wait_to_complete __rte_unused)
+{
+	struct rte_eth_link *eth_link = &dev->data->dev_link;
+	short flags = 0;
+
+	if (eth_dev_get_flags(dev, &flags) < 0) {
+		PMD_LOG(ERR, "ioctl(SIOCGIFFLAGS): %s", strerror(errno));
+		return -1;
+	}
+
+	*eth_link = (struct rte_eth_link) {
+		.link_speed = RTE_ETH_SPEED_NUM_UNKNOWN,
+		.link_duplex = RTE_ETH_LINK_FULL_DUPLEX,
+		.link_status = (flags & IFF_UP) ? RTE_ETH_LINK_UP : RTE_ETH_LINK_DOWN,
+		.link_autoneg = RTE_ETH_LINK_FIXED,
+	};
+	return 0;
+};
+
 static int
 eth_dev_close(struct rte_eth_dev *dev)
 {
@@ -126,8 +206,12 @@ eth_dev_close(struct rte_eth_dev *dev)
 
 static const struct eth_dev_ops ops = {
 	.dev_close		= eth_dev_close,
+	.link_update		= eth_link_update,
+	.dev_set_link_up	= eth_dev_set_link_up,
+	.dev_set_link_down	= eth_dev_set_link_down,
 };
 
+
 static int
 ioring_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
 {
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 3/8] net/ioring: implement control functions
  2024-12-11 16:28 ` [PATCH v2 " Stephen Hemminger
  2024-12-11 16:28   ` [PATCH v2 1/8] net/ioring: introduce new driver Stephen Hemminger
  2024-12-11 16:28   ` [PATCH v2 2/8] net/ioring: implement link state Stephen Hemminger
@ 2024-12-11 16:28   ` Stephen Hemminger
  2024-12-11 16:28   ` [PATCH v2 4/8] net/ioring: implement management functions Stephen Hemminger
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2024-12-11 16:28 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

These internal ops, just force changes to kernel visible net device.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/ioring.ini |  3 ++
 drivers/net/ioring/rte_eth_ioring.c | 69 +++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+)

diff --git a/doc/guides/nics/features/ioring.ini b/doc/guides/nics/features/ioring.ini
index d4bf70cb4f..199c7cd31c 100644
--- a/doc/guides/nics/features/ioring.ini
+++ b/doc/guides/nics/features/ioring.ini
@@ -5,6 +5,9 @@
 ;
 [Features]
 Link status          = Y
+MTU update           = Y
+Promiscuous mode     = Y
+Allmulticast mode    = Y
 Linux		     = Y
 x86-64               = Y
 Usage doc            = Y
diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index fa3e748cda..de10a4d83f 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -13,6 +13,7 @@
 #include <sys/socket.h>
 #include <net/if.h>
 #include <linux/if.h>
+#include <linux/if_arp.h>
 #include <linux/if_tun.h>
 
 #include <bus_vdev_driver.h>
@@ -163,6 +164,30 @@ eth_dev_set_link_down(struct rte_eth_dev *dev)
 	return eth_dev_change_flags(dev, 0, ~IFF_UP);
 }
 
+static int
+eth_dev_promiscuous_enable(struct rte_eth_dev *dev)
+{
+	return eth_dev_change_flags(dev, IFF_PROMISC, ~0);
+}
+
+static int
+eth_dev_promiscuous_disable(struct rte_eth_dev *dev)
+{
+	return eth_dev_change_flags(dev, 0, ~IFF_PROMISC);
+}
+
+static int
+eth_dev_allmulticast_enable(struct rte_eth_dev *dev)
+{
+	return eth_dev_change_flags(dev, IFF_ALLMULTI, ~0);
+}
+
+static int
+eth_dev_allmulticast_disable(struct rte_eth_dev *dev)
+{
+	return eth_dev_change_flags(dev, 0, ~IFF_ALLMULTI);
+}
+
 static int
 eth_link_update(struct rte_eth_dev *dev, int wait_to_complete __rte_unused)
 {
@@ -183,6 +208,44 @@ eth_link_update(struct rte_eth_dev *dev, int wait_to_complete __rte_unused)
 	return 0;
 };
 
+static int
+eth_dev_mtu_set(struct rte_eth_dev *dev, uint16_t mtu)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+	struct ifreq ifr = { .ifr_mtu = mtu };
+	int ret;
+
+	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
+
+	ret = ioctl(pmd->ctl_sock, SIOCSIFMTU, &ifr);
+	if (ret < 0) {
+		PMD_LOG(ERR, "ioctl(SIOCSIFMTU) failed: %s", strerror(errno));
+		ret = -errno;
+	}
+
+	return ret;
+}
+
+static int
+eth_dev_macaddr_set(struct rte_eth_dev *dev, struct rte_ether_addr *addr)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+	struct ifreq ifr = { };
+	int ret;
+
+	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
+	ifr.ifr_hwaddr.sa_family = ARPHRD_ETHER;
+	memcpy(ifr.ifr_hwaddr.sa_data, addr, sizeof(*addr));
+
+	ret = ioctl(pmd->ctl_sock, SIOCSIFHWADDR, &ifr);
+	if (ret < 0) {
+		PMD_LOG(ERR, "ioctl(SIOCSIFHWADDR) failed: %s", strerror(errno));
+		ret = -errno;
+	}
+
+	return ret;
+}
+
 static int
 eth_dev_close(struct rte_eth_dev *dev)
 {
@@ -209,6 +272,12 @@ static const struct eth_dev_ops ops = {
 	.link_update		= eth_link_update,
 	.dev_set_link_up	= eth_dev_set_link_up,
 	.dev_set_link_down	= eth_dev_set_link_down,
+	.mac_addr_set		= eth_dev_macaddr_set,
+	.mtu_set		= eth_dev_mtu_set,
+	.promiscuous_enable	= eth_dev_promiscuous_enable,
+	.promiscuous_disable	= eth_dev_promiscuous_disable,
+	.allmulticast_enable	= eth_dev_allmulticast_enable,
+	.allmulticast_disable	= eth_dev_allmulticast_disable,
 };
 
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 4/8] net/ioring: implement management functions
  2024-12-11 16:28 ` [PATCH v2 " Stephen Hemminger
                     ` (2 preceding siblings ...)
  2024-12-11 16:28   ` [PATCH v2 3/8] net/ioring: implement control functions Stephen Hemminger
@ 2024-12-11 16:28   ` Stephen Hemminger
  2024-12-11 16:28   ` [PATCH v2 5/8] net/ioring: implement primary secondary fd passing Stephen Hemminger
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2024-12-11 16:28 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add start, stop, configure and info functions.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 drivers/net/ioring/rte_eth_ioring.c | 72 ++++++++++++++++++++++++++---
 1 file changed, 66 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index de10a4d83f..b5d9c12bdf 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -212,12 +212,16 @@ static int
 eth_dev_mtu_set(struct rte_eth_dev *dev, uint16_t mtu)
 {
 	struct pmd_internals *pmd = dev->data->dev_private;
+
+	int sock = socket(AF_INET, SOCK_DGRAM, 0);
+	if (sock < 0)
+		return -errno;
+
 	struct ifreq ifr = { .ifr_mtu = mtu };
-	int ret;
 
 	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
 
-	ret = ioctl(pmd->ctl_sock, SIOCSIFMTU, &ifr);
+	int ret = ioctl(sock, SIOCSIFMTU, &ifr);
 	if (ret < 0) {
 		PMD_LOG(ERR, "ioctl(SIOCSIFMTU) failed: %s", strerror(errno));
 		ret = -errno;
@@ -230,14 +234,17 @@ static int
 eth_dev_macaddr_set(struct rte_eth_dev *dev, struct rte_ether_addr *addr)
 {
 	struct pmd_internals *pmd = dev->data->dev_private;
-	struct ifreq ifr = { };
-	int ret;
 
+	int sock = socket(AF_INET, SOCK_DGRAM, 0);
+	if (sock < 0)
+		return -errno;
+
+	struct ifreq ifr = { };
 	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
 	ifr.ifr_hwaddr.sa_family = ARPHRD_ETHER;
 	memcpy(ifr.ifr_hwaddr.sa_data, addr, sizeof(*addr));
 
-	ret = ioctl(pmd->ctl_sock, SIOCSIFHWADDR, &ifr);
+	int ret = ioctl(sock, SIOCSIFHWADDR, &ifr);
 	if (ret < 0) {
 		PMD_LOG(ERR, "ioctl(SIOCSIFHWADDR) failed: %s", strerror(errno));
 		ret = -errno;
@@ -246,6 +253,56 @@ eth_dev_macaddr_set(struct rte_eth_dev *dev, struct rte_ether_addr *addr)
 	return ret;
 }
 
+static int
+eth_dev_start(struct rte_eth_dev *dev)
+{
+	dev->data->dev_link.link_status = RTE_ETH_LINK_UP;
+	eth_dev_set_link_up(dev);
+
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		dev->data->rx_queue_state[i] = RTE_ETH_QUEUE_STATE_STARTED;
+		dev->data->tx_queue_state[i] = RTE_ETH_QUEUE_STATE_STARTED;
+	}
+
+	return 0;
+}
+
+static int
+eth_dev_stop(struct rte_eth_dev *dev)
+{
+	dev->data->dev_link.link_status = RTE_ETH_LINK_DOWN;
+	eth_dev_set_link_down(dev);
+
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		dev->data->rx_queue_state[i] = RTE_ETH_QUEUE_STATE_STOPPED;
+		dev->data->tx_queue_state[i] = RTE_ETH_QUEUE_STATE_STOPPED;
+	}
+
+	return 0;
+}
+
+static int
+eth_dev_configure(struct rte_eth_dev *dev)
+{
+	/* rx/tx must be paired */
+	if (dev->data->nb_rx_queues != dev->data->nb_tx_queues)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int
+eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+
+	dev_info->if_index = if_nametoindex(pmd->ifname);
+	dev_info->max_mac_addrs = 1;
+	dev_info->max_rx_pktlen = RTE_ETHER_MAX_LEN;
+
+	return 0;
+}
+
 static int
 eth_dev_close(struct rte_eth_dev *dev)
 {
@@ -263,11 +320,14 @@ eth_dev_close(struct rte_eth_dev *dev)
 		close(pmd->keep_fd);
 		pmd->keep_fd = -1;
 	}
-
 	return 0;
 }
 
 static const struct eth_dev_ops ops = {
+	.dev_start		= eth_dev_start,
+	.dev_stop		= eth_dev_stop,
+	.dev_configure		= eth_dev_configure,
+	.dev_infos_get		= eth_dev_info,
 	.dev_close		= eth_dev_close,
 	.link_update		= eth_link_update,
 	.dev_set_link_up	= eth_dev_set_link_up,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 5/8] net/ioring: implement primary secondary fd passing
  2024-12-11 16:28 ` [PATCH v2 " Stephen Hemminger
                     ` (3 preceding siblings ...)
  2024-12-11 16:28   ` [PATCH v2 4/8] net/ioring: implement management functions Stephen Hemminger
@ 2024-12-11 16:28   ` Stephen Hemminger
  2024-12-11 16:28   ` [PATCH v2 6/8] net/ioring: implement receive and transmit Stephen Hemminger
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2024-12-11 16:28 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add support for communicating fd's from primary to secondary.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/ioring.ini |   1 +
 drivers/net/ioring/rte_eth_ioring.c | 136 +++++++++++++++++++++++++++-
 2 files changed, 135 insertions(+), 2 deletions(-)

diff --git a/doc/guides/nics/features/ioring.ini b/doc/guides/nics/features/ioring.ini
index 199c7cd31c..da47062adb 100644
--- a/doc/guides/nics/features/ioring.ini
+++ b/doc/guides/nics/features/ioring.ini
@@ -8,6 +8,7 @@ Link status          = Y
 MTU update           = Y
 Promiscuous mode     = Y
 Allmulticast mode    = Y
+Multiprocess aware   = Y
 Linux		     = Y
 x86-64               = Y
 Usage doc            = Y
diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index b5d9c12bdf..ddef57adfb 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -28,6 +28,7 @@
 #include <rte_log.h>
 
 #define IORING_DEFAULT_IFNAME	"enio%d"
+#define IORING_MP_KEY		"ioring_mp_send_fds"
 
 RTE_LOG_REGISTER_DEFAULT(ioring_logtype, NOTICE);
 #define RTE_LOGTYPE_IORING ioring_logtype
@@ -400,6 +401,84 @@ parse_iface_arg(const char *key __rte_unused, const char *value, void *extra_arg
 	return 0;
 }
 
+/* Secondary process requests rxq fds from primary. */
+static int
+ioring_request_fds(const char *name, struct rte_eth_dev *dev)
+{
+	struct rte_mp_msg request = { };
+
+	strlcpy(request.name, IORING_MP_KEY, sizeof(request.name));
+	strlcpy((char *)request.param, name, RTE_MP_MAX_PARAM_LEN);
+	request.len_param = strlen(name);
+
+	/* Send the request and receive the reply */
+	PMD_LOG(DEBUG, "Sending multi-process IPC request for %s", name);
+
+	struct timespec timeout = {.tv_sec = 1, .tv_nsec = 0};
+	struct rte_mp_reply replies;
+	int ret = rte_mp_request_sync(&request, &replies, &timeout);
+	if (ret < 0 || replies.nb_received != 1) {
+		PMD_LOG(ERR, "Failed to request fds from primary: %s",
+			rte_strerror(rte_errno));
+		return -1;
+	}
+
+	struct rte_mp_msg *reply = replies.msgs;
+	PMD_LOG(DEBUG, "Received multi-process IPC reply for %s", name);
+	if (dev->data->nb_rx_queues != reply->num_fds) {
+		PMD_LOG(ERR, "Incorrect number of fds received: %d != %d",
+			reply->num_fds, dev->data->nb_rx_queues);
+		return -EINVAL;
+	}
+
+	int *fds = dev->process_private;
+	for (int i = 0; i < reply->num_fds; i++)
+		fds[i] = reply->fds[i];
+
+	free(reply);
+	return 0;
+}
+
+/* Primary process sends rxq fds to secondary. */
+static int
+ioring_mp_send_fds(const struct rte_mp_msg *request, const void *peer)
+{
+	const char *request_name = (const char *)request->param;
+
+	PMD_LOG(DEBUG, "Received multi-process IPC request for %s", request_name);
+
+	/* Find the requested port */
+	struct rte_eth_dev *dev = rte_eth_dev_get_by_name(request_name);
+	if (!dev) {
+		PMD_LOG(ERR, "Failed to get port id for %s", request_name);
+		return -1;
+	}
+
+	/* Populate the reply with the xsk fd for each queue */
+	struct rte_mp_msg reply = { };
+	if (dev->data->nb_rx_queues > RTE_MP_MAX_FD_NUM) {
+		PMD_LOG(ERR, "Number of rx queues (%d) exceeds max number of fds (%d)",
+			   dev->data->nb_rx_queues, RTE_MP_MAX_FD_NUM);
+		return -EINVAL;
+	}
+
+	int *fds = dev->process_private;
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++)
+		reply.fds[reply.num_fds++] = fds[i];
+
+	/* Send the reply */
+	strlcpy(reply.name, request->name, sizeof(reply.name));
+	strlcpy((char *)reply.param, request_name, RTE_MP_MAX_PARAM_LEN);
+	reply.len_param = strlen(request_name);
+
+	PMD_LOG(DEBUG, "Sending multi-process IPC reply for %s", request_name);
+	if (rte_mp_reply(&reply, peer) < 0) {
+		PMD_LOG(ERR, "Failed to reply to multi-process IPC request");
+		return -1;
+	}
+	return 0;
+}
+
 static int
 ioring_probe(struct rte_vdev_device *vdev)
 {
@@ -407,14 +486,43 @@ ioring_probe(struct rte_vdev_device *vdev)
 	const char *params = rte_vdev_device_args(vdev);
 	struct rte_kvargs *kvlist = NULL;
 	struct rte_eth_dev *eth_dev = NULL;
+	int *fds = NULL;
 	char tap_name[IFNAMSIZ] = IORING_DEFAULT_IFNAME;
 	uint8_t persist = 0;
 	int ret;
 
 	PMD_LOG(INFO, "Initializing %s", name);
 
-	if (rte_eal_process_type() == RTE_PROC_SECONDARY)
-		return -1; /* TODO */
+	if (rte_eal_process_type() == RTE_PROC_SECONDARY) {
+		struct rte_eth_dev *eth_dev;
+
+		eth_dev = rte_eth_dev_attach_secondary(name);
+		if (!eth_dev) {
+			PMD_LOG(ERR, "Failed to probe %s", name);
+			return -1;
+		}
+		eth_dev->dev_ops = &ops;
+		eth_dev->device = &vdev->device;
+
+		if (!rte_eal_primary_proc_alive(NULL)) {
+			PMD_LOG(ERR, "Primary process is missing");
+			return -1;
+		}
+
+		fds  = calloc(RTE_MAX_QUEUES_PER_PORT, sizeof(int));
+		if (fds == NULL) {
+			PMD_LOG(ERR, "Failed to alloc memory for process private");
+			return -1;
+		}
+
+		eth_dev->process_private = fds;
+
+		if (ioring_request_fds(name, eth_dev))
+			return -1;
+
+		rte_eth_dev_probing_finish(eth_dev);
+		return 0;
+	}
 
 	if (params != NULL) {
 		kvlist = rte_kvargs_parse(params, valid_arguments);
@@ -432,21 +540,45 @@ ioring_probe(struct rte_vdev_device *vdev)
 			persist = 1;
 	}
 
+	/* Per-queue tap fd's (for primary process) */
+	fds = calloc(RTE_MAX_QUEUES_PER_PORT, sizeof(int));
+	if (fds == NULL) {
+		PMD_LOG(ERR, "Unable to allocate fd array");
+		return -1;
+	}
+	for (unsigned int i = 0; i < RTE_MAX_QUEUES_PER_PORT; i++)
+		fds[i] = -1;
+
 	eth_dev = rte_eth_vdev_allocate(vdev, sizeof(struct pmd_internals));
 	if (eth_dev == NULL) {
 		PMD_LOG(ERR, "%s Unable to allocate device struct", tap_name);
 		goto error;
 	}
 
+	eth_dev->data->dev_flags = RTE_ETH_DEV_AUTOFILL_QUEUE_XSTATS;
+	eth_dev->dev_ops = &ops;
+	eth_dev->process_private = fds;
+
 	if (ioring_create(eth_dev, tap_name, persist) < 0)
 		goto error;
 
+	/* register the MP server on the first device */
+	static unsigned int ioring_dev_count;
+	if (ioring_dev_count == 0) {
+		if (rte_mp_action_register(IORING_MP_KEY, ioring_mp_send_fds) < 0) {
+			PMD_LOG(ERR, "Failed to register multi-process callback: %s",
+				rte_strerror(rte_errno));
+			goto error;
+		}
+	}
+	++ioring_dev_count;
 	rte_eth_dev_probing_finish(eth_dev);
 	return 0;
 
 error:
 	if (eth_dev != NULL)
 		rte_eth_dev_release_port(eth_dev);
+	free(fds);
 	rte_kvargs_free(kvlist);
 	return -1;
 }
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 6/8] net/ioring: implement receive and transmit
  2024-12-11 16:28 ` [PATCH v2 " Stephen Hemminger
                     ` (4 preceding siblings ...)
  2024-12-11 16:28   ` [PATCH v2 5/8] net/ioring: implement primary secondary fd passing Stephen Hemminger
@ 2024-12-11 16:28   ` Stephen Hemminger
  2024-12-11 16:28   ` [PATCH v2 7/8] net/ioring: add VLAN support Stephen Hemminger
  2024-12-11 16:28   ` [PATCH v2 8/8] net/ioring: implement statistics Stephen Hemminger
  7 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2024-12-11 16:28 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Use io_uring to read and write from TAP device.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 drivers/net/ioring/rte_eth_ioring.c | 365 +++++++++++++++++++++++++++-
 1 file changed, 364 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index ddef57adfb..8dd717cb9d 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -2,6 +2,7 @@
  * Copyright (c) Stephen Hemminger
  */
 
+#include <assert.h>
 #include <ctype.h>
 #include <errno.h>
 #include <fcntl.h>
@@ -9,8 +10,10 @@
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
+#include <liburing.h>
 #include <sys/ioctl.h>
 #include <sys/socket.h>
+#include <sys/uio.h>
 #include <net/if.h>
 #include <linux/if.h>
 #include <linux/if_arp.h>
@@ -27,6 +30,13 @@
 #include <rte_kvargs.h>
 #include <rte_log.h>
 
+#define IORING_DEFAULT_BURST	64
+#define IORING_NUM_BUFFERS	1024
+#define IORING_MAX_QUEUES	128
+
+
+static_assert(IORING_MAX_QUEUES <= RTE_MP_MAX_FD_NUM, "Max queues exceeds MP fd limit");
+
 #define IORING_DEFAULT_IFNAME	"enio%d"
 #define IORING_MP_KEY		"ioring_mp_send_fds"
 
@@ -34,6 +44,20 @@ RTE_LOG_REGISTER_DEFAULT(ioring_logtype, NOTICE);
 #define RTE_LOGTYPE_IORING ioring_logtype
 #define PMD_LOG(level, ...) RTE_LOG_LINE_PREFIX(level, IORING, "%s(): ", __func__, __VA_ARGS__)
 
+#ifdef RTE_ETHDEV_DEBUG_RX
+#define PMD_RX_LOG(level, ...) \
+	RTE_LOG_LINE_PREFIX(level, IORING, "%s() rx: ", __func__, __VA_ARGS__)
+#else
+#define PMD_RX_LOG(...) do { } while (0)
+#endif
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+#define PMD_TX_LOG(level, ...) \
+	RTE_LOG_LINE_PREFIX(level, IORING, "%s() tx: ", __func__, __VA_ARGS__)
+#else
+#define PMD_TX_LOG(...) do { } while (0)
+#endif
+
 #define IORING_IFACE_ARG	"iface"
 #define IORING_PERSIST_ARG	"persist"
 
@@ -43,6 +67,30 @@ static const char * const valid_arguments[] = {
 	NULL
 };
 
+struct rx_queue {
+	struct rte_mempool *mb_pool;	/* rx buffer pool */
+	struct io_uring io_ring;	/* queue of posted read's */
+	uint16_t port_id;
+	uint16_t queue_id;
+
+	uint64_t rx_packets;
+	uint64_t rx_bytes;
+	uint64_t rx_nombuf;
+	uint64_t rx_errors;
+};
+
+struct tx_queue {
+	struct io_uring io_ring;
+
+	uint16_t port_id;
+	uint16_t queue_id;
+	uint16_t free_thresh;
+
+	uint64_t tx_packets;
+	uint64_t tx_bytes;
+	uint64_t tx_errors;
+};
+
 struct pmd_internals {
 	int keep_fd;			/* keep alive file descriptor */
 	char ifname[IFNAMSIZ];		/* name assigned by kernel */
@@ -300,6 +348,15 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 	dev_info->if_index = if_nametoindex(pmd->ifname);
 	dev_info->max_mac_addrs = 1;
 	dev_info->max_rx_pktlen = RTE_ETHER_MAX_LEN;
+	dev_info->max_rx_queues = IORING_MAX_QUEUES;
+	dev_info->max_tx_queues = IORING_MAX_QUEUES;
+	dev_info->min_rx_bufsize = 0;
+
+	dev_info->default_rxportconf = (struct rte_eth_dev_portconf) {
+		.burst_size = IORING_DEFAULT_BURST,
+		.ring_size = IORING_NUM_BUFFERS,
+		.nb_queues = 1,
+	};
 
 	return 0;
 }
@@ -311,6 +368,14 @@ eth_dev_close(struct rte_eth_dev *dev)
 
 	PMD_LOG(INFO, "Closing %s", pmd->ifname);
 
+	int *fds = dev->process_private;
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		if (fds[i] == -1)
+			continue;
+		close(fds[i]);
+		fds[i] = -1;
+	}
+
 	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
 		return 0;
 
@@ -324,6 +389,297 @@ eth_dev_close(struct rte_eth_dev *dev)
 	return 0;
 }
 
+/* Setup another fd to TAP device for the queue */
+static int
+eth_queue_setup(struct rte_eth_dev *dev, const char *name, uint16_t queue_id)
+{
+	int *fds = dev->process_private;
+
+	if (fds[queue_id] != -1)
+		return 0;	/* already setup */
+
+	struct ifreq ifr = { };
+	int tap_fd = tap_open(name, &ifr, 0);
+	if (tap_fd < 0) {
+		PMD_LOG(ERR, "tap_open failed");
+		return -1;
+	}
+
+	PMD_LOG(DEBUG, "opened %d for queue %u", tap_fd, queue_id);
+	fds[queue_id] = tap_fd;
+	return 0;
+}
+
+static int
+eth_queue_fd(uint16_t port_id, uint16_t queue_id)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	int *fds = dev->process_private;
+
+	return fds[queue_id];
+}
+
+/* setup an submit queue to read mbuf */
+static inline void
+eth_rx_submit(struct rx_queue *rxq, int fd, struct rte_mbuf *mb)
+{
+	struct io_uring_sqe *sqe = io_uring_get_sqe(&rxq->io_ring);
+
+	if (unlikely(sqe == NULL)) {
+		PMD_LOG(DEBUG, "io_uring no rx sqe");
+		rxq->rx_errors++;
+	} else {
+		void *base = rte_pktmbuf_mtod(mb, void *);
+		size_t len = mb->buf_len;
+
+		io_uring_prep_read(sqe, fd, base, len, 0);
+		io_uring_sqe_set_data(sqe, mb);
+	}
+}
+
+static uint16_t
+eth_ioring_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct rx_queue *rxq = queue;
+	struct io_uring_cqe *cqe;
+	unsigned int head, num_cqe = 0;
+	uint16_t num_rx = 0;
+	uint32_t num_bytes = 0;
+	int fd = eth_queue_fd(rxq->port_id, rxq->queue_id);
+
+	io_uring_for_each_cqe(&rxq->io_ring, head, cqe) {
+		struct rte_mbuf *mb = (void *)(uintptr_t)cqe->user_data;
+		ssize_t len = cqe->res;
+
+		PMD_RX_LOG(DEBUG, "cqe %u len %zd", num_cqe, len);
+		num_cqe++;
+
+		if (unlikely(len < RTE_ETHER_HDR_LEN)) {
+			if (len < 0)
+				PMD_LOG(ERR, "io_uring_read: %s", strerror(-len));
+			else
+				PMD_LOG(ERR, "io_uring_read missing hdr");
+
+			rxq->rx_errors++;
+			goto resubmit;
+		}
+
+		struct rte_mbuf *nmb = rte_pktmbuf_alloc(rxq->mb_pool);
+		if (unlikely(nmb == 0)) {
+			PMD_LOG(DEBUG, "Rx mbuf alloc failed");
+			++rxq->rx_nombuf;
+			goto resubmit;
+		}
+
+		mb->pkt_len = len;
+		mb->data_len = len;
+		mb->port = rxq->port_id;
+		__rte_mbuf_sanity_check(mb, 1);
+
+		num_bytes += len;
+		bufs[num_rx++] = mb;
+
+		mb = nmb;
+resubmit:
+		eth_rx_submit(rxq, fd, mb);
+
+		if (num_rx == nb_pkts)
+			break;
+	}
+	io_uring_cq_advance(&rxq->io_ring, num_cqe);
+
+	rxq->rx_packets += num_rx;
+	rxq->rx_bytes += num_bytes;
+	return num_rx;
+}
+
+static int
+eth_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id, uint16_t nb_rx_desc,
+		   unsigned int socket_id,
+		   const struct rte_eth_rxconf *rx_conf __rte_unused,
+		   struct rte_mempool *mb_pool)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+
+	PMD_LOG(DEBUG, "setup port %u queue %u rx_descriptors %u",
+		dev->data->port_id, queue_id, nb_rx_desc);
+
+	/* open shared tap fd maybe already setup */
+	if (eth_queue_setup(dev, pmd->ifname, queue_id) < 0)
+		return -1;
+
+	struct rx_queue *rxq = rte_zmalloc_socket(NULL, sizeof(*rxq),
+						  RTE_CACHE_LINE_SIZE, socket_id);
+	if (rxq == NULL) {
+		PMD_LOG(ERR, "rxq alloc failed");
+		return -1;
+	}
+
+	rxq->mb_pool = mb_pool;
+	rxq->port_id = dev->data->port_id;
+	rxq->queue_id = queue_id;
+	dev->data->rx_queues[queue_id] = rxq;
+
+	struct rte_mbuf *mbufs[nb_rx_desc];
+	if (rte_pktmbuf_alloc_bulk(mb_pool, mbufs, nb_rx_desc) < 0) {
+		PMD_LOG(ERR, "Rx mbuf alloc %u bufs failed", nb_rx_desc);
+		return -1;
+	}
+
+	if (io_uring_queue_init(nb_rx_desc, &rxq->io_ring, 0) != 0) {
+		PMD_LOG(ERR, "io_uring_queue_init failed: %s", strerror(errno));
+		goto error;
+	}
+
+	int fd = eth_queue_fd(rxq->port_id, rxq->queue_id);
+
+	for (uint16_t i = 0; i < nb_rx_desc; i++) {
+		struct rte_mbuf *mb = mbufs[i];
+
+		eth_rx_submit(rxq, fd, mb);
+	}
+
+	io_uring_submit(&rxq->io_ring);
+	return 0;
+
+error:
+	rte_pktmbuf_free_bulk(mbufs, nb_rx_desc);
+	return -1;
+}
+
+static void
+eth_rx_queue_release(struct rte_eth_dev *dev, uint16_t queue_id)
+{
+	struct rx_queue *rxq = dev->data->rx_queues[queue_id];
+
+	struct io_uring_sqe *sqe = io_uring_get_sqe(&rxq->io_ring);
+	if (sqe == NULL) {
+		PMD_LOG(ERR, "io_uring_get_sqe failed: %s", strerror(errno));
+	} else {
+		io_uring_prep_cancel(sqe, NULL, IORING_ASYNC_CANCEL_ANY);
+		io_uring_submit_and_wait(&rxq->io_ring, 1);
+	}
+
+	io_uring_queue_exit(&rxq->io_ring);
+}
+
+static int
+eth_tx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id,
+		   uint16_t nb_tx_desc, unsigned int socket_id,
+		   const struct rte_eth_txconf *tx_conf)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+
+	/* open shared tap fd maybe already setup */
+	if (eth_queue_setup(dev, pmd->ifname, queue_id) < 0)
+		return -1;
+
+	struct tx_queue *txq = rte_zmalloc_socket(NULL, sizeof(*txq),
+						  RTE_CACHE_LINE_SIZE, socket_id);
+	if (txq == NULL) {
+		PMD_LOG(ERR, "txq alloc failed");
+		return -1;
+	}
+
+	txq->port_id = dev->data->port_id;
+	txq->queue_id = queue_id;
+	txq->free_thresh = tx_conf->tx_free_thresh;
+	dev->data->tx_queues[queue_id] = txq;
+
+	if (io_uring_queue_init(nb_tx_desc, &txq->io_ring, 0) != 0) {
+		PMD_LOG(ERR, "io_uring_queue_init failed: %s", strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static void
+eth_tx_queue_release(struct rte_eth_dev *dev, uint16_t queue_id)
+{
+	struct tx_queue *txq = dev->data->tx_queues[queue_id];
+
+	struct io_uring_sqe *sqe = io_uring_get_sqe(&txq->io_ring);
+	if (sqe == NULL) {
+		PMD_LOG(ERR, "io_uring_get_sqe failed: %s", strerror(errno));
+	} else {
+		io_uring_prep_cancel(sqe, NULL, IORING_ASYNC_CANCEL_ANY);
+		io_uring_submit_and_wait(&txq->io_ring, 1);
+	}
+
+	io_uring_queue_exit(&txq->io_ring);
+}
+
+static void
+eth_ioring_tx_cleanup(struct tx_queue *txq)
+{
+	struct io_uring_cqe *cqe;
+	unsigned int head;
+	unsigned int tx_done = 0;
+	uint64_t tx_bytes = 0;
+
+	io_uring_for_each_cqe(&txq->io_ring, head, cqe) {
+		struct rte_mbuf *mb = (void *)(uintptr_t)cqe->user_data;
+
+		PMD_TX_LOG(DEBUG, " mbuf len %u result: %d", mb->pkt_len, cqe->res);
+		if (unlikely(cqe->res < 0)) {
+			++txq->tx_errors;
+		} else {
+			++tx_done;
+			tx_bytes += mb->pkt_len;
+		}
+
+		rte_pktmbuf_free(mb);
+	}
+	io_uring_cq_advance(&txq->io_ring, tx_done);
+
+	txq->tx_packets += tx_done;
+	txq->tx_bytes += tx_bytes;
+}
+
+static uint16_t
+eth_ioring_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct tx_queue *txq = queue;
+	uint16_t num_tx;
+
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts);
+
+	if (io_uring_sq_space_left(&txq->io_ring) < txq->free_thresh)
+		eth_ioring_tx_cleanup(txq);
+
+	int fd = eth_queue_fd(txq->port_id, txq->queue_id);
+
+	for (num_tx = 0; num_tx < nb_pkts; num_tx++) {
+		struct rte_mbuf *mb = bufs[num_tx];
+
+		struct io_uring_sqe *sqe = io_uring_get_sqe(&txq->io_ring);
+		if (sqe == NULL)
+			break;	/* submit ring is full */
+
+		io_uring_sqe_set_data(sqe, mb);
+
+		if (rte_mbuf_refcnt_read(mb) == 1 &&
+		    RTE_MBUF_DIRECT(mb) && mb->nb_segs == 1) {
+			void *base = rte_pktmbuf_mtod(mb, void *);
+			io_uring_prep_write(sqe, fd, base, mb->pkt_len, 0);
+
+			PMD_TX_LOG(DEBUG, "tx mbuf: %p submit", mb);
+		} else {
+			PMD_LOG(ERR, "Can't do mbuf without space yet!");
+			++txq->tx_errors;
+			continue;
+		}
+	}
+	if (num_tx > 0)
+		io_uring_submit(&txq->io_ring);
+
+	return num_tx;
+}
+
 static const struct eth_dev_ops ops = {
 	.dev_start		= eth_dev_start,
 	.dev_stop		= eth_dev_stop,
@@ -339,9 +695,12 @@ static const struct eth_dev_ops ops = {
 	.promiscuous_disable	= eth_dev_promiscuous_disable,
 	.allmulticast_enable	= eth_dev_allmulticast_enable,
 	.allmulticast_disable	= eth_dev_allmulticast_disable,
+	.rx_queue_setup		= eth_rx_queue_setup,
+	.rx_queue_release	= eth_rx_queue_release,
+	.tx_queue_setup		= eth_tx_queue_setup,
+	.tx_queue_release	= eth_tx_queue_release,
 };
 
-
 static int
 ioring_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
 {
@@ -379,6 +738,10 @@ ioring_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
 	}
 
 	PMD_LOG(DEBUG, "%s setup", ifr.ifr_name);
+
+	dev->rx_pkt_burst = eth_ioring_rx;
+	dev->tx_pkt_burst = eth_ioring_tx;
+
 	return 0;
 
 error:
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 7/8] net/ioring: add VLAN support
  2024-12-11 16:28 ` [PATCH v2 " Stephen Hemminger
                     ` (5 preceding siblings ...)
  2024-12-11 16:28   ` [PATCH v2 6/8] net/ioring: implement receive and transmit Stephen Hemminger
@ 2024-12-11 16:28   ` Stephen Hemminger
  2024-12-11 16:28   ` [PATCH v2 8/8] net/ioring: implement statistics Stephen Hemminger
  7 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2024-12-11 16:28 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add support for VLAN insert and stripping.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 drivers/net/ioring/rte_eth_ioring.c | 45 +++++++++++++++++++++++++++--
 1 file changed, 43 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index 8dd717cb9d..a7f38e0d5c 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -34,6 +34,8 @@
 #define IORING_NUM_BUFFERS	1024
 #define IORING_MAX_QUEUES	128
 
+#define IORING_TX_OFFLOAD	RTE_ETH_TX_OFFLOAD_VLAN_INSERT
+#define IORING_RX_OFFLOAD	RTE_ETH_RX_OFFLOAD_VLAN_STRIP
 
 static_assert(IORING_MAX_QUEUES <= RTE_MP_MAX_FD_NUM, "Max queues exceeds MP fd limit");
 
@@ -70,6 +72,7 @@ static const char * const valid_arguments[] = {
 struct rx_queue {
 	struct rte_mempool *mb_pool;	/* rx buffer pool */
 	struct io_uring io_ring;	/* queue of posted read's */
+	uint64_t offloads;
 	uint16_t port_id;
 	uint16_t queue_id;
 
@@ -81,6 +84,7 @@ struct rx_queue {
 
 struct tx_queue {
 	struct io_uring io_ring;
+	uint64_t offloads;
 
 	uint16_t port_id;
 	uint16_t queue_id;
@@ -471,6 +475,9 @@ eth_ioring_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 			goto resubmit;
 		}
 
+		if (rxq->offloads & RTE_ETH_RX_OFFLOAD_VLAN_STRIP)
+			rte_vlan_strip(mb);
+
 		mb->pkt_len = len;
 		mb->data_len = len;
 		mb->port = rxq->port_id;
@@ -495,8 +502,7 @@ eth_ioring_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 
 static int
 eth_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id, uint16_t nb_rx_desc,
-		   unsigned int socket_id,
-		   const struct rte_eth_rxconf *rx_conf __rte_unused,
+		   unsigned int socket_id, const struct rte_eth_rxconf *rx_conf,
 		   struct rte_mempool *mb_pool)
 {
 	struct pmd_internals *pmd = dev->data->dev_private;
@@ -515,6 +521,7 @@ eth_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id, uint16_t nb_rx_de
 		return -1;
 	}
 
+	rxq->offloads = rx_conf->offloads;
 	rxq->mb_pool = mb_pool;
 	rxq->port_id = dev->data->port_id;
 	rxq->queue_id = queue_id;
@@ -583,6 +590,7 @@ eth_tx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id,
 
 	txq->port_id = dev->data->port_id;
 	txq->queue_id = queue_id;
+	txq->offloads = tx_conf->offloads;
 	txq->free_thresh = tx_conf->tx_free_thresh;
 	dev->data->tx_queues[queue_id] = txq;
 
@@ -637,6 +645,38 @@ eth_ioring_tx_cleanup(struct tx_queue *txq)
 	txq->tx_bytes += tx_bytes;
 }
 
+static uint16_t
+eth_ioring_tx_prepare(void *tx_queue __rte_unused, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
+{
+	uint16_t nb_tx;
+	int error;
+
+	for (nb_tx = 0; nb_tx < nb_pkts; nb_tx++) {
+		struct rte_mbuf *m = tx_pkts[nb_tx];
+
+#ifdef RTE_LIBRTE_ETHDEV_DEBUG
+		error = rte_validate_tx_offload(m);
+		if (unlikely(error)) {
+			rte_errno = -error;
+			break;
+		}
+#endif
+		/* Do VLAN tag insertion */
+		if (unlikely(m->ol_flags & RTE_MBUF_F_TX_VLAN)) {
+			error = rte_vlan_insert(&m);
+			/* rte_vlan_insert() may change pointer */
+			tx_pkts[nb_tx] = m;
+
+			if (unlikely(error)) {
+				rte_errno = -error;
+				break;
+			}
+		}
+	}
+
+	return nb_tx;
+}
+
 static uint16_t
 eth_ioring_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 {
@@ -740,6 +780,7 @@ ioring_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
 	PMD_LOG(DEBUG, "%s setup", ifr.ifr_name);
 
 	dev->rx_pkt_burst = eth_ioring_rx;
+	dev->tx_pkt_prepare = eth_ioring_tx_prepare;
 	dev->tx_pkt_burst = eth_ioring_tx;
 
 	return 0;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 8/8] net/ioring: implement statistics
  2024-12-11 16:28 ` [PATCH v2 " Stephen Hemminger
                     ` (6 preceding siblings ...)
  2024-12-11 16:28   ` [PATCH v2 7/8] net/ioring: add VLAN support Stephen Hemminger
@ 2024-12-11 16:28   ` Stephen Hemminger
  7 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2024-12-11 16:28 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add support for basic statistics

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/ioring.ini |  2 +
 drivers/net/ioring/rte_eth_ioring.c | 57 +++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+)

diff --git a/doc/guides/nics/features/ioring.ini b/doc/guides/nics/features/ioring.ini
index da47062adb..c9a4582d0e 100644
--- a/doc/guides/nics/features/ioring.ini
+++ b/doc/guides/nics/features/ioring.ini
@@ -9,6 +9,8 @@ MTU update           = Y
 Promiscuous mode     = Y
 Allmulticast mode    = Y
 Multiprocess aware   = Y
+Basic stats          = Y
+Stats per queue      = Y
 Linux		     = Y
 x86-64               = Y
 Usage doc            = Y
diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index a7f38e0d5c..97f7bc72d3 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -365,6 +365,61 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 	return 0;
 }
 
+static int
+eth_dev_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats)
+{
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		const struct rx_queue *rxq = dev->data->rx_queues[i];
+
+		stats->ipackets += rxq->rx_packets;
+		stats->ibytes += rxq->rx_bytes;
+		stats->ierrors += rxq->rx_errors;
+		stats->rx_nombuf += rxq->rx_nombuf;
+
+		if (i < RTE_ETHDEV_QUEUE_STAT_CNTRS) {
+			stats->q_ipackets[i] = rxq->rx_packets;
+			stats->q_ibytes[i] = rxq->rx_bytes;
+		}
+	}
+
+	for (uint16_t i = 0; i < dev->data->nb_tx_queues; i++) {
+		const struct tx_queue *txq = dev->data->tx_queues[i];
+
+		stats->opackets += txq->tx_packets;
+		stats->obytes += txq->tx_bytes;
+		stats->oerrors += txq->tx_errors;
+
+		if (i < RTE_ETHDEV_QUEUE_STAT_CNTRS) {
+			stats->q_opackets[i] = txq->tx_packets;
+			stats->q_obytes[i] = txq->tx_bytes;
+		}
+	}
+
+	return 0;
+}
+
+static int
+eth_dev_stats_reset(struct rte_eth_dev *dev)
+{
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		struct rx_queue *rxq = dev->data->rx_queues[i];
+
+		rxq->rx_packets = 0;
+		rxq->rx_bytes = 0;
+		rxq->rx_nombuf = 0;
+		rxq->rx_errors = 0;
+	}
+
+	for (uint16_t i = 0; i < dev->data->nb_tx_queues; i++) {
+		struct tx_queue *txq = dev->data->tx_queues[i];
+
+		txq->tx_packets = 0;
+		txq->tx_bytes = 0;
+		txq->tx_errors = 0;
+	}
+	return 0;
+}
+
 static int
 eth_dev_close(struct rte_eth_dev *dev)
 {
@@ -735,6 +790,8 @@ static const struct eth_dev_ops ops = {
 	.promiscuous_disable	= eth_dev_promiscuous_disable,
 	.allmulticast_enable	= eth_dev_allmulticast_enable,
 	.allmulticast_disable	= eth_dev_allmulticast_disable,
+	.stats_get              = eth_dev_stats_get,
+	.stats_reset            = eth_dev_stats_reset,
 	.rx_queue_setup		= eth_rx_queue_setup,
 	.rx_queue_release	= eth_rx_queue_release,
 	.tx_queue_setup		= eth_tx_queue_setup,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* RE: [RFC 0/8] ioring: network driver
  2024-12-11 15:03   ` Stephen Hemminger
@ 2024-12-12 19:06     ` Konstantin Ananyev
  2024-12-19 15:40       ` Morten Brørup
  0 siblings, 1 reply; 72+ messages in thread
From: Konstantin Ananyev @ 2024-12-12 19:06 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev@dpdk.org


> > > This is first draft of new simplified TAP device that uses
> > > the Linux kernel ioring API to provide a read/write ring
> > > with kernel.
> > >
> > > This is split from tap device because there are so many
> > > unnecessary things in existing tap, and supporting ioring is
> > > better without ifdefs etc. The default name of the tap
> > > device is different that other uses in DPDK but the driver
> > > tries to keep the same relevant devargs as before.
> > >
> > > This driver will only provide features that match what kernel
> > > does, so no flow support etc. The next version will add checksum
> > > and multi-segment packets. Some of the doc files may need update
> > > as well.
> >
> > Makes sense to me, though didn't properly look inside.
> > One thing - probably add  a 'tap' into the name,
> > 'tap_ioiring' or so, otherwise 'ioring' is a bit too generic
> > and might be confusing.
> 
> There are some userspaces that look for "e*" in name for some setups.

Didn't get you here, pls try to re-phrase. 

> But names are totally abitrary

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [RFC 0/8] ioring: network driver
  2024-12-12 19:06     ` Konstantin Ananyev
@ 2024-12-19 15:40       ` Morten Brørup
  2024-12-20 14:34         ` Konstantin Ananyev
  0 siblings, 1 reply; 72+ messages in thread
From: Morten Brørup @ 2024-12-19 15:40 UTC (permalink / raw)
  To: Konstantin Ananyev, Stephen Hemminger; +Cc: dev

> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> 
> > > > This is first draft of new simplified TAP device that uses
> > > > the Linux kernel ioring API to provide a read/write ring
> > > > with kernel.
> > > >
> > > > This is split from tap device because there are so many
> > > > unnecessary things in existing tap, and supporting ioring is
> > > > better without ifdefs etc. The default name of the tap
> > > > device is different that other uses in DPDK but the driver
> > > > tries to keep the same relevant devargs as before.
> > > >
> > > > This driver will only provide features that match what kernel
> > > > does, so no flow support etc. The next version will add checksum
> > > > and multi-segment packets. Some of the doc files may need update
> > > > as well.
> > >
> > > Makes sense to me, though didn't properly look inside.
> > > One thing - probably add  a 'tap' into the name,
> > > 'tap_ioiring' or so, otherwise 'ioring' is a bit too generic
> > > and might be confusing.

Konstantin is referring to the name of the driver and the source code file names, "net/ioring" -> "net/tap_ioring".

> >
> > There are some userspaces that look for "e*" in name for some setups.

Stephen is referring to the device name of an instantiated interface, e.g. "eth0".

And yes, assuming devices named "e*" are Ethernet devices is a common hack in Linux applications. I've done it myself. :-)

> 
> Didn't get you here, pls try to re-phrase.
> 
> > But names are totally arbitrary



^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [RFC 0/8] ioring: network driver
  2024-12-19 15:40       ` Morten Brørup
@ 2024-12-20 14:34         ` Konstantin Ananyev
  2024-12-20 16:19           ` Stephen Hemminger
  0 siblings, 1 reply; 72+ messages in thread
From: Konstantin Ananyev @ 2024-12-20 14:34 UTC (permalink / raw)
  To: Morten Brørup, Stephen Hemminger; +Cc: dev@dpdk.org



> > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> >
> > > > > This is first draft of new simplified TAP device that uses
> > > > > the Linux kernel ioring API to provide a read/write ring
> > > > > with kernel.
> > > > >
> > > > > This is split from tap device because there are so many
> > > > > unnecessary things in existing tap, and supporting ioring is
> > > > > better without ifdefs etc. The default name of the tap
> > > > > device is different that other uses in DPDK but the driver
> > > > > tries to keep the same relevant devargs as before.
> > > > >
> > > > > This driver will only provide features that match what kernel
> > > > > does, so no flow support etc. The next version will add checksum
> > > > > and multi-segment packets. Some of the doc files may need update
> > > > > as well.
> > > >
> > > > Makes sense to me, though didn't properly look inside.
> > > > One thing - probably add  a 'tap' into the name,
> > > > 'tap_ioiring' or so, otherwise 'ioring' is a bit too generic
> > > > and might be confusing.
> 
> Konstantin is referring to the name of the driver and the source code file names, "net/ioring" -> "net/tap_ioring".

Yep, that what I meant.

> 
> > >
> > > There are some userspaces that look for "e*" in name for some setups.
> 
> Stephen is referring to the device name of an instantiated interface, e.g. "eth0".
> 
> And yes, assuming devices named "e*" are Ethernet devices is a common hack in Linux applications. I've done it myself. :-)

Ok... and why such practice should prevent us to name PMD itself in a way we think is appropriate? 
 
> >
> > Didn't get you here, pls try to re-phrase.
> >
> > > But names are totally arbitrary
> 
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [RFC 0/8] ioring: network driver
  2024-12-20 14:34         ` Konstantin Ananyev
@ 2024-12-20 16:19           ` Stephen Hemminger
  0 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2024-12-20 16:19 UTC (permalink / raw)
  To: Konstantin Ananyev; +Cc: Morten Brørup, dev@dpdk.org

On Fri, 20 Dec 2024 14:34:27 +0000
Konstantin Ananyev <konstantin.ananyev@huawei.com> wrote:

> > > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > >  
> > > > > > This is first draft of new simplified TAP device that uses
> > > > > > the Linux kernel ioring API to provide a read/write ring
> > > > > > with kernel.
> > > > > >
> > > > > > This is split from tap device because there are so many
> > > > > > unnecessary things in existing tap, and supporting ioring is
> > > > > > better without ifdefs etc. The default name of the tap
> > > > > > device is different that other uses in DPDK but the driver
> > > > > > tries to keep the same relevant devargs as before.
> > > > > >
> > > > > > This driver will only provide features that match what kernel
> > > > > > does, so no flow support etc. The next version will add checksum
> > > > > > and multi-segment packets. Some of the doc files may need update
> > > > > > as well.  
> > > > >
> > > > > Makes sense to me, though didn't properly look inside.
> > > > > One thing - probably add  a 'tap' into the name,
> > > > > 'tap_ioiring' or so, otherwise 'ioring' is a bit too generic
> > > > > and might be confusing.  
> > 
> > Konstantin is referring to the name of the driver and the source code file names, "net/ioring" -> "net/tap_ioring".  
> 
> Yep, that what I meant.

My thoughts, are shorter name is better, and avoids confusion. There are already multiple
drivers that create tap devices: tap and virtio_user. 

> >   
> > > >
> > > > There are some userspaces that look for "e*" in name for some setups.  
> > 
> > Stephen is referring to the device name of an instantiated interface, e.g. "eth0".
> > 
> > And yes, assuming devices named "e*" are Ethernet devices is a common hack in Linux applications. I've done it myself. :-)  
> 
> Ok... and why such practice should prevent us to name PMD itself in a way we think is appropriate? 
>  
I am more leaning towards not having a default name at all. The policy should be done by Linux (udev)
not DPDK. If user wants a name they can add it via devargs.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH v2 1/8] net/ioring: introduce new driver
  2024-12-11 16:28   ` [PATCH v2 1/8] net/ioring: introduce new driver Stephen Hemminger
@ 2024-12-28 16:39     ` Morten Brørup
  0 siblings, 0 replies; 72+ messages in thread
From: Morten Brørup @ 2024-12-28 16:39 UTC (permalink / raw)
  To: Stephen Hemminger, dev

> +	int features = 0;
> +	if (ioctl(tap_fd, TUNGETFEATURES, &features) < 0) {
> +		PMD_LOG(ERR, "ioctl(TUNGETFEATURES) %s", strerror(errno));
> +		goto error;
> +	}
> +
> +	int flags = IFF_TAP | IFF_MULTI_QUEUE | IFF_NO_PI;
> +	if ((features & flags) == 0) {

Comparison will only fail if all three flags are missing. Should be:
if ((features & flags) != flags) {

> +		PMD_LOG(ERR, "TUN features %#x missing support for %#x",
> +			features, features & flags);

Should be:
features, (features & flags) ^ flags);

> +		goto error;
> +	}


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v3 0/9] ioring PMD device
  2024-12-10 21:23 [RFC 0/8] ioring: network driver Stephen Hemminger
                   ` (9 preceding siblings ...)
  2024-12-11 16:28 ` [PATCH v2 " Stephen Hemminger
@ 2025-03-11 23:51 ` Stephen Hemminger
  2025-03-11 23:51   ` [PATCH v3 1/9] net/ioring: introduce new driver Stephen Hemminger
                     ` (8 more replies)
  2025-03-13 21:50 ` [PATCH v4 00/10] new ioring PMD Stephen Hemminger
                   ` (2 subsequent siblings)
  13 siblings, 9 replies; 72+ messages in thread
From: Stephen Hemminger @ 2025-03-11 23:51 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

This is a new simplified TAP device that uses the Linux kernel
ioring API to provide a read/write ring with kernel.

This is split from tap device because there are so many
unnecessary things in existing tap, and supporting ioring is
better without ifdefs etc. The default name of the tap
device is different that other uses in DPDK but the driver
tries to keep the same relevant devargs as before.

This driver will only provide features that match what kernel
does, so no flow support etc. The next version will add checksum
and multi-segment packets. Some of the doc files may need update
as well.

v3 - add multi-segment support
     review feedback

Stephen Hemminger (9):
  net/ioring: introduce new driver
  net/ioring: implement link state
  net/ioring: implement control functions
  net/ioring: implement management functions
  net/ioring: implement secondary process support
  net/ioring: implement receive and transmit
  net/ioring: add VLAN support
  net/ioring: implement statistics
  net/ioring: support multi-segment Rx and Tx

 doc/guides/nics/features/ioring.ini |   16 +
 doc/guides/nics/index.rst           |    1 +
 doc/guides/nics/ioring.rst          |   66 ++
 drivers/net/ioring/meson.build      |   15 +
 drivers/net/ioring/rte_eth_ioring.c | 1129 +++++++++++++++++++++++++++
 drivers/net/meson.build             |    1 +
 6 files changed, 1228 insertions(+)
 create mode 100644 doc/guides/nics/features/ioring.ini
 create mode 100644 doc/guides/nics/ioring.rst
 create mode 100644 drivers/net/ioring/meson.build
 create mode 100644 drivers/net/ioring/rte_eth_ioring.c

-- 
2.47.2


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v3 1/9] net/ioring: introduce new driver
  2025-03-11 23:51 ` [PATCH v3 0/9] ioring PMD device Stephen Hemminger
@ 2025-03-11 23:51   ` Stephen Hemminger
  2025-03-11 23:51   ` [PATCH v3 2/9] net/ioring: implement link state Stephen Hemminger
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2025-03-11 23:51 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add basic driver initialization, documentation, and device creation
and basic documentation.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/ioring.ini |   9 +
 doc/guides/nics/index.rst           |   1 +
 doc/guides/nics/ioring.rst          |  66 +++++++
 drivers/net/ioring/meson.build      |  15 ++
 drivers/net/ioring/rte_eth_ioring.c | 262 ++++++++++++++++++++++++++++
 drivers/net/meson.build             |   1 +
 6 files changed, 354 insertions(+)
 create mode 100644 doc/guides/nics/features/ioring.ini
 create mode 100644 doc/guides/nics/ioring.rst
 create mode 100644 drivers/net/ioring/meson.build
 create mode 100644 drivers/net/ioring/rte_eth_ioring.c

diff --git a/doc/guides/nics/features/ioring.ini b/doc/guides/nics/features/ioring.ini
new file mode 100644
index 0000000000..c4c57caaa4
--- /dev/null
+++ b/doc/guides/nics/features/ioring.ini
@@ -0,0 +1,9 @@
+;
+; Supported features of the 'ioring' driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+Linux		     = Y
+x86-64               = Y
+Usage doc            = Y
diff --git a/doc/guides/nics/index.rst b/doc/guides/nics/index.rst
index 10a2eca3b0..afb6bf289b 100644
--- a/doc/guides/nics/index.rst
+++ b/doc/guides/nics/index.rst
@@ -41,6 +41,7 @@ Network Interface Controller Drivers
     igc
     intel_vf
     ionic
+    ioring
     ipn3ke
     ixgbe
     mana
diff --git a/doc/guides/nics/ioring.rst b/doc/guides/nics/ioring.rst
new file mode 100644
index 0000000000..7d37a6bb37
--- /dev/null
+++ b/doc/guides/nics/ioring.rst
@@ -0,0 +1,66 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+
+IORING Poll Mode Driver
+=======================
+
+The IORING Poll Mode Driver (PMD) is a simplified and improved version of the TAP PMD. It is a
+virtual device that uses Linux ioring to inject packets into the Linux kernel.
+It is useful when writing DPDK applications, that need to support interaction
+with the Linux TCP/IP stack for control plane or tunneling.
+
+The IORING PMD creates a kernel network device that can be
+managed by standard tools such as ``ip`` and ``ethtool`` commands.
+
+From a DPDK application, the IORING device looks like a DPDK ethdev.
+It supports the standard DPDK API's to query for information, statistics,
+and send/receive packets.
+
+Requirements
+------------
+
+The IORING requires the io_uring library (liburing) which provides the helper
+functions to manage io_uring with the kernel.
+
+For more info on io_uring, please see:
+
+https://kernel.dk/io_uring.pdf
+
+
+Arguments
+---------
+
+IORING devices are created with the command line ``--vdev=net_ioring0`` option.
+This option may be specified more than once by repeating with a different ``net_ioringX`` device.
+
+By default, the Linux interfaces are named ``enio0``, ``enio1``, etc.
+The interface name can be specified by adding the ``iface=foo0``, for example::
+
+   --vdev=net_ioring0,iface=io0 --vdev=net_ioring1,iface=io1, ...
+
+The PMD inherits the MAC address assigned by the kernel which will be
+a locally assigned random Ethernet address.
+
+Normally, when the DPDK application exits, the IORING device is removed.
+But this behavior can be overridden by the use of the persist flag, example::
+
+  --vdev=net_ioring0,iface=io0,persist ...
+
+
+Multi-process sharing
+---------------------
+
+The IORING device does not support secondary process (yet).
+
+
+Limitations
+-----------
+
+- IO uring requires io_uring support. This was add in Linux kernl version 5.1
+  Also, IO uring maybe disabled in some environments or by security policies.
+
+- Since IORING device uses a file descriptor to talk to the kernel,
+  the same number of queues must be specified for receive and transmit.
+
+- No flow support. Receive queue selection for incoming packets is determined
+  by the Linux kernel. See kernel documentation for more info:
+  https://www.kernel.org/doc/html/latest/networking/scaling.html
diff --git a/drivers/net/ioring/meson.build b/drivers/net/ioring/meson.build
new file mode 100644
index 0000000000..264554d069
--- /dev/null
+++ b/drivers/net/ioring/meson.build
@@ -0,0 +1,15 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2024 Stephen Hemminger
+
+if not is_linux
+    build = false
+    reason = 'only supported on Linux'
+endif
+
+dep = dependency('liburing', required:false)
+reason = 'missing dependency, "liburing"'
+build = dep.found()
+ext_deps += dep
+
+sources = files('rte_eth_ioring.c')
+require_iova_in_mbuf = false
diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
new file mode 100644
index 0000000000..4d5a5174db
--- /dev/null
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -0,0 +1,262 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) Stephen Hemminger
+ */
+
+#include <ctype.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/ioctl.h>
+#include <sys/socket.h>
+#include <net/if.h>
+#include <linux/if.h>
+#include <linux/if_tun.h>
+
+#include <bus_vdev_driver.h>
+#include <ethdev_driver.h>
+#include <ethdev_vdev.h>
+#include <rte_common.h>
+#include <rte_dev.h>
+#include <rte_eal.h>
+#include <rte_ethdev.h>
+#include <rte_ether.h>
+#include <rte_kvargs.h>
+#include <rte_log.h>
+
+#define IORING_DEFAULT_IFNAME	"itap%d"
+
+RTE_LOG_REGISTER_DEFAULT(ioring_logtype, NOTICE);
+#define RTE_LOGTYPE_IORING ioring_logtype
+#define PMD_LOG(level, ...) RTE_LOG_LINE_PREFIX(level, IORING, "%s(): ", __func__, __VA_ARGS__)
+
+#define IORING_IFACE_ARG	"iface"
+#define IORING_PERSIST_ARG	"persist"
+
+static const char * const valid_arguments[] = {
+	IORING_IFACE_ARG,
+	IORING_PERSIST_ARG,
+	NULL
+};
+
+struct pmd_internals {
+	int keep_fd;			/* keep alive file descriptor */
+	char ifname[IFNAMSIZ];		/* name assigned by kernel */
+	struct rte_ether_addr eth_addr; /* address assigned by kernel */
+};
+
+/* Creates a new tap device, name returned in ifr */
+static int
+tap_open(const char *name, struct ifreq *ifr, uint8_t persist)
+{
+	static const char tun_dev[] = "/dev/net/tun";
+	int tap_fd;
+
+	tap_fd = open(tun_dev, O_RDWR | O_CLOEXEC | O_NONBLOCK);
+	if (tap_fd < 0) {
+		PMD_LOG(ERR, "Open %s failed: %s", tun_dev, strerror(errno));
+		return -1;
+	}
+
+	int features = 0;
+	if (ioctl(tap_fd, TUNGETFEATURES, &features) < 0) {
+		PMD_LOG(ERR, "ioctl(TUNGETFEATURES) %s", strerror(errno));
+		goto error;
+	}
+
+	int flags = IFF_TAP | IFF_MULTI_QUEUE | IFF_NO_PI;
+	if ((features & flags) != flags) {
+		PMD_LOG(ERR, "TUN features %#x missing support for %#x",
+			features, features & flags);
+		goto error;
+	}
+
+#ifdef IFF_NAPI
+	/* If kernel supports using NAPI enable it */
+	if (features & IFF_NAPI)
+		flags |= IFF_NAPI;
+#endif
+	/*
+	 * Sets the device name and packet format.
+	 * Do not want the protocol information (PI)
+	 */
+	strlcpy(ifr->ifr_name, name, IFNAMSIZ);
+	ifr->ifr_flags = flags;
+	if (ioctl(tap_fd, TUNSETIFF, ifr) < 0) {
+		PMD_LOG(ERR, "ioctl(TUNSETIFF) %s: %s",
+			ifr->ifr_name, strerror(errno));
+		goto error;
+	}
+
+	/* (Optional) keep the device after application exit */
+	if (persist && ioctl(tap_fd, TUNSETPERSIST, 1) < 0) {
+		PMD_LOG(ERR, "ioctl(TUNSETPERIST) %s: %s",
+			ifr->ifr_name, strerror(errno));
+		goto error;
+	}
+
+	return tap_fd;
+error:
+	close(tap_fd);
+	return -1;
+}
+
+static int
+eth_dev_close(struct rte_eth_dev *dev)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+
+	PMD_LOG(INFO, "Closing %s", pmd->ifname);
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	/* mac_addrs must not be freed alone because part of dev_private */
+	dev->data->mac_addrs = NULL;
+
+	if (pmd->keep_fd != -1) {
+		close(pmd->keep_fd);
+		pmd->keep_fd = -1;
+	}
+
+	return 0;
+}
+
+static const struct eth_dev_ops ops = {
+	.dev_close		= eth_dev_close,
+};
+
+static int
+ioring_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
+{
+	struct rte_eth_dev_data *data = dev->data;
+	struct pmd_internals *pmd = data->dev_private;
+
+	pmd->keep_fd = -1;
+
+	data->dev_flags = RTE_ETH_DEV_AUTOFILL_QUEUE_XSTATS;
+	dev->dev_ops = &ops;
+
+	/* Get the initial fd used to keep the tap device around */
+	struct ifreq ifr = { };
+	pmd->keep_fd = tap_open(tap_name, &ifr, persist);
+	if (pmd->keep_fd < 0)
+		goto error;
+
+	strlcpy(pmd->ifname, ifr.ifr_name, IFNAMSIZ);
+
+	/* Read the MAC address assigned by the kernel */
+	if (ioctl(pmd->keep_fd, SIOCGIFHWADDR, &ifr) < 0) {
+		PMD_LOG(ERR, "Unable to get MAC address for %s: %s",
+			ifr.ifr_name, strerror(errno));
+		goto error;
+	}
+	memcpy(&pmd->eth_addr, &ifr.ifr_hwaddr.sa_data, RTE_ETHER_ADDR_LEN);
+	data->mac_addrs = &pmd->eth_addr;
+
+	/* Detach this instance, not used for traffic */
+	ifr.ifr_flags = IFF_DETACH_QUEUE;
+	if (ioctl(pmd->keep_fd, TUNSETQUEUE, &ifr) < 0) {
+		PMD_LOG(ERR, "Unable to detach keep-alive queue for %s: %s",
+			ifr.ifr_name, strerror(errno));
+		goto error;
+	}
+
+	PMD_LOG(DEBUG, "%s setup", ifr.ifr_name);
+	return 0;
+
+error:
+	if (pmd->keep_fd != -1)
+		close(pmd->keep_fd);
+	return -1;
+}
+
+static int
+parse_iface_arg(const char *key __rte_unused, const char *value, void *extra_args)
+{
+	char *name = extra_args;
+
+	/* must not be null string */
+	if (name == NULL || name[0] == '\0' ||
+	    strnlen(name, IFNAMSIZ) == IFNAMSIZ)
+		return -EINVAL;
+
+	strlcpy(name, value, IFNAMSIZ);
+	return 0;
+}
+
+static int
+ioring_probe(struct rte_vdev_device *vdev)
+{
+	const char *name = rte_vdev_device_name(vdev);
+	const char *params = rte_vdev_device_args(vdev);
+	struct rte_kvargs *kvlist = NULL;
+	struct rte_eth_dev *eth_dev = NULL;
+	char tap_name[IFNAMSIZ] = IORING_DEFAULT_IFNAME;
+	uint8_t persist = 0;
+	int ret;
+
+	PMD_LOG(INFO, "Initializing %s", name);
+
+	if (rte_eal_process_type() == RTE_PROC_SECONDARY)
+		return -1; /* TODO */
+
+	if (params != NULL) {
+		kvlist = rte_kvargs_parse(params, valid_arguments);
+		if (kvlist == NULL)
+			return -1;
+
+		if (rte_kvargs_count(kvlist, IORING_IFACE_ARG) == 1) {
+			ret = rte_kvargs_process_opt(kvlist, IORING_IFACE_ARG,
+						     &parse_iface_arg, tap_name);
+			if (ret < 0)
+				goto error;
+		}
+
+		if (rte_kvargs_count(kvlist, IORING_PERSIST_ARG) == 1)
+			persist = 1;
+	}
+
+	eth_dev = rte_eth_vdev_allocate(vdev, sizeof(struct pmd_internals));
+	if (eth_dev == NULL) {
+		PMD_LOG(ERR, "%s Unable to allocate device struct", tap_name);
+		goto error;
+	}
+
+	if (ioring_create(eth_dev, tap_name, persist) < 0)
+		goto error;
+
+	rte_eth_dev_probing_finish(eth_dev);
+	return 0;
+
+error:
+	if (eth_dev != NULL)
+		rte_eth_dev_release_port(eth_dev);
+	rte_kvargs_free(kvlist);
+	return -1;
+}
+
+static int
+ioring_remove(struct rte_vdev_device *dev)
+{
+	struct rte_eth_dev *eth_dev;
+
+	eth_dev = rte_eth_dev_allocated(rte_vdev_device_name(dev));
+	if (eth_dev == NULL)
+		return 0;
+
+	eth_dev_close(eth_dev);
+	rte_eth_dev_release_port(eth_dev);
+	return 0;
+}
+
+static struct rte_vdev_driver pmd_ioring_drv = {
+	.probe = ioring_probe,
+	.remove = ioring_remove,
+};
+
+RTE_PMD_REGISTER_VDEV(net_ioring, pmd_ioring_drv);
+RTE_PMD_REGISTER_ALIAS(net_ioring, eth_ioring);
+RTE_PMD_REGISTER_PARAM_STRING(net_ioring, IORING_IFACE_ARG "=<string> ");
diff --git a/drivers/net/meson.build b/drivers/net/meson.build
index 460eb69e5b..2e39136a5b 100644
--- a/drivers/net/meson.build
+++ b/drivers/net/meson.build
@@ -34,6 +34,7 @@ drivers = [
         'intel/ixgbe',
         'intel/cpfl',  # depends on idpf, so must come after it
         'ionic',
+        'ioring',
         'mana',
         'memif',
         'mlx4',
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v3 2/9] net/ioring: implement link state
  2025-03-11 23:51 ` [PATCH v3 0/9] ioring PMD device Stephen Hemminger
  2025-03-11 23:51   ` [PATCH v3 1/9] net/ioring: introduce new driver Stephen Hemminger
@ 2025-03-11 23:51   ` Stephen Hemminger
  2025-03-11 23:51   ` [PATCH v3 3/9] net/ioring: implement control functions Stephen Hemminger
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2025-03-11 23:51 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add hooks to set kernel link up/down and report state.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/ioring.ini |  1 +
 drivers/net/ioring/rte_eth_ioring.c | 84 +++++++++++++++++++++++++++++
 2 files changed, 85 insertions(+)

diff --git a/doc/guides/nics/features/ioring.ini b/doc/guides/nics/features/ioring.ini
index c4c57caaa4..d4bf70cb4f 100644
--- a/doc/guides/nics/features/ioring.ini
+++ b/doc/guides/nics/features/ioring.ini
@@ -4,6 +4,7 @@
 ; Refer to default.ini for the full list of available PMD features.
 ;
 [Features]
+Link status          = Y
 Linux		     = Y
 x86-64               = Y
 Usage doc            = Y
diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index 4d5a5174db..8c89497b1a 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -47,6 +47,53 @@ struct pmd_internals {
 	struct rte_ether_addr eth_addr; /* address assigned by kernel */
 };
 
+static int
+eth_dev_change_flags(struct rte_eth_dev *dev, uint16_t flags, uint16_t mask)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+
+	int sock = socket(AF_INET, SOCK_DGRAM, 0);
+	if (sock < 0)
+		return -errno;
+
+	struct ifreq ifr = { };
+	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
+
+	int ret = ioctl(sock, SIOCGIFFLAGS, &ifr);
+	if (ret < 0)
+		goto error;
+
+	/* NB: ifr.ifr_flags is type short */
+	ifr.ifr_flags &= mask;
+	ifr.ifr_flags |= flags;
+
+	ret = ioctl(sock, SIOCSIFFLAGS, &ifr);
+error:
+	close(sock);
+	return (ret < 0) ? -errno : 0;
+}
+
+static int
+eth_dev_get_flags(struct rte_eth_dev *dev, short *flags)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+
+	int sock = socket(AF_INET, SOCK_DGRAM, 0);
+	if (sock < 0)
+		return -errno;
+
+	struct ifreq ifr = { };
+	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
+
+	int ret = ioctl(sock, SIOCGIFFLAGS, &ifr);
+	if (ret == 0)
+		*flags = ifr.ifr_flags;
+
+	close(sock);
+	return (ret < 0) ? -errno : 0;
+}
+
+
 /* Creates a new tap device, name returned in ifr */
 static int
 tap_open(const char *name, struct ifreq *ifr, uint8_t persist)
@@ -103,6 +150,39 @@ tap_open(const char *name, struct ifreq *ifr, uint8_t persist)
 	return -1;
 }
 
+
+static int
+eth_dev_set_link_up(struct rte_eth_dev *dev)
+{
+	return eth_dev_change_flags(dev, IFF_UP, 0);
+}
+
+static int
+eth_dev_set_link_down(struct rte_eth_dev *dev)
+{
+	return eth_dev_change_flags(dev, 0, ~IFF_UP);
+}
+
+static int
+eth_link_update(struct rte_eth_dev *dev, int wait_to_complete __rte_unused)
+{
+	struct rte_eth_link *eth_link = &dev->data->dev_link;
+	short flags = 0;
+
+	if (eth_dev_get_flags(dev, &flags) < 0) {
+		PMD_LOG(ERR, "ioctl(SIOCGIFFLAGS): %s", strerror(errno));
+		return -1;
+	}
+
+	*eth_link = (struct rte_eth_link) {
+		.link_speed = RTE_ETH_SPEED_NUM_UNKNOWN,
+		.link_duplex = RTE_ETH_LINK_FULL_DUPLEX,
+		.link_status = (flags & IFF_UP) ? RTE_ETH_LINK_UP : RTE_ETH_LINK_DOWN,
+		.link_autoneg = RTE_ETH_LINK_FIXED,
+	};
+	return 0;
+};
+
 static int
 eth_dev_close(struct rte_eth_dev *dev)
 {
@@ -126,8 +206,12 @@ eth_dev_close(struct rte_eth_dev *dev)
 
 static const struct eth_dev_ops ops = {
 	.dev_close		= eth_dev_close,
+	.link_update		= eth_link_update,
+	.dev_set_link_up	= eth_dev_set_link_up,
+	.dev_set_link_down	= eth_dev_set_link_down,
 };
 
+
 static int
 ioring_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
 {
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v3 3/9] net/ioring: implement control functions
  2025-03-11 23:51 ` [PATCH v3 0/9] ioring PMD device Stephen Hemminger
  2025-03-11 23:51   ` [PATCH v3 1/9] net/ioring: introduce new driver Stephen Hemminger
  2025-03-11 23:51   ` [PATCH v3 2/9] net/ioring: implement link state Stephen Hemminger
@ 2025-03-11 23:51   ` Stephen Hemminger
  2025-03-11 23:51   ` [PATCH v3 4/9] net/ioring: implement management functions Stephen Hemminger
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2025-03-11 23:51 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

These internal ops, just force changes to kernel visible net device.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/ioring.ini |  3 ++
 drivers/net/ioring/rte_eth_ioring.c | 69 +++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+)

diff --git a/doc/guides/nics/features/ioring.ini b/doc/guides/nics/features/ioring.ini
index d4bf70cb4f..199c7cd31c 100644
--- a/doc/guides/nics/features/ioring.ini
+++ b/doc/guides/nics/features/ioring.ini
@@ -5,6 +5,9 @@
 ;
 [Features]
 Link status          = Y
+MTU update           = Y
+Promiscuous mode     = Y
+Allmulticast mode    = Y
 Linux		     = Y
 x86-64               = Y
 Usage doc            = Y
diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index 8c89497b1a..fe3c72098c 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -13,6 +13,7 @@
 #include <sys/socket.h>
 #include <net/if.h>
 #include <linux/if.h>
+#include <linux/if_arp.h>
 #include <linux/if_tun.h>
 
 #include <bus_vdev_driver.h>
@@ -163,6 +164,30 @@ eth_dev_set_link_down(struct rte_eth_dev *dev)
 	return eth_dev_change_flags(dev, 0, ~IFF_UP);
 }
 
+static int
+eth_dev_promiscuous_enable(struct rte_eth_dev *dev)
+{
+	return eth_dev_change_flags(dev, IFF_PROMISC, ~0);
+}
+
+static int
+eth_dev_promiscuous_disable(struct rte_eth_dev *dev)
+{
+	return eth_dev_change_flags(dev, 0, ~IFF_PROMISC);
+}
+
+static int
+eth_dev_allmulticast_enable(struct rte_eth_dev *dev)
+{
+	return eth_dev_change_flags(dev, IFF_ALLMULTI, ~0);
+}
+
+static int
+eth_dev_allmulticast_disable(struct rte_eth_dev *dev)
+{
+	return eth_dev_change_flags(dev, 0, ~IFF_ALLMULTI);
+}
+
 static int
 eth_link_update(struct rte_eth_dev *dev, int wait_to_complete __rte_unused)
 {
@@ -183,6 +208,44 @@ eth_link_update(struct rte_eth_dev *dev, int wait_to_complete __rte_unused)
 	return 0;
 };
 
+static int
+eth_dev_mtu_set(struct rte_eth_dev *dev, uint16_t mtu)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+	struct ifreq ifr = { .ifr_mtu = mtu };
+	int ret;
+
+	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
+
+	ret = ioctl(pmd->ctl_sock, SIOCSIFMTU, &ifr);
+	if (ret < 0) {
+		PMD_LOG(ERR, "ioctl(SIOCSIFMTU) failed: %s", strerror(errno));
+		ret = -errno;
+	}
+
+	return ret;
+}
+
+static int
+eth_dev_macaddr_set(struct rte_eth_dev *dev, struct rte_ether_addr *addr)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+	struct ifreq ifr = { };
+	int ret;
+
+	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
+	ifr.ifr_hwaddr.sa_family = ARPHRD_ETHER;
+	memcpy(ifr.ifr_hwaddr.sa_data, addr, sizeof(*addr));
+
+	ret = ioctl(pmd->ctl_sock, SIOCSIFHWADDR, &ifr);
+	if (ret < 0) {
+		PMD_LOG(ERR, "ioctl(SIOCSIFHWADDR) failed: %s", strerror(errno));
+		ret = -errno;
+	}
+
+	return ret;
+}
+
 static int
 eth_dev_close(struct rte_eth_dev *dev)
 {
@@ -209,6 +272,12 @@ static const struct eth_dev_ops ops = {
 	.link_update		= eth_link_update,
 	.dev_set_link_up	= eth_dev_set_link_up,
 	.dev_set_link_down	= eth_dev_set_link_down,
+	.mac_addr_set		= eth_dev_macaddr_set,
+	.mtu_set		= eth_dev_mtu_set,
+	.promiscuous_enable	= eth_dev_promiscuous_enable,
+	.promiscuous_disable	= eth_dev_promiscuous_disable,
+	.allmulticast_enable	= eth_dev_allmulticast_enable,
+	.allmulticast_disable	= eth_dev_allmulticast_disable,
 };
 
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v3 4/9] net/ioring: implement management functions
  2025-03-11 23:51 ` [PATCH v3 0/9] ioring PMD device Stephen Hemminger
                     ` (2 preceding siblings ...)
  2025-03-11 23:51   ` [PATCH v3 3/9] net/ioring: implement control functions Stephen Hemminger
@ 2025-03-11 23:51   ` Stephen Hemminger
  2025-03-11 23:51   ` [PATCH v3 5/9] net/ioring: implement secondary process support Stephen Hemminger
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2025-03-11 23:51 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add start, stop, configure and info functions.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 drivers/net/ioring/rte_eth_ioring.c | 72 ++++++++++++++++++++++++++---
 1 file changed, 66 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index fe3c72098c..b5b5ffdee3 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -212,12 +212,16 @@ static int
 eth_dev_mtu_set(struct rte_eth_dev *dev, uint16_t mtu)
 {
 	struct pmd_internals *pmd = dev->data->dev_private;
+
+	int sock = socket(AF_INET, SOCK_DGRAM, 0);
+	if (sock < 0)
+		return -errno;
+
 	struct ifreq ifr = { .ifr_mtu = mtu };
-	int ret;
 
 	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
 
-	ret = ioctl(pmd->ctl_sock, SIOCSIFMTU, &ifr);
+	int ret = ioctl(sock, SIOCSIFMTU, &ifr);
 	if (ret < 0) {
 		PMD_LOG(ERR, "ioctl(SIOCSIFMTU) failed: %s", strerror(errno));
 		ret = -errno;
@@ -230,14 +234,17 @@ static int
 eth_dev_macaddr_set(struct rte_eth_dev *dev, struct rte_ether_addr *addr)
 {
 	struct pmd_internals *pmd = dev->data->dev_private;
-	struct ifreq ifr = { };
-	int ret;
 
+	int sock = socket(AF_INET, SOCK_DGRAM, 0);
+	if (sock < 0)
+		return -errno;
+
+	struct ifreq ifr = { };
 	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
 	ifr.ifr_hwaddr.sa_family = ARPHRD_ETHER;
 	memcpy(ifr.ifr_hwaddr.sa_data, addr, sizeof(*addr));
 
-	ret = ioctl(pmd->ctl_sock, SIOCSIFHWADDR, &ifr);
+	int ret = ioctl(sock, SIOCSIFHWADDR, &ifr);
 	if (ret < 0) {
 		PMD_LOG(ERR, "ioctl(SIOCSIFHWADDR) failed: %s", strerror(errno));
 		ret = -errno;
@@ -246,6 +253,56 @@ eth_dev_macaddr_set(struct rte_eth_dev *dev, struct rte_ether_addr *addr)
 	return ret;
 }
 
+static int
+eth_dev_start(struct rte_eth_dev *dev)
+{
+	dev->data->dev_link.link_status = RTE_ETH_LINK_UP;
+	eth_dev_set_link_up(dev);
+
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		dev->data->rx_queue_state[i] = RTE_ETH_QUEUE_STATE_STARTED;
+		dev->data->tx_queue_state[i] = RTE_ETH_QUEUE_STATE_STARTED;
+	}
+
+	return 0;
+}
+
+static int
+eth_dev_stop(struct rte_eth_dev *dev)
+{
+	dev->data->dev_link.link_status = RTE_ETH_LINK_DOWN;
+	eth_dev_set_link_down(dev);
+
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		dev->data->rx_queue_state[i] = RTE_ETH_QUEUE_STATE_STOPPED;
+		dev->data->tx_queue_state[i] = RTE_ETH_QUEUE_STATE_STOPPED;
+	}
+
+	return 0;
+}
+
+static int
+eth_dev_configure(struct rte_eth_dev *dev)
+{
+	/* rx/tx must be paired */
+	if (dev->data->nb_rx_queues != dev->data->nb_tx_queues)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int
+eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+
+	dev_info->if_index = if_nametoindex(pmd->ifname);
+	dev_info->max_mac_addrs = 1;
+	dev_info->max_rx_pktlen = RTE_ETHER_MAX_LEN;
+
+	return 0;
+}
+
 static int
 eth_dev_close(struct rte_eth_dev *dev)
 {
@@ -263,11 +320,14 @@ eth_dev_close(struct rte_eth_dev *dev)
 		close(pmd->keep_fd);
 		pmd->keep_fd = -1;
 	}
-
 	return 0;
 }
 
 static const struct eth_dev_ops ops = {
+	.dev_start		= eth_dev_start,
+	.dev_stop		= eth_dev_stop,
+	.dev_configure		= eth_dev_configure,
+	.dev_infos_get		= eth_dev_info,
 	.dev_close		= eth_dev_close,
 	.link_update		= eth_link_update,
 	.dev_set_link_up	= eth_dev_set_link_up,
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v3 5/9] net/ioring: implement secondary process support
  2025-03-11 23:51 ` [PATCH v3 0/9] ioring PMD device Stephen Hemminger
                     ` (3 preceding siblings ...)
  2025-03-11 23:51   ` [PATCH v3 4/9] net/ioring: implement management functions Stephen Hemminger
@ 2025-03-11 23:51   ` Stephen Hemminger
  2025-03-11 23:51   ` [PATCH v3 6/9] net/ioring: implement receive and transmit Stephen Hemminger
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2025-03-11 23:51 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add support for communicating fd's from primary to secondary.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/ioring.ini |   1 +
 drivers/net/ioring/rte_eth_ioring.c | 136 +++++++++++++++++++++++++++-
 2 files changed, 135 insertions(+), 2 deletions(-)

diff --git a/doc/guides/nics/features/ioring.ini b/doc/guides/nics/features/ioring.ini
index 199c7cd31c..da47062adb 100644
--- a/doc/guides/nics/features/ioring.ini
+++ b/doc/guides/nics/features/ioring.ini
@@ -8,6 +8,7 @@ Link status          = Y
 MTU update           = Y
 Promiscuous mode     = Y
 Allmulticast mode    = Y
+Multiprocess aware   = Y
 Linux		     = Y
 x86-64               = Y
 Usage doc            = Y
diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index b5b5ffdee3..f01db960a7 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -28,6 +28,7 @@
 #include <rte_log.h>
 
 #define IORING_DEFAULT_IFNAME	"itap%d"
+#define IORING_MP_KEY		"ioring_mp_send_fds"
 
 RTE_LOG_REGISTER_DEFAULT(ioring_logtype, NOTICE);
 #define RTE_LOGTYPE_IORING ioring_logtype
@@ -400,6 +401,84 @@ parse_iface_arg(const char *key __rte_unused, const char *value, void *extra_arg
 	return 0;
 }
 
+/* Secondary process requests rxq fds from primary. */
+static int
+ioring_request_fds(const char *name, struct rte_eth_dev *dev)
+{
+	struct rte_mp_msg request = { };
+
+	strlcpy(request.name, IORING_MP_KEY, sizeof(request.name));
+	strlcpy((char *)request.param, name, RTE_MP_MAX_PARAM_LEN);
+	request.len_param = strlen(name);
+
+	/* Send the request and receive the reply */
+	PMD_LOG(DEBUG, "Sending multi-process IPC request for %s", name);
+
+	struct timespec timeout = {.tv_sec = 1, .tv_nsec = 0};
+	struct rte_mp_reply replies;
+	int ret = rte_mp_request_sync(&request, &replies, &timeout);
+	if (ret < 0 || replies.nb_received != 1) {
+		PMD_LOG(ERR, "Failed to request fds from primary: %s",
+			rte_strerror(rte_errno));
+		return -1;
+	}
+
+	struct rte_mp_msg *reply = replies.msgs;
+	PMD_LOG(DEBUG, "Received multi-process IPC reply for %s", name);
+	if (dev->data->nb_rx_queues != reply->num_fds) {
+		PMD_LOG(ERR, "Incorrect number of fds received: %d != %d",
+			reply->num_fds, dev->data->nb_rx_queues);
+		return -EINVAL;
+	}
+
+	int *fds = dev->process_private;
+	for (int i = 0; i < reply->num_fds; i++)
+		fds[i] = reply->fds[i];
+
+	free(reply);
+	return 0;
+}
+
+/* Primary process sends rxq fds to secondary. */
+static int
+ioring_mp_send_fds(const struct rte_mp_msg *request, const void *peer)
+{
+	const char *request_name = (const char *)request->param;
+
+	PMD_LOG(DEBUG, "Received multi-process IPC request for %s", request_name);
+
+	/* Find the requested port */
+	struct rte_eth_dev *dev = rte_eth_dev_get_by_name(request_name);
+	if (!dev) {
+		PMD_LOG(ERR, "Failed to get port id for %s", request_name);
+		return -1;
+	}
+
+	/* Populate the reply with the xsk fd for each queue */
+	struct rte_mp_msg reply = { };
+	if (dev->data->nb_rx_queues > RTE_MP_MAX_FD_NUM) {
+		PMD_LOG(ERR, "Number of rx queues (%d) exceeds max number of fds (%d)",
+			   dev->data->nb_rx_queues, RTE_MP_MAX_FD_NUM);
+		return -EINVAL;
+	}
+
+	int *fds = dev->process_private;
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++)
+		reply.fds[reply.num_fds++] = fds[i];
+
+	/* Send the reply */
+	strlcpy(reply.name, request->name, sizeof(reply.name));
+	strlcpy((char *)reply.param, request_name, RTE_MP_MAX_PARAM_LEN);
+	reply.len_param = strlen(request_name);
+
+	PMD_LOG(DEBUG, "Sending multi-process IPC reply for %s", request_name);
+	if (rte_mp_reply(&reply, peer) < 0) {
+		PMD_LOG(ERR, "Failed to reply to multi-process IPC request");
+		return -1;
+	}
+	return 0;
+}
+
 static int
 ioring_probe(struct rte_vdev_device *vdev)
 {
@@ -407,14 +486,43 @@ ioring_probe(struct rte_vdev_device *vdev)
 	const char *params = rte_vdev_device_args(vdev);
 	struct rte_kvargs *kvlist = NULL;
 	struct rte_eth_dev *eth_dev = NULL;
+	int *fds = NULL;
 	char tap_name[IFNAMSIZ] = IORING_DEFAULT_IFNAME;
 	uint8_t persist = 0;
 	int ret;
 
 	PMD_LOG(INFO, "Initializing %s", name);
 
-	if (rte_eal_process_type() == RTE_PROC_SECONDARY)
-		return -1; /* TODO */
+	if (rte_eal_process_type() == RTE_PROC_SECONDARY) {
+		struct rte_eth_dev *eth_dev;
+
+		eth_dev = rte_eth_dev_attach_secondary(name);
+		if (!eth_dev) {
+			PMD_LOG(ERR, "Failed to probe %s", name);
+			return -1;
+		}
+		eth_dev->dev_ops = &ops;
+		eth_dev->device = &vdev->device;
+
+		if (!rte_eal_primary_proc_alive(NULL)) {
+			PMD_LOG(ERR, "Primary process is missing");
+			return -1;
+		}
+
+		fds  = calloc(RTE_MAX_QUEUES_PER_PORT, sizeof(int));
+		if (fds == NULL) {
+			PMD_LOG(ERR, "Failed to alloc memory for process private");
+			return -1;
+		}
+
+		eth_dev->process_private = fds;
+
+		if (ioring_request_fds(name, eth_dev))
+			return -1;
+
+		rte_eth_dev_probing_finish(eth_dev);
+		return 0;
+	}
 
 	if (params != NULL) {
 		kvlist = rte_kvargs_parse(params, valid_arguments);
@@ -432,21 +540,45 @@ ioring_probe(struct rte_vdev_device *vdev)
 			persist = 1;
 	}
 
+	/* Per-queue tap fd's (for primary process) */
+	fds = calloc(RTE_MAX_QUEUES_PER_PORT, sizeof(int));
+	if (fds == NULL) {
+		PMD_LOG(ERR, "Unable to allocate fd array");
+		return -1;
+	}
+	for (unsigned int i = 0; i < RTE_MAX_QUEUES_PER_PORT; i++)
+		fds[i] = -1;
+
 	eth_dev = rte_eth_vdev_allocate(vdev, sizeof(struct pmd_internals));
 	if (eth_dev == NULL) {
 		PMD_LOG(ERR, "%s Unable to allocate device struct", tap_name);
 		goto error;
 	}
 
+	eth_dev->data->dev_flags = RTE_ETH_DEV_AUTOFILL_QUEUE_XSTATS;
+	eth_dev->dev_ops = &ops;
+	eth_dev->process_private = fds;
+
 	if (ioring_create(eth_dev, tap_name, persist) < 0)
 		goto error;
 
+	/* register the MP server on the first device */
+	static unsigned int ioring_dev_count;
+	if (ioring_dev_count == 0) {
+		if (rte_mp_action_register(IORING_MP_KEY, ioring_mp_send_fds) < 0) {
+			PMD_LOG(ERR, "Failed to register multi-process callback: %s",
+				rte_strerror(rte_errno));
+			goto error;
+		}
+	}
+	++ioring_dev_count;
 	rte_eth_dev_probing_finish(eth_dev);
 	return 0;
 
 error:
 	if (eth_dev != NULL)
 		rte_eth_dev_release_port(eth_dev);
+	free(fds);
 	rte_kvargs_free(kvlist);
 	return -1;
 }
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v3 6/9] net/ioring: implement receive and transmit
  2025-03-11 23:51 ` [PATCH v3 0/9] ioring PMD device Stephen Hemminger
                     ` (4 preceding siblings ...)
  2025-03-11 23:51   ` [PATCH v3 5/9] net/ioring: implement secondary process support Stephen Hemminger
@ 2025-03-11 23:51   ` Stephen Hemminger
  2025-03-11 23:51   ` [PATCH v3 7/9] net/ioring: add VLAN support Stephen Hemminger
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2025-03-11 23:51 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Use io_uring to read and write from TAP device.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 drivers/net/ioring/rte_eth_ioring.c | 362 +++++++++++++++++++++++++++-
 1 file changed, 361 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index f01db960a7..4d064c2c22 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -2,6 +2,7 @@
  * Copyright (c) Stephen Hemminger
  */
 
+#include <assert.h>
 #include <ctype.h>
 #include <errno.h>
 #include <fcntl.h>
@@ -9,8 +10,10 @@
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
+#include <liburing.h>
 #include <sys/ioctl.h>
 #include <sys/socket.h>
+#include <sys/uio.h>
 #include <net/if.h>
 #include <linux/if.h>
 #include <linux/if_arp.h>
@@ -27,6 +30,12 @@
 #include <rte_kvargs.h>
 #include <rte_log.h>
 
+#define IORING_DEFAULT_BURST	64
+#define IORING_NUM_BUFFERS	1024
+#define IORING_MAX_QUEUES	128
+
+static_assert(IORING_MAX_QUEUES <= RTE_MP_MAX_FD_NUM, "Max queues exceeds MP fd limit");
+
 #define IORING_DEFAULT_IFNAME	"itap%d"
 #define IORING_MP_KEY		"ioring_mp_send_fds"
 
@@ -34,6 +43,20 @@ RTE_LOG_REGISTER_DEFAULT(ioring_logtype, NOTICE);
 #define RTE_LOGTYPE_IORING ioring_logtype
 #define PMD_LOG(level, ...) RTE_LOG_LINE_PREFIX(level, IORING, "%s(): ", __func__, __VA_ARGS__)
 
+#ifdef RTE_ETHDEV_DEBUG_RX
+#define PMD_RX_LOG(level, ...) \
+	RTE_LOG_LINE_PREFIX(level, IORING, "%s() rx: ", __func__, __VA_ARGS__)
+#else
+#define PMD_RX_LOG(...) do { } while (0)
+#endif
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+#define PMD_TX_LOG(level, ...) \
+	RTE_LOG_LINE_PREFIX(level, IORING, "%s() tx: ", __func__, __VA_ARGS__)
+#else
+#define PMD_TX_LOG(...) do { } while (0)
+#endif
+
 #define IORING_IFACE_ARG	"iface"
 #define IORING_PERSIST_ARG	"persist"
 
@@ -43,6 +66,30 @@ static const char * const valid_arguments[] = {
 	NULL
 };
 
+struct rx_queue {
+	struct rte_mempool *mb_pool;	/* rx buffer pool */
+	struct io_uring io_ring;	/* queue of posted read's */
+	uint16_t port_id;
+	uint16_t queue_id;
+
+	uint64_t rx_packets;
+	uint64_t rx_bytes;
+	uint64_t rx_nombuf;
+	uint64_t rx_errors;
+};
+
+struct tx_queue {
+	struct io_uring io_ring;
+
+	uint16_t port_id;
+	uint16_t queue_id;
+	uint16_t free_thresh;
+
+	uint64_t tx_packets;
+	uint64_t tx_bytes;
+	uint64_t tx_errors;
+};
+
 struct pmd_internals {
 	int keep_fd;			/* keep alive file descriptor */
 	char ifname[IFNAMSIZ];		/* name assigned by kernel */
@@ -300,6 +347,15 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 	dev_info->if_index = if_nametoindex(pmd->ifname);
 	dev_info->max_mac_addrs = 1;
 	dev_info->max_rx_pktlen = RTE_ETHER_MAX_LEN;
+	dev_info->max_rx_queues = IORING_MAX_QUEUES;
+	dev_info->max_tx_queues = IORING_MAX_QUEUES;
+	dev_info->min_rx_bufsize = 0;
+
+	dev_info->default_rxportconf = (struct rte_eth_dev_portconf) {
+		.burst_size = IORING_DEFAULT_BURST,
+		.ring_size = IORING_NUM_BUFFERS,
+		.nb_queues = 1,
+	};
 
 	return 0;
 }
@@ -311,6 +367,14 @@ eth_dev_close(struct rte_eth_dev *dev)
 
 	PMD_LOG(INFO, "Closing %s", pmd->ifname);
 
+	int *fds = dev->process_private;
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		if (fds[i] == -1)
+			continue;
+		close(fds[i]);
+		fds[i] = -1;
+	}
+
 	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
 		return 0;
 
@@ -324,6 +388,295 @@ eth_dev_close(struct rte_eth_dev *dev)
 	return 0;
 }
 
+/* Setup another fd to TAP device for the queue */
+static int
+eth_queue_setup(struct rte_eth_dev *dev, const char *name, uint16_t queue_id)
+{
+	int *fds = dev->process_private;
+
+	if (fds[queue_id] != -1)
+		return 0;	/* already setup */
+
+	struct ifreq ifr = { };
+	int tap_fd = tap_open(name, &ifr, 0);
+	if (tap_fd < 0) {
+		PMD_LOG(ERR, "tap_open failed");
+		return -1;
+	}
+
+	PMD_LOG(DEBUG, "opened %d for queue %u", tap_fd, queue_id);
+	fds[queue_id] = tap_fd;
+	return 0;
+}
+
+static int
+eth_queue_fd(uint16_t port_id, uint16_t queue_id)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	int *fds = dev->process_private;
+
+	return fds[queue_id];
+}
+
+/* setup an submit queue to read mbuf */
+static inline void
+eth_rx_submit(struct rx_queue *rxq, int fd, struct rte_mbuf *mb)
+{
+	struct io_uring_sqe *sqe = io_uring_get_sqe(&rxq->io_ring);
+
+	if (unlikely(sqe == NULL)) {
+		PMD_LOG(DEBUG, "io_uring no rx sqe");
+		rxq->rx_errors++;
+		rte_pktmbuf_free(mb);
+	} else {
+		void *base = rte_pktmbuf_mtod(mb, void *);
+		size_t len = mb->buf_len;
+
+		io_uring_prep_read(sqe, fd, base, len, 0);
+		io_uring_sqe_set_data(sqe, mb);
+	}
+}
+
+static uint16_t
+eth_ioring_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct rx_queue *rxq = queue;
+	struct io_uring_cqe *cqe;
+	unsigned int head, num_cqe = 0;
+	uint16_t num_rx = 0;
+	uint32_t num_bytes = 0;
+	int fd = eth_queue_fd(rxq->port_id, rxq->queue_id);
+
+	io_uring_for_each_cqe(&rxq->io_ring, head, cqe) {
+		struct rte_mbuf *mb = (void *)(uintptr_t)cqe->user_data;
+		ssize_t len = cqe->res;
+
+		PMD_RX_LOG(DEBUG, "cqe %u len %zd", num_cqe, len);
+		num_cqe++;
+
+		if (unlikely(len < RTE_ETHER_HDR_LEN)) {
+			if (len < 0)
+				PMD_LOG(ERR, "io_uring_read: %s", strerror(-len));
+			else
+				PMD_LOG(ERR, "io_uring_read missing hdr");
+
+			rxq->rx_errors++;
+			goto resubmit;
+		}
+
+		struct rte_mbuf *nmb = rte_pktmbuf_alloc(rxq->mb_pool);
+		if (unlikely(nmb == 0)) {
+			PMD_LOG(DEBUG, "Rx mbuf alloc failed");
+			++rxq->rx_nombuf;
+			goto resubmit;
+		}
+
+		mb->pkt_len = len;
+		mb->data_len = len;
+		mb->port = rxq->port_id;
+		__rte_mbuf_sanity_check(mb, 1);
+
+		num_bytes += len;
+		bufs[num_rx++] = mb;
+
+		mb = nmb;
+resubmit:
+		eth_rx_submit(rxq, fd, mb);
+
+		if (num_rx == nb_pkts)
+			break;
+	}
+	io_uring_cq_advance(&rxq->io_ring, num_cqe);
+
+	rxq->rx_packets += num_rx;
+	rxq->rx_bytes += num_bytes;
+	return num_rx;
+}
+
+static int
+eth_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id, uint16_t nb_rx_desc,
+		   unsigned int socket_id,
+		   const struct rte_eth_rxconf *rx_conf __rte_unused,
+		   struct rte_mempool *mb_pool)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+
+	PMD_LOG(DEBUG, "setup port %u queue %u rx_descriptors %u",
+		dev->data->port_id, queue_id, nb_rx_desc);
+
+	/* open shared tap fd maybe already setup */
+	if (eth_queue_setup(dev, pmd->ifname, queue_id) < 0)
+		return -1;
+
+	struct rx_queue *rxq = rte_zmalloc_socket(NULL, sizeof(*rxq),
+						  RTE_CACHE_LINE_SIZE, socket_id);
+	if (rxq == NULL) {
+		PMD_LOG(ERR, "rxq alloc failed");
+		return -1;
+	}
+
+	rxq->mb_pool = mb_pool;
+	rxq->port_id = dev->data->port_id;
+	rxq->queue_id = queue_id;
+	dev->data->rx_queues[queue_id] = rxq;
+
+	if (io_uring_queue_init(nb_rx_desc, &rxq->io_ring, 0) != 0) {
+		PMD_LOG(ERR, "io_uring_queue_init failed: %s", strerror(errno));
+		return -1;
+	}
+
+	struct rte_mbuf **mbufs = alloca(nb_rx_desc * sizeof(struct rte_mbuf *));
+	if (mbufs == NULL) {
+		PMD_LOG(ERR, "alloca for %u failed", nb_rx_desc);
+		return -1;
+	}
+
+	if (rte_pktmbuf_alloc_bulk(mb_pool, mbufs, nb_rx_desc) < 0) {
+		PMD_LOG(ERR, "Rx mbuf alloc %u bufs failed", nb_rx_desc);
+		return -1;
+	}
+
+	int fd = eth_queue_fd(rxq->port_id, rxq->queue_id);
+	for (uint16_t i = 0; i < nb_rx_desc; i++)
+		eth_rx_submit(rxq, fd, mbufs[i]);
+
+	io_uring_submit(&rxq->io_ring);
+	return 0;
+}
+
+static void
+eth_rx_queue_release(struct rte_eth_dev *dev, uint16_t queue_id)
+{
+	struct rx_queue *rxq = dev->data->rx_queues[queue_id];
+
+	struct io_uring_sqe *sqe = io_uring_get_sqe(&rxq->io_ring);
+	if (sqe == NULL) {
+		PMD_LOG(ERR, "io_uring_get_sqe failed: %s", strerror(errno));
+	} else {
+		io_uring_prep_cancel(sqe, NULL, IORING_ASYNC_CANCEL_ANY);
+		io_uring_submit_and_wait(&rxq->io_ring, 1);
+	}
+
+	io_uring_queue_exit(&rxq->io_ring);
+}
+
+static int
+eth_tx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id,
+		   uint16_t nb_tx_desc, unsigned int socket_id,
+		   const struct rte_eth_txconf *tx_conf)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+
+	/* open shared tap fd maybe already setup */
+	if (eth_queue_setup(dev, pmd->ifname, queue_id) < 0)
+		return -1;
+
+	struct tx_queue *txq = rte_zmalloc_socket(NULL, sizeof(*txq),
+						  RTE_CACHE_LINE_SIZE, socket_id);
+	if (txq == NULL) {
+		PMD_LOG(ERR, "txq alloc failed");
+		return -1;
+	}
+
+	txq->port_id = dev->data->port_id;
+	txq->queue_id = queue_id;
+	txq->free_thresh = tx_conf->tx_free_thresh;
+	dev->data->tx_queues[queue_id] = txq;
+
+	if (io_uring_queue_init(nb_tx_desc, &txq->io_ring, 0) != 0) {
+		PMD_LOG(ERR, "io_uring_queue_init failed: %s", strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static void
+eth_tx_queue_release(struct rte_eth_dev *dev, uint16_t queue_id)
+{
+	struct tx_queue *txq = dev->data->tx_queues[queue_id];
+
+	struct io_uring_sqe *sqe = io_uring_get_sqe(&txq->io_ring);
+	if (sqe == NULL) {
+		PMD_LOG(ERR, "io_uring_get_sqe failed: %s", strerror(errno));
+	} else {
+		io_uring_prep_cancel(sqe, NULL, IORING_ASYNC_CANCEL_ANY);
+		io_uring_submit_and_wait(&txq->io_ring, 1);
+	}
+
+	io_uring_queue_exit(&txq->io_ring);
+}
+
+static void
+eth_ioring_tx_cleanup(struct tx_queue *txq)
+{
+	struct io_uring_cqe *cqe;
+	unsigned int head;
+	unsigned int tx_done = 0;
+	uint64_t tx_bytes = 0;
+
+	io_uring_for_each_cqe(&txq->io_ring, head, cqe) {
+		struct rte_mbuf *mb = (void *)(uintptr_t)cqe->user_data;
+
+		PMD_TX_LOG(DEBUG, " mbuf len %u result: %d", mb->pkt_len, cqe->res);
+		if (unlikely(cqe->res < 0)) {
+			++txq->tx_errors;
+		} else {
+			++tx_done;
+			tx_bytes += mb->pkt_len;
+		}
+
+		rte_pktmbuf_free(mb);
+	}
+	io_uring_cq_advance(&txq->io_ring, tx_done);
+
+	txq->tx_packets += tx_done;
+	txq->tx_bytes += tx_bytes;
+}
+
+static uint16_t
+eth_ioring_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct tx_queue *txq = queue;
+	uint16_t num_tx;
+
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts);
+
+	if (io_uring_sq_space_left(&txq->io_ring) < txq->free_thresh)
+		eth_ioring_tx_cleanup(txq);
+
+	int fd = eth_queue_fd(txq->port_id, txq->queue_id);
+
+	for (num_tx = 0; num_tx < nb_pkts; num_tx++) {
+		struct rte_mbuf *mb = bufs[num_tx];
+
+		struct io_uring_sqe *sqe = io_uring_get_sqe(&txq->io_ring);
+		if (sqe == NULL)
+			break;	/* submit ring is full */
+
+		io_uring_sqe_set_data(sqe, mb);
+
+		if (rte_mbuf_refcnt_read(mb) == 1 &&
+		    RTE_MBUF_DIRECT(mb) && mb->nb_segs == 1) {
+			void *base = rte_pktmbuf_mtod(mb, void *);
+			io_uring_prep_write(sqe, fd, base, mb->pkt_len, 0);
+
+			PMD_TX_LOG(DEBUG, "tx mbuf: %p submit", mb);
+		} else {
+			PMD_LOG(ERR, "Can't do mbuf without space yet!");
+			++txq->tx_errors;
+			continue;
+		}
+	}
+	if (num_tx > 0)
+		io_uring_submit(&txq->io_ring);
+
+	return num_tx;
+}
+
 static const struct eth_dev_ops ops = {
 	.dev_start		= eth_dev_start,
 	.dev_stop		= eth_dev_stop,
@@ -339,9 +692,12 @@ static const struct eth_dev_ops ops = {
 	.promiscuous_disable	= eth_dev_promiscuous_disable,
 	.allmulticast_enable	= eth_dev_allmulticast_enable,
 	.allmulticast_disable	= eth_dev_allmulticast_disable,
+	.rx_queue_setup		= eth_rx_queue_setup,
+	.rx_queue_release	= eth_rx_queue_release,
+	.tx_queue_setup		= eth_tx_queue_setup,
+	.tx_queue_release	= eth_tx_queue_release,
 };
 
-
 static int
 ioring_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
 {
@@ -379,6 +735,10 @@ ioring_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
 	}
 
 	PMD_LOG(DEBUG, "%s setup", ifr.ifr_name);
+
+	dev->rx_pkt_burst = eth_ioring_rx;
+	dev->tx_pkt_burst = eth_ioring_tx;
+
 	return 0;
 
 error:
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v3 7/9] net/ioring: add VLAN support
  2025-03-11 23:51 ` [PATCH v3 0/9] ioring PMD device Stephen Hemminger
                     ` (5 preceding siblings ...)
  2025-03-11 23:51   ` [PATCH v3 6/9] net/ioring: implement receive and transmit Stephen Hemminger
@ 2025-03-11 23:51   ` Stephen Hemminger
  2025-03-11 23:51   ` [PATCH v3 8/9] net/ioring: implement statistics Stephen Hemminger
  2025-03-11 23:51   ` [PATCH v3 9/9] net/ioring: support multi-segment Rx and Tx Stephen Hemminger
  8 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2025-03-11 23:51 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add support for VLAN insert and stripping.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 drivers/net/ioring/rte_eth_ioring.c | 47 +++++++++++++++++++++++++++--
 1 file changed, 44 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index 4d064c2c22..d473c3fcbb 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -33,9 +33,11 @@
 #define IORING_DEFAULT_BURST	64
 #define IORING_NUM_BUFFERS	1024
 #define IORING_MAX_QUEUES	128
-
 static_assert(IORING_MAX_QUEUES <= RTE_MP_MAX_FD_NUM, "Max queues exceeds MP fd limit");
 
+#define IORING_TX_OFFLOAD	RTE_ETH_TX_OFFLOAD_VLAN_INSERT
+#define IORING_RX_OFFLOAD	RTE_ETH_RX_OFFLOAD_VLAN_STRIP
+
 #define IORING_DEFAULT_IFNAME	"itap%d"
 #define IORING_MP_KEY		"ioring_mp_send_fds"
 
@@ -69,6 +71,7 @@ static const char * const valid_arguments[] = {
 struct rx_queue {
 	struct rte_mempool *mb_pool;	/* rx buffer pool */
 	struct io_uring io_ring;	/* queue of posted read's */
+	uint64_t offloads;
 	uint16_t port_id;
 	uint16_t queue_id;
 
@@ -80,6 +83,7 @@ struct rx_queue {
 
 struct tx_queue {
 	struct io_uring io_ring;
+	uint64_t offloads;
 
 	uint16_t port_id;
 	uint16_t queue_id;
@@ -471,6 +475,9 @@ eth_ioring_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 			goto resubmit;
 		}
 
+		if (rxq->offloads & RTE_ETH_RX_OFFLOAD_VLAN_STRIP)
+			rte_vlan_strip(mb);
+
 		mb->pkt_len = len;
 		mb->data_len = len;
 		mb->port = rxq->port_id;
@@ -495,8 +502,7 @@ eth_ioring_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 
 static int
 eth_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id, uint16_t nb_rx_desc,
-		   unsigned int socket_id,
-		   const struct rte_eth_rxconf *rx_conf __rte_unused,
+		   unsigned int socket_id, const struct rte_eth_rxconf *rx_conf,
 		   struct rte_mempool *mb_pool)
 {
 	struct pmd_internals *pmd = dev->data->dev_private;
@@ -515,6 +521,7 @@ eth_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id, uint16_t nb_rx_de
 		return -1;
 	}
 
+	rxq->offloads = rx_conf->offloads;
 	rxq->mb_pool = mb_pool;
 	rxq->port_id = dev->data->port_id;
 	rxq->queue_id = queue_id;
@@ -580,6 +587,7 @@ eth_tx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id,
 
 	txq->port_id = dev->data->port_id;
 	txq->queue_id = queue_id;
+	txq->offloads = tx_conf->offloads;
 	txq->free_thresh = tx_conf->tx_free_thresh;
 	dev->data->tx_queues[queue_id] = txq;
 
@@ -634,6 +642,38 @@ eth_ioring_tx_cleanup(struct tx_queue *txq)
 	txq->tx_bytes += tx_bytes;
 }
 
+static uint16_t
+eth_ioring_tx_prepare(void *tx_queue __rte_unused, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
+{
+	uint16_t nb_tx;
+	int error;
+
+	for (nb_tx = 0; nb_tx < nb_pkts; nb_tx++) {
+		struct rte_mbuf *m = tx_pkts[nb_tx];
+
+#ifdef RTE_LIBRTE_ETHDEV_DEBUG
+		error = rte_validate_tx_offload(m);
+		if (unlikely(error)) {
+			rte_errno = -error;
+			break;
+		}
+#endif
+		/* Do VLAN tag insertion */
+		if (unlikely(m->ol_flags & RTE_MBUF_F_TX_VLAN)) {
+			error = rte_vlan_insert(&m);
+			/* rte_vlan_insert() may change pointer */
+			tx_pkts[nb_tx] = m;
+
+			if (unlikely(error)) {
+				rte_errno = -error;
+				break;
+			}
+		}
+	}
+
+	return nb_tx;
+}
+
 static uint16_t
 eth_ioring_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 {
@@ -737,6 +777,7 @@ ioring_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
 	PMD_LOG(DEBUG, "%s setup", ifr.ifr_name);
 
 	dev->rx_pkt_burst = eth_ioring_rx;
+	dev->tx_pkt_prepare = eth_ioring_tx_prepare;
 	dev->tx_pkt_burst = eth_ioring_tx;
 
 	return 0;
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v3 8/9] net/ioring: implement statistics
  2025-03-11 23:51 ` [PATCH v3 0/9] ioring PMD device Stephen Hemminger
                     ` (6 preceding siblings ...)
  2025-03-11 23:51   ` [PATCH v3 7/9] net/ioring: add VLAN support Stephen Hemminger
@ 2025-03-11 23:51   ` Stephen Hemminger
  2025-03-11 23:51   ` [PATCH v3 9/9] net/ioring: support multi-segment Rx and Tx Stephen Hemminger
  8 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2025-03-11 23:51 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add support for basic statistics

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/ioring.ini |  2 +
 drivers/net/ioring/rte_eth_ioring.c | 57 +++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+)

diff --git a/doc/guides/nics/features/ioring.ini b/doc/guides/nics/features/ioring.ini
index da47062adb..c9a4582d0e 100644
--- a/doc/guides/nics/features/ioring.ini
+++ b/doc/guides/nics/features/ioring.ini
@@ -9,6 +9,8 @@ MTU update           = Y
 Promiscuous mode     = Y
 Allmulticast mode    = Y
 Multiprocess aware   = Y
+Basic stats          = Y
+Stats per queue      = Y
 Linux		     = Y
 x86-64               = Y
 Usage doc            = Y
diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index d473c3fcbb..83446dc660 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -364,6 +364,61 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 	return 0;
 }
 
+static int
+eth_dev_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats)
+{
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		const struct rx_queue *rxq = dev->data->rx_queues[i];
+
+		stats->ipackets += rxq->rx_packets;
+		stats->ibytes += rxq->rx_bytes;
+		stats->ierrors += rxq->rx_errors;
+		stats->rx_nombuf += rxq->rx_nombuf;
+
+		if (i < RTE_ETHDEV_QUEUE_STAT_CNTRS) {
+			stats->q_ipackets[i] = rxq->rx_packets;
+			stats->q_ibytes[i] = rxq->rx_bytes;
+		}
+	}
+
+	for (uint16_t i = 0; i < dev->data->nb_tx_queues; i++) {
+		const struct tx_queue *txq = dev->data->tx_queues[i];
+
+		stats->opackets += txq->tx_packets;
+		stats->obytes += txq->tx_bytes;
+		stats->oerrors += txq->tx_errors;
+
+		if (i < RTE_ETHDEV_QUEUE_STAT_CNTRS) {
+			stats->q_opackets[i] = txq->tx_packets;
+			stats->q_obytes[i] = txq->tx_bytes;
+		}
+	}
+
+	return 0;
+}
+
+static int
+eth_dev_stats_reset(struct rte_eth_dev *dev)
+{
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		struct rx_queue *rxq = dev->data->rx_queues[i];
+
+		rxq->rx_packets = 0;
+		rxq->rx_bytes = 0;
+		rxq->rx_nombuf = 0;
+		rxq->rx_errors = 0;
+	}
+
+	for (uint16_t i = 0; i < dev->data->nb_tx_queues; i++) {
+		struct tx_queue *txq = dev->data->tx_queues[i];
+
+		txq->tx_packets = 0;
+		txq->tx_bytes = 0;
+		txq->tx_errors = 0;
+	}
+	return 0;
+}
+
 static int
 eth_dev_close(struct rte_eth_dev *dev)
 {
@@ -732,6 +787,8 @@ static const struct eth_dev_ops ops = {
 	.promiscuous_disable	= eth_dev_promiscuous_disable,
 	.allmulticast_enable	= eth_dev_allmulticast_enable,
 	.allmulticast_disable	= eth_dev_allmulticast_disable,
+	.stats_get              = eth_dev_stats_get,
+	.stats_reset            = eth_dev_stats_reset,
 	.rx_queue_setup		= eth_rx_queue_setup,
 	.rx_queue_release	= eth_rx_queue_release,
 	.tx_queue_setup		= eth_tx_queue_setup,
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v3 9/9] net/ioring: support multi-segment Rx and Tx
  2025-03-11 23:51 ` [PATCH v3 0/9] ioring PMD device Stephen Hemminger
                     ` (7 preceding siblings ...)
  2025-03-11 23:51   ` [PATCH v3 8/9] net/ioring: implement statistics Stephen Hemminger
@ 2025-03-11 23:51   ` Stephen Hemminger
  8 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2025-03-11 23:51 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Use readv/writev to handle multi-segment transmit and receive.
Account for virtio header that will be used for offload (later).

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 drivers/net/ioring/rte_eth_ioring.c | 140 ++++++++++++++++++++--------
 1 file changed, 102 insertions(+), 38 deletions(-)

diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index 83446dc660..a803a9820b 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -18,6 +18,7 @@
 #include <linux/if.h>
 #include <linux/if_arp.h>
 #include <linux/if_tun.h>
+#include <linux/virtio_net.h>
 
 #include <bus_vdev_driver.h>
 #include <ethdev_driver.h>
@@ -35,8 +36,11 @@
 #define IORING_MAX_QUEUES	128
 static_assert(IORING_MAX_QUEUES <= RTE_MP_MAX_FD_NUM, "Max queues exceeds MP fd limit");
 
-#define IORING_TX_OFFLOAD	RTE_ETH_TX_OFFLOAD_VLAN_INSERT
-#define IORING_RX_OFFLOAD	RTE_ETH_RX_OFFLOAD_VLAN_STRIP
+#define IORING_TX_OFFLOAD	(RTE_ETH_TX_OFFLOAD_VLAN_INSERT | \
+				 RTE_ETH_TX_OFFLOAD_MULTI_SEGS)
+
+#define IORING_RX_OFFLOAD	(RTE_ETH_RX_OFFLOAD_VLAN_STRIP | \
+				 RTE_ETH_RX_OFFLOAD_SCATTER)
 
 #define IORING_DEFAULT_IFNAME	"itap%d"
 #define IORING_MP_KEY		"ioring_mp_send_fds"
@@ -166,7 +170,7 @@ tap_open(const char *name, struct ifreq *ifr, uint8_t persist)
 		goto error;
 	}
 
-	int flags = IFF_TAP | IFF_MULTI_QUEUE | IFF_NO_PI;
+	int flags = IFF_TAP | IFF_MULTI_QUEUE | IFF_NO_PI | IFF_VNET_HDR;
 	if ((features & flags) != flags) {
 		PMD_LOG(ERR, "TUN features %#x missing support for %#x",
 			features, features & flags);
@@ -354,6 +358,8 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 	dev_info->max_rx_queues = IORING_MAX_QUEUES;
 	dev_info->max_tx_queues = IORING_MAX_QUEUES;
 	dev_info->min_rx_bufsize = 0;
+	dev_info->tx_queue_offload_capa = IORING_TX_OFFLOAD;
+	dev_info->tx_offload_capa = dev_info->tx_queue_offload_capa;
 
 	dev_info->default_rxportconf = (struct rte_eth_dev_portconf) {
 		.burst_size = IORING_DEFAULT_BURST,
@@ -487,13 +493,44 @@ eth_rx_submit(struct rx_queue *rxq, int fd, struct rte_mbuf *mb)
 		PMD_LOG(DEBUG, "io_uring no rx sqe");
 		rxq->rx_errors++;
 		rte_pktmbuf_free(mb);
-	} else {
-		void *base = rte_pktmbuf_mtod(mb, void *);
-		size_t len = mb->buf_len;
+		return;
+	}
 
-		io_uring_prep_read(sqe, fd, base, len, 0);
-		io_uring_sqe_set_data(sqe, mb);
+	RTE_VERIFY(mb->nb_segs < IOV_MAX);
+	struct iovec iovs[IOV_MAX];
+
+	for (uint16_t i = 0; i < mb->nb_segs; i++) {
+		iovs[i].iov_base = rte_pktmbuf_mtod(mb, void *);
+		iovs[i].iov_len = rte_pktmbuf_tailroom(mb);
+		mb = mb->next;
 	}
+	io_uring_sqe_set_data(sqe, mb);
+	io_uring_prep_readv(sqe, fd, iovs, mb->nb_segs, 0);
+}
+
+static struct rte_mbuf *
+eth_ioring_rx_alloc(struct rx_queue *rxq)
+{
+	const struct rte_eth_dev *dev = &rte_eth_devices[rxq->port_id];
+	int buf_size = dev->data->mtu + sizeof(struct virtio_net_hdr);
+	struct rte_mbuf *m = NULL;
+	struct rte_mbuf **tail = &m;
+
+	do {
+		struct rte_mbuf *seg = rte_pktmbuf_alloc(rxq->mb_pool);
+		if (unlikely(seg == NULL)) {
+			rte_pktmbuf_free(m);
+			return NULL;
+		}
+		*tail = seg;
+		tail = &seg->next;
+		if (seg != m)
+			++m->nb_segs;
+
+		buf_size -= rte_pktmbuf_tailroom(seg);
+	} while (buf_size > 0);
+
+	return m;
 }
 
 static uint16_t
@@ -513,7 +550,8 @@ eth_ioring_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		PMD_RX_LOG(DEBUG, "cqe %u len %zd", num_cqe, len);
 		num_cqe++;
 
-		if (unlikely(len < RTE_ETHER_HDR_LEN)) {
+		struct virtio_net_hdr *hdr;
+		if (unlikely(len < (ssize_t)(sizeof(*hdr) + RTE_ETHER_HDR_LEN))) {
 			if (len < 0)
 				PMD_LOG(ERR, "io_uring_read: %s", strerror(-len));
 			else
@@ -523,19 +561,31 @@ eth_ioring_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 			goto resubmit;
 		}
 
-		struct rte_mbuf *nmb = rte_pktmbuf_alloc(rxq->mb_pool);
-		if (unlikely(nmb == 0)) {
-			PMD_LOG(DEBUG, "Rx mbuf alloc failed");
+		hdr = rte_pktmbuf_mtod(mb, struct virtio_net_hdr *);
+
+		struct rte_mbuf *nmb = eth_ioring_rx_alloc(rxq);
+		if (!nmb) {
 			++rxq->rx_nombuf;
 			goto resubmit;
 		}
 
-		if (rxq->offloads & RTE_ETH_RX_OFFLOAD_VLAN_STRIP)
-			rte_vlan_strip(mb);
+		len -= sizeof(*hdr);
+		mb->data_off += sizeof(*hdr);
 
 		mb->pkt_len = len;
-		mb->data_len = len;
 		mb->port = rxq->port_id;
+		struct rte_mbuf *seg = mb;
+		for (;;) {
+			seg->data_len = RTE_MIN(len, mb->buf_len);
+			seg = seg->next;
+			len -= seg->data_len;
+		} while (len > 0 && seg != NULL);
+
+		RTE_VERIFY(!(seg == NULL && len > 0));
+
+		if (rxq->offloads & RTE_ETH_RX_OFFLOAD_VLAN_STRIP)
+			rte_vlan_strip(mb);
+
 		__rte_mbuf_sanity_check(mb, 1);
 
 		num_bytes += len;
@@ -555,6 +605,7 @@ eth_ioring_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	return num_rx;
 }
 
+
 static int
 eth_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id, uint16_t nb_rx_desc,
 		   unsigned int socket_id, const struct rte_eth_rxconf *rx_conf,
@@ -587,20 +638,17 @@ eth_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id, uint16_t nb_rx_de
 		return -1;
 	}
 
-	struct rte_mbuf **mbufs = alloca(nb_rx_desc * sizeof(struct rte_mbuf *));
-	if (mbufs == NULL) {
-		PMD_LOG(ERR, "alloca for %u failed", nb_rx_desc);
-		return -1;
-	}
+	int fd = eth_queue_fd(rxq->port_id, rxq->queue_id);
 
-	if (rte_pktmbuf_alloc_bulk(mb_pool, mbufs, nb_rx_desc) < 0) {
-		PMD_LOG(ERR, "Rx mbuf alloc %u bufs failed", nb_rx_desc);
-		return -1;
-	}
+	for (uint16_t i = 0; i < nb_rx_desc; i++) {
+		struct rte_mbuf *mb = eth_ioring_rx_alloc(rxq);
+		if (mb == NULL) {
+			PMD_LOG(ERR, "Rx mbuf alloc buf failed");
+			return -1;
+		}
 
-	int fd = eth_queue_fd(rxq->port_id, rxq->queue_id);
-	for (uint16_t i = 0; i < nb_rx_desc; i++)
-		eth_rx_submit(rxq, fd, mbufs[i]);
+		eth_rx_submit(rxq, fd, mb);
+	}
 
 	io_uring_submit(&rxq->io_ring);
 	return 0;
@@ -740,31 +788,47 @@ eth_ioring_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 
 	PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts);
 
-	if (io_uring_sq_space_left(&txq->io_ring) < txq->free_thresh)
+	if (likely(io_uring_sq_space_left(&txq->io_ring) < txq->free_thresh))
 		eth_ioring_tx_cleanup(txq);
 
 	int fd = eth_queue_fd(txq->port_id, txq->queue_id);
 
 	for (num_tx = 0; num_tx < nb_pkts; num_tx++) {
 		struct rte_mbuf *mb = bufs[num_tx];
+		struct virtio_net_hdr *hdr;
 
 		struct io_uring_sqe *sqe = io_uring_get_sqe(&txq->io_ring);
 		if (sqe == NULL)
 			break;	/* submit ring is full */
 
-		io_uring_sqe_set_data(sqe, mb);
+		if (rte_mbuf_refcnt_read(mb) == 1 && RTE_MBUF_DIRECT(mb) &&
+		    rte_pktmbuf_headroom(mb) >= sizeof(*hdr)) {
+			hdr = (struct virtio_net_hdr *)rte_pktmbuf_prepend(mb, sizeof(*hdr));
+		} else {
+			struct rte_mbuf *mh = rte_pktmbuf_alloc(mb->pool);
+			if (unlikely(mh == NULL))
+				break;
+			hdr = (struct virtio_net_hdr *)rte_pktmbuf_append(mh, sizeof(*hdr));
 
-		if (rte_mbuf_refcnt_read(mb) == 1 &&
-		    RTE_MBUF_DIRECT(mb) && mb->nb_segs == 1) {
-			void *base = rte_pktmbuf_mtod(mb, void *);
-			io_uring_prep_write(sqe, fd, base, mb->pkt_len, 0);
+			mh->next = mb;
+			mh->nb_segs = mb->nb_segs + 1;
+			mh->pkt_len += mb->pkt_len;
+			mh->ol_flags = mb->ol_flags & RTE_MBUF_F_TX_OFFLOAD_MASK;
+			mb = mh;
+		}
+		memset(hdr, 0, sizeof(*hdr));
 
-			PMD_TX_LOG(DEBUG, "tx mbuf: %p submit", mb);
-		} else {
-			PMD_LOG(ERR, "Can't do mbuf without space yet!");
-			++txq->tx_errors;
-			continue;
+		io_uring_sqe_set_data(sqe, mb);
+
+		struct iovec iovs[RTE_MBUF_MAX_NB_SEGS + 1];
+		unsigned int niov = mb->nb_segs;
+		for (unsigned int i = 0; i < niov; i++) {
+			iovs[i].iov_base = rte_pktmbuf_mtod(mb, char *);
+			iovs[i].iov_len = mb->data_len;
+			mb = mb->next;
 		}
+
+		io_uring_prep_writev(sqe, fd, iovs, niov, 0);
 	}
 	if (num_tx > 0)
 		io_uring_submit(&txq->io_ring);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v4 00/10] new ioring PMD
  2024-12-10 21:23 [RFC 0/8] ioring: network driver Stephen Hemminger
                   ` (10 preceding siblings ...)
  2025-03-11 23:51 ` [PATCH v3 0/9] ioring PMD device Stephen Hemminger
@ 2025-03-13 21:50 ` Stephen Hemminger
  2025-03-13 21:50   ` [PATCH v4 01/10] net/ioring: introduce new driver Stephen Hemminger
                     ` (9 more replies)
  2026-02-09 18:38 ` [PATCH v5 00/10] net/rtap: add io_uring based TAP driver Stephen Hemminger
  2026-02-14 23:44 ` [PATCH v6 00/11] " Stephen Hemminger
  13 siblings, 10 replies; 72+ messages in thread
From: Stephen Hemminger @ 2025-03-13 21:50 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

This is a new simplified TAP device that uses the Linux kernel
ioring API to provide a read/write ring with kernel.

This is split from tap device because there are so many
unnecessary things in existing tap, and supporting ioring is
better without ifdefs etc. The default name of the tap
device is different that other uses in DPDK but the driver
tries to keep the same relevant devargs as before.

This driver will only provide features that match what kernel
does, so no flow support etc. The next version will add checksum
and multi-segment packets. Some of the doc files may need update
as well.

v4 - more testing and offload support

Stephen Hemminger (10):
  net/ioring: introduce new driver
  net/ioring: implement link state
  net/ioring: implement control functions
  net/ioring: implement management functions
  net/ioring: implement secondary process support
  net/ioring: implement receive and transmit
  net/ioring: implement statistics
  net/ioring: support multi-segment Rx and Tx
  net/ioring: support Tx checksum and segment offload
  net/ioring: add support for Rx offload

 MAINTAINERS                         |    6 +
 doc/guides/nics/features/ioring.ini |   18 +
 doc/guides/nics/index.rst           |    1 +
 doc/guides/nics/ioring.rst          |   60 ++
 drivers/net/ioring/meson.build      |   15 +
 drivers/net/ioring/rte_eth_ioring.c | 1288 +++++++++++++++++++++++++++
 drivers/net/meson.build             |    1 +
 7 files changed, 1389 insertions(+)
 create mode 100644 doc/guides/nics/features/ioring.ini
 create mode 100644 doc/guides/nics/ioring.rst
 create mode 100644 drivers/net/ioring/meson.build
 create mode 100644 drivers/net/ioring/rte_eth_ioring.c

-- 
2.47.2


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v4 01/10] net/ioring: introduce new driver
  2025-03-13 21:50 ` [PATCH v4 00/10] new ioring PMD Stephen Hemminger
@ 2025-03-13 21:50   ` Stephen Hemminger
  2025-03-13 21:50   ` [PATCH v4 02/10] net/ioring: implement link state Stephen Hemminger
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2025-03-13 21:50 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add basic driver initialization, documentation, and device creation
and basic documentation.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 MAINTAINERS                         |   6 +
 doc/guides/nics/features/ioring.ini |   9 +
 doc/guides/nics/index.rst           |   1 +
 doc/guides/nics/ioring.rst          |  66 +++++++
 drivers/net/ioring/meson.build      |  15 ++
 drivers/net/ioring/rte_eth_ioring.c | 262 ++++++++++++++++++++++++++++
 drivers/net/meson.build             |   1 +
 7 files changed, 360 insertions(+)
 create mode 100644 doc/guides/nics/features/ioring.ini
 create mode 100644 doc/guides/nics/ioring.rst
 create mode 100644 drivers/net/ioring/meson.build
 create mode 100644 drivers/net/ioring/rte_eth_ioring.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 82f6e2f917..78bf70c5a0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -853,6 +853,12 @@ F: drivers/net/intel/ipn3ke/
 F: doc/guides/nics/ipn3ke.rst
 F: doc/guides/nics/features/ipn3ke.ini
 
+Ioring - EXPERIMENTAL
+M: Stephen Hemminger <stephen@networkplumber.org>
+F: drivers/net/ioring/
+F: doc/guides/nics/ioring.rst
+F: doc/guides/nics/features/ioring.ini
+
 Marvell cnxk
 M: Nithin Dabilpuram <ndabilpuram@marvell.com>
 M: Kiran Kumar K <kirankumark@marvell.com>
diff --git a/doc/guides/nics/features/ioring.ini b/doc/guides/nics/features/ioring.ini
new file mode 100644
index 0000000000..c4c57caaa4
--- /dev/null
+++ b/doc/guides/nics/features/ioring.ini
@@ -0,0 +1,9 @@
+;
+; Supported features of the 'ioring' driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+Linux		     = Y
+x86-64               = Y
+Usage doc            = Y
diff --git a/doc/guides/nics/index.rst b/doc/guides/nics/index.rst
index 10a2eca3b0..afb6bf289b 100644
--- a/doc/guides/nics/index.rst
+++ b/doc/guides/nics/index.rst
@@ -41,6 +41,7 @@ Network Interface Controller Drivers
     igc
     intel_vf
     ionic
+    ioring
     ipn3ke
     ixgbe
     mana
diff --git a/doc/guides/nics/ioring.rst b/doc/guides/nics/ioring.rst
new file mode 100644
index 0000000000..7d37a6bb37
--- /dev/null
+++ b/doc/guides/nics/ioring.rst
@@ -0,0 +1,66 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+
+IORING Poll Mode Driver
+=======================
+
+The IORING Poll Mode Driver (PMD) is a simplified and improved version of the TAP PMD. It is a
+virtual device that uses Linux ioring to inject packets into the Linux kernel.
+It is useful when writing DPDK applications, that need to support interaction
+with the Linux TCP/IP stack for control plane or tunneling.
+
+The IORING PMD creates a kernel network device that can be
+managed by standard tools such as ``ip`` and ``ethtool`` commands.
+
+From a DPDK application, the IORING device looks like a DPDK ethdev.
+It supports the standard DPDK API's to query for information, statistics,
+and send/receive packets.
+
+Requirements
+------------
+
+The IORING requires the io_uring library (liburing) which provides the helper
+functions to manage io_uring with the kernel.
+
+For more info on io_uring, please see:
+
+https://kernel.dk/io_uring.pdf
+
+
+Arguments
+---------
+
+IORING devices are created with the command line ``--vdev=net_ioring0`` option.
+This option may be specified more than once by repeating with a different ``net_ioringX`` device.
+
+By default, the Linux interfaces are named ``enio0``, ``enio1``, etc.
+The interface name can be specified by adding the ``iface=foo0``, for example::
+
+   --vdev=net_ioring0,iface=io0 --vdev=net_ioring1,iface=io1, ...
+
+The PMD inherits the MAC address assigned by the kernel which will be
+a locally assigned random Ethernet address.
+
+Normally, when the DPDK application exits, the IORING device is removed.
+But this behavior can be overridden by the use of the persist flag, example::
+
+  --vdev=net_ioring0,iface=io0,persist ...
+
+
+Multi-process sharing
+---------------------
+
+The IORING device does not support secondary process (yet).
+
+
+Limitations
+-----------
+
+- IO uring requires io_uring support. This was add in Linux kernl version 5.1
+  Also, IO uring maybe disabled in some environments or by security policies.
+
+- Since IORING device uses a file descriptor to talk to the kernel,
+  the same number of queues must be specified for receive and transmit.
+
+- No flow support. Receive queue selection for incoming packets is determined
+  by the Linux kernel. See kernel documentation for more info:
+  https://www.kernel.org/doc/html/latest/networking/scaling.html
diff --git a/drivers/net/ioring/meson.build b/drivers/net/ioring/meson.build
new file mode 100644
index 0000000000..264554d069
--- /dev/null
+++ b/drivers/net/ioring/meson.build
@@ -0,0 +1,15 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2024 Stephen Hemminger
+
+if not is_linux
+    build = false
+    reason = 'only supported on Linux'
+endif
+
+dep = dependency('liburing', required:false)
+reason = 'missing dependency, "liburing"'
+build = dep.found()
+ext_deps += dep
+
+sources = files('rte_eth_ioring.c')
+require_iova_in_mbuf = false
diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
new file mode 100644
index 0000000000..4d5a5174db
--- /dev/null
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -0,0 +1,262 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) Stephen Hemminger
+ */
+
+#include <ctype.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/ioctl.h>
+#include <sys/socket.h>
+#include <net/if.h>
+#include <linux/if.h>
+#include <linux/if_tun.h>
+
+#include <bus_vdev_driver.h>
+#include <ethdev_driver.h>
+#include <ethdev_vdev.h>
+#include <rte_common.h>
+#include <rte_dev.h>
+#include <rte_eal.h>
+#include <rte_ethdev.h>
+#include <rte_ether.h>
+#include <rte_kvargs.h>
+#include <rte_log.h>
+
+#define IORING_DEFAULT_IFNAME	"itap%d"
+
+RTE_LOG_REGISTER_DEFAULT(ioring_logtype, NOTICE);
+#define RTE_LOGTYPE_IORING ioring_logtype
+#define PMD_LOG(level, ...) RTE_LOG_LINE_PREFIX(level, IORING, "%s(): ", __func__, __VA_ARGS__)
+
+#define IORING_IFACE_ARG	"iface"
+#define IORING_PERSIST_ARG	"persist"
+
+static const char * const valid_arguments[] = {
+	IORING_IFACE_ARG,
+	IORING_PERSIST_ARG,
+	NULL
+};
+
+struct pmd_internals {
+	int keep_fd;			/* keep alive file descriptor */
+	char ifname[IFNAMSIZ];		/* name assigned by kernel */
+	struct rte_ether_addr eth_addr; /* address assigned by kernel */
+};
+
+/* Creates a new tap device, name returned in ifr */
+static int
+tap_open(const char *name, struct ifreq *ifr, uint8_t persist)
+{
+	static const char tun_dev[] = "/dev/net/tun";
+	int tap_fd;
+
+	tap_fd = open(tun_dev, O_RDWR | O_CLOEXEC | O_NONBLOCK);
+	if (tap_fd < 0) {
+		PMD_LOG(ERR, "Open %s failed: %s", tun_dev, strerror(errno));
+		return -1;
+	}
+
+	int features = 0;
+	if (ioctl(tap_fd, TUNGETFEATURES, &features) < 0) {
+		PMD_LOG(ERR, "ioctl(TUNGETFEATURES) %s", strerror(errno));
+		goto error;
+	}
+
+	int flags = IFF_TAP | IFF_MULTI_QUEUE | IFF_NO_PI;
+	if ((features & flags) != flags) {
+		PMD_LOG(ERR, "TUN features %#x missing support for %#x",
+			features, features & flags);
+		goto error;
+	}
+
+#ifdef IFF_NAPI
+	/* If kernel supports using NAPI enable it */
+	if (features & IFF_NAPI)
+		flags |= IFF_NAPI;
+#endif
+	/*
+	 * Sets the device name and packet format.
+	 * Do not want the protocol information (PI)
+	 */
+	strlcpy(ifr->ifr_name, name, IFNAMSIZ);
+	ifr->ifr_flags = flags;
+	if (ioctl(tap_fd, TUNSETIFF, ifr) < 0) {
+		PMD_LOG(ERR, "ioctl(TUNSETIFF) %s: %s",
+			ifr->ifr_name, strerror(errno));
+		goto error;
+	}
+
+	/* (Optional) keep the device after application exit */
+	if (persist && ioctl(tap_fd, TUNSETPERSIST, 1) < 0) {
+		PMD_LOG(ERR, "ioctl(TUNSETPERIST) %s: %s",
+			ifr->ifr_name, strerror(errno));
+		goto error;
+	}
+
+	return tap_fd;
+error:
+	close(tap_fd);
+	return -1;
+}
+
+static int
+eth_dev_close(struct rte_eth_dev *dev)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+
+	PMD_LOG(INFO, "Closing %s", pmd->ifname);
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	/* mac_addrs must not be freed alone because part of dev_private */
+	dev->data->mac_addrs = NULL;
+
+	if (pmd->keep_fd != -1) {
+		close(pmd->keep_fd);
+		pmd->keep_fd = -1;
+	}
+
+	return 0;
+}
+
+static const struct eth_dev_ops ops = {
+	.dev_close		= eth_dev_close,
+};
+
+static int
+ioring_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
+{
+	struct rte_eth_dev_data *data = dev->data;
+	struct pmd_internals *pmd = data->dev_private;
+
+	pmd->keep_fd = -1;
+
+	data->dev_flags = RTE_ETH_DEV_AUTOFILL_QUEUE_XSTATS;
+	dev->dev_ops = &ops;
+
+	/* Get the initial fd used to keep the tap device around */
+	struct ifreq ifr = { };
+	pmd->keep_fd = tap_open(tap_name, &ifr, persist);
+	if (pmd->keep_fd < 0)
+		goto error;
+
+	strlcpy(pmd->ifname, ifr.ifr_name, IFNAMSIZ);
+
+	/* Read the MAC address assigned by the kernel */
+	if (ioctl(pmd->keep_fd, SIOCGIFHWADDR, &ifr) < 0) {
+		PMD_LOG(ERR, "Unable to get MAC address for %s: %s",
+			ifr.ifr_name, strerror(errno));
+		goto error;
+	}
+	memcpy(&pmd->eth_addr, &ifr.ifr_hwaddr.sa_data, RTE_ETHER_ADDR_LEN);
+	data->mac_addrs = &pmd->eth_addr;
+
+	/* Detach this instance, not used for traffic */
+	ifr.ifr_flags = IFF_DETACH_QUEUE;
+	if (ioctl(pmd->keep_fd, TUNSETQUEUE, &ifr) < 0) {
+		PMD_LOG(ERR, "Unable to detach keep-alive queue for %s: %s",
+			ifr.ifr_name, strerror(errno));
+		goto error;
+	}
+
+	PMD_LOG(DEBUG, "%s setup", ifr.ifr_name);
+	return 0;
+
+error:
+	if (pmd->keep_fd != -1)
+		close(pmd->keep_fd);
+	return -1;
+}
+
+static int
+parse_iface_arg(const char *key __rte_unused, const char *value, void *extra_args)
+{
+	char *name = extra_args;
+
+	/* must not be null string */
+	if (name == NULL || name[0] == '\0' ||
+	    strnlen(name, IFNAMSIZ) == IFNAMSIZ)
+		return -EINVAL;
+
+	strlcpy(name, value, IFNAMSIZ);
+	return 0;
+}
+
+static int
+ioring_probe(struct rte_vdev_device *vdev)
+{
+	const char *name = rte_vdev_device_name(vdev);
+	const char *params = rte_vdev_device_args(vdev);
+	struct rte_kvargs *kvlist = NULL;
+	struct rte_eth_dev *eth_dev = NULL;
+	char tap_name[IFNAMSIZ] = IORING_DEFAULT_IFNAME;
+	uint8_t persist = 0;
+	int ret;
+
+	PMD_LOG(INFO, "Initializing %s", name);
+
+	if (rte_eal_process_type() == RTE_PROC_SECONDARY)
+		return -1; /* TODO */
+
+	if (params != NULL) {
+		kvlist = rte_kvargs_parse(params, valid_arguments);
+		if (kvlist == NULL)
+			return -1;
+
+		if (rte_kvargs_count(kvlist, IORING_IFACE_ARG) == 1) {
+			ret = rte_kvargs_process_opt(kvlist, IORING_IFACE_ARG,
+						     &parse_iface_arg, tap_name);
+			if (ret < 0)
+				goto error;
+		}
+
+		if (rte_kvargs_count(kvlist, IORING_PERSIST_ARG) == 1)
+			persist = 1;
+	}
+
+	eth_dev = rte_eth_vdev_allocate(vdev, sizeof(struct pmd_internals));
+	if (eth_dev == NULL) {
+		PMD_LOG(ERR, "%s Unable to allocate device struct", tap_name);
+		goto error;
+	}
+
+	if (ioring_create(eth_dev, tap_name, persist) < 0)
+		goto error;
+
+	rte_eth_dev_probing_finish(eth_dev);
+	return 0;
+
+error:
+	if (eth_dev != NULL)
+		rte_eth_dev_release_port(eth_dev);
+	rte_kvargs_free(kvlist);
+	return -1;
+}
+
+static int
+ioring_remove(struct rte_vdev_device *dev)
+{
+	struct rte_eth_dev *eth_dev;
+
+	eth_dev = rte_eth_dev_allocated(rte_vdev_device_name(dev));
+	if (eth_dev == NULL)
+		return 0;
+
+	eth_dev_close(eth_dev);
+	rte_eth_dev_release_port(eth_dev);
+	return 0;
+}
+
+static struct rte_vdev_driver pmd_ioring_drv = {
+	.probe = ioring_probe,
+	.remove = ioring_remove,
+};
+
+RTE_PMD_REGISTER_VDEV(net_ioring, pmd_ioring_drv);
+RTE_PMD_REGISTER_ALIAS(net_ioring, eth_ioring);
+RTE_PMD_REGISTER_PARAM_STRING(net_ioring, IORING_IFACE_ARG "=<string> ");
diff --git a/drivers/net/meson.build b/drivers/net/meson.build
index 460eb69e5b..2e39136a5b 100644
--- a/drivers/net/meson.build
+++ b/drivers/net/meson.build
@@ -34,6 +34,7 @@ drivers = [
         'intel/ixgbe',
         'intel/cpfl',  # depends on idpf, so must come after it
         'ionic',
+        'ioring',
         'mana',
         'memif',
         'mlx4',
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v4 02/10] net/ioring: implement link state
  2025-03-13 21:50 ` [PATCH v4 00/10] new ioring PMD Stephen Hemminger
  2025-03-13 21:50   ` [PATCH v4 01/10] net/ioring: introduce new driver Stephen Hemminger
@ 2025-03-13 21:50   ` Stephen Hemminger
  2025-03-13 21:50   ` [PATCH v4 03/10] net/ioring: implement control functions Stephen Hemminger
                     ` (7 subsequent siblings)
  9 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2025-03-13 21:50 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add hooks to set kernel link up/down and report state.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/ioring.ini |  1 +
 drivers/net/ioring/rte_eth_ioring.c | 84 +++++++++++++++++++++++++++++
 2 files changed, 85 insertions(+)

diff --git a/doc/guides/nics/features/ioring.ini b/doc/guides/nics/features/ioring.ini
index c4c57caaa4..d4bf70cb4f 100644
--- a/doc/guides/nics/features/ioring.ini
+++ b/doc/guides/nics/features/ioring.ini
@@ -4,6 +4,7 @@
 ; Refer to default.ini for the full list of available PMD features.
 ;
 [Features]
+Link status          = Y
 Linux		     = Y
 x86-64               = Y
 Usage doc            = Y
diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index 4d5a5174db..8c89497b1a 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -47,6 +47,53 @@ struct pmd_internals {
 	struct rte_ether_addr eth_addr; /* address assigned by kernel */
 };
 
+static int
+eth_dev_change_flags(struct rte_eth_dev *dev, uint16_t flags, uint16_t mask)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+
+	int sock = socket(AF_INET, SOCK_DGRAM, 0);
+	if (sock < 0)
+		return -errno;
+
+	struct ifreq ifr = { };
+	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
+
+	int ret = ioctl(sock, SIOCGIFFLAGS, &ifr);
+	if (ret < 0)
+		goto error;
+
+	/* NB: ifr.ifr_flags is type short */
+	ifr.ifr_flags &= mask;
+	ifr.ifr_flags |= flags;
+
+	ret = ioctl(sock, SIOCSIFFLAGS, &ifr);
+error:
+	close(sock);
+	return (ret < 0) ? -errno : 0;
+}
+
+static int
+eth_dev_get_flags(struct rte_eth_dev *dev, short *flags)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+
+	int sock = socket(AF_INET, SOCK_DGRAM, 0);
+	if (sock < 0)
+		return -errno;
+
+	struct ifreq ifr = { };
+	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
+
+	int ret = ioctl(sock, SIOCGIFFLAGS, &ifr);
+	if (ret == 0)
+		*flags = ifr.ifr_flags;
+
+	close(sock);
+	return (ret < 0) ? -errno : 0;
+}
+
+
 /* Creates a new tap device, name returned in ifr */
 static int
 tap_open(const char *name, struct ifreq *ifr, uint8_t persist)
@@ -103,6 +150,39 @@ tap_open(const char *name, struct ifreq *ifr, uint8_t persist)
 	return -1;
 }
 
+
+static int
+eth_dev_set_link_up(struct rte_eth_dev *dev)
+{
+	return eth_dev_change_flags(dev, IFF_UP, 0);
+}
+
+static int
+eth_dev_set_link_down(struct rte_eth_dev *dev)
+{
+	return eth_dev_change_flags(dev, 0, ~IFF_UP);
+}
+
+static int
+eth_link_update(struct rte_eth_dev *dev, int wait_to_complete __rte_unused)
+{
+	struct rte_eth_link *eth_link = &dev->data->dev_link;
+	short flags = 0;
+
+	if (eth_dev_get_flags(dev, &flags) < 0) {
+		PMD_LOG(ERR, "ioctl(SIOCGIFFLAGS): %s", strerror(errno));
+		return -1;
+	}
+
+	*eth_link = (struct rte_eth_link) {
+		.link_speed = RTE_ETH_SPEED_NUM_UNKNOWN,
+		.link_duplex = RTE_ETH_LINK_FULL_DUPLEX,
+		.link_status = (flags & IFF_UP) ? RTE_ETH_LINK_UP : RTE_ETH_LINK_DOWN,
+		.link_autoneg = RTE_ETH_LINK_FIXED,
+	};
+	return 0;
+};
+
 static int
 eth_dev_close(struct rte_eth_dev *dev)
 {
@@ -126,8 +206,12 @@ eth_dev_close(struct rte_eth_dev *dev)
 
 static const struct eth_dev_ops ops = {
 	.dev_close		= eth_dev_close,
+	.link_update		= eth_link_update,
+	.dev_set_link_up	= eth_dev_set_link_up,
+	.dev_set_link_down	= eth_dev_set_link_down,
 };
 
+
 static int
 ioring_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
 {
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v4 03/10] net/ioring: implement control functions
  2025-03-13 21:50 ` [PATCH v4 00/10] new ioring PMD Stephen Hemminger
  2025-03-13 21:50   ` [PATCH v4 01/10] net/ioring: introduce new driver Stephen Hemminger
  2025-03-13 21:50   ` [PATCH v4 02/10] net/ioring: implement link state Stephen Hemminger
@ 2025-03-13 21:50   ` Stephen Hemminger
  2025-03-13 21:50   ` [PATCH v4 04/10] net/ioring: implement management functions Stephen Hemminger
                     ` (6 subsequent siblings)
  9 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2025-03-13 21:50 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

These internal ops, just force changes to kernel visible net device.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/ioring.ini |  3 ++
 drivers/net/ioring/rte_eth_ioring.c | 69 +++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+)

diff --git a/doc/guides/nics/features/ioring.ini b/doc/guides/nics/features/ioring.ini
index d4bf70cb4f..199c7cd31c 100644
--- a/doc/guides/nics/features/ioring.ini
+++ b/doc/guides/nics/features/ioring.ini
@@ -5,6 +5,9 @@
 ;
 [Features]
 Link status          = Y
+MTU update           = Y
+Promiscuous mode     = Y
+Allmulticast mode    = Y
 Linux		     = Y
 x86-64               = Y
 Usage doc            = Y
diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index 8c89497b1a..fe3c72098c 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -13,6 +13,7 @@
 #include <sys/socket.h>
 #include <net/if.h>
 #include <linux/if.h>
+#include <linux/if_arp.h>
 #include <linux/if_tun.h>
 
 #include <bus_vdev_driver.h>
@@ -163,6 +164,30 @@ eth_dev_set_link_down(struct rte_eth_dev *dev)
 	return eth_dev_change_flags(dev, 0, ~IFF_UP);
 }
 
+static int
+eth_dev_promiscuous_enable(struct rte_eth_dev *dev)
+{
+	return eth_dev_change_flags(dev, IFF_PROMISC, ~0);
+}
+
+static int
+eth_dev_promiscuous_disable(struct rte_eth_dev *dev)
+{
+	return eth_dev_change_flags(dev, 0, ~IFF_PROMISC);
+}
+
+static int
+eth_dev_allmulticast_enable(struct rte_eth_dev *dev)
+{
+	return eth_dev_change_flags(dev, IFF_ALLMULTI, ~0);
+}
+
+static int
+eth_dev_allmulticast_disable(struct rte_eth_dev *dev)
+{
+	return eth_dev_change_flags(dev, 0, ~IFF_ALLMULTI);
+}
+
 static int
 eth_link_update(struct rte_eth_dev *dev, int wait_to_complete __rte_unused)
 {
@@ -183,6 +208,44 @@ eth_link_update(struct rte_eth_dev *dev, int wait_to_complete __rte_unused)
 	return 0;
 };
 
+static int
+eth_dev_mtu_set(struct rte_eth_dev *dev, uint16_t mtu)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+	struct ifreq ifr = { .ifr_mtu = mtu };
+	int ret;
+
+	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
+
+	ret = ioctl(pmd->ctl_sock, SIOCSIFMTU, &ifr);
+	if (ret < 0) {
+		PMD_LOG(ERR, "ioctl(SIOCSIFMTU) failed: %s", strerror(errno));
+		ret = -errno;
+	}
+
+	return ret;
+}
+
+static int
+eth_dev_macaddr_set(struct rte_eth_dev *dev, struct rte_ether_addr *addr)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+	struct ifreq ifr = { };
+	int ret;
+
+	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
+	ifr.ifr_hwaddr.sa_family = ARPHRD_ETHER;
+	memcpy(ifr.ifr_hwaddr.sa_data, addr, sizeof(*addr));
+
+	ret = ioctl(pmd->ctl_sock, SIOCSIFHWADDR, &ifr);
+	if (ret < 0) {
+		PMD_LOG(ERR, "ioctl(SIOCSIFHWADDR) failed: %s", strerror(errno));
+		ret = -errno;
+	}
+
+	return ret;
+}
+
 static int
 eth_dev_close(struct rte_eth_dev *dev)
 {
@@ -209,6 +272,12 @@ static const struct eth_dev_ops ops = {
 	.link_update		= eth_link_update,
 	.dev_set_link_up	= eth_dev_set_link_up,
 	.dev_set_link_down	= eth_dev_set_link_down,
+	.mac_addr_set		= eth_dev_macaddr_set,
+	.mtu_set		= eth_dev_mtu_set,
+	.promiscuous_enable	= eth_dev_promiscuous_enable,
+	.promiscuous_disable	= eth_dev_promiscuous_disable,
+	.allmulticast_enable	= eth_dev_allmulticast_enable,
+	.allmulticast_disable	= eth_dev_allmulticast_disable,
 };
 
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v4 04/10] net/ioring: implement management functions
  2025-03-13 21:50 ` [PATCH v4 00/10] new ioring PMD Stephen Hemminger
                     ` (2 preceding siblings ...)
  2025-03-13 21:50   ` [PATCH v4 03/10] net/ioring: implement control functions Stephen Hemminger
@ 2025-03-13 21:50   ` Stephen Hemminger
  2025-03-13 21:50   ` [PATCH v4 05/10] net/ioring: implement secondary process support Stephen Hemminger
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2025-03-13 21:50 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add start, stop, configure and info functions.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 drivers/net/ioring/rte_eth_ioring.c | 72 ++++++++++++++++++++++++++---
 1 file changed, 66 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index fe3c72098c..b5b5ffdee3 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -212,12 +212,16 @@ static int
 eth_dev_mtu_set(struct rte_eth_dev *dev, uint16_t mtu)
 {
 	struct pmd_internals *pmd = dev->data->dev_private;
+
+	int sock = socket(AF_INET, SOCK_DGRAM, 0);
+	if (sock < 0)
+		return -errno;
+
 	struct ifreq ifr = { .ifr_mtu = mtu };
-	int ret;
 
 	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
 
-	ret = ioctl(pmd->ctl_sock, SIOCSIFMTU, &ifr);
+	int ret = ioctl(sock, SIOCSIFMTU, &ifr);
 	if (ret < 0) {
 		PMD_LOG(ERR, "ioctl(SIOCSIFMTU) failed: %s", strerror(errno));
 		ret = -errno;
@@ -230,14 +234,17 @@ static int
 eth_dev_macaddr_set(struct rte_eth_dev *dev, struct rte_ether_addr *addr)
 {
 	struct pmd_internals *pmd = dev->data->dev_private;
-	struct ifreq ifr = { };
-	int ret;
 
+	int sock = socket(AF_INET, SOCK_DGRAM, 0);
+	if (sock < 0)
+		return -errno;
+
+	struct ifreq ifr = { };
 	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
 	ifr.ifr_hwaddr.sa_family = ARPHRD_ETHER;
 	memcpy(ifr.ifr_hwaddr.sa_data, addr, sizeof(*addr));
 
-	ret = ioctl(pmd->ctl_sock, SIOCSIFHWADDR, &ifr);
+	int ret = ioctl(sock, SIOCSIFHWADDR, &ifr);
 	if (ret < 0) {
 		PMD_LOG(ERR, "ioctl(SIOCSIFHWADDR) failed: %s", strerror(errno));
 		ret = -errno;
@@ -246,6 +253,56 @@ eth_dev_macaddr_set(struct rte_eth_dev *dev, struct rte_ether_addr *addr)
 	return ret;
 }
 
+static int
+eth_dev_start(struct rte_eth_dev *dev)
+{
+	dev->data->dev_link.link_status = RTE_ETH_LINK_UP;
+	eth_dev_set_link_up(dev);
+
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		dev->data->rx_queue_state[i] = RTE_ETH_QUEUE_STATE_STARTED;
+		dev->data->tx_queue_state[i] = RTE_ETH_QUEUE_STATE_STARTED;
+	}
+
+	return 0;
+}
+
+static int
+eth_dev_stop(struct rte_eth_dev *dev)
+{
+	dev->data->dev_link.link_status = RTE_ETH_LINK_DOWN;
+	eth_dev_set_link_down(dev);
+
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		dev->data->rx_queue_state[i] = RTE_ETH_QUEUE_STATE_STOPPED;
+		dev->data->tx_queue_state[i] = RTE_ETH_QUEUE_STATE_STOPPED;
+	}
+
+	return 0;
+}
+
+static int
+eth_dev_configure(struct rte_eth_dev *dev)
+{
+	/* rx/tx must be paired */
+	if (dev->data->nb_rx_queues != dev->data->nb_tx_queues)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int
+eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+
+	dev_info->if_index = if_nametoindex(pmd->ifname);
+	dev_info->max_mac_addrs = 1;
+	dev_info->max_rx_pktlen = RTE_ETHER_MAX_LEN;
+
+	return 0;
+}
+
 static int
 eth_dev_close(struct rte_eth_dev *dev)
 {
@@ -263,11 +320,14 @@ eth_dev_close(struct rte_eth_dev *dev)
 		close(pmd->keep_fd);
 		pmd->keep_fd = -1;
 	}
-
 	return 0;
 }
 
 static const struct eth_dev_ops ops = {
+	.dev_start		= eth_dev_start,
+	.dev_stop		= eth_dev_stop,
+	.dev_configure		= eth_dev_configure,
+	.dev_infos_get		= eth_dev_info,
 	.dev_close		= eth_dev_close,
 	.link_update		= eth_link_update,
 	.dev_set_link_up	= eth_dev_set_link_up,
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v4 05/10] net/ioring: implement secondary process support
  2025-03-13 21:50 ` [PATCH v4 00/10] new ioring PMD Stephen Hemminger
                     ` (3 preceding siblings ...)
  2025-03-13 21:50   ` [PATCH v4 04/10] net/ioring: implement management functions Stephen Hemminger
@ 2025-03-13 21:50   ` Stephen Hemminger
  2025-03-13 21:50   ` [PATCH v4 06/10] net/ioring: implement receive and transmit Stephen Hemminger
                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2025-03-13 21:50 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add support for communicating fd's from primary to secondary.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/ioring.ini |   1 +
 doc/guides/nics/ioring.rst          |   6 --
 drivers/net/ioring/rte_eth_ioring.c | 136 +++++++++++++++++++++++++++-
 3 files changed, 135 insertions(+), 8 deletions(-)

diff --git a/doc/guides/nics/features/ioring.ini b/doc/guides/nics/features/ioring.ini
index 199c7cd31c..da47062adb 100644
--- a/doc/guides/nics/features/ioring.ini
+++ b/doc/guides/nics/features/ioring.ini
@@ -8,6 +8,7 @@ Link status          = Y
 MTU update           = Y
 Promiscuous mode     = Y
 Allmulticast mode    = Y
+Multiprocess aware   = Y
 Linux		     = Y
 x86-64               = Y
 Usage doc            = Y
diff --git a/doc/guides/nics/ioring.rst b/doc/guides/nics/ioring.rst
index 7d37a6bb37..69102a5b38 100644
--- a/doc/guides/nics/ioring.rst
+++ b/doc/guides/nics/ioring.rst
@@ -46,12 +46,6 @@ But this behavior can be overridden by the use of the persist flag, example::
   --vdev=net_ioring0,iface=io0,persist ...
 
 
-Multi-process sharing
----------------------
-
-The IORING device does not support secondary process (yet).
-
-
 Limitations
 -----------
 
diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index b5b5ffdee3..f01db960a7 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -28,6 +28,7 @@
 #include <rte_log.h>
 
 #define IORING_DEFAULT_IFNAME	"itap%d"
+#define IORING_MP_KEY		"ioring_mp_send_fds"
 
 RTE_LOG_REGISTER_DEFAULT(ioring_logtype, NOTICE);
 #define RTE_LOGTYPE_IORING ioring_logtype
@@ -400,6 +401,84 @@ parse_iface_arg(const char *key __rte_unused, const char *value, void *extra_arg
 	return 0;
 }
 
+/* Secondary process requests rxq fds from primary. */
+static int
+ioring_request_fds(const char *name, struct rte_eth_dev *dev)
+{
+	struct rte_mp_msg request = { };
+
+	strlcpy(request.name, IORING_MP_KEY, sizeof(request.name));
+	strlcpy((char *)request.param, name, RTE_MP_MAX_PARAM_LEN);
+	request.len_param = strlen(name);
+
+	/* Send the request and receive the reply */
+	PMD_LOG(DEBUG, "Sending multi-process IPC request for %s", name);
+
+	struct timespec timeout = {.tv_sec = 1, .tv_nsec = 0};
+	struct rte_mp_reply replies;
+	int ret = rte_mp_request_sync(&request, &replies, &timeout);
+	if (ret < 0 || replies.nb_received != 1) {
+		PMD_LOG(ERR, "Failed to request fds from primary: %s",
+			rte_strerror(rte_errno));
+		return -1;
+	}
+
+	struct rte_mp_msg *reply = replies.msgs;
+	PMD_LOG(DEBUG, "Received multi-process IPC reply for %s", name);
+	if (dev->data->nb_rx_queues != reply->num_fds) {
+		PMD_LOG(ERR, "Incorrect number of fds received: %d != %d",
+			reply->num_fds, dev->data->nb_rx_queues);
+		return -EINVAL;
+	}
+
+	int *fds = dev->process_private;
+	for (int i = 0; i < reply->num_fds; i++)
+		fds[i] = reply->fds[i];
+
+	free(reply);
+	return 0;
+}
+
+/* Primary process sends rxq fds to secondary. */
+static int
+ioring_mp_send_fds(const struct rte_mp_msg *request, const void *peer)
+{
+	const char *request_name = (const char *)request->param;
+
+	PMD_LOG(DEBUG, "Received multi-process IPC request for %s", request_name);
+
+	/* Find the requested port */
+	struct rte_eth_dev *dev = rte_eth_dev_get_by_name(request_name);
+	if (!dev) {
+		PMD_LOG(ERR, "Failed to get port id for %s", request_name);
+		return -1;
+	}
+
+	/* Populate the reply with the xsk fd for each queue */
+	struct rte_mp_msg reply = { };
+	if (dev->data->nb_rx_queues > RTE_MP_MAX_FD_NUM) {
+		PMD_LOG(ERR, "Number of rx queues (%d) exceeds max number of fds (%d)",
+			   dev->data->nb_rx_queues, RTE_MP_MAX_FD_NUM);
+		return -EINVAL;
+	}
+
+	int *fds = dev->process_private;
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++)
+		reply.fds[reply.num_fds++] = fds[i];
+
+	/* Send the reply */
+	strlcpy(reply.name, request->name, sizeof(reply.name));
+	strlcpy((char *)reply.param, request_name, RTE_MP_MAX_PARAM_LEN);
+	reply.len_param = strlen(request_name);
+
+	PMD_LOG(DEBUG, "Sending multi-process IPC reply for %s", request_name);
+	if (rte_mp_reply(&reply, peer) < 0) {
+		PMD_LOG(ERR, "Failed to reply to multi-process IPC request");
+		return -1;
+	}
+	return 0;
+}
+
 static int
 ioring_probe(struct rte_vdev_device *vdev)
 {
@@ -407,14 +486,43 @@ ioring_probe(struct rte_vdev_device *vdev)
 	const char *params = rte_vdev_device_args(vdev);
 	struct rte_kvargs *kvlist = NULL;
 	struct rte_eth_dev *eth_dev = NULL;
+	int *fds = NULL;
 	char tap_name[IFNAMSIZ] = IORING_DEFAULT_IFNAME;
 	uint8_t persist = 0;
 	int ret;
 
 	PMD_LOG(INFO, "Initializing %s", name);
 
-	if (rte_eal_process_type() == RTE_PROC_SECONDARY)
-		return -1; /* TODO */
+	if (rte_eal_process_type() == RTE_PROC_SECONDARY) {
+		struct rte_eth_dev *eth_dev;
+
+		eth_dev = rte_eth_dev_attach_secondary(name);
+		if (!eth_dev) {
+			PMD_LOG(ERR, "Failed to probe %s", name);
+			return -1;
+		}
+		eth_dev->dev_ops = &ops;
+		eth_dev->device = &vdev->device;
+
+		if (!rte_eal_primary_proc_alive(NULL)) {
+			PMD_LOG(ERR, "Primary process is missing");
+			return -1;
+		}
+
+		fds  = calloc(RTE_MAX_QUEUES_PER_PORT, sizeof(int));
+		if (fds == NULL) {
+			PMD_LOG(ERR, "Failed to alloc memory for process private");
+			return -1;
+		}
+
+		eth_dev->process_private = fds;
+
+		if (ioring_request_fds(name, eth_dev))
+			return -1;
+
+		rte_eth_dev_probing_finish(eth_dev);
+		return 0;
+	}
 
 	if (params != NULL) {
 		kvlist = rte_kvargs_parse(params, valid_arguments);
@@ -432,21 +540,45 @@ ioring_probe(struct rte_vdev_device *vdev)
 			persist = 1;
 	}
 
+	/* Per-queue tap fd's (for primary process) */
+	fds = calloc(RTE_MAX_QUEUES_PER_PORT, sizeof(int));
+	if (fds == NULL) {
+		PMD_LOG(ERR, "Unable to allocate fd array");
+		return -1;
+	}
+	for (unsigned int i = 0; i < RTE_MAX_QUEUES_PER_PORT; i++)
+		fds[i] = -1;
+
 	eth_dev = rte_eth_vdev_allocate(vdev, sizeof(struct pmd_internals));
 	if (eth_dev == NULL) {
 		PMD_LOG(ERR, "%s Unable to allocate device struct", tap_name);
 		goto error;
 	}
 
+	eth_dev->data->dev_flags = RTE_ETH_DEV_AUTOFILL_QUEUE_XSTATS;
+	eth_dev->dev_ops = &ops;
+	eth_dev->process_private = fds;
+
 	if (ioring_create(eth_dev, tap_name, persist) < 0)
 		goto error;
 
+	/* register the MP server on the first device */
+	static unsigned int ioring_dev_count;
+	if (ioring_dev_count == 0) {
+		if (rte_mp_action_register(IORING_MP_KEY, ioring_mp_send_fds) < 0) {
+			PMD_LOG(ERR, "Failed to register multi-process callback: %s",
+				rte_strerror(rte_errno));
+			goto error;
+		}
+	}
+	++ioring_dev_count;
 	rte_eth_dev_probing_finish(eth_dev);
 	return 0;
 
 error:
 	if (eth_dev != NULL)
 		rte_eth_dev_release_port(eth_dev);
+	free(fds);
 	rte_kvargs_free(kvlist);
 	return -1;
 }
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v4 06/10] net/ioring: implement receive and transmit
  2025-03-13 21:50 ` [PATCH v4 00/10] new ioring PMD Stephen Hemminger
                     ` (4 preceding siblings ...)
  2025-03-13 21:50   ` [PATCH v4 05/10] net/ioring: implement secondary process support Stephen Hemminger
@ 2025-03-13 21:50   ` Stephen Hemminger
  2025-03-13 21:50   ` [PATCH v4 07/10] net/ioring: implement statistics Stephen Hemminger
                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2025-03-13 21:50 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Use io_uring to read and write from TAP device.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 drivers/net/ioring/rte_eth_ioring.c | 366 +++++++++++++++++++++++++++-
 1 file changed, 365 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index f01db960a7..2f049e4c4f 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -2,6 +2,7 @@
  * Copyright (c) Stephen Hemminger
  */
 
+#include <assert.h>
 #include <ctype.h>
 #include <errno.h>
 #include <fcntl.h>
@@ -9,8 +10,10 @@
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
+#include <liburing.h>
 #include <sys/ioctl.h>
 #include <sys/socket.h>
+#include <sys/uio.h>
 #include <net/if.h>
 #include <linux/if.h>
 #include <linux/if_arp.h>
@@ -27,6 +30,12 @@
 #include <rte_kvargs.h>
 #include <rte_log.h>
 
+#define IORING_DEFAULT_BURST	64
+#define IORING_NUM_BUFFERS	1024
+#define IORING_MAX_QUEUES	128
+
+static_assert(IORING_MAX_QUEUES <= RTE_MP_MAX_FD_NUM, "Max queues exceeds MP fd limit");
+
 #define IORING_DEFAULT_IFNAME	"itap%d"
 #define IORING_MP_KEY		"ioring_mp_send_fds"
 
@@ -34,6 +43,20 @@ RTE_LOG_REGISTER_DEFAULT(ioring_logtype, NOTICE);
 #define RTE_LOGTYPE_IORING ioring_logtype
 #define PMD_LOG(level, ...) RTE_LOG_LINE_PREFIX(level, IORING, "%s(): ", __func__, __VA_ARGS__)
 
+#ifdef RTE_ETHDEV_DEBUG_RX
+#define PMD_RX_LOG(level, ...) \
+	RTE_LOG_LINE_PREFIX(level, IORING, "%s() rx: ", __func__, __VA_ARGS__)
+#else
+#define PMD_RX_LOG(...) do { } while (0)
+#endif
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+#define PMD_TX_LOG(level, ...) \
+	RTE_LOG_LINE_PREFIX(level, IORING, "%s() tx: ", __func__, __VA_ARGS__)
+#else
+#define PMD_TX_LOG(...) do { } while (0)
+#endif
+
 #define IORING_IFACE_ARG	"iface"
 #define IORING_PERSIST_ARG	"persist"
 
@@ -43,6 +66,30 @@ static const char * const valid_arguments[] = {
 	NULL
 };
 
+struct rx_queue {
+	struct rte_mempool *mb_pool;	/* rx buffer pool */
+	struct io_uring io_ring;	/* queue of posted read's */
+	uint16_t port_id;
+	uint16_t queue_id;
+
+	uint64_t rx_packets;
+	uint64_t rx_bytes;
+	uint64_t rx_nombuf;
+	uint64_t rx_errors;
+};
+
+struct tx_queue {
+	struct io_uring io_ring;
+
+	uint16_t port_id;
+	uint16_t queue_id;
+	uint16_t free_thresh;
+
+	uint64_t tx_packets;
+	uint64_t tx_bytes;
+	uint64_t tx_errors;
+};
+
 struct pmd_internals {
 	int keep_fd;			/* keep alive file descriptor */
 	char ifname[IFNAMSIZ];		/* name assigned by kernel */
@@ -300,6 +347,15 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 	dev_info->if_index = if_nametoindex(pmd->ifname);
 	dev_info->max_mac_addrs = 1;
 	dev_info->max_rx_pktlen = RTE_ETHER_MAX_LEN;
+	dev_info->max_rx_queues = IORING_MAX_QUEUES;
+	dev_info->max_tx_queues = IORING_MAX_QUEUES;
+	dev_info->min_rx_bufsize = 0;
+
+	dev_info->default_rxportconf = (struct rte_eth_dev_portconf) {
+		.burst_size = IORING_DEFAULT_BURST,
+		.ring_size = IORING_NUM_BUFFERS,
+		.nb_queues = 1,
+	};
 
 	return 0;
 }
@@ -311,6 +367,14 @@ eth_dev_close(struct rte_eth_dev *dev)
 
 	PMD_LOG(INFO, "Closing %s", pmd->ifname);
 
+	int *fds = dev->process_private;
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		if (fds[i] == -1)
+			continue;
+		close(fds[i]);
+		fds[i] = -1;
+	}
+
 	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
 		return 0;
 
@@ -324,6 +388,299 @@ eth_dev_close(struct rte_eth_dev *dev)
 	return 0;
 }
 
+/* Setup another fd to TAP device for the queue */
+static int
+eth_queue_setup(struct rte_eth_dev *dev, const char *name, uint16_t queue_id)
+{
+	int *fds = dev->process_private;
+
+	if (fds[queue_id] != -1)
+		return 0;	/* already setup */
+
+	struct ifreq ifr = { };
+	int tap_fd = tap_open(name, &ifr, 0);
+	if (tap_fd < 0) {
+		PMD_LOG(ERR, "tap_open failed");
+		return -1;
+	}
+
+	PMD_LOG(DEBUG, "opened %d for queue %u", tap_fd, queue_id);
+	fds[queue_id] = tap_fd;
+	return 0;
+}
+
+static int
+eth_queue_fd(uint16_t port_id, uint16_t queue_id)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	int *fds = dev->process_private;
+
+	return fds[queue_id];
+}
+
+/* setup an submit queue to read mbuf */
+static inline void
+eth_rx_submit(struct rx_queue *rxq, int fd, struct rte_mbuf *mb)
+{
+	struct io_uring_sqe *sqe = io_uring_get_sqe(&rxq->io_ring);
+
+	if (unlikely(sqe == NULL)) {
+		PMD_LOG(DEBUG, "io_uring no rx sqe");
+		rxq->rx_errors++;
+		rte_pktmbuf_free(mb);
+		return;
+	}
+	io_uring_sqe_set_data(sqe, mb);
+
+	void *buf = rte_pktmbuf_mtod_offset(mb, void *, 0);
+	unsigned int nbytes = rte_pktmbuf_tailroom(mb);
+
+	io_uring_prep_read(sqe, fd, buf, nbytes, 0);
+}
+
+static uint16_t
+eth_ioring_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct rx_queue *rxq = queue;
+	struct io_uring_cqe *cqe;
+	unsigned int head, num_cqe = 0;
+	uint16_t num_rx = 0;
+	uint32_t num_bytes = 0;
+	int fd = eth_queue_fd(rxq->port_id, rxq->queue_id);
+
+	io_uring_for_each_cqe(&rxq->io_ring, head, cqe) {
+		struct rte_mbuf *mb = (void *)(uintptr_t)cqe->user_data;
+		ssize_t len = cqe->res;
+
+		PMD_RX_LOG(DEBUG, "cqe %u len %zd", num_cqe, len);
+		num_cqe++;
+
+		if (unlikely(len < RTE_ETHER_HDR_LEN)) {
+			if (len < 0)
+				PMD_LOG(ERR, "io_uring_read: %s", strerror(-len));
+			else
+				PMD_LOG(ERR, "io_uring_read missing hdr");
+
+			rxq->rx_errors++;
+			goto resubmit;
+		}
+
+		struct rte_mbuf *nmb = rte_pktmbuf_alloc(rxq->mb_pool);
+		if (unlikely(nmb == 0)) {
+			PMD_LOG(DEBUG, "Rx mbuf alloc failed");
+			++rxq->rx_nombuf;
+			goto resubmit;
+		}
+
+		mb->pkt_len = len;
+		mb->data_len = len;
+		mb->port = rxq->port_id;
+		__rte_mbuf_sanity_check(mb, 1);
+
+		num_bytes += len;
+		bufs[num_rx++] = mb;
+
+		mb = nmb;
+resubmit:
+		eth_rx_submit(rxq, fd, mb);
+
+		if (num_rx == nb_pkts)
+			break;
+	}
+	io_uring_cq_advance(&rxq->io_ring, num_cqe);
+
+	rxq->rx_packets += num_rx;
+	rxq->rx_bytes += num_bytes;
+	return num_rx;
+}
+
+static int
+eth_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id, uint16_t nb_rx_desc,
+		   unsigned int socket_id,
+		   const struct rte_eth_rxconf *rx_conf __rte_unused,
+		   struct rte_mempool *mb_pool)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+
+	PMD_LOG(DEBUG, "setup port %u queue %u rx_descriptors %u",
+		dev->data->port_id, queue_id, nb_rx_desc);
+
+	/* open shared tap fd maybe already setup */
+	if (eth_queue_setup(dev, pmd->ifname, queue_id) < 0)
+		return -1;
+
+	struct rx_queue *rxq = rte_zmalloc_socket(NULL, sizeof(*rxq),
+						  RTE_CACHE_LINE_SIZE, socket_id);
+	if (rxq == NULL) {
+		PMD_LOG(ERR, "rxq alloc failed");
+		return -1;
+	}
+
+	rxq->mb_pool = mb_pool;
+	rxq->port_id = dev->data->port_id;
+	rxq->queue_id = queue_id;
+	dev->data->rx_queues[queue_id] = rxq;
+
+	if (io_uring_queue_init(nb_rx_desc, &rxq->io_ring, 0) != 0) {
+		PMD_LOG(ERR, "io_uring_queue_init failed: %s", strerror(errno));
+		return -1;
+	}
+
+	struct rte_mbuf **mbufs = alloca(nb_rx_desc * sizeof(struct rte_mbuf *));
+	if (mbufs == NULL) {
+		PMD_LOG(ERR, "alloca for %u failed", nb_rx_desc);
+		return -1;
+	}
+
+	if (rte_pktmbuf_alloc_bulk(mb_pool, mbufs, nb_rx_desc) < 0) {
+		PMD_LOG(ERR, "Rx mbuf alloc %u bufs failed", nb_rx_desc);
+		return -1;
+	}
+
+	int fd = eth_queue_fd(rxq->port_id, rxq->queue_id);
+	for (uint16_t i = 0; i < nb_rx_desc; i++)
+		eth_rx_submit(rxq, fd, mbufs[i]);
+
+	io_uring_submit(&rxq->io_ring);
+	return 0;
+}
+
+static void
+eth_rx_queue_release(struct rte_eth_dev *dev, uint16_t queue_id)
+{
+	struct rx_queue *rxq = dev->data->rx_queues[queue_id];
+
+	struct io_uring_sqe *sqe = io_uring_get_sqe(&rxq->io_ring);
+	if (sqe == NULL) {
+		PMD_LOG(ERR, "io_uring_get_sqe failed: %s", strerror(errno));
+	} else {
+		io_uring_prep_cancel(sqe, NULL, IORING_ASYNC_CANCEL_ANY);
+		io_uring_submit_and_wait(&rxq->io_ring, 1);
+	}
+
+	io_uring_queue_exit(&rxq->io_ring);
+}
+
+static int
+eth_tx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id,
+		   uint16_t nb_tx_desc, unsigned int socket_id,
+		   const struct rte_eth_txconf *tx_conf)
+{
+	struct pmd_internals *pmd = dev->data->dev_private;
+
+	/* open shared tap fd maybe already setup */
+	if (eth_queue_setup(dev, pmd->ifname, queue_id) < 0)
+		return -1;
+
+	struct tx_queue *txq = rte_zmalloc_socket(NULL, sizeof(*txq),
+						  RTE_CACHE_LINE_SIZE, socket_id);
+	if (txq == NULL) {
+		PMD_LOG(ERR, "txq alloc failed");
+		return -1;
+	}
+
+	txq->port_id = dev->data->port_id;
+	txq->queue_id = queue_id;
+	txq->free_thresh = tx_conf->tx_free_thresh;
+	dev->data->tx_queues[queue_id] = txq;
+
+	if (io_uring_queue_init(nb_tx_desc, &txq->io_ring, 0) != 0) {
+		PMD_LOG(ERR, "io_uring_queue_init failed: %s", strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static void
+eth_ioring_tx_cleanup(struct tx_queue *txq)
+{
+	struct io_uring_cqe *cqe;
+	unsigned int head;
+	unsigned int tx_done = 0;
+	uint64_t tx_bytes = 0;
+
+	io_uring_for_each_cqe(&txq->io_ring, head, cqe) {
+		struct rte_mbuf *mb = (void *)(uintptr_t)cqe->user_data;
+
+		PMD_TX_LOG(DEBUG, " mbuf len %u result: %d", mb->pkt_len, cqe->res);
+		if (unlikely(cqe->res < 0)) {
+			++txq->tx_errors;
+		} else {
+			++tx_done;
+			tx_bytes += mb->pkt_len;
+		}
+
+		rte_pktmbuf_free(mb);
+	}
+	io_uring_cq_advance(&txq->io_ring, tx_done);
+
+	txq->tx_packets += tx_done;
+	txq->tx_bytes += tx_bytes;
+}
+
+static void
+eth_tx_queue_release(struct rte_eth_dev *dev, uint16_t queue_id)
+{
+	struct tx_queue *txq = dev->data->tx_queues[queue_id];
+
+	eth_ioring_tx_cleanup(txq);
+
+	struct io_uring_sqe *sqe = io_uring_get_sqe(&txq->io_ring);
+	if (sqe == NULL) {
+		PMD_LOG(ERR, "io_uring_get_sqe failed: %s", strerror(errno));
+	} else {
+		io_uring_prep_cancel(sqe, NULL, IORING_ASYNC_CANCEL_ANY);
+		io_uring_submit_and_wait(&txq->io_ring, 1);
+	}
+
+	io_uring_queue_exit(&txq->io_ring);
+}
+
+static uint16_t
+eth_ioring_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct tx_queue *txq = queue;
+	uint16_t num_tx;
+
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts);
+
+	if (io_uring_sq_space_left(&txq->io_ring) < txq->free_thresh)
+		eth_ioring_tx_cleanup(txq);
+
+	int fd = eth_queue_fd(txq->port_id, txq->queue_id);
+
+	for (num_tx = 0; num_tx < nb_pkts; num_tx++) {
+		struct rte_mbuf *mb = bufs[num_tx];
+
+		struct io_uring_sqe *sqe = io_uring_get_sqe(&txq->io_ring);
+		if (sqe == NULL)
+			break;	/* submit ring is full */
+
+		io_uring_sqe_set_data(sqe, mb);
+
+		if (rte_mbuf_refcnt_read(mb) == 1 &&
+		    RTE_MBUF_DIRECT(mb) && mb->nb_segs == 1) {
+			void *base = rte_pktmbuf_mtod(mb, void *);
+			io_uring_prep_write(sqe, fd, base, mb->pkt_len, 0);
+
+			PMD_TX_LOG(DEBUG, "tx mbuf: %p submit", mb);
+		} else {
+			PMD_LOG(ERR, "Can't do mbuf without space yet!");
+			++txq->tx_errors;
+			continue;
+		}
+	}
+
+	if (likely(num_tx > 0))
+		io_uring_submit(&txq->io_ring);
+
+	return num_tx;
+}
+
 static const struct eth_dev_ops ops = {
 	.dev_start		= eth_dev_start,
 	.dev_stop		= eth_dev_stop,
@@ -339,9 +696,12 @@ static const struct eth_dev_ops ops = {
 	.promiscuous_disable	= eth_dev_promiscuous_disable,
 	.allmulticast_enable	= eth_dev_allmulticast_enable,
 	.allmulticast_disable	= eth_dev_allmulticast_disable,
+	.rx_queue_setup		= eth_rx_queue_setup,
+	.rx_queue_release	= eth_rx_queue_release,
+	.tx_queue_setup		= eth_tx_queue_setup,
+	.tx_queue_release	= eth_tx_queue_release,
 };
 
-
 static int
 ioring_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
 {
@@ -379,6 +739,10 @@ ioring_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
 	}
 
 	PMD_LOG(DEBUG, "%s setup", ifr.ifr_name);
+
+	dev->rx_pkt_burst = eth_ioring_rx;
+	dev->tx_pkt_burst = eth_ioring_tx;
+
 	return 0;
 
 error:
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v4 07/10] net/ioring: implement statistics
  2025-03-13 21:50 ` [PATCH v4 00/10] new ioring PMD Stephen Hemminger
                     ` (5 preceding siblings ...)
  2025-03-13 21:50   ` [PATCH v4 06/10] net/ioring: implement receive and transmit Stephen Hemminger
@ 2025-03-13 21:50   ` Stephen Hemminger
  2025-03-13 21:50   ` [PATCH v4 08/10] net/ioring: support multi-segment Rx and Tx Stephen Hemminger
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2025-03-13 21:50 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add support for basic statistics

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/ioring.ini |  2 +
 drivers/net/ioring/rte_eth_ioring.c | 57 +++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+)

diff --git a/doc/guides/nics/features/ioring.ini b/doc/guides/nics/features/ioring.ini
index da47062adb..c9a4582d0e 100644
--- a/doc/guides/nics/features/ioring.ini
+++ b/doc/guides/nics/features/ioring.ini
@@ -9,6 +9,8 @@ MTU update           = Y
 Promiscuous mode     = Y
 Allmulticast mode    = Y
 Multiprocess aware   = Y
+Basic stats          = Y
+Stats per queue      = Y
 Linux		     = Y
 x86-64               = Y
 Usage doc            = Y
diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index 2f049e4c4f..18546f0137 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -360,6 +360,61 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 	return 0;
 }
 
+static int
+eth_dev_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats)
+{
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		const struct rx_queue *rxq = dev->data->rx_queues[i];
+
+		stats->ipackets += rxq->rx_packets;
+		stats->ibytes += rxq->rx_bytes;
+		stats->ierrors += rxq->rx_errors;
+		stats->rx_nombuf += rxq->rx_nombuf;
+
+		if (i < RTE_ETHDEV_QUEUE_STAT_CNTRS) {
+			stats->q_ipackets[i] = rxq->rx_packets;
+			stats->q_ibytes[i] = rxq->rx_bytes;
+		}
+	}
+
+	for (uint16_t i = 0; i < dev->data->nb_tx_queues; i++) {
+		const struct tx_queue *txq = dev->data->tx_queues[i];
+
+		stats->opackets += txq->tx_packets;
+		stats->obytes += txq->tx_bytes;
+		stats->oerrors += txq->tx_errors;
+
+		if (i < RTE_ETHDEV_QUEUE_STAT_CNTRS) {
+			stats->q_opackets[i] = txq->tx_packets;
+			stats->q_obytes[i] = txq->tx_bytes;
+		}
+	}
+
+	return 0;
+}
+
+static int
+eth_dev_stats_reset(struct rte_eth_dev *dev)
+{
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		struct rx_queue *rxq = dev->data->rx_queues[i];
+
+		rxq->rx_packets = 0;
+		rxq->rx_bytes = 0;
+		rxq->rx_nombuf = 0;
+		rxq->rx_errors = 0;
+	}
+
+	for (uint16_t i = 0; i < dev->data->nb_tx_queues; i++) {
+		struct tx_queue *txq = dev->data->tx_queues[i];
+
+		txq->tx_packets = 0;
+		txq->tx_bytes = 0;
+		txq->tx_errors = 0;
+	}
+	return 0;
+}
+
 static int
 eth_dev_close(struct rte_eth_dev *dev)
 {
@@ -696,6 +751,8 @@ static const struct eth_dev_ops ops = {
 	.promiscuous_disable	= eth_dev_promiscuous_disable,
 	.allmulticast_enable	= eth_dev_allmulticast_enable,
 	.allmulticast_disable	= eth_dev_allmulticast_disable,
+	.stats_get              = eth_dev_stats_get,
+	.stats_reset            = eth_dev_stats_reset,
 	.rx_queue_setup		= eth_rx_queue_setup,
 	.rx_queue_release	= eth_rx_queue_release,
 	.tx_queue_setup		= eth_tx_queue_setup,
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v4 08/10] net/ioring: support multi-segment Rx and Tx
  2025-03-13 21:50 ` [PATCH v4 00/10] new ioring PMD Stephen Hemminger
                     ` (6 preceding siblings ...)
  2025-03-13 21:50   ` [PATCH v4 07/10] net/ioring: implement statistics Stephen Hemminger
@ 2025-03-13 21:50   ` Stephen Hemminger
  2025-03-13 21:51   ` [PATCH v4 09/10] net/ioring: support Tx checksum and segment offload Stephen Hemminger
  2025-03-13 21:51   ` [PATCH v4 10/10] net/ioring: add support for Rx offload Stephen Hemminger
  9 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2025-03-13 21:50 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Use readv/writev to handle multi-segment transmit and receive.
Account for virtio header that will be used for offload (later).

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 drivers/net/ioring/rte_eth_ioring.c | 203 ++++++++++++++++++++++------
 1 file changed, 160 insertions(+), 43 deletions(-)

diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index 18546f0137..633bfc21c2 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -18,6 +18,7 @@
 #include <linux/if.h>
 #include <linux/if_arp.h>
 #include <linux/if_tun.h>
+#include <linux/virtio_net.h>
 
 #include <bus_vdev_driver.h>
 #include <ethdev_driver.h>
@@ -30,12 +31,18 @@
 #include <rte_kvargs.h>
 #include <rte_log.h>
 
+static_assert(RTE_PKTMBUF_HEADROOM >= sizeof(struct virtio_net_hdr));
+
 #define IORING_DEFAULT_BURST	64
 #define IORING_NUM_BUFFERS	1024
 #define IORING_MAX_QUEUES	128
 
 static_assert(IORING_MAX_QUEUES <= RTE_MP_MAX_FD_NUM, "Max queues exceeds MP fd limit");
 
+#define IORING_TX_OFFLOAD	RTE_ETH_TX_OFFLOAD_MULTI_SEGS
+
+#define IORING_RX_OFFLOAD	RTE_ETH_RX_OFFLOAD_SCATTER
+
 #define IORING_DEFAULT_IFNAME	"itap%d"
 #define IORING_MP_KEY		"ioring_mp_send_fds"
 
@@ -162,7 +169,7 @@ tap_open(const char *name, struct ifreq *ifr, uint8_t persist)
 		goto error;
 	}
 
-	int flags = IFF_TAP | IFF_MULTI_QUEUE | IFF_NO_PI;
+	int flags = IFF_TAP | IFF_MULTI_QUEUE | IFF_NO_PI | IFF_VNET_HDR;
 	if ((features & flags) != flags) {
 		PMD_LOG(ERR, "TUN features %#x missing support for %#x",
 			features, features & flags);
@@ -193,6 +200,13 @@ tap_open(const char *name, struct ifreq *ifr, uint8_t persist)
 		goto error;
 	}
 
+
+	int hdr_size = sizeof(struct virtio_net_hdr);
+	if (ioctl(tap_fd, TUNSETVNETHDRSZ, &hdr_size) < 0) {
+		PMD_LOG(ERR, "ioctl(TUNSETVNETHDRSZ) %s", strerror(errno));
+		goto error;
+	}
+
 	return tap_fd;
 error:
 	close(tap_fd);
@@ -350,6 +364,8 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 	dev_info->max_rx_queues = IORING_MAX_QUEUES;
 	dev_info->max_tx_queues = IORING_MAX_QUEUES;
 	dev_info->min_rx_bufsize = 0;
+	dev_info->tx_queue_offload_capa = IORING_TX_OFFLOAD;
+	dev_info->tx_offload_capa = dev_info->tx_queue_offload_capa;
 
 	dev_info->default_rxportconf = (struct rte_eth_dev_portconf) {
 		.burst_size = IORING_DEFAULT_BURST,
@@ -478,7 +494,6 @@ static inline void
 eth_rx_submit(struct rx_queue *rxq, int fd, struct rte_mbuf *mb)
 {
 	struct io_uring_sqe *sqe = io_uring_get_sqe(&rxq->io_ring);
-
 	if (unlikely(sqe == NULL)) {
 		PMD_LOG(DEBUG, "io_uring no rx sqe");
 		rxq->rx_errors++;
@@ -487,10 +502,81 @@ eth_rx_submit(struct rx_queue *rxq, int fd, struct rte_mbuf *mb)
 	}
 	io_uring_sqe_set_data(sqe, mb);
 
-	void *buf = rte_pktmbuf_mtod_offset(mb, void *, 0);
-	unsigned int nbytes = rte_pktmbuf_tailroom(mb);
+	RTE_ASSERT(rte_pktmbuf_headroom(mb) >= sizeof(struct virtio_net_hdr));
+	void *buf = rte_pktmbuf_mtod_offset(mb, void *, -sizeof(struct virtio_net_hdr));
+	unsigned int nbytes = sizeof(struct virtio_net_hdr) + rte_pktmbuf_tailroom(mb);
+
+	/* optimize for the case where packet fits in one mbuf */
+	if (mb->nb_segs == 1) {
+		io_uring_prep_read(sqe, fd, buf, nbytes, 0);
+	} else {
+		uint16_t nsegs = mb->nb_segs;
+		RTE_ASSERT(nsegs > 0 && nsegs < IOV_MAX);
+		struct iovec iovs[RTE_MBUF_MAX_NB_SEGS];
+
+		iovs[0].iov_base = buf;
+		iovs[0].iov_len = nbytes;
+
+		for (uint16_t i = 1; i < nsegs; i++) {
+			mb = mb->next;
+			iovs[i].iov_base = rte_pktmbuf_mtod(mb, void *);
+			iovs[i].iov_len = rte_pktmbuf_tailroom(mb);
+		}
+		io_uring_prep_readv(sqe, fd, iovs, nsegs, 0);
+	}
+
+}
+
+
+/* Allocates one or more mbuf's to be used for reading packets */
+static struct rte_mbuf *
+eth_ioring_rx_alloc(struct rx_queue *rxq)
+{
+	const struct rte_eth_dev *dev = &rte_eth_devices[rxq->port_id];
+	int buf_size = dev->data->mtu;
+	struct rte_mbuf *m = NULL;
+	struct rte_mbuf **tail = &m;
+
+	do {
+		struct rte_mbuf *seg = rte_pktmbuf_alloc(rxq->mb_pool);
+		if (unlikely(seg == NULL)) {
+			rte_pktmbuf_free(m);
+			return NULL;
+		}
+		*tail = seg;
+		tail = &seg->next;
+		if (seg != m)
+			++m->nb_segs;
+
+		buf_size -= rte_pktmbuf_tailroom(seg);
+	} while (buf_size > 0);
+
+	__rte_mbuf_sanity_check(m, 1);
+	return m;
+}
+
+
+/* set length of received mbuf segments */
+static inline void
+eth_ioring_rx_adjust(struct rte_mbuf *mb, size_t len)
+{
+	struct rte_mbuf *seg;
+	unsigned int nsegs = 0;
+
+	for (seg = mb; seg != NULL && len > 0; seg = seg->next) {
+		uint16_t seg_len = RTE_MIN(len, rte_pktmbuf_tailroom(mb));
+
+		seg->data_len = seg_len;
+		len -= seg_len;
+		++nsegs;
+	}
 
-	io_uring_prep_read(sqe, fd, buf, nbytes, 0);
+	mb->nb_segs = nsegs;
+	if (len == 0 && seg != NULL) {
+		/* free any residual */
+		rte_pktmbuf_free(seg->next);
+		seg->next = NULL;
+	}
 }
 
 static uint16_t
@@ -505,37 +591,42 @@ eth_ioring_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 
 	io_uring_for_each_cqe(&rxq->io_ring, head, cqe) {
 		struct rte_mbuf *mb = (void *)(uintptr_t)cqe->user_data;
-		ssize_t len = cqe->res;
+		int32_t len = cqe->res;
 
-		PMD_RX_LOG(DEBUG, "cqe %u len %zd", num_cqe, len);
-		num_cqe++;
+		PMD_RX_LOG(DEBUG, "complete m=%p len=%d", mb, len);
 
-		if (unlikely(len < RTE_ETHER_HDR_LEN)) {
-			if (len < 0)
-				PMD_LOG(ERR, "io_uring_read: %s", strerror(-len));
-			else
-				PMD_LOG(ERR, "io_uring_read missing hdr");
+		num_cqe++;
 
+		struct virtio_net_hdr *hdr;
+		if (unlikely(len < (ssize_t)(sizeof(*hdr) + RTE_ETHER_HDR_LEN))) {
+			PMD_LOG(ERR, "io_uring_read result = %d", len);
 			rxq->rx_errors++;
 			goto resubmit;
 		}
 
-		struct rte_mbuf *nmb = rte_pktmbuf_alloc(rxq->mb_pool);
-		if (unlikely(nmb == 0)) {
-			PMD_LOG(DEBUG, "Rx mbuf alloc failed");
+		/* virtio header is before packet data */
+		hdr = rte_pktmbuf_mtod_offset(mb, struct virtio_net_hdr *, -sizeof(*hdr));
+		len -= sizeof(*hdr);
+
+		struct rte_mbuf *nmb = eth_ioring_rx_alloc(rxq);
+		if (!nmb) {
+			PMD_RX_LOG(NOTICE, "alloc failed");
 			++rxq->rx_nombuf;
 			goto resubmit;
 		}
 
-		mb->pkt_len = len;
-		mb->data_len = len;
 		mb->port = rxq->port_id;
-		__rte_mbuf_sanity_check(mb, 1);
+		mb->pkt_len = len;
+
+		if (mb->nb_segs == 1)
+			mb->data_len = len;
+		else
+			eth_ioring_rx_adjust(mb, len);
 
-		num_bytes += len;
+		num_bytes += mb->pkt_len;
 		bufs[num_rx++] = mb;
 
-		mb = nmb;
+		mb = nmb;	/* use the new buffer when resubmitting */
 resubmit:
 		eth_rx_submit(rxq, fd, mb);
 
@@ -581,20 +672,17 @@ eth_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id, uint16_t nb_rx_de
 		return -1;
 	}
 
-	struct rte_mbuf **mbufs = alloca(nb_rx_desc * sizeof(struct rte_mbuf *));
-	if (mbufs == NULL) {
-		PMD_LOG(ERR, "alloca for %u failed", nb_rx_desc);
-		return -1;
-	}
+	int fd = eth_queue_fd(rxq->port_id, rxq->queue_id);
 
-	if (rte_pktmbuf_alloc_bulk(mb_pool, mbufs, nb_rx_desc) < 0) {
-		PMD_LOG(ERR, "Rx mbuf alloc %u bufs failed", nb_rx_desc);
-		return -1;
-	}
+	for (uint16_t i = 0; i < nb_rx_desc; i++) {
+		struct rte_mbuf *mb = eth_ioring_rx_alloc(rxq);
+		if (mb == NULL) {
+			PMD_LOG(ERR, "Rx mbuf alloc buf failed");
+			return -1;
+		}
 
-	int fd = eth_queue_fd(rxq->port_id, rxq->queue_id);
-	for (uint16_t i = 0; i < nb_rx_desc; i++)
-		eth_rx_submit(rxq, fd, mbufs[i]);
+		eth_rx_submit(rxq, fd, mb);
+	}
 
 	io_uring_submit(&rxq->io_ring);
 	return 0;
@@ -701,8 +789,6 @@ eth_ioring_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	if (unlikely(nb_pkts == 0))
 		return 0;
 
-	PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts);
-
 	if (io_uring_sq_space_left(&txq->io_ring) < txq->free_thresh)
 		eth_ioring_tx_cleanup(txq);
 
@@ -710,23 +796,54 @@ eth_ioring_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 
 	for (num_tx = 0; num_tx < nb_pkts; num_tx++) {
 		struct rte_mbuf *mb = bufs[num_tx];
+		struct virtio_net_hdr *hdr;
 
 		struct io_uring_sqe *sqe = io_uring_get_sqe(&txq->io_ring);
 		if (sqe == NULL)
 			break;	/* submit ring is full */
 
+		if (rte_mbuf_refcnt_read(mb) == 1 &&
+		    RTE_MBUF_DIRECT(mb) &&
+		    rte_pktmbuf_headroom(mb) >= sizeof(*hdr)) {
+			hdr = rte_pktmbuf_mtod_offset(mb, struct virtio_net_hdr *, sizeof(*hdr));
+		} else {
+			struct rte_mbuf *mh = rte_pktmbuf_alloc(mb->pool);
+			if (unlikely(mh == NULL)) {
+				++txq->tx_errors;
+				rte_pktmbuf_free(mb);
+				continue;
+			}
+
+			hdr = rte_pktmbuf_mtod_offset(mh, struct virtio_net_hdr *, sizeof(*hdr));
+			mh->next = mb;
+			mh->nb_segs = mb->nb_segs + 1;
+			mh->pkt_len = mb->pkt_len;
+			mh->ol_flags = mb->ol_flags & RTE_MBUF_F_TX_OFFLOAD_MASK;
+			mb = mh;
+		}
+
 		io_uring_sqe_set_data(sqe, mb);
 
-		if (rte_mbuf_refcnt_read(mb) == 1 &&
-		    RTE_MBUF_DIRECT(mb) && mb->nb_segs == 1) {
-			void *base = rte_pktmbuf_mtod(mb, void *);
-			io_uring_prep_write(sqe, fd, base, mb->pkt_len, 0);
+		PMD_TX_LOG(DEBUG, "write m=%p segs=%u", mb, mb->nb_segs);
+		void *buf = rte_pktmbuf_mtod_offset(mb, void *, -sizeof(*hdr));
+		unsigned int nbytes = sizeof(struct virtio_net_hdr) + mb->data_len;
 
-			PMD_TX_LOG(DEBUG, "tx mbuf: %p submit", mb);
+		if (mb->nb_segs == 1) {
+			io_uring_prep_write(sqe, fd, buf, nbytes, 0);
 		} else {
-			PMD_LOG(ERR, "Can't do mbuf without space yet!");
-			++txq->tx_errors;
-			continue;
+			struct iovec iovs[RTE_MBUF_MAX_NB_SEGS + 1];
+			unsigned int niov = mb->nb_segs;
+
+			iovs[0].iov_base = buf;
+			iovs[0].iov_len = nbytes;
+
+			for (unsigned int i = 1; i < niov; i++) {
+				mb = mb->next;
+				iovs[i].iov_base = rte_pktmbuf_mtod(mb, void *);
+				iovs[i].iov_len = mb->data_len;
+			}
+
+			io_uring_prep_writev(sqe, fd, iovs, niov, 0);
 		}
 	}
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v4 09/10] net/ioring: support Tx checksum and segment offload
  2025-03-13 21:50 ` [PATCH v4 00/10] new ioring PMD Stephen Hemminger
                     ` (7 preceding siblings ...)
  2025-03-13 21:50   ` [PATCH v4 08/10] net/ioring: support multi-segment Rx and Tx Stephen Hemminger
@ 2025-03-13 21:51   ` Stephen Hemminger
  2025-03-13 21:51   ` [PATCH v4 10/10] net/ioring: add support for Rx offload Stephen Hemminger
  9 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2025-03-13 21:51 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

The code for transmit flag mapping can be copied from virtio
driver. The TAP device does not support querying features of the
virtio net header, so the driver assumes checksum offload is allowed.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/ioring.ini |  2 ++
 drivers/net/ioring/rte_eth_ioring.c | 51 ++++++++++++++++++++++++++++-
 2 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/doc/guides/nics/features/ioring.ini b/doc/guides/nics/features/ioring.ini
index c9a4582d0e..17497e1fd3 100644
--- a/doc/guides/nics/features/ioring.ini
+++ b/doc/guides/nics/features/ioring.ini
@@ -10,6 +10,8 @@ Promiscuous mode     = Y
 Allmulticast mode    = Y
 Multiprocess aware   = Y
 Basic stats          = Y
+L3 checksum offload  = Y
+L4 checksum offload  = Y
 Stats per queue      = Y
 Linux		     = Y
 x86-64               = Y
diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index 633bfc21c2..704b887d36 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -39,7 +39,10 @@ static_assert(RTE_PKTMBUF_HEADROOM >= sizeof(struct virtio_net_hdr));
 
 static_assert(IORING_MAX_QUEUES <= RTE_MP_MAX_FD_NUM, "Max queues exceeds MP fd limit");
 
-#define IORING_TX_OFFLOAD	RTE_ETH_TX_OFFLOAD_MULTI_SEGS
+#define IORING_TX_OFFLOAD	(RTE_ETH_TX_OFFLOAD_MULTI_SEGS | \
+				 RTE_ETH_TX_OFFLOAD_UDP_CKSUM | \
+				 RTE_ETH_TX_OFFLOAD_TCP_CKSUM | \
+				 RTE_ETH_TX_OFFLOAD_TCP_TSO)
 
 #define IORING_RX_OFFLOAD	RTE_ETH_RX_OFFLOAD_SCATTER
 
@@ -780,6 +783,51 @@ eth_tx_queue_release(struct rte_eth_dev *dev, uint16_t queue_id)
 	io_uring_queue_exit(&txq->io_ring);
 }
 
+/* Convert mbuf offload flags to virtio net header */
+static void
+eth_ioring_tx_offload(struct virtio_net_hdr *hdr, const struct rte_mbuf *m)
+{
+	uint64_t csum_l4 = m->ol_flags & RTE_MBUF_F_TX_L4_MASK;
+	uint16_t o_l23_len = (m->ol_flags & RTE_MBUF_F_TX_TUNNEL_MASK) ?
+			     m->outer_l2_len + m->outer_l3_len : 0;
+
+	if (m->ol_flags & RTE_MBUF_F_TX_TCP_SEG)
+		csum_l4 |= RTE_MBUF_F_TX_TCP_CKSUM;
+
+	switch (csum_l4) {
+	case RTE_MBUF_F_TX_UDP_CKSUM:
+		hdr->csum_start = o_l23_len + m->l2_len + m->l3_len;
+		hdr->csum_offset = offsetof(struct rte_udp_hdr, dgram_cksum);
+		hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+		break;
+
+	case RTE_MBUF_F_TX_TCP_CKSUM:
+		hdr->csum_start = o_l23_len + m->l2_len + m->l3_len;
+		hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
+		hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+		break;
+
+	default:
+		hdr->csum_start = 0;
+		hdr->csum_offset = 0;
+		hdr->flags = 0;
+		break;
+	}
+
+	/* TCP Segmentation Offload */
+	if (m->ol_flags & RTE_MBUF_F_TX_TCP_SEG) {
+		hdr->gso_type = (m->ol_flags & RTE_MBUF_F_TX_IPV6) ?
+			VIRTIO_NET_HDR_GSO_TCPV6 :
+			VIRTIO_NET_HDR_GSO_TCPV4;
+		hdr->gso_size = m->tso_segsz;
+		hdr->hdr_len = o_l23_len + m->l2_len + m->l3_len + m->l4_len;
+	} else {
+		hdr->gso_type = 0;
+		hdr->gso_size = 0;
+		hdr->hdr_len = 0;
+	}
+}
+
 static uint16_t
 eth_ioring_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 {
@@ -823,6 +871,7 @@ eth_ioring_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		}
 
 		io_uring_sqe_set_data(sqe, mb);
+		eth_ioring_tx_offload(hdr, mb);
 
 		PMD_TX_LOG(DEBUG, "write m=%p segs=%u", mb, mb->nb_segs);
 		void *buf = rte_pktmbuf_mtod_offset(mb, void *, -sizeof(*hdr));
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v4 10/10] net/ioring: add support for Rx offload
  2025-03-13 21:50 ` [PATCH v4 00/10] new ioring PMD Stephen Hemminger
                     ` (8 preceding siblings ...)
  2025-03-13 21:51   ` [PATCH v4 09/10] net/ioring: support Tx checksum and segment offload Stephen Hemminger
@ 2025-03-13 21:51   ` Stephen Hemminger
  9 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2025-03-13 21:51 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

The TAP device supports receive offload.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 drivers/net/ioring/rte_eth_ioring.c | 98 ++++++++++++++++++++++++++++-
 1 file changed, 96 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ioring/rte_eth_ioring.c b/drivers/net/ioring/rte_eth_ioring.c
index 704b887d36..7c60d49b23 100644
--- a/drivers/net/ioring/rte_eth_ioring.c
+++ b/drivers/net/ioring/rte_eth_ioring.c
@@ -30,6 +30,7 @@
 #include <rte_ether.h>
 #include <rte_kvargs.h>
 #include <rte_log.h>
+#include <rte_net.h>
 
 static_assert(RTE_PKTMBUF_HEADROOM >= sizeof(struct virtio_net_hdr));
 
@@ -44,7 +45,10 @@ static_assert(IORING_MAX_QUEUES <= RTE_MP_MAX_FD_NUM, "Max queues exceeds MP fd
 				 RTE_ETH_TX_OFFLOAD_TCP_CKSUM | \
 				 RTE_ETH_TX_OFFLOAD_TCP_TSO)
 
-#define IORING_RX_OFFLOAD	RTE_ETH_RX_OFFLOAD_SCATTER
+#define IORING_RX_OFFLOAD	(RTE_ETH_RX_OFFLOAD_UDP_CKSUM | \
+				 RTE_ETH_RX_OFFLOAD_TCP_CKSUM | \
+				 RTE_ETH_RX_OFFLOAD_TCP_LRO | \
+				 RTE_ETH_RX_OFFLOAD_SCATTER)
 
 #define IORING_DEFAULT_IFNAME	"itap%d"
 #define IORING_MP_KEY		"ioring_mp_send_fds"
@@ -349,10 +353,31 @@ eth_dev_stop(struct rte_eth_dev *dev)
 static int
 eth_dev_configure(struct rte_eth_dev *dev)
 {
+	struct pmd_internals *pmd = dev->data->dev_private;
+
 	/* rx/tx must be paired */
 	if (dev->data->nb_rx_queues != dev->data->nb_tx_queues)
 		return -EINVAL;
 
+	/*
+	 * Set offload flags visible on the kernel network interface.
+	 * This controls whether kernel will use checksum offload etc.
+	 * Note: kernel transmit is DPDK receive.
+	 */
+	const struct rte_eth_rxmode *rx_mode = &dev->data->dev_conf.rxmode;
+	unsigned int offload = 0;
+	if (rx_mode->offloads & RTE_ETH_RX_OFFLOAD_CHECKSUM) {
+		offload |= TUN_F_CSUM;
+
+		if (rx_mode->offloads & RTE_ETH_RX_OFFLOAD_TCP_LRO)
+			offload |= TUN_F_TSO4 | TUN_F_TSO6 | TUN_F_TSO_ECN;
+	}
+
+	if (ioctl(pmd->keep_fd, TUNSETOFFLOAD, offload) != 0) {
+		PMD_LOG(ERR, "ioctl(TUNSETOFFLOAD) failed: %s", strerror(errno));
+		return -1;
+	}
+
 	return 0;
 }
 
@@ -558,7 +583,6 @@ eth_ioring_rx_alloc(struct rx_queue *rxq)
 	return m;
 }
 
-
 /* set length of received mbuf segments */
 static inline void
 eth_ioring_rx_adjust(struct rte_mbuf *mb, size_t len)
@@ -582,6 +606,69 @@ eth_ioring_rx_adjust(struct rte_mbuf *mb, size_t len)
 	}
 }
 
+static int
+eth_ioring_rx_offload(struct rte_mbuf *m, const struct virtio_net_hdr *hdr)
+{
+	uint32_t ptype;
+	bool l4_supported = false;
+	struct rte_net_hdr_lens hdr_lens;
+
+	/* nothing to do */
+	if (hdr->flags == 0 && hdr->gso_type == VIRTIO_NET_HDR_GSO_NONE)
+		return 0;
+
+	m->ol_flags |= RTE_MBUF_F_RX_IP_CKSUM_UNKNOWN;
+
+	ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK);
+	m->packet_type = ptype;
+	if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
+		l4_supported = true;
+
+	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		uint32_t hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
+		if (hdr->csum_start <= hdrlen && l4_supported) {
+			m->ol_flags |= RTE_MBUF_F_RX_L4_CKSUM_NONE;
+		} else {
+			/* Unknown proto or tunnel, do sw cksum. */
+			uint16_t csum = 0, off;
+
+			if (rte_raw_cksum_mbuf(m, hdr->csum_start,
+					       rte_pktmbuf_pkt_len(m) - hdr->csum_start,
+					       &csum) < 0)
+				return -EINVAL;
+			if (likely(csum != 0xffff))
+				csum = ~csum;
+			off = hdr->csum_offset + hdr->csum_start;
+			if (rte_pktmbuf_data_len(m) >= off + 1)
+				*rte_pktmbuf_mtod_offset(m, uint16_t *, off) = csum;
+		}
+	} else if ((hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID) && l4_supported) {
+		m->ol_flags |= RTE_MBUF_F_RX_L4_CKSUM_GOOD;
+	}
+
+	/* GSO request, save required information in mbuf */
+	if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+		/* Check unsupported modes */
+		if ((hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN) || hdr->gso_size == 0)
+			return -EINVAL;
+
+		/* Update mss lengths in mbuf */
+		m->tso_segsz = hdr->gso_size;
+		switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
+		case VIRTIO_NET_HDR_GSO_TCPV4:
+		case VIRTIO_NET_HDR_GSO_TCPV6:
+			m->ol_flags |= RTE_MBUF_F_RX_LRO | RTE_MBUF_F_RX_L4_CKSUM_NONE;
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
 static uint16_t
 eth_ioring_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 {
@@ -626,6 +713,13 @@ eth_ioring_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		else
 			eth_ioring_rx_adjust(mb, len);
 
+		if (unlikely(eth_ioring_rx_offload(mb, hdr) < 0)) {
+			PMD_RX_LOG(ERR, "invalid rx offload");
+			++rxq->rx_errors;
+			goto resubmit;
+		}
+
+		__rte_mbuf_sanity_check(mb, 1);
 		num_bytes += mb->pkt_len;
 		bufs[num_rx++] = mb;
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v5 00/10] net/rtap: add io_uring based TAP driver
  2024-12-10 21:23 [RFC 0/8] ioring: network driver Stephen Hemminger
                   ` (11 preceding siblings ...)
  2025-03-13 21:50 ` [PATCH v4 00/10] new ioring PMD Stephen Hemminger
@ 2026-02-09 18:38 ` Stephen Hemminger
  2026-02-09 18:39   ` [PATCH v5 01/10] net/rtap: add driver skeleton and documentation Stephen Hemminger
                     ` (10 more replies)
  2026-02-14 23:44 ` [PATCH v6 00/11] " Stephen Hemminger
  13 siblings, 11 replies; 72+ messages in thread
From: Stephen Hemminger @ 2026-02-09 18:38 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

This series adds net_rtap, an experimental poll mode driver that uses
Linux io_uring for asynchronous packet I/O with kernel TAP interfaces.

Like net_tap, net_rtap creates a kernel network interface visible to
standard tools (ip, ethtool) and the Linux TCP/IP stack.  From DPDK
it is an ordinary ethdev.

Motivation
----------

This driver started as an experiment to determine whether Linux
io_uring could deliver better packet I/O performance than the
traditional read()/write() system calls used by net_tap.  By posting
batches of I/O requests asynchronously, io_uring amortizes system
call overhead across multiple packets.

The project also served as a testbed for using AI tooling to help
build a comprehensive test suite, refactor code, and improve
documentation.  The result is intended as an example for other PMD
authors: the driver has thorough unit tests covering data path,
offloads, multi-queue, fd lifecycle, and more, along with detailed
code comments explaining design choices.

Why not extend net_tap?
-----------------------

The existing net_tap driver was designed to provide feature parity
with mlx5 when used behind the failsafe PMD.  That goal led to
significant complexity: rte_flow support emulated via eBPF programs,
software GSO implementation, and other features that duplicate in
user space what the kernel already does.

net_rtap takes the opposite approach — use the kernel efficiently
and let it do what it does well.  There is no rte_flow support;
receive queue selection is left to the kernel's native RSS/steering.
There is no software GSO; the driver passes segmentation requests
to the kernel via the virtio-net header and lets the kernel handle
it.  The result is a much simpler driver that is easier to maintain
and reason about.

Given these fundamentally different design goals, a clean
implementation was more practical than refactoring net_tap.

Acknowledgement
---------------

Parts of the test suite, code review, and refactoring were done
with the assistance of Anthropic Claude (AI).  All generated code
was reviewed and tested by the author.

Requirements:
  - Kernel headers with IORING_ASYNC_CANCEL_ALL (upstream since 5.19)
  - liburing >= 2.0

Known working distributions: Debian 12+, Ubuntu 24.04+,
Fedora 37+, SLES 15 SP6+ / openSUSE Tumbleweed.
RHEL 9 is not supported (io_uring is disabled by default).

v5 - revised, renamed and expanded from the v4 ioring PMD
   - more complete testing and dependency handling


Stephen Hemminger (10):
  net/rtap: add driver skeleton and documentation
  net/rtap: add TAP device creation and queue management
  net/rtap: add Rx/Tx with scatter/gather support
  net/rtap: add statistics and device info
  net/rtap: add link and device management operations
  net/rtap: add checksum and TSO offload support
  net/rtap: add link state change interrupt
  net/rtap: add multi-process support
  net/rtap: add Rx interrupt support
  test: add unit tests for rtap PMD

 MAINTAINERS                            |    7 +
 app/test/meson.build                   |    1 +
 app/test/test_pmd_rtap.c               | 2044 ++++++++++++++++++++++++
 doc/guides/nics/features/rtap.ini      |   25 +
 doc/guides/nics/index.rst              |    1 +
 doc/guides/nics/rtap.rst               |  101 ++
 doc/guides/rel_notes/release_26_03.rst |    6 +
 drivers/net/meson.build                |    1 +
 drivers/net/rtap/meson.build           |   28 +
 drivers/net/rtap/rtap.h                |  100 ++
 drivers/net/rtap/rtap_ethdev.c         |  908 +++++++++++
 drivers/net/rtap/rtap_intr.c           |  267 ++++
 drivers/net/rtap/rtap_rxtx.c           |  784 +++++++++
 13 files changed, 4273 insertions(+)
 create mode 100644 app/test/test_pmd_rtap.c
 create mode 100644 doc/guides/nics/features/rtap.ini
 create mode 100644 doc/guides/nics/rtap.rst
 create mode 100644 drivers/net/rtap/meson.build
 create mode 100644 drivers/net/rtap/rtap.h
 create mode 100644 drivers/net/rtap/rtap_ethdev.c
 create mode 100644 drivers/net/rtap/rtap_intr.c
 create mode 100644 drivers/net/rtap/rtap_rxtx.c

-- 
2.51.0


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v5 01/10] net/rtap: add driver skeleton and documentation
  2026-02-09 18:38 ` [PATCH v5 00/10] net/rtap: add io_uring based TAP driver Stephen Hemminger
@ 2026-02-09 18:39   ` Stephen Hemminger
  2026-02-09 18:39   ` [PATCH v5 02/10] net/rtap: add TAP device creation and queue management Stephen Hemminger
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2026-02-09 18:39 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger, Thomas Monjalon, Anatoly Burakov

Add the initial skeleton for the rtap poll mode driver, a virtual
ethernet device that uses Linux io_uring for packet I/O with kernel
TAP devices.

This patch includes:
  - MAINTAINERS entry
  - Driver documentation (doc/guides/nics/rtap.rst)
  - Feature matrix (doc/guides/nics/features/rtap.ini)
  - Release notes update
  - Meson build integration with liburing dependency
  - Header file with shared data structures and declarations
  - Stub probe/remove handlers that register the vdev driver
  - Empty dev_ops with only dev_close implemented

The driver registers as net_rtap and is Linux-only.
It requires the liburing library version 2.0 or later.
Earlier versions have known security and build issues.
The library is available in all currently supported distributions
(Debian 12+, Ubuntu 22.04+, RHEL 9+, Fedora 35+)

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 MAINTAINERS                            |   7 +
 doc/guides/nics/features/rtap.ini      |  13 ++
 doc/guides/nics/index.rst              |   1 +
 doc/guides/nics/rtap.rst               | 101 +++++++++++++++
 doc/guides/rel_notes/release_26_03.rst |   6 +
 drivers/net/meson.build                |   1 +
 drivers/net/rtap/meson.build           |  26 ++++
 drivers/net/rtap/rtap.h                |  69 ++++++++++
 drivers/net/rtap/rtap_ethdev.c         | 172 +++++++++++++++++++++++++
 9 files changed, 396 insertions(+)
 create mode 100644 doc/guides/nics/features/rtap.ini
 create mode 100644 doc/guides/nics/rtap.rst
 create mode 100644 drivers/net/rtap/meson.build
 create mode 100644 drivers/net/rtap/rtap.h
 create mode 100644 drivers/net/rtap/rtap_ethdev.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 5683b87e4a..3d0877fdc7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1135,6 +1135,13 @@ F: doc/guides/nics/pcap_ring.rst
 F: app/test/test_pmd_ring.c
 F: app/test/test_pmd_ring_perf.c
 
+Rtap PMD - EXPERIMENTAL
+M: Stephen Hemminger <stephen@networkplumber.org>
+F: drivers/net/rtap/
+F: app/test/test_pmd_rtap.c
+F: doc/guides/nics/rtap.rst
+F: doc/guides/nics/features/rtap.ini
+
 Null Networking PMD
 M: Tetsuya Mukawa <mtetsuyah@gmail.com>
 F: drivers/net/null/
diff --git a/doc/guides/nics/features/rtap.ini b/doc/guides/nics/features/rtap.ini
new file mode 100644
index 0000000000..ed7c638029
--- /dev/null
+++ b/doc/guides/nics/features/rtap.ini
@@ -0,0 +1,13 @@
+;
+; Supported features of the 'rtap' driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+Linux                = Y
+ARMv7                = Y
+ARMv8                = Y
+Power8               = Y
+x86-32               = Y
+x86-64               = Y
+Usage doc            = Y
diff --git a/doc/guides/nics/index.rst b/doc/guides/nics/index.rst
index b00ed998c5..274575fe70 100644
--- a/doc/guides/nics/index.rst
+++ b/doc/guides/nics/index.rst
@@ -65,6 +65,7 @@ Network Interface Controller Drivers
     qede
     r8169
     rnp
+    rtap
     sfc_efx
     softnic
     tap
diff --git a/doc/guides/nics/rtap.rst b/doc/guides/nics/rtap.rst
new file mode 100644
index 0000000000..1c1cb8dd58
--- /dev/null
+++ b/doc/guides/nics/rtap.rst
@@ -0,0 +1,101 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+
+RTAP Poll Mode Driver
+=======================
+
+The RTAP Poll Mode Driver (PMD) is similar to the TAP PMD. It is a
+virtual device that uses Linux io_uring for efficient packet I/O with
+the Linux kernel.
+It is useful when writing DPDK applications that need to support interaction
+with the Linux TCP/IP stack for control plane or tunneling.
+
+The RTAP PMD creates a kernel network device that can be
+managed by standard tools such as ``ip`` and ``ethtool`` commands.
+
+From a DPDK application, the RTAP device looks like a DPDK ethdev.
+It supports the standard DPDK APIs to query for information, statistics,
+and send/receive packets.
+
+Features
+--------
+
+- Uses io_uring for asynchronous packet I/O via read/write and readv/writev
+- TX offloads: multi-segment, UDP checksum, TCP checksum, TCP segmentation (TSO)
+- RX offloads: UDP checksum, TCP checksum, TCP LRO, scatter
+- Virtio net header support for offload negotiation with the kernel
+- Multi-queue support (up to 128 queues)
+- Multi-process support (secondary processes receive queue fds from primary)
+- Link state change notification via netlink
+- Rx interrupt support for power-aware applications (eventfd per queue)
+- Promiscuous and allmulticast mode
+- MAC address configuration
+- MTU update
+- Link up/down control
+- Basic and per-queue statistics
+
+Requirements
+------------
+
+- **liburing >= 2.0**.  Earlier versions have known security and build issues.
+
+- The kernel must support ``IORING_ASYNC_CANCEL_ALL`` (upstream since 5.19).
+  The meson build checks for this symbol and will not build the driver
+  if the installed kernel headers do not provide it.  Because enterprise
+  distributions backport features independently of version numbers,
+  the driver avoids hard-coding a kernel version check.
+
+Known working distributions:
+
+- Debian 12 (Bookworm) or later
+- Ubuntu 24.04 (Noble) or later (22.04 with HWE kernel)
+- Fedora 37 or later
+- SUSE Linux Enterprise 15 SP6 or later / openSUSE Tumbleweed
+
+RHEL 9 ships io_uring only as a Technology Preview (disabled by default)
+and is not supported.
+
+For more info on io_uring, please see:
+
+- `io_uring on Wikipedia <https://en.wikipedia.org/wiki/Io_uring>`_
+- `liburing on GitHub <https://github.com/axboe/liburing>`_
+
+
+Arguments
+---------
+
+RTAP devices are created with the ``--vdev=net_rtap0`` command line option.
+Multiple devices can be created by repeating the option with different device names
+(``net_rtap1``, ``net_rtap2``, etc.).
+
+By default, the Linux interfaces are named ``rtap0``, ``rtap1``, etc.
+The interface name can be specified by adding the ``iface=foo0``, for example::
+
+   --vdev=net_rtap0,iface=io0 --vdev=net_rtap1,iface=io1 ...
+
+The PMD inherits the MAC address assigned by the kernel which will be
+a locally assigned random Ethernet address.
+
+Normally, when the DPDK application exits, the RTAP device is removed.
+But this behavior can be overridden by the use of the persist flag, which
+causes the kernel network interface to survive application exit. Example::
+
+  --vdev=net_rtap0,iface=io0,persist ...
+
+
+Limitations
+-----------
+
+- The kernel must have io_uring support with ``IORING_ASYNC_CANCEL_ALL``
+  (upstream since 5.19, but may be backported by distributions).
+  io_uring support may also be disabled in some environments or by security policies
+  (for example, Docker disables io_uring in its default seccomp profile,
+  and RHEL 9 disables it via ``kernel.io_uring_disabled`` sysctl).
+
+- Since RTAP device uses a file descriptor to talk to the kernel,
+  the same number of queues must be specified for receive and transmit.
+
+- The maximum number of queues is 128.
+
+- No flow support. Receive queue selection for incoming packets is determined
+  by the Linux kernel. See kernel documentation for more info:
+  https://www.kernel.org/doc/html/latest/networking/scaling.html
diff --git a/doc/guides/rel_notes/release_26_03.rst b/doc/guides/rel_notes/release_26_03.rst
index 031eaa657e..db5c61a15c 100644
--- a/doc/guides/rel_notes/release_26_03.rst
+++ b/doc/guides/rel_notes/release_26_03.rst
@@ -63,6 +63,12 @@ New Features
 
   * Added support for pre and post VF reset callbacks.
 
+* **Added rtap virtual ethernet driver.**
+
+  Added a new experimental virtual device driver that uses Linux io_uring
+  for packet injection into the kernel network stack.
+  It requires Linux kernel 5.1 or later and the liburing library.
+
 
 Removed Items
 -------------
diff --git a/drivers/net/meson.build b/drivers/net/meson.build
index c7dae4ad27..ef1ee68385 100644
--- a/drivers/net/meson.build
+++ b/drivers/net/meson.build
@@ -56,6 +56,7 @@ drivers = [
         'r8169',
         'ring',
         'rnp',
+        'rtap',
         'sfc',
         'softnic',
         'tap',
diff --git a/drivers/net/rtap/meson.build b/drivers/net/rtap/meson.build
new file mode 100644
index 0000000000..7bd7806ef3
--- /dev/null
+++ b/drivers/net/rtap/meson.build
@@ -0,0 +1,26 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2026 Stephen Hemminger
+
+if not is_linux
+    build = false
+    reason = 'only supported on Linux'
+endif
+
+liburing = dependency('liburing', version: '>= 2.0', required: false)
+if not liburing.found()
+    build = false
+    reason = 'missing dependency, "liburing"'
+endif
+
+if build and not cc.has_header_symbol('linux/io_uring.h', 'IORING_ASYNC_CANCEL_ALL')
+    build = false
+    reason = 'kernel headers missing IORING_ASYNC_CANCEL_ALL (need kernel >= 5.19 headers)'
+endif
+
+sources = files(
+        'rtap_ethdev.c',
+)
+
+ext_deps += liburing
+
+require_iova_in_mbuf = false
diff --git a/drivers/net/rtap/rtap.h b/drivers/net/rtap/rtap.h
new file mode 100644
index 0000000000..507ab000f3
--- /dev/null
+++ b/drivers/net/rtap/rtap.h
@@ -0,0 +1,69 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2026 Stephen Hemminger
+ */
+
+#ifndef _RTAP_H_
+#define _RTAP_H_
+
+#include <assert.h>
+#include <unistd.h>
+#include <net/if.h>
+#include <liburing.h>
+#include <linux/virtio_net.h>
+
+#include <ethdev_driver.h>
+#include <rte_ether.h>
+#include <rte_log.h>
+
+
+extern int rtap_logtype;
+#define RTE_LOGTYPE_RTAP rtap_logtype
+#define PMD_LOG(level, ...) \
+	RTE_LOG_LINE_PREFIX(level, RTAP, "%s(): ", __func__, __VA_ARGS__)
+
+#define PMD_LOG_ERRNO(level, fmt, ...) \
+	RTE_LOG_LINE(level, RTAP, "%s(): " fmt ": %s", __func__, ## __VA_ARGS__, strerror(errno))
+
+#ifdef RTE_ETHDEV_DEBUG_RX
+#define PMD_RX_LOG(level, ...) \
+	RTE_LOG_LINE_PREFIX(level, RTAP, "%s() rx: ", __func__, __VA_ARGS__)
+#else
+#define PMD_RX_LOG(...) do { } while (0)
+#endif
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+#define PMD_TX_LOG(level, ...) \
+	RTE_LOG_LINE_PREFIX(level, RTAP, "%s() tx: ", __func__, __VA_ARGS__)
+#else
+#define PMD_TX_LOG(...) do { } while (0)
+#endif
+
+struct rtap_rx_queue {
+	struct rte_mempool *mb_pool;	/* rx buffer pool */
+	struct io_uring io_ring;	/* queue of posted read's */
+	uint16_t port_id;
+	uint16_t queue_id;
+
+	uint64_t rx_packets;
+	uint64_t rx_bytes;
+	uint64_t rx_errors;
+} __rte_cache_aligned;
+
+struct rtap_tx_queue {
+	struct io_uring io_ring;
+	uint16_t port_id;
+	uint16_t queue_id;
+	uint16_t free_thresh;
+
+	uint64_t tx_packets;
+	uint64_t tx_bytes;
+	uint64_t tx_errors;
+} __rte_cache_aligned;
+
+struct rtap_pmd {
+	int keep_fd;			/* keep alive file descriptor */
+	char ifname[IFNAMSIZ];		/* name assigned by kernel */
+	struct rte_ether_addr eth_addr; /* address assigned by kernel */
+};
+
+#endif /* _RTAP_H_ */
diff --git a/drivers/net/rtap/rtap_ethdev.c b/drivers/net/rtap/rtap_ethdev.c
new file mode 100644
index 0000000000..ee5b5bad1b
--- /dev/null
+++ b/drivers/net/rtap/rtap_ethdev.c
@@ -0,0 +1,172 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2026 Stephen Hemminger
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/ioctl.h>
+#include <sys/socket.h>
+#include <net/if.h>
+#include <linux/if.h>
+#include <linux/if_arp.h>
+#include <linux/if_tun.h>
+#include <linux/virtio_net.h>
+
+#include <bus_vdev_driver.h>
+#include <ethdev_driver.h>
+#include <ethdev_vdev.h>
+#include <rte_common.h>
+#include <rte_dev.h>
+#include <rte_eal.h>
+#include <rte_ethdev.h>
+#include <rte_ether.h>
+#include <rte_kvargs.h>
+#include <rte_log.h>
+
+#include "rtap.h"
+
+#define RTAP_DEFAULT_IFNAME	"rtap%d"
+
+#define RTAP_IFACE_ARG		"iface"
+#define RTAP_PERSIST_ARG	"persist"
+
+static const char * const valid_arguments[] = {
+	RTAP_IFACE_ARG,
+	RTAP_PERSIST_ARG,
+	NULL
+};
+
+static int
+rtap_dev_close(struct rte_eth_dev *dev)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+
+	PMD_LOG(INFO, "Closing %s", pmd->ifname);
+
+	if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
+		/* mac_addrs must not be freed alone because part of dev_private */
+		dev->data->mac_addrs = NULL;
+
+		if (pmd->keep_fd != -1) {
+			PMD_LOG(DEBUG, "Closing keep_fd %d", pmd->keep_fd);
+			close(pmd->keep_fd);
+			pmd->keep_fd = -1;
+		}
+	}
+
+	free(dev->process_private);
+	dev->process_private = NULL;
+
+	return 0;
+}
+
+static const struct eth_dev_ops rtap_ops = {
+	.dev_close		= rtap_dev_close,
+};
+
+static int
+rtap_parse_iface(const char *key __rte_unused, const char *value, void *extra_args)
+{
+	char *name = extra_args;
+
+	/* must not be null string */
+	if (value == NULL || value[0] == '\0' || strnlen(value, IFNAMSIZ) == IFNAMSIZ)
+		return -EINVAL;
+
+	strlcpy(name, value, IFNAMSIZ);
+	return 0;
+}
+
+static int
+rtap_probe(struct rte_vdev_device *vdev)
+{
+	const char *name = rte_vdev_device_name(vdev);
+	const char *params = rte_vdev_device_args(vdev);
+	struct rte_kvargs *kvlist = NULL;
+	struct rte_eth_dev *eth_dev = NULL;
+	int *fds = NULL;
+	char tap_name[IFNAMSIZ] = RTAP_DEFAULT_IFNAME;
+	uint8_t persist = 0;
+	int ret;
+
+	PMD_LOG(INFO, "Initializing %s", name);
+
+	if (params != NULL) {
+		kvlist = rte_kvargs_parse(params, valid_arguments);
+		if (kvlist == NULL)
+			return -1;
+
+		if (rte_kvargs_count(kvlist, RTAP_IFACE_ARG) == 1) {
+			ret = rte_kvargs_process_opt(kvlist, RTAP_IFACE_ARG,
+						     &rtap_parse_iface, tap_name);
+			if (ret < 0)
+				goto error;
+		}
+
+		if (rte_kvargs_count(kvlist, RTAP_PERSIST_ARG) == 1)
+			persist = 1;
+	}
+
+	/* Per-queue tap fd's (for primary process) */
+	fds = calloc(RTE_MAX_QUEUES_PER_PORT, sizeof(int));
+	if (fds == NULL) {
+		PMD_LOG(ERR, "Unable to allocate fd array");
+		goto error;
+	}
+	for (unsigned int i = 0; i < RTE_MAX_QUEUES_PER_PORT; i++)
+		fds[i] = -1;
+
+	eth_dev = rte_eth_vdev_allocate(vdev, sizeof(struct rtap_pmd));
+	if (eth_dev == NULL) {
+		PMD_LOG(ERR, "%s Unable to allocate device struct", tap_name);
+		goto error;
+	}
+
+	eth_dev->dev_ops = &rtap_ops;
+	eth_dev->process_private = fds;
+	eth_dev->data->dev_flags |= RTE_ETH_DEV_AUTOFILL_QUEUE_XSTATS;
+
+	RTE_SET_USED(persist); /* used in later patches */
+
+	rte_eth_dev_probing_finish(eth_dev);
+	rte_kvargs_free(kvlist);
+	return 0;
+
+error:
+	if (eth_dev != NULL) {
+		eth_dev->process_private = NULL;
+		rte_eth_dev_release_port(eth_dev);
+	}
+	free(fds);
+	rte_kvargs_free(kvlist);
+	return -1;
+}
+
+static int
+rtap_remove(struct rte_vdev_device *dev)
+{
+	struct rte_eth_dev *eth_dev;
+
+	eth_dev = rte_eth_dev_allocated(rte_vdev_device_name(dev));
+	if (eth_dev == NULL)
+		return 0;
+
+	rtap_dev_close(eth_dev);
+	rte_eth_dev_release_port(eth_dev);
+	return 0;
+}
+
+static struct rte_vdev_driver pmd_rtap_drv = {
+	.probe = rtap_probe,
+	.remove = rtap_remove,
+};
+
+RTE_PMD_REGISTER_VDEV(net_rtap, pmd_rtap_drv);
+RTE_PMD_REGISTER_ALIAS(net_rtap, eth_rtap);
+RTE_PMD_REGISTER_PARAM_STRING(net_rtap,
+	RTAP_IFACE_ARG "=<string> "
+	RTAP_PERSIST_ARG);
+RTE_LOG_REGISTER_DEFAULT(rtap_logtype, NOTICE);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v5 02/10] net/rtap: add TAP device creation and queue management
  2026-02-09 18:38 ` [PATCH v5 00/10] net/rtap: add io_uring based TAP driver Stephen Hemminger
  2026-02-09 18:39   ` [PATCH v5 01/10] net/rtap: add driver skeleton and documentation Stephen Hemminger
@ 2026-02-09 18:39   ` Stephen Hemminger
  2026-02-09 18:39   ` [PATCH v5 03/10] net/rtap: add Rx/Tx with scatter/gather support Stephen Hemminger
                     ` (8 subsequent siblings)
  10 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2026-02-09 18:39 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger, Anatoly Burakov

Add TAP device creation using the Linux TUN/TAP interface with
IFF_MULTI_QUEUE, IFF_NO_PI, and IFF_VNET_HDR flags. Enable NAPI
mode when the kernel supports it.

The driver maintains a keep-alive file descriptor to the TAP device
and opens additional per-queue file descriptors for data path I/O.
This mirrors the multi-queue TAP architecture where each queue pair
(rx + tx) shares a single TAP fd.

Add the rtap_create() function that:
  - Opens the TAP device with configurable interface name
  - Configures the virtio-net header size
  - Reads the kernel-assigned MAC address
  - Supports the 'persist' option to keep the interface after exit
  - Detaches the keep-alive queue from data traffic

Add rtap_queue_open() and rtap_queue_close() for per-queue fd
management used during queue setup and teardown.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 drivers/net/rtap/rtap.h        |   4 +
 drivers/net/rtap/rtap_ethdev.c | 216 ++++++++++++++++++++++++++++++++-
 2 files changed, 219 insertions(+), 1 deletion(-)

diff --git a/drivers/net/rtap/rtap.h b/drivers/net/rtap/rtap.h
index 507ab000f3..39a3188a7b 100644
--- a/drivers/net/rtap/rtap.h
+++ b/drivers/net/rtap/rtap.h
@@ -66,4 +66,8 @@ struct rtap_pmd {
 	struct rte_ether_addr eth_addr; /* address assigned by kernel */
 };
 
+/* rtap_ethdev.c */
+int rtap_queue_open(struct rte_eth_dev *dev, uint16_t queue_id);
+void rtap_queue_close(struct rte_eth_dev *dev, uint16_t queue_id);
+
 #endif /* _RTAP_H_ */
diff --git a/drivers/net/rtap/rtap_ethdev.c b/drivers/net/rtap/rtap_ethdev.c
index ee5b5bad1b..4e7847ff8d 100644
--- a/drivers/net/rtap/rtap_ethdev.c
+++ b/drivers/net/rtap/rtap_ethdev.c
@@ -39,13 +39,140 @@ static const char * const valid_arguments[] = {
 	NULL
 };
 
+/* Creates a new tap device, name returned in ifr */
+static int
+rtap_tap_open(const char *name, struct ifreq *ifr, uint8_t persist)
+{
+	static const char tun_dev[] = "/dev/net/tun";
+	int tap_fd;
+
+	tap_fd = open(tun_dev, O_RDWR | O_CLOEXEC | O_NONBLOCK);
+	if (tap_fd < 0) {
+		PMD_LOG_ERRNO(ERR, "Open %s failed", tun_dev);
+		return -1;
+	}
+
+	int features = 0;
+	if (ioctl(tap_fd, TUNGETFEATURES, &features) < 0) {
+		PMD_LOG_ERRNO(ERR, "ioctl(TUNGETFEATURES): %s", tun_dev);
+		goto error;
+	}
+
+	int flags = IFF_TAP | IFF_MULTI_QUEUE | IFF_NO_PI | IFF_VNET_HDR;
+	if ((features & flags) != flags) {
+		PMD_LOG(ERR, "TUN features %#x missing support for %#x",
+			features, features & flags);
+		goto error;
+	}
+
+#ifdef IFF_NAPI
+	/* If kernel supports using NAPI enable it */
+	if (features & IFF_NAPI)
+		flags |= IFF_NAPI;
+#endif
+	/*
+	 * Sets the device name and packet format.
+	 * Do not want the protocol information (PI)
+	 */
+	strlcpy(ifr->ifr_name, name, IFNAMSIZ);
+	ifr->ifr_flags = flags;
+	if (ioctl(tap_fd, TUNSETIFF, ifr) < 0) {
+		PMD_LOG_ERRNO(ERR, "ioctl(TUNSETIFF) %s", ifr->ifr_name);
+		goto error;
+	}
+
+	/* (Optional) keep the device after application exit */
+	if (persist && ioctl(tap_fd, TUNSETPERSIST, 1) < 0) {
+		PMD_LOG_ERRNO(ERR, "ioctl(TUNSETPERSIST) %s", ifr->ifr_name);
+		goto error;
+	}
+
+	int hdr_size = sizeof(struct virtio_net_hdr);
+	if (ioctl(tap_fd, TUNSETVNETHDRSZ, &hdr_size) < 0) {
+		PMD_LOG(ERR, "ioctl(TUNSETVNETHDRSZ) %s", strerror(errno));
+		goto error;
+	}
+
+	return tap_fd;
+error:
+	close(tap_fd);
+	return -1;
+}
+
+static int
+rtap_dev_start(struct rte_eth_dev *dev)
+{
+	dev->data->dev_link.link_status = RTE_ETH_LINK_UP;
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		dev->data->rx_queue_state[i] = RTE_ETH_QUEUE_STATE_STARTED;
+		dev->data->tx_queue_state[i] = RTE_ETH_QUEUE_STATE_STARTED;
+	}
+
+	return 0;
+}
+
+static int
+rtap_dev_stop(struct rte_eth_dev *dev)
+{
+	int *fds = dev->process_private;
+
+	dev->data->dev_link.link_status = RTE_ETH_LINK_DOWN;
+
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		dev->data->rx_queue_state[i] = RTE_ETH_QUEUE_STATE_STOPPED;
+		dev->data->tx_queue_state[i] = RTE_ETH_QUEUE_STATE_STOPPED;
+	}
+
+	if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
+		for (uint16_t i = 0; i < RTE_MAX_QUEUES_PER_PORT; i++) {
+			if (fds[i] == -1)
+				continue;
+
+			close(fds[i]);
+			fds[i] = -1;
+		}
+	}
+
+	return 0;
+}
+
+static int
+rtap_dev_configure(struct rte_eth_dev *dev)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+
+	/* rx/tx must be paired */
+	if (dev->data->nb_rx_queues != dev->data->nb_tx_queues)
+		return -EINVAL;
+
+	if (ioctl(pmd->keep_fd, TUNSETOFFLOAD, 0) != 0) {
+		PMD_LOG(ERR, "ioctl(TUNSETOFFLOAD) failed: %s", strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
 static int
 rtap_dev_close(struct rte_eth_dev *dev)
 {
 	struct rtap_pmd *pmd = dev->data->dev_private;
+	int *fds = dev->process_private;
 
 	PMD_LOG(INFO, "Closing %s", pmd->ifname);
 
+	/* Release all io_uring queues (calls rx/tx_queue_release for each) */
+	rte_eth_dev_internal_reset(dev);
+
+	/* Close any remaining queue fds (each process owns its own set) */
+	for (uint16_t i = 0; i < RTE_MAX_QUEUES_PER_PORT; i++) {
+		if (fds[i] == -1)
+			continue;
+		PMD_LOG(DEBUG, "Closed queue %u fd %d", i, fds[i]);
+		close(fds[i]);
+		fds[i] = -1;
+	}
+
 	if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
 		/* mac_addrs must not be freed alone because part of dev_private */
 		dev->data->mac_addrs = NULL;
@@ -63,10 +190,96 @@ rtap_dev_close(struct rte_eth_dev *dev)
 	return 0;
 }
 
+/* Setup another fd to TAP device for the queue */
+int
+rtap_queue_open(struct rte_eth_dev *dev, uint16_t queue_id)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+	int *fds = dev->process_private;
+
+	if (fds[queue_id] != -1) {
+		PMD_LOG(DEBUG, "queue %u already has fd %d", queue_id, fds[queue_id]);
+		return 0;	/* already setup */
+	}
+
+	struct ifreq ifr = { 0 };
+	int tap_fd = rtap_tap_open(pmd->ifname, &ifr, 0);
+	if (tap_fd < 0) {
+		PMD_LOG(ERR, "tap_open failed");
+		return -1;
+	}
+
+	PMD_LOG(DEBUG, "Opened %d for queue %u", tap_fd, queue_id);
+	fds[queue_id] = tap_fd;
+	return 0;
+}
+
+void
+rtap_queue_close(struct rte_eth_dev *dev, uint16_t queue_id)
+{
+	int *fds = dev->process_private;
+	int tap_fd = fds[queue_id];
+
+	if (tap_fd == -1)
+		return; /* already closed */
+	PMD_LOG(DEBUG, "Closed queue %u fd %d", queue_id, tap_fd);
+	close(tap_fd);
+	fds[queue_id] = -1;
+}
+
 static const struct eth_dev_ops rtap_ops = {
+	.dev_start		= rtap_dev_start,
+	.dev_stop		= rtap_dev_stop,
+	.dev_configure		= rtap_dev_configure,
 	.dev_close		= rtap_dev_close,
 };
 
+static int
+rtap_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
+{
+	struct rte_eth_dev_data *data = dev->data;
+	struct rtap_pmd *pmd = data->dev_private;
+
+	pmd->keep_fd = -1;
+
+	dev->dev_ops = &rtap_ops;
+
+	/* Get the initial fd used to keep the tap device around */
+	struct ifreq ifr = { 0 };
+	pmd->keep_fd = rtap_tap_open(tap_name, &ifr, persist);
+	if (pmd->keep_fd < 0)
+		goto error;
+
+	PMD_LOG(DEBUG, "Created %s keep_fd %d", ifr.ifr_name, pmd->keep_fd);
+
+	/* Use name returned by kernel i.e if tap_name is rtap%d this will be rtap0 */
+	strlcpy(pmd->ifname, ifr.ifr_name, IFNAMSIZ);
+
+	/* Read the MAC address assigned by the kernel */
+	if (ioctl(pmd->keep_fd, SIOCGIFHWADDR, &ifr) < 0) {
+		PMD_LOG_ERRNO(ERR, "Unable to get MAC address for %s", ifr.ifr_name);
+		goto error;
+	}
+	memcpy(&pmd->eth_addr, &ifr.ifr_hwaddr.sa_data, RTE_ETHER_ADDR_LEN);
+	data->mac_addrs = &pmd->eth_addr;
+
+	/* Detach this instance, not used for traffic */
+	ifr.ifr_flags = IFF_DETACH_QUEUE;
+	if (ioctl(pmd->keep_fd, TUNSETQUEUE, &ifr) < 0) {
+		PMD_LOG_ERRNO(ERR, "Unable to detach keep-alive queue for %s", ifr.ifr_name);
+		goto error;
+	}
+
+	PMD_LOG(DEBUG, "%s setup", ifr.ifr_name);
+
+	return 0;
+
+error:
+	if (pmd->keep_fd != -1)
+		close(pmd->keep_fd);
+	return -1;
+}
+
 static int
 rtap_parse_iface(const char *key __rte_unused, const char *value, void *extra_args)
 {
@@ -129,7 +342,8 @@ rtap_probe(struct rte_vdev_device *vdev)
 	eth_dev->process_private = fds;
 	eth_dev->data->dev_flags |= RTE_ETH_DEV_AUTOFILL_QUEUE_XSTATS;
 
-	RTE_SET_USED(persist); /* used in later patches */
+	if (rtap_create(eth_dev, tap_name, persist) < 0)
+		goto error;
 
 	rte_eth_dev_probing_finish(eth_dev);
 	rte_kvargs_free(kvlist);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v5 03/10] net/rtap: add Rx/Tx with scatter/gather support
  2026-02-09 18:38 ` [PATCH v5 00/10] net/rtap: add io_uring based TAP driver Stephen Hemminger
  2026-02-09 18:39   ` [PATCH v5 01/10] net/rtap: add driver skeleton and documentation Stephen Hemminger
  2026-02-09 18:39   ` [PATCH v5 02/10] net/rtap: add TAP device creation and queue management Stephen Hemminger
@ 2026-02-09 18:39   ` Stephen Hemminger
  2026-02-09 18:39   ` [PATCH v5 04/10] net/rtap: add statistics and device info Stephen Hemminger
                     ` (7 subsequent siblings)
  10 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2026-02-09 18:39 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Implement packet receive and transmit using io_uring asynchronous I/O,
with full support for both single-segment and multi-segment mbufs.

Rx path:
  - rtap_rx_alloc() chains multiple mbufs when the MTU exceeds a
    single mbuf's tailroom capacity
  - Pre-post read/readv requests to the io_uring submission queue,
    each backed by a pre-allocated (possibly chained) mbuf
  - On rx_burst, harvest completed CQEs and replace each consumed
    mbuf with a freshly allocated one
  - rtap_rx_adjust() distributes received data across segments and
    frees unused trailing segments
  - Parse the prepended virtio-net header (offload fields are
    ignored until the offload patch)

Tx path:
  - For single-segment mbufs, use io_uring write and batch submits
  - For multi-segment mbufs, use writev via io_uring with immediate
    submit (iovec is stack-allocated)
  - When the mbuf headroom is not writable (shared or indirect),
    chain a new header mbuf for the virtio-net header
  - Prepend a zeroed virtio-net header (offload population deferred)
  - Clean completed tx CQEs to free transmitted mbufs

Add io_uring cancel-all logic using IORING_ASYNC_CANCEL_ALL for
clean queue teardown, draining all pending CQEs and freeing mbufs.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/rtap.ini |   1 +
 drivers/net/rtap/meson.build      |   1 +
 drivers/net/rtap/rtap.h           |  13 +
 drivers/net/rtap/rtap_ethdev.c    |   7 +
 drivers/net/rtap/rtap_rxtx.c      | 755 ++++++++++++++++++++++++++++++
 5 files changed, 777 insertions(+)
 create mode 100644 drivers/net/rtap/rtap_rxtx.c

diff --git a/doc/guides/nics/features/rtap.ini b/doc/guides/nics/features/rtap.ini
index ed7c638029..c064e1e0b9 100644
--- a/doc/guides/nics/features/rtap.ini
+++ b/doc/guides/nics/features/rtap.ini
@@ -4,6 +4,7 @@
 ; Refer to default.ini for the full list of available PMD features.
 ;
 [Features]
+Scattered Rx         = P
 Linux                = Y
 ARMv7                = Y
 ARMv8                = Y
diff --git a/drivers/net/rtap/meson.build b/drivers/net/rtap/meson.build
index 7bd7806ef3..8e2b15f382 100644
--- a/drivers/net/rtap/meson.build
+++ b/drivers/net/rtap/meson.build
@@ -19,6 +19,7 @@ endif
 
 sources = files(
         'rtap_ethdev.c',
+        'rtap_rxtx.c',
 )
 
 ext_deps += liburing
diff --git a/drivers/net/rtap/rtap.h b/drivers/net/rtap/rtap.h
index 39a3188a7b..a0bbb1a8a0 100644
--- a/drivers/net/rtap/rtap.h
+++ b/drivers/net/rtap/rtap.h
@@ -70,4 +70,17 @@ struct rtap_pmd {
 int rtap_queue_open(struct rte_eth_dev *dev, uint16_t queue_id);
 void rtap_queue_close(struct rte_eth_dev *dev, uint16_t queue_id);
 
+/* rtap_rxtx.c */
+uint16_t rtap_rx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts);
+uint16_t rtap_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts);
+int rtap_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id,
+			uint16_t nb_rx_desc, unsigned int socket_id,
+			const struct rte_eth_rxconf *rx_conf,
+			struct rte_mempool *mb_pool);
+void rtap_rx_queue_release(struct rte_eth_dev *dev, uint16_t queue_id);
+int rtap_tx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id,
+			uint16_t nb_tx_desc, unsigned int socket_id,
+			const struct rte_eth_txconf *tx_conf);
+void rtap_tx_queue_release(struct rte_eth_dev *dev, uint16_t queue_id);
+
 #endif /* _RTAP_H_ */
diff --git a/drivers/net/rtap/rtap_ethdev.c b/drivers/net/rtap/rtap_ethdev.c
index 4e7847ff8d..a65a8b77ad 100644
--- a/drivers/net/rtap/rtap_ethdev.c
+++ b/drivers/net/rtap/rtap_ethdev.c
@@ -232,6 +232,10 @@ static const struct eth_dev_ops rtap_ops = {
 	.dev_stop		= rtap_dev_stop,
 	.dev_configure		= rtap_dev_configure,
 	.dev_close		= rtap_dev_close,
+	.rx_queue_setup		= rtap_rx_queue_setup,
+	.rx_queue_release	= rtap_rx_queue_release,
+	.tx_queue_setup		= rtap_tx_queue_setup,
+	.tx_queue_release	= rtap_tx_queue_release,
 };
 
 static int
@@ -272,6 +276,9 @@ rtap_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
 
 	PMD_LOG(DEBUG, "%s setup", ifr.ifr_name);
 
+	dev->rx_pkt_burst = rtap_rx_burst;
+	dev->tx_pkt_burst = rtap_tx_burst;
+
 	return 0;
 
 error:
diff --git a/drivers/net/rtap/rtap_rxtx.c b/drivers/net/rtap/rtap_rxtx.c
new file mode 100644
index 0000000000..c972ab4ca0
--- /dev/null
+++ b/drivers/net/rtap/rtap_rxtx.c
@@ -0,0 +1,755 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2026 Stephen Hemminger
+ */
+
+#include <assert.h>
+#include <errno.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <liburing.h>
+#include <sys/uio.h>
+#include <linux/virtio_net.h>
+
+#include <rte_common.h>
+#include <rte_ethdev.h>
+#include <rte_ip.h>
+#include <rte_mbuf.h>
+#include <rte_net.h>
+#include <rte_malloc.h>
+
+#include "rtap.h"
+
+/*
+ * Since virtio net header is prepended to the mbuf,
+ * the DPDK configuration should make sure that mbuf pools
+ * are created to work.
+ */
+static_assert(RTE_PKTMBUF_HEADROOM >= sizeof(struct virtio_net_hdr),
+	      "Pktmbuf headroom not big enough for virtio header");
+
+
+/* Get the per-process file descriptor used transmit and receive */
+static inline int
+rtap_queue_fd(uint16_t port_id, uint16_t queue_id)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	int *fds = dev->process_private;
+	int fd = fds[queue_id];
+
+	RTE_ASSERT(fd != -1);
+	return fd;
+}
+
+/*
+ * Add to submit queue a read of mbuf data.
+ * For multi-segment mbuf's requires readv().
+ * Return:
+ *   -ENOSPC : no submit queue element available.
+ *   1 : readv was used and no io_uring_submit was done.
+ *   0 : regular read submitted, caller should call io_uring_submit
+ *       later to batch.
+ */
+static inline int
+rtap_rx_submit(struct rtap_rx_queue *rxq, int fd, struct rte_mbuf *mb)
+{
+	struct io_uring_sqe *sqe = io_uring_get_sqe(&rxq->io_ring);
+	struct iovec iovs[RTE_MBUF_MAX_NB_SEGS];
+
+	if (unlikely(sqe == NULL))
+		return -ENOSPC;
+
+	io_uring_sqe_set_data(sqe, mb);
+
+	RTE_ASSERT(rte_pktmbuf_headroom(mb) >= sizeof(struct virtio_net_hdr));
+	void *buf = rte_pktmbuf_mtod_offset(mb, void *, -sizeof(struct virtio_net_hdr));
+	unsigned int nbytes = sizeof(struct virtio_net_hdr) + rte_pktmbuf_tailroom(mb);
+
+	/* optimize for the case where packet fits in one mbuf */
+	if (mb->nb_segs == 1) {
+		io_uring_prep_read(sqe, fd, buf, nbytes, 0);
+		/* caller will submit as batch */
+		return 0;
+	} else {
+		uint16_t nsegs = mb->nb_segs;
+		RTE_ASSERT(nsegs > 0 && nsegs < IOV_MAX);
+
+		iovs[0].iov_base = buf;
+		iovs[0].iov_len = nbytes;
+
+		for (uint16_t i = 1; i < nsegs; i++) {
+			mb = mb->next;
+			iovs[i].iov_base = rte_pktmbuf_mtod(mb, void *);
+			iovs[i].iov_len = rte_pktmbuf_tailroom(mb);
+		}
+		io_uring_prep_readv(sqe, fd, iovs, nsegs, 0);
+
+		/*
+		 * For readv, need to submit now since iovs[] must be
+		 * valid until submitted.
+		 * io_uring_submit(3) returns the number of submitted submission
+		 *  queue entries (on failure returns -errno).
+		 */
+		return io_uring_submit(&rxq->io_ring);
+	}
+}
+
+/* Allocates one or more mbuf's to be used for reading packets */
+static struct rte_mbuf *
+rtap_rx_alloc(struct rtap_rx_queue *rxq)
+{
+	const struct rte_eth_dev *dev = &rte_eth_devices[rxq->port_id];
+	int buf_size = dev->data->mtu + RTE_ETHER_HDR_LEN;
+	struct rte_mbuf *m = NULL;
+	struct rte_mbuf **tail = &m;
+
+	do {
+		struct rte_mbuf *seg = rte_pktmbuf_alloc(rxq->mb_pool);
+		if (unlikely(seg == NULL)) {
+			rte_pktmbuf_free(m);
+			return NULL;
+		}
+		*tail = seg;
+		tail = &seg->next;
+		if (seg != m)
+			++m->nb_segs;
+
+		buf_size -= rte_pktmbuf_tailroom(seg);
+	} while (buf_size > 0);
+
+	__rte_mbuf_sanity_check(m, 1);
+	return m;
+}
+
+/*
+ * When receiving multi-segment mbuf's need to adjust
+ * the length of mbufs.
+ */
+static inline int
+rtap_rx_adjust(struct rte_mbuf *mb, uint32_t len)
+{
+	struct rte_mbuf *seg;
+	uint16_t count = 0;
+
+	mb->pkt_len = len;
+
+	/* Walk through mbuf chain and update the length of each segment */
+	for (seg = mb; seg != NULL && len > 0; seg = seg->next) {
+		uint16_t seg_len = RTE_MIN(len, rte_pktmbuf_tailroom(seg));
+
+		seg->data_len = seg_len;
+		count++;
+		len -= seg_len;
+
+		/* If length is zero, this is end of packet */
+		if (len == 0) {
+			/* Drop unused tail segments */
+			if (seg->next != NULL) {
+				struct rte_mbuf *tail = seg->next;
+				seg->next = NULL;
+
+				/* Free segments one by one to avoid nb_segs issues */
+				while (tail != NULL) {
+					struct rte_mbuf *next = tail->next;
+					rte_pktmbuf_free_seg(tail);
+					tail = next;
+				}
+			}
+
+			mb->nb_segs = count;
+			return 0;
+		}
+	}
+
+	/* Packet was truncated - not enough mbuf space */
+	return -1;
+}
+
+/*
+ * Set the receive offload flags of received mbuf
+ * based on the bits in the virtio network header
+ */
+static int
+rtap_rx_offload(struct rte_mbuf *m, const struct virtio_net_hdr *hdr)
+{
+	uint32_t ptype;
+	bool l4_supported = false;
+	struct rte_net_hdr_lens hdr_lens;
+
+	/* nothing to do */
+	if (hdr->flags == 0 && hdr->gso_type == VIRTIO_NET_HDR_GSO_NONE)
+		return 0;
+
+	m->ol_flags |= RTE_MBUF_F_RX_IP_CKSUM_UNKNOWN;
+
+	ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK);
+	m->packet_type = ptype;
+	if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
+		l4_supported = true;
+
+	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		uint32_t hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
+		if (hdr->csum_start <= hdrlen && l4_supported) {
+			m->ol_flags |= RTE_MBUF_F_RX_L4_CKSUM_NONE;
+		} else {
+			/* Unknown proto or tunnel, do sw cksum. */
+			uint16_t csum = 0;
+
+			if (rte_raw_cksum_mbuf(m, hdr->csum_start,
+					       rte_pktmbuf_pkt_len(m) - hdr->csum_start,
+					       &csum) < 0)
+				return -EINVAL;
+			if (likely(csum != 0xffff))
+				csum = ~csum;
+
+			uint32_t off = (uint32_t)hdr->csum_offset + hdr->csum_start;
+			if (rte_pktmbuf_data_len(m) >= off + sizeof(uint16_t))
+				*rte_pktmbuf_mtod_offset(m, uint16_t *, off) = csum;
+		}
+	} else if ((hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID) && l4_supported) {
+		m->ol_flags |= RTE_MBUF_F_RX_L4_CKSUM_GOOD;
+	}
+
+	/* GSO request, save required information in mbuf */
+	if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+		/* Check unsupported modes */
+		if ((hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN) || hdr->gso_size == 0)
+			return -EINVAL;
+
+		/* Update mss lengths in mbuf */
+		m->tso_segsz = hdr->gso_size;
+		switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
+		case VIRTIO_NET_HDR_GSO_TCPV4:
+		case VIRTIO_NET_HDR_GSO_TCPV6:
+			m->ol_flags |= RTE_MBUF_F_RX_LRO | RTE_MBUF_F_RX_L4_CKSUM_NONE;
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+uint16_t
+rtap_rx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct rtap_rx_queue *rxq = queue;
+	struct io_uring_cqe *cqe;
+	unsigned int head, num_cqe = 0, num_sqe = 0;
+	uint16_t num_rx = 0;
+	uint32_t num_bytes = 0;
+	int fd = rtap_queue_fd(rxq->port_id, rxq->queue_id);
+
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	io_uring_for_each_cqe(&rxq->io_ring, head, cqe) {
+		struct rte_mbuf *mb = (void *)(uintptr_t)cqe->user_data;
+		struct rte_mbuf *nmb = NULL;
+		struct virtio_net_hdr *hdr = NULL;
+		ssize_t len = cqe->res;
+
+		PMD_RX_LOG(DEBUG, "complete m=%p len=%zd", mb, len);
+
+		num_cqe++;
+
+		if (unlikely(len < (ssize_t)(sizeof(*hdr) + RTE_ETHER_HDR_LEN))) {
+			if (len < 0)
+				PMD_RX_LOG(ERR, "io_uring_read: %s", strerror(-len));
+			else
+				PMD_RX_LOG(ERR, "io_uring_read len %zd", len);
+			rxq->rx_errors++;
+			goto resubmit;
+		}
+
+		/* virtio header is before packet data */
+		hdr = rte_pktmbuf_mtod_offset(mb, struct virtio_net_hdr *, -sizeof(*hdr));
+		len -= sizeof(*hdr);
+
+		/* Replacement mbuf for resubmitting */
+		nmb = rtap_rx_alloc(rxq);
+		if (unlikely(nmb == NULL)) {
+			struct rte_eth_dev *dev = &rte_eth_devices[rxq->port_id];
+
+			PMD_RX_LOG(ERR, "Rx mbuf alloc failed");
+			dev->data->rx_mbuf_alloc_failed++;
+
+			nmb = mb;	 /* Reuse original */
+			goto resubmit;
+		}
+
+		if (mb->nb_segs == 1) {
+			mb->data_len = len;
+			mb->pkt_len = len;
+		} else {
+			if (unlikely(rtap_rx_adjust(mb, len) < 0)) {
+				PMD_RX_LOG(ERR, "packet truncated: pkt_len=%u exceeds mbuf capacity",
+					   mb->pkt_len);
+				++rxq->rx_errors;
+				rte_pktmbuf_free(mb);
+				goto resubmit;
+			}
+		}
+
+		if (unlikely(rtap_rx_offload(mb, hdr) < 0)) {
+			PMD_RX_LOG(ERR, "invalid rx offload");
+			++rxq->rx_errors;
+			rte_pktmbuf_free(mb);
+			goto resubmit;
+		}
+
+		mb->port = rxq->port_id;
+
+		__rte_mbuf_sanity_check(mb, 1);
+		num_bytes += mb->pkt_len;
+		bufs[num_rx++] = mb;
+
+resubmit:
+		/* Submit the replacement mbuf */
+		int n = rtap_rx_submit(rxq, fd, nmb);
+		if (unlikely(n < 0)) {
+			/* Hope that later Rx can recover */
+			PMD_RX_LOG(ERR, "io_uring no Rx sqe: %s", strerror(-n));
+			rxq->rx_errors++;
+			rte_pktmbuf_free(nmb);
+			break;
+		}
+
+		/* If using readv() then n > 0 and all sqe's have been queued. */
+		if (n > 0)
+			num_sqe = 0;
+		else
+			++num_sqe;
+
+		if (num_rx == nb_pkts)
+			break;
+	}
+	if (num_cqe > 0)
+		io_uring_cq_advance(&rxq->io_ring, num_cqe);
+
+	if (num_sqe > 0) {
+		int n = io_uring_submit(&rxq->io_ring);
+		if (unlikely(n < 0)) {
+			PMD_LOG(ERR, "Rx io_uring submit failed: %s", strerror(-n));
+		} else if (unlikely(n != (int)num_sqe)) {
+			PMD_RX_LOG(NOTICE, "Rx io_uring %d of %u resubmitted", n, num_sqe);
+		}
+	}
+
+	rxq->rx_packets += num_rx;
+	rxq->rx_bytes += num_bytes;
+
+	return num_rx;
+}
+
+int
+rtap_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id, uint16_t nb_rx_desc,
+		    unsigned int socket_id,
+		    const struct rte_eth_rxconf *rx_conf __rte_unused,
+		    struct rte_mempool *mb_pool)
+{
+	struct rte_mbuf **mbufs = NULL;
+	unsigned int nsqe = 0;
+	int fd = -1;
+
+	PMD_LOG(DEBUG, "setup port %u queue %u rx_descriptors %u",
+		dev->data->port_id, queue_id, nb_rx_desc);
+
+	struct rtap_rx_queue *rxq = rte_zmalloc_socket(NULL, sizeof(*rxq),
+						       RTE_CACHE_LINE_SIZE, socket_id);
+	if (rxq == NULL) {
+		PMD_LOG(ERR, "rxq alloc failed");
+		return -1;
+	}
+
+	rxq->mb_pool = mb_pool;
+	rxq->port_id = dev->data->port_id;
+	rxq->queue_id = queue_id;
+	dev->data->rx_queues[queue_id] = rxq;
+
+	if (io_uring_queue_init(nb_rx_desc, &rxq->io_ring, 0) != 0) {
+		PMD_LOG(ERR, "io_uring_queue_init failed: %s", strerror(errno));
+		goto error_rxq_free;
+	}
+
+	mbufs = calloc(nb_rx_desc, sizeof(struct rte_mbuf *));
+	if (mbufs == NULL) {
+		PMD_LOG(ERR, "Rx mbuf pointer alloc failed");
+		goto error_iouring_exit;
+	}
+
+	/* open shared tap fd maybe already setup */
+	if (rtap_queue_open(dev, queue_id) < 0)
+		goto error_bulk_free;
+
+	fd = rtap_queue_fd(rxq->port_id, rxq->queue_id);
+
+	for (uint16_t i = 0; i < nb_rx_desc; i++) {
+		mbufs[i] = rtap_rx_alloc(rxq);
+		if (mbufs[i] == NULL) {
+			PMD_LOG(ERR, "Rx mbuf alloc buf failed");
+			goto error_bulk_free;
+		}
+
+		int n = rtap_rx_submit(rxq, fd, mbufs[i]);
+		if (n < 0) {
+			PMD_LOG(ERR, "rtap_rx_submit failed: %s", strerror(-n));
+			goto error_bulk_free;
+		}
+
+		/* If using readv() then n > 0 and all sqe's have been queued. */
+		if (n > 0)
+			nsqe = 0;
+		else
+			++nsqe;
+	}
+
+	if (nsqe > 0) {
+		int n = io_uring_submit(&rxq->io_ring);
+		if (n < 0) {
+			PMD_LOG(ERR, "Rx io_uring submit failed: %s", strerror(-n));
+			goto error_bulk_free;
+		}
+		if (n < (int)nsqe)
+			PMD_LOG(NOTICE, "Rx io_uring partial submit %d of %u", n, nb_rx_desc);
+	}
+
+	free(mbufs);
+	return 0;
+
+error_bulk_free:
+	/* can't use bulk free here because some of mbufs[] maybe NULL */
+	for (uint16_t i = 0; i < nb_rx_desc; i++) {
+		if (mbufs[i] != NULL)
+			rte_pktmbuf_free(mbufs[i]);
+	}
+	rtap_queue_close(dev, queue_id);
+	free(mbufs);
+error_iouring_exit:
+	io_uring_queue_exit(&rxq->io_ring);
+error_rxq_free:
+	rte_free(rxq);
+	return -1;
+}
+
+/*
+ * Cancel all pending io_uring operations and drain completions.
+ * Uses IORING_ASYNC_CANCEL_ALL to cancel all operations at once.
+ * Returns the number of mbufs freed.
+ */
+static unsigned int
+rtap_cancel_all(struct io_uring *ring)
+{
+	struct io_uring_cqe *cqe;
+	struct io_uring_sqe *sqe;
+	unsigned int head, num_freed = 0;
+	unsigned int ready;
+	int ret;
+
+	/* Cancel all pending operations using CANCEL_ALL flag */
+	sqe = io_uring_get_sqe(ring);
+	if (sqe != NULL) {
+		/* IORING_ASYNC_CANCEL_ALL | IORING_ASYNC_CANCEL_ANY cancels all ops */
+		io_uring_prep_cancel(sqe, NULL,
+				     IORING_ASYNC_CANCEL_ALL | IORING_ASYNC_CANCEL_ANY);
+		io_uring_sqe_set_data(sqe, NULL);
+		ret = io_uring_submit(ring);
+		if (ret < 0)
+			PMD_LOG(ERR, "cancel submit failed: %s", strerror(-ret));
+	}
+
+	/*
+	 * One blocking wait to let the kernel deliver the cancel CQE
+	 * and the CQEs for all cancelled operations.
+	 */
+	io_uring_submit_and_wait(ring, 1);
+
+	/*
+	 * Drain all CQEs non-blocking.  Cancellation of many pending
+	 * operations may produce CQEs in waves; keep polling until the
+	 * CQ is empty.
+	 */
+	for (unsigned int retries = 0; retries < 10; retries++) {
+		ready = io_uring_cq_ready(ring);
+		if (ready == 0)
+			break;
+
+		io_uring_for_each_cqe(ring, head, cqe) {
+			struct rte_mbuf *mb = (void *)(uintptr_t)cqe->user_data;
+
+			/* Skip the cancel operation's own CQE (user_data = NULL) */
+			if (mb != NULL) {
+				rte_pktmbuf_free(mb);
+				++num_freed;
+			}
+		}
+
+		/* Advance past all processed CQEs */
+		io_uring_cq_advance(ring, ready);
+	}
+
+	return num_freed;
+}
+
+void
+rtap_rx_queue_release(struct rte_eth_dev *dev, uint16_t queue_id)
+{
+	struct rtap_rx_queue *rxq = dev->data->rx_queues[queue_id];
+
+	if (rxq == NULL)
+		return;
+
+	rtap_cancel_all(&rxq->io_ring);
+	io_uring_queue_exit(&rxq->io_ring);
+
+	rte_free(rxq);
+
+	/* Close the shared TAP fd if the tx queue is already gone */
+	if (queue_id >= dev->data->nb_tx_queues ||
+	    dev->data->tx_queues[queue_id] == NULL)
+		rtap_queue_close(dev, queue_id);
+}
+
+int
+rtap_tx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id,
+		    uint16_t nb_tx_desc, unsigned int socket_id,
+		    const struct rte_eth_txconf *tx_conf)
+{
+	/* open shared tap fd maybe already setup */
+	if (rtap_queue_open(dev, queue_id) < 0)
+		return -1;
+
+	struct rtap_tx_queue *txq = rte_zmalloc_socket(NULL, sizeof(*txq),
+						       RTE_CACHE_LINE_SIZE, socket_id);
+	if (txq == NULL) {
+		PMD_LOG(ERR, "txq alloc failed");
+		return -1;
+	}
+
+	txq->port_id = dev->data->port_id;
+	txq->queue_id = queue_id;
+	txq->free_thresh = tx_conf->tx_free_thresh;
+	dev->data->tx_queues[queue_id] = txq;
+
+	if (io_uring_queue_init(nb_tx_desc, &txq->io_ring, 0) != 0) {
+		PMD_LOG(ERR, "io_uring_queue_init failed: %s", strerror(errno));
+		rte_free(txq);
+		return -1;
+	}
+
+	return 0;
+}
+
+static void
+rtap_tx_cleanup(struct rtap_tx_queue *txq)
+{
+	struct io_uring_cqe *cqe;
+	unsigned int head;
+	unsigned int num_cqe = 0;
+
+	io_uring_for_each_cqe(&txq->io_ring, head, cqe) {
+		struct rte_mbuf *mb = (void *)(uintptr_t)cqe->user_data;
+
+		++num_cqe;
+
+		/* Skip CQEs with NULL user_data (e.g., cancel operations) */
+		if (mb == NULL)
+			continue;
+
+		PMD_TX_LOG(DEBUG, " mbuf len %u result: %d", mb->pkt_len, cqe->res);
+		txq->tx_errors += (cqe->res < 0);
+		rte_pktmbuf_free(mb);
+	}
+	io_uring_cq_advance(&txq->io_ring, num_cqe);
+}
+
+void
+rtap_tx_queue_release(struct rte_eth_dev *dev, uint16_t queue_id)
+{
+	struct rtap_tx_queue *txq = dev->data->tx_queues[queue_id];
+
+	if (txq == NULL)
+		return;
+
+	/* First drain any completed TX operations */
+	rtap_tx_cleanup(txq);
+
+	/* Cancel all remaining pending operations and free mbufs */
+	rtap_cancel_all(&txq->io_ring);
+	io_uring_queue_exit(&txq->io_ring);
+
+	rte_free(txq);
+
+	/* Close the shared TAP fd if the rx queue is already gone */
+	if (queue_id >= dev->data->nb_rx_queues ||
+	    dev->data->rx_queues[queue_id] == NULL)
+		rtap_queue_close(dev, queue_id);
+}
+
+/* Convert mbuf offload flags to virtio net header */
+static void
+rtap_tx_offload(struct virtio_net_hdr *hdr, const struct rte_mbuf *m)
+{
+	uint64_t csum_l4 = m->ol_flags & RTE_MBUF_F_TX_L4_MASK;
+	uint16_t o_l23_len = (m->ol_flags & RTE_MBUF_F_TX_TUNNEL_MASK) ?
+			     m->outer_l2_len + m->outer_l3_len : 0;
+
+	memset(hdr, 0, sizeof(*hdr));
+
+	if (m->ol_flags & RTE_MBUF_F_TX_TCP_SEG)
+		csum_l4 |= RTE_MBUF_F_TX_TCP_CKSUM;
+
+	switch (csum_l4) {
+	case RTE_MBUF_F_TX_UDP_CKSUM:
+		hdr->csum_start = o_l23_len + m->l2_len + m->l3_len;
+		hdr->csum_offset = offsetof(struct rte_udp_hdr, dgram_cksum);
+		hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+		break;
+
+	case RTE_MBUF_F_TX_TCP_CKSUM:
+		hdr->csum_start = o_l23_len + m->l2_len + m->l3_len;
+		hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
+		hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+		break;
+	}
+
+	/* TCP Segmentation Offload */
+	if (m->ol_flags & RTE_MBUF_F_TX_TCP_SEG) {
+		hdr->gso_type = (m->ol_flags & RTE_MBUF_F_TX_IPV6) ?
+			VIRTIO_NET_HDR_GSO_TCPV6 :
+			VIRTIO_NET_HDR_GSO_TCPV4;
+		hdr->gso_size = m->tso_segsz;
+		hdr->hdr_len = o_l23_len + m->l2_len + m->l3_len + m->l4_len;
+	}
+}
+
+/*
+ * Transmit burst posts mbufs into the io_uring TAP file descriptor
+ * by creating queue elements with write operation.
+ *
+ * The driver mimics the behavior of a real hardware NIC.
+ *
+ * If there is no space left in the io_uring then the driver will return the number of
+ * mbuf's that were processed to that point. The application can then decide to retry
+ * later or drop the unsent packets in case of backpressue.
+ *
+ * The transmit process puts the virtio header before the data. In some cases, a new mbuf
+ * is required from same pool as original; but if that fails, the packet is not sent and
+ * is silently dropped. This is to avoid situation where pool is so small that transmit
+ * gets stuck when pool resources are very low.
+ */
+uint16_t
+rtap_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct rtap_tx_queue *txq = queue;
+	uint16_t i, num_tx = 0;
+	uint32_t num_tx_bytes = 0;
+
+	PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts);
+
+	if (io_uring_sq_space_left(&txq->io_ring) < RTE_MAX(txq->free_thresh, nb_pkts))
+		rtap_tx_cleanup(txq);
+
+	int fd = rtap_queue_fd(txq->port_id, txq->queue_id);
+
+	for (i = 0; i < nb_pkts; i++) {
+		struct rte_mbuf *mb = bufs[i];
+		struct virtio_net_hdr *hdr;
+
+		/* Use packet head room space for virtio header (if possible) */
+		if (rte_mbuf_refcnt_read(mb) == 1 && RTE_MBUF_DIRECT(mb) &&
+		    rte_pktmbuf_headroom(mb) >= sizeof(*hdr)) {
+			hdr = rte_pktmbuf_mtod_offset(mb, struct virtio_net_hdr *, -sizeof(*hdr));
+		} else {
+			/* Need to chain a new mbuf to make room for virtio header */
+			struct rte_mbuf *mh = rte_pktmbuf_alloc(mb->pool);
+			if (unlikely(mh == NULL)) {
+				PMD_TX_LOG(DEBUG, "mbuf pool exhausted on transmit");
+				rte_pktmbuf_free(mb);
+				++txq->tx_errors;
+				continue;
+			}
+
+			/* The packet headroom should be available in newly allocated mbuf */
+			RTE_ASSERT(rte_pktmbuf_headroom(mh) >= sizeof(*hdr));
+
+			hdr = rte_pktmbuf_mtod_offset(mh, struct virtio_net_hdr *, -sizeof(*hdr));
+			mh->next = mb;
+			mh->nb_segs = mb->nb_segs + 1;
+			mh->pkt_len = mb->pkt_len;
+			mh->ol_flags = mb->ol_flags & RTE_MBUF_F_TX_OFFLOAD_MASK;
+			mb = mh;
+		}
+
+		struct io_uring_sqe *sqe = io_uring_get_sqe(&txq->io_ring);
+		if (sqe == NULL) {
+			/* Drop header mbuf if it was used */
+			if (mb != bufs[i])
+				rte_pktmbuf_free_seg(mb);
+			break;	/* submit ring is full */
+		}
+
+		/* Note: transmit bytes does not include virtio header */
+		num_tx_bytes += mb->pkt_len;
+
+		io_uring_sqe_set_data(sqe, mb);
+		rtap_tx_offload(hdr, mb);
+
+		PMD_TX_LOG(DEBUG, "write m=%p segs=%u", mb, mb->nb_segs);
+
+		/* Start of data written to kernel includes virtio net header */
+		void *buf = rte_pktmbuf_mtod_offset(mb, void *, -sizeof(*hdr));
+		unsigned int nbytes = sizeof(struct virtio_net_hdr) + mb->data_len;
+
+		if (mb->nb_segs == 1) {
+			/* Single segment mbuf can go as write and batched */
+			io_uring_prep_write(sqe, fd, buf, nbytes, 0);
+			++num_tx;
+		} else {
+			/* Mult-segment mbuf needs scatter/gather */
+			struct iovec iovs[RTE_MBUF_MAX_NB_SEGS + 1];
+			unsigned int niov = mb->nb_segs;
+
+			iovs[0].iov_base = buf;
+			iovs[0].iov_len = nbytes;
+
+			for (unsigned int v = 1; v < niov; v++) {
+				mb = mb->next;
+				iovs[v].iov_base = rte_pktmbuf_mtod(mb, void *);
+				iovs[v].iov_len = mb->data_len;
+			}
+
+			io_uring_prep_writev(sqe, fd, iovs, niov, 0);
+
+			/*
+			 * For writev, submit now since iovs[] is on the stack
+			 * and must remain valid until submitted.
+			 * This also submits any previously batched single-seg writes.
+			 */
+			int err = io_uring_submit(&txq->io_ring);
+			if (unlikely(err < 0)) {
+				PMD_TX_LOG(ERR, "Tx io_uring submit failed: %s", strerror(-err));
+				++txq->tx_errors;
+			}
+
+			num_tx = 0;
+		}
+	}
+
+	if (likely(num_tx > 0)) {
+		int err = io_uring_submit(&txq->io_ring);
+		if (unlikely(err < 0)) {
+			PMD_LOG(ERR, "Tx io_uring submit failed: %s", strerror(-err));
+			++txq->tx_errors;
+		}
+	}
+
+	txq->tx_packets += i;
+	txq->tx_bytes += num_tx_bytes;
+
+	return i;
+}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v5 04/10] net/rtap: add statistics and device info
  2026-02-09 18:38 ` [PATCH v5 00/10] net/rtap: add io_uring based TAP driver Stephen Hemminger
                     ` (2 preceding siblings ...)
  2026-02-09 18:39   ` [PATCH v5 03/10] net/rtap: add Rx/Tx with scatter/gather support Stephen Hemminger
@ 2026-02-09 18:39   ` Stephen Hemminger
  2026-02-09 18:39   ` [PATCH v5 05/10] net/rtap: add link and device management operations Stephen Hemminger
                     ` (6 subsequent siblings)
  10 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2026-02-09 18:39 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Implement basic and per-queue statistics collection and the device
info query.

Stats tracked per rx queue: packets, bytes, errors, and missed.
Stats tracked per tx queue: packets, bytes, errors.

Device info reports:
  - Interface index from the kernel TAP device
  - Maximum queue counts
  - Default burst size and ring size

Enable AUTOFILL_QUEUE_XSTATS for automatic per-queue extended
statistics.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/rtap.ini |   2 +
 drivers/net/rtap/rtap.h           |   2 +
 drivers/net/rtap/rtap_ethdev.c    | 139 ++++++++++++++++++++++++++++++
 3 files changed, 143 insertions(+)

diff --git a/doc/guides/nics/features/rtap.ini b/doc/guides/nics/features/rtap.ini
index c064e1e0b9..9bef9e341d 100644
--- a/doc/guides/nics/features/rtap.ini
+++ b/doc/guides/nics/features/rtap.ini
@@ -5,6 +5,8 @@
 ;
 [Features]
 Scattered Rx         = P
+Basic stats          = Y
+Stats per queue      = Y
 Linux                = Y
 ARMv7                = Y
 ARMv8                = Y
diff --git a/drivers/net/rtap/rtap.h b/drivers/net/rtap/rtap.h
index a0bbb1a8a0..a93d7dae17 100644
--- a/drivers/net/rtap/rtap.h
+++ b/drivers/net/rtap/rtap.h
@@ -64,6 +64,8 @@ struct rtap_pmd {
 	int keep_fd;			/* keep alive file descriptor */
 	char ifname[IFNAMSIZ];		/* name assigned by kernel */
 	struct rte_ether_addr eth_addr; /* address assigned by kernel */
+
+	uint64_t rx_drop_base;		/* value of rx_dropped when reset */
 };
 
 /* rtap_ethdev.c */
diff --git a/drivers/net/rtap/rtap_ethdev.c b/drivers/net/rtap/rtap_ethdev.c
index a65a8b77ad..a8481d9864 100644
--- a/drivers/net/rtap/rtap_ethdev.c
+++ b/drivers/net/rtap/rtap_ethdev.c
@@ -7,6 +7,7 @@
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
+#include <inttypes.h>
 #include <sys/ioctl.h>
 #include <sys/socket.h>
 #include <net/if.h>
@@ -30,6 +31,14 @@
 
 #define RTAP_DEFAULT_IFNAME	"rtap%d"
 
+#define RTAP_DEFAULT_BURST	64
+#define RTAP_NUM_BUFFERS	1024
+#define RTAP_MAX_QUEUES		128
+#define RTAP_MIN_RX_BUFSIZE	RTE_ETHER_MIN_LEN
+#define RTAP_MAX_RX_PKTLEN	RTE_ETHER_MAX_JUMBO_FRAME_LEN
+
+static_assert(RTAP_MAX_QUEUES <= RTE_MP_MAX_FD_NUM, "Max queues exceeds MP fd limit");
+
 #define RTAP_IFACE_ARG		"iface"
 #define RTAP_PERSIST_ARG	"persist"
 
@@ -153,6 +162,132 @@ rtap_dev_configure(struct rte_eth_dev *dev)
 	return 0;
 }
 
+static int
+rtap_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+
+	dev_info->if_index = if_nametoindex(pmd->ifname);
+	dev_info->max_mac_addrs = 1;
+	dev_info->max_rx_pktlen = RTAP_MAX_RX_PKTLEN;
+	dev_info->min_rx_bufsize = RTAP_MIN_RX_BUFSIZE;
+	dev_info->max_rx_queues = RTAP_MAX_QUEUES;
+	dev_info->max_tx_queues = RTAP_MAX_QUEUES;
+
+	dev_info->default_rxportconf = (struct rte_eth_dev_portconf) {
+		.burst_size = RTAP_DEFAULT_BURST,
+		.ring_size = RTAP_NUM_BUFFERS,
+		.nb_queues = 1,
+	};
+	dev_info->default_txportconf = (struct rte_eth_dev_portconf) {
+		.burst_size = RTAP_DEFAULT_BURST,
+		.ring_size = RTAP_NUM_BUFFERS,
+		.nb_queues = 1,
+	};
+	return 0;
+}
+
+/* Use sysfs to ask kernel what packets were dropped before making it to interface */
+static int
+rtap_get_rx_dropped(const char *ifname, uint64_t *rx_dropped)
+{
+	char path[256];
+
+	snprintf(path, sizeof(path), "/sys/class/net/%s/statistics/rx_dropped",
+		 ifname);
+
+	FILE *f = fopen(path, "r");
+	if (f == NULL) {
+		PMD_LOG_ERRNO(NOTICE, "open %s failed", path);
+		return -errno;
+	}
+
+	if (fscanf(f, "%"SCNu64, rx_dropped) != 1) {
+		PMD_LOG(NOTICE, "parse of rx_dropped failed");
+		fclose(f);
+		return -EIO;
+	}
+
+	fclose(f);
+	return 0;
+}
+
+static int
+rtap_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats,
+	       struct eth_queue_stats *qstats)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+	uint16_t i;
+
+	for (i = 0; i < dev->data->nb_rx_queues; i++) {
+		struct rtap_rx_queue *rxq = dev->data->rx_queues[i];
+		if (rxq == NULL)
+			continue;
+
+		stats->ipackets += rxq->rx_packets;
+		stats->ibytes += rxq->rx_bytes;
+		stats->ierrors += rxq->rx_errors;
+
+		if (qstats != NULL && i < RTE_ETHDEV_QUEUE_STAT_CNTRS) {
+			qstats->q_ipackets[i] = rxq->rx_packets;
+			qstats->q_ibytes[i] = rxq->rx_bytes;
+			qstats->q_errors[i] = rxq->rx_errors;
+		}
+	}
+
+	for (i = 0; i < dev->data->nb_tx_queues; i++) {
+		struct rtap_tx_queue *txq = dev->data->tx_queues[i];
+		if (txq == NULL)
+			continue;
+
+		stats->opackets += txq->tx_packets;
+		stats->obytes += txq->tx_bytes;
+		stats->oerrors += txq->tx_errors;
+
+		if (qstats != NULL && i < RTE_ETHDEV_QUEUE_STAT_CNTRS) {
+			qstats->q_opackets[i] = txq->tx_packets;
+			qstats->q_obytes[i] = txq->tx_bytes;
+		}
+	}
+
+	uint64_t rx_dropped = 0;
+	if (rtap_get_rx_dropped(pmd->ifname, &rx_dropped) == 0 &&
+	    rx_dropped > pmd->rx_drop_base)
+		stats->imissed = rx_dropped - pmd->rx_drop_base;
+
+	return 0;
+}
+
+static int
+rtap_stats_reset(struct rte_eth_dev *dev)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+	uint16_t i;
+
+	for (i = 0; i < dev->data->nb_rx_queues; i++) {
+		struct rtap_rx_queue *rxq = dev->data->rx_queues[i];
+		if (rxq == NULL)
+			continue;
+
+		rxq->rx_packets = 0;
+		rxq->rx_bytes = 0;
+		rxq->rx_errors = 0;
+	}
+
+	for (i = 0; i < dev->data->nb_tx_queues; i++) {
+		struct rtap_tx_queue *txq = dev->data->tx_queues[i];
+		if (txq == NULL)
+			continue;
+
+		txq->tx_packets = 0;
+		txq->tx_bytes = 0;
+		txq->tx_errors = 0;
+	}
+
+	rtap_get_rx_dropped(pmd->ifname, &pmd->rx_drop_base);
+	return 0;
+}
+
 static int
 rtap_dev_close(struct rte_eth_dev *dev)
 {
@@ -231,7 +366,10 @@ static const struct eth_dev_ops rtap_ops = {
 	.dev_start		= rtap_dev_start,
 	.dev_stop		= rtap_dev_stop,
 	.dev_configure		= rtap_dev_configure,
+	.dev_infos_get		= rtap_dev_info,
 	.dev_close		= rtap_dev_close,
+	.stats_get		= rtap_stats_get,
+	.stats_reset		= rtap_stats_reset,
 	.rx_queue_setup		= rtap_rx_queue_setup,
 	.rx_queue_release	= rtap_rx_queue_release,
 	.tx_queue_setup		= rtap_tx_queue_setup,
@@ -245,6 +383,7 @@ rtap_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
 	struct rtap_pmd *pmd = data->dev_private;
 
 	pmd->keep_fd = -1;
+	pmd->rx_drop_base = 0;
 
 	dev->dev_ops = &rtap_ops;
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v5 05/10] net/rtap: add link and device management operations
  2026-02-09 18:38 ` [PATCH v5 00/10] net/rtap: add io_uring based TAP driver Stephen Hemminger
                     ` (3 preceding siblings ...)
  2026-02-09 18:39   ` [PATCH v5 04/10] net/rtap: add statistics and device info Stephen Hemminger
@ 2026-02-09 18:39   ` Stephen Hemminger
  2026-02-09 18:39   ` [PATCH v5 06/10] net/rtap: add checksum and TSO offload support Stephen Hemminger
                     ` (5 subsequent siblings)
  10 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2026-02-09 18:39 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add device management operations for the TAP interface:
  - Link up/down control via SIOCSIFFLAGS
  - Link status query via SIOCGIFFLAGS
  - Promiscuous mode enable/disable
  - All-multicast mode enable/disable
  - MTU configuration via SIOCSIFMTU
  - MAC address configuration via SIOCSIFHWADDR

These operations manipulate the kernel TAP interface flags and
parameters through ioctl calls.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/rtap.ini |   4 +
 drivers/net/rtap/rtap.h           |   1 +
 drivers/net/rtap/rtap_ethdev.c    | 172 ++++++++++++++++++++++++++++++
 3 files changed, 177 insertions(+)

diff --git a/doc/guides/nics/features/rtap.ini b/doc/guides/nics/features/rtap.ini
index 9bef9e341d..ce0804d795 100644
--- a/doc/guides/nics/features/rtap.ini
+++ b/doc/guides/nics/features/rtap.ini
@@ -4,6 +4,10 @@
 ; Refer to default.ini for the full list of available PMD features.
 ;
 [Features]
+Link status          = Y
+MTU update           = Y
+Promiscuous mode     = Y
+Allmulticast mode    = Y
 Scattered Rx         = P
 Basic stats          = Y
 Stats per queue      = Y
diff --git a/drivers/net/rtap/rtap.h b/drivers/net/rtap/rtap.h
index a93d7dae17..99f413f001 100644
--- a/drivers/net/rtap/rtap.h
+++ b/drivers/net/rtap/rtap.h
@@ -71,6 +71,7 @@ struct rtap_pmd {
 /* rtap_ethdev.c */
 int rtap_queue_open(struct rte_eth_dev *dev, uint16_t queue_id);
 void rtap_queue_close(struct rte_eth_dev *dev, uint16_t queue_id);
+int rtap_link_update(struct rte_eth_dev *dev, int wait_to_complete);
 
 /* rtap_rxtx.c */
 uint16_t rtap_rx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts);
diff --git a/drivers/net/rtap/rtap_ethdev.c b/drivers/net/rtap/rtap_ethdev.c
index a8481d9864..591fe91eac 100644
--- a/drivers/net/rtap/rtap_ethdev.c
+++ b/drivers/net/rtap/rtap_ethdev.c
@@ -108,9 +108,170 @@ rtap_tap_open(const char *name, struct ifreq *ifr, uint8_t persist)
 	return -1;
 }
 
+static int
+rtap_change_flags(struct rte_eth_dev *dev, uint32_t flags, uint32_t mask)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+
+	int sock = socket(AF_INET, SOCK_DGRAM, 0);
+	if (sock < 0)
+		return -errno;
+
+	struct ifreq ifr = { 0 };
+	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
+
+	int ret = ioctl(sock, SIOCGIFFLAGS, &ifr);
+	if (ret < 0) {
+		PMD_LOG_ERRNO(ERR, "Unable to get flags for %s", ifr.ifr_name);
+		goto error;
+	}
+
+	/* NB: ifr.ifr_flags is type short */
+	ifr.ifr_flags &= mask;
+	ifr.ifr_flags |= flags;
+
+	ret = ioctl(sock, SIOCSIFFLAGS, &ifr);
+	if (ret < 0)
+		PMD_LOG_ERRNO(ERR, "Unable to set flags for %s", ifr.ifr_name);
+error:
+	close(sock);
+	return (ret < 0) ? -errno : 0;
+}
+
+static int
+rtap_get_flags(struct rte_eth_dev *dev, short *flags)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+
+	int sock = socket(AF_INET, SOCK_DGRAM, 0);
+	if (sock < 0) {
+		PMD_LOG_ERRNO(ERR, "socket failed");
+		return -1;
+	}
+
+	struct ifreq ifr = { 0 };
+	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
+
+	int ret = ioctl(sock, SIOCGIFFLAGS, &ifr);
+	if (ret == 0)
+		*flags = ifr.ifr_flags;
+	else
+		PMD_LOG_ERRNO(ERR, "ioctl(SIOCGIFFLAGS)");
+	close(sock);
+	return ret;
+}
+
+static int
+rtap_set_link_up(struct rte_eth_dev *dev)
+{
+	return rtap_change_flags(dev, IFF_UP, (uint16_t)~0);
+}
+
+static int
+rtap_set_link_down(struct rte_eth_dev *dev)
+{
+	return rtap_change_flags(dev, 0, ~IFF_UP);
+}
+
+static int
+rtap_promiscuous_enable(struct rte_eth_dev *dev)
+{
+	return rtap_change_flags(dev, IFF_PROMISC, ~0);
+}
+
+static int
+rtap_promiscuous_disable(struct rte_eth_dev *dev)
+{
+	return rtap_change_flags(dev, 0, ~IFF_PROMISC);
+}
+
+static int
+rtap_allmulticast_enable(struct rte_eth_dev *dev)
+{
+	return rtap_change_flags(dev, IFF_ALLMULTI, ~0);
+}
+
+static int
+rtap_allmulticast_disable(struct rte_eth_dev *dev)
+{
+	return rtap_change_flags(dev, 0, ~IFF_ALLMULTI);
+}
+
+int
+rtap_link_update(struct rte_eth_dev *dev, int wait_to_complete __rte_unused)
+{
+	struct rte_eth_link link = {
+		.link_speed = RTE_ETH_SPEED_NUM_UNKNOWN,
+		.link_duplex = RTE_ETH_LINK_FULL_DUPLEX,
+		.link_autoneg = RTE_ETH_LINK_FIXED,
+		.link_status = RTE_ETH_LINK_DOWN,
+	};
+	short flags = 0;
+
+	if (rtap_get_flags(dev, &flags) < 0)
+		return -1;
+
+	if (flags & IFF_UP)
+		link.link_status = RTE_ETH_LINK_UP;
+
+	rte_eth_linkstatus_set(dev, &link);
+	return 0;
+}
+
+static int
+rtap_mtu_set(struct rte_eth_dev *dev, uint16_t mtu)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+
+	int sock = socket(AF_INET, SOCK_DGRAM, 0);
+	if (sock < 0)
+		return -errno;
+
+	struct ifreq ifr = { .ifr_mtu = mtu };
+	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
+
+	int ret = ioctl(sock, SIOCSIFMTU, &ifr);
+	if (ret < 0) {
+		PMD_LOG(ERR, "ioctl(SIOCSIFMTU) failed: %s", strerror(errno));
+		ret = -errno;
+	}
+	close(sock);
+
+	return ret;
+}
+
+static int
+rtap_macaddr_set(struct rte_eth_dev *dev, struct rte_ether_addr *addr)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+
+	int sock = socket(AF_INET, SOCK_DGRAM, 0);
+	if (sock < 0)
+		return -errno;
+
+	struct ifreq ifr = { 0 };
+	strlcpy(ifr.ifr_name, pmd->ifname, IFNAMSIZ);
+	ifr.ifr_hwaddr.sa_family = ARPHRD_ETHER;
+	memcpy(ifr.ifr_hwaddr.sa_data, addr, sizeof(*addr));
+
+	int ret = ioctl(sock, SIOCSIFHWADDR, &ifr);
+	if (ret < 0) {
+		PMD_LOG(ERR, "ioctl(SIOCSIFHWADDR) failed: %s", strerror(errno));
+		ret = -errno;
+	}
+	close(sock);
+
+	return ret;
+}
+
 static int
 rtap_dev_start(struct rte_eth_dev *dev)
 {
+	int ret = rtap_set_link_up(dev);
+
+	if (ret != 0)
+		return ret;
+
 	dev->data->dev_link.link_status = RTE_ETH_LINK_UP;
 	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
 		dev->data->rx_queue_state[i] = RTE_ETH_QUEUE_STATE_STARTED;
@@ -127,6 +288,8 @@ rtap_dev_stop(struct rte_eth_dev *dev)
 
 	dev->data->dev_link.link_status = RTE_ETH_LINK_DOWN;
 
+	rtap_set_link_down(dev);
+
 	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
 		dev->data->rx_queue_state[i] = RTE_ETH_QUEUE_STATE_STOPPED;
 		dev->data->tx_queue_state[i] = RTE_ETH_QUEUE_STATE_STOPPED;
@@ -368,6 +531,15 @@ static const struct eth_dev_ops rtap_ops = {
 	.dev_configure		= rtap_dev_configure,
 	.dev_infos_get		= rtap_dev_info,
 	.dev_close		= rtap_dev_close,
+	.link_update		= rtap_link_update,
+	.dev_set_link_up	= rtap_set_link_up,
+	.dev_set_link_down	= rtap_set_link_down,
+	.mac_addr_set		= rtap_macaddr_set,
+	.mtu_set		= rtap_mtu_set,
+	.promiscuous_enable	= rtap_promiscuous_enable,
+	.promiscuous_disable	= rtap_promiscuous_disable,
+	.allmulticast_enable	= rtap_allmulticast_enable,
+	.allmulticast_disable	= rtap_allmulticast_disable,
 	.stats_get		= rtap_stats_get,
 	.stats_reset		= rtap_stats_reset,
 	.rx_queue_setup		= rtap_rx_queue_setup,
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v5 06/10] net/rtap: add checksum and TSO offload support
  2026-02-09 18:38 ` [PATCH v5 00/10] net/rtap: add io_uring based TAP driver Stephen Hemminger
                     ` (4 preceding siblings ...)
  2026-02-09 18:39   ` [PATCH v5 05/10] net/rtap: add link and device management operations Stephen Hemminger
@ 2026-02-09 18:39   ` Stephen Hemminger
  2026-02-09 18:39   ` [PATCH v5 07/10] net/rtap: add link state change interrupt Stephen Hemminger
                     ` (4 subsequent siblings)
  10 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2026-02-09 18:39 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add transmit and receive offload support using the virtio-net header
that is exchanged with the kernel TAP device.

Tx offloads:
  - UDP and TCP checksum offload: set VIRTIO_NET_HDR_F_NEEDS_CSUM
    with appropriate csum_start and csum_offset fields
  - TCP segmentation offload (TSO): set gso_type and gso_size

Rx offloads:
  - Parse virtio-net header flags to determine checksum status
  - Handle NEEDS_CSUM by either marking L4_CKSUM_NONE for known
    protocols or computing the checksum in software for tunnels
  - Handle DATA_VALID flag to mark L4_CKSUM_GOOD
  - Parse GSO headers to populate LRO metadata

Configure kernel-side offloads via TUNSETOFFLOAD in dev_configure
based on the requested rx offload flags.

Report offload capabilities in dev_info.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/rtap.ini |  2 ++
 drivers/net/rtap/rtap_ethdev.c    | 30 +++++++++++++++++++++++++++++-
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/doc/guides/nics/features/rtap.ini b/doc/guides/nics/features/rtap.ini
index ce0804d795..b8eaa805fe 100644
--- a/doc/guides/nics/features/rtap.ini
+++ b/doc/guides/nics/features/rtap.ini
@@ -11,6 +11,8 @@ Allmulticast mode    = Y
 Scattered Rx         = P
 Basic stats          = Y
 Stats per queue      = Y
+TSO                  = Y
+L4 checksum offload  = Y
 Linux                = Y
 ARMv7                = Y
 ARMv8                = Y
diff --git a/drivers/net/rtap/rtap_ethdev.c b/drivers/net/rtap/rtap_ethdev.c
index 591fe91eac..277a280772 100644
--- a/drivers/net/rtap/rtap_ethdev.c
+++ b/drivers/net/rtap/rtap_ethdev.c
@@ -31,6 +31,16 @@
 
 #define RTAP_DEFAULT_IFNAME	"rtap%d"
 
+#define RTAP_TX_OFFLOAD		(RTE_ETH_TX_OFFLOAD_MULTI_SEGS | \
+				 RTE_ETH_TX_OFFLOAD_UDP_CKSUM | \
+				 RTE_ETH_TX_OFFLOAD_TCP_CKSUM | \
+				 RTE_ETH_TX_OFFLOAD_TCP_TSO)
+
+#define RTAP_RX_OFFLOAD		(RTE_ETH_RX_OFFLOAD_UDP_CKSUM | \
+				 RTE_ETH_RX_OFFLOAD_TCP_CKSUM | \
+				 RTE_ETH_RX_OFFLOAD_TCP_LRO | \
+				 RTE_ETH_RX_OFFLOAD_SCATTER)
+
 #define RTAP_DEFAULT_BURST	64
 #define RTAP_NUM_BUFFERS	1024
 #define RTAP_MAX_QUEUES		128
@@ -317,7 +327,21 @@ rtap_dev_configure(struct rte_eth_dev *dev)
 	if (dev->data->nb_rx_queues != dev->data->nb_tx_queues)
 		return -EINVAL;
 
-	if (ioctl(pmd->keep_fd, TUNSETOFFLOAD, 0) != 0) {
+	/*
+	 * Set offload flags visible on the kernel network interface.
+	 * This controls whether kernel will use checksum offload etc.
+	 * Note: kernel transmit is DPDK receive.
+	 */
+	const struct rte_eth_rxmode *rx_mode = &dev->data->dev_conf.rxmode;
+	unsigned int offload = 0;
+	if (rx_mode->offloads & RTE_ETH_RX_OFFLOAD_CHECKSUM) {
+		offload |= TUN_F_CSUM;
+
+		if (rx_mode->offloads & RTE_ETH_RX_OFFLOAD_TCP_LRO)
+			offload |= TUN_F_TSO4 | TUN_F_TSO6 | TUN_F_TSO_ECN;
+	}
+
+	if (ioctl(pmd->keep_fd, TUNSETOFFLOAD, offload) != 0) {
 		PMD_LOG(ERR, "ioctl(TUNSETOFFLOAD) failed: %s", strerror(errno));
 		return -1;
 	}
@@ -336,6 +360,10 @@ rtap_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 	dev_info->min_rx_bufsize = RTAP_MIN_RX_BUFSIZE;
 	dev_info->max_rx_queues = RTAP_MAX_QUEUES;
 	dev_info->max_tx_queues = RTAP_MAX_QUEUES;
+	dev_info->rx_queue_offload_capa = RTAP_RX_OFFLOAD;
+	dev_info->rx_offload_capa = dev_info->rx_queue_offload_capa;
+	dev_info->tx_queue_offload_capa = RTAP_TX_OFFLOAD;
+	dev_info->tx_offload_capa = dev_info->tx_queue_offload_capa;
 
 	dev_info->default_rxportconf = (struct rte_eth_dev_portconf) {
 		.burst_size = RTAP_DEFAULT_BURST,
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v5 07/10] net/rtap: add link state change interrupt
  2026-02-09 18:38 ` [PATCH v5 00/10] net/rtap: add io_uring based TAP driver Stephen Hemminger
                     ` (5 preceding siblings ...)
  2026-02-09 18:39   ` [PATCH v5 06/10] net/rtap: add checksum and TSO offload support Stephen Hemminger
@ 2026-02-09 18:39   ` Stephen Hemminger
  2026-02-09 18:39   ` [PATCH v5 08/10] net/rtap: add multi-process support Stephen Hemminger
                     ` (3 subsequent siblings)
  10 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2026-02-09 18:39 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add link state change (LSC) notification support using a netlink
socket subscribed to RTMGRP_LINK events.

When LSC is enabled in the device configuration, the driver:
  - Creates a NETLINK_ROUTE socket filtering for RTM_NEWLINK and
    RTM_DELLINK messages
  - Registers the socket fd with the EAL interrupt handler
  - On interrupt, drains all pending netlink messages and updates
    the link status for matching interface index changes

The interrupt is enabled during dev_start and disabled during
dev_stop, with proper cleanup of the netlink socket and EAL
callback registration.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/rtap.ini |   1 +
 drivers/net/rtap/meson.build      |   1 +
 drivers/net/rtap/rtap.h           |   5 +
 drivers/net/rtap/rtap_ethdev.c    |  36 +++++++-
 drivers/net/rtap/rtap_intr.c      | 147 ++++++++++++++++++++++++++++++
 5 files changed, 187 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/rtap/rtap_intr.c

diff --git a/doc/guides/nics/features/rtap.ini b/doc/guides/nics/features/rtap.ini
index b8eaa805fe..36a14e9696 100644
--- a/doc/guides/nics/features/rtap.ini
+++ b/doc/guides/nics/features/rtap.ini
@@ -5,6 +5,7 @@
 ;
 [Features]
 Link status          = Y
+Link status event    = Y
 MTU update           = Y
 Promiscuous mode     = Y
 Allmulticast mode    = Y
diff --git a/drivers/net/rtap/meson.build b/drivers/net/rtap/meson.build
index 8e2b15f382..86d400323c 100644
--- a/drivers/net/rtap/meson.build
+++ b/drivers/net/rtap/meson.build
@@ -19,6 +19,7 @@ endif
 
 sources = files(
         'rtap_ethdev.c',
+        'rtap_intr.c',
         'rtap_rxtx.c',
 )
 
diff --git a/drivers/net/rtap/rtap.h b/drivers/net/rtap/rtap.h
index 99f413f001..f73b5e317d 100644
--- a/drivers/net/rtap/rtap.h
+++ b/drivers/net/rtap/rtap.h
@@ -13,6 +13,7 @@
 
 #include <ethdev_driver.h>
 #include <rte_ether.h>
+#include <rte_interrupts.h>
 #include <rte_log.h>
 
 
@@ -62,6 +63,7 @@ struct rtap_tx_queue {
 
 struct rtap_pmd {
 	int keep_fd;			/* keep alive file descriptor */
+	struct rte_intr_handle *intr_handle; /* LSC interrupt handle */
 	char ifname[IFNAMSIZ];		/* name assigned by kernel */
 	struct rte_ether_addr eth_addr; /* address assigned by kernel */
 
@@ -86,4 +88,7 @@ int rtap_tx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id,
 			const struct rte_eth_txconf *tx_conf);
 void rtap_tx_queue_release(struct rte_eth_dev *dev, uint16_t queue_id);
 
+/* rtap_intr.c */
+int rtap_lsc_set(struct rte_eth_dev *dev, int set);
+
 #endif /* _RTAP_H_ */
diff --git a/drivers/net/rtap/rtap_ethdev.c b/drivers/net/rtap/rtap_ethdev.c
index 277a280772..8c22021655 100644
--- a/drivers/net/rtap/rtap_ethdev.c
+++ b/drivers/net/rtap/rtap_ethdev.c
@@ -24,6 +24,7 @@
 #include <rte_eal.h>
 #include <rte_ethdev.h>
 #include <rte_ether.h>
+#include <rte_interrupts.h>
 #include <rte_kvargs.h>
 #include <rte_log.h>
 
@@ -141,11 +142,16 @@ rtap_change_flags(struct rte_eth_dev *dev, uint32_t flags, uint32_t mask)
 	ifr.ifr_flags |= flags;
 
 	ret = ioctl(sock, SIOCSIFFLAGS, &ifr);
-	if (ret < 0)
+	if (ret < 0) {
 		PMD_LOG_ERRNO(ERR, "Unable to set flags for %s", ifr.ifr_name);
+		goto error;
+	}
+	close(sock);
+	return 0;
 error:
+	ret = -errno;
 	close(sock);
-	return (ret < 0) ? -errno : 0;
+	return ret;
 }
 
 static int
@@ -277,11 +283,18 @@ rtap_macaddr_set(struct rte_eth_dev *dev, struct rte_ether_addr *addr)
 static int
 rtap_dev_start(struct rte_eth_dev *dev)
 {
-	int ret = rtap_set_link_up(dev);
+	int ret;
 
+	ret = rtap_lsc_set(dev, 1);
 	if (ret != 0)
 		return ret;
 
+	ret = rtap_set_link_up(dev);
+	if (ret != 0) {
+		rtap_lsc_set(dev, 0);
+		return ret;
+	}
+
 	dev->data->dev_link.link_status = RTE_ETH_LINK_UP;
 	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
 		dev->data->rx_queue_state[i] = RTE_ETH_QUEUE_STATE_STARTED;
@@ -298,6 +311,7 @@ rtap_dev_stop(struct rte_eth_dev *dev)
 
 	dev->data->dev_link.link_status = RTE_ETH_LINK_DOWN;
 
+	rtap_lsc_set(dev, 0);
 	rtap_set_link_down(dev);
 
 	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
@@ -508,6 +522,9 @@ rtap_dev_close(struct rte_eth_dev *dev)
 			close(pmd->keep_fd);
 			pmd->keep_fd = -1;
 		}
+
+		rte_intr_instance_free(pmd->intr_handle);
+		pmd->intr_handle = NULL;
 	}
 
 	free(dev->process_private);
@@ -585,6 +602,17 @@ rtap_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
 	pmd->keep_fd = -1;
 	pmd->rx_drop_base = 0;
 
+	/* Allocate interrupt instance for link state change events */
+	pmd->intr_handle = rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
+	if (pmd->intr_handle == NULL) {
+		PMD_LOG(ERR, "Failed to allocate intr handle");
+		goto error;
+	}
+	rte_intr_type_set(pmd->intr_handle, RTE_INTR_HANDLE_EXT);
+	rte_intr_fd_set(pmd->intr_handle, -1);
+	dev->intr_handle = pmd->intr_handle;
+	data->dev_flags |= RTE_ETH_DEV_INTR_LSC;
+
 	dev->dev_ops = &rtap_ops;
 
 	/* Get the initial fd used to keep the tap device around */
@@ -623,6 +651,8 @@ rtap_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
 error:
 	if (pmd->keep_fd != -1)
 		close(pmd->keep_fd);
+	rte_intr_instance_free(pmd->intr_handle);
+	pmd->intr_handle = NULL;
 	return -1;
 }
 
diff --git a/drivers/net/rtap/rtap_intr.c b/drivers/net/rtap/rtap_intr.c
new file mode 100644
index 0000000000..8a27b811e1
--- /dev/null
+++ b/drivers/net/rtap/rtap_intr.c
@@ -0,0 +1,147 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2026 Stephen Hemminger
+ */
+
+#include <errno.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/socket.h>
+#include <net/if.h>
+#include <linux/rtnetlink.h>
+
+#include <rte_interrupts.h>
+
+#include "rtap.h"
+
+/*
+ * Create a netlink socket subscribed to link state change events.
+ * Returns socket fd or -1 on failure.
+ */
+static int
+rtap_netlink_init(unsigned int groups)
+{
+	int fd;
+	struct sockaddr_nl sa = {
+		.nl_family = AF_NETLINK,
+		.nl_groups = groups,
+	};
+
+	fd = socket(AF_NETLINK, SOCK_RAW | SOCK_CLOEXEC | SOCK_NONBLOCK,
+		    NETLINK_ROUTE);
+	if (fd < 0) {
+		PMD_LOG_ERRNO(ERR, "netlink socket");
+		return -1;
+	}
+
+	if (bind(fd, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
+		PMD_LOG_ERRNO(ERR, "netlink bind");
+		close(fd);
+		return -1;
+	}
+
+	return fd;
+}
+
+/*
+ * Drain all pending netlink messages from socket.
+ * For each RTM_NEWLINK/RTM_DELLINK that matches our interface,
+ * update link status.
+ */
+static void
+rtap_netlink_recv(int fd, struct rte_eth_dev *dev)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+	unsigned int if_index = if_nametoindex(pmd->ifname);
+	char buf[4096];
+	ssize_t len;
+
+	while ((len = recv(fd, buf, sizeof(buf), 0)) > 0) {
+		for (struct nlmsghdr *nh = (struct nlmsghdr *)buf;
+		     NLMSG_OK(nh, (unsigned int)len);
+		     nh = NLMSG_NEXT(nh, len)) {
+			struct ifinfomsg *ifi;
+
+			if (nh->nlmsg_type != RTM_NEWLINK &&
+			    nh->nlmsg_type != RTM_DELLINK)
+				continue;
+
+			ifi = NLMSG_DATA(nh);
+			if ((unsigned int)ifi->ifi_index != if_index)
+				continue;
+
+			/* Link state changed for our interface */
+			rtap_link_update(dev, 0);
+		}
+	}
+}
+
+/* Interrupt handler called by EAL when netlink socket is readable */
+static void
+rtap_lsc_handler(void *cb_arg)
+{
+	struct rte_eth_dev *dev = cb_arg;
+	struct rtap_pmd *pmd = dev->data->dev_private;
+	int fd = rte_intr_fd_get(pmd->intr_handle);
+
+	if (fd >= 0)
+		rtap_netlink_recv(fd, dev);
+}
+
+/*
+ * Enable or disable link state change interrupt.
+ * When enabled, creates a netlink socket subscribed to RTMGRP_LINK
+ * and registers it with the EAL interrupt handler.
+ */
+int
+rtap_lsc_set(struct rte_eth_dev *dev, int set)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+	int ret;
+
+	/* If LSC not configured, just disable if active */
+	if (!dev->data->dev_conf.intr_conf.lsc) {
+		if (rte_intr_fd_get(pmd->intr_handle) != -1)
+			goto disable;
+		return 0;
+	}
+
+	if (set) {
+		int fd = rtap_netlink_init(RTMGRP_LINK);
+		if (fd < 0)
+			return -1;
+
+		rte_intr_fd_set(pmd->intr_handle, fd);
+		ret = rte_intr_callback_register(pmd->intr_handle,
+						 rtap_lsc_handler, dev);
+		if (ret < 0) {
+			PMD_LOG(ERR, "Failed to register LSC callback: %s",
+				rte_strerror(-ret));
+			close(fd);
+			rte_intr_fd_set(pmd->intr_handle, -1);
+			return ret;
+		}
+		return 0;
+	}
+
+disable:
+	unsigned int retry = 10;
+	do {
+		ret = rte_intr_callback_unregister(pmd->intr_handle,
+						   rtap_lsc_handler, dev);
+		if (ret >= 0)
+			break;
+		if (ret == -EAGAIN && retry-- > 0)
+			rte_delay_ms(100);
+		else {
+			PMD_LOG(ERR, "LSC callback unregister failed: %d", ret);
+			break;
+		}
+	} while (true);
+
+	if (rte_intr_fd_get(pmd->intr_handle) >= 0) {
+		close(rte_intr_fd_get(pmd->intr_handle));
+		rte_intr_fd_set(pmd->intr_handle, -1);
+	}
+
+	return 0;
+}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v5 08/10] net/rtap: add multi-process support
  2026-02-09 18:38 ` [PATCH v5 00/10] net/rtap: add io_uring based TAP driver Stephen Hemminger
                     ` (6 preceding siblings ...)
  2026-02-09 18:39   ` [PATCH v5 07/10] net/rtap: add link state change interrupt Stephen Hemminger
@ 2026-02-09 18:39   ` Stephen Hemminger
  2026-02-09 18:39   ` [PATCH v5 09/10] net/rtap: add Rx interrupt support Stephen Hemminger
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2026-02-09 18:39 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger, Anatoly Burakov

Add support for DPDK multi-process operation. Secondary processes
need access to the per-queue TAP file descriptors owned by the
primary process.

Implement fd sharing using the DPDK multi-process IPC mechanism:
  - Primary registers an MP action handler that responds to fd
    requests by sending queue fds via rte_mp_reply()
  - Secondary process attaches to the existing device and requests
    fds from primary via rte_mp_request_sync()

The MP action is registered on the first device probe and
unregistered when the last device is closed.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/rtap.ini |   1 +
 drivers/net/rtap/rtap_ethdev.c    | 130 ++++++++++++++++++++++++++++++
 2 files changed, 131 insertions(+)

diff --git a/doc/guides/nics/features/rtap.ini b/doc/guides/nics/features/rtap.ini
index 36a14e9696..fe0c88a8fc 100644
--- a/doc/guides/nics/features/rtap.ini
+++ b/doc/guides/nics/features/rtap.ini
@@ -14,6 +14,7 @@ Basic stats          = Y
 Stats per queue      = Y
 TSO                  = Y
 L4 checksum offload  = Y
+Multiprocess aware   = Y
 Linux                = Y
 ARMv7                = Y
 ARMv8                = Y
diff --git a/drivers/net/rtap/rtap_ethdev.c b/drivers/net/rtap/rtap_ethdev.c
index 8c22021655..9b8ad1f452 100644
--- a/drivers/net/rtap/rtap_ethdev.c
+++ b/drivers/net/rtap/rtap_ethdev.c
@@ -42,6 +42,8 @@
 				 RTE_ETH_RX_OFFLOAD_TCP_LRO | \
 				 RTE_ETH_RX_OFFLOAD_SCATTER)
 
+#define RTAP_MP_KEY		"rtap_mp_send_fds"
+
 #define RTAP_DEFAULT_BURST	64
 #define RTAP_NUM_BUFFERS	1024
 #define RTAP_MAX_QUEUES		128
@@ -53,6 +55,8 @@ static_assert(RTAP_MAX_QUEUES <= RTE_MP_MAX_FD_NUM, "Max queues exceeds MP fd li
 #define RTAP_IFACE_ARG		"iface"
 #define RTAP_PERSIST_ARG	"persist"
 
+static RTE_ATOMIC(unsigned int) rtap_dev_count;
+
 static const char * const valid_arguments[] = {
 	RTAP_IFACE_ARG,
 	RTAP_PERSIST_ARG,
@@ -530,6 +534,8 @@ rtap_dev_close(struct rte_eth_dev *dev)
 	free(dev->process_private);
 	dev->process_private = NULL;
 
+	if (rte_atomic_fetch_sub_explicit(&rtap_dev_count, 1, rte_memory_order_release) == 1)
+		rte_mp_action_unregister(RTAP_MP_KEY);
 	return 0;
 }
 
@@ -669,6 +675,89 @@ rtap_parse_iface(const char *key __rte_unused, const char *value, void *extra_ar
 	return 0;
 }
 
+/* Secondary process requests rxq fds from primary. */
+static int
+rtap_request_fds(const char *name, struct rte_eth_dev *dev)
+{
+	struct rte_mp_msg request = { };
+
+	strlcpy(request.name, RTAP_MP_KEY, sizeof(request.name));
+	strlcpy((char *)request.param, name, RTE_MP_MAX_PARAM_LEN);
+	request.len_param = strlen(name);
+
+	/* Send the request and receive the reply */
+	PMD_LOG(DEBUG, "Sending multi-process IPC request for %s", name);
+
+	struct timespec timeout = {.tv_sec = 1, .tv_nsec = 0};
+	struct rte_mp_reply replies;
+	int ret = rte_mp_request_sync(&request, &replies, &timeout);
+	if (ret < 0 || replies.nb_received != 1) {
+		PMD_LOG(ERR, "Failed to request fds from primary: %s",
+			rte_strerror(rte_errno));
+		return -1;
+	}
+
+	struct rte_mp_msg *reply = replies.msgs;
+	PMD_LOG(DEBUG, "Received multi-process IPC reply for %s", name);
+	if (dev->data->nb_rx_queues != reply->num_fds) {
+		PMD_LOG(ERR, "Incorrect number of fds received: %d != %d",
+			reply->num_fds, dev->data->nb_rx_queues);
+		free(reply);
+		return -EINVAL;
+	}
+
+	int *fds = dev->process_private;
+	for (int i = 0; i < reply->num_fds; i++) {
+		fds[i] = reply->fds[i];
+		PMD_LOG(DEBUG, "Received queue %u fd %d from primary", i, fds[i]);
+	}
+
+	free(reply);
+	return 0;
+}
+
+/* Primary process sends rxq fds to secondary. */
+static int
+rtap_mp_send_fds(const struct rte_mp_msg *request, const void *peer)
+{
+	const char *request_name = (const char *)request->param;
+
+	PMD_LOG(DEBUG, "Received multi-process IPC request for %s", request_name);
+
+	/* Find the requested port */
+	struct rte_eth_dev *dev = rte_eth_dev_get_by_name(request_name);
+	if (!dev) {
+		PMD_LOG(ERR, "Failed to get port id for %s", request_name);
+		return -1;
+	}
+
+	/* Populate the reply with the fds for each queue */
+	struct rte_mp_msg reply = { };
+	if (dev->data->nb_rx_queues > RTE_MP_MAX_FD_NUM) {
+		PMD_LOG(ERR, "Number of rx queues (%d) exceeds max number of fds (%d)",
+			   dev->data->nb_rx_queues, RTE_MP_MAX_FD_NUM);
+		return -EINVAL;
+	}
+
+	int *fds = dev->process_private;
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		PMD_LOG(DEBUG, "Send queue %u fd %d to secondary", i, fds[i]);
+		reply.fds[reply.num_fds++] = fds[i];
+	}
+
+	/* Send the reply */
+	strlcpy(reply.name, request->name, sizeof(reply.name));
+	strlcpy((char *)reply.param, request_name, RTE_MP_MAX_PARAM_LEN);
+	reply.len_param = strlen(request_name);
+
+	PMD_LOG(DEBUG, "Sending multi-process IPC reply for %s", request_name);
+	if (rte_mp_reply(&reply, peer) < 0) {
+		PMD_LOG(ERR, "Failed to reply to multi-process IPC request");
+		return -1;
+	}
+	return 0;
+}
+
 static int
 rtap_probe(struct rte_vdev_device *vdev)
 {
@@ -683,6 +772,38 @@ rtap_probe(struct rte_vdev_device *vdev)
 
 	PMD_LOG(INFO, "Initializing %s", name);
 
+	if (rte_eal_process_type() == RTE_PROC_SECONDARY) {
+		eth_dev = rte_eth_dev_attach_secondary(name);
+		if (!eth_dev) {
+			PMD_LOG(ERR, "Failed to probe %s", name);
+			return -1;
+		}
+		eth_dev->dev_ops = &rtap_ops;
+		eth_dev->data->dev_flags |= RTE_ETH_DEV_AUTOFILL_QUEUE_XSTATS;
+		eth_dev->device = &vdev->device;
+
+		if (!rte_eal_primary_proc_alive(NULL)) {
+			PMD_LOG(ERR, "Primary process is missing");
+			goto error;
+		}
+
+		fds = calloc(RTE_MAX_QUEUES_PER_PORT, sizeof(int));
+		if (fds == NULL) {
+			PMD_LOG(ERR, "Failed to alloc memory for process private");
+			goto error;
+		}
+		for (uint16_t i = 0; i < RTE_MAX_QUEUES_PER_PORT; i++)
+			fds[i] = -1;
+
+		eth_dev->process_private = fds;
+
+		if (rtap_request_fds(name, eth_dev))
+			goto error;
+
+		rte_eth_dev_probing_finish(eth_dev);
+		return 0;
+	}
+
 	if (params != NULL) {
 		kvlist = rte_kvargs_parse(params, valid_arguments);
 		if (kvlist == NULL)
@@ -721,6 +842,15 @@ rtap_probe(struct rte_vdev_device *vdev)
 	if (rtap_create(eth_dev, tap_name, persist) < 0)
 		goto error;
 
+	/* register the MP server on the first device */
+	if (rte_atomic_fetch_add_explicit(&rtap_dev_count, 1, rte_memory_order_acquire) == 0) {
+		if (rte_mp_action_register(RTAP_MP_KEY, rtap_mp_send_fds) < 0) {
+			PMD_LOG(ERR, "Failed to register multi-process callback: %s",
+				rte_strerror(rte_errno));
+			goto error;
+		}
+	}
+
 	rte_eth_dev_probing_finish(eth_dev);
 	rte_kvargs_free(kvlist);
 	return 0;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v5 09/10] net/rtap: add Rx interrupt support
  2026-02-09 18:38 ` [PATCH v5 00/10] net/rtap: add io_uring based TAP driver Stephen Hemminger
                     ` (7 preceding siblings ...)
  2026-02-09 18:39   ` [PATCH v5 08/10] net/rtap: add multi-process support Stephen Hemminger
@ 2026-02-09 18:39   ` Stephen Hemminger
  2026-02-09 18:39   ` [PATCH v5 10/10] test: add unit tests for rtap PMD Stephen Hemminger
  2026-02-10  9:18   ` [PATCH v5 00/10] net/rtap: add io_uring based TAP driver Morten Brørup
  10 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2026-02-09 18:39 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add support for the DPDK Rx interrupt mechanism, enabling
power-aware applications (e.g. l3fwd-power) to sleep until
packets arrive rather than busy-polling.

Each Rx queue creates an eventfd during queue setup and registers
it with its io_uring instance via io_uring_register_eventfd().
When the kernel posts a CQE (completing a read, i.e. a packet
arrived), it signals the eventfd. These per-queue eventfds are
wired into a VDEV interrupt handle during dev_start when the
application has set intr_conf.rxq.

The enable op drains the eventfd counter to re-arm notification;
disable is a no-op since the application simply stops polling.
The eventfd is always created unconditionally so it is available
if the application enables Rx interrupts later.

The Rx interrupt handle is kept separate from the existing LSC
netlink interrupt handle to avoid coupling the two mechanisms.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/rtap.ini |   1 +
 drivers/net/rtap/rtap.h           |   6 ++
 drivers/net/rtap/rtap_ethdev.c    |  16 ++++
 drivers/net/rtap/rtap_intr.c      | 120 ++++++++++++++++++++++++++++++
 drivers/net/rtap/rtap_rxtx.c      |  31 +++++++-
 5 files changed, 173 insertions(+), 1 deletion(-)

diff --git a/doc/guides/nics/features/rtap.ini b/doc/guides/nics/features/rtap.ini
index fe0c88a8fc..48fe3f1b33 100644
--- a/doc/guides/nics/features/rtap.ini
+++ b/doc/guides/nics/features/rtap.ini
@@ -6,6 +6,7 @@
 [Features]
 Link status          = Y
 Link status event    = Y
+Rx interrupt         = Y
 MTU update           = Y
 Promiscuous mode     = Y
 Allmulticast mode    = Y
diff --git a/drivers/net/rtap/rtap.h b/drivers/net/rtap/rtap.h
index f73b5e317d..f37cac87ad 100644
--- a/drivers/net/rtap/rtap.h
+++ b/drivers/net/rtap/rtap.h
@@ -42,6 +42,7 @@ extern int rtap_logtype;
 struct rtap_rx_queue {
 	struct rte_mempool *mb_pool;	/* rx buffer pool */
 	struct io_uring io_ring;	/* queue of posted read's */
+	int intr_fd;			/* eventfd for Rx interrupt */
 	uint16_t port_id;
 	uint16_t queue_id;
 
@@ -64,6 +65,7 @@ struct rtap_tx_queue {
 struct rtap_pmd {
 	int keep_fd;			/* keep alive file descriptor */
 	struct rte_intr_handle *intr_handle; /* LSC interrupt handle */
+	struct rte_intr_handle *rx_intr_handle; /* Rx queue interrupt handle */
 	char ifname[IFNAMSIZ];		/* name assigned by kernel */
 	struct rte_ether_addr eth_addr; /* address assigned by kernel */
 
@@ -90,5 +92,9 @@ void rtap_tx_queue_release(struct rte_eth_dev *dev, uint16_t queue_id);
 
 /* rtap_intr.c */
 int rtap_lsc_set(struct rte_eth_dev *dev, int set);
+int rtap_rx_intr_vec_install(struct rte_eth_dev *dev);
+void rtap_rx_intr_vec_uninstall(struct rte_eth_dev *dev);
+int rtap_rx_queue_intr_enable(struct rte_eth_dev *dev, uint16_t queue_id);
+int rtap_rx_queue_intr_disable(struct rte_eth_dev *dev, uint16_t queue_id);
 
 #endif /* _RTAP_H_ */
diff --git a/drivers/net/rtap/rtap_ethdev.c b/drivers/net/rtap/rtap_ethdev.c
index 9b8ad1f452..8a9fb5be85 100644
--- a/drivers/net/rtap/rtap_ethdev.c
+++ b/drivers/net/rtap/rtap_ethdev.c
@@ -293,8 +293,18 @@ rtap_dev_start(struct rte_eth_dev *dev)
 	if (ret != 0)
 		return ret;
 
+	/* Install Rx interrupt vector if requested by application */
+	if (dev->data->dev_conf.intr_conf.rxq) {
+		ret = rtap_rx_intr_vec_install(dev);
+		if (ret != 0) {
+			rtap_lsc_set(dev, 0);
+			return ret;
+		}
+	}
+
 	ret = rtap_set_link_up(dev);
 	if (ret != 0) {
+		rtap_rx_intr_vec_uninstall(dev);
 		rtap_lsc_set(dev, 0);
 		return ret;
 	}
@@ -315,6 +325,7 @@ rtap_dev_stop(struct rte_eth_dev *dev)
 
 	dev->data->dev_link.link_status = RTE_ETH_LINK_DOWN;
 
+	rtap_rx_intr_vec_uninstall(dev);
 	rtap_lsc_set(dev, 0);
 	rtap_set_link_down(dev);
 
@@ -527,6 +538,9 @@ rtap_dev_close(struct rte_eth_dev *dev)
 			pmd->keep_fd = -1;
 		}
 
+		rte_intr_instance_free(pmd->rx_intr_handle);
+		pmd->rx_intr_handle = NULL;
+
 		rte_intr_instance_free(pmd->intr_handle);
 		pmd->intr_handle = NULL;
 	}
@@ -597,6 +611,8 @@ static const struct eth_dev_ops rtap_ops = {
 	.rx_queue_release	= rtap_rx_queue_release,
 	.tx_queue_setup		= rtap_tx_queue_setup,
 	.tx_queue_release	= rtap_tx_queue_release,
+	.rx_queue_intr_enable	= rtap_rx_queue_intr_enable,
+	.rx_queue_intr_disable	= rtap_rx_queue_intr_disable,
 };
 
 static int
diff --git a/drivers/net/rtap/rtap_intr.c b/drivers/net/rtap/rtap_intr.c
index 8a27b811e1..231666efae 100644
--- a/drivers/net/rtap/rtap_intr.c
+++ b/drivers/net/rtap/rtap_intr.c
@@ -145,3 +145,123 @@ rtap_lsc_set(struct rte_eth_dev *dev, int set)
 
 	return 0;
 }
+
+/*
+ * Install per-queue Rx interrupt vector.
+ *
+ * Each Rx queue has an eventfd registered with its io_uring instance.
+ * When a CQE is posted (packet received), the kernel signals the eventfd.
+ * This function wires those eventfds into an rte_intr_handle so that
+ * DPDK's interrupt framework (rte_epoll_wait) can poll them.
+ *
+ * Only called when dev_conf.intr_conf.rxq is set.
+ */
+int
+rtap_rx_intr_vec_install(struct rte_eth_dev *dev)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+	uint16_t nb_rx = dev->data->nb_rx_queues;
+
+	if (pmd->rx_intr_handle != NULL) {
+		PMD_LOG(DEBUG, "Rx interrupt vector already installed");
+		return 0;
+	}
+
+	pmd->rx_intr_handle = rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_PRIVATE);
+	if (pmd->rx_intr_handle == NULL) {
+		PMD_LOG(ERR, "Failed to allocate Rx intr handle");
+		return -ENOMEM;
+	}
+
+	if (rte_intr_type_set(pmd->rx_intr_handle, RTE_INTR_HANDLE_VDEV) < 0)
+		goto error;
+
+	if (rte_intr_nb_efd_set(pmd->rx_intr_handle, nb_rx) < 0)
+		goto error;
+
+	if (rte_intr_max_intr_set(pmd->rx_intr_handle, nb_rx + 1) < 0)
+		goto error;
+
+	for (uint16_t i = 0; i < nb_rx; i++) {
+		struct rtap_rx_queue *rxq = dev->data->rx_queues[i];
+
+		if (rxq == NULL || rxq->intr_fd < 0) {
+			PMD_LOG(ERR, "Rx queue %u not ready for interrupts", i);
+			goto error;
+		}
+
+		if (rte_intr_efds_index_set(pmd->rx_intr_handle, i,
+					    rxq->intr_fd) < 0) {
+			PMD_LOG(ERR, "Failed to set efd for queue %u", i);
+			goto error;
+		}
+	}
+
+	dev->intr_handle = pmd->rx_intr_handle;
+	PMD_LOG(DEBUG, "Rx interrupt vector installed for %u queues", nb_rx);
+	return 0;
+
+error:
+	rte_intr_instance_free(pmd->rx_intr_handle);
+	pmd->rx_intr_handle = NULL;
+	return -1;
+}
+
+/*
+ * Remove per-queue Rx interrupt vector.
+ * Restores dev->intr_handle to the LSC handle.
+ */
+void
+rtap_rx_intr_vec_uninstall(struct rte_eth_dev *dev)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+
+	if (pmd->rx_intr_handle == NULL)
+		return;
+
+	/* Restore LSC handle as device interrupt handle */
+	dev->intr_handle = pmd->intr_handle;
+
+	rte_intr_instance_free(pmd->rx_intr_handle);
+	pmd->rx_intr_handle = NULL;
+
+	PMD_LOG(DEBUG, "Rx interrupt vector uninstalled");
+}
+
+/*
+ * Enable Rx interrupt for a queue.
+ *
+ * Drain any pending eventfd notification so the next CQE
+ * triggers a fresh wakeup in rte_epoll_wait().
+ */
+int
+rtap_rx_queue_intr_enable(struct rte_eth_dev *dev, uint16_t queue_id)
+{
+	struct rtap_rx_queue *rxq = dev->data->rx_queues[queue_id];
+	uint64_t val;
+
+	if (rxq == NULL || rxq->intr_fd < 0)
+		return -EINVAL;
+
+	/* Drain the eventfd counter to re-arm notification */
+	if (read(rxq->intr_fd, &val, sizeof(val)) < 0 && errno != EAGAIN) {
+		PMD_LOG(ERR, "eventfd drain failed queue %u: %s",
+			queue_id, strerror(errno));
+		return -errno;
+	}
+
+	return 0;
+}
+
+/*
+ * Disable Rx interrupt for a queue.
+ *
+ * Nothing to do - the eventfd stays registered with io_uring
+ * but the application simply stops polling it.
+ */
+int
+rtap_rx_queue_intr_disable(struct rte_eth_dev *dev __rte_unused,
+			   uint16_t queue_id __rte_unused)
+{
+	return 0;
+}
diff --git a/drivers/net/rtap/rtap_rxtx.c b/drivers/net/rtap/rtap_rxtx.c
index c972ab4ca0..87d181eded 100644
--- a/drivers/net/rtap/rtap_rxtx.c
+++ b/drivers/net/rtap/rtap_rxtx.c
@@ -8,6 +8,7 @@
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
+#include <sys/eventfd.h>
 #include <liburing.h>
 #include <sys/uio.h>
 #include <linux/virtio_net.h>
@@ -369,6 +370,7 @@ rtap_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id, uint16_t nb_rx_d
 	rxq->mb_pool = mb_pool;
 	rxq->port_id = dev->data->port_id;
 	rxq->queue_id = queue_id;
+	rxq->intr_fd = -1;
 	dev->data->rx_queues[queue_id] = rxq;
 
 	if (io_uring_queue_init(nb_rx_desc, &rxq->io_ring, 0) != 0) {
@@ -376,10 +378,26 @@ rtap_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id, uint16_t nb_rx_d
 		goto error_rxq_free;
 	}
 
+	/*
+	 * Create an eventfd for Rx interrupt notification.
+	 * io_uring will signal this fd whenever a CQE is posted,
+	 * enabling power-aware applications to sleep until packets arrive.
+	 */
+	rxq->intr_fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
+	if (rxq->intr_fd < 0) {
+		PMD_LOG(ERR, "eventfd failed: %s", strerror(errno));
+		goto error_iouring_exit;
+	}
+
+	if (io_uring_register_eventfd(&rxq->io_ring, rxq->intr_fd) < 0) {
+		PMD_LOG(ERR, "io_uring_register_eventfd failed: %s", strerror(errno));
+		goto error_eventfd_close;
+	}
+
 	mbufs = calloc(nb_rx_desc, sizeof(struct rte_mbuf *));
 	if (mbufs == NULL) {
 		PMD_LOG(ERR, "Rx mbuf pointer alloc failed");
-		goto error_iouring_exit;
+		goto error_eventfd_close;
 	}
 
 	/* open shared tap fd maybe already setup */
@@ -429,6 +447,11 @@ rtap_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id, uint16_t nb_rx_d
 	}
 	rtap_queue_close(dev, queue_id);
 	free(mbufs);
+error_eventfd_close:
+	if (rxq->intr_fd >= 0) {
+		close(rxq->intr_fd);
+		rxq->intr_fd = -1;
+	}
 error_iouring_exit:
 	io_uring_queue_exit(&rxq->io_ring);
 error_rxq_free:
@@ -503,6 +526,12 @@ rtap_rx_queue_release(struct rte_eth_dev *dev, uint16_t queue_id)
 	if (rxq == NULL)
 		return;
 
+	if (rxq->intr_fd >= 0) {
+		io_uring_unregister_eventfd(&rxq->io_ring);
+		close(rxq->intr_fd);
+		rxq->intr_fd = -1;
+	}
+
 	rtap_cancel_all(&rxq->io_ring);
 	io_uring_queue_exit(&rxq->io_ring);
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v5 10/10] test: add unit tests for rtap PMD
  2026-02-09 18:38 ` [PATCH v5 00/10] net/rtap: add io_uring based TAP driver Stephen Hemminger
                     ` (8 preceding siblings ...)
  2026-02-09 18:39   ` [PATCH v5 09/10] net/rtap: add Rx interrupt support Stephen Hemminger
@ 2026-02-09 18:39   ` Stephen Hemminger
  2026-02-10  9:18   ` [PATCH v5 00/10] net/rtap: add io_uring based TAP driver Morten Brørup
  10 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2026-02-09 18:39 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add a comprehensive test suite for the rtap poll mode driver covering:

  - Device creation and configuration
  - Device info query
  - Link status and link up/down control
  - Statistics collection and reset
  - Promiscuous and all-multicast mode toggling
  - MAC address and MTU configuration
  - Multi-queue setup, teardown, and queue count reduction
  - Mismatched rx/tx queue count rejection
  - Queue reconfiguration (stop, reconfigure, restart)
  - Rx path: inject packet via AF_PACKET, receive via DPDK
  - Tx path: transmit via DPDK, capture via AF_PACKET
  - Multi-segment Tx with chained mbufs
  - Multi-segment Rx with small-buffer mempool
  - Offload capability reporting and configuration
  - Tx UDP checksum offload end-to-end verification
  - File descriptor leak detection after device close
  - Command-line created device testing

Tests require CAP_NET_ADMIN or root privileges

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 app/test/meson.build     |    1 +
 app/test/test_pmd_rtap.c | 2044 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 2045 insertions(+)
 create mode 100644 app/test/test_pmd_rtap.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 48874037eb..5855077066 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -143,6 +143,7 @@ source_file_deps = {
     'test_pie.c': ['sched'],
     'test_pmd_perf.c': ['ethdev', 'net'] + packet_burst_generator_deps,
     'test_pmd_ring.c': ['net_ring', 'ethdev', 'bus_vdev'],
+    'test_pmd_rtap.c': ['net_rtap', 'ethdev', 'bus_vdev'],
     'test_pmd_ring_perf.c': ['ethdev', 'net_ring', 'bus_vdev'],
     'test_pmu.c': ['pmu'],
     'test_power.c': ['power', 'power_acpi', 'power_kvm_vm', 'power_intel_pstate',
diff --git a/app/test/test_pmd_rtap.c b/app/test/test_pmd_rtap.c
new file mode 100644
index 0000000000..5544c21c85
--- /dev/null
+++ b/app/test/test_pmd_rtap.c
@@ -0,0 +1,2044 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Stephen Hemminger
+ */
+#include <errno.h>
+#include <dirent.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stddef.h>
+#include <unistd.h>
+#include <arpa/inet.h>
+#include <net/if.h>
+#include <netinet/in.h>
+#include <sys/ioctl.h>
+#include <sys/select.h>
+#include <sys/socket.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+
+#include <rte_common.h>
+#include <rte_stdatomic.h>
+#include <rte_ethdev.h>
+#include <rte_bus_vdev.h>
+#include <rte_ether.h>
+#include <rte_errno.h>
+#include <rte_mbuf.h>
+#include <rte_mempool.h>
+#include <rte_ip.h>
+#include <rte_udp.h>
+#include <rte_epoll.h>
+
+#include "test.h"
+
+#define SOCKET0 0
+#define RING_SIZE 256
+#define NB_MBUF 4096
+#define MAX_MULTI_QUEUES 4
+
+#define RTAP_DRIVER_NAME "net_rtap"
+#define TEST_TAP_NAME "rtap_test0"
+
+/* TX/RX test parameters */
+#define TEST_PKT_PAYLOAD_LEN 64
+#define TEST_MAGIC_BYTE 0xAB
+#define RX_BURST_MAX 32
+#define TX_RX_TIMEOUT_US 100000  /* 100ms */
+#define TX_RX_POLL_US 1000       /* 1ms between polls */
+
+static struct rte_mempool *mp;
+static int rtap_port0 = -1;
+static int rtap_port1 = -1;
+
+/* ========== Helper Functions ========== */
+
+static int
+check_rtap_available(void)
+{
+	int fd = open("/dev/net/tun", O_RDWR);
+	if (fd < 0) {
+		printf("Cannot access /dev/net/tun: %s\n", strerror(errno));
+		printf("RTAP PMD tests require CAP_NET_ADMIN or root privileges\n");
+		return -1;
+	}
+	close(fd);
+	return 0;
+}
+
+/* Configure port with specified number of queue pairs */
+static int
+port_configure(int port, uint16_t nb_queues, const struct rte_eth_conf *conf)
+{
+	struct rte_eth_conf null_conf;
+
+	if (conf == NULL) {
+		memset(&null_conf, 0, sizeof(null_conf));
+		conf = &null_conf;
+	}
+
+	if (rte_eth_dev_configure(port, nb_queues, nb_queues, conf) < 0) {
+		printf("Configure failed for port %d with %u queues\n",
+		       port, nb_queues);
+		return -1;
+	}
+
+	return 0;
+}
+
+/* Setup queue pairs for a port */
+static int
+port_setup_queues(int port, uint16_t nb_queues, uint16_t ring_size,
+		  struct rte_mempool *mempool)
+{
+	for (uint16_t q = 0; q < nb_queues; q++) {
+		if (rte_eth_tx_queue_setup(port, q, ring_size, SOCKET0, NULL) < 0) {
+			printf("TX queue %u setup failed port %d\n", q, port);
+			return -1;
+		}
+
+		if (rte_eth_rx_queue_setup(port, q, ring_size, SOCKET0,
+					   NULL, mempool) < 0) {
+			printf("RX queue %u setup failed port %d\n", q, port);
+			return -1;
+		}
+	}
+
+	return 0;
+}
+
+/* Stop, configure, setup queues, and start a port */
+static int
+port_reconfigure(int port, uint16_t nb_queues, const struct rte_eth_conf *conf,
+		 uint16_t ring_size, struct rte_mempool *mempool)
+{
+	int ret;
+
+	ret = rte_eth_dev_stop(port);
+	if (ret != 0) {
+		printf("Error stopping port %d: %s\n", port, rte_strerror(-ret));
+		return -1;
+	}
+
+	if (port_configure(port, nb_queues, conf) < 0)
+		return -1;
+
+	if (port_setup_queues(port, nb_queues, ring_size, mempool) < 0)
+		return -1;
+
+	if (rte_eth_dev_start(port) < 0) {
+		printf("Error starting port %d\n", port);
+		return -1;
+	}
+
+	return 0;
+}
+
+/* Restore port to clean single-queue started state */
+static int
+restore_single_queue(int port)
+{
+	return port_reconfigure(port, 1, NULL, RING_SIZE, mp);
+}
+
+/* Verify link status matches expected */
+static int
+verify_link_status(int port, uint8_t expected_status)
+{
+	struct rte_eth_link link;
+	int ret;
+
+	ret = rte_eth_link_get(port, &link);
+	if (ret < 0) {
+		printf("Error getting link status: %s\n", rte_strerror(-ret));
+		return -1;
+	}
+
+	if (link.link_status != expected_status) {
+		printf("Error: link should be %s but is %s\n",
+		       expected_status ? "UP" : "DOWN",
+		       link.link_status ? "UP" : "DOWN");
+		return -1;
+	}
+
+	return 0;
+}
+
+/* Get device info with error checking */
+static int
+get_dev_info(int port, struct rte_eth_dev_info *dev_info)
+{
+	int ret = rte_eth_dev_info_get(port, dev_info);
+	if (ret != 0) {
+		printf("Error getting device info for port %d: %s\n",
+		       port, rte_strerror(-ret));
+		return -1;
+	}
+	return 0;
+}
+
+/* Reset and verify stats are zero */
+static int
+reset_and_verify_stats_zero(int port)
+{
+	struct rte_eth_stats stats;
+	int ret;
+
+	ret = rte_eth_stats_reset(port);
+	if (ret != 0) {
+		printf("Error: stats_reset failed for port %d: %s\n",
+		       port, rte_strerror(-ret));
+		return -1;
+	}
+
+	ret = rte_eth_stats_get(port, &stats);
+	if (ret != 0) {
+		printf("Error: stats_get failed for port %d: %s\n",
+		       port, rte_strerror(-ret));
+		return -1;
+	}
+
+	if (stats.ipackets != 0 || stats.opackets != 0 ||
+	    stats.ibytes != 0 || stats.obytes != 0 ||
+	    stats.ierrors != 0 || stats.oerrors != 0) {
+		printf("Error: port %d stats are not zero after reset\n", port);
+		return -1;
+	}
+
+	return 0;
+}
+
+/* Drain all pending RX packets from a port */
+static void
+drain_rx_queue(int port, uint16_t queue_id)
+{
+	struct rte_mbuf *drain[RX_BURST_MAX];
+	uint16_t n;
+
+	do {
+		n = rte_eth_rx_burst(port, queue_id, drain, RX_BURST_MAX);
+		rte_pktmbuf_free_bulk(drain, n);
+	} while (n > 0);
+}
+
+/* Set Ethernet address to broadcast */
+static inline void
+eth_addr_bcast(struct rte_ether_addr *addr)
+{
+	memset(addr, 0xff, RTE_ETHER_ADDR_LEN);
+}
+
+/* Bring TAP interface up using ioctl */
+static int
+tap_set_up(const char *ifname)
+{
+	struct ifreq ifr;
+	int sock, ret = -1;
+
+	sock = socket(AF_INET, SOCK_DGRAM, 0);
+	if (sock < 0)
+		return -1;
+
+	memset(&ifr, 0, sizeof(ifr));
+	strlcpy(ifr.ifr_name, ifname, IFNAMSIZ);
+
+	if (ioctl(sock, SIOCGIFFLAGS, &ifr) < 0)
+		goto out;
+
+	ifr.ifr_flags |= IFF_UP;
+
+	if (ioctl(sock, SIOCSIFFLAGS, &ifr) < 0)
+		goto out;
+
+	ret = 0;
+out:
+	close(sock);
+	return ret;
+}
+
+/* Open an AF_PACKET socket bound to the TAP interface */
+static int
+open_tap_socket(const char *ifname)
+{
+	int sock, ifindex;
+	struct sockaddr_ll sll;
+
+	sock = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
+	if (sock < 0) {
+		printf("socket(AF_PACKET) failed: %s\n", strerror(errno));
+		return -1;
+	}
+
+	ifindex = if_nametoindex(ifname);
+	if (ifindex == 0) {
+		printf("if_nametoindex(%s) failed: %s\n", ifname, strerror(errno));
+		close(sock);
+		return -1;
+	}
+
+	memset(&sll, 0, sizeof(sll));
+	sll.sll_family = AF_PACKET;
+	sll.sll_ifindex = ifindex;
+	sll.sll_protocol = htons(ETH_P_ALL);
+
+	if (bind(sock, (struct sockaddr *)&sll, sizeof(sll)) < 0) {
+		printf("bind() failed: %s\n", strerror(errno));
+		close(sock);
+		return -1;
+	}
+
+	return sock;
+}
+
+/* Setup TAP socket with non-blocking mode and bring interface up */
+static int
+setup_tap_socket_nb(const char *ifname)
+{
+	int sock, flags;
+
+	if (tap_set_up(ifname) < 0) {
+		printf("Failed to bring TAP interface up\n");
+		return -1;
+	}
+
+	sock = open_tap_socket(ifname);
+	if (sock < 0)
+		return -1;
+
+	flags = fcntl(sock, F_GETFL, 0);
+	fcntl(sock, F_SETFL, flags | O_NONBLOCK);
+
+	return sock;
+}
+
+/* Build a basic test packet with broadcast dest and magic byte payload */
+static void
+build_test_packet(uint8_t *pkt, size_t pkt_len,
+		  const struct rte_ether_addr *src_mac,
+		  const struct rte_ether_addr *dst_mac)
+{
+	struct rte_ether_hdr *eth = (struct rte_ether_hdr *)pkt;
+
+	if (dst_mac)
+		memcpy(&eth->dst_addr, dst_mac, RTE_ETHER_ADDR_LEN);
+	else
+		eth_addr_bcast(&eth->dst_addr);
+
+	if (src_mac)
+		memcpy(&eth->src_addr, src_mac, RTE_ETHER_ADDR_LEN);
+	else
+		rte_eth_random_addr(eth->src_addr.addr_bytes);
+
+	eth->ether_type = htons(RTE_ETHER_TYPE_IPV4);
+	memset(pkt + RTE_ETHER_HDR_LEN, TEST_MAGIC_BYTE,
+	       pkt_len - RTE_ETHER_HDR_LEN);
+}
+
+/* Poll AF_PACKET socket for a packet matching the given pattern */
+static ssize_t
+poll_tap_socket(int sock, uint8_t *buf, size_t buf_size,
+		uint8_t match_byte, size_t match_offset)
+{
+	struct timeval tv;
+	fd_set fds;
+	int elapsed = 0;
+	ssize_t rx_len;
+
+	while (elapsed < TX_RX_TIMEOUT_US) {
+		FD_ZERO(&fds);
+		FD_SET(sock, &fds);
+		tv.tv_sec = 0;
+		tv.tv_usec = TX_RX_POLL_US;
+
+		if (select(sock + 1, &fds, NULL, NULL, &tv) > 0) {
+			rx_len = recv(sock, buf, buf_size, 0);
+			if (rx_len > 0 && (size_t)rx_len > match_offset &&
+			    buf[match_offset] == match_byte)
+				return rx_len;
+		}
+		elapsed += TX_RX_POLL_US;
+	}
+
+	return 0;  /* Timeout */
+}
+
+/* Receive packets from DPDK port, filtering for our test packets */
+static uint16_t
+receive_test_packets(int port, uint16_t queue_id, struct rte_mbuf **rx_mbufs,
+		     uint16_t max_pkts, size_t expected_len, uint8_t magic_byte)
+{
+	uint16_t nb_rx = 0;
+	int elapsed = 0;
+
+	while (elapsed < TX_RX_TIMEOUT_US && nb_rx < max_pkts) {
+		struct rte_mbuf *burst[RX_BURST_MAX];
+		uint16_t n = rte_eth_rx_burst(port, queue_id, burst, RX_BURST_MAX);
+
+		for (uint16_t i = 0; i < n; i++) {
+			uint8_t *d = rte_pktmbuf_mtod(burst[i], uint8_t *);
+
+			if (burst[i]->pkt_len == expected_len &&
+			    d[RTE_ETHER_HDR_LEN] == magic_byte) {
+				rx_mbufs[nb_rx++] = burst[i];
+				if (nb_rx >= max_pkts)
+					break;
+			} else {
+				rte_pktmbuf_free(burst[i]);
+			}
+		}
+
+		if (nb_rx > 0)
+			break;
+
+		usleep(TX_RX_POLL_US);
+		elapsed += TX_RX_POLL_US;
+	}
+
+	return nb_rx;
+}
+
+/* Wait for event with timeout using polling */
+static int
+wait_for_event(RTE_ATOMIC(int) *event_flag, int initial_count, int timeout_us)
+{
+	int elapsed = 0;
+
+	while (elapsed < timeout_us) {
+		if (rte_atomic_load_explicit(event_flag, rte_memory_order_seq_cst) > initial_count)
+			return 0;
+		usleep(TX_RX_POLL_US);
+		elapsed += TX_RX_POLL_US;
+	}
+
+	return -1;  /* Timeout */
+}
+
+/* Count open file descriptors */
+static int
+count_open_fds(void)
+{
+	DIR *d;
+	struct dirent *de;
+	int count = 0;
+
+	d = opendir("/proc/self/fd");
+	if (d == NULL)
+		return -1;
+
+	while ((de = readdir(d)) != NULL) {
+		if (de->d_name[0] != '.')
+			count++;
+	}
+	closedir(d);
+	return count - 1;  /* Subtract dirfd itself */
+}
+
+/* ========== Test Functions ========== */
+
+static int
+test_ethdev_configure_port(int port)
+{
+	struct rte_eth_link link;
+	int ret;
+
+	if (port_reconfigure(port, 1, NULL, RING_SIZE, mp) < 0)
+		return -1;
+
+	ret = rte_eth_link_get(port, &link);
+	if (ret < 0) {
+		printf("Link get failed for port %u: %s\n",
+		       port, rte_strerror(-ret));
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+test_get_stats(int port)
+{
+	printf("Testing rtap PMD stats_get port %d\n", port);
+	return reset_and_verify_stats_zero(port);
+}
+
+static int
+test_stats_reset(int port)
+{
+	printf("Testing rtap PMD stats_reset port %d\n", port);
+	return reset_and_verify_stats_zero(port);
+}
+
+static int
+test_dev_info(int port)
+{
+	struct rte_eth_dev_info dev_info;
+
+	printf("Testing rtap PMD dev_info_get port %d\n", port);
+
+	if (get_dev_info(port, &dev_info) < 0)
+		return -1;
+
+	if (dev_info.max_rx_queues == 0 || dev_info.max_tx_queues == 0) {
+		printf("Error: invalid max queue values\n");
+		return -1;
+	}
+
+	if (dev_info.max_mac_addrs != 1) {
+		printf("Error: expected max_mac_addrs = 1, got %u\n",
+		       dev_info.max_mac_addrs);
+		return -1;
+	}
+
+	printf("  driver_name: %s\n", dev_info.driver_name);
+	printf("  if_index: %u\n", dev_info.if_index);
+	printf("  max_rx_queues: %u\n", dev_info.max_rx_queues);
+	printf("  max_tx_queues: %u\n", dev_info.max_tx_queues);
+
+	return 0;
+}
+
+static int
+test_link_status(int port)
+{
+	struct rte_eth_link link;
+	int ret;
+
+	printf("Testing rtap PMD link status port %d\n", port);
+
+	ret = rte_eth_link_get(port, &link);
+	if (ret < 0) {
+		printf("Error getting link status for port %d: %s\n",
+		       port, rte_strerror(-ret));
+		return -1;
+	}
+
+	printf("  link_status: %s\n", link.link_status ? "UP" : "DOWN");
+	printf("  link_speed: %u\n", link.link_speed);
+	printf("  link_duplex: %s\n",
+	       link.link_duplex ? "full-duplex" : "half-duplex");
+
+	return 0;
+}
+
+static int
+test_set_link_up_down(int port)
+{
+	int ret;
+
+	printf("Testing rtap PMD link up/down port %d\n", port);
+
+	ret = rte_eth_dev_set_link_down(port);
+	if (ret < 0) {
+		printf("Error setting link down for port %d: %s\n",
+		       port, rte_strerror(-ret));
+		return -1;
+	}
+
+	if (verify_link_status(port, RTE_ETH_LINK_DOWN) < 0)
+		return -1;
+
+	ret = rte_eth_dev_set_link_up(port);
+	if (ret < 0) {
+		printf("Error setting link up for port %d: %s\n",
+		       port, rte_strerror(-ret));
+		return -1;
+	}
+
+	if (verify_link_status(port, RTE_ETH_LINK_UP) < 0)
+		return -1;
+
+	return 0;
+}
+
+static int
+test_promiscuous_mode(int port)
+{
+	int ret;
+
+	printf("Testing rtap PMD promiscuous mode port %d\n", port);
+
+	ret = rte_eth_promiscuous_enable(port);
+	if (ret < 0) {
+		printf("Error enabling promiscuous mode: %s\n",
+		       rte_strerror(-ret));
+		return -1;
+	}
+
+	if (rte_eth_promiscuous_get(port) != 1) {
+		printf("Error: promiscuous mode should be enabled\n");
+		return -1;
+	}
+
+	ret = rte_eth_promiscuous_disable(port);
+	if (ret < 0) {
+		printf("Error disabling promiscuous mode: %s\n",
+		       rte_strerror(-ret));
+		return -1;
+	}
+
+	if (rte_eth_promiscuous_get(port) != 0) {
+		printf("Error: promiscuous mode should be disabled\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+test_allmulticast_mode(int port)
+{
+	int ret;
+
+	printf("Testing rtap PMD allmulticast mode port %d\n", port);
+
+	ret = rte_eth_allmulticast_enable(port);
+	if (ret < 0) {
+		printf("Error enabling allmulticast mode: %s\n",
+		       rte_strerror(-ret));
+		return -1;
+	}
+
+	if (rte_eth_allmulticast_get(port) != 1) {
+		printf("Error: allmulticast mode should be enabled\n");
+		return -1;
+	}
+
+	ret = rte_eth_allmulticast_disable(port);
+	if (ret < 0) {
+		printf("Error disabling allmulticast mode: %s\n",
+		       rte_strerror(-ret));
+		return -1;
+	}
+
+	if (rte_eth_allmulticast_get(port) != 0) {
+		printf("Error: allmulticast mode should be disabled\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+test_mac_address(int port)
+{
+	struct rte_ether_addr mac_addr;
+	struct rte_ether_addr new_mac = {
+		.addr_bytes = {0x00, 0x11, 0x22, 0x33, 0x44, 0x55}
+	};
+	int ret;
+
+	printf("Testing rtap PMD MAC address port %d\n", port);
+
+	ret = rte_eth_macaddr_get(port, &mac_addr);
+	if (ret < 0) {
+		printf("Error getting MAC address: %s\n", rte_strerror(-ret));
+		return -1;
+	}
+
+	printf("  Current MAC: " RTE_ETHER_ADDR_PRT_FMT "\n",
+	       RTE_ETHER_ADDR_BYTES(&mac_addr));
+
+	ret = rte_eth_dev_default_mac_addr_set(port, &new_mac);
+	if (ret < 0) {
+		printf("Error setting MAC address: %s\n", rte_strerror(-ret));
+		return -1;
+	}
+
+	ret = rte_eth_macaddr_get(port, &mac_addr);
+	if (ret < 0) {
+		printf("Error getting MAC address: %s\n", rte_strerror(-ret));
+		return -1;
+	}
+
+	if (!rte_is_same_ether_addr(&mac_addr, &new_mac)) {
+		printf("Error: MAC address not set correctly\n");
+		return -1;
+	}
+
+	printf("  New MAC: " RTE_ETHER_ADDR_PRT_FMT "\n",
+	       RTE_ETHER_ADDR_BYTES(&mac_addr));
+
+	return 0;
+}
+
+static int
+test_mtu_set(int port)
+{
+	uint16_t orig_mtu;
+	uint16_t new_mtu = 9000;
+	int ret;
+
+	printf("Testing rtap PMD MTU set port %d\n", port);
+
+	ret = rte_eth_dev_get_mtu(port, &orig_mtu);
+	if (ret < 0) {
+		printf("Error getting MTU: %s\n", rte_strerror(-ret));
+		return -1;
+	}
+
+	printf("  Original MTU: %u\n", orig_mtu);
+
+	ret = rte_eth_dev_set_mtu(port, new_mtu);
+	if (ret < 0) {
+		printf("Warning: setting MTU to %u failed: %s\n",
+		       new_mtu, rte_strerror(-ret));
+		return 0;
+	}
+
+	uint16_t current_mtu;
+	ret = rte_eth_dev_get_mtu(port, &current_mtu);
+	if (ret < 0) {
+		printf("Error getting MTU: %s\n", rte_strerror(-ret));
+		return -1;
+	}
+
+	printf("  New MTU: %u\n", current_mtu);
+	rte_eth_dev_set_mtu(port, orig_mtu);
+
+	return 0;
+}
+
+static int
+test_queue_reconfigure(int port)
+{
+	struct rte_eth_dev_info dev_info;
+	int ret;
+
+	printf("Testing rtap PMD queue reconfigure port %d\n", port);
+
+	if (port_reconfigure(port, 2, NULL, RING_SIZE, mp) < 0)
+		return -1;
+
+	ret = rte_eth_dev_info_get(port, &dev_info);
+	if (ret != 0) {
+		printf("Error getting device info: %s\n", rte_strerror(-ret));
+		return -1;
+	}
+
+	printf("  Configured with %u rx and %u tx queues\n",
+	       dev_info.nb_rx_queues, dev_info.nb_tx_queues);
+
+	if (restore_single_queue(port) < 0)
+		return -1;
+
+	return 0;
+}
+
+static int
+test_multiqueue(int port)
+{
+	struct rte_eth_dev_info dev_info;
+	uint16_t queue_counts[] = { 1, 2, MAX_MULTI_QUEUES };
+
+	printf("Testing rtap PMD multi-queue port %d\n", port);
+
+	for (unsigned int t = 0; t < RTE_DIM(queue_counts); t++) {
+		uint16_t nb_queues = queue_counts[t];
+
+		printf("  Configuring %u queue pair(s)\n", nb_queues);
+
+		if (port_reconfigure(port, nb_queues, NULL, RING_SIZE, mp) < 0)
+			return -1;
+
+		if (get_dev_info(port, &dev_info) < 0)
+			return -1;
+
+		if (dev_info.nb_rx_queues != nb_queues ||
+		    dev_info.nb_tx_queues != nb_queues) {
+			printf("Error: queue count mismatch\n");
+			return -1;
+		}
+
+		if (reset_and_verify_stats_zero(port) < 0)
+			return -1;
+
+		/* Verify per-queue xstats are zero */
+		int num_xstats = rte_eth_xstats_get(port, NULL, 0);
+		if (num_xstats > 0) {
+			struct rte_eth_xstat *xstats = malloc(sizeof(*xstats) * num_xstats);
+			struct rte_eth_xstat_name *xstat_names =
+				malloc(sizeof(*xstat_names) * num_xstats);
+
+			if (xstats == NULL || xstat_names == NULL) {
+				free(xstats);
+				free(xstat_names);
+				printf("Error: xstats alloc failed\n");
+				return -1;
+			}
+
+			rte_eth_xstats_get_names(port, xstat_names, num_xstats);
+			rte_eth_xstats_get(port, xstats, num_xstats);
+
+			for (int x = 0; x < num_xstats; x++) {
+				if (strstr(xstat_names[x].name, "_q") != NULL &&
+				    xstats[x].value != 0) {
+					printf("Error: xstat %s = %" PRIu64 " not zero\n",
+					       xstat_names[x].name, xstats[x].value);
+					free(xstats);
+					free(xstat_names);
+					return -1;
+				}
+			}
+			free(xstats);
+			free(xstat_names);
+		}
+
+		if (verify_link_status(port, RTE_ETH_LINK_UP) < 0)
+			return -1;
+
+		printf("    %u queue pair(s): OK\n", nb_queues);
+	}
+
+	if (restore_single_queue(port) < 0) {
+		printf("Error restoring single queue\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+test_multiqueue_reduce(int port)
+{
+	printf("Testing rtap PMD queue reduction port %d\n", port);
+
+	if (port_reconfigure(port, MAX_MULTI_QUEUES, NULL, RING_SIZE, mp) < 0)
+		return -1;
+
+	printf("  Started with %d queues, reducing to 2\n", MAX_MULTI_QUEUES);
+
+	if (port_reconfigure(port, 2, NULL, RING_SIZE, mp) < 0)
+		return -1;
+
+	if (reset_and_verify_stats_zero(port) < 0)
+		return -1;
+
+	printf("  Reduced to 2 queues: OK\n");
+	printf("  Reducing to 1 queue\n");
+
+	if (restore_single_queue(port) < 0)
+		return -1;
+
+	if (reset_and_verify_stats_zero(port) < 0)
+		return -1;
+
+	printf("  Reduced to 1 queue: OK\n");
+	return 0;
+}
+
+static int
+test_multiqueue_mismatch(int port)
+{
+	int ret;
+	struct { uint16_t rx; uint16_t tx; } mismatch[] = {
+		{ 1, 2 }, { 2, 1 }, { 4, 2 }, { 1, 4 },
+	};
+
+	printf("Testing rtap PMD mismatched queue rejection port %d\n", port);
+
+	ret = rte_eth_dev_stop(port);
+	if (ret != 0) {
+		printf("Error stopping port: %s\n", rte_strerror(-ret));
+		return -1;
+	}
+
+	for (unsigned int i = 0; i < RTE_DIM(mismatch); i++) {
+		struct rte_eth_conf null_conf;
+		memset(&null_conf, 0, sizeof(null_conf));
+
+		ret = rte_eth_dev_configure(port, mismatch[i].rx,
+					    mismatch[i].tx, &null_conf);
+		if (ret == 0) {
+			printf("Error: configure(%u rx, %u tx) should fail\n",
+			       mismatch[i].rx, mismatch[i].tx);
+			rte_eth_dev_stop(port);
+			return -1;
+		}
+		printf("  Rejected %u rx / %u tx: OK\n",
+		       mismatch[i].rx, mismatch[i].tx);
+	}
+
+	if (restore_single_queue(port) < 0) {
+		printf("Error restoring single queue\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+test_rx_inject(int port)
+{
+	struct rte_ether_addr mac;
+	struct rte_mbuf *rx_mbufs[RX_BURST_MAX];
+	uint8_t pkt[RTE_ETHER_HDR_LEN + TEST_PKT_PAYLOAD_LEN];
+	int sock = -1;
+	uint16_t nb_rx;
+	int ret = -1;
+
+	printf("Testing rtap PMD RX (inject via AF_PACKET)\n");
+
+	if (restore_single_queue(port) < 0) {
+		printf("Failed to restore single queue config\n");
+		return -1;
+	}
+
+	if (rte_eth_macaddr_get(port, &mac) < 0) {
+		printf("Failed to get MAC address\n");
+		return -1;
+	}
+
+	sock = setup_tap_socket_nb(TEST_TAP_NAME);
+	if (sock < 0)
+		return -1;
+
+	build_test_packet(pkt, sizeof(pkt), NULL, &mac);
+	drain_rx_queue(port, 0);
+	rte_eth_stats_reset(port);
+
+	if (send(sock, pkt, sizeof(pkt), 0) < 0) {
+		printf("send() failed: %s\n", strerror(errno));
+		goto out;
+	}
+
+	nb_rx = receive_test_packets(port, 0, rx_mbufs, 1, sizeof(pkt),
+				     TEST_MAGIC_BYTE);
+
+	if (nb_rx == 0) {
+		printf("No packet received after %d us\n", TX_RX_TIMEOUT_US);
+		goto out;
+	}
+
+	uint8_t *rx_data = rte_pktmbuf_mtod(rx_mbufs[0], uint8_t *);
+	if (rx_data[RTE_ETHER_HDR_LEN] != TEST_MAGIC_BYTE) {
+		printf("Payload mismatch\n");
+		goto free_rx;
+	}
+
+	struct rte_eth_stats stats;
+	rte_eth_stats_get(port, &stats);
+	if (stats.ipackets == 0) {
+		printf("RX stats not updated\n");
+		goto free_rx;
+	}
+
+	printf("  RX inject test PASSED (received %u packets)\n", nb_rx);
+	ret = 0;
+
+free_rx:
+	for (uint16_t i = 0; i < nb_rx; i++)
+		rte_pktmbuf_free(rx_mbufs[i]);
+out:
+	close(sock);
+	return ret;
+}
+
+static int
+test_tx_capture(int port)
+{
+	struct rte_ether_addr mac;
+	struct rte_mbuf *tx_mbuf;
+	struct rte_ether_hdr *eth;
+	uint8_t rx_buf[256];
+	int sock = -1;
+	uint16_t nb_tx;
+	ssize_t rx_len;
+	int ret = -1;
+
+	printf("Testing rtap PMD TX (capture via AF_PACKET)\n");
+
+	if (restore_single_queue(port) < 0) {
+		printf("Failed to restore single queue config\n");
+		return -1;
+	}
+
+	if (rte_eth_macaddr_get(port, &mac) < 0) {
+		printf("Failed to get MAC address\n");
+		return -1;
+	}
+
+	sock = setup_tap_socket_nb(TEST_TAP_NAME);
+	if (sock < 0)
+		return -1;
+
+	tx_mbuf = rte_pktmbuf_alloc(mp);
+	if (tx_mbuf == NULL) {
+		printf("Failed to allocate mbuf\n");
+		goto out;
+	}
+
+	eth = rte_pktmbuf_mtod(tx_mbuf, struct rte_ether_hdr *);
+	eth_addr_bcast(&eth->dst_addr);
+	memcpy(&eth->src_addr, &mac, RTE_ETHER_ADDR_LEN);
+	eth->ether_type = htons(RTE_ETHER_TYPE_IPV4);
+
+	uint8_t *payload = (uint8_t *)(eth + 1);
+	memset(payload, TEST_MAGIC_BYTE, TEST_PKT_PAYLOAD_LEN);
+
+	tx_mbuf->data_len = RTE_ETHER_HDR_LEN + TEST_PKT_PAYLOAD_LEN;
+	tx_mbuf->pkt_len = tx_mbuf->data_len;
+
+	rte_eth_stats_reset(port);
+
+	nb_tx = rte_eth_tx_burst(port, 0, &tx_mbuf, 1);
+	if (nb_tx != 1) {
+		printf("TX failed\n");
+		rte_pktmbuf_free(tx_mbuf);
+		goto out;
+	}
+
+	rx_len = poll_tap_socket(sock, rx_buf, sizeof(rx_buf),
+				 TEST_MAGIC_BYTE, RTE_ETHER_HDR_LEN);
+
+	if (rx_len <= 0) {
+		printf("No packet captured after %d us\n", TX_RX_TIMEOUT_US);
+		goto out;
+	}
+
+	struct rte_eth_stats stats;
+	rte_eth_stats_get(port, &stats);
+	if (stats.opackets == 0) {
+		printf("TX stats not updated\n");
+		goto out;
+	}
+
+	printf("  TX capture test PASSED (captured %zd bytes)\n", rx_len);
+	ret = 0;
+
+out:
+	close(sock);
+	return ret;
+}
+
+#define MSEG_NUM_SEGS 3
+#define MSEG_SEG_LEN 40
+
+static int
+test_tx_multiseg(int port)
+{
+	struct rte_ether_addr mac;
+	struct rte_mbuf *head, *seg, *prev;
+	struct rte_ether_hdr *eth;
+	uint8_t rx_buf[512];
+	int sock = -1;
+	uint16_t nb_tx;
+	ssize_t rx_len;
+	int ret = -1;
+	uint16_t total_payload = MSEG_NUM_SEGS * MSEG_SEG_LEN;
+
+	printf("Testing rtap PMD multi-segment TX\n");
+
+	if (restore_single_queue(port) < 0 ||
+	    rte_eth_macaddr_get(port, &mac) < 0) {
+		printf("Failed to setup test\n");
+		return -1;
+	}
+
+	sock = setup_tap_socket_nb(TEST_TAP_NAME);
+	if (sock < 0)
+		return -1;
+
+	head = rte_pktmbuf_alloc(mp);
+	if (head == NULL) {
+		printf("Failed to allocate head mbuf\n");
+		goto out;
+	}
+
+	eth = rte_pktmbuf_mtod(head, struct rte_ether_hdr *);
+	eth_addr_bcast(&eth->dst_addr);
+	memcpy(&eth->src_addr, &mac, RTE_ETHER_ADDR_LEN);
+	eth->ether_type = htons(RTE_ETHER_TYPE_IPV4);
+
+	uint8_t *p = (uint8_t *)(eth + 1);
+	memset(p, 0xA0, MSEG_SEG_LEN);
+	head->data_len = RTE_ETHER_HDR_LEN + MSEG_SEG_LEN;
+	head->pkt_len = RTE_ETHER_HDR_LEN + total_payload;
+	head->nb_segs = MSEG_NUM_SEGS;
+
+	prev = head;
+	for (int i = 1; i < MSEG_NUM_SEGS; i++) {
+		seg = rte_pktmbuf_alloc(mp);
+		if (seg == NULL) {
+			printf("Failed to allocate segment %d\n", i);
+			rte_pktmbuf_free(head);
+			goto out;
+		}
+		p = rte_pktmbuf_mtod(seg, uint8_t *);
+		memset(p, 0xA0 + i, MSEG_SEG_LEN);
+		seg->data_len = MSEG_SEG_LEN;
+		prev->next = seg;
+		prev = seg;
+	}
+
+	rte_eth_stats_reset(port);
+
+	nb_tx = rte_eth_tx_burst(port, 0, &head, 1);
+	if (nb_tx != 1) {
+		printf("TX failed for multi-seg packet\n");
+		rte_pktmbuf_free(head);
+		goto out;
+	}
+
+	rx_len = poll_tap_socket(sock, rx_buf, sizeof(rx_buf),
+				 0xA0, RTE_ETHER_HDR_LEN);
+
+	if (rx_len <= 0) {
+		printf("No packet captured\n");
+		goto out;
+	}
+
+	for (int i = 0; i < MSEG_NUM_SEGS; i++) {
+		int off = RTE_ETHER_HDR_LEN + i * MSEG_SEG_LEN;
+		uint8_t expected = 0xA0 + i;
+		if (rx_buf[off] != expected) {
+			printf("Segment %d mismatch\n", i);
+			goto out;
+		}
+	}
+
+	struct rte_eth_stats stats;
+	rte_eth_stats_get(port, &stats);
+	if (stats.opackets == 0) {
+		printf("TX stats not updated\n");
+		goto out;
+	}
+
+	printf("  Multi-seg TX test PASSED (%d segs, captured %zd bytes)\n",
+	       MSEG_NUM_SEGS, rx_len);
+	ret = 0;
+
+out:
+	close(sock);
+	return ret;
+}
+
+#define MSEG_RX_BUF_SIZE 256
+#define MSEG_RX_POOL_SIZE 4096
+#define MSEG_RX_PKT_PAYLOAD 200
+
+static int
+test_rx_multiseg(int port)
+{
+	struct rte_mempool *small_mp = NULL;
+	struct rte_ether_addr mac;
+	struct rte_mbuf *rx_mbufs[RX_BURST_MAX];
+	uint8_t pkt[RTE_ETHER_HDR_LEN + MSEG_RX_PKT_PAYLOAD];
+	int sock = -1;
+	uint16_t nb_rx;
+	int ret = -1;
+
+	printf("Testing rtap PMD multi-segment RX\n");
+
+	if (rte_eth_macaddr_get(port, &mac) < 0) {
+		printf("Failed to get MAC address\n");
+		return -1;
+	}
+
+	small_mp = rte_pktmbuf_pool_create("small_mbuf_pool",
+			MSEG_RX_POOL_SIZE, 32, 0, MSEG_RX_BUF_SIZE,
+			rte_socket_id());
+	if (small_mp == NULL) {
+		printf("Failed to create small mempool\n");
+		return -1;
+	}
+
+	if (port_reconfigure(port, 1, NULL, RING_SIZE, small_mp) < 0)
+		goto free_pool;
+
+	sock = setup_tap_socket_nb(TEST_TAP_NAME);
+	if (sock < 0)
+		goto restore;
+
+	drain_rx_queue(port, 0);
+
+	build_test_packet(pkt, sizeof(pkt), NULL, &mac);
+	memset(pkt + RTE_ETHER_HDR_LEN, 0xDD, MSEG_RX_PKT_PAYLOAD);
+
+	if (send(sock, pkt, sizeof(pkt), 0) < 0) {
+		printf("send() failed: %s\n", strerror(errno));
+		goto close_sock;
+	}
+
+	nb_rx = receive_test_packets(port, 0, rx_mbufs, 1, sizeof(pkt), 0xDD);
+
+	if (nb_rx == 0) {
+		printf("No packet received\n");
+		goto close_sock;
+	}
+
+	struct rte_mbuf *m = rx_mbufs[0];
+	printf("  Received: pkt_len=%u nb_segs=%u\n", m->pkt_len, m->nb_segs);
+
+	if (m->nb_segs < 2) {
+		printf("  Expected multi-segment mbuf, got %u segments\n",
+		       m->nb_segs);
+	}
+
+	if (m->pkt_len < sizeof(pkt)) {
+		printf("  Packet too short: %u < %zu\n", m->pkt_len, sizeof(pkt));
+		goto free_rx;
+	}
+
+	/* Verify payload across segments */
+	uint32_t offset = 0;
+	struct rte_mbuf *seg = m;
+	uint32_t seg_off = 0;
+	int payload_ok = 1;
+
+	while (seg != NULL && offset < m->pkt_len) {
+		if (seg_off >= seg->data_len) {
+			seg = seg->next;
+			seg_off = 0;
+			continue;
+		}
+		if (offset >= RTE_ETHER_HDR_LEN &&
+		    offset < RTE_ETHER_HDR_LEN + MSEG_RX_PKT_PAYLOAD) {
+			uint8_t *d = rte_pktmbuf_mtod_offset(seg, uint8_t *,
+							      seg_off);
+			if (*d != 0xDD) {
+				printf("  Payload mismatch at offset %u\n", offset);
+				payload_ok = 0;
+				break;
+			}
+		}
+		offset++;
+		seg_off++;
+	}
+
+	if (!payload_ok)
+		goto free_rx;
+
+	printf("  Multi-seg RX test PASSED (%u segments)\n", m->nb_segs);
+	ret = 0;
+
+free_rx:
+	for (uint16_t i = 0; i < nb_rx; i++)
+		rte_pktmbuf_free(rx_mbufs[i]);
+
+close_sock:
+	close(sock);
+
+restore:
+	restore_single_queue(port);
+
+free_pool:
+	rte_mempool_free(small_mp);
+	return ret;
+}
+
+static int
+test_offload_config(int port)
+{
+	struct rte_eth_dev_info dev_info;
+	struct rte_eth_conf conf;
+	int ret;
+
+	printf("Testing rtap PMD offload configuration port %d\n", port);
+
+	if (get_dev_info(port, &dev_info) < 0)
+		return -1;
+
+	uint64_t expected_tx = RTE_ETH_TX_OFFLOAD_MULTI_SEGS |
+			       RTE_ETH_TX_OFFLOAD_UDP_CKSUM |
+			       RTE_ETH_TX_OFFLOAD_TCP_CKSUM |
+			       RTE_ETH_TX_OFFLOAD_TCP_TSO;
+
+	if ((dev_info.tx_offload_capa & expected_tx) != expected_tx) {
+		printf("Missing TX offload capabilities\n");
+		return -1;
+	}
+
+	printf("  TX offload capa: 0x%" PRIx64 " OK\n",
+	       dev_info.tx_offload_capa);
+
+	memset(&conf, 0, sizeof(conf));
+	conf.txmode.offloads = RTE_ETH_TX_OFFLOAD_UDP_CKSUM |
+			       RTE_ETH_TX_OFFLOAD_TCP_CKSUM;
+
+	ret = port_reconfigure(port, 1, &conf, RING_SIZE, mp);
+	if (ret < 0) {
+		printf("Configure with TX offloads failed\n");
+		goto restore;
+	}
+
+	printf("  TX offload configuration: OK\n");
+
+restore:
+	restore_single_queue(port);
+	return ret;
+}
+
+#define CSUM_PKT_PAYLOAD 32
+
+static int
+test_tx_csum_offload(int port)
+{
+	struct rte_ether_addr mac;
+	struct rte_mbuf *tx_mbuf;
+	uint8_t rx_buf[256];
+	int sock = -1;
+	uint16_t nb_tx, pkt_len;
+	ssize_t rx_len = 0;
+	int ret = -1;
+
+	printf("Testing rtap PMD TX checksum offload\n");
+
+	if (restore_single_queue(port) < 0 ||
+	    rte_eth_macaddr_get(port, &mac) < 0) {
+		printf("Failed to setup test\n");
+		return -1;
+	}
+
+	sock = setup_tap_socket_nb(TEST_TAP_NAME);
+	if (sock < 0)
+		return -1;
+
+	tx_mbuf = rte_pktmbuf_alloc(mp);
+	if (tx_mbuf == NULL) {
+		printf("Failed to allocate mbuf\n");
+		goto out;
+	}
+
+	/* Build Eth + IPv4 + UDP + payload */
+	struct rte_ether_hdr *eth = rte_pktmbuf_mtod(tx_mbuf,
+						      struct rte_ether_hdr *);
+	eth_addr_bcast(&eth->dst_addr);
+	memcpy(&eth->src_addr, &mac, RTE_ETHER_ADDR_LEN);
+	eth->ether_type = htons(RTE_ETHER_TYPE_IPV4);
+
+	struct rte_ipv4_hdr *ip = (struct rte_ipv4_hdr *)(eth + 1);
+	memset(ip, 0, sizeof(*ip));
+	ip->version_ihl = 0x45;
+	ip->total_length = htons(sizeof(*ip) + sizeof(struct rte_udp_hdr) +
+				 CSUM_PKT_PAYLOAD);
+	ip->time_to_live = 64;
+	ip->next_proto_id = IPPROTO_UDP;
+	ip->src_addr = htonl(0x0a000001);
+	ip->dst_addr = htonl(0x0a000002);
+	ip->hdr_checksum = 0;
+	ip->hdr_checksum = rte_ipv4_cksum(ip);
+
+	struct rte_udp_hdr *udp = (struct rte_udp_hdr *)(ip + 1);
+	udp->src_port = htons(1234);
+	udp->dst_port = htons(5678);
+	udp->dgram_len = htons(sizeof(*udp) + CSUM_PKT_PAYLOAD);
+	udp->dgram_cksum = 0;
+
+	uint8_t *payload = (uint8_t *)(udp + 1);
+	memset(payload, 0xCC, CSUM_PKT_PAYLOAD);
+
+	pkt_len = sizeof(*eth) + sizeof(*ip) + sizeof(*udp) + CSUM_PKT_PAYLOAD;
+	tx_mbuf->data_len = pkt_len;
+	tx_mbuf->pkt_len = pkt_len;
+
+	tx_mbuf->ol_flags = RTE_MBUF_F_TX_IPV4 | RTE_MBUF_F_TX_UDP_CKSUM;
+	tx_mbuf->l2_len = sizeof(*eth);
+	tx_mbuf->l3_len = sizeof(*ip);
+	udp->dgram_cksum = rte_ipv4_phdr_cksum(ip, tx_mbuf->ol_flags);
+
+	rte_eth_stats_reset(port);
+
+	nb_tx = rte_eth_tx_burst(port, 0, &tx_mbuf, 1);
+	if (nb_tx != 1) {
+		printf("TX failed\n");
+		rte_pktmbuf_free(tx_mbuf);
+		goto out;
+	}
+
+	rx_len = poll_tap_socket(sock, rx_buf, sizeof(rx_buf),
+				 0x45, sizeof(*eth));
+
+	if (rx_len <= 0) {
+		printf("No packet captured\n");
+		goto out;
+	}
+
+	unsigned int cksum_off = sizeof(*eth) + sizeof(*ip) +
+				 offsetof(struct rte_udp_hdr, dgram_cksum);
+	uint16_t captured_cksum;
+
+	memcpy(&captured_cksum, &rx_buf[cksum_off], sizeof(captured_cksum));
+
+	if (captured_cksum == 0) {
+		printf("  Warning: UDP checksum is zero\n");
+	} else {
+		printf("  UDP cksum=0x%04x\n", ntohs(captured_cksum));
+	}
+
+	struct rte_eth_stats stats;
+	rte_eth_stats_get(port, &stats);
+	if (stats.opackets == 0) {
+		printf("TX stats not updated\n");
+		goto out;
+	}
+
+	printf("  TX csum offload PASSED (captured %zd bytes)\n", rx_len);
+	ret = 0;
+
+out:
+	close(sock);
+	return ret;
+}
+
+#define FLOOD_RING_SIZE 64
+#define FLOOD_NUM_PKTS 1000
+#define FLOOD_PKT_SIZE 128
+
+static int
+test_imissed_counter(int port)
+{
+	struct rte_eth_stats stats_before, stats_after, stats_after_reset;
+	struct rte_ether_addr mac;
+	uint8_t pkt[FLOOD_PKT_SIZE];
+	int sock = -1;
+	int ret = -1;
+
+	printf("Testing rtap PMD imissed counter port %d\n", port);
+
+	if (rte_eth_macaddr_get(port, &mac) < 0) {
+		printf("Failed to get MAC address\n");
+		return -1;
+	}
+
+	if (port_reconfigure(port, 1, NULL, FLOOD_RING_SIZE, mp) < 0)
+		goto restore;
+
+	sock = setup_tap_socket_nb(TEST_TAP_NAME);
+	if (sock < 0)
+		goto restore;
+
+	drain_rx_queue(port, 0);
+
+	ret = rte_eth_stats_reset(port);
+	if (ret != 0) {
+		printf("Failed to reset stats: %s\n", rte_strerror(-ret));
+		goto close_sock;
+	}
+
+	ret = rte_eth_stats_get(port, &stats_before);
+	if (ret != 0) {
+		printf("Failed to get baseline stats: %s\n", rte_strerror(-ret));
+		goto close_sock;
+	}
+
+	printf("  Flooding with %d packets (ring size %d)\n",
+	       FLOOD_NUM_PKTS, FLOOD_RING_SIZE);
+
+	build_test_packet(pkt, FLOOD_PKT_SIZE, &mac, &mac);
+
+	for (int i = 0; i < FLOOD_NUM_PKTS; i++) {
+		if (send(sock, pkt, FLOOD_PKT_SIZE, 0) < 0) {
+			printf("send() failed after %d packets: %s\n",
+			       i, strerror(errno));
+			goto close_sock;
+		}
+	}
+
+	usleep(100000);  /* 100ms */
+
+	/* Drain whatever we can receive */
+	{
+		struct rte_mbuf *burst[RX_BURST_MAX];
+		uint16_t total_rx = 0;
+		int attempts = 0;
+
+		while (attempts++ < 100) {
+			uint16_t n = rte_eth_rx_burst(port, 0, burst, RX_BURST_MAX);
+			if (n > 0) {
+				rte_pktmbuf_free_bulk(burst, n);
+				total_rx += n;
+			} else {
+				usleep(1000);
+			}
+		}
+		printf("  Received %u packets out of %d sent\n",
+		       total_rx, FLOOD_NUM_PKTS);
+	}
+
+	ret = rte_eth_stats_get(port, &stats_after);
+	if (ret != 0) {
+		printf("Failed to get stats after flood: %s\n", rte_strerror(-ret));
+		goto close_sock;
+	}
+
+	printf("  Stats: ipackets=%"PRIu64" imissed=%"PRIu64"\n",
+	       stats_after.ipackets, stats_after.imissed);
+
+	if (stats_after.ipackets == 0) {
+		printf("  ERROR: No packets received\n");
+		goto close_sock;
+	}
+
+	if (stats_after.imissed == 0) {
+		printf("  WARNING: No packets marked as missed\n");
+	} else {
+		printf("  SUCCESS: imissed counter working (%"PRIu64" drops)\n",
+		       stats_after.imissed);
+	}
+
+	/* Test stats_reset clears imissed counter */
+	printf("  Testing stats_reset for imissed counter\n");
+	ret = rte_eth_stats_reset(port);
+	if (ret != 0) {
+		printf("  ERROR: stats_reset failed: %s\n", rte_strerror(-ret));
+		goto close_sock;
+	}
+
+	ret = rte_eth_stats_get(port, &stats_after_reset);
+	if (ret != 0) {
+		printf("  ERROR: stats_get after reset failed: %s\n", rte_strerror(-ret));
+		goto close_sock;
+	}
+
+	if (stats_after_reset.imissed != 0 || stats_after_reset.ipackets != 0) {
+		printf("  ERROR: stats not reset properly\n");
+		goto close_sock;
+	}
+
+	printf("  Stats reset: OK (all counters zeroed)\n");
+	printf("  imissed counter test PASSED\n");
+	ret = 0;
+
+close_sock:
+	close(sock);
+
+restore:
+	restore_single_queue(port);
+	return ret;
+}
+
+#define LSC_TIMEOUT_US 500000  /* 500ms */
+#define LSC_POLL_US    1000    /* 1ms between polls */
+
+#define RXQ_INTR_TIMEOUT_MS 500  /* 500ms */
+
+static RTE_ATOMIC(int) lsc_event_count;
+static RTE_ATOMIC(int) lsc_last_status;
+
+static int
+test_lsc_callback(uint16_t port_id, enum rte_eth_event_type type,
+		  void *param __rte_unused, void *ret_param __rte_unused)
+{
+	struct rte_eth_link link;
+
+	if (type != RTE_ETH_EVENT_INTR_LSC)
+		return 0;
+
+	if (rte_eth_link_get_nowait(port_id, &link) < 0) {
+		printf("  Link get nowait failed\n");
+		return 0;
+	}
+
+	rte_atomic_store_explicit(&lsc_last_status, link.link_status, rte_memory_order_relaxed);
+	rte_atomic_fetch_add_explicit(&lsc_event_count, 1, rte_memory_order_seq_cst);
+
+	printf("    LSC event #%d: port %u link %s\n",
+	       rte_atomic_load_explicit(&lsc_event_count, rte_memory_order_relaxed),
+	       port_id,
+	       link.link_status ? "UP" : "DOWN");
+
+	return 0;
+}
+
+static int
+test_lsc_interrupt(int port)
+{
+	struct rte_eth_conf lsc_conf;
+	int initial_count;
+	int ret = -1;
+
+	printf("Testing rtap PMD link state interrupt port %d\n", port);
+
+	memset(&lsc_conf, 0, sizeof(lsc_conf));
+	lsc_conf.intr_conf.lsc = 1;
+
+	if (port_reconfigure(port, 1, &lsc_conf, RING_SIZE, mp) < 0)
+		goto restore;
+
+	ret = rte_eth_dev_callback_register(port, RTE_ETH_EVENT_INTR_LSC,
+					    test_lsc_callback, NULL);
+	if (ret < 0) {
+		printf("Failed to register LSC callback: %s\n",
+		       rte_strerror(-ret));
+		goto restore;
+	}
+
+	rte_atomic_store_explicit(&lsc_event_count, 0, rte_memory_order_relaxed);
+	rte_atomic_store_explicit(&lsc_last_status, -1, rte_memory_order_relaxed);
+
+	if (verify_link_status(port, RTE_ETH_LINK_UP) < 0) {
+		ret = -1;
+		goto stop;
+	}
+
+	printf("  Link is UP, setting link DOWN\n");
+	initial_count = rte_atomic_load_explicit(&lsc_event_count, rte_memory_order_seq_cst);
+
+	ret = rte_eth_dev_set_link_down(port);
+	if (ret < 0) {
+		printf("Set link down failed: %s\n", rte_strerror(-ret));
+		goto stop;
+	}
+
+	if (wait_for_event(&lsc_event_count, initial_count, LSC_TIMEOUT_US) < 0) {
+		printf("  No LSC event received for link DOWN after %d us\n",
+		       LSC_TIMEOUT_US);
+		if (verify_link_status(port, RTE_ETH_LINK_DOWN) < 0) {
+			ret = -1;
+			goto stop;
+		}
+		printf("  Link status is DOWN (verified via polling)\n");
+	} else {
+		printf("  LSC event received for link DOWN\n");
+		if (rte_atomic_load_explicit(&lsc_last_status, rte_memory_order_seq_cst) != RTE_ETH_LINK_DOWN) {
+			printf("  ERROR: expected DOWN status in callback\n");
+			ret = -1;
+			goto stop;
+		}
+	}
+
+	printf("  Setting link UP\n");
+	initial_count = rte_atomic_load_explicit(&lsc_event_count, rte_memory_order_seq_cst);
+
+	ret = rte_eth_dev_set_link_up(port);
+	if (ret < 0) {
+		printf("Set link up failed: %s\n", rte_strerror(-ret));
+		goto stop;
+	}
+
+	if (wait_for_event(&lsc_event_count, initial_count, LSC_TIMEOUT_US) < 0) {
+		printf("  No LSC event received for link UP after %d us\n",
+		       LSC_TIMEOUT_US);
+		if (verify_link_status(port, RTE_ETH_LINK_UP) < 0) {
+			ret = -1;
+			goto stop;
+		}
+		printf("  Link status is UP (verified via polling)\n");
+	} else {
+		printf("  LSC event received for link UP\n");
+		if (rte_atomic_load_explicit(&lsc_last_status, rte_memory_order_seq_cst) != RTE_ETH_LINK_UP) {
+			printf("  ERROR: expected UP status in callback\n");
+			ret = -1;
+			goto stop;
+		}
+	}
+
+	printf("  LSC interrupt test PASSED (total events: %d)\n",
+	       rte_atomic_load_explicit(&lsc_event_count, rte_memory_order_relaxed));
+	ret = 0;
+
+stop:
+	rte_eth_dev_stop(port);
+	rte_eth_dev_callback_unregister(port, RTE_ETH_EVENT_INTR_LSC,
+					test_lsc_callback, NULL);
+
+restore:
+	restore_single_queue(port);
+	return ret;
+}
+
+static int
+test_rxq_interrupt(int port)
+{
+	struct rte_eth_conf rxq_conf;
+	struct rte_ether_addr mac;
+	uint8_t pkt[RTE_ETHER_HDR_LEN + TEST_PKT_PAYLOAD_LEN];
+	int sock = -1;
+	int epfd = -1;
+	int ret = -1;
+
+	printf("Testing rtap PMD RX queue interrupt port %d\n", port);
+
+	if (rte_eth_macaddr_get(port, &mac) < 0) {
+		printf("Failed to get MAC address\n");
+		return -1;
+	}
+
+	memset(&rxq_conf, 0, sizeof(rxq_conf));
+	rxq_conf.intr_conf.rxq = 1;
+
+	if (port_reconfigure(port, 1, &rxq_conf, RING_SIZE, mp) < 0)
+		goto restore;
+
+	/* Enable interrupt for queue 0 */
+	ret = rte_eth_dev_rx_intr_enable(port, 0);
+	if (ret < 0) {
+		printf("  rx_intr_enable failed: %s\n", rte_strerror(-ret));
+		goto restore;
+	}
+
+	/* Add queue 0's eventfd to the per-thread epoll set */
+	ret = rte_eth_dev_rx_intr_ctl_q(port, RTE_EPOLL_PER_THREAD,
+					RTE_INTR_EVENT_ADD, 0, NULL);
+	if (ret < 0) {
+		printf("  rx_intr_ctl_q(ADD) failed: %s\n", rte_strerror(-ret));
+		printf("  (epoll may not be available in this environment)\n");
+		epfd = -1;
+	} else {
+		epfd = RTE_EPOLL_PER_THREAD;
+	}
+
+	sock = setup_tap_socket_nb(TEST_TAP_NAME);
+	if (sock < 0)
+		goto disable_intr;
+
+	drain_rx_queue(port, 0);
+	build_test_packet(pkt, sizeof(pkt), NULL, &mac);
+
+	printf("  Injecting test packet\n");
+	if (send(sock, pkt, sizeof(pkt), 0) < 0) {
+		printf("send() failed: %s\n", strerror(errno));
+		goto close_sock;
+	}
+
+	/* Wait for the Rx interrupt via epoll */
+	if (epfd != -1) {
+		struct rte_epoll_event event;
+		int nfds;
+
+		nfds = rte_epoll_wait(epfd, &event, 1, RXQ_INTR_TIMEOUT_MS);
+		if (nfds < 0) {
+			printf("  rte_epoll_wait failed: %s\n",
+			       rte_strerror(-nfds));
+			printf("  (Falling back to polling verification)\n");
+		} else if (nfds == 0) {
+			printf("  WARNING: epoll timeout - no Rx interrupt received\n");
+			printf("  (This may be expected in some test environments)\n");
+		} else {
+			printf("  Rx interrupt received via epoll: OK\n");
+		}
+	}
+
+	/* Verify the packet actually arrived */
+	{
+		struct rte_mbuf *rx_mbufs[RX_BURST_MAX];
+		uint16_t nb_rx;
+		int elapsed = 0;
+
+		while (elapsed < TX_RX_TIMEOUT_US) {
+			nb_rx = rte_eth_rx_burst(port, 0, rx_mbufs, RX_BURST_MAX);
+			if (nb_rx > 0) {
+				rte_pktmbuf_free_bulk(rx_mbufs, nb_rx);
+				printf("  Packet received successfully\n");
+				break;
+			}
+			usleep(TX_RX_POLL_US);
+			elapsed += TX_RX_POLL_US;
+		}
+
+		if (nb_rx == 0) {
+			printf("  ERROR: No packet received\n");
+			ret = -1;
+			goto close_sock;
+		}
+	}
+
+	printf("  RX queue interrupt test PASSED\n");
+	ret = 0;
+
+close_sock:
+	close(sock);
+
+disable_intr:
+	rte_eth_dev_rx_intr_disable(port, 0);
+
+	if (epfd != -1)
+		rte_eth_dev_rx_intr_ctl_q(port, RTE_EPOLL_PER_THREAD,
+					  RTE_INTR_EVENT_DEL, 0, NULL);
+
+restore:
+	restore_single_queue(port);
+	return ret;
+}
+
+static int
+test_fd_leak(void)
+{
+	int fd_before, fd_after;
+	int port = -1;
+	int ret;
+
+	printf("Testing rtap PMD file descriptor leak\n");
+
+	fd_before = count_open_fds();
+	if (fd_before < 0) {
+		printf("Cannot count open fds\n");
+		return -1;
+	}
+
+	printf("  Open fds before: %d\n", fd_before);
+
+	if (rte_vdev_init("net_rtap_fdtest", "iface=rtap_fdtest") < 0) {
+		printf("Failed to create net_rtap_fdtest\n");
+		return -1;
+	}
+
+	uint16_t p;
+	RTE_ETH_FOREACH_DEV(p) {
+		struct rte_eth_dev_info info;
+		if (rte_eth_dev_info_get(p, &info) != 0)
+			continue;
+		if (p == (uint16_t)rtap_port0 || p == (uint16_t)rtap_port1)
+			continue;
+		if (strstr(info.driver_name, "rtap") != NULL) {
+			port = p;
+			break;
+		}
+	}
+
+	if (port < 0) {
+		printf("Failed to find fd-test port\n");
+		rte_vdev_uninit("net_rtap_fdtest");
+		return -1;
+	}
+
+	if (port_reconfigure(port, 2, NULL, RING_SIZE, mp) < 0)
+		goto cleanup;
+
+	ret = rte_eth_dev_stop(port);
+	if (ret != 0)
+		printf("Warning: stop returned %d\n", ret);
+
+	rte_eth_dev_close(port);
+	rte_vdev_uninit("net_rtap_fdtest");
+
+	fd_after = count_open_fds();
+	printf("  Open fds after:  %d\n", fd_after);
+
+	if (fd_after != fd_before) {
+		printf("  ERROR: fd leak detected: %d fds leaked\n",
+		       fd_after - fd_before);
+		return -1;
+	}
+
+	printf("  fd leak test PASSED\n");
+	return 0;
+
+cleanup:
+	rte_eth_dev_stop(port);
+	rte_eth_dev_close(port);
+	rte_vdev_uninit("net_rtap_fdtest");
+	return -1;
+}
+
+static void
+test_rtap_cleanup(void)
+{
+	int ret;
+
+	if (rtap_port0 >= 0) {
+		ret = rte_eth_dev_stop(rtap_port0);
+		if (ret != 0)
+			printf("Error: failed to stop port %u: %s\n",
+			       rtap_port0, rte_strerror(-ret));
+		rte_eth_dev_close(rtap_port0);
+	}
+
+	if (rtap_port1 >= 0) {
+		ret = rte_eth_dev_stop(rtap_port1);
+		if (ret != 0)
+			printf("Error: failed to stop port %u: %s\n",
+			       rtap_port1, rte_strerror(-ret));
+		rte_eth_dev_close(rtap_port1);
+	}
+
+	rte_mempool_free(mp);
+	rte_vdev_uninit("net_rtap0");
+	rte_vdev_uninit("net_rtap1");
+}
+
+static int
+test_pmd_rtap_setup(void)
+{
+	uint16_t nb_ports;
+
+	if (check_rtap_available() < 0) {
+		printf("RTAP not available, skipping tests\n");
+		return TEST_SKIPPED;
+	}
+
+	nb_ports = rte_eth_dev_count_avail();
+	printf("nb_ports before rtap creation=%d\n", (int)nb_ports);
+
+	mp = rte_pktmbuf_pool_create("mbuf_pool", NB_MBUF, 32,
+			0, RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());
+	if (mp == NULL) {
+		printf("Failed to create mempool\n");
+		return TEST_FAILED;
+	}
+
+	if (rte_vdev_init("net_rtap0", "iface=rtap_test0") < 0) {
+		printf("Failed to create net_rtap0\n");
+		rte_mempool_free(mp);
+		return TEST_FAILED;
+	}
+
+	if (rte_vdev_init("net_rtap1", "iface=rtap_test1") < 0) {
+		printf("Failed to create net_rtap1\n");
+		rte_vdev_uninit("net_rtap0");
+		rte_mempool_free(mp);
+		return TEST_FAILED;
+	}
+
+	uint16_t port;
+	RTE_ETH_FOREACH_DEV(port) {
+		struct rte_eth_dev_info dev_info;
+		int ret = rte_eth_dev_info_get(port, &dev_info);
+		if (ret != 0)
+			continue;
+
+		if (strstr(dev_info.driver_name, "rtap") != NULL ||
+		    strstr(dev_info.driver_name, "RTAP") != NULL) {
+			if (rtap_port0 < 0)
+				rtap_port0 = port;
+			else if (rtap_port1 < 0)
+				rtap_port1 = port;
+		}
+	}
+
+	if (rtap_port0 < 0) {
+		printf("Failed to find rtap ports\n");
+		test_rtap_cleanup();
+		return TEST_FAILED;
+	}
+
+	printf("rtap_port0=%d rtap_port1=%d\n", rtap_port0, rtap_port1);
+	return TEST_SUCCESS;
+}
+
+static int
+test_ethdev_configure_ports(void)
+{
+	TEST_ASSERT((test_ethdev_configure_port(rtap_port0) == 0),
+			"test ethdev configure port rtap_port0 failed");
+
+	if (rtap_port1 >= 0) {
+		TEST_ASSERT((test_ethdev_configure_port(rtap_port1) == 0),
+				"test ethdev configure port rtap_port1 failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_command_line_rtap_port(void)
+{
+	int port, cmdl_port = -1;
+	int ret;
+
+	printf("Testing command line created rtap port\n");
+
+	RTE_ETH_FOREACH_DEV(port) {
+		struct rte_eth_dev_info dev_info;
+
+		ret = rte_eth_dev_info_get(port, &dev_info);
+		if (ret != 0)
+			continue;
+
+		if (port == rtap_port0 || port == rtap_port1)
+			continue;
+
+		if (strstr(dev_info.driver_name, "rtap") != NULL ||
+		    strstr(dev_info.driver_name, "RTAP") != NULL) {
+			printf("Found command line rtap port=%d\n", port);
+			cmdl_port = port;
+			break;
+		}
+	}
+
+	if (cmdl_port != -1) {
+		TEST_ASSERT((test_ethdev_configure_port(cmdl_port) == 0),
+				"test ethdev configure cmdl_port failed");
+		TEST_ASSERT((test_stats_reset(cmdl_port) == 0),
+				"test stats reset cmdl_port failed");
+		TEST_ASSERT((test_get_stats(cmdl_port) == 0),
+				"test get stats cmdl_port failed");
+		TEST_ASSERT((rte_eth_dev_stop(cmdl_port) == 0),
+				"test stop cmdl_port failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* Test case wrappers */
+#define TEST_CASE_WRAPPER(name, func) \
+	static int test_##name##_for_port(void) { \
+		TEST_ASSERT(func(rtap_port0) == 0, #name " failed"); \
+		return TEST_SUCCESS; \
+	}
+
+TEST_CASE_WRAPPER(get_stats, test_get_stats)
+TEST_CASE_WRAPPER(stats_reset, test_stats_reset)
+TEST_CASE_WRAPPER(dev_info, test_dev_info)
+TEST_CASE_WRAPPER(link_status, test_link_status)
+TEST_CASE_WRAPPER(link_up_down, test_set_link_up_down)
+TEST_CASE_WRAPPER(promiscuous, test_promiscuous_mode)
+TEST_CASE_WRAPPER(allmulticast, test_allmulticast_mode)
+TEST_CASE_WRAPPER(mac_address, test_mac_address)
+TEST_CASE_WRAPPER(mtu, test_mtu_set)
+TEST_CASE_WRAPPER(multiqueue, test_multiqueue)
+TEST_CASE_WRAPPER(multiqueue_reduce, test_multiqueue_reduce)
+TEST_CASE_WRAPPER(multiqueue_mismatch, test_multiqueue_mismatch)
+TEST_CASE_WRAPPER(queue_reconfigure, test_queue_reconfigure)
+TEST_CASE_WRAPPER(rx_inject, test_rx_inject)
+TEST_CASE_WRAPPER(tx_capture, test_tx_capture)
+TEST_CASE_WRAPPER(tx_multiseg, test_tx_multiseg)
+TEST_CASE_WRAPPER(rx_multiseg, test_rx_multiseg)
+TEST_CASE_WRAPPER(offload_config, test_offload_config)
+TEST_CASE_WRAPPER(tx_csum_offload, test_tx_csum_offload)
+TEST_CASE_WRAPPER(stats_imissed, test_imissed_counter)
+TEST_CASE_WRAPPER(lsc_interrupt, test_lsc_interrupt)
+TEST_CASE_WRAPPER(rxq_interrupt, test_rxq_interrupt)
+
+static int
+test_fd_leak_for_port(void)
+{
+	TEST_ASSERT(test_fd_leak() == 0, "test fd leak failed");
+	return TEST_SUCCESS;
+}
+
+static struct
+unit_test_suite test_pmd_rtap_suite = {
+	.setup = test_pmd_rtap_setup,
+	.teardown = test_rtap_cleanup,
+	.suite_name = "Test Pmd RTAP Unit Test Suite",
+	.unit_test_cases = {
+		TEST_CASE(test_ethdev_configure_ports),
+		TEST_CASE(test_dev_info_for_port),
+		TEST_CASE(test_link_status_for_port),
+		TEST_CASE(test_link_up_down_for_port),
+		TEST_CASE(test_get_stats_for_port),
+		TEST_CASE(test_stats_reset_for_port),
+		TEST_CASE(test_stats_imissed_for_port),
+		TEST_CASE(test_promiscuous_for_port),
+		TEST_CASE(test_allmulticast_for_port),
+		TEST_CASE(test_mac_address_for_port),
+		TEST_CASE(test_mtu_for_port),
+		TEST_CASE(test_multiqueue_for_port),
+		TEST_CASE(test_multiqueue_reduce_for_port),
+		TEST_CASE(test_multiqueue_mismatch_for_port),
+		TEST_CASE(test_queue_reconfigure_for_port),
+		TEST_CASE(test_rx_inject_for_port),
+		TEST_CASE(test_tx_capture_for_port),
+		TEST_CASE(test_tx_multiseg_for_port),
+		TEST_CASE(test_rx_multiseg_for_port),
+		TEST_CASE(test_offload_config_for_port),
+		TEST_CASE(test_tx_csum_offload_for_port),
+		TEST_CASE(test_lsc_interrupt_for_port),
+		TEST_CASE(test_rxq_interrupt_for_port),
+		TEST_CASE(test_fd_leak_for_port),
+		TEST_CASE(test_command_line_rtap_port),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_pmd_rtap(void)
+{
+	return unit_test_suite_runner(&test_pmd_rtap_suite);
+}
+
+REGISTER_FAST_TEST(rtap_pmd_autotest, NOHUGE_OK, ASAN_OK, test_pmd_rtap);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* RE: [PATCH v5 00/10] net/rtap: add io_uring based TAP driver
  2026-02-09 18:38 ` [PATCH v5 00/10] net/rtap: add io_uring based TAP driver Stephen Hemminger
                     ` (9 preceding siblings ...)
  2026-02-09 18:39   ` [PATCH v5 10/10] test: add unit tests for rtap PMD Stephen Hemminger
@ 2026-02-10  9:18   ` Morten Brørup
  10 siblings, 0 replies; 72+ messages in thread
From: Morten Brørup @ 2026-02-10  9:18 UTC (permalink / raw)
  To: Stephen Hemminger, dev

> This series adds net_rtap, an experimental poll mode driver that uses
> Linux io_uring for asynchronous packet I/O with kernel TAP interfaces.
> 
> Like net_tap, net_rtap creates a kernel network interface visible to
> standard tools (ip, ethtool) and the Linux TCP/IP stack.  From DPDK
> it is an ordinary ethdev.

Robin Jarry recently added Network Namespace support to the TAP driver [1].
Does the rtap driver support Network Namespaces?

[1]: https://inbox.dpdk.org/dev/20251030175537.219641-7-rjarry@redhat.com/

> Requirements:
>   - Kernel headers with IORING_ASYNC_CANCEL_ALL (upstream since 5.19)

The DPDK 25.11 minimum requirement [2] is Kernel 5.4.
Since this driver requires a higher version, it should be prominently mentioned in relevant documentation.

In December 2026, shortly after the release of DPDK 26.11, the oldest supported Kernel LTS version will be 6.1. For now, it is 5.10.

[2]: https://doc.dpdk.org/guides-25.11/linux_gsg/sys_reqs.html#system-software

>   - liburing >= 2.0
> 
> Known working distributions: Debian 12+, Ubuntu 24.04+,
> Fedora 37+, SLES 15 SP6+ / openSUSE Tumbleweed.
> RHEL 9 is not supported (io_uring is disabled by default).


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 00/11] net/rtap: add io_uring based TAP driver
  2024-12-10 21:23 [RFC 0/8] ioring: network driver Stephen Hemminger
                   ` (12 preceding siblings ...)
  2026-02-09 18:38 ` [PATCH v5 00/10] net/rtap: add io_uring based TAP driver Stephen Hemminger
@ 2026-02-14 23:44 ` Stephen Hemminger
  2026-02-14 23:44   ` [PATCH v6 01/11] net/rtap: add driver skeleton and documentation Stephen Hemminger
                     ` (11 more replies)
  13 siblings, 12 replies; 72+ messages in thread
From: Stephen Hemminger @ 2026-02-14 23:44 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

This series adds net_rtap, an experimental poll mode driver that uses
Linux io_uring for asynchronous packet I/O with kernel TAP interfaces.

Like net_tap, net_rtap creates a kernel network interface visible to
standard tools (ip, ethtool) and the Linux TCP/IP stack.  From DPDK
it is an ordinary ethdev.

Motivation
----------

This driver started as an experiment to determine whether Linux
io_uring could deliver better packet I/O performance than the
traditional read()/write() system calls used by net_tap.  By posting
batches of I/O requests asynchronously, io_uring amortizes system
call overhead across multiple packets.

The project also served as a testbed for using AI tooling to help
build a comprehensive test suite, refactor code, and improve
documentation.  The result is intended as an example for other PMD
authors: the driver has thorough unit tests covering data path,
offloads, multi-queue, fd lifecycle, and more, along with detailed
code comments explaining design choices.

Why not extend net_tap?
-----------------------

The existing net_tap driver was designed to provide feature parity
with mlx5 when used behind the failsafe PMD.  That goal led to
significant complexity: rte_flow support emulated via eBPF programs,
software GSO implementation, and other features that duplicate in
user space what the kernel already does.

net_rtap takes the opposite approach -- use the kernel efficiently
and let it do what it does well.  There is no rte_flow support;
receive queue selection is left to the kernel's native RSS/steering.
There is no software GSO; the driver passes segmentation requests
to the kernel via the virtio-net header and lets the kernel handle
it.  The result is a much simpler driver that is easier to maintain
and reason about.

Given these fundamentally different design goals, a clean
implementation was more practical than refactoring net_tap.

Acknowledgement
---------------

Parts of the test suite, code review, and refactoring were done
with the assistance of Anthropic Claude (AI).  All generated code
was reviewed and tested by the author.

Requirements:
  - Kernel headers with IORING_ASYNC_CANCEL_ALL (upstream since 5.19)
  - liburing >= 2.0

Should work on distributions: Debian 12+, Ubuntu 24.04+,
Fedora 37+, SLES 15 SP6+ / openSUSE Tumbleweed.
RHEL 9 is not supported (io_uring is disabled by default).

v6:
  - fix lots of bugs found doing automated review
  - convert to use ifindex and netlink to avoid rename issues
  - implement xstats

Stephen Hemminger (11):
  net/rtap: add driver skeleton and documentation
  net/rtap: add TAP device creation and queue management
  net/rtap: add Rx/Tx with scatter/gather support
  net/rtap: add statistics and device info
  net/rtap: add link and device management operations
  net/rtap: add checksum and TSO offload support
  net/rtap: add multi-process support
  net/rtap: add link state change interrupt
  net/rtap: add Rx interrupt support
  net/rtap: add extended statistics support
  test: add unit tests for rtap PMD

 MAINTAINERS                            |    7 +
 app/test/meson.build                   |    1 +
 app/test/test_pmd_rtap.c               | 2620 ++++++++++++++++++++++++
 doc/guides/nics/features/rtap.ini      |   26 +
 doc/guides/nics/index.rst              |    1 +
 doc/guides/nics/rtap.rst               |  101 +
 doc/guides/rel_notes/release_26_03.rst |    7 +
 drivers/net/meson.build                |    1 +
 drivers/net/rtap/meson.build           |   30 +
 drivers/net/rtap/rtap.h                |  152 ++
 drivers/net/rtap/rtap_ethdev.c         |  864 ++++++++
 drivers/net/rtap/rtap_intr.c           |  207 ++
 drivers/net/rtap/rtap_netlink.c        |  445 ++++
 drivers/net/rtap/rtap_rxtx.c           |  803 ++++++++
 drivers/net/rtap/rtap_xstats.c         |  293 +++
 15 files changed, 5558 insertions(+)
 create mode 100644 app/test/test_pmd_rtap.c
 create mode 100644 doc/guides/nics/features/rtap.ini
 create mode 100644 doc/guides/nics/rtap.rst
 create mode 100644 drivers/net/rtap/meson.build
 create mode 100644 drivers/net/rtap/rtap.h
 create mode 100644 drivers/net/rtap/rtap_ethdev.c
 create mode 100644 drivers/net/rtap/rtap_intr.c
 create mode 100644 drivers/net/rtap/rtap_netlink.c
 create mode 100644 drivers/net/rtap/rtap_rxtx.c
 create mode 100644 drivers/net/rtap/rtap_xstats.c

-- 
2.51.0

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 01/11] net/rtap: add driver skeleton and documentation
  2026-02-14 23:44 ` [PATCH v6 00/11] " Stephen Hemminger
@ 2026-02-14 23:44   ` Stephen Hemminger
  2026-02-14 23:44   ` [PATCH v6 02/11] net/rtap: add TAP device creation and queue management Stephen Hemminger
                     ` (10 subsequent siblings)
  11 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2026-02-14 23:44 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger, Thomas Monjalon, Anatoly Burakov

Add the initial skeleton for the rtap poll mode driver, a virtual
ethernet device that uses Linux io_uring for packet I/O with kernel
TAP devices.

This patch includes:
  - MAINTAINERS entry
  - Driver documentation (doc/guides/nics/rtap.rst)
  - Feature matrix (doc/guides/nics/features/rtap.ini)
  - Release notes update
  - Meson build integration with liburing dependency
  - Header file with shared data structures and declarations
  - Stub probe/remove handlers that register the vdev driver
  - Empty dev_ops with only dev_close implemented

The driver registers as net_rtap and is Linux-only.

Requires the liburing library version 2.0 or later.
Earlier versions have known security and build issues.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 MAINTAINERS                            |   7 +
 doc/guides/nics/features/rtap.ini      |  13 ++
 doc/guides/nics/index.rst              |   1 +
 doc/guides/nics/rtap.rst               | 101 ++++++++++++++
 doc/guides/rel_notes/release_26_03.rst |   7 +
 drivers/net/meson.build                |   1 +
 drivers/net/rtap/meson.build           |  26 ++++
 drivers/net/rtap/rtap.h                |  81 +++++++++++
 drivers/net/rtap/rtap_ethdev.c         | 177 +++++++++++++++++++++++++
 9 files changed, 414 insertions(+)
 create mode 100644 doc/guides/nics/features/rtap.ini
 create mode 100644 doc/guides/nics/rtap.rst
 create mode 100644 drivers/net/rtap/meson.build
 create mode 100644 drivers/net/rtap/rtap.h
 create mode 100644 drivers/net/rtap/rtap_ethdev.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 25fb109ef4..45721c9d03 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1135,6 +1135,13 @@ F: doc/guides/nics/ring.rst
 F: app/test/test_pmd_ring.c
 F: app/test/test_pmd_ring_perf.c
 
+Rtap PMD - EXPERIMENTAL
+M: Stephen Hemminger <stephen@networkplumber.org>
+F: drivers/net/rtap/
+F: app/test/test_pmd_rtap.c
+F: doc/guides/nics/rtap.rst
+F: doc/guides/nics/features/rtap.ini
+
 Null Networking PMD
 M: Tetsuya Mukawa <mtetsuyah@gmail.com>
 F: drivers/net/null/
diff --git a/doc/guides/nics/features/rtap.ini b/doc/guides/nics/features/rtap.ini
new file mode 100644
index 0000000000..ed7c638029
--- /dev/null
+++ b/doc/guides/nics/features/rtap.ini
@@ -0,0 +1,13 @@
+;
+; Supported features of the 'rtap' driver.
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+[Features]
+Linux                = Y
+ARMv7                = Y
+ARMv8                = Y
+Power8               = Y
+x86-32               = Y
+x86-64               = Y
+Usage doc            = Y
diff --git a/doc/guides/nics/index.rst b/doc/guides/nics/index.rst
index cb818284fe..24746596b7 100644
--- a/doc/guides/nics/index.rst
+++ b/doc/guides/nics/index.rst
@@ -66,6 +66,7 @@ Network Interface Controller Drivers
     r8169
     ring
     rnp
+    rtap
     sfc_efx
     softnic
     tap
diff --git a/doc/guides/nics/rtap.rst b/doc/guides/nics/rtap.rst
new file mode 100644
index 0000000000..4bb964128b
--- /dev/null
+++ b/doc/guides/nics/rtap.rst
@@ -0,0 +1,101 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+
+RTAP Poll Mode Driver
+=======================
+
+The RTAP Poll Mode Driver (PMD) is similar to the TAP PMD. It is a
+virtual device that uses Linux io_uring for efficient packet I/O with
+the Linux kernel.
+It is useful when writing DPDK applications that need to support interaction
+with the Linux TCP/IP stack for control plane or tunneling.
+
+The RTAP PMD creates a kernel network device that can be
+managed by standard tools such as ``ip`` and ``ethtool`` commands.
+
+From a DPDK application, the RTAP device looks like a DPDK ethdev.
+It supports the standard DPDK APIs to query for information, statistics,
+and send/receive packets.
+
+Features
+--------
+
+- Uses io_uring for asynchronous packet I/O via read/write and readv/writev
+- TX offloads: multi-segment, UDP checksum, TCP checksum, TCP segmentation (TSO)
+- RX offloads: UDP checksum, TCP checksum, TCP LRO, scatter
+- Virtio net header support for offload negotiation with the kernel
+- Multi-queue support (up to 128 queues)
+- Multi-process support (secondary processes receive queue fds from primary)
+- Link state change notification via netlink
+- Rx interrupt support for power-aware applications (eventfd per queue)
+- Promiscuous and allmulticast mode
+- MAC address configuration
+- MTU update
+- Link up/down control
+- Basic and per-queue statistics
+
+Requirements
+------------
+
+- **liburing >= 2.0**.  Earlier versions have known security and build issues.
+
+- The kernel must support ``IORING_ASYNC_CANCEL_ALL`` (upstream since 5.19).
+  The meson build checks for this symbol and will not build the driver
+  if the installed kernel headers do not provide it.  Because enterprise
+  distributions backport features independently of version numbers,
+  the driver avoids hard-coding a kernel version check.
+
+Known working distributions:
+
+- Debian 12 (Bookworm) or later
+- Ubuntu 24.04 (Noble) or later
+- Fedora 37 or later
+- SUSE Linux Enterprise 15 SP6 or later / openSUSE Tumbleweed
+
+RHEL 9 ships io_uring only as a Technology Preview (disabled by default)
+and is not supported.
+
+For more info on io_uring, please see:
+
+- `io_uring on Wikipedia <https://en.wikipedia.org/wiki/Io_uring>`_
+- `liburing on GitHub <https://github.com/axboe/liburing>`_
+
+
+Arguments
+---------
+
+RTAP devices are created with the ``--vdev=net_rtap0`` command line option.
+Multiple devices can be created by repeating the option with different device names
+(``net_rtap1``, ``net_rtap2``, etc.).
+
+By default, the Linux interfaces are named ``rtap0``, ``rtap1``, etc.
+The interface name can be specified by adding the ``iface=foo0``, for example::
+
+   --vdev=net_rtap0,iface=io0 --vdev=net_rtap1,iface=io1 ...
+
+The PMD inherits the MAC address assigned by the kernel which will be
+a locally assigned random Ethernet address.
+
+Normally, when the DPDK application exits, the RTAP device is removed.
+But this behavior can be overridden by the use of the persist flag, which
+causes the kernel network interface to survive application exit. Example::
+
+  --vdev=net_rtap0,iface=io0,persist ...
+
+
+Limitations
+-----------
+
+- The kernel must have io_uring support with ``IORING_ASYNC_CANCEL_ALL``
+  (upstream since 5.19, but may be backported by distributions).
+  io_uring support may also be disabled in some environments or by security policies
+  (for example, Docker disables io_uring in its default seccomp profile,
+  and RHEL 9 disables it via ``kernel.io_uring_disabled`` sysctl).
+
+- Since RTAP device uses a file descriptor to talk to the kernel,
+  the same number of queues must be specified for receive and transmit.
+
+- The maximum number of queues is 128.
+
+- No flow support. Receive queue selection for incoming packets is determined
+  by the Linux kernel. See kernel documentation for more info:
+  https://www.kernel.org/doc/html/latest/networking/scaling.html
diff --git a/doc/guides/rel_notes/release_26_03.rst b/doc/guides/rel_notes/release_26_03.rst
index afdf1af06c..40320b0101 100644
--- a/doc/guides/rel_notes/release_26_03.rst
+++ b/doc/guides/rel_notes/release_26_03.rst
@@ -87,6 +87,13 @@ New Features
   * Added support for AES-XTS cipher algorithm.
   * Added support for SHAKE-128 and SHAKE-256 authentication algorithms.
 
+* **Added rtap virtual ethernet driver.**
+
+  Added a new experimental virtual device driver that uses Linux io_uring
+  for packet injection into the kernel network stack.
+  It requires Linux kernel 5.19 or later for IORING_ASYNC_CANCEL
+  and liburing 2.0 or later.
+
 
 Removed Items
 -------------
diff --git a/drivers/net/meson.build b/drivers/net/meson.build
index c7dae4ad27..ef1ee68385 100644
--- a/drivers/net/meson.build
+++ b/drivers/net/meson.build
@@ -56,6 +56,7 @@ drivers = [
         'r8169',
         'ring',
         'rnp',
+        'rtap',
         'sfc',
         'softnic',
         'tap',
diff --git a/drivers/net/rtap/meson.build b/drivers/net/rtap/meson.build
new file mode 100644
index 0000000000..7bd7806ef3
--- /dev/null
+++ b/drivers/net/rtap/meson.build
@@ -0,0 +1,26 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2026 Stephen Hemminger
+
+if not is_linux
+    build = false
+    reason = 'only supported on Linux'
+endif
+
+liburing = dependency('liburing', version: '>= 2.0', required: false)
+if not liburing.found()
+    build = false
+    reason = 'missing dependency, "liburing"'
+endif
+
+if build and not cc.has_header_symbol('linux/io_uring.h', 'IORING_ASYNC_CANCEL_ALL')
+    build = false
+    reason = 'kernel headers missing IORING_ASYNC_CANCEL_ALL (need kernel >= 5.19 headers)'
+endif
+
+sources = files(
+        'rtap_ethdev.c',
+)
+
+ext_deps += liburing
+
+require_iova_in_mbuf = false
diff --git a/drivers/net/rtap/rtap.h b/drivers/net/rtap/rtap.h
new file mode 100644
index 0000000000..9004953e04
--- /dev/null
+++ b/drivers/net/rtap/rtap.h
@@ -0,0 +1,81 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2026 Stephen Hemminger
+ */
+
+#ifndef _RTAP_H_
+#define _RTAP_H_
+
+#include <errno.h>
+#include <stdint.h>
+#include <liburing.h>
+
+#include <rte_common.h>
+#include <rte_log.h>
+#include <rte_ether.h>
+
+extern int rtap_logtype;
+#define RTE_LOGTYPE_RTAP rtap_logtype
+#define PMD_LOG(level, ...) \
+	RTE_LOG_LINE_PREFIX(level, RTAP, "%s(): ", __func__, __VA_ARGS__)
+
+#define PMD_LOG_ERRNO(level, fmt, ...) \
+	RTE_LOG_LINE(level, RTAP, "%s(): " fmt ": %s", __func__, ## __VA_ARGS__, strerror(errno))
+
+#ifdef RTE_ETHDEV_DEBUG_RX
+#define PMD_RX_LOG(level, ...) \
+	RTE_LOG_LINE_PREFIX(level, RTAP, "%s() rx: ", __func__, __VA_ARGS__)
+#else
+#define PMD_RX_LOG(...) do { } while (0)
+#endif
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+#define PMD_TX_LOG(level, ...) \
+	RTE_LOG_LINE_PREFIX(level, RTAP, "%s() tx: ", __func__, __VA_ARGS__)
+#else
+#define PMD_TX_LOG(...) do { } while (0)
+#endif
+
+struct rtap_rx_queue {
+	struct rte_mempool *mb_pool;	/* rx buffer pool */
+	struct io_uring io_ring;	/* queue of posted read's */
+	uint16_t port_id;
+	uint16_t queue_id;
+
+	uint64_t rx_packets;
+	uint64_t rx_bytes;
+	uint64_t rx_errors;
+} __rte_cache_aligned;
+
+struct rtap_tx_queue {
+	struct io_uring io_ring;
+	uint16_t port_id;
+	uint16_t queue_id;
+	uint16_t free_thresh;
+
+	uint64_t tx_packets;
+	uint64_t tx_bytes;
+	uint64_t tx_errors;
+} __rte_cache_aligned;
+
+struct rtap_pmd {
+	int keep_fd;			/* keep alive file descriptor */
+	int if_index;			/* interface index */
+	int nlsk_fd;			/* netlink control socket */
+	struct rte_ether_addr eth_addr; /* address assigned by kernel */
+};
+
+/* rtap_netlink.c */
+int rtap_nl_open(unsigned int groups);
+struct rte_eth_dev;
+void rtap_nl_recv(int fd, struct rte_eth_dev *dev);
+int rtap_nl_get_flags(int nlsk_fd, int if_index, unsigned int *flags);
+int rtap_nl_change_flags(int nlsk_fd, int if_index,
+			 unsigned int flags, unsigned int mask);
+int rtap_nl_set_mtu(int nlsk_fd, int if_index, uint16_t mtu);
+int rtap_nl_set_mac(int nlsk_fd, int if_index,
+		    const struct rte_ether_addr *addr);
+int rtap_nl_get_mac(int nlsk_fd, int if_index, struct rte_ether_addr *addr);
+struct rtnl_link_stats64;
+int rtap_nl_get_stats(int if_index, struct rtnl_link_stats64 *stats);
+
+#endif /* _RTAP_H_ */
diff --git a/drivers/net/rtap/rtap_ethdev.c b/drivers/net/rtap/rtap_ethdev.c
new file mode 100644
index 0000000000..95e0b47988
--- /dev/null
+++ b/drivers/net/rtap/rtap_ethdev.c
@@ -0,0 +1,177 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2026 Stephen Hemminger
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <sys/ioctl.h>
+#include <sys/socket.h>
+#include <net/if.h>
+#include <linux/if_tun.h>
+#include <linux/virtio_net.h>
+
+#include <rte_config.h>
+#include <rte_common.h>
+#include <rte_dev.h>
+#include <rte_eal.h>
+#include <rte_ethdev.h>
+#include <rte_ether.h>
+#include <rte_kvargs.h>
+#include <rte_log.h>
+#include <bus_vdev_driver.h>
+#include <ethdev_driver.h>
+#include <ethdev_vdev.h>
+
+#include "rtap.h"
+
+#define RTAP_DEFAULT_IFNAME	"rtap%d"
+
+#define RTAP_IFACE_ARG		"iface"
+#define RTAP_PERSIST_ARG	"persist"
+
+static const char * const valid_arguments[] = {
+	RTAP_IFACE_ARG,
+	RTAP_PERSIST_ARG,
+	NULL
+};
+
+static int
+rtap_dev_close(struct rte_eth_dev *dev)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+
+	PMD_LOG(INFO, "Closing ifindex %d", pmd->if_index);
+
+	if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
+		/* mac_addrs must not be freed alone because part of dev_private */
+		dev->data->mac_addrs = NULL;
+
+		if (pmd->keep_fd != -1) {
+			PMD_LOG(DEBUG, "Closing keep_fd %d", pmd->keep_fd);
+			close(pmd->keep_fd);
+			pmd->keep_fd = -1;
+		}
+
+		if (pmd->nlsk_fd != -1) {
+			close(pmd->nlsk_fd);
+			pmd->nlsk_fd = -1;
+		}
+	}
+
+	free(dev->process_private);
+	dev->process_private = NULL;
+
+	return 0;
+}
+
+static const struct eth_dev_ops rtap_ops = {
+	.dev_close		= rtap_dev_close,
+};
+
+static int
+rtap_parse_iface(const char *key __rte_unused, const char *value, void *extra_args)
+{
+	char *name = extra_args;
+
+	/* must not be null string */
+	if (value == NULL || value[0] == '\0' || strnlen(value, IFNAMSIZ) == IFNAMSIZ)
+		return -EINVAL;
+
+	strlcpy(name, value, IFNAMSIZ);
+	return 0;
+}
+
+static int
+rtap_probe(struct rte_vdev_device *vdev)
+{
+	const char *name = rte_vdev_device_name(vdev);
+	const char *params = rte_vdev_device_args(vdev);
+	struct rte_kvargs *kvlist = NULL;
+	struct rte_eth_dev *eth_dev = NULL;
+	int *fds = NULL;
+	char tap_name[IFNAMSIZ] = RTAP_DEFAULT_IFNAME;
+	uint8_t persist = 0;
+	int ret;
+
+	PMD_LOG(INFO, "Initializing %s", name);
+
+	if (params != NULL) {
+		kvlist = rte_kvargs_parse(params, valid_arguments);
+		if (kvlist == NULL)
+			return -1;
+
+		if (rte_kvargs_count(kvlist, RTAP_IFACE_ARG) == 1) {
+			ret = rte_kvargs_process_opt(kvlist, RTAP_IFACE_ARG,
+						     &rtap_parse_iface, tap_name);
+			if (ret < 0)
+				goto error;
+		}
+
+		if (rte_kvargs_count(kvlist, RTAP_PERSIST_ARG) == 1)
+			persist = 1;
+	}
+
+	/* Per-queue tap fd's (for primary process) */
+	fds = calloc(RTE_MAX_QUEUES_PER_PORT, sizeof(int));
+	if (fds == NULL) {
+		PMD_LOG(ERR, "Unable to allocate fd array");
+		goto error;
+	}
+	for (unsigned int i = 0; i < RTE_MAX_QUEUES_PER_PORT; i++)
+		fds[i] = -1;
+
+	eth_dev = rte_eth_vdev_allocate(vdev, sizeof(struct rtap_pmd));
+	if (eth_dev == NULL) {
+		PMD_LOG(ERR, "%s Unable to allocate device struct", tap_name);
+		goto error;
+	}
+
+	eth_dev->dev_ops = &rtap_ops;
+	eth_dev->process_private = fds;
+	eth_dev->data->dev_flags |= RTE_ETH_DEV_AUTOFILL_QUEUE_XSTATS;
+
+	RTE_SET_USED(persist); /* used in later patches */
+
+	rte_eth_dev_probing_finish(eth_dev);
+	rte_kvargs_free(kvlist);
+	return 0;
+
+error:
+	if (eth_dev != NULL) {
+		eth_dev->process_private = NULL;
+		rte_eth_dev_release_port(eth_dev);
+	}
+	free(fds);
+	rte_kvargs_free(kvlist);
+	return -1;
+}
+
+static int
+rtap_remove(struct rte_vdev_device *dev)
+{
+	struct rte_eth_dev *eth_dev;
+
+	eth_dev = rte_eth_dev_allocated(rte_vdev_device_name(dev));
+	if (eth_dev == NULL)
+		return 0;
+
+	rtap_dev_close(eth_dev);
+	rte_eth_dev_release_port(eth_dev);
+	return 0;
+}
+
+static struct rte_vdev_driver pmd_rtap_drv = {
+	.probe = rtap_probe,
+	.remove = rtap_remove,
+};
+
+RTE_PMD_REGISTER_VDEV(net_rtap, pmd_rtap_drv);
+RTE_PMD_REGISTER_ALIAS(net_rtap, eth_rtap);
+RTE_PMD_REGISTER_PARAM_STRING(net_rtap,
+	RTAP_IFACE_ARG "=<string> "
+	RTAP_PERSIST_ARG);
+RTE_LOG_REGISTER_DEFAULT(rtap_logtype, NOTICE);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 02/11] net/rtap: add TAP device creation and queue management
  2026-02-14 23:44 ` [PATCH v6 00/11] " Stephen Hemminger
  2026-02-14 23:44   ` [PATCH v6 01/11] net/rtap: add driver skeleton and documentation Stephen Hemminger
@ 2026-02-14 23:44   ` Stephen Hemminger
  2026-02-14 23:44   ` [PATCH v6 03/11] net/rtap: add Rx/Tx with scatter/gather support Stephen Hemminger
                     ` (9 subsequent siblings)
  11 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2026-02-14 23:44 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger, Anatoly Burakov

Add TAP device creation using /dev/net/tun with IFF_MULTI_QUEUE,
IFF_NO_PI, IFF_VNET_HDR, and optional IFF_NAPI flags.

The driver maintains a keep-alive fd and opens additional per-queue
fds for I/O. Each queue pair (rx+tx) shares a single TAP fd.

Key operations:
- rtap_create(): Opens TAP, gets stable ifindex, opens netlink
  socket, retrieves MAC via netlink, detaches keep-alive queue
- rtap_queue_open/close(): Per-queue fd management (converts
  ifindex to name for TUNSETIFF ioctl)
- rtap_dev_configure(): Validates paired queues, clears offloads
- rtap_dev_start/stop(): Manages link status and queue states

The driver uses netlink (RTM_GETLINK/RTM_NEWLINK) and ifindex
for interface control rather than ioctl() and interface names.
This avoids issues with interface renames and namespace moves.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 drivers/net/rtap/meson.build    |   1 +
 drivers/net/rtap/rtap.h         |   4 +
 drivers/net/rtap/rtap_ethdev.c  | 241 ++++++++++++++++-
 drivers/net/rtap/rtap_netlink.c | 445 ++++++++++++++++++++++++++++++++
 4 files changed, 689 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/rtap/rtap_netlink.c

diff --git a/drivers/net/rtap/meson.build b/drivers/net/rtap/meson.build
index 7bd7806ef3..1a24ea0555 100644
--- a/drivers/net/rtap/meson.build
+++ b/drivers/net/rtap/meson.build
@@ -19,6 +19,7 @@ endif
 
 sources = files(
         'rtap_ethdev.c',
+        'rtap_netlink.c',
 )
 
 ext_deps += liburing
diff --git a/drivers/net/rtap/rtap.h b/drivers/net/rtap/rtap.h
index 9004953e04..a2d1149cac 100644
--- a/drivers/net/rtap/rtap.h
+++ b/drivers/net/rtap/rtap.h
@@ -64,6 +64,10 @@ struct rtap_pmd {
 	struct rte_ether_addr eth_addr; /* address assigned by kernel */
 };
 
+/* rtap_ethdev.c */
+int rtap_queue_open(struct rte_eth_dev *dev, uint16_t queue_id);
+void rtap_queue_close(struct rte_eth_dev *dev, uint16_t queue_id);
+
 /* rtap_netlink.c */
 int rtap_nl_open(unsigned int groups);
 struct rte_eth_dev;
diff --git a/drivers/net/rtap/rtap_ethdev.c b/drivers/net/rtap/rtap_ethdev.c
index 95e0b47988..0eab0a48fa 100644
--- a/drivers/net/rtap/rtap_ethdev.c
+++ b/drivers/net/rtap/rtap_ethdev.c
@@ -9,7 +9,6 @@
 #include <stdint.h>
 #include <unistd.h>
 #include <sys/ioctl.h>
-#include <sys/socket.h>
 #include <net/if.h>
 #include <linux/if_tun.h>
 #include <linux/virtio_net.h>
@@ -39,13 +38,145 @@ static const char * const valid_arguments[] = {
 	NULL
 };
 
+/* Creates a new tap device, name returned in ifr */
+static int
+rtap_tap_open(const char *name, struct ifreq *ifr, uint8_t persist)
+{
+	static const char tun_dev[] = "/dev/net/tun";
+	int tap_fd;
+
+	tap_fd = open(tun_dev, O_RDWR | O_CLOEXEC | O_NONBLOCK);
+	if (tap_fd < 0) {
+		PMD_LOG_ERRNO(ERR, "Open %s failed", tun_dev);
+		return -1;
+	}
+
+	int features = 0;
+	if (ioctl(tap_fd, TUNGETFEATURES, &features) < 0) {
+		PMD_LOG_ERRNO(ERR, "ioctl(TUNGETFEATURES): %s", tun_dev);
+		goto error;
+	}
+
+	int flags = IFF_TAP | IFF_MULTI_QUEUE | IFF_NO_PI | IFF_VNET_HDR;
+	if ((features & flags) != flags) {
+		PMD_LOG(ERR, "TUN features %#x missing support for %#x",
+			features, flags & ~features);
+		goto error;
+	}
+
+#ifdef IFF_NAPI
+	/* If kernel supports using NAPI enable it */
+	if (features & IFF_NAPI)
+		flags |= IFF_NAPI;
+#endif
+	/*
+	 * Sets the device name and packet format.
+	 * Do not want the protocol information (PI)
+	 */
+	strlcpy(ifr->ifr_name, name, IFNAMSIZ);
+	ifr->ifr_flags = flags;
+	if (ioctl(tap_fd, TUNSETIFF, ifr) < 0) {
+		PMD_LOG_ERRNO(ERR, "ioctl(TUNSETIFF) %s", ifr->ifr_name);
+		goto error;
+	}
+
+	/* (Optional) keep the device after application exit */
+	if (persist && ioctl(tap_fd, TUNSETPERSIST, 1) < 0) {
+		PMD_LOG_ERRNO(ERR, "ioctl(TUNSETPERSIST) %s", ifr->ifr_name);
+		goto error;
+	}
+
+	int hdr_size = sizeof(struct virtio_net_hdr);
+	if (ioctl(tap_fd, TUNSETVNETHDRSZ, &hdr_size) < 0) {
+		PMD_LOG(ERR, "ioctl(TUNSETVNETHDRSZ) %s", strerror(errno));
+		goto error;
+	}
+
+	return tap_fd;
+error:
+	close(tap_fd);
+	return -1;
+}
+
+static int
+rtap_dev_start(struct rte_eth_dev *dev)
+{
+	dev->data->dev_link.link_status = RTE_ETH_LINK_UP;
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		dev->data->rx_queue_state[i] = RTE_ETH_QUEUE_STATE_STARTED;
+		dev->data->tx_queue_state[i] = RTE_ETH_QUEUE_STATE_STARTED;
+	}
+
+	return 0;
+}
+
+static int
+rtap_dev_stop(struct rte_eth_dev *dev)
+{
+	int *fds = dev->process_private;
+
+	dev->data->dev_link.link_status = RTE_ETH_LINK_DOWN;
+
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		dev->data->rx_queue_state[i] = RTE_ETH_QUEUE_STATE_STOPPED;
+		dev->data->tx_queue_state[i] = RTE_ETH_QUEUE_STATE_STOPPED;
+	}
+
+	if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
+		for (uint16_t i = 0; i < RTE_MAX_QUEUES_PER_PORT; i++) {
+			if (fds[i] == -1)
+				continue;
+
+			close(fds[i]);
+			fds[i] = -1;
+		}
+	}
+
+	return 0;
+}
+
+static int
+rtap_dev_configure(struct rte_eth_dev *dev)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+
+	/* rx/tx must be paired */
+	if (dev->data->nb_rx_queues != dev->data->nb_tx_queues) {
+		PMD_LOG(ERR, "number of rx %u and tx %u queues must match",
+			dev->data->nb_rx_queues, dev->data->nb_tx_queues);
+		return -EINVAL;
+	}
+
+	if (ioctl(pmd->keep_fd, TUNSETOFFLOAD, 0) != 0) {
+		int ret = -errno;
+
+		PMD_LOG(ERR, "ioctl(TUNSETOFFLOAD) failed: %s", strerror(errno));
+		return ret;
+	}
+
+	return 0;
+}
+
 static int
 rtap_dev_close(struct rte_eth_dev *dev)
 {
 	struct rtap_pmd *pmd = dev->data->dev_private;
+	int *fds = dev->process_private;
 
 	PMD_LOG(INFO, "Closing ifindex %d", pmd->if_index);
 
+	/* Release all io_uring queues (calls rx/tx_queue_release for each) */
+	rte_eth_dev_internal_reset(dev);
+
+	/* Close any remaining queue fds (each process owns its own set) */
+	for (uint16_t i = 0; i < RTE_MAX_QUEUES_PER_PORT; i++) {
+		if (fds[i] == -1)
+			continue;
+		PMD_LOG(DEBUG, "Closed queue %u fd %d", i, fds[i]);
+		close(fds[i]);
+		fds[i] = -1;
+	}
+
 	if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
 		/* mac_addrs must not be freed alone because part of dev_private */
 		dev->data->mac_addrs = NULL;
@@ -68,10 +199,115 @@ rtap_dev_close(struct rte_eth_dev *dev)
 	return 0;
 }
 
+/* Setup another fd to TAP device for the queue */
+int
+rtap_queue_open(struct rte_eth_dev *dev, uint16_t queue_id)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+	int *fds = dev->process_private;
+	char ifname[IFNAMSIZ];
+
+	if (fds[queue_id] != -1) {
+		PMD_LOG(DEBUG, "queue %u already has fd %d", queue_id, fds[queue_id]);
+		return 0;	/* already setup */
+	}
+
+	/* Convert ifindex to name for TUNSETIFF */
+	if (if_indextoname(pmd->if_index, ifname) == NULL) {
+		PMD_LOG(ERR, "Could not find interface for ifindex %d", pmd->if_index);
+		return -1;
+	}
+
+	struct ifreq ifr = { 0 };
+	int tap_fd = rtap_tap_open(ifname, &ifr, 0);
+	if (tap_fd < 0) {
+		PMD_LOG(ERR, "tap_open failed");
+		return -1;
+	}
+
+	PMD_LOG(DEBUG, "Opened %d for queue %u", tap_fd, queue_id);
+	fds[queue_id] = tap_fd;
+	return 0;
+}
+
+void
+rtap_queue_close(struct rte_eth_dev *dev, uint16_t queue_id)
+{
+	int *fds = dev->process_private;
+	int tap_fd = fds[queue_id];
+
+	if (tap_fd == -1)
+		return; /* already closed */
+	PMD_LOG(DEBUG, "Closed queue %u fd %d", queue_id, tap_fd);
+	close(tap_fd);
+	fds[queue_id] = -1;
+}
+
 static const struct eth_dev_ops rtap_ops = {
+	.dev_start		= rtap_dev_start,
+	.dev_stop		= rtap_dev_stop,
+	.dev_configure		= rtap_dev_configure,
 	.dev_close		= rtap_dev_close,
 };
 
+static int
+rtap_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
+{
+	struct rte_eth_dev_data *data = dev->data;
+	struct rtap_pmd *pmd = data->dev_private;
+
+	pmd->keep_fd = -1;
+	pmd->nlsk_fd = -1;
+
+	dev->dev_ops = &rtap_ops;
+
+	/* Get the initial fd used to keep the tap device around */
+	struct ifreq ifr = { 0 };
+	pmd->keep_fd = rtap_tap_open(tap_name, &ifr, persist);
+	if (pmd->keep_fd < 0)
+		goto error;
+
+	PMD_LOG(DEBUG, "Created %s keep_fd %d", ifr.ifr_name, pmd->keep_fd);
+
+	/* Use if_index which is stable even if interface is renamed */
+	pmd->if_index = if_nametoindex(ifr.ifr_name);
+	if (pmd->if_index == 0) {
+		PMD_LOG(ERR, "Could not find ifindex for '%s'", ifr.ifr_name);
+		goto error;
+	}
+
+	/* Open persistent netlink socket for control operations */
+	pmd->nlsk_fd = rtap_nl_open(0);
+	if (pmd->nlsk_fd < 0)
+		goto error;
+
+	/* Read the MAC address assigned by the kernel via netlink */
+	if (rtap_nl_get_mac(pmd->nlsk_fd, pmd->if_index, &pmd->eth_addr) < 0) {
+		PMD_LOG(ERR, "Unable to get MAC address for ifindex %d", pmd->if_index);
+		goto error;
+	}
+	data->mac_addrs = &pmd->eth_addr;
+
+	/* Detach this instance, not used for traffic */
+	ifr.ifr_flags = IFF_DETACH_QUEUE;
+	if (ioctl(pmd->keep_fd, TUNSETQUEUE, &ifr) < 0) {
+		PMD_LOG_ERRNO(ERR, "Unable to detach keep-alive queue for ifindex %d",
+			      pmd->if_index);
+		goto error;
+	}
+
+	PMD_LOG(DEBUG, "ifindex %d setup", pmd->if_index);
+
+	return 0;
+
+error:
+	if (pmd->nlsk_fd != -1)
+		close(pmd->nlsk_fd);
+	if (pmd->keep_fd != -1)
+		close(pmd->keep_fd);
+	return -1;
+}
+
 static int
 rtap_parse_iface(const char *key __rte_unused, const char *value, void *extra_args)
 {
@@ -134,7 +370,8 @@ rtap_probe(struct rte_vdev_device *vdev)
 	eth_dev->process_private = fds;
 	eth_dev->data->dev_flags |= RTE_ETH_DEV_AUTOFILL_QUEUE_XSTATS;
 
-	RTE_SET_USED(persist); /* used in later patches */
+	if (rtap_create(eth_dev, tap_name, persist) < 0)
+		goto error;
 
 	rte_eth_dev_probing_finish(eth_dev);
 	rte_kvargs_free(kvlist);
diff --git a/drivers/net/rtap/rtap_netlink.c b/drivers/net/rtap/rtap_netlink.c
new file mode 100644
index 0000000000..060b89c625
--- /dev/null
+++ b/drivers/net/rtap/rtap_netlink.c
@@ -0,0 +1,445 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2026 Stephen Hemminger
+ */
+
+/*
+ * Netlink-based control operations for the rtap PMD.
+ *
+ * Uses RTM_GETLINK / RTM_NEWLINK to replace ioctl() for interface
+ * flag changes, MTU, MAC address, and statistics retrieval.
+ *
+ * Socket model:
+ *   - Control socket (pmd->nlsk_fd): persistent per-device, opened
+ *     at create time.  Used for flag changes, MTU, MAC operations.
+ *   - LSC socket: persistent while enabled, subscribed to RTMGRP_LINK.
+ *     Managed by rtap_intr.c via rtap_nl_open().
+ *   - Stats queries (rtap_nl_get_stats): use an ephemeral socket
+ *     opened on demand so they cannot block behind control operations.
+ */
+
+#include <errno.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <net/if.h>
+#include <linux/if_link.h>
+#include <linux/netlink.h>
+#include <linux/rtnetlink.h>
+
+#include <rte_ethdev.h>
+#include <rte_ether.h>
+#include <ethdev_driver.h>
+#include <rte_stdatomic.h>
+
+#include "rtap.h"
+
+/* Sequence number for netlink requests */
+static RTE_ATOMIC(uint32_t) rtap_nl_seq;
+
+/*
+ * Open a netlink route socket.
+ *
+ * If groups is non-zero, the socket subscribes to those multicast
+ * groups and is set non-blocking (for LSC notification).
+ * If groups is zero, the socket is blocking (for control/query).
+ *
+ * Returns socket fd or -1 on failure.
+ */
+int
+rtap_nl_open(unsigned int groups)
+{
+	int flags = SOCK_RAW | SOCK_CLOEXEC;
+	int fd;
+	struct sockaddr_nl sa = {
+		.nl_family = AF_NETLINK,
+		.nl_groups = groups,
+	};
+
+	if (groups != 0)
+		flags |= SOCK_NONBLOCK;
+
+	fd = socket(AF_NETLINK, flags, NETLINK_ROUTE);
+	if (fd < 0) {
+		PMD_LOG_ERRNO(ERR, "netlink socket");
+		return -1;
+	}
+
+	if (bind(fd, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
+		PMD_LOG_ERRNO(ERR, "netlink bind");
+		close(fd);
+		return -1;
+	}
+
+	return fd;
+}
+
+/*
+ * Send a netlink request and wait for acknowledgment.
+ * Returns 0 on success, negative errno on failure.
+ */
+static int
+rtap_nl_request(int fd, struct nlmsghdr *nlh)
+{
+	char buf[4096];
+	ssize_t len;
+
+	nlh->nlmsg_seq = rte_atomic_fetch_add_explicit(&rtap_nl_seq, 1,
+						       rte_memory_order_relaxed);
+	nlh->nlmsg_flags |= NLM_F_ACK;
+
+	if (send(fd, nlh, nlh->nlmsg_len, 0) < 0)
+		return -errno;
+
+	len = recv(fd, buf, sizeof(buf), 0);
+	if (len < 0)
+		return -errno;
+
+	struct nlmsghdr *nh = (struct nlmsghdr *)buf;
+	if (!NLMSG_OK(nh, (unsigned int)len))
+		return -EBADMSG;
+
+	if (nh->nlmsg_type == NLMSG_ERROR) {
+		struct nlmsgerr *err = NLMSG_DATA(nh);
+
+		return err->error;  /* 0 = success, negative = errno */
+	}
+
+	return -EBADMSG;
+}
+
+/*
+ * Send a netlink request and receive a data response.
+ * Returns length of response on success, negative errno on failure.
+ */
+static int
+rtap_nl_query(int fd, struct nlmsghdr *nlh, char *buf, size_t buflen)
+{
+	ssize_t len;
+
+	nlh->nlmsg_seq = rte_atomic_fetch_add_explicit(&rtap_nl_seq, 1,
+						       rte_memory_order_relaxed);
+
+	if (send(fd, nlh, nlh->nlmsg_len, 0) < 0)
+		return -errno;
+
+	len = recv(fd, buf, buflen, 0);
+	if (len < 0)
+		return -errno;
+
+	struct nlmsghdr *nh = (struct nlmsghdr *)buf;
+	if (!NLMSG_OK(nh, (unsigned int)len))
+		return -EBADMSG;
+
+	if (nh->nlmsg_type == NLMSG_ERROR) {
+		struct nlmsgerr *err = NLMSG_DATA(nh);
+
+		return err->error;
+	}
+
+	/* Detect truncated response */
+	if (nh->nlmsg_len > (unsigned int)len)
+		return -EBADMSG;
+
+	return len;
+}
+
+/* Append a netlink attribute to a message. */
+static void
+rtap_nl_addattr(struct nlmsghdr *nlh, unsigned int maxlen,
+		int type, const void *data, unsigned int datalen)
+{
+	unsigned int len = RTA_LENGTH(datalen);
+	struct rtattr *rta;
+
+	RTE_VERIFY(NLMSG_ALIGN(nlh->nlmsg_len) + RTA_ALIGN(len) <= maxlen);
+
+	rta = (struct rtattr *)((char *)nlh + NLMSG_ALIGN(nlh->nlmsg_len));
+	rta->rta_type = type;
+	rta->rta_len = len;
+	if (datalen > 0)
+		memcpy(RTA_DATA(rta), data, datalen);
+	nlh->nlmsg_len = NLMSG_ALIGN(nlh->nlmsg_len) + RTA_ALIGN(len);
+}
+
+/*
+ * Get interface flags via RTM_GETLINK.
+ * Returns 0 on success and sets *flags.
+ */
+int
+rtap_nl_get_flags(int nlsk_fd, int if_index, unsigned int *flags)
+{
+	struct {
+		struct nlmsghdr nlh;
+		struct ifinfomsg ifi;
+	} req = {
+		.nlh = {
+			.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
+			.nlmsg_type = RTM_GETLINK,
+			.nlmsg_flags = NLM_F_REQUEST,
+		},
+		.ifi = {
+			.ifi_family = AF_UNSPEC,
+			.ifi_index = if_index,
+		},
+	};
+	char resp[4096];
+	int ret;
+
+	ret = rtap_nl_query(nlsk_fd, &req.nlh, resp, sizeof(resp));
+	if (ret < 0)
+		return ret;
+
+	struct nlmsghdr *nh = (struct nlmsghdr *)resp;
+	if (nh->nlmsg_type != RTM_NEWLINK)
+		return -EBADMSG;
+
+	struct ifinfomsg *ifi = NLMSG_DATA(nh);
+	*flags = ifi->ifi_flags;
+	return 0;
+}
+
+/*
+ * Change interface flags via RTM_NEWLINK.
+ * 'flags' are set, 'mask' are cleared.
+ */
+int
+rtap_nl_change_flags(int nlsk_fd, int if_index,
+		     unsigned int flags, unsigned int mask)
+{
+	struct {
+		struct nlmsghdr nlh;
+		struct ifinfomsg ifi;
+	} req = {
+		.nlh = {
+			.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
+			.nlmsg_type = RTM_NEWLINK,
+			.nlmsg_flags = NLM_F_REQUEST,
+		},
+		.ifi = {
+			.ifi_family = AF_UNSPEC,
+			.ifi_index = if_index,
+			.ifi_flags = flags,
+			.ifi_change = mask,
+		},
+	};
+
+	return rtap_nl_request(nlsk_fd, &req.nlh);
+}
+
+/*
+ * Set MTU via RTM_NEWLINK.
+ */
+int
+rtap_nl_set_mtu(int nlsk_fd, int if_index, uint16_t mtu)
+{
+	struct {
+		struct nlmsghdr nlh;
+		struct ifinfomsg ifi;
+		char attrs[64];
+	} req = {
+		.nlh = {
+			.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
+			.nlmsg_type = RTM_NEWLINK,
+			.nlmsg_flags = NLM_F_REQUEST,
+		},
+		.ifi = {
+			.ifi_family = AF_UNSPEC,
+			.ifi_index = if_index,
+		},
+	};
+	unsigned int mtu32 = mtu;
+
+	rtap_nl_addattr(&req.nlh, sizeof(req), IFLA_MTU, &mtu32, sizeof(mtu32));
+	return rtap_nl_request(nlsk_fd, &req.nlh);
+}
+
+/*
+ * Set MAC address via RTM_NEWLINK.
+ */
+int
+rtap_nl_set_mac(int nlsk_fd, int if_index, const struct rte_ether_addr *addr)
+{
+	struct {
+		struct nlmsghdr nlh;
+		struct ifinfomsg ifi;
+		char attrs[64];
+	} req = {
+		.nlh = {
+			.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
+			.nlmsg_type = RTM_NEWLINK,
+			.nlmsg_flags = NLM_F_REQUEST,
+		},
+		.ifi = {
+			.ifi_family = AF_UNSPEC,
+			.ifi_index = if_index,
+		},
+	};
+
+	rtap_nl_addattr(&req.nlh, sizeof(req), IFLA_ADDRESS,
+			addr->addr_bytes, RTE_ETHER_ADDR_LEN);
+	return rtap_nl_request(nlsk_fd, &req.nlh);
+}
+
+/*
+ * Get MAC address via RTM_GETLINK.
+ */
+int
+rtap_nl_get_mac(int nlsk_fd, int if_index, struct rte_ether_addr *addr)
+{
+	struct {
+		struct nlmsghdr nlh;
+		struct ifinfomsg ifi;
+	} req = {
+		.nlh = {
+			.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
+			.nlmsg_type = RTM_GETLINK,
+			.nlmsg_flags = NLM_F_REQUEST,
+		},
+		.ifi = {
+			.ifi_family = AF_UNSPEC,
+			.ifi_index = if_index,
+		},
+	};
+	char resp[4096];
+	int ret;
+
+	ret = rtap_nl_query(nlsk_fd, &req.nlh, resp, sizeof(resp));
+	if (ret < 0)
+		return ret;
+
+	struct nlmsghdr *nh = (struct nlmsghdr *)resp;
+	if (nh->nlmsg_type != RTM_NEWLINK)
+		return -EBADMSG;
+
+	struct ifinfomsg *ifi = NLMSG_DATA(nh);
+	struct rtattr *rta = (struct rtattr *)((char *)ifi + NLMSG_ALIGN(sizeof(*ifi)));
+	int rtalen = nh->nlmsg_len - NLMSG_LENGTH(sizeof(*ifi));
+
+	while (RTA_OK(rta, rtalen)) {
+		if (rta->rta_type == IFLA_ADDRESS) {
+			if (RTA_PAYLOAD(rta) == RTE_ETHER_ADDR_LEN) {
+				memcpy(addr->addr_bytes, RTA_DATA(rta), RTE_ETHER_ADDR_LEN);
+				return 0;
+			}
+		}
+		rta = RTA_NEXT(rta, rtalen);
+	}
+
+	return -ENOENT;
+}
+
+/*
+ * Get link statistics via RTM_GETLINK with IFLA_STATS64 attribute.
+ * Opens an ephemeral socket to avoid blocking behind control operations.
+ */
+int
+rtap_nl_get_stats(int if_index, struct rtnl_link_stats64 *stats)
+{
+	struct {
+		struct nlmsghdr nlh;
+		struct ifinfomsg ifi;
+	} req = {
+		.nlh = {
+			.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
+			.nlmsg_type = RTM_GETLINK,
+			.nlmsg_flags = NLM_F_REQUEST,
+		},
+		.ifi = {
+			.ifi_family = AF_UNSPEC,
+			.ifi_index = if_index,
+		},
+	};
+	char resp[4096];
+	int fd, ret;
+
+	memset(stats, 0, sizeof(*stats));
+
+	/* Use ephemeral socket so stats queries don't block */
+	fd = rtap_nl_open(0);
+	if (fd < 0)
+		return fd;
+
+	ret = rtap_nl_query(fd, &req.nlh, resp, sizeof(resp));
+	close(fd);
+
+	if (ret < 0)
+		return ret;
+
+	struct nlmsghdr *nh = (struct nlmsghdr *)resp;
+	if (nh->nlmsg_type != RTM_NEWLINK)
+		return -EBADMSG;
+
+	struct ifinfomsg *ifi = NLMSG_DATA(nh);
+	struct rtattr *rta = (struct rtattr *)((char *)ifi + NLMSG_ALIGN(sizeof(*ifi)));
+	int rtalen = nh->nlmsg_len - NLMSG_LENGTH(sizeof(*ifi));
+
+	/* Parse attributes looking for IFLA_STATS64 */
+	while (RTA_OK(rta, rtalen)) {
+		if (rta->rta_type == IFLA_STATS64) {
+			if (RTA_PAYLOAD(rta) >= sizeof(*stats)) {
+				memcpy(stats, RTA_DATA(rta), sizeof(*stats));
+				return 0;
+			}
+		}
+		rta = RTA_NEXT(rta, rtalen);
+	}
+
+	return -ENOENT;
+}
+
+/*
+ * Process incoming netlink messages for link state changes.
+ * Called by rtap_intr.c when the LSC socket has data.
+ */
+void
+rtap_nl_recv(int fd, struct rte_eth_dev *dev)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+	char buf[4096];
+	ssize_t len;
+
+	while ((len = recv(fd, buf, sizeof(buf), MSG_DONTWAIT)) > 0) {
+		struct nlmsghdr *nh;
+
+		for (nh = (struct nlmsghdr *)buf;
+		     NLMSG_OK(nh, (unsigned int)len);
+		     nh = NLMSG_NEXT(nh, len)) {
+			if (nh->nlmsg_type == NLMSG_DONE)
+				break;
+			if (nh->nlmsg_type == NLMSG_ERROR)
+				continue;
+			if (nh->nlmsg_type != RTM_NEWLINK &&
+			    nh->nlmsg_type != RTM_DELLINK)
+				continue;
+
+			struct ifinfomsg *ifi = NLMSG_DATA(nh);
+
+			/* Only process messages for our interface */
+			if (ifi->ifi_index != pmd->if_index)
+				continue;
+
+			if (nh->nlmsg_type == RTM_DELLINK) {
+				PMD_LOG(INFO, "ifindex %d deleted", pmd->if_index);
+				dev->data->dev_link.link_status = RTE_ETH_LINK_DOWN;
+				rte_eth_dev_callback_process(dev,
+					RTE_ETH_EVENT_INTR_LSC, NULL);
+			} else {
+				bool was_up = dev->data->dev_link.link_status == RTE_ETH_LINK_UP;
+				bool is_up = (ifi->ifi_flags & IFF_UP) &&
+					     (ifi->ifi_flags & IFF_RUNNING);
+
+				if (was_up != is_up) {
+					PMD_LOG(DEBUG, "ifindex %d link %s",
+						pmd->if_index, is_up ? "up" : "down");
+					dev->data->dev_link.link_status =
+						is_up ? RTE_ETH_LINK_UP : RTE_ETH_LINK_DOWN;
+					rte_eth_dev_callback_process(dev,
+						RTE_ETH_EVENT_INTR_LSC, NULL);
+				}
+			}
+		}
+	}
+}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 03/11] net/rtap: add Rx/Tx with scatter/gather support
  2026-02-14 23:44 ` [PATCH v6 00/11] " Stephen Hemminger
  2026-02-14 23:44   ` [PATCH v6 01/11] net/rtap: add driver skeleton and documentation Stephen Hemminger
  2026-02-14 23:44   ` [PATCH v6 02/11] net/rtap: add TAP device creation and queue management Stephen Hemminger
@ 2026-02-14 23:44   ` Stephen Hemminger
  2026-02-14 23:44   ` [PATCH v6 04/11] net/rtap: add statistics and device info Stephen Hemminger
                     ` (8 subsequent siblings)
  11 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2026-02-14 23:44 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Implement packet receive and transmit using io_uring asynchronous I/O,
with full support for both single-segment and multi-segment mbufs.

Rx path:
  - rtap_rx_alloc() chains multiple mbufs when the MTU exceeds a
    single mbuf's tailroom capacity
  - Pre-post read/readv requests to the io_uring submission queue,
    each backed by a pre-allocated (possibly chained) mbuf
  - On rx_burst, harvest completed CQEs and replace each consumed
    mbuf with a freshly allocated one
  - rtap_rx_adjust() distributes received data across segments and
    frees unused trailing segments
  - Parse the prepended virtio-net header

Tx path:
  - For single-segment mbufs, use io_uring write and batch submits
  - For multi-segment mbufs, use writev via io_uring with immediate
    submit (iovec is stack-allocated)
  - When the mbuf headroom is not writable (shared or indirect),
    chain a new header mbuf for the virtio-net header
  - Prepend a zeroed virtio-net header
  - Clean completed Tx CQEs to free transmitted mbufs

Add io_uring cancel-all logic using IORING_ASYNC_CANCEL_ALL for
clean queue teardown, draining all pending CQEs and freeing mbufs.

Full offload configuration is not enabled until later.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/rtap.ini |   1 +
 drivers/net/rtap/meson.build      |   1 +
 drivers/net/rtap/rtap.h           |  13 +
 drivers/net/rtap/rtap_ethdev.c    |   7 +
 drivers/net/rtap/rtap_rxtx.c      | 771 ++++++++++++++++++++++++++++++
 5 files changed, 793 insertions(+)
 create mode 100644 drivers/net/rtap/rtap_rxtx.c

diff --git a/doc/guides/nics/features/rtap.ini b/doc/guides/nics/features/rtap.ini
index ed7c638029..c064e1e0b9 100644
--- a/doc/guides/nics/features/rtap.ini
+++ b/doc/guides/nics/features/rtap.ini
@@ -4,6 +4,7 @@
 ; Refer to default.ini for the full list of available PMD features.
 ;
 [Features]
+Scattered Rx         = P
 Linux                = Y
 ARMv7                = Y
 ARMv8                = Y
diff --git a/drivers/net/rtap/meson.build b/drivers/net/rtap/meson.build
index 1a24ea0555..ed0bcc1313 100644
--- a/drivers/net/rtap/meson.build
+++ b/drivers/net/rtap/meson.build
@@ -20,6 +20,7 @@ endif
 sources = files(
         'rtap_ethdev.c',
         'rtap_netlink.c',
+        'rtap_rxtx.c',
 )
 
 ext_deps += liburing
diff --git a/drivers/net/rtap/rtap.h b/drivers/net/rtap/rtap.h
index a2d1149cac..823b8c59f7 100644
--- a/drivers/net/rtap/rtap.h
+++ b/drivers/net/rtap/rtap.h
@@ -82,4 +82,17 @@ int rtap_nl_get_mac(int nlsk_fd, int if_index, struct rte_ether_addr *addr);
 struct rtnl_link_stats64;
 int rtap_nl_get_stats(int if_index, struct rtnl_link_stats64 *stats);
 
+/* rtap_rxtx.c */
+uint16_t rtap_rx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts);
+uint16_t rtap_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts);
+int rtap_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id,
+			uint16_t nb_rx_desc, unsigned int socket_id,
+			const struct rte_eth_rxconf *rx_conf,
+			struct rte_mempool *mb_pool);
+void rtap_rx_queue_release(struct rte_eth_dev *dev, uint16_t queue_id);
+int rtap_tx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id,
+			uint16_t nb_tx_desc, unsigned int socket_id,
+			const struct rte_eth_txconf *tx_conf);
+void rtap_tx_queue_release(struct rte_eth_dev *dev, uint16_t queue_id);
+
 #endif /* _RTAP_H_ */
diff --git a/drivers/net/rtap/rtap_ethdev.c b/drivers/net/rtap/rtap_ethdev.c
index 0eab0a48fa..463e24a6e1 100644
--- a/drivers/net/rtap/rtap_ethdev.c
+++ b/drivers/net/rtap/rtap_ethdev.c
@@ -248,6 +248,10 @@ static const struct eth_dev_ops rtap_ops = {
 	.dev_stop		= rtap_dev_stop,
 	.dev_configure		= rtap_dev_configure,
 	.dev_close		= rtap_dev_close,
+	.rx_queue_setup		= rtap_rx_queue_setup,
+	.rx_queue_release	= rtap_rx_queue_release,
+	.tx_queue_setup		= rtap_tx_queue_setup,
+	.tx_queue_release	= rtap_tx_queue_release,
 };
 
 static int
@@ -298,6 +302,9 @@ rtap_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
 
 	PMD_LOG(DEBUG, "ifindex %d setup", pmd->if_index);
 
+	dev->rx_pkt_burst = rtap_rx_burst;
+	dev->tx_pkt_burst = rtap_tx_burst;
+
 	return 0;
 
 error:
diff --git a/drivers/net/rtap/rtap_rxtx.c b/drivers/net/rtap/rtap_rxtx.c
new file mode 100644
index 0000000000..7826169751
--- /dev/null
+++ b/drivers/net/rtap/rtap_rxtx.c
@@ -0,0 +1,771 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2026 Stephen Hemminger
+ */
+
+#include <assert.h>
+#include <errno.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <stddef.h>
+#include <string.h>
+#include <liburing.h>
+#include <sys/types.h>
+#include <sys/uio.h>
+#include <linux/virtio_net.h>
+
+#include <rte_branch_prediction.h>
+#include <rte_config.h>
+#include <rte_common.h>
+#include <rte_malloc.h>
+#include <rte_cksum.h>
+#include <rte_debug.h>
+#include <rte_ethdev.h>
+#include <rte_ether.h>
+#include <rte_mbuf.h>
+#include <rte_net.h>
+#include <rte_tcp.h>
+#include <rte_udp.h>
+#include <rte_log.h>
+#include <ethdev_driver.h>
+
+#include "rtap.h"
+
+/*
+ * Since virtio net header is prepended to the mbuf,
+ * the DPDK configuration should make sure that mbuf pools
+ * are created to work.
+ */
+static_assert(RTE_PKTMBUF_HEADROOM >= sizeof(struct virtio_net_hdr),
+	      "Pktmbuf headroom not big enough for virtio header");
+
+/* Get the per-process file descriptor used transmit and receive */
+static inline int
+rtap_queue_fd(uint16_t port_id, uint16_t queue_id)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	int *fds = dev->process_private;
+	int fd = fds[queue_id];
+
+	RTE_ASSERT(fd != -1);
+	return fd;
+}
+
+/*
+ * Add to submit queue a read of mbuf data.
+ * For multi-segment mbuf's requires readv().
+ * Return:
+ *   -ENOSPC : no submit queue element available.
+ *   1 : readv was used and no io_uring_submit was done.
+ *   0 : regular read submitted, caller should call io_uring_submit
+ *       later to batch.
+ */
+static inline int
+rtap_rx_submit(struct rtap_rx_queue *rxq, int fd, struct rte_mbuf *mb)
+{
+	struct io_uring_sqe *sqe = io_uring_get_sqe(&rxq->io_ring);
+	struct iovec iovs[IOV_MAX];
+	uint16_t nsegs = mb->nb_segs;
+
+	if (unlikely(sqe == NULL))
+		return -ENOSPC;
+
+	io_uring_sqe_set_data(sqe, mb);
+
+	RTE_ASSERT(rte_pktmbuf_headroom(mb) >= sizeof(struct virtio_net_hdr));
+	void *buf = rte_pktmbuf_mtod_offset(mb, void *, -sizeof(struct virtio_net_hdr));
+	unsigned int nbytes = sizeof(struct virtio_net_hdr) + rte_pktmbuf_tailroom(mb);
+
+	/* optimize for the case where packet fits in one mbuf */
+	if (nsegs == 1) {
+		io_uring_prep_read(sqe, fd, buf, nbytes, 0);
+		/* caller will submit as batch */
+		return 0;
+	}
+
+	RTE_ASSERT(nsegs > 0 && nsegs < IOV_MAX);
+
+	iovs[0].iov_base = buf;
+	iovs[0].iov_len = nbytes;
+
+	for (uint16_t i = 1; i < nsegs; i++) {
+		mb = mb->next;
+		iovs[i].iov_base = rte_pktmbuf_mtod(mb, void *);
+		iovs[i].iov_len = rte_pktmbuf_tailroom(mb);
+	}
+	io_uring_prep_readv(sqe, fd, iovs, nsegs, 0);
+
+	/*
+	 * For readv, need to submit now since iovs[] must be
+	 * valid until submitted.
+	 * io_uring_submit(3) returns the number of submitted submission
+	 *  queue entries (on failure returns -errno).
+	 */
+	return io_uring_submit(&rxq->io_ring);
+}
+
+/* Allocates one or more mbuf's to be used for reading packets */
+static struct rte_mbuf *
+rtap_rx_alloc(struct rtap_rx_queue *rxq)
+{
+	const struct rte_eth_dev *dev = &rte_eth_devices[rxq->port_id];
+	int buf_size = dev->data->mtu + RTE_ETHER_HDR_LEN;
+	struct rte_mbuf *m = NULL;
+	struct rte_mbuf **tail = &m;
+
+	do {
+		struct rte_mbuf *seg = rte_pktmbuf_alloc(rxq->mb_pool);
+		if (unlikely(seg == NULL)) {
+			rte_pktmbuf_free(m);
+			return NULL;
+		}
+		*tail = seg;
+		tail = &seg->next;
+		if (seg != m)
+			++m->nb_segs;
+
+		buf_size -= rte_pktmbuf_tailroom(seg);
+	} while (buf_size > 0);
+
+	__rte_mbuf_sanity_check(m, 1);
+	return m;
+}
+
+/*
+ * When receiving multi-segment mbuf's need to adjust
+ * the length of mbufs.
+ */
+static inline int
+rtap_rx_adjust(struct rte_mbuf *mb, uint32_t len)
+{
+	struct rte_mbuf *seg;
+	uint16_t count = 0;
+
+	mb->pkt_len = len;
+
+	/* Walk through mbuf chain and update the length of each segment */
+	for (seg = mb; seg != NULL && len > 0; seg = seg->next) {
+		uint16_t seg_len = RTE_MIN(len, rte_pktmbuf_tailroom(seg));
+
+		seg->data_len = seg_len;
+		count++;
+		len -= seg_len;
+
+		/* If length is zero, this is end of packet */
+		if (len == 0) {
+			/* Drop unused tail segments */
+			if (seg->next != NULL) {
+				struct rte_mbuf *tail = seg->next;
+				seg->next = NULL;
+
+				/* Free segments one by one to avoid nb_segs issues */
+				while (tail != NULL) {
+					struct rte_mbuf *next = tail->next;
+					rte_pktmbuf_free_seg(tail);
+					tail = next;
+				}
+			}
+
+			mb->nb_segs = count;
+			return 0;
+		}
+	}
+
+	/* Packet was truncated - not enough mbuf space */
+	return -1;
+}
+
+/*
+ * Set the receive offload flags of received mbuf
+ * based on the bits in the virtio network header
+ */
+static int
+rtap_rx_offload(struct rte_mbuf *m, const struct virtio_net_hdr *hdr)
+{
+	uint32_t ptype;
+	bool l4_supported = false;
+	struct rte_net_hdr_lens hdr_lens;
+
+	/* nothing to do */
+	if (hdr->flags == 0 && hdr->gso_type == VIRTIO_NET_HDR_GSO_NONE)
+		return 0;
+
+	m->ol_flags |= RTE_MBUF_F_RX_IP_CKSUM_UNKNOWN;
+
+	ptype = rte_net_get_ptype(m, &hdr_lens, RTE_PTYPE_ALL_MASK);
+	m->packet_type = ptype;
+	if ((ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_TCP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_UDP ||
+	    (ptype & RTE_PTYPE_L4_MASK) == RTE_PTYPE_L4_SCTP)
+		l4_supported = true;
+
+	if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		uint32_t hdrlen = hdr_lens.l2_len + hdr_lens.l3_len + hdr_lens.l4_len;
+		if (hdr->csum_start <= hdrlen && l4_supported) {
+			m->ol_flags |= RTE_MBUF_F_RX_L4_CKSUM_NONE;
+		} else {
+			/* Unknown proto or tunnel, do sw cksum. */
+			uint16_t csum = 0;
+
+			if (rte_raw_cksum_mbuf(m, hdr->csum_start,
+					       rte_pktmbuf_pkt_len(m) - hdr->csum_start,
+					       &csum) < 0)
+				return -EINVAL;
+			if (likely(csum != 0xffff))
+				csum = ~csum;
+
+			uint32_t off = (uint32_t)hdr->csum_offset + hdr->csum_start;
+			if (rte_pktmbuf_data_len(m) >= off + sizeof(uint16_t))
+				*rte_pktmbuf_mtod_offset(m, uint16_t *, off) = csum;
+		}
+	} else if ((hdr->flags & VIRTIO_NET_HDR_F_DATA_VALID) && l4_supported) {
+		m->ol_flags |= RTE_MBUF_F_RX_L4_CKSUM_GOOD;
+	}
+
+	/* GSO request, save required information in mbuf */
+	if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+		/* Check unsupported modes */
+		if ((hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN) || hdr->gso_size == 0)
+			return -EINVAL;
+
+		/* Update mss lengths in mbuf */
+		m->tso_segsz = hdr->gso_size;
+		switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
+		case VIRTIO_NET_HDR_GSO_TCPV4:
+		case VIRTIO_NET_HDR_GSO_TCPV6:
+			m->ol_flags |= RTE_MBUF_F_RX_LRO | RTE_MBUF_F_RX_L4_CKSUM_NONE;
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+uint16_t
+rtap_rx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct rtap_rx_queue *rxq = queue;
+	struct io_uring_cqe *cqe;
+	unsigned int head, num_cqe = 0, num_sqe = 0;
+	uint16_t num_rx = 0;
+	uint32_t num_bytes = 0;
+	int fd = rtap_queue_fd(rxq->port_id, rxq->queue_id);
+
+	if (unlikely(nb_pkts == 0))
+		return 0;
+
+	io_uring_for_each_cqe(&rxq->io_ring, head, cqe) {
+		struct rte_mbuf *mb = (void *)(uintptr_t)cqe->user_data;
+		struct rte_mbuf *nmb = NULL;
+		struct virtio_net_hdr *hdr = NULL;
+		ssize_t len = cqe->res;
+		int nsub;
+
+		PMD_RX_LOG(DEBUG, "complete m=%p len=%zd", mb, len);
+
+		num_cqe++;
+
+		if (unlikely(len < (ssize_t)(sizeof(*hdr) + RTE_ETHER_HDR_LEN))) {
+			if (len < 0)
+				PMD_RX_LOG(ERR, "io_uring_read: %s", strerror(-len));
+			else
+				PMD_RX_LOG(ERR, "io_uring_read len %zd", len);
+			rxq->rx_errors++;
+			nmb = mb;
+			goto resubmit;
+		}
+
+		/* virtio header is before packet data */
+		hdr = rte_pktmbuf_mtod_offset(mb, struct virtio_net_hdr *, -sizeof(*hdr));
+		len -= sizeof(*hdr);
+
+		/* Replacement mbuf for resubmitting */
+		nmb = rtap_rx_alloc(rxq);
+		if (unlikely(nmb == NULL)) {
+			struct rte_eth_dev *dev = &rte_eth_devices[rxq->port_id];
+
+			PMD_RX_LOG(ERR, "Rx mbuf alloc failed");
+			dev->data->rx_mbuf_alloc_failed++;
+
+			nmb = mb;	 /* Reuse original */
+			goto resubmit;
+		}
+
+		if (mb->nb_segs == 1) {
+			mb->data_len = len;
+			mb->pkt_len = len;
+		} else {
+			if (unlikely(rtap_rx_adjust(mb, len) < 0)) {
+				PMD_RX_LOG(ERR, "packet truncated: pkt_len=%u exceeds mbuf capacity",
+					   mb->pkt_len);
+				++rxq->rx_errors;
+				rte_pktmbuf_free(mb);
+				goto resubmit;
+			}
+		}
+
+		if (unlikely(rtap_rx_offload(mb, hdr) < 0)) {
+			PMD_RX_LOG(ERR, "invalid rx offload");
+			++rxq->rx_errors;
+			rte_pktmbuf_free(mb);
+			goto resubmit;
+		}
+
+		mb->port = rxq->port_id;
+
+		__rte_mbuf_sanity_check(mb, 1);
+		num_bytes += mb->pkt_len;
+		bufs[num_rx++] = mb;
+
+resubmit:
+		/* Submit the replacement mbuf */
+		nsub = rtap_rx_submit(rxq, fd, nmb);
+		if (unlikely(nsub < 0)) {
+			/* Hope that later Rx can recover */
+			PMD_RX_LOG(ERR, "io_uring no Rx sqe: %s", strerror(-nsub));
+			rxq->rx_errors++;
+			rte_pktmbuf_free(nmb);
+			break;
+		}
+
+		if (nsub > 0)
+			num_sqe = 0;
+		else
+			++num_sqe;
+
+		if (num_rx == nb_pkts)
+			break;
+	}
+	if (num_cqe > 0)
+		io_uring_cq_advance(&rxq->io_ring, num_cqe);
+
+	if (num_sqe > 0) {
+		int n = io_uring_submit(&rxq->io_ring);
+		if (unlikely(n < 0))
+			PMD_LOG(ERR, "Rx io_uring submit failed: %s", strerror(-n));
+		else if (unlikely(n != (int)num_sqe))
+			PMD_RX_LOG(NOTICE, "Rx io_uring %d of %u resubmitted", n, num_sqe);
+	}
+
+	rxq->rx_packets += num_rx;
+	rxq->rx_bytes += num_bytes;
+
+	return num_rx;
+}
+
+/*
+ * Cancel all pending io_uring operations and drain completions.
+ * Uses IORING_ASYNC_CANCEL_ALL to cancel all operations at once.
+ * Returns the number of mbufs freed.
+ */
+static unsigned int
+rtap_cancel_all(struct io_uring *ring)
+{
+	struct io_uring_cqe *cqe;
+	struct io_uring_sqe *sqe;
+	unsigned int head, num_freed = 0;
+	unsigned int ready;
+	int ret;
+
+	/* Cancel all pending operations using CANCEL_ALL flag */
+	sqe = io_uring_get_sqe(ring);
+	if (sqe != NULL) {
+		/* IORING_ASYNC_CANCEL_ALL | IORING_ASYNC_CANCEL_ANY cancels all ops */
+		io_uring_prep_cancel(sqe, NULL,
+				     IORING_ASYNC_CANCEL_ALL | IORING_ASYNC_CANCEL_ANY);
+		io_uring_sqe_set_data(sqe, NULL);
+		ret = io_uring_submit(ring);
+		if (ret < 0)
+			PMD_LOG(ERR, "cancel submit failed: %s", strerror(-ret));
+	}
+
+	/*
+	 * One blocking wait to let the kernel deliver the cancel CQE
+	 * and the CQEs for all cancelled operations.
+	 */
+	io_uring_submit_and_wait(ring, 1);
+
+	/*
+	 * Drain all CQEs non-blocking.  Cancellation of many pending
+	 * operations may produce CQEs in waves; keep polling until the
+	 * CQ is empty.
+	 */
+	for (unsigned int retries = 0; retries < 10; retries++) {
+		ready = io_uring_cq_ready(ring);
+		if (ready == 0)
+			break;
+
+		io_uring_for_each_cqe(ring, head, cqe) {
+			struct rte_mbuf *mb = (void *)(uintptr_t)cqe->user_data;
+
+			/* Skip the cancel operation's own CQE (user_data = NULL) */
+			if (mb != NULL) {
+				rte_pktmbuf_free(mb);
+				++num_freed;
+			}
+		}
+
+		/* Advance past all processed CQEs */
+		io_uring_cq_advance(ring, ready);
+	}
+
+	return num_freed;
+}
+
+int
+rtap_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id, uint16_t nb_rx_desc,
+		    unsigned int socket_id,
+		    const struct rte_eth_rxconf *rx_conf __rte_unused,
+		    struct rte_mempool *mb_pool)
+{
+	struct rte_mbuf **mbufs = NULL;
+	unsigned int nsqe = 0;
+	int fd = -1;
+
+	PMD_LOG(DEBUG, "setup port %u queue %u rx_descriptors %u",
+		dev->data->port_id, queue_id, nb_rx_desc);
+
+	struct rtap_rx_queue *rxq = rte_zmalloc_socket(NULL, sizeof(*rxq),
+						       RTE_CACHE_LINE_SIZE, socket_id);
+	if (rxq == NULL) {
+		PMD_LOG(ERR, "rxq alloc failed");
+		return -1;
+	}
+
+	rxq->mb_pool = mb_pool;
+	rxq->port_id = dev->data->port_id;
+	rxq->queue_id = queue_id;
+	dev->data->rx_queues[queue_id] = rxq;
+
+	if (io_uring_queue_init(nb_rx_desc, &rxq->io_ring, 0) != 0) {
+		PMD_LOG(ERR, "io_uring_queue_init failed: %s", strerror(errno));
+		goto error_rxq_free;
+	}
+
+	mbufs = calloc(nb_rx_desc, sizeof(struct rte_mbuf *));
+	if (mbufs == NULL) {
+		PMD_LOG(ERR, "Rx mbuf pointer alloc failed");
+		goto error_iouring_exit;
+	}
+
+	/* open shared tap fd maybe already setup */
+	if (rtap_queue_open(dev, queue_id) < 0)
+		goto error_bulk_free;
+
+	fd = rtap_queue_fd(rxq->port_id, rxq->queue_id);
+
+	for (uint16_t i = 0; i < nb_rx_desc; i++) {
+		mbufs[i] = rtap_rx_alloc(rxq);
+		if (mbufs[i] == NULL) {
+			PMD_LOG(ERR, "Rx mbuf alloc buf failed");
+			goto error_bulk_free;
+		}
+
+		int n = rtap_rx_submit(rxq, fd, mbufs[i]);
+		if (n < 0) {
+			PMD_LOG(ERR, "rtap_rx_submit failed: %s", strerror(-n));
+			goto error_bulk_free;
+		}
+
+		/* If using readv() then n > 0 and all sqe's have been queued. */
+		if (n > 0)
+			nsqe = 0;
+		else
+			++nsqe;
+	}
+
+	if (nsqe > 0) {
+		int n = io_uring_submit(&rxq->io_ring);
+		if (n < 0) {
+			PMD_LOG(ERR, "Rx io_uring submit failed: %s", strerror(-n));
+			goto error_bulk_free;
+		}
+		if (n < (int)nsqe)
+			PMD_LOG(NOTICE, "Rx io_uring partial submit %d of %u", n, nb_rx_desc);
+	}
+
+	free(mbufs);
+	return 0;
+
+error_bulk_free:
+	/* some of the mbufs might be queued already */
+	rtap_cancel_all(&rxq->io_ring);
+	rtap_queue_close(dev, queue_id);
+	free(mbufs);
+error_iouring_exit:
+	io_uring_queue_exit(&rxq->io_ring);
+error_rxq_free:
+	rte_free(rxq);
+	return -1;
+}
+
+void
+rtap_rx_queue_release(struct rte_eth_dev *dev, uint16_t queue_id)
+{
+	struct rtap_rx_queue *rxq = dev->data->rx_queues[queue_id];
+
+	if (rxq == NULL)
+		return;
+
+	rtap_cancel_all(&rxq->io_ring);
+	io_uring_queue_exit(&rxq->io_ring);
+
+	rte_free(rxq);
+
+	/* Close the shared TAP fd if the tx queue is already gone */
+	if (queue_id >= dev->data->nb_tx_queues ||
+	    dev->data->tx_queues[queue_id] == NULL)
+		rtap_queue_close(dev, queue_id);
+}
+
+int
+rtap_tx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id,
+		    uint16_t nb_tx_desc, unsigned int socket_id,
+		    const struct rte_eth_txconf *tx_conf)
+{
+	/* open shared tap fd maybe already setup */
+	if (rtap_queue_open(dev, queue_id) < 0)
+		return -1;
+
+	struct rtap_tx_queue *txq = rte_zmalloc_socket(NULL, sizeof(*txq),
+						       RTE_CACHE_LINE_SIZE, socket_id);
+	if (txq == NULL) {
+		PMD_LOG(ERR, "txq alloc failed");
+		return -1;
+	}
+
+	txq->port_id = dev->data->port_id;
+	txq->queue_id = queue_id;
+	txq->free_thresh = tx_conf->tx_free_thresh;
+	dev->data->tx_queues[queue_id] = txq;
+
+	if (io_uring_queue_init(nb_tx_desc, &txq->io_ring, 0) != 0) {
+		PMD_LOG(ERR, "io_uring_queue_init failed: %s", strerror(errno));
+		rte_free(txq);
+		return -1;
+	}
+
+	return 0;
+}
+
+static void
+rtap_tx_cleanup(struct rtap_tx_queue *txq)
+{
+	struct io_uring_cqe *cqe;
+	unsigned int head;
+	unsigned int num_cqe = 0;
+
+	io_uring_for_each_cqe(&txq->io_ring, head, cqe) {
+		struct rte_mbuf *mb = (void *)(uintptr_t)cqe->user_data;
+
+		++num_cqe;
+
+		/* Skip CQEs with NULL user_data (e.g., cancel operations) */
+		if (mb == NULL)
+			continue;
+
+		PMD_TX_LOG(DEBUG, " mbuf len %u result: %d", mb->pkt_len, cqe->res);
+		txq->tx_errors += (cqe->res < 0);
+		rte_pktmbuf_free(mb);
+	}
+	io_uring_cq_advance(&txq->io_ring, num_cqe);
+}
+
+void
+rtap_tx_queue_release(struct rte_eth_dev *dev, uint16_t queue_id)
+{
+	struct rtap_tx_queue *txq = dev->data->tx_queues[queue_id];
+
+	if (txq == NULL)
+		return;
+
+	/* First drain any completed TX operations */
+	rtap_tx_cleanup(txq);
+
+	/* Cancel all remaining pending operations and free mbufs */
+	rtap_cancel_all(&txq->io_ring);
+	io_uring_queue_exit(&txq->io_ring);
+
+	rte_free(txq);
+
+	/* Close the shared TAP fd if the rx queue is already gone */
+	if (queue_id >= dev->data->nb_rx_queues ||
+	    dev->data->rx_queues[queue_id] == NULL)
+		rtap_queue_close(dev, queue_id);
+}
+
+/* Convert mbuf offload flags to virtio net header */
+static void
+rtap_tx_offload(struct virtio_net_hdr *hdr, const struct rte_mbuf *m)
+{
+	uint64_t csum_l4 = m->ol_flags & RTE_MBUF_F_TX_L4_MASK;
+	uint16_t o_l23_len = (m->ol_flags & RTE_MBUF_F_TX_TUNNEL_MASK) ?
+			     m->outer_l2_len + m->outer_l3_len : 0;
+
+	memset(hdr, 0, sizeof(*hdr));
+
+	if (m->ol_flags & RTE_MBUF_F_TX_TCP_SEG)
+		csum_l4 |= RTE_MBUF_F_TX_TCP_CKSUM;
+
+	switch (csum_l4) {
+	case RTE_MBUF_F_TX_UDP_CKSUM:
+		hdr->csum_start = o_l23_len + m->l2_len + m->l3_len;
+		hdr->csum_offset = offsetof(struct rte_udp_hdr, dgram_cksum);
+		hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+		break;
+
+	case RTE_MBUF_F_TX_TCP_CKSUM:
+		hdr->csum_start = o_l23_len + m->l2_len + m->l3_len;
+		hdr->csum_offset = offsetof(struct rte_tcp_hdr, cksum);
+		hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+		break;
+	}
+
+	/* TCP Segmentation Offload */
+	if (m->ol_flags & RTE_MBUF_F_TX_TCP_SEG) {
+		hdr->gso_type = (m->ol_flags & RTE_MBUF_F_TX_IPV6) ?
+			VIRTIO_NET_HDR_GSO_TCPV6 :
+			VIRTIO_NET_HDR_GSO_TCPV4;
+		hdr->gso_size = m->tso_segsz;
+		hdr->hdr_len = o_l23_len + m->l2_len + m->l3_len + m->l4_len;
+	}
+}
+
+/*
+ * Transmit burst posts mbufs into the io_uring TAP file descriptor
+ * by creating queue elements with write operation.
+ *
+ * The driver mimics the behavior of a real hardware NIC.
+ *
+ * If there is no space left in the io_uring then the driver will return the number of
+ * mbuf's that were processed to that point. The application can then decide to retry
+ * later or drop the unsent packets in case of backpressue.
+ *
+ * The transmit process puts the virtio header before the data. In some cases, a new mbuf
+ * is required from same pool as original; but if that fails, the packet is not sent and
+ * is silently dropped. This is to avoid situation where pool is so small that transmit
+ * gets stuck when pool resources are very low.
+ */
+uint16_t
+rtap_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct rtap_tx_queue *txq = queue;
+	uint16_t i, num_tx = 0, num_pend = 0;
+	uint32_t num_tx_bytes = 0;
+
+	PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts);
+
+	unsigned int space_desired = RTE_MAX(txq->free_thresh, nb_pkts);
+	if (io_uring_sq_space_left(&txq->io_ring) < space_desired)
+		rtap_tx_cleanup(txq);
+
+	int fd = rtap_queue_fd(txq->port_id, txq->queue_id);
+
+	for (i = 0; i < nb_pkts; i++) {
+		struct rte_mbuf *mb = bufs[i];
+		struct virtio_net_hdr *hdr;
+
+		/* Use packet head room space for virtio header (if possible) */
+		if (rte_mbuf_refcnt_read(mb) == 1 && RTE_MBUF_DIRECT(mb) &&
+		    rte_pktmbuf_headroom(mb) >= sizeof(*hdr)) {
+			hdr = rte_pktmbuf_mtod_offset(mb, struct virtio_net_hdr *, -sizeof(*hdr));
+		} else {
+			/* Need to chain a new mbuf to make room for virtio header */
+			struct rte_mbuf *mh = rte_pktmbuf_alloc(mb->pool);
+			if (unlikely(mh == NULL)) {
+				PMD_TX_LOG(DEBUG, "mbuf pool exhausted on transmit");
+				rte_pktmbuf_free(mb);
+				++txq->tx_errors;
+				continue;
+			}
+
+			/* The packet headroom should be available in newly allocated mbuf */
+			RTE_ASSERT(rte_pktmbuf_headroom(mh) >= sizeof(*hdr));
+
+			hdr = rte_pktmbuf_mtod_offset(mh, struct virtio_net_hdr *, -sizeof(*hdr));
+			mh->next = mb;
+			mh->nb_segs = mb->nb_segs + 1;
+			mh->pkt_len = mb->pkt_len;
+			mh->ol_flags = mb->ol_flags & RTE_MBUF_F_TX_OFFLOAD_MASK;
+			mb = mh;
+		}
+
+		struct io_uring_sqe *sqe = io_uring_get_sqe(&txq->io_ring);
+		if (sqe == NULL) {
+			/* Drop header mbuf if it was used */
+			if (mb != bufs[i])
+				rte_pktmbuf_free_seg(mb);
+			break;	/* submit ring is full */
+		}
+
+		/* Note: transmit bytes does not include virtio header */
+		++num_tx;
+		num_tx_bytes += mb->pkt_len;
+
+		io_uring_sqe_set_data(sqe, mb);
+		rtap_tx_offload(hdr, mb);
+
+		PMD_TX_LOG(DEBUG, "write m=%p segs=%u", mb, mb->nb_segs);
+
+		/* Start of data written to kernel includes virtio net header */
+		void *buf = rte_pktmbuf_mtod_offset(mb, void *, -sizeof(*hdr));
+		unsigned int nbytes = sizeof(struct virtio_net_hdr) + mb->data_len;
+
+		if (mb->nb_segs == 1) {
+			/* Single segment mbuf can go as write and batched */
+			io_uring_prep_write(sqe, fd, buf, nbytes, 0);
+			++num_pend;
+		} else {
+			/* Mult-segment mbuf needs scatter/gather */
+			struct iovec iovs[IOV_MAX];
+			unsigned int niov = mb->nb_segs;
+
+			if (unlikely(niov > IOV_MAX)) {
+				PMD_TX_LOG(ERR, "Tx nsegs %u > max %u",
+					   niov, IOV_MAX);
+				++txq->tx_errors;
+				rte_pktmbuf_free(mb);
+				continue;
+			}
+
+			iovs[0].iov_base = buf;
+			iovs[0].iov_len = nbytes;
+
+			for (unsigned int v = 1; v < niov; v++) {
+				mb = mb->next;
+				iovs[v].iov_base = rte_pktmbuf_mtod(mb, void *);
+				iovs[v].iov_len = mb->data_len;
+			}
+
+			io_uring_prep_writev(sqe, fd, iovs, niov, 0);
+
+			/*
+			 * For writev, submit now since iovs[] is on the stack
+			 * and must remain valid until submitted.
+			 * This also submits any previously batched single-seg writes.
+			 */
+			int err = io_uring_submit(&txq->io_ring);
+			if (unlikely(err < 0)) {
+				PMD_TX_LOG(ERR, "Tx io_uring submit failed: %s", strerror(-err));
+				++txq->tx_errors;
+			}
+
+			num_pend = 0;
+		}
+	}
+
+	if (likely(num_pend > 0)) {
+		int err = io_uring_submit(&txq->io_ring);
+		if (unlikely(err < 0)) {
+			PMD_LOG(ERR, "Tx io_uring submit failed: %s", strerror(-err));
+			++txq->tx_errors;
+		}
+	}
+
+	txq->tx_packets += num_tx;
+	txq->tx_bytes += num_tx_bytes;
+
+	return num_tx;
+}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 04/11] net/rtap: add statistics and device info
  2026-02-14 23:44 ` [PATCH v6 00/11] " Stephen Hemminger
                     ` (2 preceding siblings ...)
  2026-02-14 23:44   ` [PATCH v6 03/11] net/rtap: add Rx/Tx with scatter/gather support Stephen Hemminger
@ 2026-02-14 23:44   ` Stephen Hemminger
  2026-02-14 23:44   ` [PATCH v6 05/11] net/rtap: add link and device management operations Stephen Hemminger
                     ` (7 subsequent siblings)
  11 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2026-02-14 23:44 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Implement basic and per-queue statistics collection.
Usual statistics per queue: packets, bytes, errors.

The imissed counter reports kernel rx_dropped (packets dropped
before reaching the interface) retrieved via netlink IFLA_STATS64.
The rx_drop_base field captures the baseline value at stats_reset.

Device info reports ifindex, max queue counts, and default
burst/ring sizes.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/rtap.ini |   2 +
 drivers/net/rtap/rtap.h           |   2 +
 drivers/net/rtap/rtap_ethdev.c    | 120 ++++++++++++++++++++++++++++++
 3 files changed, 124 insertions(+)

diff --git a/doc/guides/nics/features/rtap.ini b/doc/guides/nics/features/rtap.ini
index c064e1e0b9..9bef9e341d 100644
--- a/doc/guides/nics/features/rtap.ini
+++ b/doc/guides/nics/features/rtap.ini
@@ -5,6 +5,8 @@
 ;
 [Features]
 Scattered Rx         = P
+Basic stats          = Y
+Stats per queue      = Y
 Linux                = Y
 ARMv7                = Y
 ARMv8                = Y
diff --git a/drivers/net/rtap/rtap.h b/drivers/net/rtap/rtap.h
index 823b8c59f7..abfde20e60 100644
--- a/drivers/net/rtap/rtap.h
+++ b/drivers/net/rtap/rtap.h
@@ -62,6 +62,8 @@ struct rtap_pmd {
 	int if_index;			/* interface index */
 	int nlsk_fd;			/* netlink control socket */
 	struct rte_ether_addr eth_addr; /* address assigned by kernel */
+
+	uint64_t rx_drop_base;		/* value of rx_dropped when reset */
 };
 
 /* rtap_ethdev.c */
diff --git a/drivers/net/rtap/rtap_ethdev.c b/drivers/net/rtap/rtap_ethdev.c
index 463e24a6e1..888f8e7f39 100644
--- a/drivers/net/rtap/rtap_ethdev.c
+++ b/drivers/net/rtap/rtap_ethdev.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2026 Stephen Hemminger
  */
 
+#include <assert.h>
 #include <errno.h>
 #include <fcntl.h>
 #include <stdlib.h>
@@ -11,6 +12,7 @@
 #include <sys/ioctl.h>
 #include <net/if.h>
 #include <linux/if_tun.h>
+#include <linux/if_link.h>
 #include <linux/virtio_net.h>
 
 #include <rte_config.h>
@@ -29,6 +31,14 @@
 
 #define RTAP_DEFAULT_IFNAME	"rtap%d"
 
+#define RTAP_DEFAULT_BURST	64
+#define RTAP_NUM_BUFFERS	1024
+#define RTAP_MAX_QUEUES		128
+#define RTAP_MIN_RX_BUFSIZE	RTE_ETHER_MIN_LEN
+#define RTAP_MAX_RX_PKTLEN	RTE_ETHER_MAX_JUMBO_FRAME_LEN
+
+static_assert(RTAP_MAX_QUEUES <= RTE_MP_MAX_FD_NUM, "Max queues exceeds MP fd limit");
+
 #define RTAP_IFACE_ARG		"iface"
 #define RTAP_PERSIST_ARG	"persist"
 
@@ -157,6 +167,112 @@ rtap_dev_configure(struct rte_eth_dev *dev)
 	return 0;
 }
 
+static int
+rtap_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+
+	dev_info->if_index = pmd->if_index;
+	dev_info->max_mac_addrs = 1;
+	dev_info->max_rx_pktlen = RTAP_MAX_RX_PKTLEN;
+	dev_info->min_rx_bufsize = RTAP_MIN_RX_BUFSIZE;
+	dev_info->max_rx_queues = RTAP_MAX_QUEUES;
+	dev_info->max_tx_queues = RTAP_MAX_QUEUES;
+
+	dev_info->default_rxportconf = (struct rte_eth_dev_portconf) {
+		.burst_size = RTAP_DEFAULT_BURST,
+		.ring_size = RTAP_NUM_BUFFERS,
+		.nb_queues = 1,
+	};
+	dev_info->default_txportconf = (struct rte_eth_dev_portconf) {
+		.burst_size = RTAP_DEFAULT_BURST,
+		.ring_size = RTAP_NUM_BUFFERS,
+		.nb_queues = 1,
+	};
+	return 0;
+}
+
+static int
+rtap_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats,
+	       struct eth_queue_stats *qstats)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+	struct rtnl_link_stats64 kstats;
+	uint16_t i;
+
+	for (i = 0; i < dev->data->nb_rx_queues; i++) {
+		struct rtap_rx_queue *rxq = dev->data->rx_queues[i];
+		if (rxq == NULL)
+			continue;
+
+		stats->ipackets += rxq->rx_packets;
+		stats->ibytes += rxq->rx_bytes;
+		stats->ierrors += rxq->rx_errors;
+
+		if (qstats != NULL && i < RTE_ETHDEV_QUEUE_STAT_CNTRS) {
+			qstats->q_ipackets[i] = rxq->rx_packets;
+			qstats->q_ibytes[i] = rxq->rx_bytes;
+			qstats->q_errors[i] = rxq->rx_errors;
+		}
+	}
+
+	for (i = 0; i < dev->data->nb_tx_queues; i++) {
+		struct rtap_tx_queue *txq = dev->data->tx_queues[i];
+		if (txq == NULL)
+			continue;
+
+		stats->opackets += txq->tx_packets;
+		stats->obytes += txq->tx_bytes;
+		stats->oerrors += txq->tx_errors;
+
+		if (qstats != NULL && i < RTE_ETHDEV_QUEUE_STAT_CNTRS) {
+			qstats->q_opackets[i] = txq->tx_packets;
+			qstats->q_obytes[i] = txq->tx_bytes;
+		}
+	}
+
+	/* Get kernel rx_dropped counter via netlink */
+	if (rtap_nl_get_stats(pmd->if_index, &kstats) == 0 &&
+	    kstats.rx_dropped > pmd->rx_drop_base)
+		stats->imissed = kstats.rx_dropped - pmd->rx_drop_base;
+
+	return 0;
+}
+
+static int
+rtap_stats_reset(struct rte_eth_dev *dev)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+	struct rtnl_link_stats64 kstats;
+	uint16_t i;
+
+	for (i = 0; i < dev->data->nb_rx_queues; i++) {
+		struct rtap_rx_queue *rxq = dev->data->rx_queues[i];
+		if (rxq == NULL)
+			continue;
+
+		rxq->rx_packets = 0;
+		rxq->rx_bytes = 0;
+		rxq->rx_errors = 0;
+	}
+
+	for (i = 0; i < dev->data->nb_tx_queues; i++) {
+		struct rtap_tx_queue *txq = dev->data->tx_queues[i];
+		if (txq == NULL)
+			continue;
+
+		txq->tx_packets = 0;
+		txq->tx_bytes = 0;
+		txq->tx_errors = 0;
+	}
+
+	/* Capture current rx_dropped as baseline via netlink */
+	if (rtap_nl_get_stats(pmd->if_index, &kstats) == 0)
+		pmd->rx_drop_base = kstats.rx_dropped;
+
+	return 0;
+}
+
 static int
 rtap_dev_close(struct rte_eth_dev *dev)
 {
@@ -247,7 +363,10 @@ static const struct eth_dev_ops rtap_ops = {
 	.dev_start		= rtap_dev_start,
 	.dev_stop		= rtap_dev_stop,
 	.dev_configure		= rtap_dev_configure,
+	.dev_infos_get		= rtap_dev_info,
 	.dev_close		= rtap_dev_close,
+	.stats_get		= rtap_stats_get,
+	.stats_reset		= rtap_stats_reset,
 	.rx_queue_setup		= rtap_rx_queue_setup,
 	.rx_queue_release	= rtap_rx_queue_release,
 	.tx_queue_setup		= rtap_tx_queue_setup,
@@ -262,6 +381,7 @@ rtap_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
 
 	pmd->keep_fd = -1;
 	pmd->nlsk_fd = -1;
+	pmd->rx_drop_base = 0;
 
 	dev->dev_ops = &rtap_ops;
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 05/11] net/rtap: add link and device management operations
  2026-02-14 23:44 ` [PATCH v6 00/11] " Stephen Hemminger
                     ` (3 preceding siblings ...)
  2026-02-14 23:44   ` [PATCH v6 04/11] net/rtap: add statistics and device info Stephen Hemminger
@ 2026-02-14 23:44   ` Stephen Hemminger
  2026-02-14 23:44   ` [PATCH v6 06/11] net/rtap: add checksum and TSO offload support Stephen Hemminger
                     ` (6 subsequent siblings)
  11 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2026-02-14 23:44 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add device management operations via netlink:
  - Link up/down control
  - Link status query (checks IFF_UP and IFF_RUNNING)
  - Promiscuous mode enable/disable
  - All-multicast mode enable/disable
  - MTU configuration
  - MAC address configuration

All operations use the persistent netlink socket (pmd->nlsk_fd)
and ifindex for stable interface identification across renames
and namespace moves.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/rtap.ini |   4 ++
 drivers/net/rtap/rtap.h           |   1 +
 drivers/net/rtap/rtap_ethdev.c    | 102 ++++++++++++++++++++++++++++++
 3 files changed, 107 insertions(+)

diff --git a/doc/guides/nics/features/rtap.ini b/doc/guides/nics/features/rtap.ini
index 9bef9e341d..ce0804d795 100644
--- a/doc/guides/nics/features/rtap.ini
+++ b/doc/guides/nics/features/rtap.ini
@@ -4,6 +4,10 @@
 ; Refer to default.ini for the full list of available PMD features.
 ;
 [Features]
+Link status          = Y
+MTU update           = Y
+Promiscuous mode     = Y
+Allmulticast mode    = Y
 Scattered Rx         = P
 Basic stats          = Y
 Stats per queue      = Y
diff --git a/drivers/net/rtap/rtap.h b/drivers/net/rtap/rtap.h
index abfde20e60..aeda7f3f25 100644
--- a/drivers/net/rtap/rtap.h
+++ b/drivers/net/rtap/rtap.h
@@ -69,6 +69,7 @@ struct rtap_pmd {
 /* rtap_ethdev.c */
 int rtap_queue_open(struct rte_eth_dev *dev, uint16_t queue_id);
 void rtap_queue_close(struct rte_eth_dev *dev, uint16_t queue_id);
+int rtap_link_update(struct rte_eth_dev *dev, int wait_to_complete);
 
 /* rtap_netlink.c */
 int rtap_nl_open(unsigned int groups);
diff --git a/drivers/net/rtap/rtap_ethdev.c b/drivers/net/rtap/rtap_ethdev.c
index 888f8e7f39..b6c6421e3d 100644
--- a/drivers/net/rtap/rtap_ethdev.c
+++ b/drivers/net/rtap/rtap_ethdev.c
@@ -108,9 +108,100 @@ rtap_tap_open(const char *name, struct ifreq *ifr, uint8_t persist)
 	return -1;
 }
 
+static int
+rtap_set_link_up(struct rte_eth_dev *dev)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+
+	return rtap_nl_change_flags(pmd->nlsk_fd, pmd->if_index, IFF_UP, IFF_UP);
+}
+
+static int
+rtap_set_link_down(struct rte_eth_dev *dev)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+
+	return rtap_nl_change_flags(pmd->nlsk_fd, pmd->if_index, 0, IFF_UP);
+}
+
+static int
+rtap_promiscuous_enable(struct rte_eth_dev *dev)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+
+	return rtap_nl_change_flags(pmd->nlsk_fd, pmd->if_index, IFF_PROMISC, IFF_PROMISC);
+}
+
+static int
+rtap_promiscuous_disable(struct rte_eth_dev *dev)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+
+	return rtap_nl_change_flags(pmd->nlsk_fd, pmd->if_index, 0, IFF_PROMISC);
+}
+
+static int
+rtap_allmulticast_enable(struct rte_eth_dev *dev)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+
+	return rtap_nl_change_flags(pmd->nlsk_fd, pmd->if_index, IFF_ALLMULTI, IFF_ALLMULTI);
+}
+
+static int
+rtap_allmulticast_disable(struct rte_eth_dev *dev)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+
+	return rtap_nl_change_flags(pmd->nlsk_fd, pmd->if_index, 0, IFF_ALLMULTI);
+}
+
+int
+rtap_link_update(struct rte_eth_dev *dev, int wait_to_complete __rte_unused)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+	struct rte_eth_link link = {
+		.link_speed = RTE_ETH_SPEED_NUM_UNKNOWN,
+		.link_duplex = RTE_ETH_LINK_FULL_DUPLEX,
+		.link_autoneg = RTE_ETH_LINK_FIXED,
+		.link_status = RTE_ETH_LINK_DOWN,
+	};
+	unsigned int flags = 0;
+
+	if (rtap_nl_get_flags(pmd->nlsk_fd, pmd->if_index, &flags) < 0)
+		return -1;
+
+	if ((flags & IFF_UP) && (flags & IFF_RUNNING))
+		link.link_status = RTE_ETH_LINK_UP;
+
+	rte_eth_linkstatus_set(dev, &link);
+	return 0;
+}
+
+static int
+rtap_mtu_set(struct rte_eth_dev *dev, uint16_t mtu)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+
+	return rtap_nl_set_mtu(pmd->nlsk_fd, pmd->if_index, mtu);
+}
+
+static int
+rtap_macaddr_set(struct rte_eth_dev *dev, struct rte_ether_addr *addr)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+
+	return rtap_nl_set_mac(pmd->nlsk_fd, pmd->if_index, addr);
+}
+
 static int
 rtap_dev_start(struct rte_eth_dev *dev)
 {
+	int ret = rtap_set_link_up(dev);
+
+	if (ret != 0)
+		return ret;
+
 	dev->data->dev_link.link_status = RTE_ETH_LINK_UP;
 	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
 		dev->data->rx_queue_state[i] = RTE_ETH_QUEUE_STATE_STARTED;
@@ -127,6 +218,8 @@ rtap_dev_stop(struct rte_eth_dev *dev)
 
 	dev->data->dev_link.link_status = RTE_ETH_LINK_DOWN;
 
+	rtap_set_link_down(dev);
+
 	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
 		dev->data->rx_queue_state[i] = RTE_ETH_QUEUE_STATE_STOPPED;
 		dev->data->tx_queue_state[i] = RTE_ETH_QUEUE_STATE_STOPPED;
@@ -365,6 +458,15 @@ static const struct eth_dev_ops rtap_ops = {
 	.dev_configure		= rtap_dev_configure,
 	.dev_infos_get		= rtap_dev_info,
 	.dev_close		= rtap_dev_close,
+	.link_update		= rtap_link_update,
+	.dev_set_link_up	= rtap_set_link_up,
+	.dev_set_link_down	= rtap_set_link_down,
+	.mac_addr_set		= rtap_macaddr_set,
+	.mtu_set		= rtap_mtu_set,
+	.promiscuous_enable	= rtap_promiscuous_enable,
+	.promiscuous_disable	= rtap_promiscuous_disable,
+	.allmulticast_enable	= rtap_allmulticast_enable,
+	.allmulticast_disable	= rtap_allmulticast_disable,
 	.stats_get		= rtap_stats_get,
 	.stats_reset		= rtap_stats_reset,
 	.rx_queue_setup		= rtap_rx_queue_setup,
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 06/11] net/rtap: add checksum and TSO offload support
  2026-02-14 23:44 ` [PATCH v6 00/11] " Stephen Hemminger
                     ` (4 preceding siblings ...)
  2026-02-14 23:44   ` [PATCH v6 05/11] net/rtap: add link and device management operations Stephen Hemminger
@ 2026-02-14 23:44   ` Stephen Hemminger
  2026-02-14 23:44   ` [PATCH v6 07/11] net/rtap: add multi-process support Stephen Hemminger
                     ` (5 subsequent siblings)
  11 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2026-02-14 23:44 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add transmit and receive offload support using the virtio-net header
that is exchanged with the kernel TAP device.

Tx offloads:
  - UDP and TCP checksum offload: set VIRTIO_NET_HDR_F_NEEDS_CSUM
    with appropriate csum_start and csum_offset fields
  - TCP segmentation offload (TSO): set gso_type and gso_size

Rx offloads:
  - Enable receive offloads based on configuration

Report offload capabilities in dev_info.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/rtap.ini |  2 ++
 drivers/net/rtap/rtap_ethdev.c    | 30 +++++++++++++++++++++++++++++-
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/doc/guides/nics/features/rtap.ini b/doc/guides/nics/features/rtap.ini
index ce0804d795..b8eaa805fe 100644
--- a/doc/guides/nics/features/rtap.ini
+++ b/doc/guides/nics/features/rtap.ini
@@ -11,6 +11,8 @@ Allmulticast mode    = Y
 Scattered Rx         = P
 Basic stats          = Y
 Stats per queue      = Y
+TSO                  = Y
+L4 checksum offload  = Y
 Linux                = Y
 ARMv7                = Y
 ARMv8                = Y
diff --git a/drivers/net/rtap/rtap_ethdev.c b/drivers/net/rtap/rtap_ethdev.c
index b6c6421e3d..c4125e04b3 100644
--- a/drivers/net/rtap/rtap_ethdev.c
+++ b/drivers/net/rtap/rtap_ethdev.c
@@ -31,6 +31,16 @@
 
 #define RTAP_DEFAULT_IFNAME	"rtap%d"
 
+#define RTAP_TX_OFFLOAD		(RTE_ETH_TX_OFFLOAD_MULTI_SEGS | \
+				 RTE_ETH_TX_OFFLOAD_UDP_CKSUM | \
+				 RTE_ETH_TX_OFFLOAD_TCP_CKSUM | \
+				 RTE_ETH_TX_OFFLOAD_TCP_TSO)
+
+#define RTAP_RX_OFFLOAD		(RTE_ETH_RX_OFFLOAD_UDP_CKSUM | \
+				 RTE_ETH_RX_OFFLOAD_TCP_CKSUM | \
+				 RTE_ETH_RX_OFFLOAD_TCP_LRO | \
+				 RTE_ETH_RX_OFFLOAD_SCATTER)
+
 #define RTAP_DEFAULT_BURST	64
 #define RTAP_NUM_BUFFERS	1024
 #define RTAP_MAX_QUEUES		128
@@ -250,7 +260,21 @@ rtap_dev_configure(struct rte_eth_dev *dev)
 		return -EINVAL;
 	}
 
-	if (ioctl(pmd->keep_fd, TUNSETOFFLOAD, 0) != 0) {
+	/*
+	 * Set offload flags visible on the kernel network interface.
+	 * This controls whether kernel will use checksum offload etc.
+	 * Note: kernel transmit is DPDK receive.
+	 */
+	const struct rte_eth_rxmode *rx_mode = &dev->data->dev_conf.rxmode;
+	unsigned int offload = 0;
+	if (rx_mode->offloads & RTE_ETH_RX_OFFLOAD_CHECKSUM) {
+		offload |= TUN_F_CSUM;
+
+		if (rx_mode->offloads & RTE_ETH_RX_OFFLOAD_TCP_LRO)
+			offload |= TUN_F_TSO4 | TUN_F_TSO6 | TUN_F_TSO_ECN;
+	}
+
+	if (ioctl(pmd->keep_fd, TUNSETOFFLOAD, offload) != 0) {
 		int ret = -errno;
 
 		PMD_LOG(ERR, "ioctl(TUNSETOFFLOAD) failed: %s", strerror(errno));
@@ -271,6 +295,10 @@ rtap_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 	dev_info->min_rx_bufsize = RTAP_MIN_RX_BUFSIZE;
 	dev_info->max_rx_queues = RTAP_MAX_QUEUES;
 	dev_info->max_tx_queues = RTAP_MAX_QUEUES;
+	dev_info->rx_queue_offload_capa = RTAP_RX_OFFLOAD;
+	dev_info->rx_offload_capa = dev_info->rx_queue_offload_capa;
+	dev_info->tx_queue_offload_capa = RTAP_TX_OFFLOAD;
+	dev_info->tx_offload_capa = dev_info->tx_queue_offload_capa;
 
 	dev_info->default_rxportconf = (struct rte_eth_dev_portconf) {
 		.burst_size = RTAP_DEFAULT_BURST,
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 07/11] net/rtap: add multi-process support
  2026-02-14 23:44 ` [PATCH v6 00/11] " Stephen Hemminger
                     ` (5 preceding siblings ...)
  2026-02-14 23:44   ` [PATCH v6 06/11] net/rtap: add checksum and TSO offload support Stephen Hemminger
@ 2026-02-14 23:44   ` Stephen Hemminger
  2026-02-14 23:44   ` [PATCH v6 08/11] net/rtap: add link state change interrupt Stephen Hemminger
                     ` (4 subsequent siblings)
  11 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2026-02-14 23:44 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger, Anatoly Burakov

Add support for DPDK multi-process operation. Secondary processes
need access to the per-queue TAP file descriptors owned by the
primary process.

Implement fd sharing using the DPDK multi-process IPC mechanism:
  - Primary registers an MP action handler that responds to fd
    requests by sending queue fds via rte_mp_reply()
  - Secondary process attaches to the existing device and requests
    fds from primary via rte_mp_request_sync()

The MP action is registered on the first device probe and
unregistered when the last device is closed.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/rtap.ini |   1 +
 drivers/net/rtap/rtap_ethdev.c    | 140 ++++++++++++++++++++++++++++++
 2 files changed, 141 insertions(+)

diff --git a/doc/guides/nics/features/rtap.ini b/doc/guides/nics/features/rtap.ini
index b8eaa805fe..cfbd29ef08 100644
--- a/doc/guides/nics/features/rtap.ini
+++ b/doc/guides/nics/features/rtap.ini
@@ -13,6 +13,7 @@ Basic stats          = Y
 Stats per queue      = Y
 TSO                  = Y
 L4 checksum offload  = Y
+Multiprocess aware   = Y
 Linux                = Y
 ARMv7                = Y
 ARMv8                = Y
diff --git a/drivers/net/rtap/rtap_ethdev.c b/drivers/net/rtap/rtap_ethdev.c
index c4125e04b3..705d98323e 100644
--- a/drivers/net/rtap/rtap_ethdev.c
+++ b/drivers/net/rtap/rtap_ethdev.c
@@ -9,6 +9,7 @@
 #include <string.h>
 #include <stdint.h>
 #include <unistd.h>
+#include <time.h>
 #include <sys/ioctl.h>
 #include <net/if.h>
 #include <linux/if_tun.h>
@@ -19,10 +20,12 @@
 #include <rte_common.h>
 #include <rte_dev.h>
 #include <rte_eal.h>
+#include <rte_errno.h>
 #include <rte_ethdev.h>
 #include <rte_ether.h>
 #include <rte_kvargs.h>
 #include <rte_log.h>
+#include <rte_stdatomic.h>
 #include <bus_vdev_driver.h>
 #include <ethdev_driver.h>
 #include <ethdev_vdev.h>
@@ -41,6 +44,8 @@
 				 RTE_ETH_RX_OFFLOAD_TCP_LRO | \
 				 RTE_ETH_RX_OFFLOAD_SCATTER)
 
+#define RTAP_MP_KEY		"rtap_mp_send_fds"
+
 #define RTAP_DEFAULT_BURST	64
 #define RTAP_NUM_BUFFERS	1024
 #define RTAP_MAX_QUEUES		128
@@ -52,6 +57,8 @@ static_assert(RTAP_MAX_QUEUES <= RTE_MP_MAX_FD_NUM, "Max queues exceeds MP fd li
 #define RTAP_IFACE_ARG		"iface"
 #define RTAP_PERSIST_ARG	"persist"
 
+static RTE_ATOMIC(unsigned int) rtap_dev_count;
+
 static const char * const valid_arguments[] = {
 	RTAP_IFACE_ARG,
 	RTAP_PERSIST_ARG,
@@ -433,6 +440,8 @@ rtap_dev_close(struct rte_eth_dev *dev)
 	free(dev->process_private);
 	dev->process_private = NULL;
 
+	if (rte_atomic_fetch_sub_explicit(&rtap_dev_count, 1, rte_memory_order_release) == 1)
+		rte_mp_action_unregister(RTAP_MP_KEY);
 	return 0;
 }
 
@@ -578,6 +587,96 @@ rtap_parse_iface(const char *key __rte_unused, const char *value, void *extra_ar
 	return 0;
 }
 
+/* Secondary process requests rxq fds from primary. */
+static int
+rtap_request_fds(const char *name, struct rte_eth_dev *dev)
+{
+	struct rte_mp_msg request = { };
+
+	strlcpy(request.name, RTAP_MP_KEY, sizeof(request.name));
+	strlcpy((char *)request.param, name, RTE_MP_MAX_PARAM_LEN);
+	request.len_param = strlen(name) + 1;
+
+	/* Send the request and receive the reply */
+	PMD_LOG(DEBUG, "Sending multi-process IPC request for %s", name);
+
+	struct timespec timeout = {.tv_sec = 1, .tv_nsec = 0};
+	struct rte_mp_reply replies;
+	int ret = rte_mp_request_sync(&request, &replies, &timeout);
+	if (ret < 0) {
+		PMD_LOG(ERR, "Failed to request fds from primary: %s",
+			rte_strerror(rte_errno));
+		return -1;
+	}
+
+	struct rte_mp_msg *reply = replies.msgs;
+	PMD_LOG(DEBUG, "Received multi-process IPC reply for %s", name);
+
+	if (replies.nb_received != 1) {
+		PMD_LOG(ERR, "Got %u replies from primary", replies.nb_received);
+		free(reply);
+		return -EINVAL;
+	}
+
+	if (dev->data->nb_rx_queues != reply->num_fds) {
+		PMD_LOG(ERR, "Incorrect number of fds received: %d != %d",
+			reply->num_fds, dev->data->nb_rx_queues);
+		free(reply);
+		return -EINVAL;
+	}
+
+	int *fds = dev->process_private;
+	for (int i = 0; i < reply->num_fds; i++) {
+		fds[i] = reply->fds[i];
+		PMD_LOG(DEBUG, "Received queue %u fd %d from primary", i, fds[i]);
+	}
+
+	free(reply);
+	return 0;
+}
+
+/* Primary process sends rxq fds to secondary. */
+static int
+rtap_mp_send_fds(const struct rte_mp_msg *request, const void *peer)
+{
+	const char *request_name = (const char *)request->param;
+
+	PMD_LOG(DEBUG, "Received multi-process IPC request for %s", request_name);
+
+	/* Find the requested port */
+	struct rte_eth_dev *dev = rte_eth_dev_get_by_name(request_name);
+	if (dev == NULL) {
+		PMD_LOG(ERR, "Failed to get port id for %s", request_name);
+		return -1;
+	}
+
+	/* Populate the reply with the fds for each queue */
+	struct rte_mp_msg reply = { };
+	if (dev->data->nb_rx_queues > RTE_MP_MAX_FD_NUM) {
+		PMD_LOG(ERR, "Number of rx queues (%d) exceeds max number of fds (%d)",
+			   dev->data->nb_rx_queues, RTE_MP_MAX_FD_NUM);
+		return -EINVAL;
+	}
+
+	int *fds = dev->process_private;
+	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
+		PMD_LOG(DEBUG, "Send queue %u fd %d to secondary", i, fds[i]);
+		reply.fds[reply.num_fds++] = fds[i];
+	}
+
+	/* Send the reply */
+	strlcpy(reply.name, request->name, sizeof(reply.name));
+	strlcpy((char *)reply.param, request_name, RTE_MP_MAX_PARAM_LEN);
+	reply.len_param = strlen(request_name) + 1;
+
+	PMD_LOG(DEBUG, "Sending multi-process IPC reply for %s", request_name);
+	if (rte_mp_reply(&reply, peer) < 0) {
+		PMD_LOG(ERR, "Failed to reply to multi-process IPC request");
+		return -1;
+	}
+	return 0;
+}
+
 static int
 rtap_probe(struct rte_vdev_device *vdev)
 {
@@ -592,6 +691,38 @@ rtap_probe(struct rte_vdev_device *vdev)
 
 	PMD_LOG(INFO, "Initializing %s", name);
 
+	if (rte_eal_process_type() == RTE_PROC_SECONDARY) {
+		eth_dev = rte_eth_dev_attach_secondary(name);
+		if (eth_dev == NULL) {
+			PMD_LOG(ERR, "Failed to probe %s", name);
+			return -1;
+		}
+		eth_dev->dev_ops = &rtap_ops;
+		eth_dev->data->dev_flags |= RTE_ETH_DEV_AUTOFILL_QUEUE_XSTATS;
+		eth_dev->device = &vdev->device;
+
+		if (!rte_eal_primary_proc_alive(NULL)) {
+			PMD_LOG(ERR, "Primary process is missing");
+			goto error;
+		}
+
+		fds = calloc(RTE_MAX_QUEUES_PER_PORT, sizeof(int));
+		if (fds == NULL) {
+			PMD_LOG(ERR, "Failed to alloc memory for process private");
+			goto error;
+		}
+		for (uint16_t i = 0; i < RTE_MAX_QUEUES_PER_PORT; i++)
+			fds[i] = -1;
+
+		eth_dev->process_private = fds;
+
+		if (rtap_request_fds(name, eth_dev))
+			goto error;
+
+		rte_eth_dev_probing_finish(eth_dev);
+		return 0;
+	}
+
 	if (params != NULL) {
 		kvlist = rte_kvargs_parse(params, valid_arguments);
 		if (kvlist == NULL)
@@ -630,6 +761,15 @@ rtap_probe(struct rte_vdev_device *vdev)
 	if (rtap_create(eth_dev, tap_name, persist) < 0)
 		goto error;
 
+	/* register the MP server on the first device */
+	if (rte_atomic_fetch_add_explicit(&rtap_dev_count, 1, rte_memory_order_acquire) == 0 &&
+	    rte_mp_action_register(RTAP_MP_KEY, rtap_mp_send_fds) < 0) {
+		rte_atomic_store_explicit(&rtap_dev_count, 0, rte_memory_order_relaxed);
+		PMD_LOG(ERR, "Failed to register multi-process callback: %s",
+			rte_strerror(rte_errno));
+		goto error;
+	}
+
 	rte_eth_dev_probing_finish(eth_dev);
 	rte_kvargs_free(kvlist);
 	return 0;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 08/11] net/rtap: add link state change interrupt
  2026-02-14 23:44 ` [PATCH v6 00/11] " Stephen Hemminger
                     ` (6 preceding siblings ...)
  2026-02-14 23:44   ` [PATCH v6 07/11] net/rtap: add multi-process support Stephen Hemminger
@ 2026-02-14 23:44   ` Stephen Hemminger
  2026-02-14 23:44   ` [PATCH v6 09/11] net/rtap: add Rx interrupt support Stephen Hemminger
                     ` (3 subsequent siblings)
  11 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2026-02-14 23:44 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add link state change (LSC) notification support using a netlink
socket subscribed to RTMGRP_LINK events.

When enabled, the driver registers a netlink socket with the EAL
interrupt handler. On receiving RTM_NEWLINK/RTM_DELLINK messages
matching the interface ifindex, the driver updates link status and
triggers RTE_ETH_EVENT_INTR_LSC callbacks.

LSC is enabled in dev_start and disabled in dev_stop.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/rtap.ini |  1 +
 drivers/net/rtap/meson.build      |  1 +
 drivers/net/rtap/rtap.h           |  4 ++
 drivers/net/rtap/rtap_ethdev.c    | 26 ++++++++-
 drivers/net/rtap/rtap_intr.c      | 87 +++++++++++++++++++++++++++++++
 5 files changed, 118 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/rtap/rtap_intr.c

diff --git a/doc/guides/nics/features/rtap.ini b/doc/guides/nics/features/rtap.ini
index cfbd29ef08..fe0c88a8fc 100644
--- a/doc/guides/nics/features/rtap.ini
+++ b/doc/guides/nics/features/rtap.ini
@@ -5,6 +5,7 @@
 ;
 [Features]
 Link status          = Y
+Link status event    = Y
 MTU update           = Y
 Promiscuous mode     = Y
 Allmulticast mode    = Y
diff --git a/drivers/net/rtap/meson.build b/drivers/net/rtap/meson.build
index ed0bcc1313..58943e035a 100644
--- a/drivers/net/rtap/meson.build
+++ b/drivers/net/rtap/meson.build
@@ -19,6 +19,7 @@ endif
 
 sources = files(
         'rtap_ethdev.c',
+        'rtap_intr.c',
         'rtap_netlink.c',
         'rtap_rxtx.c',
 )
diff --git a/drivers/net/rtap/rtap.h b/drivers/net/rtap/rtap.h
index aeda7f3f25..2cc55c9667 100644
--- a/drivers/net/rtap/rtap.h
+++ b/drivers/net/rtap/rtap.h
@@ -61,6 +61,7 @@ struct rtap_pmd {
 	int keep_fd;			/* keep alive file descriptor */
 	int if_index;			/* interface index */
 	int nlsk_fd;			/* netlink control socket */
+	struct rte_intr_handle *intr_handle; /* LSC interrupt handle */
 	struct rte_ether_addr eth_addr; /* address assigned by kernel */
 
 	uint64_t rx_drop_base;		/* value of rx_dropped when reset */
@@ -98,4 +99,7 @@ int rtap_tx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id,
 			const struct rte_eth_txconf *tx_conf);
 void rtap_tx_queue_release(struct rte_eth_dev *dev, uint16_t queue_id);
 
+/* rtap_intr.c */
+int rtap_lsc_set(struct rte_eth_dev *dev, int set);
+
 #endif /* _RTAP_H_ */
diff --git a/drivers/net/rtap/rtap_ethdev.c b/drivers/net/rtap/rtap_ethdev.c
index 705d98323e..9c23ca2f16 100644
--- a/drivers/net/rtap/rtap_ethdev.c
+++ b/drivers/net/rtap/rtap_ethdev.c
@@ -214,11 +214,18 @@ rtap_macaddr_set(struct rte_eth_dev *dev, struct rte_ether_addr *addr)
 static int
 rtap_dev_start(struct rte_eth_dev *dev)
 {
-	int ret = rtap_set_link_up(dev);
+	int ret;
 
+	ret = rtap_lsc_set(dev, 1);
 	if (ret != 0)
 		return ret;
 
+	ret = rtap_set_link_up(dev);
+	if (ret != 0) {
+		rtap_lsc_set(dev, 0);
+		return ret;
+	}
+
 	dev->data->dev_link.link_status = RTE_ETH_LINK_UP;
 	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
 		dev->data->rx_queue_state[i] = RTE_ETH_QUEUE_STATE_STARTED;
@@ -235,6 +242,7 @@ rtap_dev_stop(struct rte_eth_dev *dev)
 
 	dev->data->dev_link.link_status = RTE_ETH_LINK_DOWN;
 
+	rtap_lsc_set(dev, 0);
 	rtap_set_link_down(dev);
 
 	for (uint16_t i = 0; i < dev->data->nb_rx_queues; i++) {
@@ -435,6 +443,9 @@ rtap_dev_close(struct rte_eth_dev *dev)
 			close(pmd->nlsk_fd);
 			pmd->nlsk_fd = -1;
 		}
+
+		rte_intr_instance_free(pmd->intr_handle);
+		pmd->intr_handle = NULL;
 	}
 
 	free(dev->process_private);
@@ -522,6 +533,17 @@ rtap_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
 	pmd->nlsk_fd = -1;
 	pmd->rx_drop_base = 0;
 
+	/* Allocate interrupt instance for link state change events */
+	pmd->intr_handle = rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
+	if (pmd->intr_handle == NULL) {
+		PMD_LOG(ERR, "Failed to allocate intr handle");
+		goto error;
+	}
+	rte_intr_type_set(pmd->intr_handle, RTE_INTR_HANDLE_EXT);
+	rte_intr_fd_set(pmd->intr_handle, -1);
+	dev->intr_handle = pmd->intr_handle;
+	data->dev_flags |= RTE_ETH_DEV_INTR_LSC;
+
 	dev->dev_ops = &rtap_ops;
 
 	/* Get the initial fd used to keep the tap device around */
@@ -571,6 +593,8 @@ rtap_create(struct rte_eth_dev *dev, const char *tap_name, uint8_t persist)
 		close(pmd->nlsk_fd);
 	if (pmd->keep_fd != -1)
 		close(pmd->keep_fd);
+	rte_intr_instance_free(pmd->intr_handle);
+	pmd->intr_handle = NULL;
 	return -1;
 }
 
diff --git a/drivers/net/rtap/rtap_intr.c b/drivers/net/rtap/rtap_intr.c
new file mode 100644
index 0000000000..6cdfb412c3
--- /dev/null
+++ b/drivers/net/rtap/rtap_intr.c
@@ -0,0 +1,87 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2026 Stephen Hemminger
+ */
+
+#include <errno.h>
+#include <stdbool.h>
+#include <unistd.h>
+#include <linux/rtnetlink.h>
+
+#include <rte_cycles.h>
+#include <rte_ethdev.h>
+#include <rte_errno.h>
+#include <ethdev_driver.h>
+#include <rte_interrupts.h>
+
+#include "rtap.h"
+
+/* Interrupt handler called by EAL when netlink socket is readable */
+static void
+rtap_lsc_handler(void *cb_arg)
+{
+	struct rte_eth_dev *dev = cb_arg;
+	struct rtap_pmd *pmd = dev->data->dev_private;
+	int fd = rte_intr_fd_get(pmd->intr_handle);
+
+	if (fd >= 0)
+		rtap_nl_recv(fd, dev);
+}
+
+/*
+ * Enable or disable link state change interrupt.
+ * When enabled, creates a netlink socket subscribed to RTMGRP_LINK
+ * and registers it with the EAL interrupt handler.
+ */
+int
+rtap_lsc_set(struct rte_eth_dev *dev, int set)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+	unsigned int retry = 10;
+	int ret;
+
+	/* If LSC not configured, just disable if active */
+	if (!dev->data->dev_conf.intr_conf.lsc) {
+		if (rte_intr_fd_get(pmd->intr_handle) != -1)
+			goto disable;
+		return 0;
+	}
+
+	if (set) {
+		int fd = rtap_nl_open(RTMGRP_LINK);
+		if (fd < 0)
+			return -1;
+
+		rte_intr_fd_set(pmd->intr_handle, fd);
+		ret = rte_intr_callback_register(pmd->intr_handle,
+						 rtap_lsc_handler, dev);
+		if (ret < 0) {
+			PMD_LOG(ERR, "Failed to register LSC callback: %s",
+				rte_strerror(-ret));
+			close(fd);
+			rte_intr_fd_set(pmd->intr_handle, -1);
+			return ret;
+		}
+		return 0;
+	}
+
+disable:
+	do {
+		ret = rte_intr_callback_unregister(pmd->intr_handle,
+						   rtap_lsc_handler, dev);
+		if (ret >= 0) {
+			break;
+		} else if (ret == -EAGAIN && retry-- > 0) {
+			rte_delay_ms(100);
+		} else {
+			PMD_LOG(ERR, "LSC callback unregister failed: %d", ret);
+			break;
+		}
+	} while (true);
+
+	if (rte_intr_fd_get(pmd->intr_handle) >= 0) {
+		close(rte_intr_fd_get(pmd->intr_handle));
+		rte_intr_fd_set(pmd->intr_handle, -1);
+	}
+
+	return 0;
+}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 09/11] net/rtap: add Rx interrupt support
  2026-02-14 23:44 ` [PATCH v6 00/11] " Stephen Hemminger
                     ` (7 preceding siblings ...)
  2026-02-14 23:44   ` [PATCH v6 08/11] net/rtap: add link state change interrupt Stephen Hemminger
@ 2026-02-14 23:44   ` Stephen Hemminger
  2026-02-14 23:44   ` [PATCH v6 10/11] net/rtap: add extended statistics support Stephen Hemminger
                     ` (2 subsequent siblings)
  11 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2026-02-14 23:44 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add support for the DPDK Rx interrupt mechanism, enabling
power-aware applications (e.g. l3fwd-power) to sleep until
packets arrive rather than busy-polling.

Each Rx queue creates an eventfd during queue setup and registers
it with its io_uring instance via io_uring_register_eventfd().
When the kernel posts a CQE (completing a read, i.e. a packet
arrived), it signals the eventfd. These per-queue eventfds are
wired into a VDEV interrupt handle during dev_start when the
application has set intr_conf.rxq.

The enable op drains the eventfd counter to re-arm notification;
disable is a no-op since the application simply stops polling.
The eventfd is always created unconditionally so it is available
if the application enables Rx interrupts later.

The Rx interrupt handle is kept separate from the existing LSC
netlink interrupt handle to avoid coupling the two mechanisms.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/rtap.ini |   1 +
 drivers/net/rtap/rtap.h           |   6 ++
 drivers/net/rtap/rtap_ethdev.c    |  26 +++++++
 drivers/net/rtap/rtap_intr.c      | 120 ++++++++++++++++++++++++++++++
 drivers/net/rtap/rtap_rxtx.c      |  31 +++++++-
 5 files changed, 183 insertions(+), 1 deletion(-)

diff --git a/doc/guides/nics/features/rtap.ini b/doc/guides/nics/features/rtap.ini
index fe0c88a8fc..48fe3f1b33 100644
--- a/doc/guides/nics/features/rtap.ini
+++ b/doc/guides/nics/features/rtap.ini
@@ -6,6 +6,7 @@
 [Features]
 Link status          = Y
 Link status event    = Y
+Rx interrupt         = Y
 MTU update           = Y
 Promiscuous mode     = Y
 Allmulticast mode    = Y
diff --git a/drivers/net/rtap/rtap.h b/drivers/net/rtap/rtap.h
index 2cc55c9667..2c17117a80 100644
--- a/drivers/net/rtap/rtap.h
+++ b/drivers/net/rtap/rtap.h
@@ -38,6 +38,7 @@ extern int rtap_logtype;
 struct rtap_rx_queue {
 	struct rte_mempool *mb_pool;	/* rx buffer pool */
 	struct io_uring io_ring;	/* queue of posted read's */
+	int intr_fd;			/* eventfd for Rx interrupt */
 	uint16_t port_id;
 	uint16_t queue_id;
 
@@ -62,6 +63,7 @@ struct rtap_pmd {
 	int if_index;			/* interface index */
 	int nlsk_fd;			/* netlink control socket */
 	struct rte_intr_handle *intr_handle; /* LSC interrupt handle */
+	struct rte_intr_handle *rx_intr_handle; /* Rx queue interrupt handle */
 	struct rte_ether_addr eth_addr; /* address assigned by kernel */
 
 	uint64_t rx_drop_base;		/* value of rx_dropped when reset */
@@ -101,5 +103,9 @@ void rtap_tx_queue_release(struct rte_eth_dev *dev, uint16_t queue_id);
 
 /* rtap_intr.c */
 int rtap_lsc_set(struct rte_eth_dev *dev, int set);
+int rtap_rx_intr_vec_install(struct rte_eth_dev *dev);
+void rtap_rx_intr_vec_uninstall(struct rte_eth_dev *dev);
+int rtap_rx_queue_intr_enable(struct rte_eth_dev *dev, uint16_t queue_id);
+int rtap_rx_queue_intr_disable(struct rte_eth_dev *dev, uint16_t queue_id);
 
 #endif /* _RTAP_H_ */
diff --git a/drivers/net/rtap/rtap_ethdev.c b/drivers/net/rtap/rtap_ethdev.c
index 9c23ca2f16..2cbb66b675 100644
--- a/drivers/net/rtap/rtap_ethdev.c
+++ b/drivers/net/rtap/rtap_ethdev.c
@@ -220,8 +220,18 @@ rtap_dev_start(struct rte_eth_dev *dev)
 	if (ret != 0)
 		return ret;
 
+	/* Install Rx interrupt vector if requested by application */
+	if (dev->data->dev_conf.intr_conf.rxq) {
+		ret = rtap_rx_intr_vec_install(dev);
+		if (ret != 0) {
+			rtap_lsc_set(dev, 0);
+			return ret;
+		}
+	}
+
 	ret = rtap_set_link_up(dev);
 	if (ret != 0) {
+		rtap_rx_intr_vec_uninstall(dev);
 		rtap_lsc_set(dev, 0);
 		return ret;
 	}
@@ -242,6 +252,7 @@ rtap_dev_stop(struct rte_eth_dev *dev)
 
 	dev->data->dev_link.link_status = RTE_ETH_LINK_DOWN;
 
+	rtap_rx_intr_vec_uninstall(dev);
 	rtap_lsc_set(dev, 0);
 	rtap_set_link_down(dev);
 
@@ -275,6 +286,16 @@ rtap_dev_configure(struct rte_eth_dev *dev)
 		return -EINVAL;
 	}
 
+	/*
+	 * LSC and Rx queue interrupts both need dev->intr_handle,
+	 * so they cannot be enabled simultaneously.
+	 */
+	if (dev->data->dev_conf.intr_conf.lsc &&
+	    dev->data->dev_conf.intr_conf.rxq) {
+		PMD_LOG(ERR, "LSC and Rx queue interrupts are mutually exclusive");
+		return -ENOTSUP;
+	}
+
 	/*
 	 * Set offload flags visible on the kernel network interface.
 	 * This controls whether kernel will use checksum offload etc.
@@ -444,6 +465,9 @@ rtap_dev_close(struct rte_eth_dev *dev)
 			pmd->nlsk_fd = -1;
 		}
 
+		rte_intr_instance_free(pmd->rx_intr_handle);
+		pmd->rx_intr_handle = NULL;
+
 		rte_intr_instance_free(pmd->intr_handle);
 		pmd->intr_handle = NULL;
 	}
@@ -521,6 +545,8 @@ static const struct eth_dev_ops rtap_ops = {
 	.rx_queue_release	= rtap_rx_queue_release,
 	.tx_queue_setup		= rtap_tx_queue_setup,
 	.tx_queue_release	= rtap_tx_queue_release,
+	.rx_queue_intr_enable	= rtap_rx_queue_intr_enable,
+	.rx_queue_intr_disable	= rtap_rx_queue_intr_disable,
 };
 
 static int
diff --git a/drivers/net/rtap/rtap_intr.c b/drivers/net/rtap/rtap_intr.c
index 6cdfb412c3..3704c425b8 100644
--- a/drivers/net/rtap/rtap_intr.c
+++ b/drivers/net/rtap/rtap_intr.c
@@ -85,3 +85,123 @@ rtap_lsc_set(struct rte_eth_dev *dev, int set)
 
 	return 0;
 }
+
+/*
+ * Install per-queue Rx interrupt vector.
+ *
+ * Each Rx queue has an eventfd registered with its io_uring instance.
+ * When a CQE is posted (packet received), the kernel signals the eventfd.
+ * This function wires those eventfds into an rte_intr_handle so that
+ * DPDK's interrupt framework (rte_epoll_wait) can poll them.
+ *
+ * Only called when dev_conf.intr_conf.rxq is set.
+ */
+int
+rtap_rx_intr_vec_install(struct rte_eth_dev *dev)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+	uint16_t nb_rx = dev->data->nb_rx_queues;
+
+	if (pmd->rx_intr_handle != NULL) {
+		PMD_LOG(DEBUG, "Rx interrupt vector already installed");
+		return 0;
+	}
+
+	pmd->rx_intr_handle = rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_PRIVATE);
+	if (pmd->rx_intr_handle == NULL) {
+		PMD_LOG(ERR, "Failed to allocate Rx intr handle");
+		return -ENOMEM;
+	}
+
+	if (rte_intr_type_set(pmd->rx_intr_handle, RTE_INTR_HANDLE_VDEV) < 0)
+		goto error;
+
+	if (rte_intr_nb_efd_set(pmd->rx_intr_handle, nb_rx) < 0)
+		goto error;
+
+	if (rte_intr_max_intr_set(pmd->rx_intr_handle, nb_rx + 1) < 0)
+		goto error;
+
+	for (uint16_t i = 0; i < nb_rx; i++) {
+		struct rtap_rx_queue *rxq = dev->data->rx_queues[i];
+
+		if (rxq == NULL || rxq->intr_fd < 0) {
+			PMD_LOG(ERR, "Rx queue %u not ready for interrupts", i);
+			goto error;
+		}
+
+		if (rte_intr_efds_index_set(pmd->rx_intr_handle, i,
+					    rxq->intr_fd) < 0) {
+			PMD_LOG(ERR, "Failed to set efd for queue %u", i);
+			goto error;
+		}
+	}
+
+	dev->intr_handle = pmd->rx_intr_handle;
+	PMD_LOG(DEBUG, "Rx interrupt vector installed for %u queues", nb_rx);
+	return 0;
+
+error:
+	rte_intr_instance_free(pmd->rx_intr_handle);
+	pmd->rx_intr_handle = NULL;
+	return -1;
+}
+
+/*
+ * Remove per-queue Rx interrupt vector.
+ * Restores dev->intr_handle to the LSC handle.
+ */
+void
+rtap_rx_intr_vec_uninstall(struct rte_eth_dev *dev)
+{
+	struct rtap_pmd *pmd = dev->data->dev_private;
+
+	if (pmd->rx_intr_handle == NULL)
+		return;
+
+	/* Restore LSC handle as device interrupt handle */
+	dev->intr_handle = pmd->intr_handle;
+
+	rte_intr_instance_free(pmd->rx_intr_handle);
+	pmd->rx_intr_handle = NULL;
+
+	PMD_LOG(DEBUG, "Rx interrupt vector uninstalled");
+}
+
+/*
+ * Enable Rx interrupt for a queue.
+ *
+ * Drain any pending eventfd notification so the next CQE
+ * triggers a fresh wakeup in rte_epoll_wait().
+ */
+int
+rtap_rx_queue_intr_enable(struct rte_eth_dev *dev, uint16_t queue_id)
+{
+	struct rtap_rx_queue *rxq = dev->data->rx_queues[queue_id];
+	uint64_t val;
+
+	if (rxq == NULL || rxq->intr_fd < 0)
+		return -EINVAL;
+
+	/* Drain the eventfd counter to re-arm notification */
+	if (read(rxq->intr_fd, &val, sizeof(val)) < 0 && errno != EAGAIN) {
+		PMD_LOG(ERR, "eventfd drain failed queue %u: %s",
+			queue_id, strerror(errno));
+		return -errno;
+	}
+
+	return 0;
+}
+
+/*
+ * Disable Rx interrupt for a queue.
+ *
+ * Nothing to do - the eventfd stays registered with io_uring
+ * but the application simply stops polling it.
+ */
+int
+rtap_rx_queue_intr_disable(struct rte_eth_dev *dev __rte_unused,
+			   uint16_t queue_id __rte_unused)
+{
+	return 0;
+}
diff --git a/drivers/net/rtap/rtap_rxtx.c b/drivers/net/rtap/rtap_rxtx.c
index 7826169751..cd9b4f0bac 100644
--- a/drivers/net/rtap/rtap_rxtx.c
+++ b/drivers/net/rtap/rtap_rxtx.c
@@ -10,6 +10,7 @@
 #include <stddef.h>
 #include <string.h>
 #include <liburing.h>
+#include <sys/eventfd.h>
 #include <sys/types.h>
 #include <sys/uio.h>
 #include <linux/virtio_net.h>
@@ -437,6 +438,7 @@ rtap_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id, uint16_t nb_rx_d
 	rxq->mb_pool = mb_pool;
 	rxq->port_id = dev->data->port_id;
 	rxq->queue_id = queue_id;
+	rxq->intr_fd = -1;
 	dev->data->rx_queues[queue_id] = rxq;
 
 	if (io_uring_queue_init(nb_rx_desc, &rxq->io_ring, 0) != 0) {
@@ -444,10 +446,26 @@ rtap_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id, uint16_t nb_rx_d
 		goto error_rxq_free;
 	}
 
+	/*
+	 * Create an eventfd for Rx interrupt notification.
+	 * io_uring will signal this fd whenever a CQE is posted,
+	 * enabling power-aware applications to sleep until packets arrive.
+	 */
+	rxq->intr_fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
+	if (rxq->intr_fd < 0) {
+		PMD_LOG(ERR, "eventfd failed: %s", strerror(errno));
+		goto error_iouring_exit;
+	}
+
+	if (io_uring_register_eventfd(&rxq->io_ring, rxq->intr_fd) < 0) {
+		PMD_LOG(ERR, "io_uring_register_eventfd failed: %s", strerror(errno));
+		goto error_eventfd_close;
+	}
+
 	mbufs = calloc(nb_rx_desc, sizeof(struct rte_mbuf *));
 	if (mbufs == NULL) {
 		PMD_LOG(ERR, "Rx mbuf pointer alloc failed");
-		goto error_iouring_exit;
+		goto error_eventfd_close;
 	}
 
 	/* open shared tap fd maybe already setup */
@@ -494,6 +512,11 @@ rtap_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_id, uint16_t nb_rx_d
 	rtap_cancel_all(&rxq->io_ring);
 	rtap_queue_close(dev, queue_id);
 	free(mbufs);
+error_eventfd_close:
+	if (rxq->intr_fd >= 0) {
+		close(rxq->intr_fd);
+		rxq->intr_fd = -1;
+	}
 error_iouring_exit:
 	io_uring_queue_exit(&rxq->io_ring);
 error_rxq_free:
@@ -509,6 +532,12 @@ rtap_rx_queue_release(struct rte_eth_dev *dev, uint16_t queue_id)
 	if (rxq == NULL)
 		return;
 
+	if (rxq->intr_fd >= 0) {
+		io_uring_unregister_eventfd(&rxq->io_ring);
+		close(rxq->intr_fd);
+		rxq->intr_fd = -1;
+	}
+
 	rtap_cancel_all(&rxq->io_ring);
 	io_uring_queue_exit(&rxq->io_ring);
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 10/11] net/rtap: add extended statistics support
  2026-02-14 23:44 ` [PATCH v6 00/11] " Stephen Hemminger
                     ` (8 preceding siblings ...)
  2026-02-14 23:44   ` [PATCH v6 09/11] net/rtap: add Rx interrupt support Stephen Hemminger
@ 2026-02-14 23:44   ` Stephen Hemminger
  2026-02-14 23:44   ` [PATCH v6 11/11] test: add unit tests for rtap PMD Stephen Hemminger
  2026-02-15  8:58   ` [PATCH v6 00/11] net/rtap: add io_uring based TAP driver Konstantin Ananyev
  11 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2026-02-14 23:44 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add xstats support to rtap PMD following the virtio PMD design pattern.

Statistics provided per queue (14 Rx + 12 Tx):
- Packet size distribution (6 buckets: 64, 65-127, 128-255, 256-511,
  512-1023, 1024-1518 bytes)
- Packet type classification (broadcast, multicast, unicast)
- Rx offload stats (LRO, checksum validation, mbuf alloc failures)
- Tx offload stats (TSO, checksum offload, multi-segment packets)

Statistics follow DPDK naming convention:
  {direction}_q{queue_id}_{detail}_{unit}

Examples:
  rx_q0_size_64_packets
  rx_q0_broadcast_packets
  rx_q0_lro_packets
  tx_q0_tso_packets
  tx_q0_checksum_offload_packets

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 doc/guides/nics/features/rtap.ini |   1 +
 drivers/net/rtap/meson.build      |   1 +
 drivers/net/rtap/rtap.h           |  41 +++++
 drivers/net/rtap/rtap_ethdev.c    |   3 +
 drivers/net/rtap/rtap_rxtx.c      |   3 +
 drivers/net/rtap/rtap_xstats.c    | 293 ++++++++++++++++++++++++++++++
 6 files changed, 342 insertions(+)
 create mode 100644 drivers/net/rtap/rtap_xstats.c

diff --git a/doc/guides/nics/features/rtap.ini b/doc/guides/nics/features/rtap.ini
index 48fe3f1b33..233227e7d4 100644
--- a/doc/guides/nics/features/rtap.ini
+++ b/doc/guides/nics/features/rtap.ini
@@ -15,6 +15,7 @@ Basic stats          = Y
 Stats per queue      = Y
 TSO                  = Y
 L4 checksum offload  = Y
+Extended stats       = P
 Multiprocess aware   = Y
 Linux                = Y
 ARMv7                = Y
diff --git a/drivers/net/rtap/meson.build b/drivers/net/rtap/meson.build
index 58943e035a..835d1e557d 100644
--- a/drivers/net/rtap/meson.build
+++ b/drivers/net/rtap/meson.build
@@ -22,6 +22,7 @@ sources = files(
         'rtap_intr.c',
         'rtap_netlink.c',
         'rtap_rxtx.c',
+        'rtap_xstats.c',
 )
 
 ext_deps += liburing
diff --git a/drivers/net/rtap/rtap.h b/drivers/net/rtap/rtap.h
index 2c17117a80..ac4a616e99 100644
--- a/drivers/net/rtap/rtap.h
+++ b/drivers/net/rtap/rtap.h
@@ -35,6 +35,22 @@ extern int rtap_logtype;
 #define PMD_TX_LOG(...) do { } while (0)
 #endif
 
+/* Packet size buckets for xstats (similar to virtio PMD) */
+#define RTAP_NUM_PKT_SIZE_BUCKETS	6
+
+/* Extended statistics for Rx queues */
+struct rtap_rx_xstats {
+	uint64_t size_bins[RTAP_NUM_PKT_SIZE_BUCKETS];
+	uint64_t broadcast_packets;
+	uint64_t multicast_packets;
+	uint64_t unicast_packets;
+	uint64_t lro_packets;
+	uint64_t checksum_good;
+	uint64_t checksum_none;
+	uint64_t checksum_bad;
+	uint64_t mbuf_alloc_failed;
+};
+
 struct rtap_rx_queue {
 	struct rte_mempool *mb_pool;	/* rx buffer pool */
 	struct io_uring io_ring;	/* queue of posted read's */
@@ -45,8 +61,21 @@ struct rtap_rx_queue {
 	uint64_t rx_packets;
 	uint64_t rx_bytes;
 	uint64_t rx_errors;
+
+	struct rtap_rx_xstats xstats;	/* extended statistics */
 } __rte_cache_aligned;
 
+/* Extended statistics for Tx queues */
+struct rtap_tx_xstats {
+	uint64_t size_bins[RTAP_NUM_PKT_SIZE_BUCKETS];
+	uint64_t broadcast_packets;
+	uint64_t multicast_packets;
+	uint64_t unicast_packets;
+	uint64_t tso_packets;
+	uint64_t checksum_offload;
+	uint64_t multiseg_packets;
+};
+
 struct rtap_tx_queue {
 	struct io_uring io_ring;
 	uint16_t port_id;
@@ -56,6 +85,8 @@ struct rtap_tx_queue {
 	uint64_t tx_packets;
 	uint64_t tx_bytes;
 	uint64_t tx_errors;
+
+	struct rtap_tx_xstats xstats;	/* extended statistics */
 } __rte_cache_aligned;
 
 struct rtap_pmd {
@@ -108,4 +139,14 @@ void rtap_rx_intr_vec_uninstall(struct rte_eth_dev *dev);
 int rtap_rx_queue_intr_enable(struct rte_eth_dev *dev, uint16_t queue_id);
 int rtap_rx_queue_intr_disable(struct rte_eth_dev *dev, uint16_t queue_id);
 
+/* rtap_xstats.c */
+int rtap_xstats_get_names(struct rte_eth_dev *dev,
+			  struct rte_eth_xstat_name *xstats_names,
+			  unsigned int limit);
+int rtap_xstats_get(struct rte_eth_dev *dev, struct rte_eth_xstat *xstats,
+		    unsigned int n);
+int rtap_xstats_reset(struct rte_eth_dev *dev);
+void rtap_rx_xstats_update(struct rtap_rx_queue *rxq, struct rte_mbuf *mb);
+void rtap_tx_xstats_update(struct rtap_tx_queue *txq, struct rte_mbuf *mb);
+
 #endif /* _RTAP_H_ */
diff --git a/drivers/net/rtap/rtap_ethdev.c b/drivers/net/rtap/rtap_ethdev.c
index 2cbb66b675..eb581608cc 100644
--- a/drivers/net/rtap/rtap_ethdev.c
+++ b/drivers/net/rtap/rtap_ethdev.c
@@ -541,6 +541,9 @@ static const struct eth_dev_ops rtap_ops = {
 	.allmulticast_disable	= rtap_allmulticast_disable,
 	.stats_get		= rtap_stats_get,
 	.stats_reset		= rtap_stats_reset,
+	.xstats_get		= rtap_xstats_get,
+	.xstats_get_names	= rtap_xstats_get_names,
+	.xstats_reset		= rtap_xstats_reset,
 	.rx_queue_setup		= rtap_rx_queue_setup,
 	.rx_queue_release	= rtap_rx_queue_release,
 	.tx_queue_setup		= rtap_tx_queue_setup,
diff --git a/drivers/net/rtap/rtap_rxtx.c b/drivers/net/rtap/rtap_rxtx.c
index cd9b4f0bac..f0ea53b8cf 100644
--- a/drivers/net/rtap/rtap_rxtx.c
+++ b/drivers/net/rtap/rtap_rxtx.c
@@ -289,6 +289,7 @@ rtap_rx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 
 			PMD_RX_LOG(ERR, "Rx mbuf alloc failed");
 			dev->data->rx_mbuf_alloc_failed++;
+			rxq->xstats.mbuf_alloc_failed++;
 
 			nmb = mb;	 /* Reuse original */
 			goto resubmit;
@@ -317,6 +318,7 @@ rtap_rx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		mb->port = rxq->port_id;
 
 		__rte_mbuf_sanity_check(mb, 1);
+		rtap_rx_xstats_update(rxq, mb);
 		num_bytes += mb->pkt_len;
 		bufs[num_rx++] = mb;
 
@@ -735,6 +737,7 @@ rtap_tx_burst(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 
 		io_uring_sqe_set_data(sqe, mb);
 		rtap_tx_offload(hdr, mb);
+		rtap_tx_xstats_update(txq, mb);
 
 		PMD_TX_LOG(DEBUG, "write m=%p segs=%u", mb, mb->nb_segs);
 
diff --git a/drivers/net/rtap/rtap_xstats.c b/drivers/net/rtap/rtap_xstats.c
new file mode 100644
index 0000000000..b5886844c5
--- /dev/null
+++ b/drivers/net/rtap/rtap_xstats.c
@@ -0,0 +1,293 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2026 Stephen Hemminger
+ */
+
+#include <stdio.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <ethdev_driver.h>
+#include <rte_ethdev.h>
+#include <rte_ether.h>
+
+#include "rtap.h"
+
+/*
+ * Xstats name/offset descriptors, following the virtio PMD pattern.
+ *
+ * Both xstats_get_names and xstats_get iterate these same tables
+ * in the same per-queue order, guaranteeing name[i] matches value[i].
+ */
+struct rtap_xstats_name_off {
+	const char *name;
+	unsigned int offset;
+};
+
+#define RTAP_RXQ_XSTAT(field) { #field, offsetof(struct rtap_rx_xstats, field) }
+#define RTAP_TXQ_XSTAT(field) { #field, offsetof(struct rtap_tx_xstats, field) }
+
+static const struct rtap_xstats_name_off rtap_rxq_xstats[] = {
+	RTAP_RXQ_XSTAT(size_bins[0]),
+	RTAP_RXQ_XSTAT(size_bins[1]),
+	RTAP_RXQ_XSTAT(size_bins[2]),
+	RTAP_RXQ_XSTAT(size_bins[3]),
+	RTAP_RXQ_XSTAT(size_bins[4]),
+	RTAP_RXQ_XSTAT(size_bins[5]),
+	RTAP_RXQ_XSTAT(broadcast_packets),
+	RTAP_RXQ_XSTAT(multicast_packets),
+	RTAP_RXQ_XSTAT(unicast_packets),
+	RTAP_RXQ_XSTAT(lro_packets),
+	RTAP_RXQ_XSTAT(checksum_good),
+	RTAP_RXQ_XSTAT(checksum_none),
+	RTAP_RXQ_XSTAT(checksum_bad),
+	RTAP_RXQ_XSTAT(mbuf_alloc_failed),
+};
+
+static const struct rtap_xstats_name_off rtap_txq_xstats[] = {
+	RTAP_TXQ_XSTAT(size_bins[0]),
+	RTAP_TXQ_XSTAT(size_bins[1]),
+	RTAP_TXQ_XSTAT(size_bins[2]),
+	RTAP_TXQ_XSTAT(size_bins[3]),
+	RTAP_TXQ_XSTAT(size_bins[4]),
+	RTAP_TXQ_XSTAT(size_bins[5]),
+	RTAP_TXQ_XSTAT(broadcast_packets),
+	RTAP_TXQ_XSTAT(multicast_packets),
+	RTAP_TXQ_XSTAT(unicast_packets),
+	RTAP_TXQ_XSTAT(tso_packets),
+	RTAP_TXQ_XSTAT(checksum_offload),
+	RTAP_TXQ_XSTAT(multiseg_packets),
+};
+
+/* Display names for size buckets (indexed by array position) */
+static const char * const rtap_size_bucket_names[] = {
+	"size_64",
+	"size_65_to_127",
+	"size_128_to_255",
+	"size_256_to_511",
+	"size_512_to_1023",
+	"size_1024_to_1518",
+};
+
+/* Size bucket upper bounds for the update helpers */
+static const uint16_t rtap_size_bucket_limits[RTAP_NUM_PKT_SIZE_BUCKETS] = {
+	64, 127, 255, 511, 1023, 1518,
+};
+
+#define RTAP_NUM_RXQ_XSTATS RTE_DIM(rtap_rxq_xstats)
+#define RTAP_NUM_TXQ_XSTATS RTE_DIM(rtap_txq_xstats)
+
+static unsigned int
+rtap_xstats_count(const struct rte_eth_dev *dev)
+{
+	return dev->data->nb_rx_queues * RTAP_NUM_RXQ_XSTATS +
+	       dev->data->nb_tx_queues * RTAP_NUM_TXQ_XSTATS;
+}
+
+/*
+ * Build a display name for a per-queue xstat.
+ *
+ * For size_bins[N] entries, use the human-readable bucket name;
+ * for everything else, use the field name directly.
+ */
+static void
+rtap_xstat_name(char *buf, size_t bufsz,
+		const char *dir, unsigned int q,
+		const struct rtap_xstats_name_off *desc)
+{
+	/* Check if this is a size_bins entry */
+	for (unsigned int i = 0; i < RTAP_NUM_PKT_SIZE_BUCKETS; i++) {
+		char binref[32];
+
+		snprintf(binref, sizeof(binref), "size_bins[%u]", i);
+		if (strcmp(desc->name, binref) == 0) {
+			snprintf(buf, bufsz, "%s_q%u_%s_packets",
+				 dir, q, rtap_size_bucket_names[i]);
+			return;
+		}
+	}
+
+	snprintf(buf, bufsz, "%s_q%u_%s", dir, q, desc->name);
+}
+
+int
+rtap_xstats_get_names(struct rte_eth_dev *dev,
+		      struct rte_eth_xstat_name *xstats_names,
+		      unsigned int limit)
+{
+	unsigned int nb_rx = dev->data->nb_rx_queues;
+	unsigned int nb_tx = dev->data->nb_tx_queues;
+	unsigned int count = rtap_xstats_count(dev);
+	unsigned int idx = 0;
+
+	if (xstats_names == NULL)
+		return count;
+
+	/* Rx queue stats: all stats for queue 0, then all for queue 1, ... */
+	for (unsigned int q = 0; q < nb_rx; q++) {
+		for (unsigned int i = 0; i < RTAP_NUM_RXQ_XSTATS; i++) {
+			if (idx >= limit)
+				goto out;
+
+			rtap_xstat_name(xstats_names[idx].name,
+					sizeof(xstats_names[idx].name),
+					"rx", q, &rtap_rxq_xstats[i]);
+			idx++;
+		}
+	}
+
+	/* Tx queue stats: all stats for queue 0, then all for queue 1, ... */
+	for (unsigned int q = 0; q < nb_tx; q++) {
+		for (unsigned int i = 0; i < RTAP_NUM_TXQ_XSTATS; i++) {
+			if (idx >= limit)
+				goto out;
+
+			rtap_xstat_name(xstats_names[idx].name,
+					sizeof(xstats_names[idx].name),
+					"tx", q, &rtap_txq_xstats[i]);
+			idx++;
+		}
+	}
+
+out:
+	return count;
+}
+
+int
+rtap_xstats_get(struct rte_eth_dev *dev, struct rte_eth_xstat *xstats,
+		unsigned int n)
+{
+	unsigned int nb_rx = dev->data->nb_rx_queues;
+	unsigned int nb_tx = dev->data->nb_tx_queues;
+	unsigned int count = rtap_xstats_count(dev);
+	unsigned int idx = 0;
+
+	if (n < count)
+		return count;
+
+	/* Collect Rx queue xstats — same per-queue order as names */
+	for (unsigned int q = 0; q < nb_rx; q++) {
+		struct rtap_rx_queue *rxq = dev->data->rx_queues[q];
+
+		for (unsigned int i = 0; i < RTAP_NUM_RXQ_XSTATS; i++) {
+			xstats[idx].id = idx;
+			if (rxq == NULL)
+				xstats[idx].value = 0;
+			else
+				xstats[idx].value =
+					*(uint64_t *)((char *)&rxq->xstats +
+						       rtap_rxq_xstats[i].offset);
+			idx++;
+		}
+	}
+
+	/* Collect Tx queue xstats — same per-queue order as names */
+	for (unsigned int q = 0; q < nb_tx; q++) {
+		struct rtap_tx_queue *txq = dev->data->tx_queues[q];
+
+		for (unsigned int i = 0; i < RTAP_NUM_TXQ_XSTATS; i++) {
+			xstats[idx].id = idx;
+			if (txq == NULL)
+				xstats[idx].value = 0;
+			else
+				xstats[idx].value =
+					*(uint64_t *)((char *)&txq->xstats +
+						       rtap_txq_xstats[i].offset);
+			idx++;
+		}
+	}
+
+	return idx;
+}
+
+int
+rtap_xstats_reset(struct rte_eth_dev *dev)
+{
+	for (unsigned int q = 0; q < dev->data->nb_rx_queues; q++) {
+		struct rtap_rx_queue *rxq = dev->data->rx_queues[q];
+		if (rxq != NULL)
+			memset(&rxq->xstats, 0, sizeof(rxq->xstats));
+	}
+
+	for (unsigned int q = 0; q < dev->data->nb_tx_queues; q++) {
+		struct rtap_tx_queue *txq = dev->data->tx_queues[q];
+		if (txq != NULL)
+			memset(&txq->xstats, 0, sizeof(txq->xstats));
+	}
+
+	return 0;
+}
+
+/* Helper to update Rx xstats — called from rx_burst */
+void
+rtap_rx_xstats_update(struct rtap_rx_queue *rxq, struct rte_mbuf *mb)
+{
+	struct rtap_rx_xstats *xs = &rxq->xstats;
+	uint16_t pkt_len = mb->pkt_len;
+	struct rte_ether_hdr *eth_hdr;
+
+	/* Update size bucket */
+	for (unsigned int i = 0; i < RTAP_NUM_PKT_SIZE_BUCKETS; i++) {
+		if (pkt_len <= rtap_size_bucket_limits[i]) {
+			xs->size_bins[i]++;
+			break;
+		}
+	}
+
+	/* Update packet type counters */
+	eth_hdr = rte_pktmbuf_mtod(mb, struct rte_ether_hdr *);
+	if (rte_is_broadcast_ether_addr(&eth_hdr->dst_addr))
+		xs->broadcast_packets++;
+	else if (rte_is_multicast_ether_addr(&eth_hdr->dst_addr))
+		xs->multicast_packets++;
+	else
+		xs->unicast_packets++;
+
+	/* Update offload-related counters */
+	if (mb->ol_flags & RTE_MBUF_F_RX_LRO)
+		xs->lro_packets++;
+
+	if (mb->ol_flags & RTE_MBUF_F_RX_L4_CKSUM_GOOD)
+		xs->checksum_good++;
+	else if (mb->ol_flags & RTE_MBUF_F_RX_L4_CKSUM_NONE)
+		xs->checksum_none++;
+	else if (mb->ol_flags & RTE_MBUF_F_RX_L4_CKSUM_BAD)
+		xs->checksum_bad++;
+}
+
+/* Helper to update Tx xstats — called from tx_burst */
+void
+rtap_tx_xstats_update(struct rtap_tx_queue *txq, struct rte_mbuf *mb)
+{
+	struct rtap_tx_xstats *xs = &txq->xstats;
+	uint16_t pkt_len = mb->pkt_len;
+	struct rte_ether_hdr *eth_hdr;
+
+	/* Update size bucket */
+	for (unsigned int i = 0; i < RTAP_NUM_PKT_SIZE_BUCKETS; i++) {
+		if (pkt_len <= rtap_size_bucket_limits[i]) {
+			xs->size_bins[i]++;
+			break;
+		}
+	}
+
+	/* Update packet type counters */
+	eth_hdr = rte_pktmbuf_mtod(mb, struct rte_ether_hdr *);
+	if (rte_is_broadcast_ether_addr(&eth_hdr->dst_addr))
+		xs->broadcast_packets++;
+	else if (rte_is_multicast_ether_addr(&eth_hdr->dst_addr))
+		xs->multicast_packets++;
+	else
+		xs->unicast_packets++;
+
+	/* Update offload-related counters */
+	if (mb->ol_flags & RTE_MBUF_F_TX_TCP_SEG)
+		xs->tso_packets++;
+
+	if ((mb->ol_flags & RTE_MBUF_F_TX_L4_MASK) != 0)
+		xs->checksum_offload++;
+
+	if (mb->nb_segs > 1)
+		xs->multiseg_packets++;
+}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 11/11] test: add unit tests for rtap PMD
  2026-02-14 23:44 ` [PATCH v6 00/11] " Stephen Hemminger
                     ` (9 preceding siblings ...)
  2026-02-14 23:44   ` [PATCH v6 10/11] net/rtap: add extended statistics support Stephen Hemminger
@ 2026-02-14 23:44   ` Stephen Hemminger
  2026-02-15  8:58   ` [PATCH v6 00/11] net/rtap: add io_uring based TAP driver Konstantin Ananyev
  11 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2026-02-14 23:44 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

Add a comprehensive test suite for the rtap poll mode driver covering:

  - Device creation and configuration
  - Device info query
  - Link status and link up/down control
  - Statistics collection and reset
  - Promiscuous and all-multicast mode toggling
  - MAC address and MTU configuration
  - Multi-queue setup, teardown, and queue count reduction
  - Mismatched Rx/Tx queue count rejection
  - Queue reconfiguration (stop, reconfigure, restart)
  - Rx path: inject packet via AF_PACKET, receive via DPDK
  - Tx path: transmit via DPDK, capture via AF_PACKET
  - Multi-segment Tx with chained mbufs
  - Multi-segment Rx with small-buffer mempool
  - Offload capability reporting and configuration
  - Tx UDP checksum offload end-to-end verification
  - File descriptor leak detection after device close
  - Command-line created device testing

Tests require CAP_NET_ADMIN or root privileges

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 app/test/meson.build     |    1 +
 app/test/test_pmd_rtap.c | 2620 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 2621 insertions(+)
 create mode 100644 app/test/test_pmd_rtap.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 48874037eb..5855077066 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -143,6 +143,7 @@ source_file_deps = {
     'test_pie.c': ['sched'],
     'test_pmd_perf.c': ['ethdev', 'net'] + packet_burst_generator_deps,
     'test_pmd_ring.c': ['net_ring', 'ethdev', 'bus_vdev'],
+    'test_pmd_rtap.c': ['net_rtap', 'ethdev', 'bus_vdev'],
     'test_pmd_ring_perf.c': ['ethdev', 'net_ring', 'bus_vdev'],
     'test_pmu.c': ['pmu'],
     'test_power.c': ['power', 'power_acpi', 'power_kvm_vm', 'power_intel_pstate',
diff --git a/app/test/test_pmd_rtap.c b/app/test/test_pmd_rtap.c
new file mode 100644
index 0000000000..a78adcd647
--- /dev/null
+++ b/app/test/test_pmd_rtap.c
@@ -0,0 +1,2620 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Stephen Hemminger
+ */
+#include <errno.h>
+#include <dirent.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stddef.h>
+#include <unistd.h>
+#include <arpa/inet.h>
+#include <net/if.h>
+#include <netinet/in.h>
+#include <sys/ioctl.h>
+#include <sys/select.h>
+#include <sys/socket.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+
+#include <rte_common.h>
+#include <rte_stdatomic.h>
+#include <rte_ethdev.h>
+#include <rte_bus_vdev.h>
+#include <rte_ether.h>
+#include <rte_errno.h>
+#include <rte_mbuf.h>
+#include <rte_mempool.h>
+#include <rte_ip.h>
+#include <rte_udp.h>
+#include <rte_epoll.h>
+
+#include "test.h"
+
+#define SOCKET0 0
+#define RING_SIZE 256
+#define NB_MBUF 4096
+#define MAX_MULTI_QUEUES 4
+
+#define RTAP_DRIVER_NAME "net_rtap"
+#define TEST_TAP_NAME "rtap_test0"
+
+/* TX/RX test parameters */
+#define TEST_PKT_PAYLOAD_LEN 64
+#define TEST_MAGIC_BYTE 0xAB
+#define RX_BURST_MAX 32
+#define TX_RX_TIMEOUT_US 100000  /* 100ms */
+#define TX_RX_POLL_US 1000       /* 1ms between polls */
+
+static struct rte_mempool *mp;
+static int rtap_port0 = -1;
+static int rtap_port1 = -1;
+
+/* ========== Helper Functions ========== */
+
+static int
+check_rtap_available(void)
+{
+	int fd = open("/dev/net/tun", O_RDWR);
+	if (fd < 0) {
+		printf("Cannot access /dev/net/tun: %s\n", strerror(errno));
+		printf("RTAP PMD tests require CAP_NET_ADMIN or root privileges\n");
+		return -1;
+	}
+	close(fd);
+	return 0;
+}
+
+/* Configure port with specified number of queue pairs */
+static int
+port_configure(int port, uint16_t nb_queues, const struct rte_eth_conf *conf)
+{
+	struct rte_eth_conf null_conf;
+
+	if (conf == NULL) {
+		memset(&null_conf, 0, sizeof(null_conf));
+		conf = &null_conf;
+	}
+
+	if (rte_eth_dev_configure(port, nb_queues, nb_queues, conf) < 0) {
+		printf("Configure failed for port %d with %u queues\n",
+		       port, nb_queues);
+		return -1;
+	}
+
+	return 0;
+}
+
+/* Setup queue pairs for a port */
+static int
+port_setup_queues(int port, uint16_t nb_queues, uint16_t ring_size,
+		  struct rte_mempool *mempool)
+{
+	for (uint16_t q = 0; q < nb_queues; q++) {
+		if (rte_eth_tx_queue_setup(port, q, ring_size, SOCKET0, NULL) < 0) {
+			printf("TX queue %u setup failed port %d\n", q, port);
+			return -1;
+		}
+
+		if (rte_eth_rx_queue_setup(port, q, ring_size, SOCKET0,
+					   NULL, mempool) < 0) {
+			printf("RX queue %u setup failed port %d\n", q, port);
+			return -1;
+		}
+	}
+
+	return 0;
+}
+
+/* Stop, configure, setup queues, and start a port */
+static int
+port_reconfigure(int port, uint16_t nb_queues, const struct rte_eth_conf *conf,
+		 uint16_t ring_size, struct rte_mempool *mempool)
+{
+	int ret;
+
+	ret = rte_eth_dev_stop(port);
+	if (ret != 0) {
+		printf("Error stopping port %d: %s\n", port, rte_strerror(-ret));
+		return -1;
+	}
+
+	if (port_configure(port, nb_queues, conf) < 0)
+		return -1;
+
+	if (port_setup_queues(port, nb_queues, ring_size, mempool) < 0)
+		return -1;
+
+	if (rte_eth_dev_start(port) < 0) {
+		printf("Error starting port %d\n", port);
+		return -1;
+	}
+
+	return 0;
+}
+
+/* Restore port to clean single-queue started state */
+static int
+restore_single_queue(int port)
+{
+	return port_reconfigure(port, 1, NULL, RING_SIZE, mp);
+}
+
+/* Verify link status matches expected */
+static int
+verify_link_status(int port, uint8_t expected_status)
+{
+	struct rte_eth_link link;
+	int ret;
+
+	ret = rte_eth_link_get(port, &link);
+	if (ret < 0) {
+		printf("Error getting link status: %s\n", rte_strerror(-ret));
+		return -1;
+	}
+
+	if (link.link_status != expected_status) {
+		printf("Error: link should be %s but is %s\n",
+		       expected_status ? "UP" : "DOWN",
+		       link.link_status ? "UP" : "DOWN");
+		return -1;
+	}
+
+	return 0;
+}
+
+/* Get device info with error checking */
+static int
+get_dev_info(int port, struct rte_eth_dev_info *dev_info)
+{
+	int ret = rte_eth_dev_info_get(port, dev_info);
+	if (ret != 0) {
+		printf("Error getting device info for port %d: %s\n",
+		       port, rte_strerror(-ret));
+		return -1;
+	}
+	return 0;
+}
+
+/* Reset and verify stats are zero */
+static int
+reset_and_verify_stats_zero(int port)
+{
+	struct rte_eth_stats stats;
+	int ret;
+
+	ret = rte_eth_stats_reset(port);
+	if (ret != 0) {
+		printf("Error: stats_reset failed for port %d: %s\n",
+		       port, rte_strerror(-ret));
+		return -1;
+	}
+
+	ret = rte_eth_stats_get(port, &stats);
+	if (ret != 0) {
+		printf("Error: stats_get failed for port %d: %s\n",
+		       port, rte_strerror(-ret));
+		return -1;
+	}
+
+	if (stats.ipackets != 0 || stats.opackets != 0 ||
+	    stats.ibytes != 0 || stats.obytes != 0 ||
+	    stats.ierrors != 0 || stats.oerrors != 0) {
+		printf("Error: port %d stats are not zero after reset\n", port);
+		return -1;
+	}
+
+	return 0;
+}
+
+/* Drain all pending RX packets from a port */
+static void
+drain_rx_queue(int port, uint16_t queue_id)
+{
+	struct rte_mbuf *drain[RX_BURST_MAX];
+	uint16_t n;
+
+	do {
+		n = rte_eth_rx_burst(port, queue_id, drain, RX_BURST_MAX);
+		rte_pktmbuf_free_bulk(drain, n);
+	} while (n > 0);
+}
+
+/* Set Ethernet address to broadcast */
+static inline void
+eth_addr_bcast(struct rte_ether_addr *addr)
+{
+	memset(addr, 0xff, RTE_ETHER_ADDR_LEN);
+}
+
+/* Bring TAP interface up using ioctl */
+static int
+tap_set_up(const char *ifname)
+{
+	struct ifreq ifr;
+	int sock, ret = -1;
+
+	sock = socket(AF_INET, SOCK_DGRAM, 0);
+	if (sock < 0)
+		return -1;
+
+	memset(&ifr, 0, sizeof(ifr));
+	strlcpy(ifr.ifr_name, ifname, IFNAMSIZ);
+
+	if (ioctl(sock, SIOCGIFFLAGS, &ifr) < 0)
+		goto out;
+
+	ifr.ifr_flags |= IFF_UP;
+
+	if (ioctl(sock, SIOCSIFFLAGS, &ifr) < 0)
+		goto out;
+
+	ret = 0;
+out:
+	close(sock);
+	return ret;
+}
+
+/* Open an AF_PACKET socket bound to the TAP interface */
+static int
+open_tap_socket(const char *ifname)
+{
+	int sock, ifindex;
+	struct sockaddr_ll sll;
+
+	sock = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
+	if (sock < 0) {
+		printf("socket(AF_PACKET) failed: %s\n", strerror(errno));
+		return -1;
+	}
+
+	ifindex = if_nametoindex(ifname);
+	if (ifindex == 0) {
+		printf("if_nametoindex(%s) failed: %s\n", ifname, strerror(errno));
+		close(sock);
+		return -1;
+	}
+
+	memset(&sll, 0, sizeof(sll));
+	sll.sll_family = AF_PACKET;
+	sll.sll_ifindex = ifindex;
+	sll.sll_protocol = htons(ETH_P_ALL);
+
+	if (bind(sock, (struct sockaddr *)&sll, sizeof(sll)) < 0) {
+		printf("bind() failed: %s\n", strerror(errno));
+		close(sock);
+		return -1;
+	}
+
+	return sock;
+}
+
+/* Setup TAP socket with non-blocking mode and bring interface up */
+static int
+setup_tap_socket_nb(const char *ifname)
+{
+	int sock, flags;
+
+	if (tap_set_up(ifname) < 0) {
+		printf("Failed to bring TAP interface up\n");
+		return -1;
+	}
+
+	sock = open_tap_socket(ifname);
+	if (sock < 0)
+		return -1;
+
+	flags = fcntl(sock, F_GETFL, 0);
+	fcntl(sock, F_SETFL, flags | O_NONBLOCK);
+
+	return sock;
+}
+
+/* Build a basic test packet with broadcast dest and magic byte payload */
+static void
+build_test_packet(uint8_t *pkt, size_t pkt_len,
+		  const struct rte_ether_addr *src_mac,
+		  const struct rte_ether_addr *dst_mac)
+{
+	struct rte_ether_hdr *eth = (struct rte_ether_hdr *)pkt;
+
+	if (dst_mac)
+		memcpy(&eth->dst_addr, dst_mac, RTE_ETHER_ADDR_LEN);
+	else
+		eth_addr_bcast(&eth->dst_addr);
+
+	if (src_mac)
+		memcpy(&eth->src_addr, src_mac, RTE_ETHER_ADDR_LEN);
+	else
+		rte_eth_random_addr(eth->src_addr.addr_bytes);
+
+	eth->ether_type = htons(RTE_ETHER_TYPE_IPV4);
+	memset(pkt + RTE_ETHER_HDR_LEN, TEST_MAGIC_BYTE,
+	       pkt_len - RTE_ETHER_HDR_LEN);
+}
+
+/* Poll AF_PACKET socket for a packet matching the given pattern */
+static ssize_t
+poll_tap_socket(int sock, uint8_t *buf, size_t buf_size,
+		uint8_t match_byte, size_t match_offset)
+{
+	struct timeval tv;
+	fd_set fds;
+	int elapsed = 0;
+	ssize_t rx_len;
+
+	while (elapsed < TX_RX_TIMEOUT_US) {
+		FD_ZERO(&fds);
+		FD_SET(sock, &fds);
+		tv.tv_sec = 0;
+		tv.tv_usec = TX_RX_POLL_US;
+
+		if (select(sock + 1, &fds, NULL, NULL, &tv) > 0) {
+			rx_len = recv(sock, buf, buf_size, 0);
+			if (rx_len > 0 && (size_t)rx_len > match_offset &&
+			    buf[match_offset] == match_byte)
+				return rx_len;
+		}
+		elapsed += TX_RX_POLL_US;
+	}
+
+	return 0;  /* Timeout */
+}
+
+/* Receive packets from DPDK port, filtering for our test packets */
+static uint16_t
+receive_test_packets(int port, uint16_t queue_id, struct rte_mbuf **rx_mbufs,
+		     uint16_t max_pkts, size_t expected_len, uint8_t magic_byte)
+{
+	uint16_t nb_rx = 0;
+	int elapsed = 0;
+
+	while (elapsed < TX_RX_TIMEOUT_US && nb_rx < max_pkts) {
+		struct rte_mbuf *burst[RX_BURST_MAX];
+		uint16_t n = rte_eth_rx_burst(port, queue_id, burst, RX_BURST_MAX);
+
+		for (uint16_t i = 0; i < n; i++) {
+			uint8_t *d = rte_pktmbuf_mtod(burst[i], uint8_t *);
+
+			if (burst[i]->pkt_len == expected_len &&
+			    d[RTE_ETHER_HDR_LEN] == magic_byte) {
+				rx_mbufs[nb_rx++] = burst[i];
+				if (nb_rx >= max_pkts)
+					break;
+			} else {
+				rte_pktmbuf_free(burst[i]);
+			}
+		}
+
+		if (nb_rx > 0)
+			break;
+
+		usleep(TX_RX_POLL_US);
+		elapsed += TX_RX_POLL_US;
+	}
+
+	return nb_rx;
+}
+
+/* Wait for event with timeout using polling */
+static int
+wait_for_event(RTE_ATOMIC(int) *event_flag, int initial_count, int timeout_us)
+{
+	int elapsed = 0;
+
+	while (elapsed < timeout_us) {
+		if (rte_atomic_load_explicit(event_flag, rte_memory_order_seq_cst) > initial_count)
+			return 0;
+		usleep(TX_RX_POLL_US);
+		elapsed += TX_RX_POLL_US;
+	}
+
+	return -1;  /* Timeout */
+}
+
+/* Count open file descriptors */
+static int
+count_open_fds(void)
+{
+	DIR *d;
+	struct dirent *de;
+	int count = 0;
+
+	d = opendir("/proc/self/fd");
+	if (d == NULL)
+		return -1;
+
+	while ((de = readdir(d)) != NULL) {
+		if (de->d_name[0] != '.')
+			count++;
+	}
+	closedir(d);
+	return count - 1;  /* Subtract dirfd itself */
+}
+
+/* ========== Test Functions ========== */
+
+static int
+test_ethdev_configure_port(int port)
+{
+	struct rte_eth_link link;
+	int ret;
+
+	if (port_reconfigure(port, 1, NULL, RING_SIZE, mp) < 0)
+		return -1;
+
+	ret = rte_eth_link_get(port, &link);
+	if (ret < 0) {
+		printf("Link get failed for port %u: %s\n",
+		       port, rte_strerror(-ret));
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+test_get_stats(int port)
+{
+	printf("Testing rtap PMD stats_get port %d\n", port);
+	return reset_and_verify_stats_zero(port);
+}
+
+static int
+test_stats_reset(int port)
+{
+	printf("Testing rtap PMD stats_reset port %d\n", port);
+	return reset_and_verify_stats_zero(port);
+}
+
+static int
+test_dev_info(int port)
+{
+	struct rte_eth_dev_info dev_info;
+
+	printf("Testing rtap PMD dev_info_get port %d\n", port);
+
+	if (get_dev_info(port, &dev_info) < 0)
+		return -1;
+
+	if (dev_info.max_rx_queues == 0 || dev_info.max_tx_queues == 0) {
+		printf("Error: invalid max queue values\n");
+		return -1;
+	}
+
+	if (dev_info.max_mac_addrs != 1) {
+		printf("Error: expected max_mac_addrs = 1, got %u\n",
+		       dev_info.max_mac_addrs);
+		return -1;
+	}
+
+	printf("  driver_name: %s\n", dev_info.driver_name);
+	printf("  if_index: %u\n", dev_info.if_index);
+	printf("  max_rx_queues: %u\n", dev_info.max_rx_queues);
+	printf("  max_tx_queues: %u\n", dev_info.max_tx_queues);
+
+	return 0;
+}
+
+static int
+test_link_status(int port)
+{
+	struct rte_eth_link link;
+	int ret;
+
+	printf("Testing rtap PMD link status port %d\n", port);
+
+	ret = rte_eth_link_get(port, &link);
+	if (ret < 0) {
+		printf("Error getting link status for port %d: %s\n",
+		       port, rte_strerror(-ret));
+		return -1;
+	}
+
+	printf("  link_status: %s\n", link.link_status ? "UP" : "DOWN");
+	printf("  link_speed: %u\n", link.link_speed);
+	printf("  link_duplex: %s\n",
+	       link.link_duplex ? "full-duplex" : "half-duplex");
+
+	return 0;
+}
+
+static int
+test_set_link_up_down(int port)
+{
+	int ret;
+
+	printf("Testing rtap PMD link up/down port %d\n", port);
+
+	ret = rte_eth_dev_set_link_down(port);
+	if (ret < 0) {
+		printf("Error setting link down for port %d: %s\n",
+		       port, rte_strerror(-ret));
+		return -1;
+	}
+
+	if (verify_link_status(port, RTE_ETH_LINK_DOWN) < 0)
+		return -1;
+
+	ret = rte_eth_dev_set_link_up(port);
+	if (ret < 0) {
+		printf("Error setting link up for port %d: %s\n",
+		       port, rte_strerror(-ret));
+		return -1;
+	}
+
+	if (verify_link_status(port, RTE_ETH_LINK_UP) < 0)
+		return -1;
+
+	return 0;
+}
+
+static int
+test_promiscuous_mode(int port)
+{
+	int ret;
+
+	printf("Testing rtap PMD promiscuous mode port %d\n", port);
+
+	ret = rte_eth_promiscuous_enable(port);
+	if (ret < 0) {
+		printf("Error enabling promiscuous mode: %s\n",
+		       rte_strerror(-ret));
+		return -1;
+	}
+
+	if (rte_eth_promiscuous_get(port) != 1) {
+		printf("Error: promiscuous mode should be enabled\n");
+		return -1;
+	}
+
+	ret = rte_eth_promiscuous_disable(port);
+	if (ret < 0) {
+		printf("Error disabling promiscuous mode: %s\n",
+		       rte_strerror(-ret));
+		return -1;
+	}
+
+	if (rte_eth_promiscuous_get(port) != 0) {
+		printf("Error: promiscuous mode should be disabled\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+test_allmulticast_mode(int port)
+{
+	int ret;
+
+	printf("Testing rtap PMD allmulticast mode port %d\n", port);
+
+	ret = rte_eth_allmulticast_enable(port);
+	if (ret < 0) {
+		printf("Error enabling allmulticast mode: %s\n",
+		       rte_strerror(-ret));
+		return -1;
+	}
+
+	if (rte_eth_allmulticast_get(port) != 1) {
+		printf("Error: allmulticast mode should be enabled\n");
+		return -1;
+	}
+
+	ret = rte_eth_allmulticast_disable(port);
+	if (ret < 0) {
+		printf("Error disabling allmulticast mode: %s\n",
+		       rte_strerror(-ret));
+		return -1;
+	}
+
+	if (rte_eth_allmulticast_get(port) != 0) {
+		printf("Error: allmulticast mode should be disabled\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+test_mac_address(int port)
+{
+	struct rte_ether_addr mac_addr;
+	struct rte_ether_addr new_mac = {
+		.addr_bytes = {0x00, 0x11, 0x22, 0x33, 0x44, 0x55}
+	};
+	int ret;
+
+	printf("Testing rtap PMD MAC address port %d\n", port);
+
+	ret = rte_eth_macaddr_get(port, &mac_addr);
+	if (ret < 0) {
+		printf("Error getting MAC address: %s\n", rte_strerror(-ret));
+		return -1;
+	}
+
+	printf("  Current MAC: " RTE_ETHER_ADDR_PRT_FMT "\n",
+	       RTE_ETHER_ADDR_BYTES(&mac_addr));
+
+	ret = rte_eth_dev_default_mac_addr_set(port, &new_mac);
+	if (ret < 0) {
+		printf("Error setting MAC address: %s\n", rte_strerror(-ret));
+		return -1;
+	}
+
+	ret = rte_eth_macaddr_get(port, &mac_addr);
+	if (ret < 0) {
+		printf("Error getting MAC address: %s\n", rte_strerror(-ret));
+		return -1;
+	}
+
+	if (!rte_is_same_ether_addr(&mac_addr, &new_mac)) {
+		printf("Error: MAC address not set correctly\n");
+		return -1;
+	}
+
+	printf("  New MAC: " RTE_ETHER_ADDR_PRT_FMT "\n",
+	       RTE_ETHER_ADDR_BYTES(&mac_addr));
+
+	return 0;
+}
+
+static int
+test_mtu_set(int port)
+{
+	uint16_t orig_mtu;
+	uint16_t new_mtu = 9000;
+	int ret;
+
+	printf("Testing rtap PMD MTU set port %d\n", port);
+
+	ret = rte_eth_dev_get_mtu(port, &orig_mtu);
+	if (ret < 0) {
+		printf("Error getting MTU: %s\n", rte_strerror(-ret));
+		return -1;
+	}
+
+	printf("  Original MTU: %u\n", orig_mtu);
+
+	ret = rte_eth_dev_set_mtu(port, new_mtu);
+	if (ret < 0) {
+		printf("Warning: setting MTU to %u failed: %s\n",
+		       new_mtu, rte_strerror(-ret));
+		return 0;
+	}
+
+	uint16_t current_mtu;
+	ret = rte_eth_dev_get_mtu(port, &current_mtu);
+	if (ret < 0) {
+		printf("Error getting MTU: %s\n", rte_strerror(-ret));
+		return -1;
+	}
+
+	printf("  New MTU: %u\n", current_mtu);
+	rte_eth_dev_set_mtu(port, orig_mtu);
+
+	return 0;
+}
+
+static int
+test_queue_reconfigure(int port)
+{
+	struct rte_eth_dev_info dev_info;
+	int ret;
+
+	printf("Testing rtap PMD queue reconfigure port %d\n", port);
+
+	if (port_reconfigure(port, 2, NULL, RING_SIZE, mp) < 0)
+		return -1;
+
+	ret = rte_eth_dev_info_get(port, &dev_info);
+	if (ret != 0) {
+		printf("Error getting device info: %s\n", rte_strerror(-ret));
+		return -1;
+	}
+
+	printf("  Configured with %u rx and %u tx queues\n",
+	       dev_info.nb_rx_queues, dev_info.nb_tx_queues);
+
+	if (restore_single_queue(port) < 0)
+		return -1;
+
+	return 0;
+}
+
+static int
+test_multiqueue(int port)
+{
+	struct rte_eth_dev_info dev_info;
+	uint16_t queue_counts[] = { 1, 2, MAX_MULTI_QUEUES };
+
+	printf("Testing rtap PMD multi-queue port %d\n", port);
+
+	for (unsigned int t = 0; t < RTE_DIM(queue_counts); t++) {
+		uint16_t nb_queues = queue_counts[t];
+
+		printf("  Configuring %u queue pair(s)\n", nb_queues);
+
+		if (port_reconfigure(port, nb_queues, NULL, RING_SIZE, mp) < 0)
+			return -1;
+
+		if (get_dev_info(port, &dev_info) < 0)
+			return -1;
+
+		if (dev_info.nb_rx_queues != nb_queues ||
+		    dev_info.nb_tx_queues != nb_queues) {
+			printf("Error: queue count mismatch\n");
+			return -1;
+		}
+
+		if (reset_and_verify_stats_zero(port) < 0)
+			return -1;
+
+		/* Verify per-queue xstats are zero */
+		int num_xstats = rte_eth_xstats_get(port, NULL, 0);
+		if (num_xstats > 0) {
+			struct rte_eth_xstat *xstats = malloc(sizeof(*xstats) * num_xstats);
+			struct rte_eth_xstat_name *xstat_names =
+				malloc(sizeof(*xstat_names) * num_xstats);
+
+			if (xstats == NULL || xstat_names == NULL) {
+				free(xstats);
+				free(xstat_names);
+				printf("Error: xstats alloc failed\n");
+				return -1;
+			}
+
+			rte_eth_xstats_get_names(port, xstat_names, num_xstats);
+			rte_eth_xstats_get(port, xstats, num_xstats);
+
+			for (int x = 0; x < num_xstats; x++) {
+				if (strstr(xstat_names[x].name, "_q") != NULL &&
+				    xstats[x].value != 0) {
+					printf("Error: xstat %s = %" PRIu64 " not zero\n",
+					       xstat_names[x].name, xstats[x].value);
+					free(xstats);
+					free(xstat_names);
+					return -1;
+				}
+			}
+			free(xstats);
+			free(xstat_names);
+		}
+
+		if (verify_link_status(port, RTE_ETH_LINK_UP) < 0)
+			return -1;
+
+		printf("    %u queue pair(s): OK\n", nb_queues);
+	}
+
+	if (restore_single_queue(port) < 0) {
+		printf("Error restoring single queue\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+test_multiqueue_reduce(int port)
+{
+	printf("Testing rtap PMD queue reduction port %d\n", port);
+
+	if (port_reconfigure(port, MAX_MULTI_QUEUES, NULL, RING_SIZE, mp) < 0)
+		return -1;
+
+	printf("  Started with %d queues, reducing to 2\n", MAX_MULTI_QUEUES);
+
+	if (port_reconfigure(port, 2, NULL, RING_SIZE, mp) < 0)
+		return -1;
+
+	if (reset_and_verify_stats_zero(port) < 0)
+		return -1;
+
+	printf("  Reduced to 2 queues: OK\n");
+	printf("  Reducing to 1 queue\n");
+
+	if (restore_single_queue(port) < 0)
+		return -1;
+
+	if (reset_and_verify_stats_zero(port) < 0)
+		return -1;
+
+	printf("  Reduced to 1 queue: OK\n");
+	return 0;
+}
+
+static int
+test_multiqueue_mismatch(int port)
+{
+	int ret;
+	struct { uint16_t rx; uint16_t tx; } mismatch[] = {
+		{ 1, 2 }, { 2, 1 }, { 4, 2 }, { 1, 4 },
+	};
+
+	printf("Testing rtap PMD mismatched queue rejection port %d\n", port);
+
+	ret = rte_eth_dev_stop(port);
+	if (ret != 0) {
+		printf("Error stopping port: %s\n", rte_strerror(-ret));
+		return -1;
+	}
+
+	for (unsigned int i = 0; i < RTE_DIM(mismatch); i++) {
+		struct rte_eth_conf null_conf;
+		memset(&null_conf, 0, sizeof(null_conf));
+
+		ret = rte_eth_dev_configure(port, mismatch[i].rx,
+					    mismatch[i].tx, &null_conf);
+		if (ret == 0) {
+			printf("Error: configure(%u rx, %u tx) should fail\n",
+			       mismatch[i].rx, mismatch[i].tx);
+			rte_eth_dev_stop(port);
+			return -1;
+		}
+		printf("  Rejected %u rx / %u tx: OK\n",
+		       mismatch[i].rx, mismatch[i].tx);
+	}
+
+	if (restore_single_queue(port) < 0) {
+		printf("Error restoring single queue\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+test_rx_inject(int port)
+{
+	struct rte_ether_addr mac;
+	struct rte_mbuf *rx_mbufs[RX_BURST_MAX];
+	uint8_t pkt[RTE_ETHER_HDR_LEN + TEST_PKT_PAYLOAD_LEN];
+	int sock = -1;
+	uint16_t nb_rx;
+	int ret = -1;
+
+	printf("Testing rtap PMD RX (inject via AF_PACKET)\n");
+
+	if (restore_single_queue(port) < 0) {
+		printf("Failed to restore single queue config\n");
+		return -1;
+	}
+
+	if (rte_eth_macaddr_get(port, &mac) < 0) {
+		printf("Failed to get MAC address\n");
+		return -1;
+	}
+
+	sock = setup_tap_socket_nb(TEST_TAP_NAME);
+	if (sock < 0)
+		return -1;
+
+	build_test_packet(pkt, sizeof(pkt), NULL, &mac);
+	drain_rx_queue(port, 0);
+	rte_eth_stats_reset(port);
+
+	if (send(sock, pkt, sizeof(pkt), 0) < 0) {
+		printf("send() failed: %s\n", strerror(errno));
+		goto out;
+	}
+
+	nb_rx = receive_test_packets(port, 0, rx_mbufs, 1, sizeof(pkt),
+				     TEST_MAGIC_BYTE);
+
+	if (nb_rx == 0) {
+		printf("No packet received after %d us\n", TX_RX_TIMEOUT_US);
+		goto out;
+	}
+
+	uint8_t *rx_data = rte_pktmbuf_mtod(rx_mbufs[0], uint8_t *);
+	if (rx_data[RTE_ETHER_HDR_LEN] != TEST_MAGIC_BYTE) {
+		printf("Payload mismatch\n");
+		goto free_rx;
+	}
+
+	struct rte_eth_stats stats;
+	rte_eth_stats_get(port, &stats);
+	if (stats.ipackets == 0) {
+		printf("RX stats not updated\n");
+		goto free_rx;
+	}
+
+	printf("  RX inject test PASSED (received %u packets)\n", nb_rx);
+	ret = 0;
+
+free_rx:
+	for (uint16_t i = 0; i < nb_rx; i++)
+		rte_pktmbuf_free(rx_mbufs[i]);
+out:
+	close(sock);
+	return ret;
+}
+
+static int
+test_tx_capture(int port)
+{
+	struct rte_ether_addr mac;
+	struct rte_mbuf *tx_mbuf;
+	struct rte_ether_hdr *eth;
+	uint8_t rx_buf[256];
+	int sock = -1;
+	uint16_t nb_tx;
+	ssize_t rx_len;
+	int ret = -1;
+
+	printf("Testing rtap PMD TX (capture via AF_PACKET)\n");
+
+	if (restore_single_queue(port) < 0) {
+		printf("Failed to restore single queue config\n");
+		return -1;
+	}
+
+	if (rte_eth_macaddr_get(port, &mac) < 0) {
+		printf("Failed to get MAC address\n");
+		return -1;
+	}
+
+	sock = setup_tap_socket_nb(TEST_TAP_NAME);
+	if (sock < 0)
+		return -1;
+
+	tx_mbuf = rte_pktmbuf_alloc(mp);
+	if (tx_mbuf == NULL) {
+		printf("Failed to allocate mbuf\n");
+		goto out;
+	}
+
+	eth = rte_pktmbuf_mtod(tx_mbuf, struct rte_ether_hdr *);
+	eth_addr_bcast(&eth->dst_addr);
+	memcpy(&eth->src_addr, &mac, RTE_ETHER_ADDR_LEN);
+	eth->ether_type = htons(RTE_ETHER_TYPE_IPV4);
+
+	uint8_t *payload = (uint8_t *)(eth + 1);
+	memset(payload, TEST_MAGIC_BYTE, TEST_PKT_PAYLOAD_LEN);
+
+	tx_mbuf->data_len = RTE_ETHER_HDR_LEN + TEST_PKT_PAYLOAD_LEN;
+	tx_mbuf->pkt_len = tx_mbuf->data_len;
+
+	rte_eth_stats_reset(port);
+
+	nb_tx = rte_eth_tx_burst(port, 0, &tx_mbuf, 1);
+	if (nb_tx != 1) {
+		printf("TX failed\n");
+		rte_pktmbuf_free(tx_mbuf);
+		goto out;
+	}
+
+	rx_len = poll_tap_socket(sock, rx_buf, sizeof(rx_buf),
+				 TEST_MAGIC_BYTE, RTE_ETHER_HDR_LEN);
+
+	if (rx_len <= 0) {
+		printf("No packet captured after %d us\n", TX_RX_TIMEOUT_US);
+		goto out;
+	}
+
+	struct rte_eth_stats stats;
+	rte_eth_stats_get(port, &stats);
+	if (stats.opackets == 0) {
+		printf("TX stats not updated\n");
+		goto out;
+	}
+
+	printf("  TX capture test PASSED (captured %zd bytes)\n", rx_len);
+	ret = 0;
+
+out:
+	close(sock);
+	return ret;
+}
+
+#define MSEG_NUM_SEGS 3
+#define MSEG_SEG_LEN 40
+
+static int
+test_tx_multiseg(int port)
+{
+	struct rte_ether_addr mac;
+	struct rte_mbuf *head, *seg, *prev;
+	struct rte_ether_hdr *eth;
+	uint8_t rx_buf[512];
+	int sock = -1;
+	uint16_t nb_tx;
+	ssize_t rx_len;
+	int ret = -1;
+	uint16_t total_payload = MSEG_NUM_SEGS * MSEG_SEG_LEN;
+
+	printf("Testing rtap PMD multi-segment TX\n");
+
+	if (restore_single_queue(port) < 0 ||
+	    rte_eth_macaddr_get(port, &mac) < 0) {
+		printf("Failed to setup test\n");
+		return -1;
+	}
+
+	sock = setup_tap_socket_nb(TEST_TAP_NAME);
+	if (sock < 0)
+		return -1;
+
+	head = rte_pktmbuf_alloc(mp);
+	if (head == NULL) {
+		printf("Failed to allocate head mbuf\n");
+		goto out;
+	}
+
+	eth = rte_pktmbuf_mtod(head, struct rte_ether_hdr *);
+	eth_addr_bcast(&eth->dst_addr);
+	memcpy(&eth->src_addr, &mac, RTE_ETHER_ADDR_LEN);
+	eth->ether_type = htons(RTE_ETHER_TYPE_IPV4);
+
+	uint8_t *p = (uint8_t *)(eth + 1);
+	memset(p, 0xA0, MSEG_SEG_LEN);
+	head->data_len = RTE_ETHER_HDR_LEN + MSEG_SEG_LEN;
+	head->pkt_len = RTE_ETHER_HDR_LEN + total_payload;
+	head->nb_segs = MSEG_NUM_SEGS;
+
+	prev = head;
+	for (int i = 1; i < MSEG_NUM_SEGS; i++) {
+		seg = rte_pktmbuf_alloc(mp);
+		if (seg == NULL) {
+			printf("Failed to allocate segment %d\n", i);
+			rte_pktmbuf_free(head);
+			goto out;
+		}
+		p = rte_pktmbuf_mtod(seg, uint8_t *);
+		memset(p, 0xA0 + i, MSEG_SEG_LEN);
+		seg->data_len = MSEG_SEG_LEN;
+		prev->next = seg;
+		prev = seg;
+	}
+
+	rte_eth_stats_reset(port);
+
+	nb_tx = rte_eth_tx_burst(port, 0, &head, 1);
+	if (nb_tx != 1) {
+		printf("TX failed for multi-seg packet\n");
+		rte_pktmbuf_free(head);
+		goto out;
+	}
+
+	rx_len = poll_tap_socket(sock, rx_buf, sizeof(rx_buf),
+				 0xA0, RTE_ETHER_HDR_LEN);
+
+	if (rx_len <= 0) {
+		printf("No packet captured\n");
+		goto out;
+	}
+
+	for (int i = 0; i < MSEG_NUM_SEGS; i++) {
+		int off = RTE_ETHER_HDR_LEN + i * MSEG_SEG_LEN;
+		uint8_t expected = 0xA0 + i;
+		if (rx_buf[off] != expected) {
+			printf("Segment %d mismatch\n", i);
+			goto out;
+		}
+	}
+
+	struct rte_eth_stats stats;
+	rte_eth_stats_get(port, &stats);
+	if (stats.opackets == 0) {
+		printf("TX stats not updated\n");
+		goto out;
+	}
+
+	printf("  Multi-seg TX test PASSED (%d segs, captured %zd bytes)\n",
+	       MSEG_NUM_SEGS, rx_len);
+	ret = 0;
+
+out:
+	close(sock);
+	return ret;
+}
+
+#define MSEG_RX_BUF_SIZE 256
+#define MSEG_RX_POOL_SIZE 4096
+#define MSEG_RX_PKT_PAYLOAD 200
+
+static int
+test_rx_multiseg(int port)
+{
+	struct rte_mempool *small_mp = NULL;
+	struct rte_ether_addr mac;
+	struct rte_mbuf *rx_mbufs[RX_BURST_MAX];
+	uint8_t pkt[RTE_ETHER_HDR_LEN + MSEG_RX_PKT_PAYLOAD];
+	int sock = -1;
+	uint16_t nb_rx;
+	int ret = -1;
+
+	printf("Testing rtap PMD multi-segment RX\n");
+
+	if (rte_eth_macaddr_get(port, &mac) < 0) {
+		printf("Failed to get MAC address\n");
+		return -1;
+	}
+
+	small_mp = rte_pktmbuf_pool_create("small_mbuf_pool",
+			MSEG_RX_POOL_SIZE, 32, 0, MSEG_RX_BUF_SIZE,
+			rte_socket_id());
+	if (small_mp == NULL) {
+		printf("Failed to create small mempool\n");
+		return -1;
+	}
+
+	if (port_reconfigure(port, 1, NULL, RING_SIZE, small_mp) < 0)
+		goto free_pool;
+
+	sock = setup_tap_socket_nb(TEST_TAP_NAME);
+	if (sock < 0)
+		goto restore;
+
+	drain_rx_queue(port, 0);
+
+	build_test_packet(pkt, sizeof(pkt), NULL, &mac);
+	memset(pkt + RTE_ETHER_HDR_LEN, 0xDD, MSEG_RX_PKT_PAYLOAD);
+
+	if (send(sock, pkt, sizeof(pkt), 0) < 0) {
+		printf("send() failed: %s\n", strerror(errno));
+		goto close_sock;
+	}
+
+	nb_rx = receive_test_packets(port, 0, rx_mbufs, 1, sizeof(pkt), 0xDD);
+
+	if (nb_rx == 0) {
+		printf("No packet received\n");
+		goto close_sock;
+	}
+
+	struct rte_mbuf *m = rx_mbufs[0];
+	printf("  Received: pkt_len=%u nb_segs=%u\n", m->pkt_len, m->nb_segs);
+
+	if (m->nb_segs < 2) {
+		printf("  Expected multi-segment mbuf, got %u segments\n",
+		       m->nb_segs);
+	}
+
+	if (m->pkt_len < sizeof(pkt)) {
+		printf("  Packet too short: %u < %zu\n", m->pkt_len, sizeof(pkt));
+		goto free_rx;
+	}
+
+	/* Verify payload across segments */
+	uint32_t offset = 0;
+	struct rte_mbuf *seg = m;
+	uint32_t seg_off = 0;
+	int payload_ok = 1;
+
+	while (seg != NULL && offset < m->pkt_len) {
+		if (seg_off >= seg->data_len) {
+			seg = seg->next;
+			seg_off = 0;
+			continue;
+		}
+		if (offset >= RTE_ETHER_HDR_LEN &&
+		    offset < RTE_ETHER_HDR_LEN + MSEG_RX_PKT_PAYLOAD) {
+			uint8_t *d = rte_pktmbuf_mtod_offset(seg, uint8_t *,
+							      seg_off);
+			if (*d != 0xDD) {
+				printf("  Payload mismatch at offset %u\n", offset);
+				payload_ok = 0;
+				break;
+			}
+		}
+		offset++;
+		seg_off++;
+	}
+
+	if (!payload_ok)
+		goto free_rx;
+
+	printf("  Multi-seg RX test PASSED (%u segments)\n", m->nb_segs);
+	ret = 0;
+
+free_rx:
+	for (uint16_t i = 0; i < nb_rx; i++)
+		rte_pktmbuf_free(rx_mbufs[i]);
+
+close_sock:
+	close(sock);
+
+restore:
+	restore_single_queue(port);
+
+free_pool:
+	rte_mempool_free(small_mp);
+	return ret;
+}
+
+static int
+test_offload_config(int port)
+{
+	struct rte_eth_dev_info dev_info;
+	struct rte_eth_conf conf;
+	int ret;
+
+	printf("Testing rtap PMD offload configuration port %d\n", port);
+
+	if (get_dev_info(port, &dev_info) < 0)
+		return -1;
+
+	uint64_t expected_tx = RTE_ETH_TX_OFFLOAD_MULTI_SEGS |
+			       RTE_ETH_TX_OFFLOAD_UDP_CKSUM |
+			       RTE_ETH_TX_OFFLOAD_TCP_CKSUM |
+			       RTE_ETH_TX_OFFLOAD_TCP_TSO;
+
+	if ((dev_info.tx_offload_capa & expected_tx) != expected_tx) {
+		printf("Missing TX offload capabilities\n");
+		return -1;
+	}
+
+	printf("  TX offload capa: 0x%" PRIx64 " OK\n",
+	       dev_info.tx_offload_capa);
+
+	memset(&conf, 0, sizeof(conf));
+	conf.txmode.offloads = RTE_ETH_TX_OFFLOAD_UDP_CKSUM |
+			       RTE_ETH_TX_OFFLOAD_TCP_CKSUM;
+
+	ret = port_reconfigure(port, 1, &conf, RING_SIZE, mp);
+	if (ret < 0) {
+		printf("Configure with TX offloads failed\n");
+		goto restore;
+	}
+
+	printf("  TX offload configuration: OK\n");
+
+restore:
+	restore_single_queue(port);
+	return ret;
+}
+
+#define CSUM_PKT_PAYLOAD 32
+
+static int
+test_tx_csum_offload(int port)
+{
+	struct rte_ether_addr mac;
+	struct rte_mbuf *tx_mbuf;
+	uint8_t rx_buf[256];
+	int sock = -1;
+	uint16_t nb_tx, pkt_len;
+	ssize_t rx_len = 0;
+	int ret = -1;
+
+	printf("Testing rtap PMD TX checksum offload\n");
+
+	if (restore_single_queue(port) < 0 ||
+	    rte_eth_macaddr_get(port, &mac) < 0) {
+		printf("Failed to setup test\n");
+		return -1;
+	}
+
+	sock = setup_tap_socket_nb(TEST_TAP_NAME);
+	if (sock < 0)
+		return -1;
+
+	tx_mbuf = rte_pktmbuf_alloc(mp);
+	if (tx_mbuf == NULL) {
+		printf("Failed to allocate mbuf\n");
+		goto out;
+	}
+
+	/* Build Eth + IPv4 + UDP + payload */
+	struct rte_ether_hdr *eth = rte_pktmbuf_mtod(tx_mbuf,
+						      struct rte_ether_hdr *);
+	eth_addr_bcast(&eth->dst_addr);
+	memcpy(&eth->src_addr, &mac, RTE_ETHER_ADDR_LEN);
+	eth->ether_type = htons(RTE_ETHER_TYPE_IPV4);
+
+	struct rte_ipv4_hdr *ip = (struct rte_ipv4_hdr *)(eth + 1);
+	memset(ip, 0, sizeof(*ip));
+	ip->version_ihl = 0x45;
+	ip->total_length = htons(sizeof(*ip) + sizeof(struct rte_udp_hdr) +
+				 CSUM_PKT_PAYLOAD);
+	ip->time_to_live = 64;
+	ip->next_proto_id = IPPROTO_UDP;
+	ip->src_addr = htonl(0x0a000001);
+	ip->dst_addr = htonl(0x0a000002);
+	ip->hdr_checksum = 0;
+	ip->hdr_checksum = rte_ipv4_cksum(ip);
+
+	struct rte_udp_hdr *udp = (struct rte_udp_hdr *)(ip + 1);
+	udp->src_port = htons(1234);
+	udp->dst_port = htons(5678);
+	udp->dgram_len = htons(sizeof(*udp) + CSUM_PKT_PAYLOAD);
+	udp->dgram_cksum = 0;
+
+	uint8_t *payload = (uint8_t *)(udp + 1);
+	memset(payload, 0xCC, CSUM_PKT_PAYLOAD);
+
+	pkt_len = sizeof(*eth) + sizeof(*ip) + sizeof(*udp) + CSUM_PKT_PAYLOAD;
+	tx_mbuf->data_len = pkt_len;
+	tx_mbuf->pkt_len = pkt_len;
+
+	tx_mbuf->ol_flags = RTE_MBUF_F_TX_IPV4 | RTE_MBUF_F_TX_UDP_CKSUM;
+	tx_mbuf->l2_len = sizeof(*eth);
+	tx_mbuf->l3_len = sizeof(*ip);
+	udp->dgram_cksum = rte_ipv4_phdr_cksum(ip, tx_mbuf->ol_flags);
+
+	rte_eth_stats_reset(port);
+
+	nb_tx = rte_eth_tx_burst(port, 0, &tx_mbuf, 1);
+	if (nb_tx != 1) {
+		printf("TX failed\n");
+		rte_pktmbuf_free(tx_mbuf);
+		goto out;
+	}
+
+	rx_len = poll_tap_socket(sock, rx_buf, sizeof(rx_buf),
+				 0x45, sizeof(*eth));
+
+	if (rx_len <= 0) {
+		printf("No packet captured\n");
+		goto out;
+	}
+
+	unsigned int cksum_off = sizeof(*eth) + sizeof(*ip) +
+				 offsetof(struct rte_udp_hdr, dgram_cksum);
+	uint16_t captured_cksum;
+
+	memcpy(&captured_cksum, &rx_buf[cksum_off], sizeof(captured_cksum));
+
+	if (captured_cksum == 0)
+		printf("  Warning: UDP checksum is zero\n");
+	else
+		printf("  UDP cksum=0x%04x\n", ntohs(captured_cksum));
+
+	struct rte_eth_stats stats;
+	rte_eth_stats_get(port, &stats);
+	if (stats.opackets == 0) {
+		printf("TX stats not updated\n");
+		goto out;
+	}
+
+	printf("  TX csum offload PASSED (captured %zd bytes)\n", rx_len);
+	ret = 0;
+
+out:
+	close(sock);
+	return ret;
+}
+
+#define FLOOD_RING_SIZE 64
+#define FLOOD_NUM_PKTS 1000
+#define FLOOD_PKT_SIZE 128
+
+static int
+test_imissed_counter(int port)
+{
+	struct rte_eth_stats stats_before, stats_after, stats_after_reset;
+	struct rte_ether_addr mac;
+	uint8_t pkt[FLOOD_PKT_SIZE];
+	int sock = -1;
+	int ret = -1;
+
+	printf("Testing rtap PMD imissed counter port %d\n", port);
+
+	if (rte_eth_macaddr_get(port, &mac) < 0) {
+		printf("Failed to get MAC address\n");
+		return -1;
+	}
+
+	if (port_reconfigure(port, 1, NULL, FLOOD_RING_SIZE, mp) < 0)
+		goto restore;
+
+	sock = setup_tap_socket_nb(TEST_TAP_NAME);
+	if (sock < 0)
+		goto restore;
+
+	drain_rx_queue(port, 0);
+
+	ret = rte_eth_stats_reset(port);
+	if (ret != 0) {
+		printf("Failed to reset stats: %s\n", rte_strerror(-ret));
+		goto close_sock;
+	}
+
+	ret = rte_eth_stats_get(port, &stats_before);
+	if (ret != 0) {
+		printf("Failed to get baseline stats: %s\n", rte_strerror(-ret));
+		goto close_sock;
+	}
+
+	printf("  Flooding with %d packets (ring size %d)\n",
+	       FLOOD_NUM_PKTS, FLOOD_RING_SIZE);
+
+	build_test_packet(pkt, FLOOD_PKT_SIZE, &mac, &mac);
+
+	for (int i = 0; i < FLOOD_NUM_PKTS; i++) {
+		if (send(sock, pkt, FLOOD_PKT_SIZE, 0) < 0) {
+			printf("send() failed after %d packets: %s\n",
+			       i, strerror(errno));
+			goto close_sock;
+		}
+	}
+
+	usleep(100000);  /* 100ms */
+
+	/* Drain whatever we can receive */
+	{
+		struct rte_mbuf *burst[RX_BURST_MAX];
+		uint16_t total_rx = 0;
+		int attempts = 0;
+
+		while (attempts++ < 100) {
+			uint16_t n = rte_eth_rx_burst(port, 0, burst, RX_BURST_MAX);
+			if (n > 0) {
+				rte_pktmbuf_free_bulk(burst, n);
+				total_rx += n;
+			} else {
+				usleep(1000);
+			}
+		}
+		printf("  Received %u packets out of %d sent\n",
+		       total_rx, FLOOD_NUM_PKTS);
+	}
+
+	ret = rte_eth_stats_get(port, &stats_after);
+	if (ret != 0) {
+		printf("Failed to get stats after flood: %s\n", rte_strerror(-ret));
+		goto close_sock;
+	}
+
+	printf("  Stats: ipackets=%"PRIu64" imissed=%"PRIu64"\n",
+	       stats_after.ipackets, stats_after.imissed);
+
+	if (stats_after.ipackets == 0) {
+		printf("  ERROR: No packets received\n");
+		goto close_sock;
+	}
+
+	if (stats_after.imissed == 0) {
+		printf("  WARNING: No packets marked as missed\n");
+	} else {
+		printf("  SUCCESS: imissed counter working (%"PRIu64" drops)\n",
+		       stats_after.imissed);
+	}
+
+	/* Test stats_reset clears imissed counter */
+	printf("  Testing stats_reset for imissed counter\n");
+	ret = rte_eth_stats_reset(port);
+	if (ret != 0) {
+		printf("  ERROR: stats_reset failed: %s\n", rte_strerror(-ret));
+		goto close_sock;
+	}
+
+	ret = rte_eth_stats_get(port, &stats_after_reset);
+	if (ret != 0) {
+		printf("  ERROR: stats_get after reset failed: %s\n", rte_strerror(-ret));
+		goto close_sock;
+	}
+
+	if (stats_after_reset.imissed != 0 || stats_after_reset.ipackets != 0) {
+		printf("  ERROR: stats not reset properly\n");
+		goto close_sock;
+	}
+
+	printf("  Stats reset: OK (all counters zeroed)\n");
+	printf("  imissed counter test PASSED\n");
+	ret = 0;
+
+close_sock:
+	close(sock);
+
+restore:
+	restore_single_queue(port);
+	return ret;
+}
+
+#define LSC_TIMEOUT_US 500000  /* 500ms */
+#define LSC_POLL_US    1000    /* 1ms between polls */
+
+#define RXQ_INTR_TIMEOUT_MS 500  /* 500ms */
+
+static RTE_ATOMIC(int) lsc_event_count;
+static RTE_ATOMIC(int) lsc_last_status;
+
+static int
+test_lsc_callback(uint16_t port_id, enum rte_eth_event_type type,
+		  void *param __rte_unused, void *ret_param __rte_unused)
+{
+	struct rte_eth_link link;
+
+	if (type != RTE_ETH_EVENT_INTR_LSC)
+		return 0;
+
+	if (rte_eth_link_get_nowait(port_id, &link) < 0) {
+		printf("  Link get nowait failed\n");
+		return 0;
+	}
+
+	rte_atomic_store_explicit(&lsc_last_status, link.link_status, rte_memory_order_relaxed);
+	rte_atomic_fetch_add_explicit(&lsc_event_count, 1, rte_memory_order_seq_cst);
+
+	printf("    LSC event #%d: port %u link %s\n",
+	       rte_atomic_load_explicit(&lsc_event_count, rte_memory_order_relaxed),
+	       port_id,
+	       link.link_status ? "UP" : "DOWN");
+
+	return 0;
+}
+
+static int
+test_lsc_interrupt(int port)
+{
+	struct rte_eth_conf lsc_conf;
+	int initial_count;
+	int ret = -1;
+
+	printf("Testing rtap PMD link state interrupt port %d\n", port);
+
+	memset(&lsc_conf, 0, sizeof(lsc_conf));
+	lsc_conf.intr_conf.lsc = 1;
+
+	if (port_reconfigure(port, 1, &lsc_conf, RING_SIZE, mp) < 0)
+		goto restore;
+
+	ret = rte_eth_dev_callback_register(port, RTE_ETH_EVENT_INTR_LSC,
+					    test_lsc_callback, NULL);
+	if (ret < 0) {
+		printf("Failed to register LSC callback: %s\n",
+		       rte_strerror(-ret));
+		goto restore;
+	}
+
+	rte_atomic_store_explicit(&lsc_event_count, 0, rte_memory_order_relaxed);
+	rte_atomic_store_explicit(&lsc_last_status, -1, rte_memory_order_relaxed);
+
+	if (verify_link_status(port, RTE_ETH_LINK_UP) < 0) {
+		ret = -1;
+		goto stop;
+	}
+
+	printf("  Link is UP, setting link DOWN\n");
+	initial_count = rte_atomic_load_explicit(&lsc_event_count, rte_memory_order_seq_cst);
+
+	ret = rte_eth_dev_set_link_down(port);
+	if (ret < 0) {
+		printf("Set link down failed: %s\n", rte_strerror(-ret));
+		goto stop;
+	}
+
+	if (wait_for_event(&lsc_event_count, initial_count, LSC_TIMEOUT_US) < 0) {
+		printf("  No LSC event received for link DOWN after %d us\n",
+		       LSC_TIMEOUT_US);
+		if (verify_link_status(port, RTE_ETH_LINK_DOWN) < 0) {
+			ret = -1;
+			goto stop;
+		}
+		printf("  Link status is DOWN (verified via polling)\n");
+	} else {
+		printf("  LSC event received for link DOWN\n");
+		if (rte_atomic_load_explicit(&lsc_last_status, rte_memory_order_seq_cst) != RTE_ETH_LINK_DOWN) {
+			printf("  ERROR: expected DOWN status in callback\n");
+			ret = -1;
+			goto stop;
+		}
+	}
+
+	printf("  Setting link UP\n");
+	initial_count = rte_atomic_load_explicit(&lsc_event_count, rte_memory_order_seq_cst);
+
+	ret = rte_eth_dev_set_link_up(port);
+	if (ret < 0) {
+		printf("Set link up failed: %s\n", rte_strerror(-ret));
+		goto stop;
+	}
+
+	if (wait_for_event(&lsc_event_count, initial_count, LSC_TIMEOUT_US) < 0) {
+		printf("  No LSC event received for link UP after %d us\n",
+		       LSC_TIMEOUT_US);
+		if (verify_link_status(port, RTE_ETH_LINK_UP) < 0) {
+			ret = -1;
+			goto stop;
+		}
+		printf("  Link status is UP (verified via polling)\n");
+	} else {
+		printf("  LSC event received for link UP\n");
+		if (rte_atomic_load_explicit(&lsc_last_status, rte_memory_order_seq_cst) != RTE_ETH_LINK_UP) {
+			printf("  ERROR: expected UP status in callback\n");
+			ret = -1;
+			goto stop;
+		}
+	}
+
+	printf("  LSC interrupt test PASSED (total events: %d)\n",
+	       rte_atomic_load_explicit(&lsc_event_count, rte_memory_order_relaxed));
+	ret = 0;
+
+stop:
+	rte_eth_dev_stop(port);
+	rte_eth_dev_callback_unregister(port, RTE_ETH_EVENT_INTR_LSC,
+					test_lsc_callback, NULL);
+
+restore:
+	restore_single_queue(port);
+	return ret;
+}
+
+static int
+test_rxq_interrupt(int port)
+{
+	struct rte_eth_conf rxq_conf;
+	struct rte_ether_addr mac;
+	uint8_t pkt[RTE_ETHER_HDR_LEN + TEST_PKT_PAYLOAD_LEN];
+	int sock = -1;
+	int epfd = -1;
+	int ret = -1;
+
+	printf("Testing rtap PMD RX queue interrupt port %d\n", port);
+
+	if (rte_eth_macaddr_get(port, &mac) < 0) {
+		printf("Failed to get MAC address\n");
+		return -1;
+	}
+
+	memset(&rxq_conf, 0, sizeof(rxq_conf));
+	rxq_conf.intr_conf.rxq = 1;
+
+	if (port_reconfigure(port, 1, &rxq_conf, RING_SIZE, mp) < 0)
+		goto restore;
+
+	/* Enable interrupt for queue 0 */
+	ret = rte_eth_dev_rx_intr_enable(port, 0);
+	if (ret < 0) {
+		printf("  rx_intr_enable failed: %s\n", rte_strerror(-ret));
+		goto restore;
+	}
+
+	/* Add queue 0's eventfd to the per-thread epoll set */
+	ret = rte_eth_dev_rx_intr_ctl_q(port, RTE_EPOLL_PER_THREAD,
+					RTE_INTR_EVENT_ADD, 0, NULL);
+	if (ret < 0) {
+		printf("  rx_intr_ctl_q(ADD) failed: %s\n", rte_strerror(-ret));
+		printf("  (epoll may not be available in this environment)\n");
+		epfd = -1;
+	} else {
+		epfd = RTE_EPOLL_PER_THREAD;
+	}
+
+	sock = setup_tap_socket_nb(TEST_TAP_NAME);
+	if (sock < 0)
+		goto disable_intr;
+
+	drain_rx_queue(port, 0);
+	build_test_packet(pkt, sizeof(pkt), NULL, &mac);
+
+	printf("  Injecting test packet\n");
+	if (send(sock, pkt, sizeof(pkt), 0) < 0) {
+		printf("send() failed: %s\n", strerror(errno));
+		goto close_sock;
+	}
+
+	/* Wait for the Rx interrupt via epoll */
+	if (epfd != -1) {
+		struct rte_epoll_event event;
+		int nfds;
+
+		nfds = rte_epoll_wait(epfd, &event, 1, RXQ_INTR_TIMEOUT_MS);
+		if (nfds < 0) {
+			printf("  rte_epoll_wait failed: %s\n",
+			       rte_strerror(-nfds));
+			printf("  (Falling back to polling verification)\n");
+		} else if (nfds == 0) {
+			printf("  WARNING: epoll timeout - no Rx interrupt received\n");
+			printf("  (This may be expected in some test environments)\n");
+		} else {
+			printf("  Rx interrupt received via epoll: OK\n");
+		}
+	}
+
+	/* Verify the packet actually arrived */
+	{
+		struct rte_mbuf *rx_mbufs[RX_BURST_MAX];
+		uint16_t nb_rx;
+		int elapsed = 0;
+
+		while (elapsed < TX_RX_TIMEOUT_US) {
+			nb_rx = rte_eth_rx_burst(port, 0, rx_mbufs, RX_BURST_MAX);
+			if (nb_rx > 0) {
+				rte_pktmbuf_free_bulk(rx_mbufs, nb_rx);
+				printf("  Packet received successfully\n");
+				break;
+			}
+			usleep(TX_RX_POLL_US);
+			elapsed += TX_RX_POLL_US;
+		}
+
+		if (nb_rx == 0) {
+			printf("  ERROR: No packet received\n");
+			ret = -1;
+			goto close_sock;
+		}
+	}
+
+	printf("  RX queue interrupt test PASSED\n");
+	ret = 0;
+
+close_sock:
+	close(sock);
+
+disable_intr:
+	rte_eth_dev_rx_intr_disable(port, 0);
+
+	if (epfd != -1)
+		rte_eth_dev_rx_intr_ctl_q(port, RTE_EPOLL_PER_THREAD,
+					  RTE_INTR_EVENT_DEL, 0, NULL);
+
+restore:
+	restore_single_queue(port);
+	return ret;
+}
+
+/* ========== Extended Statistics (xstats) Tests ========== */
+
+static int
+test_xstats_get_names(int port)
+{
+	struct rte_eth_xstat_name *names = NULL;
+	int count, ret = -1;
+
+	printf("Testing xstats_get_names for port %d\n", port);
+
+	/* Get number of xstats */
+	count = rte_eth_xstats_get_names(port, NULL, 0);
+	if (count < 0) {
+		printf("Error: xstats_get_names returned %d\n", count);
+		return -1;
+	}
+
+	printf("  Total xstats count: %d\n", count);
+
+	if (count == 0) {
+		printf("  Warning: no xstats available\n");
+		return 0;
+	}
+
+	/* Allocate and retrieve names */
+	names = calloc(count, sizeof(*names));
+	if (names == NULL) {
+		printf("  Error: failed to allocate memory for xstats names\n");
+		return -1;
+	}
+
+	ret = rte_eth_xstats_get_names(port, names, count);
+	if (ret < 0 || ret != count) {
+		printf("  Error: xstats_get_names returned %d, expected %d\n",
+		       ret, count);
+		goto cleanup;
+	}
+
+	/* Verify naming convention for a few stats */
+	bool found_rx_size = false;
+	bool found_tx_size = false;
+	bool found_rx_broadcast = false;
+	bool found_tx_broadcast = false;
+
+	for (int i = 0; i < count; i++) {
+		if (strstr(names[i].name, "rx_q") && strstr(names[i].name, "size_"))
+			found_rx_size = true;
+		if (strstr(names[i].name, "tx_q") && strstr(names[i].name, "size_"))
+			found_tx_size = true;
+		if (strstr(names[i].name, "rx_q") && strstr(names[i].name, "broadcast"))
+			found_rx_broadcast = true;
+		if (strstr(names[i].name, "tx_q") && strstr(names[i].name, "broadcast"))
+			found_tx_broadcast = true;
+	}
+
+	if (!found_rx_size || !found_tx_size || !found_rx_broadcast || !found_tx_broadcast) {
+		printf("  Error: expected xstats not found\n");
+		printf("    rx_size=%d tx_size=%d rx_broadcast=%d tx_broadcast=%d\n",
+		       found_rx_size, found_tx_size, found_rx_broadcast, found_tx_broadcast);
+		ret = -1;
+		goto cleanup;
+	}
+
+	printf("  xstats naming convention verified: OK\n");
+	ret = 0;
+
+cleanup:
+	free(names);
+	return ret;
+}
+
+static int
+test_xstats_get_values(int port)
+{
+	struct rte_eth_xstat *xstats = NULL;
+	int count, ret = -1;
+
+	printf("Testing xstats_get values for port %d\n", port);
+
+	/* Get number of xstats */
+	count = rte_eth_xstats_get_names(port, NULL, 0);
+	if (count <= 0) {
+		printf("  Error: no xstats available\n");
+		return -1;
+	}
+
+	/* Allocate and retrieve values */
+	xstats = calloc(count, sizeof(*xstats));
+	if (xstats == NULL) {
+		printf("  Error: failed to allocate memory for xstats\n");
+		return -1;
+	}
+
+	ret = rte_eth_xstats_get(port, xstats, count);
+	if (ret < 0 || ret != count) {
+		printf("  Error: xstats_get returned %d, expected %d\n", ret, count);
+		goto cleanup;
+	}
+
+	/* Verify IDs are sequential */
+	for (int i = 0; i < count; i++) {
+		if (xstats[i].id != (uint64_t)i) {
+			printf("  Error: xstats[%d].id=%"PRIu64", expected %d\n",
+			       i, xstats[i].id, i);
+			ret = -1;
+			goto cleanup;
+		}
+	}
+
+	printf("  xstats IDs are sequential: OK\n");
+	ret = 0;
+
+cleanup:
+	free(xstats);
+	return ret;
+}
+
+static int
+test_xstats_reset(int port)
+{
+	struct rte_eth_xstat *xstats = NULL;
+	int count, ret = -1;
+
+	printf("Testing xstats_reset for port %d\n", port);
+
+	count = rte_eth_xstats_get_names(port, NULL, 0);
+	if (count <= 0) {
+		printf("  Error: no xstats available\n");
+		return -1;
+	}
+
+	xstats = calloc(count, sizeof(*xstats));
+	if (xstats == NULL) {
+		printf("  Error: failed to allocate memory for xstats\n");
+		return -1;
+	}
+
+	/* Reset xstats */
+	ret = rte_eth_xstats_reset(port);
+	if (ret != 0) {
+		printf("  Error: xstats_reset returned %d\n", ret);
+		goto cleanup;
+	}
+
+	/* Verify all xstats are zero */
+	ret = rte_eth_xstats_get(port, xstats, count);
+	if (ret < 0) {
+		printf("  Error: xstats_get after reset returned %d\n", ret);
+		goto cleanup;
+	}
+
+	for (int i = 0; i < count; i++) {
+		if (xstats[i].value != 0) {
+			printf("  Error: xstats[%d].value=%"PRIu64", expected 0\n",
+			       i, xstats[i].value);
+			ret = -1;
+			goto cleanup;
+		}
+	}
+
+	printf("  All xstats reset to zero: OK\n");
+	ret = 0;
+
+cleanup:
+	free(xstats);
+	return ret;
+}
+
+static int
+test_xstats_packet_sizes(int port)
+{
+	struct rte_eth_xstat_name *names = NULL;
+	struct rte_eth_xstat *xstats = NULL;
+	struct rte_eth_dev_info dev_info;
+	struct rte_mbuf *tx_mbufs[4];
+	uint16_t pkt_sizes[] = {60, 100, 200, 500}; /* Will result in 64, 128, 256, 512 on wire */
+	int count, ret = -1;
+	int sock = -1;
+	char ifname[IFNAMSIZ];
+
+	printf("Testing xstats packet size buckets for port %d\n", port);
+
+	if (get_dev_info(port, &dev_info) < 0)
+		return -1;
+
+	snprintf(ifname, sizeof(ifname), "rtap_test%d", port);
+	sock = setup_tap_socket_nb(ifname);
+	if (sock < 0) {
+		printf("  Warning: cannot setup TAP socket, skipping test\n");
+		return 0;
+	}
+
+	/* Reset stats */
+	if (rte_eth_xstats_reset(port) != 0) {
+		printf("  Error: xstats_reset failed\n");
+		goto cleanup;
+	}
+
+	/* Send packets of different sizes */
+	for (unsigned int i = 0; i < RTE_DIM(pkt_sizes); i++) {
+		tx_mbufs[i] = rte_pktmbuf_alloc(mp);
+		if (tx_mbufs[i] == NULL) {
+			printf("  Error: failed to allocate mbuf\n");
+			goto cleanup;
+		}
+
+		uint8_t *pkt = rte_pktmbuf_mtod(tx_mbufs[i], uint8_t *);
+		build_test_packet(pkt, pkt_sizes[i], NULL, NULL);
+		tx_mbufs[i]->pkt_len = pkt_sizes[i];
+		tx_mbufs[i]->data_len = pkt_sizes[i];
+	}
+
+	uint16_t nb_tx = rte_eth_tx_burst(port, 0, tx_mbufs, RTE_DIM(tx_mbufs));
+	if (nb_tx != RTE_DIM(tx_mbufs)) {
+		printf("  Error: sent %u/%zu packets\n", nb_tx, RTE_DIM(tx_mbufs));
+		goto cleanup;
+	}
+
+	usleep(10000); /* Allow packets to be processed */
+
+	/* Get xstats */
+	count = rte_eth_xstats_get_names(port, NULL, 0);
+	if (count <= 0) {
+		printf("  Error: no xstats available\n");
+		goto cleanup;
+	}
+
+	names = calloc(count, sizeof(*names));
+	xstats = calloc(count, sizeof(*xstats));
+	if (names == NULL || xstats == NULL) {
+		printf("  Error: failed to allocate memory\n");
+		goto cleanup;
+	}
+
+	rte_eth_xstats_get_names(port, names, count);
+	rte_eth_xstats_get(port, xstats, count);
+
+	/* Find and verify size bucket stats */
+	int buckets_found = 0;
+	for (int i = 0; i < count; i++) {
+		if (strstr(names[i].name, "tx_q0_size_64_packets") && xstats[i].value > 0)
+			buckets_found++;
+		else if (strstr(names[i].name, "tx_q0_size_65_to_127_packets") && xstats[i].value > 0)
+			buckets_found++;
+		else if (strstr(names[i].name, "tx_q0_size_128_to_255_packets") && xstats[i].value > 0)
+			buckets_found++;
+		else if (strstr(names[i].name, "tx_q0_size_256_to_511_packets") && xstats[i].value > 0)
+			buckets_found++;
+		else if (strstr(names[i].name, "tx_q0_size_512_to_1023_packets") && xstats[i].value > 0)
+			buckets_found++;
+	}
+
+	if (buckets_found < 4) {
+		printf("  Warning: only found %d size buckets with packets\n", buckets_found);
+		printf("  (This may be expected if packets were coalesced)\n");
+	} else {
+		printf("  Packet size buckets populated: OK\n");
+	}
+
+	ret = 0;
+
+cleanup:
+	free(names);
+	free(xstats);
+	if (sock >= 0)
+		close(sock);
+	return ret;
+}
+
+static int
+test_xstats_packet_types(int port)
+{
+	struct rte_eth_xstat_name *names = NULL;
+	struct rte_eth_xstat *xstats = NULL;
+	struct rte_eth_dev_info dev_info;
+	struct rte_mbuf *tx_mbufs[3];
+	struct rte_ether_addr bcast, mcast, ucast;
+	int count, ret = -1;
+	int sock = -1;
+	char ifname[IFNAMSIZ];
+
+	printf("Testing xstats packet type classification for port %d\n", port);
+
+	if (get_dev_info(port, &dev_info) < 0)
+		return -1;
+
+	snprintf(ifname, sizeof(ifname), "rtap_test%d", port);
+	sock = setup_tap_socket_nb(ifname);
+	if (sock < 0) {
+		printf("  Warning: cannot setup TAP socket, skipping test\n");
+		return 0;
+	}
+
+	/* Reset stats */
+	if (rte_eth_xstats_reset(port) != 0) {
+		printf("  Error: xstats_reset failed\n");
+		goto cleanup;
+	}
+
+	/* Prepare addresses */
+	eth_addr_bcast(&bcast);
+	mcast.addr_bytes[0] = 0x01;
+	mcast.addr_bytes[1] = 0x00;
+	mcast.addr_bytes[2] = 0x5e;
+	mcast.addr_bytes[3] = 0x00;
+	mcast.addr_bytes[4] = 0x00;
+	mcast.addr_bytes[5] = 0x01;
+	rte_eth_random_addr(ucast.addr_bytes);
+	ucast.addr_bytes[0] &= 0xfe; /* Clear multicast bit */
+
+	/* Send broadcast, multicast, and unicast packets */
+	for (int i = 0; i < 3; i++) {
+		tx_mbufs[i] = rte_pktmbuf_alloc(mp);
+		if (tx_mbufs[i] == NULL) {
+			printf("  Error: failed to allocate mbuf\n");
+			goto cleanup;
+		}
+
+		uint8_t *pkt = rte_pktmbuf_mtod(tx_mbufs[i], uint8_t *);
+		struct rte_ether_addr *dst = (i == 0) ? &bcast : (i == 1) ? &mcast : &ucast;
+		build_test_packet(pkt, 64, NULL, dst);
+		tx_mbufs[i]->pkt_len = 64;
+		tx_mbufs[i]->data_len = 64;
+	}
+
+	uint16_t nb_tx = rte_eth_tx_burst(port, 0, tx_mbufs, 3);
+	if (nb_tx != 3) {
+		printf("  Error: sent %u/3 packets\n", nb_tx);
+		goto cleanup;
+	}
+
+	usleep(10000);
+
+	/* Get xstats */
+	count = rte_eth_xstats_get_names(port, NULL, 0);
+	if (count <= 0) {
+		printf("  Error: no xstats available\n");
+		goto cleanup;
+	}
+
+	names = calloc(count, sizeof(*names));
+	xstats = calloc(count, sizeof(*xstats));
+	if (names == NULL || xstats == NULL) {
+		printf("  Error: failed to allocate memory\n");
+		goto cleanup;
+	}
+
+	rte_eth_xstats_get_names(port, names, count);
+	rte_eth_xstats_get(port, xstats, count);
+
+	/* Verify packet type stats */
+	bool found_bcast = false, found_mcast = false, found_ucast = false;
+	for (int i = 0; i < count; i++) {
+		if (strstr(names[i].name, "tx_q0_broadcast_packets") && xstats[i].value > 0)
+			found_bcast = true;
+		else if (strstr(names[i].name, "tx_q0_multicast_packets") && xstats[i].value > 0)
+			found_mcast = true;
+		else if (strstr(names[i].name, "tx_q0_unicast_packets") && xstats[i].value > 0)
+			found_ucast = true;
+	}
+
+	if (!found_bcast || !found_mcast || !found_ucast) {
+		printf("  Error: packet type classification failed\n");
+		printf("    broadcast=%d multicast=%d unicast=%d\n",
+		       found_bcast, found_mcast, found_ucast);
+		ret = -1;
+		goto cleanup;
+	}
+
+	printf("  Packet type classification: OK\n");
+	ret = 0;
+
+cleanup:
+	free(names);
+	free(xstats);
+	if (sock >= 0)
+		close(sock);
+	return ret;
+}
+
+static int
+test_xstats_multiqueue(int port)
+{
+	struct rte_eth_xstat_name *names = NULL;
+	struct rte_eth_xstat *xstats = NULL;
+	int count, ret = -1;
+	const uint16_t nb_queues = 2;
+
+	printf("Testing xstats with multiple queues for port %d\n", port);
+
+	/* Reconfigure with 2 queues */
+	if (port_reconfigure(port, nb_queues, NULL, RING_SIZE, mp) < 0) {
+		printf("  Error: failed to reconfigure port with %u queues\n", nb_queues);
+		return -1;
+	}
+
+	/* Reset stats */
+	if (rte_eth_xstats_reset(port) != 0) {
+		printf("  Error: xstats_reset failed\n");
+		goto restore;
+	}
+
+	/* Get xstats */
+	count = rte_eth_xstats_get_names(port, NULL, 0);
+	if (count <= 0) {
+		printf("  Error: no xstats available\n");
+		goto restore;
+	}
+
+	names = calloc(count, sizeof(*names));
+	xstats = calloc(count, sizeof(*xstats));
+	if (names == NULL || xstats == NULL) {
+		printf("  Error: failed to allocate memory\n");
+		goto restore;
+	}
+
+	rte_eth_xstats_get_names(port, names, count);
+	rte_eth_xstats_get(port, xstats, count);
+
+	/* Verify we have stats for both queues */
+	bool found_q0 = false, found_q1 = false;
+	for (int i = 0; i < count; i++) {
+		if (strstr(names[i].name, "_q0_"))
+			found_q0 = true;
+		if (strstr(names[i].name, "_q1_"))
+			found_q1 = true;
+	}
+
+	if (!found_q0 || !found_q1) {
+		printf("  Error: missing stats for queues (q0=%d, q1=%d)\n",
+		       found_q0, found_q1);
+		ret = -1;
+		goto cleanup;
+	}
+
+	printf("  Multi-queue xstats present: OK\n");
+	ret = 0;
+
+cleanup:
+	free(names);
+	free(xstats);
+restore:
+	restore_single_queue(port);
+	return ret;
+}
+
+static int
+test_xstats_offload_stats(int port)
+{
+	struct rte_eth_xstat_name *names = NULL;
+	struct rte_eth_xstat *xstats = NULL;
+	struct rte_eth_dev_info dev_info;
+	struct rte_mbuf *mb;
+	struct rte_ether_hdr *eth;
+	struct rte_ipv4_hdr *ip;
+	struct rte_udp_hdr *udp;
+	int count, ret = -1;
+	int sock = -1;
+	char ifname[IFNAMSIZ];
+
+	printf("Testing xstats offload statistics for port %d\n", port);
+
+	if (get_dev_info(port, &dev_info) < 0)
+		return -1;
+
+	snprintf(ifname, sizeof(ifname), "rtap_test%d", port);
+	sock = setup_tap_socket_nb(ifname);
+	if (sock < 0) {
+		printf("  Warning: cannot setup TAP socket, skipping test\n");
+		return 0;
+	}
+
+	/* Reset stats */
+	if (rte_eth_xstats_reset(port) != 0) {
+		printf("  Error: xstats_reset failed\n");
+		goto cleanup;
+	}
+
+	/* Create packet with checksum offload request */
+	mb = rte_pktmbuf_alloc(mp);
+	if (mb == NULL) {
+		printf("  Error: failed to allocate mbuf\n");
+		goto cleanup;
+	}
+
+	eth = rte_pktmbuf_mtod(mb, struct rte_ether_hdr *);
+	ip = (struct rte_ipv4_hdr *)(eth + 1);
+	udp = (struct rte_udp_hdr *)(ip + 1);
+
+	/* Build packet headers */
+	rte_eth_random_addr(eth->dst_addr.addr_bytes);
+	rte_eth_random_addr(eth->src_addr.addr_bytes);
+	eth->ether_type = rte_cpu_to_be_16(RTE_ETHER_TYPE_IPV4);
+
+	ip->version_ihl = 0x45;
+	ip->total_length = rte_cpu_to_be_16(sizeof(*ip) + sizeof(*udp) + 20);
+	ip->src_addr = rte_cpu_to_be_32(0x0a000001);
+	ip->dst_addr = rte_cpu_to_be_32(0x0a000002);
+
+	udp->src_port = rte_cpu_to_be_16(1234);
+	udp->dst_port = rte_cpu_to_be_16(5678);
+	udp->dgram_len = rte_cpu_to_be_16(sizeof(*udp) + 20);
+
+	mb->l2_len = sizeof(*eth);
+	mb->l3_len = sizeof(*ip);
+	mb->l4_len = sizeof(*udp);
+	mb->ol_flags = RTE_MBUF_F_TX_IPV4 | RTE_MBUF_F_TX_UDP_CKSUM | RTE_MBUF_F_TX_IP_CKSUM;
+	mb->data_len = sizeof(*eth) + sizeof(*ip) + sizeof(*udp) + 20;
+	mb->pkt_len = mb->data_len;
+
+	uint16_t nb_tx = rte_eth_tx_burst(port, 0, &mb, 1);
+	if (nb_tx != 1) {
+		printf("  Error: failed to transmit packet\n");
+		rte_pktmbuf_free(mb);
+		goto cleanup;
+	}
+
+	usleep(10000);
+
+	/* Get xstats */
+	count = rte_eth_xstats_get_names(port, NULL, 0);
+	if (count <= 0) {
+		printf("  Error: no xstats available\n");
+		goto cleanup;
+	}
+
+	names = calloc(count, sizeof(*names));
+	xstats = calloc(count, sizeof(*xstats));
+	if (names == NULL || xstats == NULL) {
+		printf("  Error: failed to allocate memory\n");
+		goto cleanup;
+	}
+
+	rte_eth_xstats_get_names(port, names, count);
+	rte_eth_xstats_get(port, xstats, count);
+
+	/* Look for checksum offload stat */
+	bool found_cksum_offload = false;
+	for (int i = 0; i < count; i++) {
+		if (strstr(names[i].name, "tx_q0_checksum_offload_packets") &&
+		    xstats[i].value > 0) {
+			found_cksum_offload = true;
+			break;
+		}
+	}
+
+	if (!found_cksum_offload) {
+		printf("  Warning: checksum offload stat not incremented\n");
+		printf("  (This may be expected if offload was not used)\n");
+	} else {
+		printf("  Checksum offload stat incremented: OK\n");
+	}
+
+	ret = 0;
+
+cleanup:
+	free(names);
+	free(xstats);
+	if (sock >= 0)
+		close(sock);
+	return ret;
+}
+
+static int
+test_fd_leak(void)
+{
+	int fd_before, fd_after;
+	int port = -1;
+	int ret;
+
+	printf("Testing rtap PMD file descriptor leak\n");
+
+	fd_before = count_open_fds();
+	if (fd_before < 0) {
+		printf("Cannot count open fds\n");
+		return -1;
+	}
+
+	printf("  Open fds before: %d\n", fd_before);
+
+	if (rte_vdev_init("net_rtap_fdtest", "iface=rtap_fdtest") < 0) {
+		printf("Failed to create net_rtap_fdtest\n");
+		return -1;
+	}
+
+	uint16_t p;
+	RTE_ETH_FOREACH_DEV(p) {
+		struct rte_eth_dev_info info;
+		if (rte_eth_dev_info_get(p, &info) != 0)
+			continue;
+		if (p == (uint16_t)rtap_port0 || p == (uint16_t)rtap_port1)
+			continue;
+		if (strstr(info.driver_name, "rtap") != NULL) {
+			port = p;
+			break;
+		}
+	}
+
+	if (port < 0) {
+		printf("Failed to find fd-test port\n");
+		rte_vdev_uninit("net_rtap_fdtest");
+		return -1;
+	}
+
+	if (port_reconfigure(port, 2, NULL, RING_SIZE, mp) < 0)
+		goto cleanup;
+
+	ret = rte_eth_dev_stop(port);
+	if (ret != 0)
+		printf("Warning: stop returned %d\n", ret);
+
+	rte_eth_dev_close(port);
+	rte_vdev_uninit("net_rtap_fdtest");
+
+	fd_after = count_open_fds();
+	printf("  Open fds after:  %d\n", fd_after);
+
+	if (fd_after != fd_before) {
+		printf("  ERROR: fd leak detected: %d fds leaked\n",
+		       fd_after - fd_before);
+		return -1;
+	}
+
+	printf("  fd leak test PASSED\n");
+	return 0;
+
+cleanup:
+	rte_eth_dev_stop(port);
+	rte_eth_dev_close(port);
+	rte_vdev_uninit("net_rtap_fdtest");
+	return -1;
+}
+
+static void
+test_rtap_cleanup(void)
+{
+	int ret;
+
+	if (rtap_port0 >= 0) {
+		ret = rte_eth_dev_stop(rtap_port0);
+		if (ret != 0)
+			printf("Error: failed to stop port %u: %s\n",
+			       rtap_port0, rte_strerror(-ret));
+		rte_eth_dev_close(rtap_port0);
+	}
+
+	if (rtap_port1 >= 0) {
+		ret = rte_eth_dev_stop(rtap_port1);
+		if (ret != 0)
+			printf("Error: failed to stop port %u: %s\n",
+			       rtap_port1, rte_strerror(-ret));
+		rte_eth_dev_close(rtap_port1);
+	}
+
+	rte_mempool_free(mp);
+	rte_vdev_uninit("net_rtap0");
+	rte_vdev_uninit("net_rtap1");
+}
+
+static int
+test_pmd_rtap_setup(void)
+{
+	uint16_t nb_ports;
+
+	if (check_rtap_available() < 0) {
+		printf("RTAP not available, skipping tests\n");
+		return TEST_SKIPPED;
+	}
+
+	nb_ports = rte_eth_dev_count_avail();
+	printf("nb_ports before rtap creation=%d\n", (int)nb_ports);
+
+	mp = rte_pktmbuf_pool_create("mbuf_pool", NB_MBUF, 32,
+			0, RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());
+	if (mp == NULL) {
+		printf("Failed to create mempool\n");
+		return TEST_FAILED;
+	}
+
+	if (rte_vdev_init("net_rtap0", "iface=rtap_test0") < 0) {
+		printf("Failed to create net_rtap0\n");
+		rte_mempool_free(mp);
+		return TEST_FAILED;
+	}
+
+	if (rte_vdev_init("net_rtap1", "iface=rtap_test1") < 0) {
+		printf("Failed to create net_rtap1\n");
+		rte_vdev_uninit("net_rtap0");
+		rte_mempool_free(mp);
+		return TEST_FAILED;
+	}
+
+	uint16_t port;
+	RTE_ETH_FOREACH_DEV(port) {
+		struct rte_eth_dev_info dev_info;
+		int ret = rte_eth_dev_info_get(port, &dev_info);
+		if (ret != 0)
+			continue;
+
+		if (strstr(dev_info.driver_name, "rtap") != NULL ||
+		    strstr(dev_info.driver_name, "RTAP") != NULL) {
+			if (rtap_port0 < 0)
+				rtap_port0 = port;
+			else if (rtap_port1 < 0)
+				rtap_port1 = port;
+		}
+	}
+
+	if (rtap_port0 < 0) {
+		printf("Failed to find rtap ports\n");
+		test_rtap_cleanup();
+		return TEST_FAILED;
+	}
+
+	printf("rtap_port0=%d rtap_port1=%d\n", rtap_port0, rtap_port1);
+	return TEST_SUCCESS;
+}
+
+static int
+test_ethdev_configure_ports(void)
+{
+	TEST_ASSERT((test_ethdev_configure_port(rtap_port0) == 0),
+			"test ethdev configure port rtap_port0 failed");
+
+	if (rtap_port1 >= 0) {
+		TEST_ASSERT((test_ethdev_configure_port(rtap_port1) == 0),
+				"test ethdev configure port rtap_port1 failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_command_line_rtap_port(void)
+{
+	int port, cmdl_port = -1;
+	int ret;
+
+	printf("Testing command line created rtap port\n");
+
+	RTE_ETH_FOREACH_DEV(port) {
+		struct rte_eth_dev_info dev_info;
+
+		ret = rte_eth_dev_info_get(port, &dev_info);
+		if (ret != 0)
+			continue;
+
+		if (port == rtap_port0 || port == rtap_port1)
+			continue;
+
+		if (strstr(dev_info.driver_name, "rtap") != NULL ||
+		    strstr(dev_info.driver_name, "RTAP") != NULL) {
+			printf("Found command line rtap port=%d\n", port);
+			cmdl_port = port;
+			break;
+		}
+	}
+
+	if (cmdl_port != -1) {
+		TEST_ASSERT((test_ethdev_configure_port(cmdl_port) == 0),
+				"test ethdev configure cmdl_port failed");
+		TEST_ASSERT((test_stats_reset(cmdl_port) == 0),
+				"test stats reset cmdl_port failed");
+		TEST_ASSERT((test_get_stats(cmdl_port) == 0),
+				"test get stats cmdl_port failed");
+		TEST_ASSERT((rte_eth_dev_stop(cmdl_port) == 0),
+				"test stop cmdl_port failed");
+	}
+
+	return TEST_SUCCESS;
+}
+
+/* Test case wrappers */
+#define TEST_CASE_WRAPPER(name, func) \
+	static int test_##name##_for_port(void) { \
+		TEST_ASSERT(func(rtap_port0) == 0, #name " failed"); \
+		return TEST_SUCCESS; \
+	}
+
+TEST_CASE_WRAPPER(get_stats, test_get_stats)
+TEST_CASE_WRAPPER(stats_reset, test_stats_reset)
+TEST_CASE_WRAPPER(dev_info, test_dev_info)
+TEST_CASE_WRAPPER(link_status, test_link_status)
+TEST_CASE_WRAPPER(link_up_down, test_set_link_up_down)
+TEST_CASE_WRAPPER(promiscuous, test_promiscuous_mode)
+TEST_CASE_WRAPPER(allmulticast, test_allmulticast_mode)
+TEST_CASE_WRAPPER(mac_address, test_mac_address)
+TEST_CASE_WRAPPER(mtu, test_mtu_set)
+TEST_CASE_WRAPPER(multiqueue, test_multiqueue)
+TEST_CASE_WRAPPER(multiqueue_reduce, test_multiqueue_reduce)
+TEST_CASE_WRAPPER(multiqueue_mismatch, test_multiqueue_mismatch)
+TEST_CASE_WRAPPER(queue_reconfigure, test_queue_reconfigure)
+TEST_CASE_WRAPPER(rx_inject, test_rx_inject)
+TEST_CASE_WRAPPER(tx_capture, test_tx_capture)
+TEST_CASE_WRAPPER(tx_multiseg, test_tx_multiseg)
+TEST_CASE_WRAPPER(rx_multiseg, test_rx_multiseg)
+TEST_CASE_WRAPPER(offload_config, test_offload_config)
+TEST_CASE_WRAPPER(tx_csum_offload, test_tx_csum_offload)
+TEST_CASE_WRAPPER(stats_imissed, test_imissed_counter)
+TEST_CASE_WRAPPER(lsc_interrupt, test_lsc_interrupt)
+TEST_CASE_WRAPPER(rxq_interrupt, test_rxq_interrupt)
+TEST_CASE_WRAPPER(xstats_get_names, test_xstats_get_names)
+TEST_CASE_WRAPPER(xstats_get_values, test_xstats_get_values)
+TEST_CASE_WRAPPER(xstats_reset, test_xstats_reset)
+TEST_CASE_WRAPPER(xstats_packet_sizes, test_xstats_packet_sizes)
+TEST_CASE_WRAPPER(xstats_packet_types, test_xstats_packet_types)
+TEST_CASE_WRAPPER(xstats_multiqueue, test_xstats_multiqueue)
+TEST_CASE_WRAPPER(xstats_offload_stats, test_xstats_offload_stats)
+
+static int
+test_fd_leak_for_port(void)
+{
+	TEST_ASSERT(test_fd_leak() == 0, "test fd leak failed");
+	return TEST_SUCCESS;
+}
+
+static struct
+unit_test_suite test_pmd_rtap_suite = {
+	.setup = test_pmd_rtap_setup,
+	.teardown = test_rtap_cleanup,
+	.suite_name = "Test Pmd RTAP Unit Test Suite",
+	.unit_test_cases = {
+		TEST_CASE(test_ethdev_configure_ports),
+		TEST_CASE(test_dev_info_for_port),
+		TEST_CASE(test_link_status_for_port),
+		TEST_CASE(test_link_up_down_for_port),
+		TEST_CASE(test_get_stats_for_port),
+		TEST_CASE(test_stats_reset_for_port),
+		TEST_CASE(test_stats_imissed_for_port),
+		TEST_CASE(test_promiscuous_for_port),
+		TEST_CASE(test_allmulticast_for_port),
+		TEST_CASE(test_mac_address_for_port),
+		TEST_CASE(test_mtu_for_port),
+		TEST_CASE(test_multiqueue_for_port),
+		TEST_CASE(test_multiqueue_reduce_for_port),
+		TEST_CASE(test_multiqueue_mismatch_for_port),
+		TEST_CASE(test_queue_reconfigure_for_port),
+		TEST_CASE(test_rx_inject_for_port),
+		TEST_CASE(test_tx_capture_for_port),
+		TEST_CASE(test_tx_multiseg_for_port),
+		TEST_CASE(test_rx_multiseg_for_port),
+		TEST_CASE(test_offload_config_for_port),
+		TEST_CASE(test_tx_csum_offload_for_port),
+		TEST_CASE(test_lsc_interrupt_for_port),
+		TEST_CASE(test_rxq_interrupt_for_port),
+		TEST_CASE(test_xstats_get_names_for_port),
+		TEST_CASE(test_xstats_get_values_for_port),
+		TEST_CASE(test_xstats_reset_for_port),
+		TEST_CASE(test_xstats_packet_sizes_for_port),
+		TEST_CASE(test_xstats_packet_types_for_port),
+		TEST_CASE(test_xstats_multiqueue_for_port),
+		TEST_CASE(test_xstats_offload_stats_for_port),
+		TEST_CASE(test_fd_leak_for_port),
+		TEST_CASE(test_command_line_rtap_port),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_pmd_rtap(void)
+{
+	return unit_test_suite_runner(&test_pmd_rtap_suite);
+}
+
+REGISTER_FAST_TEST(rtap_pmd_autotest, NOHUGE_OK, ASAN_OK, test_pmd_rtap);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* RE: [PATCH v6 00/11] net/rtap: add io_uring based TAP driver
  2026-02-14 23:44 ` [PATCH v6 00/11] " Stephen Hemminger
                     ` (10 preceding siblings ...)
  2026-02-14 23:44   ` [PATCH v6 11/11] test: add unit tests for rtap PMD Stephen Hemminger
@ 2026-02-15  8:58   ` Konstantin Ananyev
  2026-02-15 17:08     ` Stephen Hemminger
  11 siblings, 1 reply; 72+ messages in thread
From: Konstantin Ananyev @ 2026-02-15  8:58 UTC (permalink / raw)
  To: Stephen Hemminger, dev@dpdk.org



> 
> This series adds net_rtap, an experimental poll mode driver that uses
> Linux io_uring for asynchronous packet I/O with kernel TAP interfaces.
> 
> Like net_tap, net_rtap creates a kernel network interface visible to
> standard tools (ip, ethtool) and the Linux TCP/IP stack.  From DPDK
> it is an ordinary ethdev.
> 
> Motivation
> ----------
> 
> This driver started as an experiment to determine whether Linux
> io_uring could deliver better packet I/O performance than the
> traditional read()/write() system calls used by net_tap.  By posting
> batches of I/O requests asynchronously, io_uring amortizes system
> call overhead across multiple packets.
 
Sounds interesting...
Curious did you make any perf comparisons vs our traditional tap?
 
> The project also served as a testbed for using AI tooling to help
> build a comprehensive test suite, refactor code, and improve
> documentation.  The result is intended as an example for other PMD
> authors: the driver has thorough unit tests covering data path,
> offloads, multi-queue, fd lifecycle, and more, along with detailed
> code comments explaining design choices.
>
>
> 
> Why not extend net_tap?
> -----------------------
> 
> The existing net_tap driver was designed to provide feature parity
> with mlx5 when used behind the failsafe PMD.  That goal led to
> significant complexity: rte_flow support emulated via eBPF programs,
> software GSO implementation, and other features that duplicate in
> user space what the kernel already does.
> 
> net_rtap takes the opposite approach -- use the kernel efficiently
> and let it do what it does well.  There is no rte_flow support;
> receive queue selection is left to the kernel's native RSS/steering.
> There is no software GSO; the driver passes segmentation requests
> to the kernel via the virtio-net header and lets the kernel handle
> it.  The result is a much simpler driver that is easier to maintain
> and reason about.
> 
> Given these fundamentally different design goals, a clean
> implementation was more practical than refactoring net_tap.
> 
> Acknowledgement
> ---------------
> 
> Parts of the test suite, code review, and refactoring were done
> with the assistance of Anthropic Claude (AI).  All generated code
> was reviewed and tested by the author.
> 
> Requirements:
>   - Kernel headers with IORING_ASYNC_CANCEL_ALL (upstream since 5.19)
>   - liburing >= 2.0
> 
> Should work on distributions: Debian 12+, Ubuntu 24.04+,
> Fedora 37+, SLES 15 SP6+ / openSUSE Tumbleweed.
> RHEL 9 is not supported (io_uring is disabled by default).
> 
> v6:
>   - fix lots of bugs found doing automated review
>   - convert to use ifindex and netlink to avoid rename issues
>   - implement xstats
> 
> Stephen Hemminger (11):
>   net/rtap: add driver skeleton and documentation
>   net/rtap: add TAP device creation and queue management
>   net/rtap: add Rx/Tx with scatter/gather support
>   net/rtap: add statistics and device info
>   net/rtap: add link and device management operations
>   net/rtap: add checksum and TSO offload support
>   net/rtap: add multi-process support
>   net/rtap: add link state change interrupt
>   net/rtap: add Rx interrupt support
>   net/rtap: add extended statistics support
>   test: add unit tests for rtap PMD
> 
>  MAINTAINERS                            |    7 +
>  app/test/meson.build                   |    1 +
>  app/test/test_pmd_rtap.c               | 2620 ++++++++++++++++++++++++
>  doc/guides/nics/features/rtap.ini      |   26 +
>  doc/guides/nics/index.rst              |    1 +
>  doc/guides/nics/rtap.rst               |  101 +
>  doc/guides/rel_notes/release_26_03.rst |    7 +
>  drivers/net/meson.build                |    1 +
>  drivers/net/rtap/meson.build           |   30 +
>  drivers/net/rtap/rtap.h                |  152 ++
>  drivers/net/rtap/rtap_ethdev.c         |  864 ++++++++
>  drivers/net/rtap/rtap_intr.c           |  207 ++
>  drivers/net/rtap/rtap_netlink.c        |  445 ++++
>  drivers/net/rtap/rtap_rxtx.c           |  803 ++++++++
>  drivers/net/rtap/rtap_xstats.c         |  293 +++
>  15 files changed, 5558 insertions(+)
>  create mode 100644 app/test/test_pmd_rtap.c
>  create mode 100644 doc/guides/nics/features/rtap.ini
>  create mode 100644 doc/guides/nics/rtap.rst
>  create mode 100644 drivers/net/rtap/meson.build
>  create mode 100644 drivers/net/rtap/rtap.h
>  create mode 100644 drivers/net/rtap/rtap_ethdev.c
>  create mode 100644 drivers/net/rtap/rtap_intr.c
>  create mode 100644 drivers/net/rtap/rtap_netlink.c
>  create mode 100644 drivers/net/rtap/rtap_rxtx.c
>  create mode 100644 drivers/net/rtap/rtap_xstats.c
> 
> --
> 2.51.0


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 00/11] net/rtap: add io_uring based TAP driver
  2026-02-15  8:58   ` [PATCH v6 00/11] net/rtap: add io_uring based TAP driver Konstantin Ananyev
@ 2026-02-15 17:08     ` Stephen Hemminger
  0 siblings, 0 replies; 72+ messages in thread
From: Stephen Hemminger @ 2026-02-15 17:08 UTC (permalink / raw)
  To: Konstantin Ananyev; +Cc: dev@dpdk.org

On Sun, 15 Feb 2026 08:58:14 +0000
Konstantin Ananyev <konstantin.ananyev@huawei.com> wrote:

> > 
> > This driver started as an experiment to determine whether Linux
> > io_uring could deliver better packet I/O performance than the
> > traditional read()/write() system calls used by net_tap.  By posting
> > batches of I/O requests asynchronously, io_uring amortizes system
> > call overhead across multiple packets.  
>  
> Sounds interesting...
> Curious did you make any perf comparisons vs our traditional tap?

That is next step. Trying to get a testbed that is realistic setup.

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2026-02-15 17:08 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-10 21:23 [RFC 0/8] ioring: network driver Stephen Hemminger
2024-12-10 21:23 ` [RFC 1/8] net/ioring: introduce new driver Stephen Hemminger
2024-12-10 21:23 ` [RFC 2/8] net/ioring: implement link state Stephen Hemminger
2024-12-10 21:23 ` [RFC 3/8] net/ioring: implement control functions Stephen Hemminger
2024-12-10 21:23 ` [RFC 4/8] net/ioring: implement management functions Stephen Hemminger
2024-12-10 21:23 ` [RFC 5/8] net/ioring: implement primary secondary fd passing Stephen Hemminger
2024-12-10 21:23 ` [RFC 6/8] net/ioring: implement receive and transmit Stephen Hemminger
2024-12-10 21:23 ` [RFC 7/8] net/ioring: add VLAN support Stephen Hemminger
2024-12-10 21:23 ` [RFC 8/8] net/ioring: implement statistics Stephen Hemminger
2024-12-11 11:34 ` [RFC 0/8] ioring: network driver Konstantin Ananyev
2024-12-11 15:03   ` Stephen Hemminger
2024-12-12 19:06     ` Konstantin Ananyev
2024-12-19 15:40       ` Morten Brørup
2024-12-20 14:34         ` Konstantin Ananyev
2024-12-20 16:19           ` Stephen Hemminger
2024-12-11 16:28 ` [PATCH v2 " Stephen Hemminger
2024-12-11 16:28   ` [PATCH v2 1/8] net/ioring: introduce new driver Stephen Hemminger
2024-12-28 16:39     ` Morten Brørup
2024-12-11 16:28   ` [PATCH v2 2/8] net/ioring: implement link state Stephen Hemminger
2024-12-11 16:28   ` [PATCH v2 3/8] net/ioring: implement control functions Stephen Hemminger
2024-12-11 16:28   ` [PATCH v2 4/8] net/ioring: implement management functions Stephen Hemminger
2024-12-11 16:28   ` [PATCH v2 5/8] net/ioring: implement primary secondary fd passing Stephen Hemminger
2024-12-11 16:28   ` [PATCH v2 6/8] net/ioring: implement receive and transmit Stephen Hemminger
2024-12-11 16:28   ` [PATCH v2 7/8] net/ioring: add VLAN support Stephen Hemminger
2024-12-11 16:28   ` [PATCH v2 8/8] net/ioring: implement statistics Stephen Hemminger
2025-03-11 23:51 ` [PATCH v3 0/9] ioring PMD device Stephen Hemminger
2025-03-11 23:51   ` [PATCH v3 1/9] net/ioring: introduce new driver Stephen Hemminger
2025-03-11 23:51   ` [PATCH v3 2/9] net/ioring: implement link state Stephen Hemminger
2025-03-11 23:51   ` [PATCH v3 3/9] net/ioring: implement control functions Stephen Hemminger
2025-03-11 23:51   ` [PATCH v3 4/9] net/ioring: implement management functions Stephen Hemminger
2025-03-11 23:51   ` [PATCH v3 5/9] net/ioring: implement secondary process support Stephen Hemminger
2025-03-11 23:51   ` [PATCH v3 6/9] net/ioring: implement receive and transmit Stephen Hemminger
2025-03-11 23:51   ` [PATCH v3 7/9] net/ioring: add VLAN support Stephen Hemminger
2025-03-11 23:51   ` [PATCH v3 8/9] net/ioring: implement statistics Stephen Hemminger
2025-03-11 23:51   ` [PATCH v3 9/9] net/ioring: support multi-segment Rx and Tx Stephen Hemminger
2025-03-13 21:50 ` [PATCH v4 00/10] new ioring PMD Stephen Hemminger
2025-03-13 21:50   ` [PATCH v4 01/10] net/ioring: introduce new driver Stephen Hemminger
2025-03-13 21:50   ` [PATCH v4 02/10] net/ioring: implement link state Stephen Hemminger
2025-03-13 21:50   ` [PATCH v4 03/10] net/ioring: implement control functions Stephen Hemminger
2025-03-13 21:50   ` [PATCH v4 04/10] net/ioring: implement management functions Stephen Hemminger
2025-03-13 21:50   ` [PATCH v4 05/10] net/ioring: implement secondary process support Stephen Hemminger
2025-03-13 21:50   ` [PATCH v4 06/10] net/ioring: implement receive and transmit Stephen Hemminger
2025-03-13 21:50   ` [PATCH v4 07/10] net/ioring: implement statistics Stephen Hemminger
2025-03-13 21:50   ` [PATCH v4 08/10] net/ioring: support multi-segment Rx and Tx Stephen Hemminger
2025-03-13 21:51   ` [PATCH v4 09/10] net/ioring: support Tx checksum and segment offload Stephen Hemminger
2025-03-13 21:51   ` [PATCH v4 10/10] net/ioring: add support for Rx offload Stephen Hemminger
2026-02-09 18:38 ` [PATCH v5 00/10] net/rtap: add io_uring based TAP driver Stephen Hemminger
2026-02-09 18:39   ` [PATCH v5 01/10] net/rtap: add driver skeleton and documentation Stephen Hemminger
2026-02-09 18:39   ` [PATCH v5 02/10] net/rtap: add TAP device creation and queue management Stephen Hemminger
2026-02-09 18:39   ` [PATCH v5 03/10] net/rtap: add Rx/Tx with scatter/gather support Stephen Hemminger
2026-02-09 18:39   ` [PATCH v5 04/10] net/rtap: add statistics and device info Stephen Hemminger
2026-02-09 18:39   ` [PATCH v5 05/10] net/rtap: add link and device management operations Stephen Hemminger
2026-02-09 18:39   ` [PATCH v5 06/10] net/rtap: add checksum and TSO offload support Stephen Hemminger
2026-02-09 18:39   ` [PATCH v5 07/10] net/rtap: add link state change interrupt Stephen Hemminger
2026-02-09 18:39   ` [PATCH v5 08/10] net/rtap: add multi-process support Stephen Hemminger
2026-02-09 18:39   ` [PATCH v5 09/10] net/rtap: add Rx interrupt support Stephen Hemminger
2026-02-09 18:39   ` [PATCH v5 10/10] test: add unit tests for rtap PMD Stephen Hemminger
2026-02-10  9:18   ` [PATCH v5 00/10] net/rtap: add io_uring based TAP driver Morten Brørup
2026-02-14 23:44 ` [PATCH v6 00/11] " Stephen Hemminger
2026-02-14 23:44   ` [PATCH v6 01/11] net/rtap: add driver skeleton and documentation Stephen Hemminger
2026-02-14 23:44   ` [PATCH v6 02/11] net/rtap: add TAP device creation and queue management Stephen Hemminger
2026-02-14 23:44   ` [PATCH v6 03/11] net/rtap: add Rx/Tx with scatter/gather support Stephen Hemminger
2026-02-14 23:44   ` [PATCH v6 04/11] net/rtap: add statistics and device info Stephen Hemminger
2026-02-14 23:44   ` [PATCH v6 05/11] net/rtap: add link and device management operations Stephen Hemminger
2026-02-14 23:44   ` [PATCH v6 06/11] net/rtap: add checksum and TSO offload support Stephen Hemminger
2026-02-14 23:44   ` [PATCH v6 07/11] net/rtap: add multi-process support Stephen Hemminger
2026-02-14 23:44   ` [PATCH v6 08/11] net/rtap: add link state change interrupt Stephen Hemminger
2026-02-14 23:44   ` [PATCH v6 09/11] net/rtap: add Rx interrupt support Stephen Hemminger
2026-02-14 23:44   ` [PATCH v6 10/11] net/rtap: add extended statistics support Stephen Hemminger
2026-02-14 23:44   ` [PATCH v6 11/11] test: add unit tests for rtap PMD Stephen Hemminger
2026-02-15  8:58   ` [PATCH v6 00/11] net/rtap: add io_uring based TAP driver Konstantin Ananyev
2026-02-15 17:08     ` Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox