bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v3 00/15] netkit: Support for io_uring zero-copy and AF_XDP
@ 2025-10-20 16:23 Daniel Borkmann
  2025-10-20 16:23 ` [PATCH net-next v3 01/15] net: Add bind-queue operation Daniel Borkmann
                   ` (14 more replies)
  0 siblings, 15 replies; 54+ messages in thread
From: Daniel Borkmann @ 2025-10-20 16:23 UTC (permalink / raw)
  To: netdev
  Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

Containers use virtual netdevs to route traffic from a physical netdev
in the host namespace. They do not have access to the physical netdev
in the host and thus can't use memory providers or AF_XDP that require
reconfiguring/restarting queues in the physical netdev.

This patchset adds the concept of queue peering to virtual netdevs that
allow containers to use memory providers and AF_XDP at native speed.
These mapped queues are bound to a real queue in a physical netdev and
act as a proxy.

Memory providers and AF_XDP operations takes an ifindex and queue id,
so containers would pass in an ifindex for a virtual netdev and a queue
id of a mapped queue, which then gets proxied to the underlying real
queue. Peered queues are created and bound to a real queue atomically
through a generic ynl netdev operation.

We have implemented support for this concept in netkit and tested the
latter against Nvidia ConnectX-6 (mlx5) as well as Broadcom BCM957504
(bnxt_en) 100G NICs. For more details see the individual patches.

v2->v3:
 - Use netdev_ops_assert_locked instead of netdev_assert_locked (syzbot)
 - Add missing netdev_lockdep_set_classes in netkit
v1->v2:
 - Removed bind sample ynl code (Stan)
 - Reworked netdev locking to have consistent order (Stan, Kuba)
 - Return 'not supported' in API patch (Stan)
 - Improved ynl documentation (Kuba)
 - Added 'max: s32-max' in ynl spec for ifindex (Kuba)
 - Added also queue type in ynl to have user specify rx to make
   it obvious (Kuba)
 - Use of netdev_hold (Kuba)
 - Avoid static inlines from another header (Kuba)
 - Squashed some commits (Kuba, Stan)
 - Removed ndo_{peer,unpeer}_queues callback and simplified
   code (Kuba)
 - Improved commit messages (Toke, Kuba, Stan, zf)
 - Got rid of locking genl_sk_priv_get (Stan)
 - Removed af_xdp cleanup churn (Maciej)
 - Added netdev locking asserts (Stan)
 - Reject ethtool ioctl path queue resizing (Kuba)
 - Added kdoc for ndo_queue_create (Stan)
 - Uninvert logic in netkit single dev mode (Jordan)
 - Added binding support for multiple queues

Daniel Borkmann (9):
  net, ethtool: Disallow peered real rxqs to be resized
  xsk: Move NETDEV_XDP_ACT_ZC into generic header
  xsk: Move pool registration into single function
  xsk: Add small helper xp_pool_bindable
  xsk: Change xsk_rcv_check to check netdev/queue_id from pool
  xsk: Proxy pool management for mapped queues
  netkit: Add single device mode for netkit
  netkit: Document fast vs slowpath members via macros
  netkit: Add xsk support for af_xdp applications

David Wei (6):
  net: Add bind-queue operation
  net: Implement netdev_nl_bind_queue_doit
  net: Add peer info to queue-get response
  net: Proxy net_mp_{open,close}_rxq for mapped queues
  netkit: Implement rtnl_link_ops->alloc and ndo_queue_create
  netkit: Add io_uring zero-copy support for TCP

 Documentation/netlink/specs/netdev.yaml |  84 ++++++
 drivers/net/netkit.c                    | 330 ++++++++++++++++++++----
 include/linux/ethtool.h                 |   1 +
 include/net/netdev_queues.h             |   5 +
 include/net/netdev_rx_queue.h           |  39 ++-
 include/net/page_pool/memory_provider.h |   4 +-
 include/net/xdp_sock_drv.h              |   8 +-
 include/uapi/linux/if_link.h            |   6 +
 include/uapi/linux/netdev.h             |  22 ++
 net/core/netdev-genl-gen.c              |  25 ++
 net/core/netdev-genl-gen.h              |   1 +
 net/core/netdev-genl.c                  | 177 ++++++++++++-
 net/core/netdev_rx_queue.c              | 126 +++++++--
 net/ethtool/channels.c                  |  12 +-
 net/ethtool/common.c                    |  10 +-
 net/ethtool/ioctl.c                     |   4 +-
 net/xdp/xsk.c                           |  44 +++-
 net/xdp/xsk.h                           |   5 +-
 net/xdp/xsk_buff_pool.c                 |  18 +-
 tools/include/uapi/linux/netdev.h       |  22 ++
 20 files changed, 830 insertions(+), 113 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH net-next v3 01/15] net: Add bind-queue operation
  2025-10-20 16:23 [PATCH net-next v3 00/15] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
@ 2025-10-20 16:23 ` Daniel Borkmann
  2025-10-22 11:19   ` Nikolay Aleksandrov
  2025-10-24  2:12   ` Jakub Kicinski
  2025-10-20 16:23 ` [PATCH net-next v3 02/15] net: Implement netdev_nl_bind_queue_doit Daniel Borkmann
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 54+ messages in thread
From: Daniel Borkmann @ 2025-10-20 16:23 UTC (permalink / raw)
  To: netdev
  Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

From: David Wei <dw@davidwei.uk>

Add a ynl netdev family operation called bind-queue that creates a new
rx queue in a virtual netdev (i.e. netkit or veth) and binds it to an rx
queue in a real netdev. This forms a queue pair, where the peer queue of
the pair in the virtual netdev acts as a proxy for the peer queue in the
real netdev. Thus, the peer queue in the virtual netdev can be used by
processes running in a container to use both memory providers (io_uring
zero-copy rx and devmem) and AF_XDP. An early implementation had only
driver-specific integration [0], but in order for other virtual devices
to reuse, it makes sense to have this as a generic API.

src-ifindex and src-queue-id is the real netdev and its rx queue id
respectively. dst-ifindex is the virtual netdev. Note that this op doesn't
take dst-queue-id because a new rx queue is created. The virtual netdev
must have real_num_rx_queues less than num_rx_queues at the time of
calling bind-queue. The queue-type must be rx as only rx queues are
supported for now.

Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://bpfconf.ebpf.io/bpfconf2025/bpfconf2025_material/lsfmmbpf_2025_netkit_borkmann.pdf [0]
---
 Documentation/netlink/specs/netdev.yaml | 60 +++++++++++++++++++++++++
 include/uapi/linux/netdev.h             | 12 +++++
 net/core/netdev-genl-gen.c              | 25 +++++++++++
 net/core/netdev-genl-gen.h              |  1 +
 net/core/netdev-genl.c                  |  5 +++
 tools/include/uapi/linux/netdev.h       | 12 +++++
 6 files changed, 115 insertions(+)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index e00d3fa1c152..20bb00b7e9ac 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -561,6 +561,46 @@ attribute-sets:
         type: u32
         checks:
           min: 1
+  -
+    name: queue-pair
+    attributes:
+      -
+        name: queue-type
+        doc: |
+          Queue type as rx, tx, for src-queue-id and dst-queue-id.
+          Currently only pairing queues of type rx is supported.
+        type: u32
+        enum: queue-type
+      -
+        name: src-ifindex
+        doc: |
+          Specifies the netdev ifindex of the physical device to pair
+          src-queue-id from.
+        type: u32
+        checks:
+          min: 1
+          max: s32-max
+      -
+        name: src-queue-id
+        doc: |
+          Specifies the netdev queue id of the physical device with
+          src-ifindex to pair a queue from.
+        type: u32
+      -
+        name: dst-ifindex
+        doc: |
+          Specifies the netdev ifindex of the virtual device to pair
+          a new queue with the src-queue-id from src-ifindex.
+        type: u32
+        checks:
+          min: 1
+          max: s32-max
+      -
+        name: dst-queue-id
+        doc: |
+          Specifies the new netdev queue id of the virtual device after
+          a successful pairing operation.
+        type: u32
 
 operations:
   list:
@@ -772,6 +812,26 @@ operations:
           attributes:
             - id
 
+    -
+      name: bind-queue
+      doc: |
+        Bind a physical netdevice queue to a virtual one. The binding
+        creates a queue pair, where a queue can reference its peer queue.
+        This is useful for memory providers and AF_XDP operations which
+        take an ifindex and queue id to allow auch applications to bind
+        against virtual devices in containers.
+      attribute-set: queue-pair
+      do:
+        request:
+          attributes:
+            - queue-type
+            - src-ifindex
+            - src-queue-id
+            - dst-ifindex
+        reply:
+          attributes:
+            - dst-queue-id
+
 kernel-family:
   headers: ["net/netdev_netlink.h"]
   sock-priv: struct netdev_nl_sock
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 48eb49aa03d4..4ef04d0bc412 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -210,6 +210,17 @@ enum {
 	NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
 };
 
+enum {
+	NETDEV_A_QUEUE_PAIR_QUEUE_TYPE = 1,
+	NETDEV_A_QUEUE_PAIR_SRC_IFINDEX,
+	NETDEV_A_QUEUE_PAIR_SRC_QUEUE_ID,
+	NETDEV_A_QUEUE_PAIR_DST_IFINDEX,
+	NETDEV_A_QUEUE_PAIR_DST_QUEUE_ID,
+
+	__NETDEV_A_QUEUE_PAIR_MAX,
+	NETDEV_A_QUEUE_PAIR_MAX = (__NETDEV_A_QUEUE_PAIR_MAX - 1)
+};
+
 enum {
 	NETDEV_CMD_DEV_GET = 1,
 	NETDEV_CMD_DEV_ADD_NTF,
@@ -226,6 +237,7 @@ enum {
 	NETDEV_CMD_BIND_RX,
 	NETDEV_CMD_NAPI_SET,
 	NETDEV_CMD_BIND_TX,
+	NETDEV_CMD_BIND_QUEUE,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index e9a2a6f26cb7..69f8126c3e42 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -26,6 +26,16 @@ static const struct netlink_range_validation netdev_a_napi_defer_hard_irqs_range
 	.max	= S32_MAX,
 };
 
+static const struct netlink_range_validation netdev_a_queue_pair_src_ifindex_range = {
+	.min	= 1ULL,
+	.max	= S32_MAX,
+};
+
+static const struct netlink_range_validation netdev_a_queue_pair_dst_ifindex_range = {
+	.min	= 1ULL,
+	.max	= S32_MAX,
+};
+
 /* Common nested types */
 const struct nla_policy netdev_page_pool_info_nl_policy[NETDEV_A_PAGE_POOL_IFINDEX + 1] = {
 	[NETDEV_A_PAGE_POOL_ID] = NLA_POLICY_FULL_RANGE(NLA_UINT, &netdev_a_page_pool_id_range),
@@ -106,6 +116,14 @@ static const struct nla_policy netdev_bind_tx_nl_policy[NETDEV_A_DMABUF_FD + 1]
 	[NETDEV_A_DMABUF_FD] = { .type = NLA_U32, },
 };
 
+/* NETDEV_CMD_BIND_QUEUE - do */
+static const struct nla_policy netdev_bind_queue_nl_policy[NETDEV_A_QUEUE_PAIR_DST_IFINDEX + 1] = {
+	[NETDEV_A_QUEUE_PAIR_QUEUE_TYPE] = NLA_POLICY_MAX(NLA_U32, 1),
+	[NETDEV_A_QUEUE_PAIR_SRC_IFINDEX] = NLA_POLICY_FULL_RANGE(NLA_U32, &netdev_a_queue_pair_src_ifindex_range),
+	[NETDEV_A_QUEUE_PAIR_SRC_QUEUE_ID] = { .type = NLA_U32, },
+	[NETDEV_A_QUEUE_PAIR_DST_IFINDEX] = NLA_POLICY_FULL_RANGE(NLA_U32, &netdev_a_queue_pair_dst_ifindex_range),
+};
+
 /* Ops table for netdev */
 static const struct genl_split_ops netdev_nl_ops[] = {
 	{
@@ -204,6 +222,13 @@ static const struct genl_split_ops netdev_nl_ops[] = {
 		.maxattr	= NETDEV_A_DMABUF_FD,
 		.flags		= GENL_CMD_CAP_DO,
 	},
+	{
+		.cmd		= NETDEV_CMD_BIND_QUEUE,
+		.doit		= netdev_nl_bind_queue_doit,
+		.policy		= netdev_bind_queue_nl_policy,
+		.maxattr	= NETDEV_A_QUEUE_PAIR_DST_IFINDEX,
+		.flags		= GENL_CMD_CAP_DO,
+	},
 };
 
 static const struct genl_multicast_group netdev_nl_mcgrps[] = {
diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h
index cf3fad74511f..309248fe2b9e 100644
--- a/net/core/netdev-genl-gen.h
+++ b/net/core/netdev-genl-gen.h
@@ -35,6 +35,7 @@ int netdev_nl_qstats_get_dumpit(struct sk_buff *skb,
 int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info);
 int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info);
 int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info);
+int netdev_nl_bind_queue_doit(struct sk_buff *skb, struct genl_info *info);
 
 enum {
 	NETDEV_NLGRP_MGMT,
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 470fabbeacd9..ce1018ea390f 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -1120,6 +1120,11 @@ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
 	return err;
 }
 
+int netdev_nl_bind_queue_doit(struct sk_buff *skb, struct genl_info *info)
+{
+	return -EOPNOTSUPP;
+}
+
 void netdev_nl_sock_priv_init(struct netdev_nl_sock *priv)
 {
 	INIT_LIST_HEAD(&priv->bindings);
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 48eb49aa03d4..4ef04d0bc412 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -210,6 +210,17 @@ enum {
 	NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
 };
 
+enum {
+	NETDEV_A_QUEUE_PAIR_QUEUE_TYPE = 1,
+	NETDEV_A_QUEUE_PAIR_SRC_IFINDEX,
+	NETDEV_A_QUEUE_PAIR_SRC_QUEUE_ID,
+	NETDEV_A_QUEUE_PAIR_DST_IFINDEX,
+	NETDEV_A_QUEUE_PAIR_DST_QUEUE_ID,
+
+	__NETDEV_A_QUEUE_PAIR_MAX,
+	NETDEV_A_QUEUE_PAIR_MAX = (__NETDEV_A_QUEUE_PAIR_MAX - 1)
+};
+
 enum {
 	NETDEV_CMD_DEV_GET = 1,
 	NETDEV_CMD_DEV_ADD_NTF,
@@ -226,6 +237,7 @@ enum {
 	NETDEV_CMD_BIND_RX,
 	NETDEV_CMD_NAPI_SET,
 	NETDEV_CMD_BIND_TX,
+	NETDEV_CMD_BIND_QUEUE,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH net-next v3 02/15] net: Implement netdev_nl_bind_queue_doit
  2025-10-20 16:23 [PATCH net-next v3 00/15] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
  2025-10-20 16:23 ` [PATCH net-next v3 01/15] net: Add bind-queue operation Daniel Borkmann
@ 2025-10-20 16:23 ` Daniel Borkmann
  2025-10-22 11:17   ` Nikolay Aleksandrov
                     ` (4 more replies)
  2025-10-20 16:23 ` [PATCH net-next v3 03/15] net: Add peer info to queue-get response Daniel Borkmann
                   ` (12 subsequent siblings)
  14 siblings, 5 replies; 54+ messages in thread
From: Daniel Borkmann @ 2025-10-20 16:23 UTC (permalink / raw)
  To: netdev
  Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

From: David Wei <dw@davidwei.uk>

Implement netdev_nl_bind_queue_doit() that creates an rx queue in a
virtual netdev and then binds it to an rxq in a real netdev to create
a queue pair.

Example with ynl client:

  # ./pyynl/cli.py \
      --spec ~/netlink/specs/netdev.yaml \
      --do bind-queue \
      --json '{"src-ifindex": 4, "src-queue-id": 15, "dst-ifindex": 8, "queue-type": "rx"}'
  {'dst-queue-id': 1}

Note that the netdevice locking order is always from the virtual to
the physical device.

Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 include/net/netdev_queues.h   |   5 ++
 include/net/netdev_rx_queue.h |  36 ++++++++-
 net/core/netdev-genl.c        | 141 +++++++++++++++++++++++++++++++++-
 net/core/netdev_rx_queue.c    |  61 +++++++++++++++
 4 files changed, 240 insertions(+), 3 deletions(-)

diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index cd00e0406cf4..286d5edce07d 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -130,6 +130,10 @@ void netdev_stat_queue_sum(struct net_device *netdev,
  * @ndo_queue_get_dma_dev: Get dma device for zero-copy operations to be used
  *			   for this queue. Return NULL on error.
  *
+ * @ndo_queue_create: Create a new RX queue which can be bound to another queue.
+ *		      Ops on this queue are redirected to the peer queue e.g.
+ *		      when opening a memory provider.
+ *
  * Note that @ndo_queue_mem_alloc and @ndo_queue_mem_free may be called while
  * the interface is closed. @ndo_queue_start and @ndo_queue_stop will only
  * be called for an interface which is open.
@@ -149,6 +153,7 @@ struct netdev_queue_mgmt_ops {
 						  int idx);
 	struct device *		(*ndo_queue_get_dma_dev)(struct net_device *dev,
 							 int idx);
+	int			(*ndo_queue_create)(struct net_device *dev);
 };
 
 bool netif_rxq_has_unreadable_mp(struct net_device *dev, int idx);
diff --git a/include/net/netdev_rx_queue.h b/include/net/netdev_rx_queue.h
index 8cdcd138b33f..db3ef94c0744 100644
--- a/include/net/netdev_rx_queue.h
+++ b/include/net/netdev_rx_queue.h
@@ -28,6 +28,7 @@ struct netdev_rx_queue {
 #endif
 	struct napi_struct		*napi;
 	struct pp_memory_provider_params mp_params;
+	struct netdev_rx_queue		*peer;
 } ____cacheline_aligned_in_smp;
 
 /*
@@ -56,6 +57,37 @@ get_netdev_rx_queue_index(struct netdev_rx_queue *queue)
 	return index;
 }
 
-int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq);
+static inline void __netdev_rx_queue_peer(struct netdev_rx_queue *src_rxq,
+					  struct netdev_rx_queue *dst_rxq)
+{
+	src_rxq->peer = dst_rxq;
+	dst_rxq->peer = src_rxq;
+}
 
-#endif
+static inline void __netdev_rx_queue_unpeer(struct netdev_rx_queue *src_rxq,
+					    struct netdev_rx_queue *dst_rxq)
+{
+	src_rxq->peer = NULL;
+	dst_rxq->peer = NULL;
+}
+
+static inline bool netdev_rx_queue_peered(struct net_device *dev,
+					  u16 queue_id)
+{
+	if (queue_id < dev->real_num_rx_queues)
+		return dev->_rx[queue_id].peer;
+	return false;
+}
+
+void netdev_rx_queue_peer(struct net_device *src_dev,
+			  struct netdev_rx_queue *src_rxq,
+			  struct netdev_rx_queue *dst_rxq);
+void netdev_rx_queue_unpeer(struct net_device *src_dev,
+			    struct netdev_rx_queue *src_rxq,
+			    struct netdev_rx_queue *dst_rxq);
+int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq);
+struct netdev_rx_queue *
+netif_get_rx_queue_peer_locked(struct net_device **dev,
+			       unsigned int *rxq_idx,
+			       bool *needs_unlock);
+#endif /* _LINUX_NETDEV_RX_QUEUE_H */
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index ce1018ea390f..579469abac8c 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -1122,7 +1122,146 @@ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
 
 int netdev_nl_bind_queue_doit(struct sk_buff *skb, struct genl_info *info)
 {
-	return -EOPNOTSUPP;
+	u32 src_ifidx, src_qid, dst_ifidx, dst_qid, q_type;
+	struct netdev_rx_queue *src_rxq, *dst_rxq, *tmp_rxq;
+	struct net_device *src_dev, *dst_dev;
+	struct sk_buff *rsp;
+	int err = 0;
+	void *hdr;
+
+	if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_PAIR_QUEUE_TYPE) ||
+	    GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_PAIR_SRC_IFINDEX) ||
+	    GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_PAIR_SRC_QUEUE_ID) ||
+	    GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_PAIR_DST_IFINDEX))
+		return -EINVAL;
+
+	src_ifidx = nla_get_u32(info->attrs[NETDEV_A_QUEUE_PAIR_SRC_IFINDEX]);
+	src_qid = nla_get_u32(info->attrs[NETDEV_A_QUEUE_PAIR_SRC_QUEUE_ID]);
+	dst_ifidx = nla_get_u32(info->attrs[NETDEV_A_QUEUE_PAIR_DST_IFINDEX]);
+	q_type = nla_get_u32(info->attrs[NETDEV_A_QUEUE_PAIR_QUEUE_TYPE]);
+
+	if (q_type != NETDEV_QUEUE_TYPE_RX) {
+		NL_SET_ERR_MSG(info->extack, "Only binding of RX queue supported");
+		return -EOPNOTSUPP;
+	}
+	if (dst_ifidx == src_ifidx) {
+		NL_SET_ERR_MSG(info->extack,
+			       "Destination driver cannot be same as source driver");
+		return -EOPNOTSUPP;
+	}
+
+	rsp = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!rsp)
+		return -ENOMEM;
+
+	hdr = genlmsg_iput(rsp, info);
+	if (!hdr) {
+		err = -EMSGSIZE;
+		goto err_genlmsg_free;
+	}
+
+	/* Locking order is always from the virtual to the physical device
+	 * since this is also the same order when applications open the
+	 * memory provider later on.
+	 */
+	dst_dev = netdev_get_by_index_lock(genl_info_net(info), dst_ifidx);
+	if (!dst_dev) {
+		err = -ENODEV;
+		goto err_genlmsg_free;
+	}
+	if (dst_dev->dev.parent) {
+		err = -EOPNOTSUPP;
+		NL_SET_ERR_MSG(info->extack,
+			       "Destination device is not a virtual device");
+		goto err_unlock_dst_dev;
+	}
+	if (!dst_dev->queue_mgmt_ops ||
+	    !dst_dev->queue_mgmt_ops->ndo_queue_create) {
+		err = -EOPNOTSUPP;
+		NL_SET_ERR_MSG(info->extack,
+			       "Destination driver does not support queue management operations");
+		goto err_unlock_dst_dev;
+	}
+	if (dst_dev->real_num_rx_queues < 1) {
+		err = -EOPNOTSUPP;
+		NL_SET_ERR_MSG(info->extack,
+			       "Destination device must have at least one real RX queue");
+		goto err_unlock_dst_dev;
+	}
+
+	src_dev = netdev_get_by_index_lock(genl_info_net(info), src_ifidx);
+	if (!src_dev) {
+		err = -ENODEV;
+		goto err_unlock_dst_dev;
+	}
+	if (!src_dev->dev.parent) {
+		err = -EOPNOTSUPP;
+		NL_SET_ERR_MSG(info->extack,
+			       "Source device is a virtual device");
+		goto err_unlock_src_dev;
+	}
+	if (!netif_device_present(src_dev)) {
+		err = -ENODEV;
+		NL_SET_ERR_MSG(info->extack,
+			       "Source device has been removed from the system");
+		goto err_unlock_src_dev;
+	}
+	if (!src_dev->queue_mgmt_ops) {
+		err = -EOPNOTSUPP;
+		NL_SET_ERR_MSG(info->extack,
+			       "Source driver does not support queue management operations");
+		goto err_unlock_src_dev;
+	}
+	if (src_qid >= src_dev->num_rx_queues) {
+		err = -ERANGE;
+		NL_SET_ERR_MSG(info->extack,
+			       "Source device queue is out of range");
+		goto err_unlock_src_dev;
+	}
+
+	src_rxq = __netif_get_rx_queue(src_dev, src_qid);
+	if (src_rxq->peer) {
+		err = -EBUSY;
+		NL_SET_ERR_MSG(info->extack,
+			       "Source device queue is already bound");
+		goto err_unlock_src_dev;
+	}
+
+	tmp_rxq = __netif_get_rx_queue(dst_dev, dst_dev->real_num_rx_queues - 1);
+	if (tmp_rxq->peer && tmp_rxq->peer->dev != src_dev) {
+		err = -EOPNOTSUPP;
+		NL_SET_ERR_MSG(info->extack,
+			       "Binding multiple queues from difference source devices not supported");
+		goto err_unlock_src_dev;
+	}
+
+	err = dst_dev->queue_mgmt_ops->ndo_queue_create(dst_dev);
+	if (err <= 0) {
+		NL_SET_ERR_MSG(info->extack,
+			       "Destination device is unable to create a new queue");
+		goto err_unlock_src_dev;
+	}
+
+	dst_qid = err - 1;
+	dst_rxq = __netif_get_rx_queue(dst_dev, dst_qid);
+
+	netdev_rx_queue_peer(src_dev, src_rxq, dst_rxq);
+
+	nla_put_u32(rsp, NETDEV_A_QUEUE_PAIR_DST_QUEUE_ID, dst_qid);
+	genlmsg_end(rsp, hdr);
+
+	netdev_unlock(src_dev);
+	netdev_unlock(dst_dev);
+
+	return genlmsg_reply(rsp, info);
+
+err_unlock_src_dev:
+	netdev_unlock(src_dev);
+err_unlock_dst_dev:
+	netdev_unlock(dst_dev);
+err_genlmsg_free:
+	nlmsg_free(rsp);
+	return err;
 }
 
 void netdev_nl_sock_priv_init(struct netdev_nl_sock *priv)
diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
index c7d9341b7630..916ca8d7ae7c 100644
--- a/net/core/netdev_rx_queue.c
+++ b/net/core/netdev_rx_queue.c
@@ -18,6 +18,67 @@ bool netif_rxq_has_unreadable_mp(struct net_device *dev, int idx)
 }
 EXPORT_SYMBOL(netif_rxq_has_unreadable_mp);
 
+void netdev_rx_queue_peer(struct net_device *src_dev,
+			  struct netdev_rx_queue *src_rxq,
+			  struct netdev_rx_queue *dst_rxq)
+{
+	netdev_assert_locked(src_dev);
+	netdev_assert_locked(dst_rxq->dev);
+
+	netdev_hold(src_dev, &src_rxq->dev_tracker, GFP_KERNEL);
+	__netdev_rx_queue_peer(src_rxq, dst_rxq);
+}
+
+void netdev_rx_queue_unpeer(struct net_device *src_dev,
+			    struct netdev_rx_queue *src_rxq,
+			    struct netdev_rx_queue *dst_rxq)
+{
+	WARN_ON_ONCE(READ_ONCE(dst_rxq->dev->reg_state) != NETREG_UNREGISTERING);
+
+	netdev_assert_locked(dst_rxq->dev);
+	netdev_assert_locked(src_dev);
+
+	__netdev_rx_queue_unpeer(src_rxq, dst_rxq);
+	netdev_put(src_dev, &src_rxq->dev_tracker);
+}
+
+static struct netdev_rx_queue *
+__netif_get_rx_queue_peer(struct net_device **dev, unsigned int *rxq_idx,
+			  bool virt_to_phys_only)
+{
+	struct net_device *req_dev = *dev;
+	struct netdev_rx_queue *rxq = __netif_get_rx_queue(req_dev, *rxq_idx);
+
+	if (rxq->peer) {
+		if (virt_to_phys_only &&
+		    req_dev->dev.parent)
+			return NULL;
+		rxq = rxq->peer;
+		*rxq_idx = get_netdev_rx_queue_index(rxq);
+		*dev = rxq->dev;
+	}
+	return rxq;
+}
+
+struct netdev_rx_queue *
+netif_get_rx_queue_peer_locked(struct net_device **dev, unsigned int *rxq_idx,
+			       bool *needs_unlock)
+{
+	struct net_device *req_dev = *dev;
+	struct netdev_rx_queue *rxq;
+
+	/* Locking order is always from the virtual to the physical device
+	 * see netdev_nl_bind_queue_doit().
+	 */
+	netdev_ops_assert_locked(req_dev);
+	rxq = __netif_get_rx_queue_peer(dev, rxq_idx, true);
+	if (rxq && req_dev != *dev) {
+		*needs_unlock = true;
+		netdev_lock(*dev);
+	}
+	return rxq;
+}
+
 int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
 {
 	struct netdev_rx_queue *rxq = __netif_get_rx_queue(dev, rxq_idx);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH net-next v3 03/15] net: Add peer info to queue-get response
  2025-10-20 16:23 [PATCH net-next v3 00/15] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
  2025-10-20 16:23 ` [PATCH net-next v3 01/15] net: Add bind-queue operation Daniel Borkmann
  2025-10-20 16:23 ` [PATCH net-next v3 02/15] net: Implement netdev_nl_bind_queue_doit Daniel Borkmann
@ 2025-10-20 16:23 ` Daniel Borkmann
  2025-10-22 11:23   ` Nikolay Aleksandrov
  2025-10-24  2:33   ` Jakub Kicinski
  2025-10-20 16:23 ` [PATCH net-next v3 04/15] net, ethtool: Disallow peered real rxqs to be resized Daniel Borkmann
                   ` (11 subsequent siblings)
  14 siblings, 2 replies; 54+ messages in thread
From: Daniel Borkmann @ 2025-10-20 16:23 UTC (permalink / raw)
  To: netdev
  Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

From: David Wei <dw@davidwei.uk>

Add a nested peer field to the queue-get response that returns the peered
ifindex and queue id.

Example with ynl client:

  # ip netns exec foo ./pyynl/cli.py \
      --spec ~/netlink/specs/netdev.yaml \
      --do queue-get \
      --json '{"ifindex": 3, "id": 1, "type": "rx"}'
  {'id': 1, 'ifindex': 3, 'peer': {'id': 15, 'ifindex': 4, 'netns-id': 21}, 'type': 'rx'}

Note that the caller of netdev_nl_queue_fill_one() holds the netdevice
lock. For the queue-get we do not lock both devices. When queues get
{un,}peered, both devices are locked, thus if netdev_rx_queue_peered()
returns true, the peer pointer points to a valid device. The netns-id
is fetched via peernet2id_alloc() similarly as done in OVS.

Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 Documentation/netlink/specs/netdev.yaml | 24 ++++++++++++++++++
 include/net/netdev_rx_queue.h           |  3 +++
 include/uapi/linux/netdev.h             | 10 ++++++++
 net/core/netdev-genl.c                  | 33 +++++++++++++++++++++++--
 net/core/netdev_rx_queue.c              |  8 ++++++
 tools/include/uapi/linux/netdev.h       | 10 ++++++++
 6 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 20bb00b7e9ac..a3c562dfd205 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -297,6 +297,24 @@ attribute-sets:
   -
     name: xsk-info
     attributes: []
+  -
+    name: peer-info
+    attributes:
+      -
+        name: id
+        doc: Queue index of the netdevice to which the peer queue belongs.
+        type: u32
+      -
+        name: ifindex
+        doc: ifindex of the netdevice to which the peer queue belongs.
+        type: u32
+      -
+        name: netns-id
+        doc: |
+          Network namespace of the netdevice to which the peer queue belongs.
+          This is populated if the netdevices are not in the same network
+          namespace.
+        type: s32
   -
     name: queue
     attributes:
@@ -338,6 +356,11 @@ attribute-sets:
         doc: XSK information for this queue, if any.
         type: nest
         nested-attributes: xsk-info
+      -
+        name: peer
+        doc: Whether this queue was bound to another peer queue.
+        type: nest
+        nested-attributes: peer-info
   -
     name: qstats
     doc: |
@@ -723,6 +746,7 @@ operations:
             - dmabuf
             - io-uring
             - xsk
+            - peer
       dump:
         request:
           attributes:
diff --git a/include/net/netdev_rx_queue.h b/include/net/netdev_rx_queue.h
index db3ef94c0744..ea23cca947bb 100644
--- a/include/net/netdev_rx_queue.h
+++ b/include/net/netdev_rx_queue.h
@@ -90,4 +90,7 @@ struct netdev_rx_queue *
 netif_get_rx_queue_peer_locked(struct net_device **dev,
 			       unsigned int *rxq_idx,
 			       bool *needs_unlock);
+struct netdev_rx_queue *
+netif_get_rx_queue_peer_any(struct net_device **dev,
+			    unsigned int *rxq_idx);
 #endif /* _LINUX_NETDEV_RX_QUEUE_H */
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 4ef04d0bc412..d4d5d9f86eee 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -150,6 +150,15 @@ enum {
 	NETDEV_A_XSK_INFO_MAX = (__NETDEV_A_XSK_INFO_MAX - 1)
 };
 
+enum {
+	NETDEV_A_PEER_INFO_ID = 1,
+	NETDEV_A_PEER_INFO_IFINDEX,
+	NETDEV_A_PEER_INFO_NETNS_ID,
+
+	__NETDEV_A_PEER_INFO_MAX,
+	NETDEV_A_PEER_INFO_MAX = (__NETDEV_A_PEER_INFO_MAX - 1)
+};
+
 enum {
 	NETDEV_A_QUEUE_ID = 1,
 	NETDEV_A_QUEUE_IFINDEX,
@@ -158,6 +167,7 @@ enum {
 	NETDEV_A_QUEUE_DMABUF,
 	NETDEV_A_QUEUE_IO_URING,
 	NETDEV_A_QUEUE_XSK,
+	NETDEV_A_QUEUE_PEER,
 
 	__NETDEV_A_QUEUE_MAX,
 	NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1)
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 579469abac8c..28658b5cd7a4 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -393,6 +393,7 @@ netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
 	struct pp_memory_provider_params *params;
 	struct netdev_rx_queue *rxq;
 	struct netdev_queue *txq;
+	struct nlattr *nest;
 	void *hdr;
 
 	hdr = genlmsg_iput(rsp, info);
@@ -410,6 +411,34 @@ netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
 		if (nla_put_napi_id(rsp, rxq->napi))
 			goto nla_put_failure;
 
+		if (netdev_rx_queue_peered(netdev, q_idx)) {
+			struct net_device *p_netdev = netdev;
+			struct net *net, *p_net;
+			u32 p_q_idx = q_idx;
+
+			nest = nla_nest_start(rsp, NETDEV_A_QUEUE_PEER);
+			if (!nest)
+				goto nla_put_failure;
+
+			netif_get_rx_queue_peer_any(&p_netdev, &p_q_idx);
+			if (nla_put_u32(rsp, NETDEV_A_PEER_INFO_ID, p_q_idx) ||
+			    nla_put_u32(rsp, NETDEV_A_PEER_INFO_IFINDEX,
+					READ_ONCE(p_netdev->ifindex)))
+				goto nla_put_failure;
+
+			rcu_read_lock();
+			p_net = dev_net_rcu(p_netdev);
+			net = dev_net_rcu(netdev);
+			if (!net_eq(net, p_net)) {
+				s32 id = peernet2id_alloc(net, p_net, GFP_ATOMIC);
+
+				if (nla_put_s32(rsp, NETDEV_A_PEER_INFO_NETNS_ID, id))
+					goto nla_put_failure_unlock;
+			}
+			rcu_read_unlock();
+			nla_nest_end(rsp, nest);
+		}
+
 		params = &rxq->mp_params;
 		if (params->mp_ops &&
 		    params->mp_ops->nl_fill(params->mp_priv, rsp, rxq))
@@ -419,7 +448,6 @@ netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
 			if (nla_put_empty_nest(rsp, NETDEV_A_QUEUE_XSK))
 				goto nla_put_failure;
 #endif
-
 		break;
 	case NETDEV_QUEUE_TYPE_TX:
 		txq = netdev_get_tx_queue(netdev, q_idx);
@@ -434,9 +462,10 @@ netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
 	}
 
 	genlmsg_end(rsp, hdr);
-
 	return 0;
 
+nla_put_failure_unlock:
+	rcu_read_unlock();
 nla_put_failure:
 	genlmsg_cancel(rsp, hdr);
 	return -EMSGSIZE;
diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
index 916ca8d7ae7c..8ee289316c06 100644
--- a/net/core/netdev_rx_queue.c
+++ b/net/core/netdev_rx_queue.c
@@ -79,6 +79,14 @@ netif_get_rx_queue_peer_locked(struct net_device **dev, unsigned int *rxq_idx,
 	return rxq;
 }
 
+struct netdev_rx_queue *
+netif_get_rx_queue_peer_any(struct net_device **dev, unsigned int *rxq_idx)
+{
+	netdev_ops_assert_locked(*dev);
+	/* Retrieves both virt-to-phys and phys-to-virt peering. */
+	return __netif_get_rx_queue_peer(dev, rxq_idx, false);
+}
+
 int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
 {
 	struct netdev_rx_queue *rxq = __netif_get_rx_queue(dev, rxq_idx);
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 4ef04d0bc412..d4d5d9f86eee 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -150,6 +150,15 @@ enum {
 	NETDEV_A_XSK_INFO_MAX = (__NETDEV_A_XSK_INFO_MAX - 1)
 };
 
+enum {
+	NETDEV_A_PEER_INFO_ID = 1,
+	NETDEV_A_PEER_INFO_IFINDEX,
+	NETDEV_A_PEER_INFO_NETNS_ID,
+
+	__NETDEV_A_PEER_INFO_MAX,
+	NETDEV_A_PEER_INFO_MAX = (__NETDEV_A_PEER_INFO_MAX - 1)
+};
+
 enum {
 	NETDEV_A_QUEUE_ID = 1,
 	NETDEV_A_QUEUE_IFINDEX,
@@ -158,6 +167,7 @@ enum {
 	NETDEV_A_QUEUE_DMABUF,
 	NETDEV_A_QUEUE_IO_URING,
 	NETDEV_A_QUEUE_XSK,
+	NETDEV_A_QUEUE_PEER,
 
 	__NETDEV_A_QUEUE_MAX,
 	NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH net-next v3 04/15] net, ethtool: Disallow peered real rxqs to be resized
  2025-10-20 16:23 [PATCH net-next v3 00/15] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
                   ` (2 preceding siblings ...)
  2025-10-20 16:23 ` [PATCH net-next v3 03/15] net: Add peer info to queue-get response Daniel Borkmann
@ 2025-10-20 16:23 ` Daniel Borkmann
  2025-10-22 11:25   ` Nikolay Aleksandrov
  2025-10-20 16:23 ` [PATCH net-next v3 05/15] net: Proxy net_mp_{open,close}_rxq for mapped queues Daniel Borkmann
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 54+ messages in thread
From: Daniel Borkmann @ 2025-10-20 16:23 UTC (permalink / raw)
  To: netdev
  Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

Similar to AF_XDP, do not allow queues in a physical netdev to be
resized by ethtool -L when they are peered.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
---
 include/linux/ethtool.h |  1 +
 net/ethtool/channels.c  | 12 ++++++------
 net/ethtool/common.c    | 10 +++++++++-
 net/ethtool/ioctl.c     |  4 ++--
 4 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index c2d8b4ec62eb..151fc920234d 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -1481,4 +1481,5 @@ struct ethtool_forced_speed_map {
 
 void
 ethtool_forced_speed_maps_init(struct ethtool_forced_speed_map *maps, u32 size);
+bool ethtool_channel_busy(struct net_device *dev, u32 channel);
 #endif /* _LINUX_ETHTOOL_H */
diff --git a/net/ethtool/channels.c b/net/ethtool/channels.c
index ca4f80282448..b3de8064275c 100644
--- a/net/ethtool/channels.c
+++ b/net/ethtool/channels.c
@@ -1,7 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0-only
 
-#include <net/xdp_sock_drv.h>
-
 #include "netlink.h"
 #include "common.h"
 
@@ -169,14 +167,16 @@ ethnl_set_channels(struct ethnl_req_info *req_info, struct genl_info *info)
 	if (ret)
 		return ret;
 
-	/* Disabling channels, query zero-copy AF_XDP sockets */
+	/* ensure channels are not busy at the moment */
 	from_channel = channels.combined_count +
 		       min(channels.rx_count, channels.tx_count);
-	for (i = from_channel; i < old_total; i++)
-		if (xsk_get_pool_from_qid(dev, i)) {
-			GENL_SET_ERR_MSG(info, "requested channel counts are too low for existing zerocopy AF_XDP sockets");
+	for (i = from_channel; i < old_total; i++) {
+		if (ethtool_channel_busy(dev, i)) {
+			GENL_SET_ERR_MSG(info,
+					 "requested channel counts are too low due to busy queues (AF_XDP or queue peering)");
 			return -EINVAL;
 		}
+	}
 
 	ret = dev->ethtool_ops->set_channels(dev, &channels);
 	return ret < 0 ? ret : 1;
diff --git a/net/ethtool/common.c b/net/ethtool/common.c
index 55223ebc2a7e..a67382c2208b 100644
--- a/net/ethtool/common.c
+++ b/net/ethtool/common.c
@@ -6,13 +6,15 @@
 #include <linux/rtnetlink.h>
 #include <linux/ptp_clock_kernel.h>
 #include <linux/phy_link_topology.h>
+
 #include <net/netdev_queues.h>
+#include <net/netdev_rx_queue.h>
+#include <net/xdp_sock_drv.h>
 
 #include "netlink.h"
 #include "common.h"
 #include "../core/dev.h"
 
-
 const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN] = {
 	[NETIF_F_SG_BIT] =               "tx-scatter-gather",
 	[NETIF_F_IP_CSUM_BIT] =          "tx-checksum-ipv4",
@@ -1101,6 +1103,12 @@ EXPORT_SYMBOL(ethtool_get_ts_info_by_layer);
 
 const struct ethtool_phy_ops *ethtool_phy_ops;
 
+bool ethtool_channel_busy(struct net_device *dev, u32 channel)
+{
+	return netdev_rx_queue_peered(dev, channel) ||
+	       xsk_get_pool_from_qid(dev, channel);
+}
+
 void ethtool_set_ethtool_phy_ops(const struct ethtool_phy_ops *ops)
 {
 	ASSERT_RTNL();
diff --git a/net/ethtool/ioctl.c b/net/ethtool/ioctl.c
index fa83ddade4f8..9ed87a18e48a 100644
--- a/net/ethtool/ioctl.c
+++ b/net/ethtool/ioctl.c
@@ -2282,12 +2282,12 @@ static noinline_for_stack int ethtool_set_channels(struct net_device *dev,
 	if (ret)
 		return ret;
 
-	/* Disabling channels, query zero-copy AF_XDP sockets */
+	/* Disabling channels, query busy queues (AF_XDP, queue peering) */
 	from_channel = channels.combined_count +
 		min(channels.rx_count, channels.tx_count);
 	to_channel = curr.combined_count + max(curr.rx_count, curr.tx_count);
 	for (i = from_channel; i < to_channel; i++)
-		if (xsk_get_pool_from_qid(dev, i))
+		if (ethtool_channel_busy(dev, i))
 			return -EINVAL;
 
 	ret = dev->ethtool_ops->set_channels(dev, &channels);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH net-next v3 05/15] net: Proxy net_mp_{open,close}_rxq for mapped queues
  2025-10-20 16:23 [PATCH net-next v3 00/15] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
                   ` (3 preceding siblings ...)
  2025-10-20 16:23 ` [PATCH net-next v3 04/15] net, ethtool: Disallow peered real rxqs to be resized Daniel Borkmann
@ 2025-10-20 16:23 ` Daniel Borkmann
  2025-10-22 12:50   ` Nikolay Aleksandrov
  2025-10-24 18:36   ` Stanislav Fomichev
  2025-10-20 16:23 ` [PATCH net-next v3 06/15] xsk: Move NETDEV_XDP_ACT_ZC into generic header Daniel Borkmann
                   ` (9 subsequent siblings)
  14 siblings, 2 replies; 54+ messages in thread
From: Daniel Borkmann @ 2025-10-20 16:23 UTC (permalink / raw)
  To: netdev
  Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

From: David Wei <dw@davidwei.uk>

When a process in a container wants to setup a memory provider, it will
use the virtual netdev and a mapped rxq, and call net_mp_{open,close}_rxq
to try and restart the queue. At this point, proxy the queue restart on
the real rxq in the physical netdev.

For memory providers (io_uring zero-copy rx and devmem), it causes the
real rxq in the physical netdev to be filled from a memory provider that
has DMA mapped memory from a process within a container.

Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 include/net/page_pool/memory_provider.h |  4 +-
 net/core/netdev_rx_queue.c              | 57 +++++++++++++++++--------
 2 files changed, 41 insertions(+), 20 deletions(-)

diff --git a/include/net/page_pool/memory_provider.h b/include/net/page_pool/memory_provider.h
index ada4f968960a..b6f811c3416b 100644
--- a/include/net/page_pool/memory_provider.h
+++ b/include/net/page_pool/memory_provider.h
@@ -23,12 +23,12 @@ bool net_mp_niov_set_dma_addr(struct net_iov *niov, dma_addr_t addr);
 void net_mp_niov_set_page_pool(struct page_pool *pool, struct net_iov *niov);
 void net_mp_niov_clear_page_pool(struct net_iov *niov);
 
-int net_mp_open_rxq(struct net_device *dev, unsigned ifq_idx,
+int net_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
 		    struct pp_memory_provider_params *p);
 int __net_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
 		      const struct pp_memory_provider_params *p,
 		      struct netlink_ext_ack *extack);
-void net_mp_close_rxq(struct net_device *dev, unsigned ifq_idx,
+void net_mp_close_rxq(struct net_device *dev, unsigned int rxq_idx,
 		      struct pp_memory_provider_params *old_p);
 void __net_mp_close_rxq(struct net_device *dev, unsigned int rxq_idx,
 			const struct pp_memory_provider_params *old_p);
diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
index 8ee289316c06..b4ff3497e086 100644
--- a/net/core/netdev_rx_queue.c
+++ b/net/core/netdev_rx_queue.c
@@ -170,48 +170,63 @@ int __net_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
 		      struct netlink_ext_ack *extack)
 {
 	struct netdev_rx_queue *rxq;
+	bool needs_unlock = false;
 	int ret;
 
 	if (!netdev_need_ops_lock(dev))
 		return -EOPNOTSUPP;
-
 	if (rxq_idx >= dev->real_num_rx_queues) {
 		NL_SET_ERR_MSG(extack, "rx queue index out of range");
 		return -ERANGE;
 	}
-	rxq_idx = array_index_nospec(rxq_idx, dev->real_num_rx_queues);
 
+	rxq_idx = array_index_nospec(rxq_idx, dev->real_num_rx_queues);
+	rxq = netif_get_rx_queue_peer_locked(&dev, &rxq_idx, &needs_unlock);
+	if (!rxq) {
+		NL_SET_ERR_MSG(extack, "rx queue peered to a virtual netdev");
+		return -EBUSY;
+	}
+	if (!dev->dev.parent) {
+		NL_SET_ERR_MSG(extack, "rx queue is mapped to a virtual netdev");
+		ret = -EBUSY;
+		goto out;
+	}
 	if (dev->cfg->hds_config != ETHTOOL_TCP_DATA_SPLIT_ENABLED) {
 		NL_SET_ERR_MSG(extack, "tcp-data-split is disabled");
-		return -EINVAL;
+		ret = -EINVAL;
+		goto out;
 	}
 	if (dev->cfg->hds_thresh) {
 		NL_SET_ERR_MSG(extack, "hds-thresh is not zero");
-		return -EINVAL;
+		ret = -EINVAL;
+		goto out;
 	}
 	if (dev_xdp_prog_count(dev)) {
 		NL_SET_ERR_MSG(extack, "unable to custom memory provider to device with XDP program attached");
-		return -EEXIST;
+		ret = -EEXIST;
+		goto out;
 	}
-
-	rxq = __netif_get_rx_queue(dev, rxq_idx);
 	if (rxq->mp_params.mp_ops) {
 		NL_SET_ERR_MSG(extack, "designated queue already memory provider bound");
-		return -EEXIST;
+		ret = -EEXIST;
+		goto out;
 	}
 #ifdef CONFIG_XDP_SOCKETS
 	if (rxq->pool) {
 		NL_SET_ERR_MSG(extack, "designated queue already in use by AF_XDP");
-		return -EBUSY;
+		ret = -EBUSY;
+		goto out;
 	}
 #endif
-
 	rxq->mp_params = *p;
 	ret = netdev_rx_queue_restart(dev, rxq_idx);
 	if (ret) {
 		rxq->mp_params.mp_ops = NULL;
 		rxq->mp_params.mp_priv = NULL;
 	}
+out:
+	if (needs_unlock)
+		netdev_unlock(dev);
 	return ret;
 }
 
@@ -226,38 +241,44 @@ int net_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
 	return ret;
 }
 
-void __net_mp_close_rxq(struct net_device *dev, unsigned int ifq_idx,
+void __net_mp_close_rxq(struct net_device *dev, unsigned int rxq_idx,
 			const struct pp_memory_provider_params *old_p)
 {
 	struct netdev_rx_queue *rxq;
+	bool needs_unlock = false;
 	int err;
 
-	if (WARN_ON_ONCE(ifq_idx >= dev->real_num_rx_queues))
+	if (WARN_ON_ONCE(rxq_idx >= dev->real_num_rx_queues))
 		return;
 
-	rxq = __netif_get_rx_queue(dev, ifq_idx);
+	rxq = netif_get_rx_queue_peer_locked(&dev, &rxq_idx, &needs_unlock);
+	if (WARN_ON_ONCE(!rxq))
+		return;
 
 	/* Callers holding a netdev ref may get here after we already
 	 * went thru shutdown via dev_memory_provider_uninstall().
 	 */
 	if (dev->reg_state > NETREG_REGISTERED &&
 	    !rxq->mp_params.mp_ops)
-		return;
+		goto out;
 
 	if (WARN_ON_ONCE(rxq->mp_params.mp_ops != old_p->mp_ops ||
 			 rxq->mp_params.mp_priv != old_p->mp_priv))
-		return;
+		goto out;
 
 	rxq->mp_params.mp_ops = NULL;
 	rxq->mp_params.mp_priv = NULL;
-	err = netdev_rx_queue_restart(dev, ifq_idx);
+	err = netdev_rx_queue_restart(dev, rxq_idx);
 	WARN_ON(err && err != -ENETDOWN);
+out:
+	if (needs_unlock)
+		netdev_unlock(dev);
 }
 
-void net_mp_close_rxq(struct net_device *dev, unsigned ifq_idx,
+void net_mp_close_rxq(struct net_device *dev, unsigned int rxq_idx,
 		      struct pp_memory_provider_params *old_p)
 {
 	netdev_lock(dev);
-	__net_mp_close_rxq(dev, ifq_idx, old_p);
+	__net_mp_close_rxq(dev, rxq_idx, old_p);
 	netdev_unlock(dev);
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH net-next v3 06/15] xsk: Move NETDEV_XDP_ACT_ZC into generic header
  2025-10-20 16:23 [PATCH net-next v3 00/15] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
                   ` (4 preceding siblings ...)
  2025-10-20 16:23 ` [PATCH net-next v3 05/15] net: Proxy net_mp_{open,close}_rxq for mapped queues Daniel Borkmann
@ 2025-10-20 16:23 ` Daniel Borkmann
  2025-10-22 12:51   ` Nikolay Aleksandrov
  2025-10-20 16:23 ` [PATCH net-next v3 07/15] xsk: Move pool registration into single function Daniel Borkmann
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 54+ messages in thread
From: Daniel Borkmann @ 2025-10-20 16:23 UTC (permalink / raw)
  To: netdev
  Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

Move NETDEV_XDP_ACT_ZC into xdp_sock_drv.h header such that external code
can reuse it, and rename it into more generic NETDEV_XDP_ACT_XSK.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 include/net/xdp_sock_drv.h | 4 ++++
 net/xdp/xsk_buff_pool.c    | 6 +-----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 4f2d3268a676..242e34f771cc 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -12,6 +12,10 @@
 #define XDP_UMEM_MIN_CHUNK_SHIFT 11
 #define XDP_UMEM_MIN_CHUNK_SIZE (1 << XDP_UMEM_MIN_CHUNK_SHIFT)
 
+#define NETDEV_XDP_ACT_XSK	(NETDEV_XDP_ACT_BASIC |		\
+				 NETDEV_XDP_ACT_REDIRECT |	\
+				 NETDEV_XDP_ACT_XSK_ZEROCOPY)
+
 struct xsk_cb_desc {
 	void *src;
 	u8 off;
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index aa9788f20d0d..26165baf99f4 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -158,10 +158,6 @@ static void xp_disable_drv_zc(struct xsk_buff_pool *pool)
 	}
 }
 
-#define NETDEV_XDP_ACT_ZC	(NETDEV_XDP_ACT_BASIC |		\
-				 NETDEV_XDP_ACT_REDIRECT |	\
-				 NETDEV_XDP_ACT_XSK_ZEROCOPY)
-
 int xp_assign_dev(struct xsk_buff_pool *pool,
 		  struct net_device *netdev, u16 queue_id, u16 flags)
 {
@@ -203,7 +199,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
 		/* For copy-mode, we are done. */
 		return 0;
 
-	if ((netdev->xdp_features & NETDEV_XDP_ACT_ZC) != NETDEV_XDP_ACT_ZC) {
+	if ((netdev->xdp_features & NETDEV_XDP_ACT_XSK) != NETDEV_XDP_ACT_XSK) {
 		err = -EOPNOTSUPP;
 		goto err_unreg_pool;
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH net-next v3 07/15] xsk: Move pool registration into single function
  2025-10-20 16:23 [PATCH net-next v3 00/15] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
                   ` (5 preceding siblings ...)
  2025-10-20 16:23 ` [PATCH net-next v3 06/15] xsk: Move NETDEV_XDP_ACT_ZC into generic header Daniel Borkmann
@ 2025-10-20 16:23 ` Daniel Borkmann
  2025-10-22 12:52   ` Nikolay Aleksandrov
  2025-10-20 16:23 ` [PATCH net-next v3 08/15] xsk: Add small helper xp_pool_bindable Daniel Borkmann
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 54+ messages in thread
From: Daniel Borkmann @ 2025-10-20 16:23 UTC (permalink / raw)
  To: netdev
  Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

Small refactor to move the pool registration into xsk_reg_pool_at_qid,
such that the netdev and queue_id can be registered there. No change
in functionality.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
---
 net/xdp/xsk.c           | 5 +++++
 net/xdp/xsk_buff_pool.c | 5 -----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 7b0c68a70888..0e9a385f5680 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -141,6 +141,11 @@ int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool,
 			      dev->real_num_rx_queues,
 			      dev->real_num_tx_queues))
 		return -EINVAL;
+	if (xsk_get_pool_from_qid(dev, queue_id))
+		return -EBUSY;
+
+	pool->netdev = dev;
+	pool->queue_id = queue_id;
 
 	if (queue_id < dev->real_num_rx_queues)
 		dev->_rx[queue_id].pool = pool;
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 26165baf99f4..62a176996f02 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -173,11 +173,6 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
 	if (force_zc && force_copy)
 		return -EINVAL;
 
-	if (xsk_get_pool_from_qid(netdev, queue_id))
-		return -EBUSY;
-
-	pool->netdev = netdev;
-	pool->queue_id = queue_id;
 	err = xsk_reg_pool_at_qid(netdev, pool, queue_id);
 	if (err)
 		return err;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH net-next v3 08/15] xsk: Add small helper xp_pool_bindable
  2025-10-20 16:23 [PATCH net-next v3 00/15] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
                   ` (6 preceding siblings ...)
  2025-10-20 16:23 ` [PATCH net-next v3 07/15] xsk: Move pool registration into single function Daniel Borkmann
@ 2025-10-20 16:23 ` Daniel Borkmann
  2025-10-22 12:52   ` Nikolay Aleksandrov
  2025-10-20 16:23 ` [PATCH net-next v3 09/15] xsk: Change xsk_rcv_check to check netdev/queue_id from pool Daniel Borkmann
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 54+ messages in thread
From: Daniel Borkmann @ 2025-10-20 16:23 UTC (permalink / raw)
  To: netdev
  Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

Add another small helper called xp_pool_bindable and move the current
dev_get_min_mp_channel_count test into this helper. Pass in the pool
object, such that we derive the netdev from the prior registered pool.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
---
 net/xdp/xsk_buff_pool.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 62a176996f02..701be6a5b074 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -54,6 +54,11 @@ int xp_alloc_tx_descs(struct xsk_buff_pool *pool, struct xdp_sock *xs)
 	return 0;
 }
 
+static bool xp_pool_bindable(struct xsk_buff_pool *pool)
+{
+	return dev_get_min_mp_channel_count(pool->netdev) == 0;
+}
+
 struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
 						struct xdp_umem *umem)
 {
@@ -204,7 +209,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
 		goto err_unreg_pool;
 	}
 
-	if (dev_get_min_mp_channel_count(netdev)) {
+	if (!xp_pool_bindable(pool)) {
 		err = -EBUSY;
 		goto err_unreg_pool;
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH net-next v3 09/15] xsk: Change xsk_rcv_check to check netdev/queue_id from pool
  2025-10-20 16:23 [PATCH net-next v3 00/15] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
                   ` (7 preceding siblings ...)
  2025-10-20 16:23 ` [PATCH net-next v3 08/15] xsk: Add small helper xp_pool_bindable Daniel Borkmann
@ 2025-10-20 16:23 ` Daniel Borkmann
  2025-10-20 16:23 ` [PATCH net-next v3 10/15] xsk: Proxy pool management for mapped queues Daniel Borkmann
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Daniel Borkmann @ 2025-10-20 16:23 UTC (permalink / raw)
  To: netdev
  Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

Change the xsk_rcv_check test for inbound packets to use the xs->pool->netdev
and xs->pool->queue_id of the bound socket rather than xs->dev and xs->queue_id
since the latter could point to a virtual device with mapped rxq rather than
the physical backing device of the pool.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
---
 net/xdp/xsk.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 0e9a385f5680..985e0cac965d 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -340,15 +340,13 @@ static int xsk_rcv_check(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
 {
 	if (!xsk_is_bound(xs))
 		return -ENXIO;
-
-	if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
+	if (xs->pool->netdev   != xdp->rxq->dev ||
+	    xs->pool->queue_id != xdp->rxq->queue_index)
 		return -EINVAL;
-
 	if (len > xsk_pool_get_rx_frame_size(xs->pool) && !xs->sg) {
 		xs->rx_dropped++;
 		return -ENOSPC;
 	}
-
 	return 0;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH net-next v3 10/15] xsk: Proxy pool management for mapped queues
  2025-10-20 16:23 [PATCH net-next v3 00/15] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
                   ` (8 preceding siblings ...)
  2025-10-20 16:23 ` [PATCH net-next v3 09/15] xsk: Change xsk_rcv_check to check netdev/queue_id from pool Daniel Borkmann
@ 2025-10-20 16:23 ` Daniel Borkmann
  2025-10-20 16:23 ` [PATCH net-next v3 11/15] netkit: Add single device mode for netkit Daniel Borkmann
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Daniel Borkmann @ 2025-10-20 16:23 UTC (permalink / raw)
  To: netdev
  Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

Similarly what we do for net_mp_{open,close}_rxq for mapped queues,
proxy also the xsk_{reg,clear}_pool_at_qid via __netif_get_rx_queue_peer
such that when a virtual netdev picked a mapped rxq, the request gets
through to the real rxq in the physical netdev.

Change the function signatures for queue_id to unsigned int in order
to pass the queue_id parameter into __netif_get_rx_queue_peer. The
proxying is only relevant for queue_id < dev->real_num_rx_queues since
right now its only supported for rxqs.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
---
 include/net/xdp_sock_drv.h |  4 ++--
 net/xdp/xsk.c              | 33 ++++++++++++++++++++++++++++-----
 net/xdp/xsk.h              |  5 ++---
 3 files changed, 32 insertions(+), 10 deletions(-)

diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 242e34f771cc..25c37fab00bc 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -29,7 +29,7 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc);
 u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 max);
 void xsk_tx_release(struct xsk_buff_pool *pool);
 struct xsk_buff_pool *xsk_get_pool_from_qid(struct net_device *dev,
-					    u16 queue_id);
+					    unsigned int queue_id);
 void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool);
 void xsk_set_tx_need_wakeup(struct xsk_buff_pool *pool);
 void xsk_clear_rx_need_wakeup(struct xsk_buff_pool *pool);
@@ -296,7 +296,7 @@ static inline void xsk_tx_release(struct xsk_buff_pool *pool)
 }
 
 static inline struct xsk_buff_pool *
-xsk_get_pool_from_qid(struct net_device *dev, u16 queue_id)
+xsk_get_pool_from_qid(struct net_device *dev, unsigned int queue_id)
 {
 	return NULL;
 }
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 985e0cac965d..9e55ea0f5fde 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -23,6 +23,8 @@
 #include <linux/netdevice.h>
 #include <linux/rculist.h>
 #include <linux/vmalloc.h>
+
+#include <net/netdev_queues.h>
 #include <net/xdp_sock_drv.h>
 #include <net/busy_poll.h>
 #include <net/netdev_lock.h>
@@ -111,7 +113,7 @@ bool xsk_uses_need_wakeup(struct xsk_buff_pool *pool)
 EXPORT_SYMBOL(xsk_uses_need_wakeup);
 
 struct xsk_buff_pool *xsk_get_pool_from_qid(struct net_device *dev,
-					    u16 queue_id)
+					    unsigned int queue_id)
 {
 	if (queue_id < dev->real_num_rx_queues)
 		return dev->_rx[queue_id].pool;
@@ -122,12 +124,19 @@ struct xsk_buff_pool *xsk_get_pool_from_qid(struct net_device *dev,
 }
 EXPORT_SYMBOL(xsk_get_pool_from_qid);
 
-void xsk_clear_pool_at_qid(struct net_device *dev, u16 queue_id)
+void xsk_clear_pool_at_qid(struct net_device *dev, unsigned int queue_id)
 {
+	bool needs_unlock = false;
+
+	if (queue_id < dev->real_num_rx_queues)
+		WARN_ON_ONCE(!netif_get_rx_queue_peer_locked(&dev, &queue_id,
+							     &needs_unlock));
 	if (queue_id < dev->num_rx_queues)
 		dev->_rx[queue_id].pool = NULL;
 	if (queue_id < dev->num_tx_queues)
 		dev->_tx[queue_id].pool = NULL;
+	if (needs_unlock)
+		netdev_unlock(dev);
 }
 
 /* The buffer pool is stored both in the _rx struct and the _tx struct as we do
@@ -135,14 +144,26 @@ void xsk_clear_pool_at_qid(struct net_device *dev, u16 queue_id)
  * This might also change during run time.
  */
 int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool,
-			u16 queue_id)
+			unsigned int queue_id)
 {
+	bool needs_unlock = false;
+	int ret = 0;
+
 	if (queue_id >= max_t(unsigned int,
 			      dev->real_num_rx_queues,
 			      dev->real_num_tx_queues))
 		return -EINVAL;
 	if (xsk_get_pool_from_qid(dev, queue_id))
 		return -EBUSY;
+	if (queue_id < dev->real_num_rx_queues) {
+		if (!netif_get_rx_queue_peer_locked(&dev, &queue_id,
+						    &needs_unlock))
+			return -EBUSY;
+	}
+	if (xsk_get_pool_from_qid(dev, queue_id)) {
+		ret = -EBUSY;
+		goto out;
+	}
 
 	pool->netdev = dev;
 	pool->queue_id = queue_id;
@@ -151,8 +172,10 @@ int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool,
 		dev->_rx[queue_id].pool = pool;
 	if (queue_id < dev->real_num_tx_queues)
 		dev->_tx[queue_id].pool = pool;
-
-	return 0;
+out:
+	if (needs_unlock)
+		netdev_unlock(dev);
+	return ret;
 }
 
 static int __xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff_xsk *xskb, u32 len,
diff --git a/net/xdp/xsk.h b/net/xdp/xsk.h
index a4bc4749faac..54d9a7736fd2 100644
--- a/net/xdp/xsk.h
+++ b/net/xdp/xsk.h
@@ -41,8 +41,7 @@ static inline struct xdp_sock *xdp_sk(struct sock *sk)
 
 void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
 			     struct xdp_sock __rcu **map_entry);
-void xsk_clear_pool_at_qid(struct net_device *dev, u16 queue_id);
+void xsk_clear_pool_at_qid(struct net_device *dev, unsigned int queue_id);
 int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool,
-			u16 queue_id);
-
+			unsigned int queue_id);
 #endif /* XSK_H_ */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH net-next v3 11/15] netkit: Add single device mode for netkit
  2025-10-20 16:23 [PATCH net-next v3 00/15] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
                   ` (9 preceding siblings ...)
  2025-10-20 16:23 ` [PATCH net-next v3 10/15] xsk: Proxy pool management for mapped queues Daniel Borkmann
@ 2025-10-20 16:23 ` Daniel Borkmann
  2025-10-22 13:13   ` Nikolay Aleksandrov
  2025-10-20 16:23 ` [PATCH net-next v3 12/15] netkit: Document fast vs slowpath members via macros Daniel Borkmann
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 54+ messages in thread
From: Daniel Borkmann @ 2025-10-20 16:23 UTC (permalink / raw)
  To: netdev
  Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

Add a single device mode for netkit instead of netkit pairs. The primary
target for the paired devices is to connect network namespaces, of course,
and support has been implemented in projects like Cilium [0]. For the rxq
binding the plan is to support two main scenarios related to single device
mode:

* For the use-case of io_uring zero-copy, the control plane can either
  set up a netkit pair where the peer device can perform rxq binding which
  is then tied to the lifetime of the peer device, or the control plane
  can use a regular netkit pair to connect the hostns to a Pod/container
  and dynamically add/remove rxq bindings through a single device without
  having to interrupt the device pair. In the case of io_uring, the memory
  pool is used as skb non-linear pages, and thus the skb will go its way
  through the regular stack into netkit. Things like the netkit policy when
  no BPF is attached or skb scrubbing etc apply as-is in case the paired
  devices are used, or if the backend memory is tied to the single device
  and traffic goes through a paired device.

* For the use-case of AF_XDP, the control plane needs to use netkit in the
  single device mode. The single device mode currently enforces only a
  pass policy when no BPF is attached, and does not yet support BPF link
  attachments for AF_XDP. skbs sent to that device get dropped at the
  moment. Given AF_XDP operates at a lower layer of the stack tying this
  to the netkit pair did not make sense. In future, the plan is to allow
  BPF at the XDP layer which can: i) process traffic coming from the AF_XDP
  application (e.g. QEMU with AF_XDP backend) to filter egress traffic or
  to push selected egress traffic up to the single netkit device to the
  local stack (e.g. DHCP requests), and ii) vice-versa skbs sent to the
  single netkit into the AF_XDP application (e.g. DHCP replies). Also,
  the control-plane can dynamically add/remove rxq bindings for the single
  netkit device without having to interrupt (e.g. down/up cycle) the main
  netkit pair for the Pod which has traffic going in and out.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Jordan Rife <jordan@jrife.io>
Link: https://docs.cilium.io/en/stable/operations/performance/tuning/#netkit-device-mode [0]
---
 drivers/net/netkit.c         | 108 ++++++++++++++++++++++-------------
 include/uapi/linux/if_link.h |   6 ++
 2 files changed, 74 insertions(+), 40 deletions(-)

diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
index 492be60f2e70..e3a2445d83fc 100644
--- a/drivers/net/netkit.c
+++ b/drivers/net/netkit.c
@@ -25,6 +25,7 @@ struct netkit {
 
 	/* Needed in slow-path */
 	enum netkit_mode mode;
+	enum netkit_pairing pair;
 	bool primary;
 	u32 headroom;
 };
@@ -133,6 +134,10 @@ static int netkit_open(struct net_device *dev)
 	struct netkit *nk = netkit_priv(dev);
 	struct net_device *peer = rtnl_dereference(nk->peer);
 
+	if (nk->pair == NETKIT_DEVICE_SINGLE) {
+		netif_carrier_on(dev);
+		return 0;
+	}
 	if (!peer)
 		return -ENOTCONN;
 	if (peer->flags & IFF_UP) {
@@ -333,6 +338,7 @@ static int netkit_new_link(struct net_device *dev,
 	enum netkit_scrub scrub_prim = NETKIT_SCRUB_DEFAULT;
 	enum netkit_scrub scrub_peer = NETKIT_SCRUB_DEFAULT;
 	struct nlattr *peer_tb[IFLA_MAX + 1], **tbp, *attr;
+	enum netkit_pairing pair = NETKIT_DEVICE_PAIR;
 	enum netkit_action policy_prim = NETKIT_PASS;
 	enum netkit_action policy_peer = NETKIT_PASS;
 	struct nlattr **data = params->data;
@@ -341,7 +347,7 @@ static int netkit_new_link(struct net_device *dev,
 	struct nlattr **tb = params->tb;
 	u16 headroom = 0, tailroom = 0;
 	struct ifinfomsg *ifmp = NULL;
-	struct net_device *peer;
+	struct net_device *peer = NULL;
 	char ifname[IFNAMSIZ];
 	struct netkit *nk;
 	int err;
@@ -378,6 +384,8 @@ static int netkit_new_link(struct net_device *dev,
 			headroom = nla_get_u16(data[IFLA_NETKIT_HEADROOM]);
 		if (data[IFLA_NETKIT_TAILROOM])
 			tailroom = nla_get_u16(data[IFLA_NETKIT_TAILROOM]);
+		if (data[IFLA_NETKIT_PAIRING])
+			pair = nla_get_u32(data[IFLA_NETKIT_PAIRING]);
 	}
 
 	if (ifmp && tbp[IFLA_IFNAME]) {
@@ -390,45 +398,49 @@ static int netkit_new_link(struct net_device *dev,
 	if (mode != NETKIT_L2 &&
 	    (tb[IFLA_ADDRESS] || tbp[IFLA_ADDRESS]))
 		return -EOPNOTSUPP;
+	if (pair == NETKIT_DEVICE_SINGLE &&
+	    (tb != tbp ||
+	     tb[IFLA_NETKIT_PEER_POLICY] ||
+	     tb[IFLA_NETKIT_PEER_SCRUB] ||
+	     policy_prim != NETKIT_PASS))
+		return -EOPNOTSUPP;
 
-	peer = rtnl_create_link(peer_net, ifname, ifname_assign_type,
-				&netkit_link_ops, tbp, extack);
-	if (IS_ERR(peer))
-		return PTR_ERR(peer);
-
-	netif_inherit_tso_max(peer, dev);
-	if (headroom) {
-		peer->needed_headroom = headroom;
-		dev->needed_headroom = headroom;
-	}
-	if (tailroom) {
-		peer->needed_tailroom = tailroom;
-		dev->needed_tailroom = tailroom;
-	}
-
-	if (mode == NETKIT_L2 && !(ifmp && tbp[IFLA_ADDRESS]))
-		eth_hw_addr_random(peer);
-	if (ifmp && dev->ifindex)
-		peer->ifindex = ifmp->ifi_index;
-
-	nk = netkit_priv(peer);
-	nk->primary = false;
-	nk->policy = policy_peer;
-	nk->scrub = scrub_peer;
-	nk->mode = mode;
-	nk->headroom = headroom;
-	bpf_mprog_bundle_init(&nk->bundle);
+	if (pair == NETKIT_DEVICE_PAIR) {
+		peer = rtnl_create_link(peer_net, ifname, ifname_assign_type,
+					&netkit_link_ops, tbp, extack);
+		if (IS_ERR(peer))
+			return PTR_ERR(peer);
+
+		netif_inherit_tso_max(peer, dev);
+		if (headroom)
+			peer->needed_headroom = headroom;
+		if (tailroom)
+			peer->needed_tailroom = tailroom;
+		if (mode == NETKIT_L2 && !(ifmp && tbp[IFLA_ADDRESS]))
+			eth_hw_addr_random(peer);
+		if (ifmp && dev->ifindex)
+			peer->ifindex = ifmp->ifi_index;
 
-	err = register_netdevice(peer);
-	if (err < 0)
-		goto err_register_peer;
-	netif_carrier_off(peer);
-	if (mode == NETKIT_L2)
-		dev_change_flags(peer, peer->flags & ~IFF_NOARP, NULL);
+		nk = netkit_priv(peer);
+		nk->primary = false;
+		nk->policy = policy_peer;
+		nk->scrub = scrub_peer;
+		nk->mode = mode;
+		nk->pair = pair;
+		nk->headroom = headroom;
+		bpf_mprog_bundle_init(&nk->bundle);
+
+		err = register_netdevice(peer);
+		if (err < 0)
+			goto err_register_peer;
+		netif_carrier_off(peer);
+		if (mode == NETKIT_L2)
+			dev_change_flags(peer, peer->flags & ~IFF_NOARP, NULL);
 
-	err = rtnl_configure_link(peer, NULL, 0, NULL);
-	if (err < 0)
-		goto err_configure_peer;
+		err = rtnl_configure_link(peer, NULL, 0, NULL);
+		if (err < 0)
+			goto err_configure_peer;
+	}
 
 	if (mode == NETKIT_L2 && !tb[IFLA_ADDRESS])
 		eth_hw_addr_random(dev);
@@ -436,12 +448,17 @@ static int netkit_new_link(struct net_device *dev,
 		nla_strscpy(dev->name, tb[IFLA_IFNAME], IFNAMSIZ);
 	else
 		strscpy(dev->name, "nk%d", IFNAMSIZ);
+	if (headroom)
+		dev->needed_headroom = headroom;
+	if (tailroom)
+		dev->needed_tailroom = tailroom;
 
 	nk = netkit_priv(dev);
 	nk->primary = true;
 	nk->policy = policy_prim;
 	nk->scrub = scrub_prim;
 	nk->mode = mode;
+	nk->pair = pair;
 	nk->headroom = headroom;
 	bpf_mprog_bundle_init(&nk->bundle);
 
@@ -453,10 +470,12 @@ static int netkit_new_link(struct net_device *dev,
 		dev_change_flags(dev, dev->flags & ~IFF_NOARP, NULL);
 
 	rcu_assign_pointer(netkit_priv(dev)->peer, peer);
-	rcu_assign_pointer(netkit_priv(peer)->peer, dev);
+	if (peer)
+		rcu_assign_pointer(netkit_priv(peer)->peer, dev);
 	return 0;
 err_configure_peer:
-	unregister_netdevice(peer);
+	if (peer)
+		unregister_netdevice(peer);
 	return err;
 err_register_peer:
 	free_netdev(peer);
@@ -516,6 +535,8 @@ static struct net_device *netkit_dev_fetch(struct net *net, u32 ifindex, u32 whi
 	nk = netkit_priv(dev);
 	if (!nk->primary)
 		return ERR_PTR(-EACCES);
+	if (nk->pair == NETKIT_DEVICE_SINGLE)
+		return ERR_PTR(-EOPNOTSUPP);
 	if (which == BPF_NETKIT_PEER) {
 		dev = rcu_dereference_rtnl(nk->peer);
 		if (!dev)
@@ -877,6 +898,7 @@ static int netkit_change_link(struct net_device *dev, struct nlattr *tb[],
 		{ IFLA_NETKIT_PEER_INFO,  "peer info" },
 		{ IFLA_NETKIT_HEADROOM,   "headroom" },
 		{ IFLA_NETKIT_TAILROOM,   "tailroom" },
+		{ IFLA_NETKIT_PAIRING,    "pairing" },
 	};
 
 	if (!nk->primary) {
@@ -896,9 +918,11 @@ static int netkit_change_link(struct net_device *dev, struct nlattr *tb[],
 	}
 
 	if (data[IFLA_NETKIT_POLICY]) {
+		err = -EOPNOTSUPP;
 		attr = data[IFLA_NETKIT_POLICY];
 		policy = nla_get_u32(attr);
-		err = netkit_check_policy(policy, attr, extack);
+		if (nk->pair == NETKIT_DEVICE_PAIR)
+			err = netkit_check_policy(policy, attr, extack);
 		if (err)
 			return err;
 		WRITE_ONCE(nk->policy, policy);
@@ -929,6 +953,7 @@ static size_t netkit_get_size(const struct net_device *dev)
 	       nla_total_size(sizeof(u8))  + /* IFLA_NETKIT_PRIMARY */
 	       nla_total_size(sizeof(u16)) + /* IFLA_NETKIT_HEADROOM */
 	       nla_total_size(sizeof(u16)) + /* IFLA_NETKIT_TAILROOM */
+	       nla_total_size(sizeof(u32)) + /* IFLA_NETKIT_PAIRING */
 	       0;
 }
 
@@ -949,6 +974,8 @@ static int netkit_fill_info(struct sk_buff *skb, const struct net_device *dev)
 		return -EMSGSIZE;
 	if (nla_put_u16(skb, IFLA_NETKIT_TAILROOM, dev->needed_tailroom))
 		return -EMSGSIZE;
+	if (nla_put_u32(skb, IFLA_NETKIT_PAIRING, nk->pair))
+		return -EMSGSIZE;
 
 	if (peer) {
 		nk = netkit_priv(peer);
@@ -970,6 +997,7 @@ static const struct nla_policy netkit_policy[IFLA_NETKIT_MAX + 1] = {
 	[IFLA_NETKIT_TAILROOM]		= { .type = NLA_U16 },
 	[IFLA_NETKIT_SCRUB]		= NLA_POLICY_MAX(NLA_U32, NETKIT_SCRUB_DEFAULT),
 	[IFLA_NETKIT_PEER_SCRUB]	= NLA_POLICY_MAX(NLA_U32, NETKIT_SCRUB_DEFAULT),
+	[IFLA_NETKIT_PAIRING]		= NLA_POLICY_MAX(NLA_U32, NETKIT_DEVICE_SINGLE),
 	[IFLA_NETKIT_PRIMARY]		= { .type = NLA_REJECT,
 					    .reject_message = "Primary attribute is read-only" },
 };
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 3b491d96e52e..bbd565757298 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -1296,6 +1296,11 @@ enum netkit_mode {
 	NETKIT_L3,
 };
 
+enum netkit_pairing {
+	NETKIT_DEVICE_PAIR,
+	NETKIT_DEVICE_SINGLE,
+};
+
 /* NETKIT_SCRUB_NONE leaves clearing skb->{mark,priority} up to
  * the BPF program if attached. This also means the latter can
  * consume the two fields if they were populated earlier.
@@ -1320,6 +1325,7 @@ enum {
 	IFLA_NETKIT_PEER_SCRUB,
 	IFLA_NETKIT_HEADROOM,
 	IFLA_NETKIT_TAILROOM,
+	IFLA_NETKIT_PAIRING,
 	__IFLA_NETKIT_MAX,
 };
 #define IFLA_NETKIT_MAX	(__IFLA_NETKIT_MAX - 1)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH net-next v3 12/15] netkit: Document fast vs slowpath members via macros
  2025-10-20 16:23 [PATCH net-next v3 00/15] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
                   ` (10 preceding siblings ...)
  2025-10-20 16:23 ` [PATCH net-next v3 11/15] netkit: Add single device mode for netkit Daniel Borkmann
@ 2025-10-20 16:23 ` Daniel Borkmann
  2025-10-22 13:02   ` Nikolay Aleksandrov
  2025-10-20 16:23 ` [PATCH net-next v3 13/15] netkit: Implement rtnl_link_ops->alloc and ndo_queue_create Daniel Borkmann
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 54+ messages in thread
From: Daniel Borkmann @ 2025-10-20 16:23 UTC (permalink / raw)
  To: netdev
  Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

Instead of a comment, just use two cachline groups to document the intent
for members often accessed in fast or slow path.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
---
 drivers/net/netkit.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
index e3a2445d83fc..96734828bfb8 100644
--- a/drivers/net/netkit.c
+++ b/drivers/net/netkit.c
@@ -16,18 +16,20 @@
 #define DRV_NAME "netkit"
 
 struct netkit {
-	/* Needed in fast-path */
+	__cacheline_group_begin(netkit_fastpath);
 	struct net_device __rcu *peer;
 	struct bpf_mprog_entry __rcu *active;
 	enum netkit_action policy;
 	enum netkit_scrub scrub;
 	struct bpf_mprog_bundle	bundle;
+	__cacheline_group_end(netkit_fastpath);
 
-	/* Needed in slow-path */
+	__cacheline_group_begin(netkit_slowpath);
 	enum netkit_mode mode;
 	enum netkit_pairing pair;
 	bool primary;
 	u32 headroom;
+	__cacheline_group_end(netkit_slowpath);
 };
 
 struct netkit_link {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH net-next v3 13/15] netkit: Implement rtnl_link_ops->alloc and ndo_queue_create
  2025-10-20 16:23 [PATCH net-next v3 00/15] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
                   ` (11 preceding siblings ...)
  2025-10-20 16:23 ` [PATCH net-next v3 12/15] netkit: Document fast vs slowpath members via macros Daniel Borkmann
@ 2025-10-20 16:23 ` Daniel Borkmann
  2025-10-22 13:00   ` Nikolay Aleksandrov
  2025-10-20 16:23 ` [PATCH net-next v3 14/15] netkit: Add io_uring zero-copy support for TCP Daniel Borkmann
  2025-10-20 16:23 ` [PATCH net-next v3 15/15] netkit: Add xsk support for af_xdp applications Daniel Borkmann
  14 siblings, 1 reply; 54+ messages in thread
From: Daniel Borkmann @ 2025-10-20 16:23 UTC (permalink / raw)
  To: netdev
  Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

From: David Wei <dw@davidwei.uk>

Implement rtnl_link_ops->alloc that allows the number of rx queues to be
set when netkit is created. By default, netkit has only a single rxq (and
single txq). The number of queues is deliberately not allowed to be changed
via ethtool -L and is fixed for the lifetime of a netkit instance.

For netkit device creation, numrxqueues with larger than one rxq can be
specified. These rxqs are then mappable to real rxqs in physical netdevs:

  ip link add type netkit peer numrxqueues 64      # for device pair
  ip link add numrxqueues 64 type netkit single    # for single device

The limit of numrxqueues for netkit is currently set to 256, which allows
binding multiple real rxqs from physical netdevs.

The implementation of ndo_queue_create() adds a new rxq during the bind
queue operation. We allow to create queues either in single device mode or
for the case of dual device mode for the netkit peer device which gets
placed into the target network namespace. For dual device mode the bind
against the primary device does not make sense for the targeted use cases,
and therefore gets rejected.

We also need to add a lockdep class for netkit, such that lockdep does
not trip over us, similarly done as in commit 0bef512012b1 ("net: add
netdev_lockdep_set_classes() to virtual drivers").

Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 drivers/net/netkit.c | 129 +++++++++++++++++++++++++++++++++++++++----
 1 file changed, 117 insertions(+), 12 deletions(-)

diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
index 96734828bfb8..75b57496b72e 100644
--- a/drivers/net/netkit.c
+++ b/drivers/net/netkit.c
@@ -9,11 +9,20 @@
 #include <linux/bpf_mprog.h>
 #include <linux/indirect_call_wrapper.h>
 
+#include <net/netdev_lock.h>
+#include <net/netdev_queues.h>
+#include <net/netdev_rx_queue.h>
 #include <net/netkit.h>
 #include <net/dst.h>
 #include <net/tcx.h>
 
-#define DRV_NAME "netkit"
+#define NETKIT_DRV_NAME	"netkit"
+
+#define NETKIT_NUM_RX_QUEUES_MAX  256
+#define NETKIT_NUM_TX_QUEUES_MAX  1
+
+#define NETKIT_NUM_RX_QUEUES_REAL 1
+#define NETKIT_NUM_TX_QUEUES_REAL 1
 
 struct netkit {
 	__cacheline_group_begin(netkit_fastpath);
@@ -37,6 +46,8 @@ struct netkit_link {
 	struct net_device *dev;
 };
 
+static struct rtnl_link_ops netkit_link_ops;
+
 static __always_inline int
 netkit_run(const struct bpf_mprog_entry *entry, struct sk_buff *skb,
 	   enum netkit_action ret)
@@ -224,9 +235,16 @@ static void netkit_get_stats(struct net_device *dev,
 	stats->tx_dropped = DEV_STATS_READ(dev, tx_dropped);
 }
 
+static int netkit_init(struct net_device *dev)
+{
+	netdev_lockdep_set_classes(dev);
+	return 0;
+}
+
 static void netkit_uninit(struct net_device *dev);
 
 static const struct net_device_ops netkit_netdev_ops = {
+	.ndo_init		= netkit_init,
 	.ndo_open		= netkit_open,
 	.ndo_stop		= netkit_close,
 	.ndo_start_xmit		= netkit_xmit,
@@ -243,13 +261,99 @@ static const struct net_device_ops netkit_netdev_ops = {
 static void netkit_get_drvinfo(struct net_device *dev,
 			       struct ethtool_drvinfo *info)
 {
-	strscpy(info->driver, DRV_NAME, sizeof(info->driver));
+	strscpy(info->driver, NETKIT_DRV_NAME, sizeof(info->driver));
+}
+
+static void netkit_get_channels(struct net_device *dev,
+				struct ethtool_channels *channels)
+{
+	channels->max_rx = dev->num_rx_queues;
+	channels->max_tx = dev->num_tx_queues;
+	channels->max_other = 0;
+	channels->max_combined = 1;
+	channels->rx_count = dev->real_num_rx_queues;
+	channels->tx_count = dev->real_num_tx_queues;
+	channels->other_count = 0;
+	channels->combined_count = 0;
 }
 
 static const struct ethtool_ops netkit_ethtool_ops = {
 	.get_drvinfo		= netkit_get_drvinfo,
+	.get_channels		= netkit_get_channels,
 };
 
+static int netkit_queue_create(struct net_device *dev)
+{
+	struct netkit *nk = netkit_priv(dev);
+	u32 rxq_count_old, rxq_count_new;
+	int err;
+
+	rxq_count_old = dev->real_num_rx_queues;
+	rxq_count_new = rxq_count_old + 1;
+
+	/* Only allow to bind in single device mode or to bind against
+	 * the peer device which then ends up in the target netns.
+	 */
+	if (nk->pair == NETKIT_DEVICE_PAIR && nk->primary)
+		return -EOPNOTSUPP;
+
+	if (netif_running(dev))
+		netif_carrier_off(dev);
+	err = netif_set_real_num_rx_queues(dev, rxq_count_new);
+	if (netif_running(dev))
+		netif_carrier_on(dev);
+
+	return err ? err : rxq_count_new;
+}
+
+static const struct netdev_queue_mgmt_ops netkit_queue_mgmt_ops = {
+	.ndo_queue_create = netkit_queue_create,
+};
+
+static struct net_device *netkit_alloc(struct nlattr *tb[],
+				       const char *ifname,
+				       unsigned char name_assign_type,
+				       unsigned int num_tx_queues,
+				       unsigned int num_rx_queues)
+{
+	const struct rtnl_link_ops *ops = &netkit_link_ops;
+	struct net_device *dev;
+
+	if (num_tx_queues > NETKIT_NUM_TX_QUEUES_MAX ||
+	    num_rx_queues > NETKIT_NUM_RX_QUEUES_MAX)
+		return ERR_PTR(-EOPNOTSUPP);
+
+	dev = alloc_netdev_mqs(ops->priv_size, ifname,
+			       name_assign_type, ops->setup,
+			       num_tx_queues, num_rx_queues);
+	if (dev) {
+		dev->real_num_tx_queues = NETKIT_NUM_TX_QUEUES_REAL;
+		dev->real_num_rx_queues = NETKIT_NUM_RX_QUEUES_REAL;
+	}
+	return dev;
+}
+
+static void netkit_queue_unpeer(struct net_device *dev)
+{
+	struct netdev_rx_queue *src_rxq, *dst_rxq;
+	struct net_device *src_dev;
+	int i;
+
+	if (dev->real_num_rx_queues == 1)
+		return;
+	netdev_lock(dev);
+	for (i = 1; i < dev->real_num_rx_queues; i++) {
+		dst_rxq = __netif_get_rx_queue(dev, i);
+		src_rxq = dst_rxq->peer;
+		src_dev = src_rxq->dev;
+
+		netdev_lock(src_dev);
+		netdev_rx_queue_unpeer(src_dev, src_rxq, dst_rxq);
+		netdev_unlock(src_dev);
+	}
+	netdev_unlock(dev);
+}
+
 static void netkit_setup(struct net_device *dev)
 {
 	static const netdev_features_t netkit_features_hw_vlan =
@@ -280,8 +384,9 @@ static void netkit_setup(struct net_device *dev)
 	dev->priv_flags |= IFF_DISABLE_NETPOLL;
 	dev->lltx = true;
 
-	dev->ethtool_ops = &netkit_ethtool_ops;
-	dev->netdev_ops  = &netkit_netdev_ops;
+	dev->netdev_ops     = &netkit_netdev_ops;
+	dev->ethtool_ops    = &netkit_ethtool_ops;
+	dev->queue_mgmt_ops = &netkit_queue_mgmt_ops;
 
 	dev->features |= netkit_features;
 	dev->hw_features = netkit_features;
@@ -330,8 +435,6 @@ static int netkit_validate(struct nlattr *tb[], struct nlattr *data[],
 	return 0;
 }
 
-static struct rtnl_link_ops netkit_link_ops;
-
 static int netkit_new_link(struct net_device *dev,
 			   struct rtnl_newlink_params *params,
 			   struct netlink_ext_ack *extack)
@@ -865,6 +968,7 @@ static void netkit_release_all(struct net_device *dev)
 static void netkit_uninit(struct net_device *dev)
 {
 	netkit_release_all(dev);
+	netkit_queue_unpeer(dev);
 }
 
 static void netkit_del_link(struct net_device *dev, struct list_head *head)
@@ -1005,8 +1109,9 @@ static const struct nla_policy netkit_policy[IFLA_NETKIT_MAX + 1] = {
 };
 
 static struct rtnl_link_ops netkit_link_ops = {
-	.kind		= DRV_NAME,
+	.kind		= NETKIT_DRV_NAME,
 	.priv_size	= sizeof(struct netkit),
+	.alloc		= netkit_alloc,
 	.setup		= netkit_setup,
 	.newlink	= netkit_new_link,
 	.dellink	= netkit_del_link,
@@ -1020,7 +1125,7 @@ static struct rtnl_link_ops netkit_link_ops = {
 	.maxtype	= IFLA_NETKIT_MAX,
 };
 
-static __init int netkit_init(void)
+static __init int netkit_mod_init(void)
 {
 	BUILD_BUG_ON((int)NETKIT_NEXT != (int)TCX_NEXT ||
 		     (int)NETKIT_PASS != (int)TCX_PASS ||
@@ -1030,16 +1135,16 @@ static __init int netkit_init(void)
 	return rtnl_link_register(&netkit_link_ops);
 }
 
-static __exit void netkit_exit(void)
+static __exit void netkit_mod_exit(void)
 {
 	rtnl_link_unregister(&netkit_link_ops);
 }
 
-module_init(netkit_init);
-module_exit(netkit_exit);
+module_init(netkit_mod_init);
+module_exit(netkit_mod_exit);
 
 MODULE_DESCRIPTION("BPF-programmable network device");
 MODULE_AUTHOR("Daniel Borkmann <daniel@iogearbox.net>");
 MODULE_AUTHOR("Nikolay Aleksandrov <razor@blackwall.org>");
 MODULE_LICENSE("GPL");
-MODULE_ALIAS_RTNL_LINK(DRV_NAME);
+MODULE_ALIAS_RTNL_LINK(NETKIT_DRV_NAME);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH net-next v3 14/15] netkit: Add io_uring zero-copy support for TCP
  2025-10-20 16:23 [PATCH net-next v3 00/15] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
                   ` (12 preceding siblings ...)
  2025-10-20 16:23 ` [PATCH net-next v3 13/15] netkit: Implement rtnl_link_ops->alloc and ndo_queue_create Daniel Borkmann
@ 2025-10-20 16:23 ` Daniel Borkmann
  2025-10-22 13:12   ` Nikolay Aleksandrov
  2025-10-20 16:23 ` [PATCH net-next v3 15/15] netkit: Add xsk support for af_xdp applications Daniel Borkmann
  14 siblings, 1 reply; 54+ messages in thread
From: Daniel Borkmann @ 2025-10-20 16:23 UTC (permalink / raw)
  To: netdev
  Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

From: David Wei <dw@davidwei.uk>

This adds the last missing bit to netkit for supporting io_uring with
zero-copy mode [0]. Up until this point it was not possible to consume
the latter out of containers or Kubernetes Pods where applications are
in their own network namespace.

Thus, as a last missing bit, implement ndo_queue_get_dma_dev() in netkit
to return the physical device of the real rxq for DMA. This allows memory
providers like io_uring zero-copy or devmem to bind to the physically
mapped rxq in netkit.

io_uring example with eth0 being a physical device with 16 queues where
netkit is bound to the last queue, iou-zcrx.c is binary from selftests.
Flow steering to that queue is based on the service VIP:port of the
server utilizing io_uring:

  # ethtool -X eth0 start 0 equal 15
  # ethtool -X eth0 start 15 equal 1 context new
  # ethtool --config-ntuple eth0 flow-type tcp4 dst-ip 1.2.3.4 dst-port 5000 action 15
  # ip netns add foo
  # ip link add type netkit peer numrxqueues 2
  # ./pyynl/cli.py --spec ~/netlink/specs/netdev.yaml \
                   --do bind-queue \
                   --json "{"src-ifindex": $(ifindex eth0), "src-queue-id": 15, \
                            "dst-ifindex": $(ifindex nk0), "queue-type": "rx"}"
  {'dst-queue-id': 1}
  # ip link set nk0 netns foo
  # ip link set nk1 up
  # ip netns exec foo ip link set lo up
  # ip netns exec foo ip link set nk0 up
  # ip netns exec foo ip addr add 1.2.3.4/32 dev nk0
  [ ... setup routing etc to get external traffic into the netns ... ]
  # ip netns exec foo ./iou-zcrx -s -p 5000 -i nk0 -q 1

Remote io_uring client:

  # ./iou-zcrx -c -h 1.2.3.4 -p 5000 -l 12840 -z 65536

We have tested the above against a Broadcom BCM957504 (bnxt_en)
100G NIC, supporting TCP header/data split.

Similarly, this also works for devmem which we tested using ncdevmem:

  # ip netns exec foo ./ncdevmem -s 1.2.3.4 -l -p 5000 -f nk0 -t 1 -q 1

And on the remote client:

  # ./ncdevmem -s 1.2.3.4 -p 5000 -f eth0

For Cilium, the plan is to open up support for the various memory providers
for regular Kubernetes Pods when Cilium is configured with netkit datapath
mode.

Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://kernel-recipes.org/en/2024/schedule/efficient-zero-copy-networking-using-io_uring [0]
---
 drivers/net/netkit.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
index 75b57496b72e..a281b39a1047 100644
--- a/drivers/net/netkit.c
+++ b/drivers/net/netkit.c
@@ -282,6 +282,21 @@ static const struct ethtool_ops netkit_ethtool_ops = {
 	.get_channels		= netkit_get_channels,
 };
 
+static struct device *netkit_queue_get_dma_dev(struct net_device *dev, int idx)
+{
+	struct netdev_rx_queue *rxq, *peer_rxq;
+	unsigned int peer_idx;
+
+	rxq = __netif_get_rx_queue(dev, idx);
+	if (!rxq->peer)
+		return NULL;
+
+	peer_rxq = rxq->peer;
+	peer_idx = get_netdev_rx_queue_index(peer_rxq);
+
+	return netdev_queue_get_dma_dev(peer_rxq->dev, peer_idx);
+}
+
 static int netkit_queue_create(struct net_device *dev)
 {
 	struct netkit *nk = netkit_priv(dev);
@@ -307,7 +322,8 @@ static int netkit_queue_create(struct net_device *dev)
 }
 
 static const struct netdev_queue_mgmt_ops netkit_queue_mgmt_ops = {
-	.ndo_queue_create = netkit_queue_create,
+	.ndo_queue_get_dma_dev		= netkit_queue_get_dma_dev,
+	.ndo_queue_create		= netkit_queue_create,
 };
 
 static struct net_device *netkit_alloc(struct nlattr *tb[],
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH net-next v3 15/15] netkit: Add xsk support for af_xdp applications
  2025-10-20 16:23 [PATCH net-next v3 00/15] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
                   ` (13 preceding siblings ...)
  2025-10-20 16:23 ` [PATCH net-next v3 14/15] netkit: Add io_uring zero-copy support for TCP Daniel Borkmann
@ 2025-10-20 16:23 ` Daniel Borkmann
  2025-10-22 14:27   ` Nikolay Aleksandrov
  14 siblings, 1 reply; 54+ messages in thread
From: Daniel Borkmann @ 2025-10-20 16:23 UTC (permalink / raw)
  To: netdev
  Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

Enable support for AF_XDP applications to operate on a netkit device.
The goal is that AF_XDP applications can natively consume AF_XDP
from network namespaces. The use-case from Cilium side is to support
Kubernetes KubeVirt VMs through QEMU's AF_XDP backend. KubeVirt is a
virtual machine management add-on for Kubernetes which aims to provide
a common ground for virtualization. KubeVirt spawns the VMs inside
Kubernetes Pods which reside in their own network namespace just like
regular Pods.

Raw QEMU AF_XDP backend example with eth0 being a physical device with
16 queues where netkit is bound to the last queue (for multi-queue RSS
context can be used if supported by the driver):

  # ethtool -X eth0 start 0 equal 15
  # ethtool -X eth0 start 15 equal 1 context new
  # ethtool --config-ntuple eth0 flow-type ether \
            src 00:00:00:00:00:00 \
            src-mask ff:ff:ff:ff:ff:ff \
            dst $mac dst-mask 00:00:00:00:00:00 \
            proto 0 proto-mask 0xffff action 15
  [ ... setup BPF/XDP prog on eth0 to steer into shared xsk map ... ]
  # ip netns add foo
  # ip link add numrxqueues 2 nk type netkit single
  # ./pyynl/cli.py --spec ~/netlink/specs/netdev.yaml \
                   --do bind-queue \
                   --json "{"src-ifindex": $(ifindex eth0), "src-queue-id": 15, \
                            "dst-ifindex": $(ifindex nk), "queue-type": "rx"}"
  {'dst-queue-id': 1}
  # ip link set nk netns foo
  # ip netns exec foo ip link set lo up
  # ip netns exec foo ip link set nk up
  # ip netns exec foo qemu-system-x86_64 \
          -kernel $kernel \
          -drive file=${image_name},index=0,media=disk,format=raw \
          -append "root=/dev/sda rw console=ttyS0" \
          -cpu host \
          -m $memory \
          -enable-kvm \
          -device virtio-net-pci,netdev=net0,mac=$mac \
          -netdev af-xdp,ifname=nk,id=net0,mode=native,queues=1,start-queue=1,inhibit=on,map-path=$dir/xsks_map \
          -nographic

We have tested the above against a dual-port Nvidia ConnectX-6 (mlx5)
100G NIC with successful network connectivity out of QEMU. An earlier
iteration of this work was presented at LSF/MM/BPF [0].

For getting to a first starting point to connect all things with
KubeVirt, bind mounting the xsk map from Cilium into the VM launcher
Pod which acts as a regular Kubernetes Pod while not perfect, is not
a big problem given its out of reach from the application sitting
inside the VM (and some of the control plane aspects are baked in
the launcher Pod already), so the isolation barrier is still the VM.
Eventually the goal is to have a XDP/XSK redirect extension where
there is no need to have the xsk map, and the BPF program can just
derive the target xsk through the queue where traffic was received
on.

The exposure through netkit is because Cilium should not act as a
proxy handing out xsk sockets. Existing applications expect a netdev
from kernel side and should not need to rewrite just to implement
against a CNI's protocol. Also, all the memory should not be accounted
against Cilium but rather the application Pod itself which is consuming
AF_XDP. Further, on up/downgrades we expect the data plane to being
completely decoupled from the control plane; if Cilium would own the
sockets that would be disruptive. Another use-case which opens up and
is regularly asked from users would be to have DPDK applications on
top of AF_XDP in regular Kubernetes Pods.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://bpfconf.ebpf.io/bpfconf2025/bpfconf2025_material/lsfmmbpf_2025_netkit_borkmann.pdf [0]
---
 drivers/net/netkit.c | 71 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 70 insertions(+), 1 deletion(-)

diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
index a281b39a1047..f69abe5ec4cd 100644
--- a/drivers/net/netkit.c
+++ b/drivers/net/netkit.c
@@ -12,6 +12,7 @@
 #include <net/netdev_lock.h>
 #include <net/netdev_queues.h>
 #include <net/netdev_rx_queue.h>
+#include <net/xdp_sock_drv.h>
 #include <net/netkit.h>
 #include <net/dst.h>
 #include <net/tcx.h>
@@ -235,6 +236,71 @@ static void netkit_get_stats(struct net_device *dev,
 	stats->tx_dropped = DEV_STATS_READ(dev, tx_dropped);
 }
 
+static bool netkit_xsk_supported_at_phys(const struct net_device *dev)
+{
+	if (!dev->netdev_ops->ndo_bpf ||
+	    !dev->netdev_ops->ndo_xdp_xmit ||
+	    !dev->netdev_ops->ndo_xsk_wakeup)
+		return false;
+	if ((dev->xdp_features & NETDEV_XDP_ACT_XSK) != NETDEV_XDP_ACT_XSK)
+		return false;
+	return true;
+}
+
+static int netkit_xsk(struct net_device *dev, struct netdev_bpf *xdp)
+{
+	struct netkit *nk = netkit_priv(dev);
+	struct netdev_bpf xdp_lower;
+	struct netdev_rx_queue *rxq;
+	struct net_device *phys;
+
+	switch (xdp->command) {
+	case XDP_SETUP_XSK_POOL:
+		if (nk->pair == NETKIT_DEVICE_PAIR)
+			return -EOPNOTSUPP;
+		if (xdp->xsk.queue_id >= dev->real_num_rx_queues)
+			return -EINVAL;
+
+		rxq = __netif_get_rx_queue(dev, xdp->xsk.queue_id);
+		if (!rxq->peer)
+			return -EOPNOTSUPP;
+
+		phys = rxq->peer->dev;
+		if (!netkit_xsk_supported_at_phys(phys))
+			return -EOPNOTSUPP;
+
+		memcpy(&xdp_lower, xdp, sizeof(xdp_lower));
+		xdp_lower.xsk.queue_id = get_netdev_rx_queue_index(rxq->peer);
+		break;
+	case XDP_SETUP_PROG:
+		return -EPERM;
+	default:
+		return -EINVAL;
+	}
+
+	return phys->netdev_ops->ndo_bpf(phys, &xdp_lower);
+}
+
+static int netkit_xsk_wakeup(struct net_device *dev, u32 queue_id, u32 flags)
+{
+	struct netdev_rx_queue *rxq;
+	struct net_device *phys;
+
+	if (queue_id >= dev->real_num_rx_queues)
+		return -EINVAL;
+
+	rxq = __netif_get_rx_queue(dev, queue_id);
+	if (!rxq->peer)
+		return -EOPNOTSUPP;
+
+	phys = rxq->peer->dev;
+	if (!netkit_xsk_supported_at_phys(phys))
+		return -EOPNOTSUPP;
+
+	return phys->netdev_ops->ndo_xsk_wakeup(phys,
+			get_netdev_rx_queue_index(rxq->peer), flags);
+}
+
 static int netkit_init(struct net_device *dev)
 {
 	netdev_lockdep_set_classes(dev);
@@ -255,6 +321,8 @@ static const struct net_device_ops netkit_netdev_ops = {
 	.ndo_get_peer_dev	= netkit_peer_dev,
 	.ndo_get_stats64	= netkit_get_stats,
 	.ndo_uninit		= netkit_uninit,
+	.ndo_bpf		= netkit_xsk,
+	.ndo_xsk_wakeup		= netkit_xsk_wakeup,
 	.ndo_features_check	= passthru_features_check,
 };
 
@@ -409,10 +477,11 @@ static void netkit_setup(struct net_device *dev)
 	dev->hw_enc_features = netkit_features;
 	dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
 	dev->vlan_features = dev->features & ~netkit_features_hw_vlan;
-
 	dev->needs_free_netdev = true;
 
 	netif_set_tso_max_size(dev, GSO_MAX_SIZE);
+
+	xdp_set_features_flag(dev, NETDEV_XDP_ACT_XSK);
 }
 
 static struct net *netkit_get_link_net(const struct net_device *dev)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 02/15] net: Implement netdev_nl_bind_queue_doit
  2025-10-20 16:23 ` [PATCH net-next v3 02/15] net: Implement netdev_nl_bind_queue_doit Daniel Borkmann
@ 2025-10-22 11:17   ` Nikolay Aleksandrov
  2025-10-22 11:26     ` Daniel Borkmann
  2025-10-23 10:17   ` Paolo Abeni
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 54+ messages in thread
From: Nikolay Aleksandrov @ 2025-10-22 11:17 UTC (permalink / raw)
  To: Daniel Borkmann, netdev
  Cc: bpf, kuba, davem, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

On 10/20/25 19:23, Daniel Borkmann wrote:
> From: David Wei <dw@davidwei.uk>
> 
> Implement netdev_nl_bind_queue_doit() that creates an rx queue in a
> virtual netdev and then binds it to an rxq in a real netdev to create
> a queue pair.
> 
> Example with ynl client:
> 
>   # ./pyynl/cli.py \
>       --spec ~/netlink/specs/netdev.yaml \
>       --do bind-queue \
>       --json '{"src-ifindex": 4, "src-queue-id": 15, "dst-ifindex": 8, "queue-type": "rx"}'
>   {'dst-queue-id': 1}
> 
> Note that the netdevice locking order is always from the virtual to
> the physical device.
> 
> Signed-off-by: David Wei <dw@davidwei.uk>
> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  include/net/netdev_queues.h   |   5 ++
>  include/net/netdev_rx_queue.h |  36 ++++++++-
>  net/core/netdev-genl.c        | 141 +++++++++++++++++++++++++++++++++-
>  net/core/netdev_rx_queue.c    |  61 +++++++++++++++
>  4 files changed, 240 insertions(+), 3 deletions(-)
> 
> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> index cd00e0406cf4..286d5edce07d 100644
> --- a/include/net/netdev_queues.h
> +++ b/include/net/netdev_queues.h
> @@ -130,6 +130,10 @@ void netdev_stat_queue_sum(struct net_device *netdev,
>   * @ndo_queue_get_dma_dev: Get dma device for zero-copy operations to be used
>   *			   for this queue. Return NULL on error.
>   *
> + * @ndo_queue_create: Create a new RX queue which can be bound to another queue.
> + *		      Ops on this queue are redirected to the peer queue e.g.
> + *		      when opening a memory provider.
> + *

It'd be nice to mention what the expected return value can be. See more below.

>   * Note that @ndo_queue_mem_alloc and @ndo_queue_mem_free may be called while
>   * the interface is closed. @ndo_queue_start and @ndo_queue_stop will only
>   * be called for an interface which is open.
> @@ -149,6 +153,7 @@ struct netdev_queue_mgmt_ops {
>  						  int idx);
>  	struct device *		(*ndo_queue_get_dma_dev)(struct net_device *dev,
>  							 int idx);
> +	int			(*ndo_queue_create)(struct net_device *dev);
>  };
>  
>  bool netif_rxq_has_unreadable_mp(struct net_device *dev, int idx);
> diff --git a/include/net/netdev_rx_queue.h b/include/net/netdev_rx_queue.h
> index 8cdcd138b33f..db3ef94c0744 100644
> --- a/include/net/netdev_rx_queue.h
> +++ b/include/net/netdev_rx_queue.h
> @@ -28,6 +28,7 @@ struct netdev_rx_queue {
>  #endif
>  	struct napi_struct		*napi;
>  	struct pp_memory_provider_params mp_params;
> +	struct netdev_rx_queue		*peer;
>  } ____cacheline_aligned_in_smp;
>  
>  /*
> @@ -56,6 +57,37 @@ get_netdev_rx_queue_index(struct netdev_rx_queue *queue)
>  	return index;
>  }
>  
> -int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq);
> +static inline void __netdev_rx_queue_peer(struct netdev_rx_queue *src_rxq,
> +					  struct netdev_rx_queue *dst_rxq)
> +{
> +	src_rxq->peer = dst_rxq;
> +	dst_rxq->peer = src_rxq;
> +}
>  
> -#endif
> +static inline void __netdev_rx_queue_unpeer(struct netdev_rx_queue *src_rxq,
> +					    struct netdev_rx_queue *dst_rxq)
> +{
> +	src_rxq->peer = NULL;
> +	dst_rxq->peer = NULL;
> +}
> +
> +static inline bool netdev_rx_queue_peered(struct net_device *dev,
> +					  u16 queue_id)
> +{
> +	if (queue_id < dev->real_num_rx_queues)
> +		return dev->_rx[queue_id].peer;
> +	return false;
> +}
> +
> +void netdev_rx_queue_peer(struct net_device *src_dev,
> +			  struct netdev_rx_queue *src_rxq,
> +			  struct netdev_rx_queue *dst_rxq);
> +void netdev_rx_queue_unpeer(struct net_device *src_dev,
> +			    struct netdev_rx_queue *src_rxq,
> +			    struct netdev_rx_queue *dst_rxq);
> +int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq);
> +struct netdev_rx_queue *
> +netif_get_rx_queue_peer_locked(struct net_device **dev,
> +			       unsigned int *rxq_idx,
> +			       bool *needs_unlock);
> +#endif /* _LINUX_NETDEV_RX_QUEUE_H */
> diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> index ce1018ea390f..579469abac8c 100644
> --- a/net/core/netdev-genl.c
> +++ b/net/core/netdev-genl.c
> @@ -1122,7 +1122,146 @@ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
>  
>  int netdev_nl_bind_queue_doit(struct sk_buff *skb, struct genl_info *info)
>  {
> -	return -EOPNOTSUPP;
> +	u32 src_ifidx, src_qid, dst_ifidx, dst_qid, q_type;
> +	struct netdev_rx_queue *src_rxq, *dst_rxq, *tmp_rxq;
> +	struct net_device *src_dev, *dst_dev;
> +	struct sk_buff *rsp;
> +	int err = 0;
> +	void *hdr;

nit: reverse xmas tree order

> +
> +	if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_PAIR_QUEUE_TYPE) ||
> +	    GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_PAIR_SRC_IFINDEX) ||
> +	    GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_PAIR_SRC_QUEUE_ID) ||
> +	    GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_PAIR_DST_IFINDEX))
> +		return -EINVAL;
> +
> +	src_ifidx = nla_get_u32(info->attrs[NETDEV_A_QUEUE_PAIR_SRC_IFINDEX]);
> +	src_qid = nla_get_u32(info->attrs[NETDEV_A_QUEUE_PAIR_SRC_QUEUE_ID]);
> +	dst_ifidx = nla_get_u32(info->attrs[NETDEV_A_QUEUE_PAIR_DST_IFINDEX]);
> +	q_type = nla_get_u32(info->attrs[NETDEV_A_QUEUE_PAIR_QUEUE_TYPE]);
> +
> +	if (q_type != NETDEV_QUEUE_TYPE_RX) {
> +		NL_SET_ERR_MSG(info->extack, "Only binding of RX queue supported");
> +		return -EOPNOTSUPP;
> +	}
> +	if (dst_ifidx == src_ifidx) {
> +		NL_SET_ERR_MSG(info->extack,
> +			       "Destination driver cannot be same as source driver");
> +		return -EOPNOTSUPP;
> +	}
> +
> +	rsp = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL);
> +	if (!rsp)
> +		return -ENOMEM;
> +
> +	hdr = genlmsg_iput(rsp, info);
> +	if (!hdr) {
> +		err = -EMSGSIZE;
> +		goto err_genlmsg_free;
> +	}
> +
> +	/* Locking order is always from the virtual to the physical device
> +	 * since this is also the same order when applications open the
> +	 * memory provider later on.
> +	 */
> +	dst_dev = netdev_get_by_index_lock(genl_info_net(info), dst_ifidx);
> +	if (!dst_dev) {
> +		err = -ENODEV;
> +		goto err_genlmsg_free;
> +	}
> +	if (dst_dev->dev.parent) {
> +		err = -EOPNOTSUPP;
> +		NL_SET_ERR_MSG(info->extack,
> +			       "Destination device is not a virtual device");
> +		goto err_unlock_dst_dev;
> +	}
> +	if (!dst_dev->queue_mgmt_ops ||
> +	    !dst_dev->queue_mgmt_ops->ndo_queue_create) {
> +		err = -EOPNOTSUPP;
> +		NL_SET_ERR_MSG(info->extack,
> +			       "Destination driver does not support queue management operations");
> +		goto err_unlock_dst_dev;
> +	}
> +	if (dst_dev->real_num_rx_queues < 1) {
> +		err = -EOPNOTSUPP;
> +		NL_SET_ERR_MSG(info->extack,
> +			       "Destination device must have at least one real RX queue");
> +		goto err_unlock_dst_dev;
> +	}
> +
> +	src_dev = netdev_get_by_index_lock(genl_info_net(info), src_ifidx);
> +	if (!src_dev) {
> +		err = -ENODEV;
> +		goto err_unlock_dst_dev;
> +	}
> +	if (!src_dev->dev.parent) {
> +		err = -EOPNOTSUPP;
> +		NL_SET_ERR_MSG(info->extack,
> +			       "Source device is a virtual device");
> +		goto err_unlock_src_dev;
> +	}
> +	if (!netif_device_present(src_dev)) {
> +		err = -ENODEV;
> +		NL_SET_ERR_MSG(info->extack,
> +			       "Source device has been removed from the system");
> +		goto err_unlock_src_dev;
> +	}
> +	if (!src_dev->queue_mgmt_ops) {
> +		err = -EOPNOTSUPP;
> +		NL_SET_ERR_MSG(info->extack,
> +			       "Source driver does not support queue management operations");
> +		goto err_unlock_src_dev;
> +	}
> +	if (src_qid >= src_dev->num_rx_queues) {
> +		err = -ERANGE;
> +		NL_SET_ERR_MSG(info->extack,
> +			       "Source device queue is out of range");
> +		goto err_unlock_src_dev;
> +	}
> +
> +	src_rxq = __netif_get_rx_queue(src_dev, src_qid);
> +	if (src_rxq->peer) {
> +		err = -EBUSY;
> +		NL_SET_ERR_MSG(info->extack,
> +			       "Source device queue is already bound");
> +		goto err_unlock_src_dev;
> +	}
> +
> +	tmp_rxq = __netif_get_rx_queue(dst_dev, dst_dev->real_num_rx_queues - 1);
> +	if (tmp_rxq->peer && tmp_rxq->peer->dev != src_dev) {
> +		err = -EOPNOTSUPP;
> +		NL_SET_ERR_MSG(info->extack,
> +			       "Binding multiple queues from difference source devices not supported");

s/difference/different/

> +		goto err_unlock_src_dev;
> +	}
> +
> +	err = dst_dev->queue_mgmt_ops->ndo_queue_create(dst_dev);
> +	if (err <= 0) {

<= 0 is a bit weird, if 0 signals an error perhaps "err" must be set?

Maybe directly use dst_qid above and set "err" appropriately to better
demonstrate what's expected?

> +		NL_SET_ERR_MSG(info->extack,
> +			       "Destination device is unable to create a new queue");
> +		goto err_unlock_src_dev;
> +	}
> +
> +	dst_qid = err - 1;
> +	dst_rxq = __netif_get_rx_queue(dst_dev, dst_qid);
> +
> +	netdev_rx_queue_peer(src_dev, src_rxq, dst_rxq);
> +
> +	nla_put_u32(rsp, NETDEV_A_QUEUE_PAIR_DST_QUEUE_ID, dst_qid);
> +	genlmsg_end(rsp, hdr);
> +
> +	netdev_unlock(src_dev);
> +	netdev_unlock(dst_dev);
> +
> +	return genlmsg_reply(rsp, info);
> +
> +err_unlock_src_dev:
> +	netdev_unlock(src_dev);
> +err_unlock_dst_dev:
> +	netdev_unlock(dst_dev);
> +err_genlmsg_free:
> +	nlmsg_free(rsp);
> +	return err;
>  }
>  
>  void netdev_nl_sock_priv_init(struct netdev_nl_sock *priv)
> diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
> index c7d9341b7630..916ca8d7ae7c 100644
> --- a/net/core/netdev_rx_queue.c
> +++ b/net/core/netdev_rx_queue.c
> @@ -18,6 +18,67 @@ bool netif_rxq_has_unreadable_mp(struct net_device *dev, int idx)
>  }
>  EXPORT_SYMBOL(netif_rxq_has_unreadable_mp);
>  
> +void netdev_rx_queue_peer(struct net_device *src_dev,
> +			  struct netdev_rx_queue *src_rxq,
> +			  struct netdev_rx_queue *dst_rxq)
> +{
> +	netdev_assert_locked(src_dev);
> +	netdev_assert_locked(dst_rxq->dev);
> +
> +	netdev_hold(src_dev, &src_rxq->dev_tracker, GFP_KERNEL);
> +	__netdev_rx_queue_peer(src_rxq, dst_rxq);
> +}
> +
> +void netdev_rx_queue_unpeer(struct net_device *src_dev,
> +			    struct netdev_rx_queue *src_rxq,
> +			    struct netdev_rx_queue *dst_rxq)
> +{
> +	WARN_ON_ONCE(READ_ONCE(dst_rxq->dev->reg_state) != NETREG_UNREGISTERING);
> +
> +	netdev_assert_locked(dst_rxq->dev);
> +	netdev_assert_locked(src_dev);
> +
> +	__netdev_rx_queue_unpeer(src_rxq, dst_rxq);
> +	netdev_put(src_dev, &src_rxq->dev_tracker);
> +}
> +
> +static struct netdev_rx_queue *
> +__netif_get_rx_queue_peer(struct net_device **dev, unsigned int *rxq_idx,
> +			  bool virt_to_phys_only)
> +{
> +	struct net_device *req_dev = *dev;
> +	struct netdev_rx_queue *rxq = __netif_get_rx_queue(req_dev, *rxq_idx);
> +
> +	if (rxq->peer) {
> +		if (virt_to_phys_only &&
> +		    req_dev->dev.parent)
> +			return NULL;
> +		rxq = rxq->peer;
> +		*rxq_idx = get_netdev_rx_queue_index(rxq);
> +		*dev = rxq->dev;
> +	}
> +	return rxq;
> +}
> +
> +struct netdev_rx_queue *
> +netif_get_rx_queue_peer_locked(struct net_device **dev, unsigned int *rxq_idx,
> +			       bool *needs_unlock)
> +{
> +	struct net_device *req_dev = *dev;
> +	struct netdev_rx_queue *rxq;
> +
> +	/* Locking order is always from the virtual to the physical device
> +	 * see netdev_nl_bind_queue_doit().
> +	 */
> +	netdev_ops_assert_locked(req_dev);
> +	rxq = __netif_get_rx_queue_peer(dev, rxq_idx, true);
> +	if (rxq && req_dev != *dev) {
> +		*needs_unlock = true;
> +		netdev_lock(*dev);
> +	}
> +	return rxq;
> +}
> +>  int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
>  {
>  	struct netdev_rx_queue *rxq = __netif_get_rx_queue(dev, rxq_idx);


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 01/15] net: Add bind-queue operation
  2025-10-20 16:23 ` [PATCH net-next v3 01/15] net: Add bind-queue operation Daniel Borkmann
@ 2025-10-22 11:19   ` Nikolay Aleksandrov
  2025-10-24  2:12   ` Jakub Kicinski
  1 sibling, 0 replies; 54+ messages in thread
From: Nikolay Aleksandrov @ 2025-10-22 11:19 UTC (permalink / raw)
  To: Daniel Borkmann, netdev
  Cc: bpf, kuba, davem, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

On 10/20/25 19:23, Daniel Borkmann wrote:
> From: David Wei <dw@davidwei.uk>
> 
> Add a ynl netdev family operation called bind-queue that creates a new
> rx queue in a virtual netdev (i.e. netkit or veth) and binds it to an rx
> queue in a real netdev. This forms a queue pair, where the peer queue of
> the pair in the virtual netdev acts as a proxy for the peer queue in the
> real netdev. Thus, the peer queue in the virtual netdev can be used by
> processes running in a container to use both memory providers (io_uring
> zero-copy rx and devmem) and AF_XDP. An early implementation had only
> driver-specific integration [0], but in order for other virtual devices
> to reuse, it makes sense to have this as a generic API.
> 
> src-ifindex and src-queue-id is the real netdev and its rx queue id
> respectively. dst-ifindex is the virtual netdev. Note that this op doesn't
> take dst-queue-id because a new rx queue is created. The virtual netdev
> must have real_num_rx_queues less than num_rx_queues at the time of
> calling bind-queue. The queue-type must be rx as only rx queues are
> supported for now.
> 
> Signed-off-by: David Wei <dw@davidwei.uk>
> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Link: https://bpfconf.ebpf.io/bpfconf2025/bpfconf2025_material/lsfmmbpf_2025_netkit_borkmann.pdf [0]
> ---
>  Documentation/netlink/specs/netdev.yaml | 60 +++++++++++++++++++++++++
>  include/uapi/linux/netdev.h             | 12 +++++
>  net/core/netdev-genl-gen.c              | 25 +++++++++++
>  net/core/netdev-genl-gen.h              |  1 +
>  net/core/netdev-genl.c                  |  5 +++
>  tools/include/uapi/linux/netdev.h       | 12 +++++
>  6 files changed, 115 insertions(+)
> 
> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index e00d3fa1c152..20bb00b7e9ac 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -561,6 +561,46 @@ attribute-sets:
>          type: u32
>          checks:
>            min: 1
> +  -
> +    name: queue-pair
> +    attributes:
> +      -
> +        name: queue-type
> +        doc: |
> +          Queue type as rx, tx, for src-queue-id and dst-queue-id.
> +          Currently only pairing queues of type rx is supported.
> +        type: u32
> +        enum: queue-type
> +      -
> +        name: src-ifindex
> +        doc: |
> +          Specifies the netdev ifindex of the physical device to pair
> +          src-queue-id from.
> +        type: u32
> +        checks:
> +          min: 1
> +          max: s32-max
> +      -
> +        name: src-queue-id
> +        doc: |
> +          Specifies the netdev queue id of the physical device with
> +          src-ifindex to pair a queue from.
> +        type: u32
> +      -
> +        name: dst-ifindex
> +        doc: |
> +          Specifies the netdev ifindex of the virtual device to pair
> +          a new queue with the src-queue-id from src-ifindex.
> +        type: u32
> +        checks:
> +          min: 1
> +          max: s32-max
> +      -
> +        name: dst-queue-id
> +        doc: |
> +          Specifies the new netdev queue id of the virtual device after
> +          a successful pairing operation.
> +        type: u32
>  
>  operations:
>    list:
> @@ -772,6 +812,26 @@ operations:
>            attributes:
>              - id
>  
> +    -
> +      name: bind-queue
> +      doc: |
> +        Bind a physical netdevice queue to a virtual one. The binding
> +        creates a queue pair, where a queue can reference its peer queue.
> +        This is useful for memory providers and AF_XDP operations which
> +        take an ifindex and queue id to allow auch applications to bind
> +        against virtual devices in containers.
> +      attribute-set: queue-pair
> +      do:
> +        request:
> +          attributes:
> +            - queue-type
> +            - src-ifindex
> +            - src-queue-id
> +            - dst-ifindex
> +        reply:
> +          attributes:
> +            - dst-queue-id
> +
>  kernel-family:
>    headers: ["net/netdev_netlink.h"]
>    sock-priv: struct netdev_nl_sock
> diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> index 48eb49aa03d4..4ef04d0bc412 100644
> --- a/include/uapi/linux/netdev.h
> +++ b/include/uapi/linux/netdev.h
> @@ -210,6 +210,17 @@ enum {
>  	NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
>  };
>  
> +enum {
> +	NETDEV_A_QUEUE_PAIR_QUEUE_TYPE = 1,
> +	NETDEV_A_QUEUE_PAIR_SRC_IFINDEX,
> +	NETDEV_A_QUEUE_PAIR_SRC_QUEUE_ID,
> +	NETDEV_A_QUEUE_PAIR_DST_IFINDEX,
> +	NETDEV_A_QUEUE_PAIR_DST_QUEUE_ID,
> +
> +	__NETDEV_A_QUEUE_PAIR_MAX,
> +	NETDEV_A_QUEUE_PAIR_MAX = (__NETDEV_A_QUEUE_PAIR_MAX - 1)
> +};
> +
>  enum {
>  	NETDEV_CMD_DEV_GET = 1,
>  	NETDEV_CMD_DEV_ADD_NTF,
> @@ -226,6 +237,7 @@ enum {
>  	NETDEV_CMD_BIND_RX,
>  	NETDEV_CMD_NAPI_SET,
>  	NETDEV_CMD_BIND_TX,
> +	NETDEV_CMD_BIND_QUEUE,
>  
>  	__NETDEV_CMD_MAX,
>  	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
> diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
> index e9a2a6f26cb7..69f8126c3e42 100644
> --- a/net/core/netdev-genl-gen.c
> +++ b/net/core/netdev-genl-gen.c
> @@ -26,6 +26,16 @@ static const struct netlink_range_validation netdev_a_napi_defer_hard_irqs_range
>  	.max	= S32_MAX,
>  };
>  
> +static const struct netlink_range_validation netdev_a_queue_pair_src_ifindex_range = {
> +	.min	= 1ULL,
> +	.max	= S32_MAX,
> +};
> +
> +static const struct netlink_range_validation netdev_a_queue_pair_dst_ifindex_range = {
> +	.min	= 1ULL,
> +	.max	= S32_MAX,
> +};
> +
>  /* Common nested types */
>  const struct nla_policy netdev_page_pool_info_nl_policy[NETDEV_A_PAGE_POOL_IFINDEX + 1] = {
>  	[NETDEV_A_PAGE_POOL_ID] = NLA_POLICY_FULL_RANGE(NLA_UINT, &netdev_a_page_pool_id_range),
> @@ -106,6 +116,14 @@ static const struct nla_policy netdev_bind_tx_nl_policy[NETDEV_A_DMABUF_FD + 1]
>  	[NETDEV_A_DMABUF_FD] = { .type = NLA_U32, },
>  };
>  
> +/* NETDEV_CMD_BIND_QUEUE - do */
> +static const struct nla_policy netdev_bind_queue_nl_policy[NETDEV_A_QUEUE_PAIR_DST_IFINDEX + 1] = {
> +	[NETDEV_A_QUEUE_PAIR_QUEUE_TYPE] = NLA_POLICY_MAX(NLA_U32, 1),
> +	[NETDEV_A_QUEUE_PAIR_SRC_IFINDEX] = NLA_POLICY_FULL_RANGE(NLA_U32, &netdev_a_queue_pair_src_ifindex_range),
> +	[NETDEV_A_QUEUE_PAIR_SRC_QUEUE_ID] = { .type = NLA_U32, },
> +	[NETDEV_A_QUEUE_PAIR_DST_IFINDEX] = NLA_POLICY_FULL_RANGE(NLA_U32, &netdev_a_queue_pair_dst_ifindex_range),
> +};
> +
>  /* Ops table for netdev */
>  static const struct genl_split_ops netdev_nl_ops[] = {
>  	{
> @@ -204,6 +222,13 @@ static const struct genl_split_ops netdev_nl_ops[] = {
>  		.maxattr	= NETDEV_A_DMABUF_FD,
>  		.flags		= GENL_CMD_CAP_DO,
>  	},
> +	{
> +		.cmd		= NETDEV_CMD_BIND_QUEUE,
> +		.doit		= netdev_nl_bind_queue_doit,
> +		.policy		= netdev_bind_queue_nl_policy,
> +		.maxattr	= NETDEV_A_QUEUE_PAIR_DST_IFINDEX,
> +		.flags		= GENL_CMD_CAP_DO,
> +	},
>  };
>  
>  static const struct genl_multicast_group netdev_nl_mcgrps[] = {
> diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h
> index cf3fad74511f..309248fe2b9e 100644
> --- a/net/core/netdev-genl-gen.h
> +++ b/net/core/netdev-genl-gen.h
> @@ -35,6 +35,7 @@ int netdev_nl_qstats_get_dumpit(struct sk_buff *skb,
>  int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info);
>  int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info);
>  int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info);
> +int netdev_nl_bind_queue_doit(struct sk_buff *skb, struct genl_info *info);
>  
>  enum {
>  	NETDEV_NLGRP_MGMT,
> diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> index 470fabbeacd9..ce1018ea390f 100644
> --- a/net/core/netdev-genl.c
> +++ b/net/core/netdev-genl.c
> @@ -1120,6 +1120,11 @@ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
>  	return err;
>  }
>  
> +int netdev_nl_bind_queue_doit(struct sk_buff *skb, struct genl_info *info)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
>  void netdev_nl_sock_priv_init(struct netdev_nl_sock *priv)
>  {
>  	INIT_LIST_HEAD(&priv->bindings);
> diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
> index 48eb49aa03d4..4ef04d0bc412 100644
> --- a/tools/include/uapi/linux/netdev.h
> +++ b/tools/include/uapi/linux/netdev.h
> @@ -210,6 +210,17 @@ enum {
>  	NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
>  };
>  
> +enum {
> +	NETDEV_A_QUEUE_PAIR_QUEUE_TYPE = 1,
> +	NETDEV_A_QUEUE_PAIR_SRC_IFINDEX,
> +	NETDEV_A_QUEUE_PAIR_SRC_QUEUE_ID,
> +	NETDEV_A_QUEUE_PAIR_DST_IFINDEX,
> +	NETDEV_A_QUEUE_PAIR_DST_QUEUE_ID,
> +
> +	__NETDEV_A_QUEUE_PAIR_MAX,
> +	NETDEV_A_QUEUE_PAIR_MAX = (__NETDEV_A_QUEUE_PAIR_MAX - 1)
> +};
> +
>  enum {
>  	NETDEV_CMD_DEV_GET = 1,
>  	NETDEV_CMD_DEV_ADD_NTF,
> @@ -226,6 +237,7 @@ enum {
>  	NETDEV_CMD_BIND_RX,
>  	NETDEV_CMD_NAPI_SET,
>  	NETDEV_CMD_BIND_TX,
> +	NETDEV_CMD_BIND_QUEUE,
>  
>  	__NETDEV_CMD_MAX,
>  	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 03/15] net: Add peer info to queue-get response
  2025-10-20 16:23 ` [PATCH net-next v3 03/15] net: Add peer info to queue-get response Daniel Borkmann
@ 2025-10-22 11:23   ` Nikolay Aleksandrov
  2025-10-24  2:33   ` Jakub Kicinski
  1 sibling, 0 replies; 54+ messages in thread
From: Nikolay Aleksandrov @ 2025-10-22 11:23 UTC (permalink / raw)
  To: Daniel Borkmann, netdev
  Cc: bpf, kuba, davem, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

On 10/20/25 19:23, Daniel Borkmann wrote:
> From: David Wei <dw@davidwei.uk>
> 
> Add a nested peer field to the queue-get response that returns the peered
> ifindex and queue id.
> 
> Example with ynl client:
> 
>   # ip netns exec foo ./pyynl/cli.py \
>       --spec ~/netlink/specs/netdev.yaml \
>       --do queue-get \
>       --json '{"ifindex": 3, "id": 1, "type": "rx"}'
>   {'id': 1, 'ifindex': 3, 'peer': {'id': 15, 'ifindex': 4, 'netns-id': 21}, 'type': 'rx'}
> 
> Note that the caller of netdev_nl_queue_fill_one() holds the netdevice
> lock. For the queue-get we do not lock both devices. When queues get
> {un,}peered, both devices are locked, thus if netdev_rx_queue_peered()
> returns true, the peer pointer points to a valid device. The netns-id
> is fetched via peernet2id_alloc() similarly as done in OVS.
> 
> Signed-off-by: David Wei <dw@davidwei.uk>
> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  Documentation/netlink/specs/netdev.yaml | 24 ++++++++++++++++++
>  include/net/netdev_rx_queue.h           |  3 +++
>  include/uapi/linux/netdev.h             | 10 ++++++++
>  net/core/netdev-genl.c                  | 33 +++++++++++++++++++++++--
>  net/core/netdev_rx_queue.c              |  8 ++++++
>  tools/include/uapi/linux/netdev.h       | 10 ++++++++
>  6 files changed, 86 insertions(+), 2 deletions(-)
> 

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 04/15] net, ethtool: Disallow peered real rxqs to be resized
  2025-10-20 16:23 ` [PATCH net-next v3 04/15] net, ethtool: Disallow peered real rxqs to be resized Daniel Borkmann
@ 2025-10-22 11:25   ` Nikolay Aleksandrov
  0 siblings, 0 replies; 54+ messages in thread
From: Nikolay Aleksandrov @ 2025-10-22 11:25 UTC (permalink / raw)
  To: Daniel Borkmann, netdev
  Cc: bpf, kuba, davem, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

On 10/20/25 19:23, Daniel Borkmann wrote:
> Similar to AF_XDP, do not allow queues in a physical netdev to be
> resized by ethtool -L when they are peered.
> 
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Co-developed-by: David Wei <dw@davidwei.uk>
> Signed-off-by: David Wei <dw@davidwei.uk>
> ---
>  include/linux/ethtool.h |  1 +
>  net/ethtool/channels.c  | 12 ++++++------
>  net/ethtool/common.c    | 10 +++++++++-
>  net/ethtool/ioctl.c     |  4 ++--
>  4 files changed, 18 insertions(+), 9 deletions(-)
> 

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 02/15] net: Implement netdev_nl_bind_queue_doit
  2025-10-22 11:17   ` Nikolay Aleksandrov
@ 2025-10-22 11:26     ` Daniel Borkmann
  0 siblings, 0 replies; 54+ messages in thread
From: Daniel Borkmann @ 2025-10-22 11:26 UTC (permalink / raw)
  To: Nikolay Aleksandrov, netdev
  Cc: bpf, kuba, davem, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

On 10/22/25 1:17 PM, Nikolay Aleksandrov wrote:
> On 10/20/25 19:23, Daniel Borkmann wrote:
[...]
>> +		goto err_unlock_src_dev;
>> +	}
>> +
>> +	err = dst_dev->queue_mgmt_ops->ndo_queue_create(dst_dev);
>> +	if (err <= 0) {
> 
> <= 0 is a bit weird, if 0 signals an error perhaps "err" must be set?
> 
> Maybe directly use dst_qid above and set "err" appropriately to better
> demonstrate what's expected?
Ok, yeah makes sense, we can pass the queue id as a param which the
ndo callback needs to fill out on success and then the return is either
error or no error. Seems simpler, will address in v4 then.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 05/15] net: Proxy net_mp_{open,close}_rxq for mapped queues
  2025-10-20 16:23 ` [PATCH net-next v3 05/15] net: Proxy net_mp_{open,close}_rxq for mapped queues Daniel Borkmann
@ 2025-10-22 12:50   ` Nikolay Aleksandrov
  2025-10-24 18:36   ` Stanislav Fomichev
  1 sibling, 0 replies; 54+ messages in thread
From: Nikolay Aleksandrov @ 2025-10-22 12:50 UTC (permalink / raw)
  To: Daniel Borkmann, netdev
  Cc: bpf, kuba, davem, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

On 10/20/25 19:23, Daniel Borkmann wrote:
> From: David Wei <dw@davidwei.uk>
> 
> When a process in a container wants to setup a memory provider, it will
> use the virtual netdev and a mapped rxq, and call net_mp_{open,close}_rxq
> to try and restart the queue. At this point, proxy the queue restart on
> the real rxq in the physical netdev.
> 
> For memory providers (io_uring zero-copy rx and devmem), it causes the
> real rxq in the physical netdev to be filled from a memory provider that
> has DMA mapped memory from a process within a container.
> 
> Signed-off-by: David Wei <dw@davidwei.uk>
> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  include/net/page_pool/memory_provider.h |  4 +-
>  net/core/netdev_rx_queue.c              | 57 +++++++++++++++++--------
>  2 files changed, 41 insertions(+), 20 deletions(-)
> 

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 06/15] xsk: Move NETDEV_XDP_ACT_ZC into generic header
  2025-10-20 16:23 ` [PATCH net-next v3 06/15] xsk: Move NETDEV_XDP_ACT_ZC into generic header Daniel Borkmann
@ 2025-10-22 12:51   ` Nikolay Aleksandrov
  0 siblings, 0 replies; 54+ messages in thread
From: Nikolay Aleksandrov @ 2025-10-22 12:51 UTC (permalink / raw)
  To: Daniel Borkmann, netdev
  Cc: bpf, kuba, davem, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

On 10/20/25 19:23, Daniel Borkmann wrote:
> Move NETDEV_XDP_ACT_ZC into xdp_sock_drv.h header such that external code
> can reuse it, and rename it into more generic NETDEV_XDP_ACT_XSK.
> 
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Co-developed-by: David Wei <dw@davidwei.uk>
> Signed-off-by: David Wei <dw@davidwei.uk>
> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
>  include/net/xdp_sock_drv.h | 4 ++++
>  net/xdp/xsk_buff_pool.c    | 6 +-----
>  2 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
> index 4f2d3268a676..242e34f771cc 100644
> --- a/include/net/xdp_sock_drv.h
> +++ b/include/net/xdp_sock_drv.h
> @@ -12,6 +12,10 @@
>  #define XDP_UMEM_MIN_CHUNK_SHIFT 11
>  #define XDP_UMEM_MIN_CHUNK_SIZE (1 << XDP_UMEM_MIN_CHUNK_SHIFT)
>  
> +#define NETDEV_XDP_ACT_XSK	(NETDEV_XDP_ACT_BASIC |		\
> +				 NETDEV_XDP_ACT_REDIRECT |	\
> +				 NETDEV_XDP_ACT_XSK_ZEROCOPY)
> +
>  struct xsk_cb_desc {
>  	void *src;
>  	u8 off;
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index aa9788f20d0d..26165baf99f4 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -158,10 +158,6 @@ static void xp_disable_drv_zc(struct xsk_buff_pool *pool)
>  	}
>  }
>  
> -#define NETDEV_XDP_ACT_ZC	(NETDEV_XDP_ACT_BASIC |		\
> -				 NETDEV_XDP_ACT_REDIRECT |	\
> -				 NETDEV_XDP_ACT_XSK_ZEROCOPY)
> -
>  int xp_assign_dev(struct xsk_buff_pool *pool,
>  		  struct net_device *netdev, u16 queue_id, u16 flags)
>  {
> @@ -203,7 +199,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>  		/* For copy-mode, we are done. */
>  		return 0;
>  
> -	if ((netdev->xdp_features & NETDEV_XDP_ACT_ZC) != NETDEV_XDP_ACT_ZC) {
> +	if ((netdev->xdp_features & NETDEV_XDP_ACT_XSK) != NETDEV_XDP_ACT_XSK) {
>  		err = -EOPNOTSUPP;
>  		goto err_unreg_pool;
>  	}

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 07/15] xsk: Move pool registration into single function
  2025-10-20 16:23 ` [PATCH net-next v3 07/15] xsk: Move pool registration into single function Daniel Borkmann
@ 2025-10-22 12:52   ` Nikolay Aleksandrov
  0 siblings, 0 replies; 54+ messages in thread
From: Nikolay Aleksandrov @ 2025-10-22 12:52 UTC (permalink / raw)
  To: Daniel Borkmann, netdev
  Cc: bpf, kuba, davem, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

On 10/20/25 19:23, Daniel Borkmann wrote:
> Small refactor to move the pool registration into xsk_reg_pool_at_qid,
> such that the netdev and queue_id can be registered there. No change
> in functionality.
> 
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Co-developed-by: David Wei <dw@davidwei.uk>
> Signed-off-by: David Wei <dw@davidwei.uk>
> ---
>  net/xdp/xsk.c           | 5 +++++
>  net/xdp/xsk_buff_pool.c | 5 -----
>  2 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 7b0c68a70888..0e9a385f5680 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -141,6 +141,11 @@ int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool,
>  			      dev->real_num_rx_queues,
>  			      dev->real_num_tx_queues))
>  		return -EINVAL;
> +	if (xsk_get_pool_from_qid(dev, queue_id))
> +		return -EBUSY;
> +
> +	pool->netdev = dev;
> +	pool->queue_id = queue_id;
>  
>  	if (queue_id < dev->real_num_rx_queues)
>  		dev->_rx[queue_id].pool = pool;
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index 26165baf99f4..62a176996f02 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -173,11 +173,6 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>  	if (force_zc && force_copy)
>  		return -EINVAL;
>  
> -	if (xsk_get_pool_from_qid(netdev, queue_id))
> -		return -EBUSY;
> -
> -	pool->netdev = netdev;
> -	pool->queue_id = queue_id;
>  	err = xsk_reg_pool_at_qid(netdev, pool, queue_id);
>  	if (err)
>  		return err;

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 08/15] xsk: Add small helper xp_pool_bindable
  2025-10-20 16:23 ` [PATCH net-next v3 08/15] xsk: Add small helper xp_pool_bindable Daniel Borkmann
@ 2025-10-22 12:52   ` Nikolay Aleksandrov
  0 siblings, 0 replies; 54+ messages in thread
From: Nikolay Aleksandrov @ 2025-10-22 12:52 UTC (permalink / raw)
  To: Daniel Borkmann, netdev
  Cc: bpf, kuba, davem, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

On 10/20/25 19:23, Daniel Borkmann wrote:
> Add another small helper called xp_pool_bindable and move the current
> dev_get_min_mp_channel_count test into this helper. Pass in the pool
> object, such that we derive the netdev from the prior registered pool.
> 
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Co-developed-by: David Wei <dw@davidwei.uk>
> Signed-off-by: David Wei <dw@davidwei.uk>
> ---
>  net/xdp/xsk_buff_pool.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index 62a176996f02..701be6a5b074 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -54,6 +54,11 @@ int xp_alloc_tx_descs(struct xsk_buff_pool *pool, struct xdp_sock *xs)
>  	return 0;
>  }
>  
> +static bool xp_pool_bindable(struct xsk_buff_pool *pool)
> +{
> +	return dev_get_min_mp_channel_count(pool->netdev) == 0;
> +}
> +
>  struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
>  						struct xdp_umem *umem)
>  {
> @@ -204,7 +209,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>  		goto err_unreg_pool;
>  	}
>  
> -	if (dev_get_min_mp_channel_count(netdev)) {
> +	if (!xp_pool_bindable(pool)) {
>  		err = -EBUSY;
>  		goto err_unreg_pool;
>  	}

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 13/15] netkit: Implement rtnl_link_ops->alloc and ndo_queue_create
  2025-10-20 16:23 ` [PATCH net-next v3 13/15] netkit: Implement rtnl_link_ops->alloc and ndo_queue_create Daniel Borkmann
@ 2025-10-22 13:00   ` Nikolay Aleksandrov
  0 siblings, 0 replies; 54+ messages in thread
From: Nikolay Aleksandrov @ 2025-10-22 13:00 UTC (permalink / raw)
  To: Daniel Borkmann, netdev
  Cc: bpf, kuba, davem, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

On 10/20/25 19:23, Daniel Borkmann wrote:
> From: David Wei <dw@davidwei.uk>
> 
> Implement rtnl_link_ops->alloc that allows the number of rx queues to be
> set when netkit is created. By default, netkit has only a single rxq (and
> single txq). The number of queues is deliberately not allowed to be changed
> via ethtool -L and is fixed for the lifetime of a netkit instance.
> 
> For netkit device creation, numrxqueues with larger than one rxq can be
> specified. These rxqs are then mappable to real rxqs in physical netdevs:
> 
>   ip link add type netkit peer numrxqueues 64      # for device pair
>   ip link add numrxqueues 64 type netkit single    # for single device
> 
> The limit of numrxqueues for netkit is currently set to 256, which allows
> binding multiple real rxqs from physical netdevs.
> 
> The implementation of ndo_queue_create() adds a new rxq during the bind
> queue operation. We allow to create queues either in single device mode or
> for the case of dual device mode for the netkit peer device which gets
> placed into the target network namespace. For dual device mode the bind
> against the primary device does not make sense for the targeted use cases,
> and therefore gets rejected.
> 
> We also need to add a lockdep class for netkit, such that lockdep does
> not trip over us, similarly done as in commit 0bef512012b1 ("net: add
> netdev_lockdep_set_classes() to virtual drivers").
> 
> Signed-off-by: David Wei <dw@davidwei.uk>
> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  drivers/net/netkit.c | 129 +++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 117 insertions(+), 12 deletions(-)
> 

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 12/15] netkit: Document fast vs slowpath members via macros
  2025-10-20 16:23 ` [PATCH net-next v3 12/15] netkit: Document fast vs slowpath members via macros Daniel Borkmann
@ 2025-10-22 13:02   ` Nikolay Aleksandrov
  0 siblings, 0 replies; 54+ messages in thread
From: Nikolay Aleksandrov @ 2025-10-22 13:02 UTC (permalink / raw)
  To: Daniel Borkmann, netdev
  Cc: bpf, kuba, davem, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

On 10/20/25 19:23, Daniel Borkmann wrote:
> Instead of a comment, just use two cachline groups to document the intent
> for members often accessed in fast or slow path.
> 
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Co-developed-by: David Wei <dw@davidwei.uk>
> Signed-off-by: David Wei <dw@davidwei.uk>
> ---
>  drivers/net/netkit.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
> index e3a2445d83fc..96734828bfb8 100644
> --- a/drivers/net/netkit.c
> +++ b/drivers/net/netkit.c
> @@ -16,18 +16,20 @@
>  #define DRV_NAME "netkit"
>  
>  struct netkit {
> -	/* Needed in fast-path */
> +	__cacheline_group_begin(netkit_fastpath);
>  	struct net_device __rcu *peer;
>  	struct bpf_mprog_entry __rcu *active;
>  	enum netkit_action policy;
>  	enum netkit_scrub scrub;
>  	struct bpf_mprog_bundle	bundle;
> +	__cacheline_group_end(netkit_fastpath);
>  
> -	/* Needed in slow-path */
> +	__cacheline_group_begin(netkit_slowpath);
>  	enum netkit_mode mode;
>  	enum netkit_pairing pair;
>  	bool primary;
>  	u32 headroom;
> +	__cacheline_group_end(netkit_slowpath);
>  };
>  
>  struct netkit_link {

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 14/15] netkit: Add io_uring zero-copy support for TCP
  2025-10-20 16:23 ` [PATCH net-next v3 14/15] netkit: Add io_uring zero-copy support for TCP Daniel Borkmann
@ 2025-10-22 13:12   ` Nikolay Aleksandrov
  0 siblings, 0 replies; 54+ messages in thread
From: Nikolay Aleksandrov @ 2025-10-22 13:12 UTC (permalink / raw)
  To: Daniel Borkmann, netdev
  Cc: bpf, kuba, davem, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

On 10/20/25 19:23, Daniel Borkmann wrote:
> From: David Wei <dw@davidwei.uk>
> 
> This adds the last missing bit to netkit for supporting io_uring with
> zero-copy mode [0]. Up until this point it was not possible to consume
> the latter out of containers or Kubernetes Pods where applications are
> in their own network namespace.
> 
> Thus, as a last missing bit, implement ndo_queue_get_dma_dev() in netkit
> to return the physical device of the real rxq for DMA. This allows memory
> providers like io_uring zero-copy or devmem to bind to the physically
> mapped rxq in netkit.
> 
> io_uring example with eth0 being a physical device with 16 queues where
> netkit is bound to the last queue, iou-zcrx.c is binary from selftests.
> Flow steering to that queue is based on the service VIP:port of the
> server utilizing io_uring:
> 
>   # ethtool -X eth0 start 0 equal 15
>   # ethtool -X eth0 start 15 equal 1 context new
>   # ethtool --config-ntuple eth0 flow-type tcp4 dst-ip 1.2.3.4 dst-port 5000 action 15
>   # ip netns add foo
>   # ip link add type netkit peer numrxqueues 2
>   # ./pyynl/cli.py --spec ~/netlink/specs/netdev.yaml \
>                    --do bind-queue \
>                    --json "{"src-ifindex": $(ifindex eth0), "src-queue-id": 15, \
>                             "dst-ifindex": $(ifindex nk0), "queue-type": "rx"}"
>   {'dst-queue-id': 1}
>   # ip link set nk0 netns foo
>   # ip link set nk1 up
>   # ip netns exec foo ip link set lo up
>   # ip netns exec foo ip link set nk0 up
>   # ip netns exec foo ip addr add 1.2.3.4/32 dev nk0
>   [ ... setup routing etc to get external traffic into the netns ... ]
>   # ip netns exec foo ./iou-zcrx -s -p 5000 -i nk0 -q 1
> 
> Remote io_uring client:
> 
>   # ./iou-zcrx -c -h 1.2.3.4 -p 5000 -l 12840 -z 65536
> 
> We have tested the above against a Broadcom BCM957504 (bnxt_en)
> 100G NIC, supporting TCP header/data split.
> 
> Similarly, this also works for devmem which we tested using ncdevmem:
> 
>   # ip netns exec foo ./ncdevmem -s 1.2.3.4 -l -p 5000 -f nk0 -t 1 -q 1
> 
> And on the remote client:
> 
>   # ./ncdevmem -s 1.2.3.4 -p 5000 -f eth0
> 
> For Cilium, the plan is to open up support for the various memory providers
> for regular Kubernetes Pods when Cilium is configured with netkit datapath
> mode.
> 
> Signed-off-by: David Wei <dw@davidwei.uk>
> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Link: https://kernel-recipes.org/en/2024/schedule/efficient-zero-copy-networking-using-io_uring [0]
> ---
>  drivers/net/netkit.c | 18 +++++++++++++++++-
>  1 file changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
> index 75b57496b72e..a281b39a1047 100644
> --- a/drivers/net/netkit.c
> +++ b/drivers/net/netkit.c
> @@ -282,6 +282,21 @@ static const struct ethtool_ops netkit_ethtool_ops = {
>  	.get_channels		= netkit_get_channels,
>  };
>  
> +static struct device *netkit_queue_get_dma_dev(struct net_device *dev, int idx)
> +{
> +	struct netdev_rx_queue *rxq, *peer_rxq;
> +	unsigned int peer_idx;
> +
> +	rxq = __netif_get_rx_queue(dev, idx);
> +	if (!rxq->peer)
> +		return NULL;
> +
> +	peer_rxq = rxq->peer;
> +	peer_idx = get_netdev_rx_queue_index(peer_rxq);
> +
> +	return netdev_queue_get_dma_dev(peer_rxq->dev, peer_idx);
> +}
> +
>  static int netkit_queue_create(struct net_device *dev)
>  {
>  	struct netkit *nk = netkit_priv(dev);
> @@ -307,7 +322,8 @@ static int netkit_queue_create(struct net_device *dev)
>  }
>  
>  static const struct netdev_queue_mgmt_ops netkit_queue_mgmt_ops = {
> -	.ndo_queue_create = netkit_queue_create,
> +	.ndo_queue_get_dma_dev		= netkit_queue_get_dma_dev,
> +	.ndo_queue_create		= netkit_queue_create,
>  };
>  
>  static struct net_device *netkit_alloc(struct nlattr *tb[],

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 11/15] netkit: Add single device mode for netkit
  2025-10-20 16:23 ` [PATCH net-next v3 11/15] netkit: Add single device mode for netkit Daniel Borkmann
@ 2025-10-22 13:13   ` Nikolay Aleksandrov
  0 siblings, 0 replies; 54+ messages in thread
From: Nikolay Aleksandrov @ 2025-10-22 13:13 UTC (permalink / raw)
  To: Daniel Borkmann, netdev
  Cc: bpf, kuba, davem, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

On 10/20/25 19:23, Daniel Borkmann wrote:
> Add a single device mode for netkit instead of netkit pairs. The primary
> target for the paired devices is to connect network namespaces, of course,
> and support has been implemented in projects like Cilium [0]. For the rxq
> binding the plan is to support two main scenarios related to single device
> mode:
> 
> * For the use-case of io_uring zero-copy, the control plane can either
>   set up a netkit pair where the peer device can perform rxq binding which
>   is then tied to the lifetime of the peer device, or the control plane
>   can use a regular netkit pair to connect the hostns to a Pod/container
>   and dynamically add/remove rxq bindings through a single device without
>   having to interrupt the device pair. In the case of io_uring, the memory
>   pool is used as skb non-linear pages, and thus the skb will go its way
>   through the regular stack into netkit. Things like the netkit policy when
>   no BPF is attached or skb scrubbing etc apply as-is in case the paired
>   devices are used, or if the backend memory is tied to the single device
>   and traffic goes through a paired device.
> 
> * For the use-case of AF_XDP, the control plane needs to use netkit in the
>   single device mode. The single device mode currently enforces only a
>   pass policy when no BPF is attached, and does not yet support BPF link
>   attachments for AF_XDP. skbs sent to that device get dropped at the
>   moment. Given AF_XDP operates at a lower layer of the stack tying this
>   to the netkit pair did not make sense. In future, the plan is to allow
>   BPF at the XDP layer which can: i) process traffic coming from the AF_XDP
>   application (e.g. QEMU with AF_XDP backend) to filter egress traffic or
>   to push selected egress traffic up to the single netkit device to the
>   local stack (e.g. DHCP requests), and ii) vice-versa skbs sent to the
>   single netkit into the AF_XDP application (e.g. DHCP replies). Also,
>   the control-plane can dynamically add/remove rxq bindings for the single
>   netkit device without having to interrupt (e.g. down/up cycle) the main
>   netkit pair for the Pod which has traffic going in and out.
> 
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Co-developed-by: David Wei <dw@davidwei.uk>
> Signed-off-by: David Wei <dw@davidwei.uk>
> Reviewed-by: Jordan Rife <jordan@jrife.io>
> Link: https://docs.cilium.io/en/stable/operations/performance/tuning/#netkit-device-mode [0]
> ---
>  drivers/net/netkit.c         | 108 ++++++++++++++++++++++-------------
>  include/uapi/linux/if_link.h |   6 ++
>  2 files changed, 74 insertions(+), 40 deletions(-)
> 

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 15/15] netkit: Add xsk support for af_xdp applications
  2025-10-20 16:23 ` [PATCH net-next v3 15/15] netkit: Add xsk support for af_xdp applications Daniel Borkmann
@ 2025-10-22 14:27   ` Nikolay Aleksandrov
  0 siblings, 0 replies; 54+ messages in thread
From: Nikolay Aleksandrov @ 2025-10-22 14:27 UTC (permalink / raw)
  To: Daniel Borkmann, netdev
  Cc: bpf, kuba, davem, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

On 10/20/25 19:23, Daniel Borkmann wrote:
> Enable support for AF_XDP applications to operate on a netkit device.
> The goal is that AF_XDP applications can natively consume AF_XDP
> from network namespaces. The use-case from Cilium side is to support
> Kubernetes KubeVirt VMs through QEMU's AF_XDP backend. KubeVirt is a
> virtual machine management add-on for Kubernetes which aims to provide
> a common ground for virtualization. KubeVirt spawns the VMs inside
> Kubernetes Pods which reside in their own network namespace just like
> regular Pods.
> 
> Raw QEMU AF_XDP backend example with eth0 being a physical device with
> 16 queues where netkit is bound to the last queue (for multi-queue RSS
> context can be used if supported by the driver):
> 
>   # ethtool -X eth0 start 0 equal 15
>   # ethtool -X eth0 start 15 equal 1 context new
>   # ethtool --config-ntuple eth0 flow-type ether \
>             src 00:00:00:00:00:00 \
>             src-mask ff:ff:ff:ff:ff:ff \
>             dst $mac dst-mask 00:00:00:00:00:00 \
>             proto 0 proto-mask 0xffff action 15
>   [ ... setup BPF/XDP prog on eth0 to steer into shared xsk map ... ]
>   # ip netns add foo
>   # ip link add numrxqueues 2 nk type netkit single
>   # ./pyynl/cli.py --spec ~/netlink/specs/netdev.yaml \
>                    --do bind-queue \
>                    --json "{"src-ifindex": $(ifindex eth0), "src-queue-id": 15, \
>                             "dst-ifindex": $(ifindex nk), "queue-type": "rx"}"
>   {'dst-queue-id': 1}
>   # ip link set nk netns foo
>   # ip netns exec foo ip link set lo up
>   # ip netns exec foo ip link set nk up
>   # ip netns exec foo qemu-system-x86_64 \
>           -kernel $kernel \
>           -drive file=${image_name},index=0,media=disk,format=raw \
>           -append "root=/dev/sda rw console=ttyS0" \
>           -cpu host \
>           -m $memory \
>           -enable-kvm \
>           -device virtio-net-pci,netdev=net0,mac=$mac \
>           -netdev af-xdp,ifname=nk,id=net0,mode=native,queues=1,start-queue=1,inhibit=on,map-path=$dir/xsks_map \
>           -nographic
> 
> We have tested the above against a dual-port Nvidia ConnectX-6 (mlx5)
> 100G NIC with successful network connectivity out of QEMU. An earlier
> iteration of this work was presented at LSF/MM/BPF [0].
> 
> For getting to a first starting point to connect all things with
> KubeVirt, bind mounting the xsk map from Cilium into the VM launcher
> Pod which acts as a regular Kubernetes Pod while not perfect, is not
> a big problem given its out of reach from the application sitting
> inside the VM (and some of the control plane aspects are baked in
> the launcher Pod already), so the isolation barrier is still the VM.
> Eventually the goal is to have a XDP/XSK redirect extension where
> there is no need to have the xsk map, and the BPF program can just
> derive the target xsk through the queue where traffic was received
> on.
> 
> The exposure through netkit is because Cilium should not act as a
> proxy handing out xsk sockets. Existing applications expect a netdev
> from kernel side and should not need to rewrite just to implement
> against a CNI's protocol. Also, all the memory should not be accounted
> against Cilium but rather the application Pod itself which is consuming
> AF_XDP. Further, on up/downgrades we expect the data plane to being
> completely decoupled from the control plane; if Cilium would own the
> sockets that would be disruptive. Another use-case which opens up and
> is regularly asked from users would be to have DPDK applications on
> top of AF_XDP in regular Kubernetes Pods.
> 
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Co-developed-by: David Wei <dw@davidwei.uk>
> Signed-off-by: David Wei <dw@davidwei.uk>
> Link: https://bpfconf.ebpf.io/bpfconf2025/bpfconf2025_material/lsfmmbpf_2025_netkit_borkmann.pdf [0]
> ---
>  drivers/net/netkit.c | 71 +++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 70 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
> index a281b39a1047..f69abe5ec4cd 100644
> --- a/drivers/net/netkit.c
> +++ b/drivers/net/netkit.c
> @@ -12,6 +12,7 @@
>  #include <net/netdev_lock.h>
>  #include <net/netdev_queues.h>
>  #include <net/netdev_rx_queue.h>
> +#include <net/xdp_sock_drv.h>
>  #include <net/netkit.h>
>  #include <net/dst.h>
>  #include <net/tcx.h>
> @@ -235,6 +236,71 @@ static void netkit_get_stats(struct net_device *dev,
>  	stats->tx_dropped = DEV_STATS_READ(dev, tx_dropped);
>  }
>  
> +static bool netkit_xsk_supported_at_phys(const struct net_device *dev)
> +{
> +	if (!dev->netdev_ops->ndo_bpf ||
> +	    !dev->netdev_ops->ndo_xdp_xmit ||
> +	    !dev->netdev_ops->ndo_xsk_wakeup)
> +		return false;
> +	if ((dev->xdp_features & NETDEV_XDP_ACT_XSK) != NETDEV_XDP_ACT_XSK)
> +		return false;
> +	return true;
> +}
> +
> +static int netkit_xsk(struct net_device *dev, struct netdev_bpf *xdp)
> +{
> +	struct netkit *nk = netkit_priv(dev);
> +	struct netdev_bpf xdp_lower;
> +	struct netdev_rx_queue *rxq;
> +	struct net_device *phys;
> +
> +	switch (xdp->command) {
> +	case XDP_SETUP_XSK_POOL:
> +		if (nk->pair == NETKIT_DEVICE_PAIR)
> +			return -EOPNOTSUPP;
> +		if (xdp->xsk.queue_id >= dev->real_num_rx_queues)
> +			return -EINVAL;
> +
> +		rxq = __netif_get_rx_queue(dev, xdp->xsk.queue_id);
> +		if (!rxq->peer)
> +			return -EOPNOTSUPP;
> +
> +		phys = rxq->peer->dev;
> +		if (!netkit_xsk_supported_at_phys(phys))
> +			return -EOPNOTSUPP;
> +
> +		memcpy(&xdp_lower, xdp, sizeof(xdp_lower));
> +		xdp_lower.xsk.queue_id = get_netdev_rx_queue_index(rxq->peer);
> +		break;
> +	case XDP_SETUP_PROG:
> +		return -EPERM;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	return phys->netdev_ops->ndo_bpf(phys, &xdp_lower);
> +}
> +
> +static int netkit_xsk_wakeup(struct net_device *dev, u32 queue_id, u32 flags)
> +{
> +	struct netdev_rx_queue *rxq;
> +	struct net_device *phys;
> +
> +	if (queue_id >= dev->real_num_rx_queues)
> +		return -EINVAL;
> +
> +	rxq = __netif_get_rx_queue(dev, queue_id);
> +	if (!rxq->peer)
> +		return -EOPNOTSUPP;
> +
> +	phys = rxq->peer->dev;
> +	if (!netkit_xsk_supported_at_phys(phys))
> +		return -EOPNOTSUPP;
> +
> +	return phys->netdev_ops->ndo_xsk_wakeup(phys,
> +			get_netdev_rx_queue_index(rxq->peer), flags);
> +}
> +
>  static int netkit_init(struct net_device *dev)
>  {
>  	netdev_lockdep_set_classes(dev);
> @@ -255,6 +321,8 @@ static const struct net_device_ops netkit_netdev_ops = {
>  	.ndo_get_peer_dev	= netkit_peer_dev,
>  	.ndo_get_stats64	= netkit_get_stats,
>  	.ndo_uninit		= netkit_uninit,
> +	.ndo_bpf		= netkit_xsk,
> +	.ndo_xsk_wakeup		= netkit_xsk_wakeup,
>  	.ndo_features_check	= passthru_features_check,
>  };
>  
> @@ -409,10 +477,11 @@ static void netkit_setup(struct net_device *dev)
>  	dev->hw_enc_features = netkit_features;
>  	dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
>  	dev->vlan_features = dev->features & ~netkit_features_hw_vlan;
> -
>  	dev->needs_free_netdev = true;
>  
>  	netif_set_tso_max_size(dev, GSO_MAX_SIZE);
> +
> +	xdp_set_features_flag(dev, NETDEV_XDP_ACT_XSK);
>  }
>  
>  static struct net *netkit_get_link_net(const struct net_device *dev)

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 02/15] net: Implement netdev_nl_bind_queue_doit
  2025-10-20 16:23 ` [PATCH net-next v3 02/15] net: Implement netdev_nl_bind_queue_doit Daniel Borkmann
  2025-10-22 11:17   ` Nikolay Aleksandrov
@ 2025-10-23 10:17   ` Paolo Abeni
  2025-10-23 12:46     ` Daniel Borkmann
  2025-10-23 10:27   ` Paolo Abeni
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 54+ messages in thread
From: Paolo Abeni @ 2025-10-23 10:17 UTC (permalink / raw)
  To: Daniel Borkmann, netdev
  Cc: bpf, kuba, davem, razor, willemb, sdf, john.fastabend, martin.lau,
	jordan, maciej.fijalkowski, magnus.karlsson, dw, toke, yangzhenze,
	wangdongdong.6

On 10/20/25 6:23 PM, Daniel Borkmann wrote:
> +	tmp_rxq = __netif_get_rx_queue(dst_dev, dst_dev->real_num_rx_queues - 1);
> +	if (tmp_rxq->peer && tmp_rxq->peer->dev != src_dev) {
> +		err = -EOPNOTSUPP;
> +		NL_SET_ERR_MSG(info->extack,
> +			       "Binding multiple queues from difference source devices not supported");
> +		goto err_unlock_src_dev;
> +	}

Why checking a single queue on dst/virtual device? Should the above
check be repeated for all the real_num_rx_queues?

/P


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 02/15] net: Implement netdev_nl_bind_queue_doit
  2025-10-20 16:23 ` [PATCH net-next v3 02/15] net: Implement netdev_nl_bind_queue_doit Daniel Borkmann
  2025-10-22 11:17   ` Nikolay Aleksandrov
  2025-10-23 10:17   ` Paolo Abeni
@ 2025-10-23 10:27   ` Paolo Abeni
  2025-10-23 12:48     ` Daniel Borkmann
  2025-10-24  2:28   ` Jakub Kicinski
  2025-10-24 18:20   ` Stanislav Fomichev
  4 siblings, 1 reply; 54+ messages in thread
From: Paolo Abeni @ 2025-10-23 10:27 UTC (permalink / raw)
  To: Daniel Borkmann, netdev
  Cc: bpf, kuba, davem, razor, willemb, sdf, john.fastabend, martin.lau,
	jordan, maciej.fijalkowski, magnus.karlsson, dw, toke, yangzhenze,
	wangdongdong.6

On 10/20/25 6:23 PM, Daniel Borkmann wrote:
> +	if (!src_dev->dev.parent) {
> +		err = -EOPNOTSUPP;
> +		NL_SET_ERR_MSG(info->extack,
> +			       "Source device is a virtual device");
> +		goto err_unlock_src_dev;
> +	}

Is this check strictly needed? I think that if we relax it, it could be
simpler to create all-virtual selftests.

/P


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 02/15] net: Implement netdev_nl_bind_queue_doit
  2025-10-23 10:17   ` Paolo Abeni
@ 2025-10-23 12:46     ` Daniel Borkmann
  0 siblings, 0 replies; 54+ messages in thread
From: Daniel Borkmann @ 2025-10-23 12:46 UTC (permalink / raw)
  To: Paolo Abeni, netdev
  Cc: bpf, kuba, davem, razor, willemb, sdf, john.fastabend, martin.lau,
	jordan, maciej.fijalkowski, magnus.karlsson, dw, toke, yangzhenze,
	wangdongdong.6

On 10/23/25 12:17 PM, Paolo Abeni wrote:
> On 10/20/25 6:23 PM, Daniel Borkmann wrote:
>> +	tmp_rxq = __netif_get_rx_queue(dst_dev, dst_dev->real_num_rx_queues - 1);
>> +	if (tmp_rxq->peer && tmp_rxq->peer->dev != src_dev) {
>> +		err = -EOPNOTSUPP;
>> +		NL_SET_ERR_MSG(info->extack,
>> +			       "Binding multiple queues from difference source devices not supported");
>> +		goto err_unlock_src_dev;
>> +	}
> 
> Why checking a single queue on dst/virtual device? Should the above
> check be repeated for all the real_num_rx_queues?
We could, but this is actually already enough. For example, initially the device
has no phys binding at all (tmp_rxq->peer is NULL). Then we bind the first time.
In the second request, we reject tmp_rxq->peer->dev != src_dev, and enforce
binding to the same device again as previously, and then it repeats for subsequent
bind requests.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 02/15] net: Implement netdev_nl_bind_queue_doit
  2025-10-23 10:27   ` Paolo Abeni
@ 2025-10-23 12:48     ` Daniel Borkmann
  2025-10-24  2:08       ` Jakub Kicinski
  0 siblings, 1 reply; 54+ messages in thread
From: Daniel Borkmann @ 2025-10-23 12:48 UTC (permalink / raw)
  To: Paolo Abeni, netdev
  Cc: bpf, kuba, davem, razor, willemb, sdf, john.fastabend, martin.lau,
	jordan, maciej.fijalkowski, magnus.karlsson, dw, toke, yangzhenze,
	wangdongdong.6

On 10/23/25 12:27 PM, Paolo Abeni wrote:
> On 10/20/25 6:23 PM, Daniel Borkmann wrote:
>> +	if (!src_dev->dev.parent) {
>> +		err = -EOPNOTSUPP;
>> +		NL_SET_ERR_MSG(info->extack,
>> +			       "Source device is a virtual device");
>> +		goto err_unlock_src_dev;
>> +	}
> 
> Is this check strictly needed? I think that if we relax it, it could be
> simpler to create all-virtual selftests.
It is needed given we need to always ensure lock ordering for the two devices,
that is, the order is always from the virtual to the physical device.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 02/15] net: Implement netdev_nl_bind_queue_doit
  2025-10-23 12:48     ` Daniel Borkmann
@ 2025-10-24  2:08       ` Jakub Kicinski
  2025-10-28 21:59         ` David Wei
  0 siblings, 1 reply; 54+ messages in thread
From: Jakub Kicinski @ 2025-10-24  2:08 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Paolo Abeni, netdev, bpf, davem, razor, willemb, sdf,
	john.fastabend, martin.lau, jordan, maciej.fijalkowski,
	magnus.karlsson, dw, toke, yangzhenze, wangdongdong.6

On Thu, 23 Oct 2025 14:48:15 +0200 Daniel Borkmann wrote:
> On 10/23/25 12:27 PM, Paolo Abeni wrote:
> > On 10/20/25 6:23 PM, Daniel Borkmann wrote:  
> >> +	if (!src_dev->dev.parent) {
> >> +		err = -EOPNOTSUPP;
> >> +		NL_SET_ERR_MSG(info->extack,
> >> +			       "Source device is a virtual device");
> >> +		goto err_unlock_src_dev;
> >> +	}  
> > 
> > Is this check strictly needed? I think that if we relax it, it could be
> > simpler to create all-virtual selftests.  
> It is needed given we need to always ensure lock ordering for the two devices,
> that is, the order is always from the virtual to the physical device.

You do seem to be taking the lock before you check if the device was
the type you expected tho.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 01/15] net: Add bind-queue operation
  2025-10-20 16:23 ` [PATCH net-next v3 01/15] net: Add bind-queue operation Daniel Borkmann
  2025-10-22 11:19   ` Nikolay Aleksandrov
@ 2025-10-24  2:12   ` Jakub Kicinski
  2025-10-24 10:15     ` Daniel Borkmann
  1 sibling, 1 reply; 54+ messages in thread
From: Jakub Kicinski @ 2025-10-24  2:12 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

On Mon, 20 Oct 2025 18:23:41 +0200 Daniel Borkmann wrote:
> +      name: bind-queue
> +      doc: |
> +        Bind a physical netdevice queue to a virtual one. The binding
> +        creates a queue pair, where a queue can reference its peer queue.
> +        This is useful for memory providers and AF_XDP operations which
> +        take an ifindex and queue id to allow auch applications to bind
> +        against virtual devices in containers.
> +      attribute-set: queue-pair

      flags: [admin-perm]

right?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 02/15] net: Implement netdev_nl_bind_queue_doit
  2025-10-20 16:23 ` [PATCH net-next v3 02/15] net: Implement netdev_nl_bind_queue_doit Daniel Borkmann
                     ` (2 preceding siblings ...)
  2025-10-23 10:27   ` Paolo Abeni
@ 2025-10-24  2:28   ` Jakub Kicinski
  2025-10-28 22:41     ` David Wei
  2025-10-24 18:20   ` Stanislav Fomichev
  4 siblings, 1 reply; 54+ messages in thread
From: Jakub Kicinski @ 2025-10-24  2:28 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

On Mon, 20 Oct 2025 18:23:42 +0200 Daniel Borkmann wrote:
> +void netdev_rx_queue_peer(struct net_device *src_dev,
> +			  struct netdev_rx_queue *src_rxq,
> +			  struct netdev_rx_queue *dst_rxq)
> +{
> +	netdev_assert_locked(src_dev);
> +	netdev_assert_locked(dst_rxq->dev);
> +
> +	netdev_hold(src_dev, &src_rxq->dev_tracker, GFP_KERNEL);

Isn't ->dev_tracker already used by sysfs?

Are you handling the underlying device going away?

> +	__netdev_rx_queue_peer(src_rxq, dst_rxq);
> +}

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 03/15] net: Add peer info to queue-get response
  2025-10-20 16:23 ` [PATCH net-next v3 03/15] net: Add peer info to queue-get response Daniel Borkmann
  2025-10-22 11:23   ` Nikolay Aleksandrov
@ 2025-10-24  2:33   ` Jakub Kicinski
  2025-10-24 12:59     ` Daniel Borkmann
  1 sibling, 1 reply; 54+ messages in thread
From: Jakub Kicinski @ 2025-10-24  2:33 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

On Mon, 20 Oct 2025 18:23:43 +0200 Daniel Borkmann wrote:
> Add a nested peer field to the queue-get response that returns the peered
> ifindex and queue id.
> 
> Example with ynl client:
> 
>   # ip netns exec foo ./pyynl/cli.py \
>       --spec ~/netlink/specs/netdev.yaml \
>       --do queue-get \
>       --json '{"ifindex": 3, "id": 1, "type": "rx"}'
>   {'id': 1, 'ifindex': 3, 'peer': {'id': 15, 'ifindex': 4, 'netns-id': 21}, 'type': 'rx'}

I'm struggling with the roles of what is src and dst and peer :(
No great suggestion off the top of my head but better terms would 
make this much easier to review.

The example seems to be from the container side. Do we need to show peer
info on the container side? Not just on the host side?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 01/15] net: Add bind-queue operation
  2025-10-24  2:12   ` Jakub Kicinski
@ 2025-10-24 10:15     ` Daniel Borkmann
  2025-10-24 18:11       ` Stanislav Fomichev
  0 siblings, 1 reply; 54+ messages in thread
From: Daniel Borkmann @ 2025-10-24 10:15 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

On 10/24/25 4:12 AM, Jakub Kicinski wrote:
> On Mon, 20 Oct 2025 18:23:41 +0200 Daniel Borkmann wrote:
>> +      name: bind-queue
>> +      doc: |
>> +        Bind a physical netdevice queue to a virtual one. The binding
>> +        creates a queue pair, where a queue can reference its peer queue.
>> +        This is useful for memory providers and AF_XDP operations which
>> +        take an ifindex and queue id to allow auch applications to bind
>> +        against virtual devices in containers.
>> +      attribute-set: queue-pair
> 
>        flags: [admin-perm]
> 
> right?
Oh, yes good catch! I've just checked for other instances in that file, don't
we also need the same flag for bind-tx? bind-rx for example has it, only the
info dumps don't. I can cook a patch for net

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index e1735b486222..59c71a76b26f 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -827,6 +827,7 @@ operations:
        name: bind-tx
        doc: Bind dmabuf to netdev for TX
        attribute-set: dmabuf
+      flags: [admin-perm]
        do:
          request:
            attributes:
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index 8a973bc5588a..3f044b864ff7 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -220,7 +220,7 @@ static const struct genl_split_ops netdev_nl_ops[] = {
                 .doit           = netdev_nl_bind_tx_doit,
                 .policy         = netdev_bind_tx_nl_policy,
                 .maxattr        = NETDEV_A_DMABUF_FD,
-               .flags          = GENL_CMD_CAP_DO,
+               .flags          = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
         },
         {
                 .cmd            = NETDEV_CMD_BIND_QUEUE,


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 03/15] net: Add peer info to queue-get response
  2025-10-24  2:33   ` Jakub Kicinski
@ 2025-10-24 12:59     ` Daniel Borkmann
  2025-10-24 23:18       ` Jakub Kicinski
  0 siblings, 1 reply; 54+ messages in thread
From: Daniel Borkmann @ 2025-10-24 12:59 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

On 10/24/25 4:33 AM, Jakub Kicinski wrote:
> On Mon, 20 Oct 2025 18:23:43 +0200 Daniel Borkmann wrote:
>> Add a nested peer field to the queue-get response that returns the peered
>> ifindex and queue id.
>>
>> Example with ynl client:
>>
>>    # ip netns exec foo ./pyynl/cli.py \
>>        --spec ~/netlink/specs/netdev.yaml \
>>        --do queue-get \
>>        --json '{"ifindex": 3, "id": 1, "type": "rx"}'
>>    {'id': 1, 'ifindex': 3, 'peer': {'id': 15, 'ifindex': 4, 'netns-id': 21}, 'type': 'rx'}
> 
> I'm struggling with the roles of what is src and dst and peer :(
> No great suggestion off the top of my head but better terms would
> make this much easier to review.
> 
> The example seems to be from the container side. Do we need to show peer
> info on the container side? Not just on the host side?

I think up to us which side we want to show. My thinking was to allow user
introspection from both, but we don't have to. Right now the above example
was from the container side, but technically it could be either side depending
in which netns the phys dev would be located.

The user knows which is which based on the ifindex passed to the queue-get
query: if the ifindex is from a virtual device (e.g. netkit type), then the
'peer' section shows the phys dev, and vice versa, if the ifindex is from a
phys device (say, mlx5), then the 'peer' section shows the virtual one.

Maybe I'll provide a better more in-depth example with both sides and above
explanation in the commit msg for v4..

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 01/15] net: Add bind-queue operation
  2025-10-24 10:15     ` Daniel Borkmann
@ 2025-10-24 18:11       ` Stanislav Fomichev
  2025-10-24 19:17         ` Daniel Borkmann
  0 siblings, 1 reply; 54+ messages in thread
From: Stanislav Fomichev @ 2025-10-24 18:11 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Jakub Kicinski, netdev, bpf, davem, razor, pabeni, willemb, sdf,
	john.fastabend, martin.lau, jordan, maciej.fijalkowski,
	magnus.karlsson, dw, toke, yangzhenze, wangdongdong.6

On 10/24, Daniel Borkmann wrote:
> On 10/24/25 4:12 AM, Jakub Kicinski wrote:
> > On Mon, 20 Oct 2025 18:23:41 +0200 Daniel Borkmann wrote:
> > > +      name: bind-queue
> > > +      doc: |
> > > +        Bind a physical netdevice queue to a virtual one. The binding
> > > +        creates a queue pair, where a queue can reference its peer queue.
> > > +        This is useful for memory providers and AF_XDP operations which
> > > +        take an ifindex and queue id to allow auch applications to bind
> > > +        against virtual devices in containers.
> > > +      attribute-set: queue-pair
> > 
> >        flags: [admin-perm]
> > 
> > right?
> Oh, yes good catch! I've just checked for other instances in that file, don't
> we also need the same flag for bind-tx? bind-rx for example has it, only the
> info dumps don't. I can cook a patch for net

IIRC, TX side was non-admin-perm by design (because it only references the
binding for tx and doesn't need any heavy device setup).

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 02/15] net: Implement netdev_nl_bind_queue_doit
  2025-10-20 16:23 ` [PATCH net-next v3 02/15] net: Implement netdev_nl_bind_queue_doit Daniel Borkmann
                     ` (3 preceding siblings ...)
  2025-10-24  2:28   ` Jakub Kicinski
@ 2025-10-24 18:20   ` Stanislav Fomichev
  2025-10-24 19:15     ` Daniel Borkmann
  4 siblings, 1 reply; 54+ messages in thread
From: Stanislav Fomichev @ 2025-10-24 18:20 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
	john.fastabend, martin.lau, jordan, maciej.fijalkowski,
	magnus.karlsson, dw, toke, yangzhenze, wangdongdong.6

On 10/20, Daniel Borkmann wrote:
> From: David Wei <dw@davidwei.uk>
> 
> Implement netdev_nl_bind_queue_doit() that creates an rx queue in a
> virtual netdev and then binds it to an rxq in a real netdev to create
> a queue pair.
> 
> Example with ynl client:
> 
>   # ./pyynl/cli.py \
>       --spec ~/netlink/specs/netdev.yaml \
>       --do bind-queue \
>       --json '{"src-ifindex": 4, "src-queue-id": 15, "dst-ifindex": 8, "queue-type": "rx"}'
>   {'dst-queue-id': 1}
> 
> Note that the netdevice locking order is always from the virtual to
> the physical device.
> 
> Signed-off-by: David Wei <dw@davidwei.uk>
> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  include/net/netdev_queues.h   |   5 ++
>  include/net/netdev_rx_queue.h |  36 ++++++++-
>  net/core/netdev-genl.c        | 141 +++++++++++++++++++++++++++++++++-
>  net/core/netdev_rx_queue.c    |  61 +++++++++++++++
>  4 files changed, 240 insertions(+), 3 deletions(-)
> 
> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> index cd00e0406cf4..286d5edce07d 100644
> --- a/include/net/netdev_queues.h
> +++ b/include/net/netdev_queues.h
> @@ -130,6 +130,10 @@ void netdev_stat_queue_sum(struct net_device *netdev,
>   * @ndo_queue_get_dma_dev: Get dma device for zero-copy operations to be used
>   *			   for this queue. Return NULL on error.
>   *
> + * @ndo_queue_create: Create a new RX queue which can be bound to another queue.
> + *		      Ops on this queue are redirected to the peer queue e.g.
> + *		      when opening a memory provider.
> + *
>   * Note that @ndo_queue_mem_alloc and @ndo_queue_mem_free may be called while
>   * the interface is closed. @ndo_queue_start and @ndo_queue_stop will only
>   * be called for an interface which is open.
> @@ -149,6 +153,7 @@ struct netdev_queue_mgmt_ops {
>  						  int idx);
>  	struct device *		(*ndo_queue_get_dma_dev)(struct net_device *dev,
>  							 int idx);
> +	int			(*ndo_queue_create)(struct net_device *dev);
>  };
>  
>  bool netif_rxq_has_unreadable_mp(struct net_device *dev, int idx);
> diff --git a/include/net/netdev_rx_queue.h b/include/net/netdev_rx_queue.h
> index 8cdcd138b33f..db3ef94c0744 100644
> --- a/include/net/netdev_rx_queue.h
> +++ b/include/net/netdev_rx_queue.h
> @@ -28,6 +28,7 @@ struct netdev_rx_queue {
>  #endif
>  	struct napi_struct		*napi;
>  	struct pp_memory_provider_params mp_params;
> +	struct netdev_rx_queue		*peer;
>  } ____cacheline_aligned_in_smp;
>  
>  /*
> @@ -56,6 +57,37 @@ get_netdev_rx_queue_index(struct netdev_rx_queue *queue)
>  	return index;
>  }
>  
> -int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq);
> +static inline void __netdev_rx_queue_peer(struct netdev_rx_queue *src_rxq,
> +					  struct netdev_rx_queue *dst_rxq)
> +{
> +	src_rxq->peer = dst_rxq;
> +	dst_rxq->peer = src_rxq;
> +}
>  
> -#endif
> +static inline void __netdev_rx_queue_unpeer(struct netdev_rx_queue *src_rxq,
> +					    struct netdev_rx_queue *dst_rxq)
> +{
> +	src_rxq->peer = NULL;
> +	dst_rxq->peer = NULL;
> +}
> +
> +static inline bool netdev_rx_queue_peered(struct net_device *dev,
> +					  u16 queue_id)
> +{
> +	if (queue_id < dev->real_num_rx_queues)
> +		return dev->_rx[queue_id].peer;
> +	return false;
> +}
> +
> +void netdev_rx_queue_peer(struct net_device *src_dev,
> +			  struct netdev_rx_queue *src_rxq,
> +			  struct netdev_rx_queue *dst_rxq);
> +void netdev_rx_queue_unpeer(struct net_device *src_dev,
> +			    struct netdev_rx_queue *src_rxq,
> +			    struct netdev_rx_queue *dst_rxq);
> +int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq);
> +struct netdev_rx_queue *
> +netif_get_rx_queue_peer_locked(struct net_device **dev,
> +			       unsigned int *rxq_idx,
> +			       bool *needs_unlock);
> +#endif /* _LINUX_NETDEV_RX_QUEUE_H */
> diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> index ce1018ea390f..579469abac8c 100644
> --- a/net/core/netdev-genl.c
> +++ b/net/core/netdev-genl.c
> @@ -1122,7 +1122,146 @@ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
>  
>  int netdev_nl_bind_queue_doit(struct sk_buff *skb, struct genl_info *info)
>  {
> -	return -EOPNOTSUPP;
> +	u32 src_ifidx, src_qid, dst_ifidx, dst_qid, q_type;
> +	struct netdev_rx_queue *src_rxq, *dst_rxq, *tmp_rxq;
> +	struct net_device *src_dev, *dst_dev;
> +	struct sk_buff *rsp;
> +	int err = 0;
> +	void *hdr;
> +
> +	if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_PAIR_QUEUE_TYPE) ||
> +	    GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_PAIR_SRC_IFINDEX) ||
> +	    GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_PAIR_SRC_QUEUE_ID) ||
> +	    GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_PAIR_DST_IFINDEX))
> +		return -EINVAL;
> +
> +	src_ifidx = nla_get_u32(info->attrs[NETDEV_A_QUEUE_PAIR_SRC_IFINDEX]);
> +	src_qid = nla_get_u32(info->attrs[NETDEV_A_QUEUE_PAIR_SRC_QUEUE_ID]);
> +	dst_ifidx = nla_get_u32(info->attrs[NETDEV_A_QUEUE_PAIR_DST_IFINDEX]);
> +	q_type = nla_get_u32(info->attrs[NETDEV_A_QUEUE_PAIR_QUEUE_TYPE]);
> +
> +	if (q_type != NETDEV_QUEUE_TYPE_RX) {
> +		NL_SET_ERR_MSG(info->extack, "Only binding of RX queue supported");
> +		return -EOPNOTSUPP;
> +	}
> +	if (dst_ifidx == src_ifidx) {
> +		NL_SET_ERR_MSG(info->extack,
> +			       "Destination driver cannot be same as source driver");
> +		return -EOPNOTSUPP;
> +	}
> +
> +	rsp = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL);
> +	if (!rsp)
> +		return -ENOMEM;
> +
> +	hdr = genlmsg_iput(rsp, info);
> +	if (!hdr) {
> +		err = -EMSGSIZE;
> +		goto err_genlmsg_free;
> +	}

[..]

> +	/* Locking order is always from the virtual to the physical device
> +	 * since this is also the same order when applications open the
> +	 * memory provider later on.
> +	 */
> +	dst_dev = netdev_get_by_index_lock(genl_info_net(info), dst_ifidx);
> +	if (!dst_dev) {
> +		err = -ENODEV;
> +		goto err_genlmsg_free;
> +	}

...

> +	src_dev = netdev_get_by_index_lock(genl_info_net(info), src_ifidx);
> +	if (!src_dev) {
> +		err = -ENODEV;
> +		goto err_unlock_dst_dev;
> +	}

But isn't the above susceptible to ABBA exploitation from the userspace?
I can try to concurrently do two requests, the second one being with
dst_dev and src_dev swapped. Or do we assume that we exit earlier for
the swapped case based on some other condition?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 05/15] net: Proxy net_mp_{open,close}_rxq for mapped queues
  2025-10-20 16:23 ` [PATCH net-next v3 05/15] net: Proxy net_mp_{open,close}_rxq for mapped queues Daniel Borkmann
  2025-10-22 12:50   ` Nikolay Aleksandrov
@ 2025-10-24 18:36   ` Stanislav Fomichev
  2025-10-29  2:07     ` David Wei
  1 sibling, 1 reply; 54+ messages in thread
From: Stanislav Fomichev @ 2025-10-24 18:36 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
	john.fastabend, martin.lau, jordan, maciej.fijalkowski,
	magnus.karlsson, dw, toke, yangzhenze, wangdongdong.6

On 10/20, Daniel Borkmann wrote:
> From: David Wei <dw@davidwei.uk>
> 
> When a process in a container wants to setup a memory provider, it will
> use the virtual netdev and a mapped rxq, and call net_mp_{open,close}_rxq
> to try and restart the queue. At this point, proxy the queue restart on
> the real rxq in the physical netdev.
> 
> For memory providers (io_uring zero-copy rx and devmem), it causes the
> real rxq in the physical netdev to be filled from a memory provider that
> has DMA mapped memory from a process within a container.
> 
> Signed-off-by: David Wei <dw@davidwei.uk>
> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  include/net/page_pool/memory_provider.h |  4 +-
>  net/core/netdev_rx_queue.c              | 57 +++++++++++++++++--------
>  2 files changed, 41 insertions(+), 20 deletions(-)
> 
> diff --git a/include/net/page_pool/memory_provider.h b/include/net/page_pool/memory_provider.h
> index ada4f968960a..b6f811c3416b 100644
> --- a/include/net/page_pool/memory_provider.h
> +++ b/include/net/page_pool/memory_provider.h
> @@ -23,12 +23,12 @@ bool net_mp_niov_set_dma_addr(struct net_iov *niov, dma_addr_t addr);
>  void net_mp_niov_set_page_pool(struct page_pool *pool, struct net_iov *niov);
>  void net_mp_niov_clear_page_pool(struct net_iov *niov);
>  
> -int net_mp_open_rxq(struct net_device *dev, unsigned ifq_idx,
> +int net_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
>  		    struct pp_memory_provider_params *p);
>  int __net_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
>  		      const struct pp_memory_provider_params *p,
>  		      struct netlink_ext_ack *extack);
> -void net_mp_close_rxq(struct net_device *dev, unsigned ifq_idx,
> +void net_mp_close_rxq(struct net_device *dev, unsigned int rxq_idx,
>  		      struct pp_memory_provider_params *old_p);
>  void __net_mp_close_rxq(struct net_device *dev, unsigned int rxq_idx,
>  			const struct pp_memory_provider_params *old_p);
> diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
> index 8ee289316c06..b4ff3497e086 100644
> --- a/net/core/netdev_rx_queue.c
> +++ b/net/core/netdev_rx_queue.c
> @@ -170,48 +170,63 @@ int __net_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
>  		      struct netlink_ext_ack *extack)
>  {
>  	struct netdev_rx_queue *rxq;
> +	bool needs_unlock = false;
>  	int ret;
>  
>  	if (!netdev_need_ops_lock(dev))
>  		return -EOPNOTSUPP;
> -
>  	if (rxq_idx >= dev->real_num_rx_queues) {
>  		NL_SET_ERR_MSG(extack, "rx queue index out of range");
>  		return -ERANGE;
>  	}
> -	rxq_idx = array_index_nospec(rxq_idx, dev->real_num_rx_queues);
>  
> +	rxq_idx = array_index_nospec(rxq_idx, dev->real_num_rx_queues);
> +	rxq = netif_get_rx_queue_peer_locked(&dev, &rxq_idx, &needs_unlock);
> +	if (!rxq) {
> +		NL_SET_ERR_MSG(extack, "rx queue peered to a virtual netdev");
> +		return -EBUSY;
> +	}
> +	if (!dev->dev.parent) {
> +		NL_SET_ERR_MSG(extack, "rx queue is mapped to a virtual netdev");
> +		ret = -EBUSY;
> +		goto out;
> +	}
>  	if (dev->cfg->hds_config != ETHTOOL_TCP_DATA_SPLIT_ENABLED) {
>  		NL_SET_ERR_MSG(extack, "tcp-data-split is disabled");
> -		return -EINVAL;
> +		ret = -EINVAL;
> +		goto out;
>  	}
>  	if (dev->cfg->hds_thresh) {
>  		NL_SET_ERR_MSG(extack, "hds-thresh is not zero");
> -		return -EINVAL;
> +		ret = -EINVAL;
> +		goto out;
>  	}
>  	if (dev_xdp_prog_count(dev)) {
>  		NL_SET_ERR_MSG(extack, "unable to custom memory provider to device with XDP program attached");
> -		return -EEXIST;
> +		ret = -EEXIST;
> +		goto out;
>  	}
> -
> -	rxq = __netif_get_rx_queue(dev, rxq_idx);
>  	if (rxq->mp_params.mp_ops) {
>  		NL_SET_ERR_MSG(extack, "designated queue already memory provider bound");
> -		return -EEXIST;
> +		ret = -EEXIST;
> +		goto out;
>  	}
>  #ifdef CONFIG_XDP_SOCKETS
>  	if (rxq->pool) {
>  		NL_SET_ERR_MSG(extack, "designated queue already in use by AF_XDP");
> -		return -EBUSY;
> +		ret = -EBUSY;
> +		goto out;
>  	}
>  #endif
> -
>  	rxq->mp_params = *p;
>  	ret = netdev_rx_queue_restart(dev, rxq_idx);
>  	if (ret) {
>  		rxq->mp_params.mp_ops = NULL;
>  		rxq->mp_params.mp_priv = NULL;
>  	}
> +out:
> +	if (needs_unlock)
> +		netdev_unlock(dev);

Can we do something better than needs_unlock flag? Maybe something like the
following?

netif_put_rx_queue_peer_locked(orig_dev, dev)
{
	if (orig_dev != dev)
		netdev_unlock(dev);
}

Then we can do:

orig_dev = dev;
rxq = netif_get_rx_queue_peer_locked(&dev, &rx_idx);
...
netif_put_rx_queue_peer_locked(orig_dev, dev);

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 02/15] net: Implement netdev_nl_bind_queue_doit
  2025-10-24 18:20   ` Stanislav Fomichev
@ 2025-10-24 19:15     ` Daniel Borkmann
  0 siblings, 0 replies; 54+ messages in thread
From: Daniel Borkmann @ 2025-10-24 19:15 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
	john.fastabend, martin.lau, jordan, maciej.fijalkowski,
	magnus.karlsson, dw, toke, yangzhenze, wangdongdong.6

On 10/24/25 8:20 PM, Stanislav Fomichev wrote:
[...]
>> +	/* Locking order is always from the virtual to the physical device
>> +	 * since this is also the same order when applications open the
>> +	 * memory provider later on.
>> +	 */
>> +	dst_dev = netdev_get_by_index_lock(genl_info_net(info), dst_ifidx);
>> +	if (!dst_dev) {
>> +		err = -ENODEV;
>> +		goto err_genlmsg_free;
>> +	}
> 
> ...
> 
>> +	src_dev = netdev_get_by_index_lock(genl_info_net(info), src_ifidx);
>> +	if (!src_dev) {
>> +		err = -ENODEV;
>> +		goto err_unlock_dst_dev;
>> +	}
> 
> But isn't the above susceptible to ABBA exploitation from the userspace?
> I can try to concurrently do two requests, the second one being with
> dst_dev and src_dev swapped. Or do we assume that we exit earlier for
> the swapped case based on some other condition?

Hm, in all of the locking that was reworked, we only ever let the case succeed
where it locks both devices when the dst_dev is a virtual device, and the
src_dev is a phys device. If this is not given and the dst_dev is a phys device,
we error out via err_unlock_dst_dev and unlock the dst_dev again but never
proceed further to attempt to lock src_dev (as mentioned in the comment) -
basically what you mentioned the swapped case cannot lock both devs.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 01/15] net: Add bind-queue operation
  2025-10-24 18:11       ` Stanislav Fomichev
@ 2025-10-24 19:17         ` Daniel Borkmann
  0 siblings, 0 replies; 54+ messages in thread
From: Daniel Borkmann @ 2025-10-24 19:17 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Jakub Kicinski, netdev, bpf, davem, razor, pabeni, willemb, sdf,
	john.fastabend, martin.lau, jordan, maciej.fijalkowski,
	magnus.karlsson, dw, toke, yangzhenze, wangdongdong.6

On 10/24/25 8:11 PM, Stanislav Fomichev wrote:
> On 10/24, Daniel Borkmann wrote:
>> On 10/24/25 4:12 AM, Jakub Kicinski wrote:
>>> On Mon, 20 Oct 2025 18:23:41 +0200 Daniel Borkmann wrote:
>>>> +      name: bind-queue
>>>> +      doc: |
>>>> +        Bind a physical netdevice queue to a virtual one. The binding
>>>> +        creates a queue pair, where a queue can reference its peer queue.
>>>> +        This is useful for memory providers and AF_XDP operations which
>>>> +        take an ifindex and queue id to allow auch applications to bind
>>>> +        against virtual devices in containers.
>>>> +      attribute-set: queue-pair
>>>
>>>         flags: [admin-perm]
>>>
>>> right?
>> Oh, yes good catch! I've just checked for other instances in that file, don't
>> we also need the same flag for bind-tx? bind-rx for example has it, only the
>> info dumps don't. I can cook a patch for net
> 
> IIRC, TX side was non-admin-perm by design (because it only references the
> binding for tx and doesn't need any heavy device setup).

Ah perfect, thanks for clarifying!

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 03/15] net: Add peer info to queue-get response
  2025-10-24 12:59     ` Daniel Borkmann
@ 2025-10-24 23:18       ` Jakub Kicinski
  2025-10-29  2:08         ` David Wei
  0 siblings, 1 reply; 54+ messages in thread
From: Jakub Kicinski @ 2025-10-24 23:18 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6

On Fri, 24 Oct 2025 14:59:39 +0200 Daniel Borkmann wrote:
> On 10/24/25 4:33 AM, Jakub Kicinski wrote:
> > On Mon, 20 Oct 2025 18:23:43 +0200 Daniel Borkmann wrote:  
> >> Add a nested peer field to the queue-get response that returns the peered
> >> ifindex and queue id.
> >>
> >> Example with ynl client:
> >>
> >>    # ip netns exec foo ./pyynl/cli.py \
> >>        --spec ~/netlink/specs/netdev.yaml \
> >>        --do queue-get \
> >>        --json '{"ifindex": 3, "id": 1, "type": "rx"}'
> >>    {'id': 1, 'ifindex': 3, 'peer': {'id': 15, 'ifindex': 4, 'netns-id': 21}, 'type': 'rx'}  
> > 
> > I'm struggling with the roles of what is src and dst and peer :(
> > No great suggestion off the top of my head but better terms would
> > make this much easier to review.
> > 
> > The example seems to be from the container side. Do we need to show peer
> > info on the container side? Not just on the host side?  
> 
> I think up to us which side we want to show. My thinking was to allow user
> introspection from both, but we don't have to. Right now the above example
> was from the container side, but technically it could be either side depending
> in which netns the phys dev would be located.
> 
> The user knows which is which based on the ifindex passed to the queue-get
> query: if the ifindex is from a virtual device (e.g. netkit type), then the
> 'peer' section shows the phys dev, and vice versa, if the ifindex is from a
> phys device (say, mlx5), then the 'peer' section shows the virtual one.
> 
> Maybe I'll provide a better more in-depth example with both sides and above
> explanation in the commit msg for v4..

Yes, FWIW my mental model is that "leaking" host information into the
container is best avoided. Not a problem, but shouldn't be done without
a clear reason.
Typical debug scenario can be covered from the host side (container X
is having issues with queue Y, dump all the queues, find out which one
is bound to X/Y).

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 02/15] net: Implement netdev_nl_bind_queue_doit
  2025-10-24  2:08       ` Jakub Kicinski
@ 2025-10-28 21:59         ` David Wei
  2025-10-28 23:44           ` Jakub Kicinski
  0 siblings, 1 reply; 54+ messages in thread
From: David Wei @ 2025-10-28 21:59 UTC (permalink / raw)
  To: Jakub Kicinski, Daniel Borkmann
  Cc: Paolo Abeni, netdev, bpf, davem, razor, willemb, sdf,
	john.fastabend, martin.lau, jordan, maciej.fijalkowski,
	magnus.karlsson, toke, yangzhenze, wangdongdong.6

On 2025-10-23 19:08, Jakub Kicinski wrote:
> On Thu, 23 Oct 2025 14:48:15 +0200 Daniel Borkmann wrote:
>> On 10/23/25 12:27 PM, Paolo Abeni wrote:
>>> On 10/20/25 6:23 PM, Daniel Borkmann wrote:
>>>> +	if (!src_dev->dev.parent) {
>>>> +		err = -EOPNOTSUPP;
>>>> +		NL_SET_ERR_MSG(info->extack,
>>>> +			       "Source device is a virtual device");
>>>> +		goto err_unlock_src_dev;
>>>> +	}
>>>
>>> Is this check strictly needed? I think that if we relax it, it could be
>>> simpler to create all-virtual selftests.
>> It is needed given we need to always ensure lock ordering for the two devices,
>> that is, the order is always from the virtual to the physical device.
> 
> You do seem to be taking the lock before you check if the device was
> the type you expected tho.

I believe this is okay. Let's say we have two netdevs, A that is real
and B that is virtual. User calls netdev_nl_bind_queue_doit() twice in
two different contexts, 1 with the correct order (A as src, B as dst)
and 2 with the incorrect order (B as src, A as dst). We always try to
lock dst first, then src.

         1                 2
lock(dst == B)
                   lock(dst == A)
                   is not virtual...
                   unlock(A)
lock(src == A)


         1                 2
                   lock(dst == A)
lock(dst == B)
                   is not virtual...
                   unlock(A)
lock(src == A)

The check will prevent ABBA by never taking that final lock to complete
the cycle. Please check and lmk if I'm off, stuff like this makes my
brain hurt.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 02/15] net: Implement netdev_nl_bind_queue_doit
  2025-10-24  2:28   ` Jakub Kicinski
@ 2025-10-28 22:41     ` David Wei
  2025-10-29 16:46       ` Daniel Borkmann
  0 siblings, 1 reply; 54+ messages in thread
From: David Wei @ 2025-10-28 22:41 UTC (permalink / raw)
  To: Jakub Kicinski, Daniel Borkmann
  Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, toke,
	yangzhenze, wangdongdong.6

On 2025-10-23 19:28, Jakub Kicinski wrote:
> On Mon, 20 Oct 2025 18:23:42 +0200 Daniel Borkmann wrote:
>> +void netdev_rx_queue_peer(struct net_device *src_dev,
>> +			  struct netdev_rx_queue *src_rxq,
>> +			  struct netdev_rx_queue *dst_rxq)
>> +{
>> +	netdev_assert_locked(src_dev);
>> +	netdev_assert_locked(dst_rxq->dev);
>> +
>> +	netdev_hold(src_dev, &src_rxq->dev_tracker, GFP_KERNEL);
> 
> Isn't ->dev_tracker already used by sysfs?

You're right, it is. Can netdevice_tracker not be shared?

> 
> Are you handling the underlying device going away?

Ah, good point, no we're not handling that right now. Reading the code
and intuitively, it doesn't look like holding the netdev refc will
prevent something like unplugging the device...

I take it an unregistration notifier e.g. xsk_notifier() is the way to
handle it?

> 
>> +	__netdev_rx_queue_peer(src_rxq, dst_rxq);
>> +}

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 02/15] net: Implement netdev_nl_bind_queue_doit
  2025-10-28 21:59         ` David Wei
@ 2025-10-28 23:44           ` Jakub Kicinski
  2025-10-29  0:38             ` David Wei
  0 siblings, 1 reply; 54+ messages in thread
From: Jakub Kicinski @ 2025-10-28 23:44 UTC (permalink / raw)
  To: David Wei
  Cc: Daniel Borkmann, Paolo Abeni, netdev, bpf, davem, razor, willemb,
	sdf, john.fastabend, martin.lau, jordan, maciej.fijalkowski,
	magnus.karlsson, toke, yangzhenze, wangdongdong.6

On Tue, 28 Oct 2025 14:59:05 -0700 David Wei wrote:
> On 2025-10-23 19:08, Jakub Kicinski wrote:
> > On Thu, 23 Oct 2025 14:48:15 +0200 Daniel Borkmann wrote:  
> >> It is needed given we need to always ensure lock ordering for the two devices,
> >> that is, the order is always from the virtual to the physical device.  
> > 
> > You do seem to be taking the lock before you check if the device was
> > the type you expected tho.  
> 
> I believe this is okay. Let's say we have two netdevs, A that is real
> and B that is virtual. 

Now imagine they are both virtual.

> User calls netdev_nl_bind_queue_doit() twice in
> two different contexts, 1 with the correct order (A as src, B as dst)
> and 2 with the incorrect order (B as src, A as dst). We always try to
> lock dst first, then src.
> 
>          1                 2
> lock(dst == B)
>                    lock(dst == A)
>                    is not virtual...
>                    unlock(A)
> lock(src == A)
> 
> 
>          1                 2
>                    lock(dst == A)
> lock(dst == B)
>                    is not virtual...
>                    unlock(A)
> lock(src == A)
> 
> The check will prevent ABBA by never taking that final lock to complete
> the cycle. Please check and lmk if I'm off, stuff like this makes my
> brain hurt.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 02/15] net: Implement netdev_nl_bind_queue_doit
  2025-10-28 23:44           ` Jakub Kicinski
@ 2025-10-29  0:38             ` David Wei
  0 siblings, 0 replies; 54+ messages in thread
From: David Wei @ 2025-10-29  0:38 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Daniel Borkmann, Paolo Abeni, netdev, bpf, davem, razor, willemb,
	sdf, john.fastabend, martin.lau, jordan, maciej.fijalkowski,
	magnus.karlsson, toke, yangzhenze, wangdongdong.6

On 2025-10-28 16:44, Jakub Kicinski wrote:
> On Tue, 28 Oct 2025 14:59:05 -0700 David Wei wrote:
>> On 2025-10-23 19:08, Jakub Kicinski wrote:
>>> On Thu, 23 Oct 2025 14:48:15 +0200 Daniel Borkmann wrote:
>>>> It is needed given we need to always ensure lock ordering for the two devices,
>>>> that is, the order is always from the virtual to the physical device.
>>>
>>> You do seem to be taking the lock before you check if the device was
>>> the type you expected tho.
>>
>> I believe this is okay. Let's say we have two netdevs, A that is real
>> and B that is virtual.
> 
> Now imagine they are both virtual.

:facepalm: Yes, you're right, I hadn't considered this case. I'll check
if it's safe to access netdev->dev without holding the instance lock,
and if not, go back to locking both netdevs in a deterministic order.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 05/15] net: Proxy net_mp_{open,close}_rxq for mapped queues
  2025-10-24 18:36   ` Stanislav Fomichev
@ 2025-10-29  2:07     ` David Wei
  0 siblings, 0 replies; 54+ messages in thread
From: David Wei @ 2025-10-29  2:07 UTC (permalink / raw)
  To: Stanislav Fomichev, Daniel Borkmann
  Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
	john.fastabend, martin.lau, jordan, maciej.fijalkowski,
	magnus.karlsson, toke, yangzhenze, wangdongdong.6

On 2025-10-24 11:36, Stanislav Fomichev wrote:
> On 10/20, Daniel Borkmann wrote:
>> From: David Wei <dw@davidwei.uk>
>>
>> When a process in a container wants to setup a memory provider, it will
>> use the virtual netdev and a mapped rxq, and call net_mp_{open,close}_rxq
>> to try and restart the queue. At this point, proxy the queue restart on
>> the real rxq in the physical netdev.
>>
>> For memory providers (io_uring zero-copy rx and devmem), it causes the
>> real rxq in the physical netdev to be filled from a memory provider that
>> has DMA mapped memory from a process within a container.
>>
>> Signed-off-by: David Wei <dw@davidwei.uk>
>> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>> ---
>>   include/net/page_pool/memory_provider.h |  4 +-
>>   net/core/netdev_rx_queue.c              | 57 +++++++++++++++++--------
>>   2 files changed, 41 insertions(+), 20 deletions(-)
>>
>> diff --git a/include/net/page_pool/memory_provider.h b/include/net/page_pool/memory_provider.h
>> index ada4f968960a..b6f811c3416b 100644
>> --- a/include/net/page_pool/memory_provider.h
>> +++ b/include/net/page_pool/memory_provider.h
>> @@ -23,12 +23,12 @@ bool net_mp_niov_set_dma_addr(struct net_iov *niov, dma_addr_t addr);
>>   void net_mp_niov_set_page_pool(struct page_pool *pool, struct net_iov *niov);
>>   void net_mp_niov_clear_page_pool(struct net_iov *niov);
>>   
>> -int net_mp_open_rxq(struct net_device *dev, unsigned ifq_idx,
>> +int net_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
>>   		    struct pp_memory_provider_params *p);
>>   int __net_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
>>   		      const struct pp_memory_provider_params *p,
>>   		      struct netlink_ext_ack *extack);
>> -void net_mp_close_rxq(struct net_device *dev, unsigned ifq_idx,
>> +void net_mp_close_rxq(struct net_device *dev, unsigned int rxq_idx,
>>   		      struct pp_memory_provider_params *old_p);
>>   void __net_mp_close_rxq(struct net_device *dev, unsigned int rxq_idx,
>>   			const struct pp_memory_provider_params *old_p);
>> diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
>> index 8ee289316c06..b4ff3497e086 100644
>> --- a/net/core/netdev_rx_queue.c
>> +++ b/net/core/netdev_rx_queue.c
>> @@ -170,48 +170,63 @@ int __net_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
>>   		      struct netlink_ext_ack *extack)
>>   {
>>   	struct netdev_rx_queue *rxq;
>> +	bool needs_unlock = false;
>>   	int ret;
>>   
>>   	if (!netdev_need_ops_lock(dev))
>>   		return -EOPNOTSUPP;
>> -
>>   	if (rxq_idx >= dev->real_num_rx_queues) {
>>   		NL_SET_ERR_MSG(extack, "rx queue index out of range");
>>   		return -ERANGE;
>>   	}
>> -	rxq_idx = array_index_nospec(rxq_idx, dev->real_num_rx_queues);
>>   
>> +	rxq_idx = array_index_nospec(rxq_idx, dev->real_num_rx_queues);
>> +	rxq = netif_get_rx_queue_peer_locked(&dev, &rxq_idx, &needs_unlock);
>> +	if (!rxq) {
>> +		NL_SET_ERR_MSG(extack, "rx queue peered to a virtual netdev");
>> +		return -EBUSY;
>> +	}
>> +	if (!dev->dev.parent) {
>> +		NL_SET_ERR_MSG(extack, "rx queue is mapped to a virtual netdev");
>> +		ret = -EBUSY;
>> +		goto out;
>> +	}
>>   	if (dev->cfg->hds_config != ETHTOOL_TCP_DATA_SPLIT_ENABLED) {
>>   		NL_SET_ERR_MSG(extack, "tcp-data-split is disabled");
>> -		return -EINVAL;
>> +		ret = -EINVAL;
>> +		goto out;
>>   	}
>>   	if (dev->cfg->hds_thresh) {
>>   		NL_SET_ERR_MSG(extack, "hds-thresh is not zero");
>> -		return -EINVAL;
>> +		ret = -EINVAL;
>> +		goto out;
>>   	}
>>   	if (dev_xdp_prog_count(dev)) {
>>   		NL_SET_ERR_MSG(extack, "unable to custom memory provider to device with XDP program attached");
>> -		return -EEXIST;
>> +		ret = -EEXIST;
>> +		goto out;
>>   	}
>> -
>> -	rxq = __netif_get_rx_queue(dev, rxq_idx);
>>   	if (rxq->mp_params.mp_ops) {
>>   		NL_SET_ERR_MSG(extack, "designated queue already memory provider bound");
>> -		return -EEXIST;
>> +		ret = -EEXIST;
>> +		goto out;
>>   	}
>>   #ifdef CONFIG_XDP_SOCKETS
>>   	if (rxq->pool) {
>>   		NL_SET_ERR_MSG(extack, "designated queue already in use by AF_XDP");
>> -		return -EBUSY;
>> +		ret = -EBUSY;
>> +		goto out;
>>   	}
>>   #endif
>> -
>>   	rxq->mp_params = *p;
>>   	ret = netdev_rx_queue_restart(dev, rxq_idx);
>>   	if (ret) {
>>   		rxq->mp_params.mp_ops = NULL;
>>   		rxq->mp_params.mp_priv = NULL;
>>   	}
>> +out:
>> +	if (needs_unlock)
>> +		netdev_unlock(dev);
> 
> Can we do something better than needs_unlock flag? Maybe something like the
> following?
> 
> netif_put_rx_queue_peer_locked(orig_dev, dev)
> {
> 	if (orig_dev != dev)
> 		netdev_unlock(dev);
> }
> 
> Then we can do:
> 
> orig_dev = dev;
> rxq = netif_get_rx_queue_peer_locked(&dev, &rx_idx);
> ...
> netif_put_rx_queue_peer_locked(orig_dev, dev);

Thanks, that's a lot cleaner, changed in v4.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 03/15] net: Add peer info to queue-get response
  2025-10-24 23:18       ` Jakub Kicinski
@ 2025-10-29  2:08         ` David Wei
  2025-10-29 22:47           ` Jakub Kicinski
  0 siblings, 1 reply; 54+ messages in thread
From: David Wei @ 2025-10-29  2:08 UTC (permalink / raw)
  To: Jakub Kicinski, Daniel Borkmann
  Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, toke,
	yangzhenze, wangdongdong.6

On 2025-10-24 16:18, Jakub Kicinski wrote:
> On Fri, 24 Oct 2025 14:59:39 +0200 Daniel Borkmann wrote:
>> On 10/24/25 4:33 AM, Jakub Kicinski wrote:
>>> On Mon, 20 Oct 2025 18:23:43 +0200 Daniel Borkmann wrote:
>>>> Add a nested peer field to the queue-get response that returns the peered
>>>> ifindex and queue id.
>>>>
>>>> Example with ynl client:
>>>>
>>>>     # ip netns exec foo ./pyynl/cli.py \
>>>>         --spec ~/netlink/specs/netdev.yaml \
>>>>         --do queue-get \
>>>>         --json '{"ifindex": 3, "id": 1, "type": "rx"}'
>>>>     {'id': 1, 'ifindex': 3, 'peer': {'id': 15, 'ifindex': 4, 'netns-id': 21}, 'type': 'rx'}
>>>
>>> I'm struggling with the roles of what is src and dst and peer :(
>>> No great suggestion off the top of my head but better terms would
>>> make this much easier to review.
>>>
>>> The example seems to be from the container side. Do we need to show peer
>>> info on the container side? Not just on the host side?
>>
>> I think up to us which side we want to show. My thinking was to allow user
>> introspection from both, but we don't have to. Right now the above example
>> was from the container side, but technically it could be either side depending
>> in which netns the phys dev would be located.
>>
>> The user knows which is which based on the ifindex passed to the queue-get
>> query: if the ifindex is from a virtual device (e.g. netkit type), then the
>> 'peer' section shows the phys dev, and vice versa, if the ifindex is from a
>> phys device (say, mlx5), then the 'peer' section shows the virtual one.
>>
>> Maybe I'll provide a better more in-depth example with both sides and above
>> explanation in the commit msg for v4..
> 
> Yes, FWIW my mental model is that "leaking" host information into the
> container is best avoided. Not a problem, but shouldn't be done without
> a clear reason.
> Typical debug scenario can be covered from the host side (container X
> is having issues with queue Y, dump all the queues, find out which one
> is bound to X/Y).

Makes sense, I didn't consider leaking host info in a container. Happy
to remove the introspection from the container side, leaving it only on
the host side when queues are dumped.

Like Daniel mentioned, I didn't add 'src/real' or 'dst/virtual' because
I believed this information is implicit to the user when querying a
netdev based on its type. Do you find this to be confusing? Happy to add
a clarifying field in the nested struct.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 02/15] net: Implement netdev_nl_bind_queue_doit
  2025-10-28 22:41     ` David Wei
@ 2025-10-29 16:46       ` Daniel Borkmann
  0 siblings, 0 replies; 54+ messages in thread
From: Daniel Borkmann @ 2025-10-29 16:46 UTC (permalink / raw)
  To: David Wei, Jakub Kicinski
  Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, toke,
	yangzhenze, wangdongdong.6

On 10/28/25 11:41 PM, David Wei wrote:
> On 2025-10-23 19:28, Jakub Kicinski wrote:
>> On Mon, 20 Oct 2025 18:23:42 +0200 Daniel Borkmann wrote:
>>> +void netdev_rx_queue_peer(struct net_device *src_dev,
>>> +              struct netdev_rx_queue *src_rxq,
>>> +              struct netdev_rx_queue *dst_rxq)
>>> +{
>>> +    netdev_assert_locked(src_dev);
>>> +    netdev_assert_locked(dst_rxq->dev);
>>> +
>>> +    netdev_hold(src_dev, &src_rxq->dev_tracker, GFP_KERNEL);
>>
>> Isn't ->dev_tracker already used by sysfs?
> 
> You're right, it is. Can netdevice_tracker not be shared?
Given this is not common practice, I've added a peer_tracker (given
this is also only enabled / takes space on debug kernels).

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH net-next v3 03/15] net: Add peer info to queue-get response
  2025-10-29  2:08         ` David Wei
@ 2025-10-29 22:47           ` Jakub Kicinski
  0 siblings, 0 replies; 54+ messages in thread
From: Jakub Kicinski @ 2025-10-29 22:47 UTC (permalink / raw)
  To: David Wei
  Cc: Daniel Borkmann, netdev, bpf, davem, razor, pabeni, willemb, sdf,
	john.fastabend, martin.lau, jordan, maciej.fijalkowski,
	magnus.karlsson, toke, yangzhenze, wangdongdong.6

On Tue, 28 Oct 2025 19:08:10 -0700 David Wei wrote:
> >> I think up to us which side we want to show. My thinking was to allow user
> >> introspection from both, but we don't have to. Right now the above example
> >> was from the container side, but technically it could be either side depending
> >> in which netns the phys dev would be located.
> >>
> >> The user knows which is which based on the ifindex passed to the queue-get
> >> query: if the ifindex is from a virtual device (e.g. netkit type), then the
> >> 'peer' section shows the phys dev, and vice versa, if the ifindex is from a
> >> phys device (say, mlx5), then the 'peer' section shows the virtual one.
> >>
> >> Maybe I'll provide a better more in-depth example with both sides and above
> >> explanation in the commit msg for v4..  
> > 
> > Yes, FWIW my mental model is that "leaking" host information into the
> > container is best avoided. Not a problem, but shouldn't be done without
> > a clear reason.
> > Typical debug scenario can be covered from the host side (container X
> > is having issues with queue Y, dump all the queues, find out which one
> > is bound to X/Y).  
> 
> Makes sense, I didn't consider leaking host info in a container. Happy
> to remove the introspection from the container side, leaving it only on
> the host side when queues are dumped.
> 
> Like Daniel mentioned, I didn't add 'src/real' or 'dst/virtual' because
> I believed this information is implicit to the user when querying a
> netdev based on its type. Do you find this to be confusing? Happy to add
> a clarifying field in the nested struct.

In veth/netkit we call "peer" the other side of an equal pipe. Same for
ndo_get_peer_dev. Queue is not a peering situation, but rather an attachment
/ delegation of a sub-object from one netdev to another.

I'd use a term like delegation or grant when talking about the HW
queue. And assignment in context of virtual.

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2025-10-29 22:47 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-20 16:23 [PATCH net-next v3 00/15] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
2025-10-20 16:23 ` [PATCH net-next v3 01/15] net: Add bind-queue operation Daniel Borkmann
2025-10-22 11:19   ` Nikolay Aleksandrov
2025-10-24  2:12   ` Jakub Kicinski
2025-10-24 10:15     ` Daniel Borkmann
2025-10-24 18:11       ` Stanislav Fomichev
2025-10-24 19:17         ` Daniel Borkmann
2025-10-20 16:23 ` [PATCH net-next v3 02/15] net: Implement netdev_nl_bind_queue_doit Daniel Borkmann
2025-10-22 11:17   ` Nikolay Aleksandrov
2025-10-22 11:26     ` Daniel Borkmann
2025-10-23 10:17   ` Paolo Abeni
2025-10-23 12:46     ` Daniel Borkmann
2025-10-23 10:27   ` Paolo Abeni
2025-10-23 12:48     ` Daniel Borkmann
2025-10-24  2:08       ` Jakub Kicinski
2025-10-28 21:59         ` David Wei
2025-10-28 23:44           ` Jakub Kicinski
2025-10-29  0:38             ` David Wei
2025-10-24  2:28   ` Jakub Kicinski
2025-10-28 22:41     ` David Wei
2025-10-29 16:46       ` Daniel Borkmann
2025-10-24 18:20   ` Stanislav Fomichev
2025-10-24 19:15     ` Daniel Borkmann
2025-10-20 16:23 ` [PATCH net-next v3 03/15] net: Add peer info to queue-get response Daniel Borkmann
2025-10-22 11:23   ` Nikolay Aleksandrov
2025-10-24  2:33   ` Jakub Kicinski
2025-10-24 12:59     ` Daniel Borkmann
2025-10-24 23:18       ` Jakub Kicinski
2025-10-29  2:08         ` David Wei
2025-10-29 22:47           ` Jakub Kicinski
2025-10-20 16:23 ` [PATCH net-next v3 04/15] net, ethtool: Disallow peered real rxqs to be resized Daniel Borkmann
2025-10-22 11:25   ` Nikolay Aleksandrov
2025-10-20 16:23 ` [PATCH net-next v3 05/15] net: Proxy net_mp_{open,close}_rxq for mapped queues Daniel Borkmann
2025-10-22 12:50   ` Nikolay Aleksandrov
2025-10-24 18:36   ` Stanislav Fomichev
2025-10-29  2:07     ` David Wei
2025-10-20 16:23 ` [PATCH net-next v3 06/15] xsk: Move NETDEV_XDP_ACT_ZC into generic header Daniel Borkmann
2025-10-22 12:51   ` Nikolay Aleksandrov
2025-10-20 16:23 ` [PATCH net-next v3 07/15] xsk: Move pool registration into single function Daniel Borkmann
2025-10-22 12:52   ` Nikolay Aleksandrov
2025-10-20 16:23 ` [PATCH net-next v3 08/15] xsk: Add small helper xp_pool_bindable Daniel Borkmann
2025-10-22 12:52   ` Nikolay Aleksandrov
2025-10-20 16:23 ` [PATCH net-next v3 09/15] xsk: Change xsk_rcv_check to check netdev/queue_id from pool Daniel Borkmann
2025-10-20 16:23 ` [PATCH net-next v3 10/15] xsk: Proxy pool management for mapped queues Daniel Borkmann
2025-10-20 16:23 ` [PATCH net-next v3 11/15] netkit: Add single device mode for netkit Daniel Borkmann
2025-10-22 13:13   ` Nikolay Aleksandrov
2025-10-20 16:23 ` [PATCH net-next v3 12/15] netkit: Document fast vs slowpath members via macros Daniel Borkmann
2025-10-22 13:02   ` Nikolay Aleksandrov
2025-10-20 16:23 ` [PATCH net-next v3 13/15] netkit: Implement rtnl_link_ops->alloc and ndo_queue_create Daniel Borkmann
2025-10-22 13:00   ` Nikolay Aleksandrov
2025-10-20 16:23 ` [PATCH net-next v3 14/15] netkit: Add io_uring zero-copy support for TCP Daniel Borkmann
2025-10-22 13:12   ` Nikolay Aleksandrov
2025-10-20 16:23 ` [PATCH net-next v3 15/15] netkit: Add xsk support for af_xdp applications Daniel Borkmann
2025-10-22 14:27   ` Nikolay Aleksandrov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).