* [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP
@ 2025-09-19 21:31 Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 01/20] net, ynl: Add bind-queue operation Daniel Borkmann
` (21 more replies)
0 siblings, 22 replies; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-19 21:31 UTC (permalink / raw)
To: netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson
Containers use virtual netdevs to route traffic from a physical netdev
in the host namespace. They do not have access to the physical netdev
in the host and thus can't use memory providers or AF_XDP that require
reconfiguring/restarting queues in the physical netdev.
This patchset adds the concept of queue peering to virtual netdevs that
allow containers to use memory providers and AF_XDP at _native speed_!
These mapped queues are bound to a real queue in a physical netdev and
act as a proxy.
Memory providers and AF_XDP operations takes an ifindex and queue id,
so containers would pass in an ifindex for a virtual netdev and a queue
id of a mapped queue, which then gets proxied to the underlying real
queue. Peered queues are created and bound to a real queue atomically
through a generic ynl netdev operation.
We have implemented support for this concept in netkit and tested the
latter against Nvidia ConnectX-6 (mlx5) as well as Broadcom BCM957504
(bnxt_en) 100G NICs. For more details see the individual patches.
Daniel Borkmann (10):
net: Add ndo_{peer,unpeer}_queues callback
net, ethtool: Disallow mapped real rxqs to be resized
xsk: Move NETDEV_XDP_ACT_ZC into generic header
xsk: Move pool registration into single function
xsk: Add small helper xp_pool_bindable
xsk: Change xsk_rcv_check to check netdev/queue_id from pool
xsk: Proxy pool management for mapped queues
netkit: Add single device mode for netkit
netkit: Document fast vs slowpath members via macros
netkit: Add xsk support for af_xdp applications
David Wei (10):
net, ynl: Add bind-queue operation
net: Add peer to netdev_rx_queue
net: Add ndo_queue_create callback
net, ynl: Implement netdev_nl_bind_queue_doit
net, ynl: Add peer info to queue-get response
net: Proxy net_mp_{open,close}_rxq for mapped queues
netkit: Implement rtnl_link_ops->alloc
netkit: Implement ndo_queue_create
netkit: Add io_uring zero-copy support for TCP
tools, ynl: Add queue binding ynl sample application
Documentation/netlink/specs/netdev.yaml | 54 ++++
drivers/net/netkit.c | 362 ++++++++++++++++++++----
include/linux/netdevice.h | 15 +-
include/net/netdev_queues.h | 1 +
include/net/netdev_rx_queue.h | 55 ++++
include/net/xdp_sock_drv.h | 8 +-
include/uapi/linux/if_link.h | 6 +
include/uapi/linux/netdev.h | 20 ++
net/core/netdev-genl-gen.c | 14 +
net/core/netdev-genl-gen.h | 1 +
net/core/netdev-genl.c | 144 +++++++++-
net/core/netdev_rx_queue.c | 15 +-
net/ethtool/channels.c | 10 +-
net/xdp/xsk.c | 27 +-
net/xdp/xsk.h | 5 +-
net/xdp/xsk_buff_pool.c | 29 +-
tools/include/uapi/linux/netdev.h | 20 ++
tools/net/ynl/samples/bind.c | 56 ++++
18 files changed, 750 insertions(+), 92 deletions(-)
create mode 100644 tools/net/ynl/samples/bind.c
--
2.43.0
^ permalink raw reply [flat|nested] 64+ messages in thread
* [PATCH net-next 01/20] net, ynl: Add bind-queue operation
2025-09-19 21:31 [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
@ 2025-09-19 21:31 ` Daniel Borkmann
2025-09-22 16:04 ` Stanislav Fomichev
2025-09-23 1:17 ` Jakub Kicinski
2025-09-19 21:31 ` [PATCH net-next 02/20] net: Add peer to netdev_rx_queue Daniel Borkmann
` (20 subsequent siblings)
21 siblings, 2 replies; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-19 21:31 UTC (permalink / raw)
To: netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
From: David Wei <dw@davidwei.uk>
Add a ynl netdev family operation called bind-queue that _binds_ an
rxq from a real netdev to a virtual netdev i.e. netkit or veth. This
bound or _mapped_ rxq in the virtual netdev acts as a proxy for the
parent real rxq, and can be used by processes running in a container
to use memory providers (io_uring zero-copy rx or devmem) or AF_XDP.
An early implementation had only driver-specific integration [0],
but in order for other virtual devices to reuse, it makes sense to
have this as a generic API.
src-ifindex and src-queue-id is the real netdev and rxq respectively.
dst-ifindex is the virtual netdev. Note that this op doesn't take
dst-queue-id, because the expectation is that the op will _create_ a
new rxq in the virtual netdev. The virtual netdev must have
real_num_rx_queues less than num_rx_queues at the time of calling
bind-queue.
Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://bpfconf.ebpf.io/bpfconf2025/bpfconf2025_material/lsfmmbpf_2025_netkit_borkmann.pdf [0]
---
Documentation/netlink/specs/netdev.yaml | 37 +++++++++++++++++++++++++
include/uapi/linux/netdev.h | 11 ++++++++
net/core/netdev-genl-gen.c | 14 ++++++++++
net/core/netdev-genl-gen.h | 1 +
net/core/netdev-genl.c | 4 +++
tools/include/uapi/linux/netdev.h | 11 ++++++++
6 files changed, 78 insertions(+)
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index e00d3fa1c152..99a430ea8a9a 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -561,6 +561,29 @@ attribute-sets:
type: u32
checks:
min: 1
+ -
+ name: queue-pair
+ attributes:
+ -
+ name: src-ifindex
+ doc: netdev ifindex of the physical device
+ type: u32
+ checks:
+ min: 1
+ -
+ name: src-queue-id
+ doc: netdev queue id of the physical device
+ type: u32
+ -
+ name: dst-ifindex
+ doc: netdev ifindex of the virtual device
+ type: u32
+ checks:
+ min: 1
+ -
+ name: dst-queue-id
+ doc: netdev queue id of the virtual device
+ type: u32
operations:
list:
@@ -772,6 +795,20 @@ operations:
attributes:
- id
+ -
+ name: bind-queue
+ doc: Bind a physical netdev queue to a virtual one
+ attribute-set: queue-pair
+ do:
+ request:
+ attributes:
+ - src-ifindex
+ - src-queue-id
+ - dst-ifindex
+ reply:
+ attributes:
+ - dst-queue-id
+
kernel-family:
headers: ["net/netdev_netlink.h"]
sock-priv: struct netdev_nl_sock
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 48eb49aa03d4..05e17765a39d 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -210,6 +210,16 @@ enum {
NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
};
+enum {
+ NETDEV_A_QUEUE_PAIR_SRC_IFINDEX = 1,
+ NETDEV_A_QUEUE_PAIR_SRC_QUEUE_ID,
+ NETDEV_A_QUEUE_PAIR_DST_IFINDEX,
+ NETDEV_A_QUEUE_PAIR_DST_QUEUE_ID,
+
+ __NETDEV_A_QUEUE_PAIR_MAX,
+ NETDEV_A_QUEUE_PAIR_MAX = (__NETDEV_A_QUEUE_PAIR_MAX - 1)
+};
+
enum {
NETDEV_CMD_DEV_GET = 1,
NETDEV_CMD_DEV_ADD_NTF,
@@ -226,6 +236,7 @@ enum {
NETDEV_CMD_BIND_RX,
NETDEV_CMD_NAPI_SET,
NETDEV_CMD_BIND_TX,
+ NETDEV_CMD_BIND_QUEUE,
__NETDEV_CMD_MAX,
NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index e9a2a6f26cb7..10b2ab4dd500 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -106,6 +106,13 @@ static const struct nla_policy netdev_bind_tx_nl_policy[NETDEV_A_DMABUF_FD + 1]
[NETDEV_A_DMABUF_FD] = { .type = NLA_U32, },
};
+/* NETDEV_CMD_BIND_QUEUE - do */
+static const struct nla_policy netdev_bind_queue_nl_policy[NETDEV_A_QUEUE_PAIR_DST_IFINDEX + 1] = {
+ [NETDEV_A_QUEUE_PAIR_SRC_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1),
+ [NETDEV_A_QUEUE_PAIR_SRC_QUEUE_ID] = { .type = NLA_U32, },
+ [NETDEV_A_QUEUE_PAIR_DST_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1),
+};
+
/* Ops table for netdev */
static const struct genl_split_ops netdev_nl_ops[] = {
{
@@ -204,6 +211,13 @@ static const struct genl_split_ops netdev_nl_ops[] = {
.maxattr = NETDEV_A_DMABUF_FD,
.flags = GENL_CMD_CAP_DO,
},
+ {
+ .cmd = NETDEV_CMD_BIND_QUEUE,
+ .doit = netdev_nl_bind_queue_doit,
+ .policy = netdev_bind_queue_nl_policy,
+ .maxattr = NETDEV_A_QUEUE_PAIR_DST_IFINDEX,
+ .flags = GENL_CMD_CAP_DO,
+ },
};
static const struct genl_multicast_group netdev_nl_mcgrps[] = {
diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h
index cf3fad74511f..309248fe2b9e 100644
--- a/net/core/netdev-genl-gen.h
+++ b/net/core/netdev-genl-gen.h
@@ -35,6 +35,7 @@ int netdev_nl_qstats_get_dumpit(struct sk_buff *skb,
int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info);
int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info);
int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info);
+int netdev_nl_bind_queue_doit(struct sk_buff *skb, struct genl_info *info);
enum {
NETDEV_NLGRP_MGMT,
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 470fabbeacd9..b0aea27bf84e 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -1120,6 +1120,10 @@ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
return err;
}
+int netdev_nl_bind_queue_doit(struct sk_buff *skb, struct genl_info *info)
+{
+}
+
void netdev_nl_sock_priv_init(struct netdev_nl_sock *priv)
{
INIT_LIST_HEAD(&priv->bindings);
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 48eb49aa03d4..05e17765a39d 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -210,6 +210,16 @@ enum {
NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
};
+enum {
+ NETDEV_A_QUEUE_PAIR_SRC_IFINDEX = 1,
+ NETDEV_A_QUEUE_PAIR_SRC_QUEUE_ID,
+ NETDEV_A_QUEUE_PAIR_DST_IFINDEX,
+ NETDEV_A_QUEUE_PAIR_DST_QUEUE_ID,
+
+ __NETDEV_A_QUEUE_PAIR_MAX,
+ NETDEV_A_QUEUE_PAIR_MAX = (__NETDEV_A_QUEUE_PAIR_MAX - 1)
+};
+
enum {
NETDEV_CMD_DEV_GET = 1,
NETDEV_CMD_DEV_ADD_NTF,
@@ -226,6 +236,7 @@ enum {
NETDEV_CMD_BIND_RX,
NETDEV_CMD_NAPI_SET,
NETDEV_CMD_BIND_TX,
+ NETDEV_CMD_BIND_QUEUE,
__NETDEV_CMD_MAX,
NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH net-next 02/20] net: Add peer to netdev_rx_queue
2025-09-19 21:31 [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 01/20] net, ynl: Add bind-queue operation Daniel Borkmann
@ 2025-09-19 21:31 ` Daniel Borkmann
2025-09-23 1:22 ` Jakub Kicinski
2025-09-19 21:31 ` [PATCH net-next 03/20] net: Add ndo_queue_create callback Daniel Borkmann
` (19 subsequent siblings)
21 siblings, 1 reply; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-19 21:31 UTC (permalink / raw)
To: netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
From: David Wei <dw@davidwei.uk>
Add a peer pointer to netdev_rx_queue that points from the real rxq to the
mapped rxq in a virtual netdev, and vice versa.
Add related helpers that set, unset, get and check the peer pointer.
Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
include/net/netdev_rx_queue.h | 51 +++++++++++++++++++++++++++++++++++
1 file changed, 51 insertions(+)
diff --git a/include/net/netdev_rx_queue.h b/include/net/netdev_rx_queue.h
index 8cdcd138b33f..47126ccaf854 100644
--- a/include/net/netdev_rx_queue.h
+++ b/include/net/netdev_rx_queue.h
@@ -28,6 +28,7 @@ struct netdev_rx_queue {
#endif
struct napi_struct *napi;
struct pp_memory_provider_params mp_params;
+ struct netdev_rx_queue *peer;
} ____cacheline_aligned_in_smp;
/*
@@ -58,4 +59,54 @@ get_netdev_rx_queue_index(struct netdev_rx_queue *queue)
int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq);
+static inline void __netdev_rx_queue_peer(struct netdev_rx_queue *src_rxq,
+ struct netdev_rx_queue *dst_rxq)
+{
+ src_rxq->peer = dst_rxq;
+ dst_rxq->peer = src_rxq;
+}
+
+static inline void netdev_rx_queue_peer(struct net_device *src_dev,
+ struct netdev_rx_queue *src_rxq,
+ struct netdev_rx_queue *dst_rxq)
+{
+ dev_hold(src_dev);
+ __netdev_rx_queue_peer(src_rxq, dst_rxq);
+}
+
+static inline void __netdev_rx_queue_unpeer(struct netdev_rx_queue *src_rxq,
+ struct netdev_rx_queue *dst_rxq)
+{
+ src_rxq->peer = NULL;
+ dst_rxq->peer = NULL;
+}
+
+static inline void netdev_rx_queue_unpeer(struct net_device *src_dev,
+ struct netdev_rx_queue *src_rxq,
+ struct netdev_rx_queue *dst_rxq)
+{
+ __netdev_rx_queue_unpeer(src_rxq, dst_rxq);
+ dev_put(src_dev);
+}
+
+static inline bool netdev_rx_queue_peered(struct net_device *dev,
+ u16 queue_id)
+{
+ if (queue_id < dev->real_num_rx_queues)
+ return dev->_rx[queue_id].peer;
+ return false;
+}
+
+static inline struct netdev_rx_queue *
+__netif_get_rx_queue_peer(struct net_device **dev, unsigned int *rxq_idx)
+{
+ struct netdev_rx_queue *rxq = __netif_get_rx_queue(*dev, *rxq_idx);
+
+ if (rxq->peer) {
+ rxq = rxq->peer;
+ *rxq_idx = get_netdev_rx_queue_index(rxq);
+ *dev = rxq->dev;
+ }
+ return rxq;
+}
#endif
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH net-next 03/20] net: Add ndo_queue_create callback
2025-09-19 21:31 [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 01/20] net, ynl: Add bind-queue operation Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 02/20] net: Add peer to netdev_rx_queue Daniel Borkmann
@ 2025-09-19 21:31 ` Daniel Borkmann
2025-09-22 16:04 ` Stanislav Fomichev
2025-09-23 1:22 ` Jakub Kicinski
2025-09-19 21:31 ` [PATCH net-next 04/20] net: Add ndo_{peer,unpeer}_queues callback Daniel Borkmann
` (18 subsequent siblings)
21 siblings, 2 replies; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-19 21:31 UTC (permalink / raw)
To: netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
From: David Wei <dw@davidwei.uk>
Add ndo_queue_create() to netdev_queue_mgmt_ops that will create a new
rxq specifically for mapping to a real rxq. The intent is for only
virtual netdevs i.e. netkit and veth to implement this ndo. This will
be called from ynl netdev fam bind-queue op to atomically create a
mapped rxq and bind it to a real rxq.
Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
include/net/netdev_queues.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index cd00e0406cf4..6b0d2416728d 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -149,6 +149,7 @@ struct netdev_queue_mgmt_ops {
int idx);
struct device * (*ndo_queue_get_dma_dev)(struct net_device *dev,
int idx);
+ int (*ndo_queue_create)(struct net_device *dev);
};
bool netif_rxq_has_unreadable_mp(struct net_device *dev, int idx);
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH net-next 04/20] net: Add ndo_{peer,unpeer}_queues callback
2025-09-19 21:31 [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
` (2 preceding siblings ...)
2025-09-19 21:31 ` [PATCH net-next 03/20] net: Add ndo_queue_create callback Daniel Borkmann
@ 2025-09-19 21:31 ` Daniel Borkmann
2025-09-23 1:23 ` Jakub Kicinski
2025-09-19 21:31 ` [PATCH net-next 05/20] net, ynl: Implement netdev_nl_bind_queue_doit Daniel Borkmann
` (17 subsequent siblings)
21 siblings, 1 reply; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-19 21:31 UTC (permalink / raw)
To: netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
Add ndo_{peer,unpeer}_queues() callback which can be used by virtual drivers
that implement rxq mapping to a real rxq to update their internal state or
exposed capability flags from the set of rxq mappings.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
---
include/linux/netdevice.h | 15 ++++++++++++++-
include/net/netdev_rx_queue.h | 4 ++++
2 files changed, 18 insertions(+), 1 deletion(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1c54d44805fa..43b3c4e3593e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -65,6 +65,7 @@ struct macsec_context;
struct macsec_ops;
struct netdev_config;
struct netdev_name_node;
+struct netdev_rx_queue;
struct sd_flow_limit;
struct sfp_bus;
/* 802.11 specific */
@@ -1404,6 +1405,15 @@ struct netdev_net_notifier {
* struct kernel_hwtstamp_config *kernel_config,
* struct netlink_ext_ack *extack);
* Change the hardware timestamping parameters for NIC device.
+ *
+ * void (*ndo_peer_queues)(struct net_device *dev, struct netdev_rx_queue *rxq);
+ * Custom callback for drivers when a physical queue gets peered with
+ * a virtual one, so that device drivers can update exposed device flags.
+ *
+ * void (*ndo_unpeer_queues)(struct net_device *dev, struct netdev_rx_queue *rxq);
+ * Custom callback for drivers when a physical queue gets unpeered with
+ * a virtual one, so that device drivers can update exposed device flags.
+ * Reverse operation of ndo_peer_queues.
*/
struct net_device_ops {
int (*ndo_init)(struct net_device *dev);
@@ -1651,7 +1661,10 @@ struct net_device_ops {
int (*ndo_hwtstamp_set)(struct net_device *dev,
struct kernel_hwtstamp_config *kernel_config,
struct netlink_ext_ack *extack);
-
+ void (*ndo_peer_queues)(struct net_device *dev,
+ struct netdev_rx_queue *rxq);
+ void (*ndo_unpeer_queues)(struct net_device *dev,
+ struct netdev_rx_queue *rxq);
#if IS_ENABLED(CONFIG_NET_SHAPER)
/**
* @net_shaper_ops: Device shaping offload operations
diff --git a/include/net/netdev_rx_queue.h b/include/net/netdev_rx_queue.h
index 47126ccaf854..fdfacd28c2ae 100644
--- a/include/net/netdev_rx_queue.h
+++ b/include/net/netdev_rx_queue.h
@@ -72,6 +72,8 @@ static inline void netdev_rx_queue_peer(struct net_device *src_dev,
{
dev_hold(src_dev);
__netdev_rx_queue_peer(src_rxq, dst_rxq);
+ if (dst_rxq->dev->netdev_ops->ndo_peer_queues)
+ dst_rxq->dev->netdev_ops->ndo_peer_queues(dst_rxq->dev, dst_rxq);
}
static inline void __netdev_rx_queue_unpeer(struct netdev_rx_queue *src_rxq,
@@ -85,6 +87,8 @@ static inline void netdev_rx_queue_unpeer(struct net_device *src_dev,
struct netdev_rx_queue *src_rxq,
struct netdev_rx_queue *dst_rxq)
{
+ if (dst_rxq->dev->netdev_ops->ndo_unpeer_queues)
+ dst_rxq->dev->netdev_ops->ndo_unpeer_queues(dst_rxq->dev, dst_rxq);
__netdev_rx_queue_unpeer(src_rxq, dst_rxq);
dev_put(src_dev);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH net-next 05/20] net, ynl: Implement netdev_nl_bind_queue_doit
2025-09-19 21:31 [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
` (3 preceding siblings ...)
2025-09-19 21:31 ` [PATCH net-next 04/20] net: Add ndo_{peer,unpeer}_queues callback Daniel Borkmann
@ 2025-09-19 21:31 ` Daniel Borkmann
2025-09-22 16:06 ` Stanislav Fomichev
2025-09-19 21:31 ` [PATCH net-next 06/20] net, ynl: Add peer info to queue-get response Daniel Borkmann
` (16 subsequent siblings)
21 siblings, 1 reply; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-19 21:31 UTC (permalink / raw)
To: netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
From: David Wei <dw@davidwei.uk>
Implement netdev_nl_bind_queue_doit() that creates a mapped rxq in a
virtual netdev and then binds it to a real rxq in a physical netdev
by setting the peer pointer in netdev_rx_queue.
Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
net/core/netdev-genl.c | 117 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 117 insertions(+)
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index b0aea27bf84e..ed0ce3dbfc6f 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -1122,6 +1122,123 @@ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
int netdev_nl_bind_queue_doit(struct sk_buff *skb, struct genl_info *info)
{
+ u32 src_ifidx, src_qid, dst_ifidx, dst_qid;
+ struct netdev_rx_queue *src_rxq, *dst_rxq;
+ struct net_device *src_dev, *dst_dev;
+ struct netdev_nl_sock *priv;
+ struct sk_buff *rsp;
+ int err = 0;
+ void *hdr;
+
+ if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_PAIR_SRC_IFINDEX) ||
+ GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_PAIR_SRC_QUEUE_ID) ||
+ GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_PAIR_DST_IFINDEX))
+ return -EINVAL;
+
+ src_ifidx = nla_get_u32(info->attrs[NETDEV_A_QUEUE_PAIR_SRC_IFINDEX]);
+ src_qid = nla_get_u32(info->attrs[NETDEV_A_QUEUE_PAIR_SRC_QUEUE_ID]);
+ dst_ifidx = nla_get_u32(info->attrs[NETDEV_A_QUEUE_PAIR_DST_IFINDEX]);
+ if (dst_ifidx == src_ifidx) {
+ NL_SET_ERR_MSG(info->extack,
+ "Destination driver cannot be same as source driver");
+ return -EOPNOTSUPP;
+ }
+
+ priv = genl_sk_priv_get(&netdev_nl_family, NETLINK_CB(skb).sk);
+ if (IS_ERR(priv))
+ return PTR_ERR(priv);
+
+ rsp = genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL);
+ if (!rsp)
+ return -ENOMEM;
+
+ hdr = genlmsg_iput(rsp, info);
+ if (!hdr) {
+ err = -EMSGSIZE;
+ goto err_genlmsg_free;
+ }
+
+ mutex_lock(&priv->lock);
+
+ src_dev = netdev_get_by_index_lock(genl_info_net(info), src_ifidx);
+ if (!src_dev) {
+ err = -ENODEV;
+ goto err_unlock_sock;
+ }
+ if (!netif_device_present(src_dev)) {
+ err = -ENODEV;
+ goto err_unlock_src_dev;
+ }
+ if (!src_dev->dev.parent) {
+ err = -EOPNOTSUPP;
+ NL_SET_ERR_MSG(info->extack,
+ "Source driver is a virtual device");
+ goto err_unlock_src_dev;
+ }
+ if (!src_dev->queue_mgmt_ops) {
+ err = -EOPNOTSUPP;
+ NL_SET_ERR_MSG(info->extack,
+ "Source driver does not support queue management operations");
+ goto err_unlock_src_dev;
+ }
+ if (src_qid >= src_dev->num_rx_queues) {
+ err = -ERANGE;
+ NL_SET_ERR_MSG(info->extack,
+ "Source driver queue out of range");
+ goto err_unlock_src_dev;
+ }
+
+ src_rxq = __netif_get_rx_queue(src_dev, src_qid);
+ if (src_rxq->peer) {
+ err = -EBUSY;
+ NL_SET_ERR_MSG(info->extack,
+ "Source driver queue already bound");
+ goto err_unlock_src_dev;
+ }
+
+ dst_dev = netdev_get_by_index_lock(genl_info_net(info), dst_ifidx);
+ if (!dst_dev) {
+ err = -ENODEV;
+ goto err_unlock_src_dev;
+ }
+ if (!dst_dev->queue_mgmt_ops ||
+ !dst_dev->queue_mgmt_ops->ndo_queue_create) {
+ err = -EOPNOTSUPP;
+ NL_SET_ERR_MSG(info->extack,
+ "Destination driver does not support queue management operations");
+ goto err_unlock_dst_dev;
+ }
+
+ err = dst_dev->queue_mgmt_ops->ndo_queue_create(dst_dev);
+ if (err <= 0) {
+ NL_SET_ERR_MSG(info->extack,
+ "Destination driver unable to create a new queue");
+ goto err_unlock_dst_dev;
+ }
+
+ dst_qid = err - 1;
+ dst_rxq = __netif_get_rx_queue(dst_dev, dst_qid);
+
+ netdev_rx_queue_peer(src_dev, src_rxq, dst_rxq);
+
+ nla_put_u32(rsp, NETDEV_A_QUEUE_PAIR_DST_QUEUE_ID, dst_qid);
+ genlmsg_end(rsp, hdr);
+
+ netdev_unlock(dst_dev);
+ netdev_unlock(src_dev);
+ mutex_unlock(&priv->lock);
+
+ return genlmsg_reply(rsp, info);
+
+err_unlock_dst_dev:
+ netdev_unlock(dst_dev);
+err_unlock_src_dev:
+ netdev_unlock(src_dev);
+err_unlock_sock:
+ mutex_unlock(&priv->lock);
+err_genlmsg_free:
+ nlmsg_free(rsp);
+ return err;
}
void netdev_nl_sock_priv_init(struct netdev_nl_sock *priv)
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH net-next 06/20] net, ynl: Add peer info to queue-get response
2025-09-19 21:31 [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
` (4 preceding siblings ...)
2025-09-19 21:31 ` [PATCH net-next 05/20] net, ynl: Implement netdev_nl_bind_queue_doit Daniel Borkmann
@ 2025-09-19 21:31 ` Daniel Borkmann
2025-09-23 1:32 ` Jakub Kicinski
2025-09-19 21:31 ` [PATCH net-next 07/20] net, ethtool: Disallow mapped real rxqs to be resized Daniel Borkmann
` (15 subsequent siblings)
21 siblings, 1 reply; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-19 21:31 UTC (permalink / raw)
To: netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
From: David Wei <dw@davidwei.uk>
Add a nested peer field to the queue-get response that returns the
peered ifindex and queue id. If the queried queue is a mapped queue
in a virtual netdev, the nested fields for dmabuf/io-uring/xsk will
be filled in, too.
Example:
# ip netns exec foo ./pyynl/cli.py \
--spec ~/netlink/specs/netdev.yaml \
--do queue-get \
--json '{"ifindex": 3, "id": 1, "type": "rx"}'
{'id': 1, 'ifindex': 3, 'peer': {'id': 15, 'ifindex': 4}, 'io-uring': {}, 'type': 'rx'}
Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
Documentation/netlink/specs/netdev.yaml | 17 +++++++++++++++++
include/uapi/linux/netdev.h | 9 +++++++++
net/core/netdev-genl.c | 23 ++++++++++++++++++++++-
tools/include/uapi/linux/netdev.h | 9 +++++++++
4 files changed, 57 insertions(+), 1 deletion(-)
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 99a430ea8a9a..1467c36f6b5f 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -297,6 +297,17 @@ attribute-sets:
-
name: xsk-info
attributes: []
+ -
+ name: peer-info
+ attributes:
+ -
+ name: id
+ doc: Queue index of the netdevice to which the peer queue belongs.
+ type: u32
+ -
+ name: ifindex
+ doc: ifindex of the netdevice to which the peer queue belongs.
+ type: u32
-
name: queue
attributes:
@@ -338,6 +349,11 @@ attribute-sets:
doc: XSK information for this queue, if any.
type: nest
nested-attributes: xsk-info
+ -
+ name: peer
+ doc: Whether this queue was bound to another peer queue.
+ type: nest
+ nested-attributes: peer-info
-
name: qstats
doc: |
@@ -706,6 +722,7 @@ operations:
- dmabuf
- io-uring
- xsk
+ - peer
dump:
request:
attributes:
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 05e17765a39d..73d1590e4696 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -150,6 +150,14 @@ enum {
NETDEV_A_XSK_INFO_MAX = (__NETDEV_A_XSK_INFO_MAX - 1)
};
+enum {
+ NETDEV_A_PEER_INFO_ID = 1,
+ NETDEV_A_PEER_INFO_IFINDEX,
+
+ __NETDEV_A_PEER_INFO_MAX,
+ NETDEV_A_PEER_INFO_MAX = (__NETDEV_A_PEER_INFO_MAX - 1)
+};
+
enum {
NETDEV_A_QUEUE_ID = 1,
NETDEV_A_QUEUE_IFINDEX,
@@ -158,6 +166,7 @@ enum {
NETDEV_A_QUEUE_DMABUF,
NETDEV_A_QUEUE_IO_URING,
NETDEV_A_QUEUE_XSK,
+ NETDEV_A_QUEUE_PEER,
__NETDEV_A_QUEUE_MAX,
NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1)
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index ed0ce3dbfc6f..c20922539216 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -393,6 +393,7 @@ netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
struct pp_memory_provider_params *params;
struct netdev_rx_queue *rxq;
struct netdev_queue *txq;
+ struct nlattr *nest;
void *hdr;
hdr = genlmsg_iput(rsp, info);
@@ -410,6 +411,27 @@ netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
if (nla_put_napi_id(rsp, rxq->napi))
goto nla_put_failure;
+ if (netdev_rx_queue_peered(netdev, q_idx)) {
+ struct netdev_rx_queue *p_rxq;
+ struct net_device *p_netdev = netdev;
+ u32 p_q_idx = q_idx;
+
+ nest = nla_nest_start(rsp, NETDEV_A_QUEUE_PEER);
+ if (!nest)
+ goto nla_put_failure;
+ p_rxq = __netif_get_rx_queue_peer(&p_netdev, &p_q_idx);
+ if (nla_put_u32(rsp, NETDEV_A_PEER_INFO_ID, p_q_idx) ||
+ nla_put_u32(rsp, NETDEV_A_PEER_INFO_IFINDEX, p_netdev->ifindex))
+ goto nla_put_failure;
+ nla_nest_end(rsp, nest);
+
+ if (!netdev->dev.parent) {
+ netdev = p_netdev;
+ q_idx = p_q_idx;
+ rxq = p_rxq;
+ }
+ }
+
params = &rxq->mp_params;
if (params->mp_ops &&
params->mp_ops->nl_fill(params->mp_priv, rsp, rxq))
@@ -419,7 +441,6 @@ netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
if (nla_put_empty_nest(rsp, NETDEV_A_QUEUE_XSK))
goto nla_put_failure;
#endif
-
break;
case NETDEV_QUEUE_TYPE_TX:
txq = netdev_get_tx_queue(netdev, q_idx);
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 05e17765a39d..73d1590e4696 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -150,6 +150,14 @@ enum {
NETDEV_A_XSK_INFO_MAX = (__NETDEV_A_XSK_INFO_MAX - 1)
};
+enum {
+ NETDEV_A_PEER_INFO_ID = 1,
+ NETDEV_A_PEER_INFO_IFINDEX,
+
+ __NETDEV_A_PEER_INFO_MAX,
+ NETDEV_A_PEER_INFO_MAX = (__NETDEV_A_PEER_INFO_MAX - 1)
+};
+
enum {
NETDEV_A_QUEUE_ID = 1,
NETDEV_A_QUEUE_IFINDEX,
@@ -158,6 +166,7 @@ enum {
NETDEV_A_QUEUE_DMABUF,
NETDEV_A_QUEUE_IO_URING,
NETDEV_A_QUEUE_XSK,
+ NETDEV_A_QUEUE_PEER,
__NETDEV_A_QUEUE_MAX,
NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1)
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH net-next 07/20] net, ethtool: Disallow mapped real rxqs to be resized
2025-09-19 21:31 [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
` (5 preceding siblings ...)
2025-09-19 21:31 ` [PATCH net-next 06/20] net, ynl: Add peer info to queue-get response Daniel Borkmann
@ 2025-09-19 21:31 ` Daniel Borkmann
2025-09-23 1:34 ` Jakub Kicinski
2025-09-19 21:31 ` [PATCH net-next 08/20] net: Proxy net_mp_{open,close}_rxq for mapped queues Daniel Borkmann
` (14 subsequent siblings)
21 siblings, 1 reply; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-19 21:31 UTC (permalink / raw)
To: netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
Similar to AF_XDP, do not allow queues in a physical netdev to be
resized by ethtool -L when they are peered.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
---
net/ethtool/channels.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/net/ethtool/channels.c b/net/ethtool/channels.c
index ca4f80282448..0ede1075e016 100644
--- a/net/ethtool/channels.c
+++ b/net/ethtool/channels.c
@@ -1,5 +1,6 @@
// SPDX-License-Identifier: GPL-2.0-only
+#include <net/netdev_rx_queue.h>
#include <net/xdp_sock_drv.h>
#include "netlink.h"
@@ -169,14 +170,19 @@ ethnl_set_channels(struct ethnl_req_info *req_info, struct genl_info *info)
if (ret)
return ret;
- /* Disabling channels, query zero-copy AF_XDP sockets */
+ /* ensure channels are not busy at the moment */
from_channel = channels.combined_count +
min(channels.rx_count, channels.tx_count);
- for (i = from_channel; i < old_total; i++)
+ for (i = from_channel; i < old_total; i++) {
+ if (netdev_rx_queue_peered(dev, i)) {
+ GENL_SET_ERR_MSG(info, "requested channel counts are too low due to existing queue peering");
+ return -EINVAL;
+ }
if (xsk_get_pool_from_qid(dev, i)) {
GENL_SET_ERR_MSG(info, "requested channel counts are too low for existing zerocopy AF_XDP sockets");
return -EINVAL;
}
+ }
ret = dev->ethtool_ops->set_channels(dev, &channels);
return ret < 0 ? ret : 1;
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH net-next 08/20] net: Proxy net_mp_{open,close}_rxq for mapped queues
2025-09-19 21:31 [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
` (6 preceding siblings ...)
2025-09-19 21:31 ` [PATCH net-next 07/20] net, ethtool: Disallow mapped real rxqs to be resized Daniel Borkmann
@ 2025-09-19 21:31 ` Daniel Borkmann
2025-09-22 16:35 ` Stanislav Fomichev
2025-09-19 21:31 ` [PATCH net-next 09/20] xsk: Move NETDEV_XDP_ACT_ZC into generic header Daniel Borkmann
` (13 subsequent siblings)
21 siblings, 1 reply; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-19 21:31 UTC (permalink / raw)
To: netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
From: David Wei <dw@davidwei.uk>
When a process in a container wants to setup a memory provider, it will
use the virtual netdev and a mapped rxq, and call net_mp_{open,close}_rxq
to try and restart the queue. At this point, proxy the queue restart on
the real rxq in the physical netdev.
For memory providers (io_uring zero-copy rx and devmem), it causes the
real rxq in the physical netdev to be filled from a memory provider that
has DMA mapped memory from a process within a container.
Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
net/core/netdev_rx_queue.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
index c7d9341b7630..238d3cd9677e 100644
--- a/net/core/netdev_rx_queue.c
+++ b/net/core/netdev_rx_queue.c
@@ -105,13 +105,21 @@ int __net_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
if (!netdev_need_ops_lock(dev))
return -EOPNOTSUPP;
-
if (rxq_idx >= dev->real_num_rx_queues) {
NL_SET_ERR_MSG(extack, "rx queue index out of range");
return -ERANGE;
}
+
rxq_idx = array_index_nospec(rxq_idx, dev->real_num_rx_queues);
+ rxq = __netif_get_rx_queue_peer(&dev, &rxq_idx);
+ /* Check again since dev might have changed */
+ if (!netdev_need_ops_lock(dev))
+ return -EOPNOTSUPP;
+ if (!dev->dev.parent) {
+ NL_SET_ERR_MSG(extack, "rx queue is mapped to a virtual netdev");
+ return -EBUSY;
+ }
if (dev->cfg->hds_config != ETHTOOL_TCP_DATA_SPLIT_ENABLED) {
NL_SET_ERR_MSG(extack, "tcp-data-split is disabled");
return -EINVAL;
@@ -124,8 +132,6 @@ int __net_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
NL_SET_ERR_MSG(extack, "unable to custom memory provider to device with XDP program attached");
return -EEXIST;
}
-
- rxq = __netif_get_rx_queue(dev, rxq_idx);
if (rxq->mp_params.mp_ops) {
NL_SET_ERR_MSG(extack, "designated queue already memory provider bound");
return -EEXIST;
@@ -136,7 +142,6 @@ int __net_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
return -EBUSY;
}
#endif
-
rxq->mp_params = *p;
ret = netdev_rx_queue_restart(dev, rxq_idx);
if (ret) {
@@ -166,7 +171,7 @@ void __net_mp_close_rxq(struct net_device *dev, unsigned int ifq_idx,
if (WARN_ON_ONCE(ifq_idx >= dev->real_num_rx_queues))
return;
- rxq = __netif_get_rx_queue(dev, ifq_idx);
+ rxq = __netif_get_rx_queue_peer(&dev, &ifq_idx);
/* Callers holding a netdev ref may get here after we already
* went thru shutdown via dev_memory_provider_uninstall().
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH net-next 09/20] xsk: Move NETDEV_XDP_ACT_ZC into generic header
2025-09-19 21:31 [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
` (7 preceding siblings ...)
2025-09-19 21:31 ` [PATCH net-next 08/20] net: Proxy net_mp_{open,close}_rxq for mapped queues Daniel Borkmann
@ 2025-09-19 21:31 ` Daniel Borkmann
2025-09-22 15:59 ` Maciej Fijalkowski
2025-09-19 21:31 ` [PATCH net-next 10/20] xsk: Move pool registration into single function Daniel Borkmann
` (12 subsequent siblings)
21 siblings, 1 reply; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-19 21:31 UTC (permalink / raw)
To: netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
Move NETDEV_XDP_ACT_ZC into xdp_sock_drv.h header such that external code
can reuse it, and rename it into more generic NETDEV_XDP_ACT_XSK.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
---
include/net/xdp_sock_drv.h | 4 ++++
net/xdp/xsk_buff_pool.c | 6 +-----
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 513c8e9704f6..47120666d8d6 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -12,6 +12,10 @@
#define XDP_UMEM_MIN_CHUNK_SHIFT 11
#define XDP_UMEM_MIN_CHUNK_SIZE (1 << XDP_UMEM_MIN_CHUNK_SHIFT)
+#define NETDEV_XDP_ACT_XSK (NETDEV_XDP_ACT_BASIC | \
+ NETDEV_XDP_ACT_REDIRECT | \
+ NETDEV_XDP_ACT_XSK_ZEROCOPY)
+
struct xsk_cb_desc {
void *src;
u8 off;
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index aa9788f20d0d..26165baf99f4 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -158,10 +158,6 @@ static void xp_disable_drv_zc(struct xsk_buff_pool *pool)
}
}
-#define NETDEV_XDP_ACT_ZC (NETDEV_XDP_ACT_BASIC | \
- NETDEV_XDP_ACT_REDIRECT | \
- NETDEV_XDP_ACT_XSK_ZEROCOPY)
-
int xp_assign_dev(struct xsk_buff_pool *pool,
struct net_device *netdev, u16 queue_id, u16 flags)
{
@@ -203,7 +199,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
/* For copy-mode, we are done. */
return 0;
- if ((netdev->xdp_features & NETDEV_XDP_ACT_ZC) != NETDEV_XDP_ACT_ZC) {
+ if ((netdev->xdp_features & NETDEV_XDP_ACT_XSK) != NETDEV_XDP_ACT_XSK) {
err = -EOPNOTSUPP;
goto err_unreg_pool;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH net-next 10/20] xsk: Move pool registration into single function
2025-09-19 21:31 [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
` (8 preceding siblings ...)
2025-09-19 21:31 ` [PATCH net-next 09/20] xsk: Move NETDEV_XDP_ACT_ZC into generic header Daniel Borkmann
@ 2025-09-19 21:31 ` Daniel Borkmann
2025-09-22 16:01 ` Maciej Fijalkowski
2025-09-19 21:31 ` [PATCH net-next 11/20] xsk: Add small helper xp_pool_bindable Daniel Borkmann
` (11 subsequent siblings)
21 siblings, 1 reply; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-19 21:31 UTC (permalink / raw)
To: netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
Small refactor to move the pool registration into xsk_reg_pool_at_qid,
such that the netdev and queue_id can be registered there. No change
in functionality.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
---
net/xdp/xsk.c | 5 +++++
net/xdp/xsk_buff_pool.c | 16 +++-------------
2 files changed, 8 insertions(+), 13 deletions(-)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 72e34bd2d925..82ad89f6ba35 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -141,6 +141,11 @@ int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool,
dev->real_num_rx_queues,
dev->real_num_tx_queues))
return -EINVAL;
+ if (xsk_get_pool_from_qid(dev, queue_id))
+ return -EBUSY;
+
+ pool->netdev = dev;
+ pool->queue_id = queue_id;
if (queue_id < dev->real_num_rx_queues)
dev->_rx[queue_id].pool = pool;
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 26165baf99f4..375696f895d4 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -169,32 +169,24 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
force_zc = flags & XDP_ZEROCOPY;
force_copy = flags & XDP_COPY;
-
if (force_zc && force_copy)
return -EINVAL;
- if (xsk_get_pool_from_qid(netdev, queue_id))
- return -EBUSY;
-
- pool->netdev = netdev;
- pool->queue_id = queue_id;
err = xsk_reg_pool_at_qid(netdev, pool, queue_id);
if (err)
return err;
if (flags & XDP_USE_SG)
pool->umem->flags |= XDP_UMEM_SG_FLAG;
-
if (flags & XDP_USE_NEED_WAKEUP)
pool->uses_need_wakeup = true;
- /* Tx needs to be explicitly woken up the first time. Also
- * for supporting drivers that do not implement this
- * feature. They will always have to call sendto() or poll().
+ /* Tx needs to be explicitly woken up the first time. Also
+ * for supporting drivers that do not implement this feature.
+ * They will always have to call sendto() or poll().
*/
pool->cached_need_wakeup = XDP_WAKEUP_TX;
dev_hold(netdev);
-
if (force_copy)
/* For copy-mode, we are done. */
return 0;
@@ -203,12 +195,10 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
err = -EOPNOTSUPP;
goto err_unreg_pool;
}
-
if (netdev->xdp_zc_max_segs == 1 && (flags & XDP_USE_SG)) {
err = -EOPNOTSUPP;
goto err_unreg_pool;
}
-
if (dev_get_min_mp_channel_count(netdev)) {
err = -EBUSY;
goto err_unreg_pool;
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH net-next 11/20] xsk: Add small helper xp_pool_bindable
2025-09-19 21:31 [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
` (9 preceding siblings ...)
2025-09-19 21:31 ` [PATCH net-next 10/20] xsk: Move pool registration into single function Daniel Borkmann
@ 2025-09-19 21:31 ` Daniel Borkmann
2025-09-22 16:03 ` Maciej Fijalkowski
2025-09-19 21:31 ` [PATCH net-next 12/20] xsk: Change xsk_rcv_check to check netdev/queue_id from pool Daniel Borkmann
` (10 subsequent siblings)
21 siblings, 1 reply; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-19 21:31 UTC (permalink / raw)
To: netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
Add another small helper called xp_pool_bindable and move the current
dev_get_min_mp_channel_count test into this helper. Pass in the pool
object, such that we derive the netdev from the prior registered pool.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
---
net/xdp/xsk_buff_pool.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 375696f895d4..d2109d683fe5 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -54,6 +54,11 @@ int xp_alloc_tx_descs(struct xsk_buff_pool *pool, struct xdp_sock *xs)
return 0;
}
+static bool xp_pool_bindable(struct xsk_buff_pool *pool)
+{
+ return dev_get_min_mp_channel_count(pool->netdev) == 0;
+}
+
struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
struct xdp_umem *umem)
{
@@ -199,7 +204,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
err = -EOPNOTSUPP;
goto err_unreg_pool;
}
- if (dev_get_min_mp_channel_count(netdev)) {
+ if (!xp_pool_bindable(pool)) {
err = -EBUSY;
goto err_unreg_pool;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH net-next 12/20] xsk: Change xsk_rcv_check to check netdev/queue_id from pool
2025-09-19 21:31 [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
` (10 preceding siblings ...)
2025-09-19 21:31 ` [PATCH net-next 11/20] xsk: Add small helper xp_pool_bindable Daniel Borkmann
@ 2025-09-19 21:31 ` Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 13/20] xsk: Proxy pool management for mapped queues Daniel Borkmann
` (9 subsequent siblings)
21 siblings, 0 replies; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-19 21:31 UTC (permalink / raw)
To: netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
Change the xsk_rcv_check test for inbound packets to use the xs->pool->netdev
and xs->pool->queue_id of the bound socket rather than xs->dev and xs->queue_id
since the latter could point to a virtual device with mapped rxq rather than
the physical backing device of the pool.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
---
net/xdp/xsk.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 82ad89f6ba35..cf40c70ee59f 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -340,15 +340,13 @@ static int xsk_rcv_check(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
{
if (!xsk_is_bound(xs))
return -ENXIO;
-
- if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
+ if (xs->pool->netdev != xdp->rxq->dev ||
+ xs->pool->queue_id != xdp->rxq->queue_index)
return -EINVAL;
-
if (len > xsk_pool_get_rx_frame_size(xs->pool) && !xs->sg) {
xs->rx_dropped++;
return -ENOSPC;
}
-
return 0;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH net-next 13/20] xsk: Proxy pool management for mapped queues
2025-09-19 21:31 [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
` (11 preceding siblings ...)
2025-09-19 21:31 ` [PATCH net-next 12/20] xsk: Change xsk_rcv_check to check netdev/queue_id from pool Daniel Borkmann
@ 2025-09-19 21:31 ` Daniel Borkmann
2025-09-22 16:48 ` Stanislav Fomichev
2025-09-19 21:31 ` [PATCH net-next 14/20] netkit: Add single device mode for netkit Daniel Borkmann
` (8 subsequent siblings)
21 siblings, 1 reply; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-19 21:31 UTC (permalink / raw)
To: netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
Similarly what we do for net_mp_{open,close}_rxq for mapped queues,
proxy also the xsk_{reg,clear}_pool_at_qid via __netif_get_rx_queue_peer
such that when a virtual netdev picked a mapped rxq, the request gets
through to the real rxq in the physical netdev.
Change the function signatures for queue_id to unsigned int in order
to pass the queue_id parameter into __netif_get_rx_queue_peer. The
proxying is only relevant for queue_id < dev->real_num_rx_queues since
right now its only supported for rxqs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
---
include/net/xdp_sock_drv.h | 4 ++--
net/xdp/xsk.c | 16 +++++++++++-----
net/xdp/xsk.h | 5 ++---
3 files changed, 15 insertions(+), 10 deletions(-)
diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 47120666d8d6..709af292cba7 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -29,7 +29,7 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc);
u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 max);
void xsk_tx_release(struct xsk_buff_pool *pool);
struct xsk_buff_pool *xsk_get_pool_from_qid(struct net_device *dev,
- u16 queue_id);
+ unsigned int queue_id);
void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool);
void xsk_set_tx_need_wakeup(struct xsk_buff_pool *pool);
void xsk_clear_rx_need_wakeup(struct xsk_buff_pool *pool);
@@ -286,7 +286,7 @@ static inline void xsk_tx_release(struct xsk_buff_pool *pool)
}
static inline struct xsk_buff_pool *
-xsk_get_pool_from_qid(struct net_device *dev, u16 queue_id)
+xsk_get_pool_from_qid(struct net_device *dev, unsigned int queue_id)
{
return NULL;
}
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index cf40c70ee59f..b9efa6d8a112 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -23,6 +23,8 @@
#include <linux/netdevice.h>
#include <linux/rculist.h>
#include <linux/vmalloc.h>
+
+#include <net/netdev_queues.h>
#include <net/xdp_sock_drv.h>
#include <net/busy_poll.h>
#include <net/netdev_lock.h>
@@ -111,19 +113,20 @@ bool xsk_uses_need_wakeup(struct xsk_buff_pool *pool)
EXPORT_SYMBOL(xsk_uses_need_wakeup);
struct xsk_buff_pool *xsk_get_pool_from_qid(struct net_device *dev,
- u16 queue_id)
+ unsigned int queue_id)
{
if (queue_id < dev->real_num_rx_queues)
return dev->_rx[queue_id].pool;
if (queue_id < dev->real_num_tx_queues)
return dev->_tx[queue_id].pool;
-
return NULL;
}
EXPORT_SYMBOL(xsk_get_pool_from_qid);
-void xsk_clear_pool_at_qid(struct net_device *dev, u16 queue_id)
+void xsk_clear_pool_at_qid(struct net_device *dev, unsigned int queue_id)
{
+ if (queue_id < dev->real_num_rx_queues)
+ __netif_get_rx_queue_peer(&dev, &queue_id);
if (queue_id < dev->num_rx_queues)
dev->_rx[queue_id].pool = NULL;
if (queue_id < dev->num_tx_queues)
@@ -135,7 +138,7 @@ void xsk_clear_pool_at_qid(struct net_device *dev, u16 queue_id)
* This might also change during run time.
*/
int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool,
- u16 queue_id)
+ unsigned int queue_id)
{
if (queue_id >= max_t(unsigned int,
dev->real_num_rx_queues,
@@ -143,6 +146,10 @@ int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool,
return -EINVAL;
if (xsk_get_pool_from_qid(dev, queue_id))
return -EBUSY;
+ if (queue_id < dev->real_num_rx_queues)
+ __netif_get_rx_queue_peer(&dev, &queue_id);
+ if (xsk_get_pool_from_qid(dev, queue_id))
+ return -EBUSY;
pool->netdev = dev;
pool->queue_id = queue_id;
@@ -151,7 +158,6 @@ int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool,
dev->_rx[queue_id].pool = pool;
if (queue_id < dev->real_num_tx_queues)
dev->_tx[queue_id].pool = pool;
-
return 0;
}
diff --git a/net/xdp/xsk.h b/net/xdp/xsk.h
index a4bc4749faac..54d9a7736fd2 100644
--- a/net/xdp/xsk.h
+++ b/net/xdp/xsk.h
@@ -41,8 +41,7 @@ static inline struct xdp_sock *xdp_sk(struct sock *sk)
void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
struct xdp_sock __rcu **map_entry);
-void xsk_clear_pool_at_qid(struct net_device *dev, u16 queue_id);
+void xsk_clear_pool_at_qid(struct net_device *dev, unsigned int queue_id);
int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool,
- u16 queue_id);
-
+ unsigned int queue_id);
#endif /* XSK_H_ */
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH net-next 14/20] netkit: Add single device mode for netkit
2025-09-19 21:31 [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
` (12 preceding siblings ...)
2025-09-19 21:31 ` [PATCH net-next 13/20] xsk: Proxy pool management for mapped queues Daniel Borkmann
@ 2025-09-19 21:31 ` Daniel Borkmann
2025-09-27 1:10 ` Jordan Rife
2025-09-19 21:31 ` [PATCH net-next 15/20] netkit: Document fast vs slowpath members via macros Daniel Borkmann
` (7 subsequent siblings)
21 siblings, 1 reply; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-19 21:31 UTC (permalink / raw)
To: netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
Add a single device mode for netkit instead of netkit pairs. The primary
target for the paired devices is to connect network namespaces, of course,
and support has been implemented in projects like Cilium [0]. For the rxq
binding the plan is to support two main scenarios related to single device
mode:
* For the use-case of io_uring zero-copy, the control plane can either
set up a netkit pair where the peer device can perform rxq binding which
is then tied to the lifetime of the peer device, or the control plane
can use a regular netkit pair to connect the hostns to a Pod/container
and dynamically add/remove rxq bindings through a single device without
having to interrupt the device pair. In the case of io_uring, the memory
pool is used as skb non-linear pages, and thus the skb will go its way
through the regular stack into netkit. Things like the netkit policy when
no BPF is attached or skb scrubbing etc apply as-is in case the paired
devices are used, or if the backend memory is tied to the single device
and traffic goes through a paired device.
* For the use-case of AF_XDP, the control plane needs to use netkit in the
single device mode. The single device mode currently enforces only a
pass policy when no BPF is attached, and does not yet support BPF link
attachments for AF_XDP. skbs sent to that device get dropped at the
moment. Given AF_XDP operates at a lower layer of the stack tying this
to the netkit pair did not make sense. In future, the plan is to allow
BPF at the XDP layer which can: i) process traffic coming from the AF_XDP
application (e.g. QEMU with AF_XDP backend) to filter egress traffic or
to push selected egress traffic up to the single netkit device to the
local stack (e.g. DHCP requests), and ii) vice-versa skbs sent to the
single netkit into the AF_XDP application (e.g. DHCP replies). Also,
the control-plane can dynamically add/remove rxq bindings for the single
netkit device without having to interrupt (e.g. down/up cycle) the main
netkit pair for the Pod which has traffic going in and out.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://docs.cilium.io/en/stable/operations/performance/tuning/#netkit-device-mode [0]
---
drivers/net/netkit.c | 108 ++++++++++++++++++++++-------------
include/uapi/linux/if_link.h | 6 ++
2 files changed, 74 insertions(+), 40 deletions(-)
diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
index 492be60f2e70..ceb1393ee599 100644
--- a/drivers/net/netkit.c
+++ b/drivers/net/netkit.c
@@ -25,6 +25,7 @@ struct netkit {
/* Needed in slow-path */
enum netkit_mode mode;
+ enum netkit_pairing pair;
bool primary;
u32 headroom;
};
@@ -133,6 +134,10 @@ static int netkit_open(struct net_device *dev)
struct netkit *nk = netkit_priv(dev);
struct net_device *peer = rtnl_dereference(nk->peer);
+ if (nk->pair == NETKIT_DEVICE_SINGLE) {
+ netif_carrier_on(dev);
+ return 0;
+ }
if (!peer)
return -ENOTCONN;
if (peer->flags & IFF_UP) {
@@ -333,6 +338,7 @@ static int netkit_new_link(struct net_device *dev,
enum netkit_scrub scrub_prim = NETKIT_SCRUB_DEFAULT;
enum netkit_scrub scrub_peer = NETKIT_SCRUB_DEFAULT;
struct nlattr *peer_tb[IFLA_MAX + 1], **tbp, *attr;
+ enum netkit_pairing pair = NETKIT_DEVICE_PAIR;
enum netkit_action policy_prim = NETKIT_PASS;
enum netkit_action policy_peer = NETKIT_PASS;
struct nlattr **data = params->data;
@@ -341,7 +347,7 @@ static int netkit_new_link(struct net_device *dev,
struct nlattr **tb = params->tb;
u16 headroom = 0, tailroom = 0;
struct ifinfomsg *ifmp = NULL;
- struct net_device *peer;
+ struct net_device *peer = NULL;
char ifname[IFNAMSIZ];
struct netkit *nk;
int err;
@@ -378,6 +384,8 @@ static int netkit_new_link(struct net_device *dev,
headroom = nla_get_u16(data[IFLA_NETKIT_HEADROOM]);
if (data[IFLA_NETKIT_TAILROOM])
tailroom = nla_get_u16(data[IFLA_NETKIT_TAILROOM]);
+ if (data[IFLA_NETKIT_PAIRING])
+ pair = nla_get_u32(data[IFLA_NETKIT_PAIRING]);
}
if (ifmp && tbp[IFLA_IFNAME]) {
@@ -390,45 +398,49 @@ static int netkit_new_link(struct net_device *dev,
if (mode != NETKIT_L2 &&
(tb[IFLA_ADDRESS] || tbp[IFLA_ADDRESS]))
return -EOPNOTSUPP;
+ if (pair != NETKIT_DEVICE_PAIR &&
+ (tb != tbp ||
+ tb[IFLA_NETKIT_PEER_POLICY] ||
+ tb[IFLA_NETKIT_PEER_SCRUB] ||
+ policy_prim != NETKIT_PASS))
+ return -EOPNOTSUPP;
- peer = rtnl_create_link(peer_net, ifname, ifname_assign_type,
- &netkit_link_ops, tbp, extack);
- if (IS_ERR(peer))
- return PTR_ERR(peer);
-
- netif_inherit_tso_max(peer, dev);
- if (headroom) {
- peer->needed_headroom = headroom;
- dev->needed_headroom = headroom;
- }
- if (tailroom) {
- peer->needed_tailroom = tailroom;
- dev->needed_tailroom = tailroom;
- }
-
- if (mode == NETKIT_L2 && !(ifmp && tbp[IFLA_ADDRESS]))
- eth_hw_addr_random(peer);
- if (ifmp && dev->ifindex)
- peer->ifindex = ifmp->ifi_index;
-
- nk = netkit_priv(peer);
- nk->primary = false;
- nk->policy = policy_peer;
- nk->scrub = scrub_peer;
- nk->mode = mode;
- nk->headroom = headroom;
- bpf_mprog_bundle_init(&nk->bundle);
+ if (pair == NETKIT_DEVICE_PAIR) {
+ peer = rtnl_create_link(peer_net, ifname, ifname_assign_type,
+ &netkit_link_ops, tbp, extack);
+ if (IS_ERR(peer))
+ return PTR_ERR(peer);
+
+ netif_inherit_tso_max(peer, dev);
+ if (headroom)
+ peer->needed_headroom = headroom;
+ if (tailroom)
+ peer->needed_tailroom = tailroom;
+ if (mode == NETKIT_L2 && !(ifmp && tbp[IFLA_ADDRESS]))
+ eth_hw_addr_random(peer);
+ if (ifmp && dev->ifindex)
+ peer->ifindex = ifmp->ifi_index;
- err = register_netdevice(peer);
- if (err < 0)
- goto err_register_peer;
- netif_carrier_off(peer);
- if (mode == NETKIT_L2)
- dev_change_flags(peer, peer->flags & ~IFF_NOARP, NULL);
+ nk = netkit_priv(peer);
+ nk->primary = false;
+ nk->policy = policy_peer;
+ nk->scrub = scrub_peer;
+ nk->mode = mode;
+ nk->pair = pair;
+ nk->headroom = headroom;
+ bpf_mprog_bundle_init(&nk->bundle);
+
+ err = register_netdevice(peer);
+ if (err < 0)
+ goto err_register_peer;
+ netif_carrier_off(peer);
+ if (mode == NETKIT_L2)
+ dev_change_flags(peer, peer->flags & ~IFF_NOARP, NULL);
- err = rtnl_configure_link(peer, NULL, 0, NULL);
- if (err < 0)
- goto err_configure_peer;
+ err = rtnl_configure_link(peer, NULL, 0, NULL);
+ if (err < 0)
+ goto err_configure_peer;
+ }
if (mode == NETKIT_L2 && !tb[IFLA_ADDRESS])
eth_hw_addr_random(dev);
@@ -436,12 +448,17 @@ static int netkit_new_link(struct net_device *dev,
nla_strscpy(dev->name, tb[IFLA_IFNAME], IFNAMSIZ);
else
strscpy(dev->name, "nk%d", IFNAMSIZ);
+ if (headroom)
+ dev->needed_headroom = headroom;
+ if (tailroom)
+ dev->needed_tailroom = tailroom;
nk = netkit_priv(dev);
nk->primary = true;
nk->policy = policy_prim;
nk->scrub = scrub_prim;
nk->mode = mode;
+ nk->pair = pair;
nk->headroom = headroom;
bpf_mprog_bundle_init(&nk->bundle);
@@ -453,10 +470,12 @@ static int netkit_new_link(struct net_device *dev,
dev_change_flags(dev, dev->flags & ~IFF_NOARP, NULL);
rcu_assign_pointer(netkit_priv(dev)->peer, peer);
- rcu_assign_pointer(netkit_priv(peer)->peer, dev);
+ if (peer)
+ rcu_assign_pointer(netkit_priv(peer)->peer, dev);
return 0;
err_configure_peer:
- unregister_netdevice(peer);
+ if (peer)
+ unregister_netdevice(peer);
return err;
err_register_peer:
free_netdev(peer);
@@ -516,6 +535,8 @@ static struct net_device *netkit_dev_fetch(struct net *net, u32 ifindex, u32 whi
nk = netkit_priv(dev);
if (!nk->primary)
return ERR_PTR(-EACCES);
+ if (nk->pair == NETKIT_DEVICE_SINGLE)
+ return ERR_PTR(-EOPNOTSUPP);
if (which == BPF_NETKIT_PEER) {
dev = rcu_dereference_rtnl(nk->peer);
if (!dev)
@@ -877,6 +898,7 @@ static int netkit_change_link(struct net_device *dev, struct nlattr *tb[],
{ IFLA_NETKIT_PEER_INFO, "peer info" },
{ IFLA_NETKIT_HEADROOM, "headroom" },
{ IFLA_NETKIT_TAILROOM, "tailroom" },
+ { IFLA_NETKIT_PAIRING, "pairing" },
};
if (!nk->primary) {
@@ -896,9 +918,11 @@ static int netkit_change_link(struct net_device *dev, struct nlattr *tb[],
}
if (data[IFLA_NETKIT_POLICY]) {
+ err = -EOPNOTSUPP;
attr = data[IFLA_NETKIT_POLICY];
policy = nla_get_u32(attr);
- err = netkit_check_policy(policy, attr, extack);
+ if (nk->pair == NETKIT_DEVICE_PAIR)
+ err = netkit_check_policy(policy, attr, extack);
if (err)
return err;
WRITE_ONCE(nk->policy, policy);
@@ -929,6 +953,7 @@ static size_t netkit_get_size(const struct net_device *dev)
nla_total_size(sizeof(u8)) + /* IFLA_NETKIT_PRIMARY */
nla_total_size(sizeof(u16)) + /* IFLA_NETKIT_HEADROOM */
nla_total_size(sizeof(u16)) + /* IFLA_NETKIT_TAILROOM */
+ nla_total_size(sizeof(u32)) + /* IFLA_NETKIT_PAIRING */
0;
}
@@ -949,6 +974,8 @@ static int netkit_fill_info(struct sk_buff *skb, const struct net_device *dev)
return -EMSGSIZE;
if (nla_put_u16(skb, IFLA_NETKIT_TAILROOM, dev->needed_tailroom))
return -EMSGSIZE;
+ if (nla_put_u32(skb, IFLA_NETKIT_PAIRING, nk->pair))
+ return -EMSGSIZE;
if (peer) {
nk = netkit_priv(peer);
@@ -970,6 +997,7 @@ static const struct nla_policy netkit_policy[IFLA_NETKIT_MAX + 1] = {
[IFLA_NETKIT_TAILROOM] = { .type = NLA_U16 },
[IFLA_NETKIT_SCRUB] = NLA_POLICY_MAX(NLA_U32, NETKIT_SCRUB_DEFAULT),
[IFLA_NETKIT_PEER_SCRUB] = NLA_POLICY_MAX(NLA_U32, NETKIT_SCRUB_DEFAULT),
+ [IFLA_NETKIT_PAIRING] = NLA_POLICY_MAX(NLA_U32, NETKIT_DEVICE_SINGLE),
[IFLA_NETKIT_PRIMARY] = { .type = NLA_REJECT,
.reject_message = "Primary attribute is read-only" },
};
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 45f56c9f95d9..4a2f781f3cca 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -1294,6 +1294,11 @@ enum netkit_mode {
NETKIT_L3,
};
+enum netkit_pairing {
+ NETKIT_DEVICE_PAIR,
+ NETKIT_DEVICE_SINGLE,
+};
+
/* NETKIT_SCRUB_NONE leaves clearing skb->{mark,priority} up to
* the BPF program if attached. This also means the latter can
* consume the two fields if they were populated earlier.
@@ -1318,6 +1323,7 @@ enum {
IFLA_NETKIT_PEER_SCRUB,
IFLA_NETKIT_HEADROOM,
IFLA_NETKIT_TAILROOM,
+ IFLA_NETKIT_PAIRING,
__IFLA_NETKIT_MAX,
};
#define IFLA_NETKIT_MAX (__IFLA_NETKIT_MAX - 1)
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH net-next 15/20] netkit: Document fast vs slowpath members via macros
2025-09-19 21:31 [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
` (13 preceding siblings ...)
2025-09-19 21:31 ` [PATCH net-next 14/20] netkit: Add single device mode for netkit Daniel Borkmann
@ 2025-09-19 21:31 ` Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 16/20] netkit: Implement rtnl_link_ops->alloc Daniel Borkmann
` (6 subsequent siblings)
21 siblings, 0 replies; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-19 21:31 UTC (permalink / raw)
To: netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
Instead of a comment, just use two cachline groups to document the intent
for members often accessed in fast or slow path.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
---
drivers/net/netkit.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
index ceb1393ee599..8f1285513d82 100644
--- a/drivers/net/netkit.c
+++ b/drivers/net/netkit.c
@@ -16,18 +16,20 @@
#define DRV_NAME "netkit"
struct netkit {
- /* Needed in fast-path */
+ __cacheline_group_begin(netkit_fastpath);
struct net_device __rcu *peer;
struct bpf_mprog_entry __rcu *active;
enum netkit_action policy;
enum netkit_scrub scrub;
struct bpf_mprog_bundle bundle;
+ __cacheline_group_end(netkit_fastpath);
- /* Needed in slow-path */
+ __cacheline_group_begin(netkit_slowpath);
enum netkit_mode mode;
enum netkit_pairing pair;
bool primary;
u32 headroom;
+ __cacheline_group_end(netkit_slowpath);
};
struct netkit_link {
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH net-next 16/20] netkit: Implement rtnl_link_ops->alloc
2025-09-19 21:31 [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
` (14 preceding siblings ...)
2025-09-19 21:31 ` [PATCH net-next 15/20] netkit: Document fast vs slowpath members via macros Daniel Borkmann
@ 2025-09-19 21:31 ` Daniel Borkmann
2025-09-27 1:17 ` Jordan Rife
2025-09-19 21:31 ` [PATCH net-next 17/20] netkit: Implement ndo_queue_create Daniel Borkmann
` (5 subsequent siblings)
21 siblings, 1 reply; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-19 21:31 UTC (permalink / raw)
To: netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
From: David Wei <dw@davidwei.uk>
Implement rtnl_link_ops->alloc that allows the number of rx queues to be
set when netkit is created. By default, netkit has only a single rxq (and
single txq). The number of queues is deliberately not allowed to be changed
via ethtool -L and is fixed for the lifetime of a netkit instance.
For netkit device creation, numrxqueues with larger than one rxq can be
specified. These rxqs are then mappable to real rxqs in physical netdevs:
ip link add numrxqueues 2 type netkit
As a starting point, the limit of numrxqueues for netkit is currently set
to 2, but future work is going to allow mapping multiple real rxqs from
physical netdevs, potentially at some point even from different physical
netdevs.
Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
drivers/net/netkit.c | 78 ++++++++++++++++++++++++++++++++++++++++----
1 file changed, 72 insertions(+), 6 deletions(-)
diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
index 8f1285513d82..e5dfbf7ea351 100644
--- a/drivers/net/netkit.c
+++ b/drivers/net/netkit.c
@@ -9,11 +9,19 @@
#include <linux/bpf_mprog.h>
#include <linux/indirect_call_wrapper.h>
+#include <net/netdev_queues.h>
+#include <net/netdev_rx_queue.h>
#include <net/netkit.h>
#include <net/dst.h>
#include <net/tcx.h>
-#define DRV_NAME "netkit"
+#define NETKIT_DRV_NAME "netkit"
+
+#define NETKIT_NUM_TX_QUEUES_MAX 1
+#define NETKIT_NUM_RX_QUEUES_MAX 2
+
+#define NETKIT_NUM_TX_QUEUES_REAL 1
+#define NETKIT_NUM_RX_QUEUES_REAL 1
struct netkit {
__cacheline_group_begin(netkit_fastpath);
@@ -37,6 +45,8 @@ struct netkit_link {
struct net_device *dev;
};
+static struct rtnl_link_ops netkit_link_ops;
+
static __always_inline int
netkit_run(const struct bpf_mprog_entry *entry, struct sk_buff *skb,
enum netkit_action ret)
@@ -243,13 +253,69 @@ static const struct net_device_ops netkit_netdev_ops = {
static void netkit_get_drvinfo(struct net_device *dev,
struct ethtool_drvinfo *info)
{
- strscpy(info->driver, DRV_NAME, sizeof(info->driver));
+ strscpy(info->driver, NETKIT_DRV_NAME, sizeof(info->driver));
+}
+
+static void netkit_get_channels(struct net_device *dev,
+ struct ethtool_channels *channels)
+{
+ channels->max_rx = dev->num_rx_queues;
+ channels->max_tx = dev->num_tx_queues;
+ channels->max_other = 0;
+ channels->max_combined = 1;
+ channels->rx_count = dev->real_num_rx_queues;
+ channels->tx_count = dev->real_num_tx_queues;
+ channels->other_count = 0;
+ channels->combined_count = 0;
}
static const struct ethtool_ops netkit_ethtool_ops = {
.get_drvinfo = netkit_get_drvinfo,
+ .get_channels = netkit_get_channels,
};
+static struct net_device *netkit_alloc(struct nlattr *tb[],
+ const char *ifname,
+ unsigned char name_assign_type,
+ unsigned int num_tx_queues,
+ unsigned int num_rx_queues)
+{
+ const struct rtnl_link_ops *ops = &netkit_link_ops;
+ struct net_device *dev;
+
+ if (num_tx_queues > NETKIT_NUM_TX_QUEUES_MAX ||
+ num_rx_queues > NETKIT_NUM_RX_QUEUES_MAX)
+ return ERR_PTR(-EOPNOTSUPP);
+
+ dev = alloc_netdev_mqs(ops->priv_size, ifname,
+ name_assign_type, ops->setup,
+ num_tx_queues, num_rx_queues);
+ if (dev) {
+ dev->real_num_tx_queues = NETKIT_NUM_TX_QUEUES_REAL;
+ dev->real_num_rx_queues = NETKIT_NUM_RX_QUEUES_REAL;
+ }
+ return dev;
+}
+
+static void netkit_queue_unpeer(struct net_device *dev)
+{
+ struct netdev_rx_queue *src_rxq, *dst_rxq;
+ struct net_device *src_dev;
+ int i;
+
+ if (dev->real_num_rx_queues == 1)
+ return;
+ for (i = 1; i < dev->real_num_rx_queues; i++) {
+ dst_rxq = __netif_get_rx_queue(dev, i);
+ src_rxq = dst_rxq->peer;
+ src_dev = src_rxq->dev;
+
+ netdev_lock(src_dev);
+ netdev_rx_queue_unpeer(src_dev, src_rxq, dst_rxq);
+ netdev_unlock(src_dev);
+ }
+}
+
static void netkit_setup(struct net_device *dev)
{
static const netdev_features_t netkit_features_hw_vlan =
@@ -330,8 +396,6 @@ static int netkit_validate(struct nlattr *tb[], struct nlattr *data[],
return 0;
}
-static struct rtnl_link_ops netkit_link_ops;
-
static int netkit_new_link(struct net_device *dev,
struct rtnl_newlink_params *params,
struct netlink_ext_ack *extack)
@@ -865,6 +929,7 @@ static void netkit_release_all(struct net_device *dev)
static void netkit_uninit(struct net_device *dev)
{
netkit_release_all(dev);
+ netkit_queue_unpeer(dev);
}
static void netkit_del_link(struct net_device *dev, struct list_head *head)
@@ -1005,8 +1070,9 @@ static const struct nla_policy netkit_policy[IFLA_NETKIT_MAX + 1] = {
};
static struct rtnl_link_ops netkit_link_ops = {
- .kind = DRV_NAME,
+ .kind = NETKIT_DRV_NAME,
.priv_size = sizeof(struct netkit),
+ .alloc = netkit_alloc,
.setup = netkit_setup,
.newlink = netkit_new_link,
.dellink = netkit_del_link,
@@ -1042,4 +1108,4 @@ MODULE_DESCRIPTION("BPF-programmable network device");
MODULE_AUTHOR("Daniel Borkmann <daniel@iogearbox.net>");
MODULE_AUTHOR("Nikolay Aleksandrov <razor@blackwall.org>");
MODULE_LICENSE("GPL");
-MODULE_ALIAS_RTNL_LINK(DRV_NAME);
+MODULE_ALIAS_RTNL_LINK(NETKIT_DRV_NAME);
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH net-next 17/20] netkit: Implement ndo_queue_create
2025-09-19 21:31 [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
` (15 preceding siblings ...)
2025-09-19 21:31 ` [PATCH net-next 16/20] netkit: Implement rtnl_link_ops->alloc Daniel Borkmann
@ 2025-09-19 21:31 ` Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 18/20] netkit: Add io_uring zero-copy support for TCP Daniel Borkmann
` (4 subsequent siblings)
21 siblings, 0 replies; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-19 21:31 UTC (permalink / raw)
To: netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
From: David Wei <dw@davidwei.uk>
Implement ndo_queue_create() that adds a new rxq during the bind-queue
ynl netdev operation. We allow to create queues either in single device
mode or for the case of dual device mode for the netkit peer device which
gets placed into the target network namespace. For dual device mode the
bind against the primary device does not make sense for the targeted use
cases, and therefore gets rejected.
Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
drivers/net/netkit.c | 33 +++++++++++++++++++++++++++++++--
1 file changed, 31 insertions(+), 2 deletions(-)
diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
index e5dfbf7ea351..27ff84833f28 100644
--- a/drivers/net/netkit.c
+++ b/drivers/net/netkit.c
@@ -274,6 +274,34 @@ static const struct ethtool_ops netkit_ethtool_ops = {
.get_channels = netkit_get_channels,
};
+static int netkit_queue_create(struct net_device *dev)
+{
+ struct netkit *nk = netkit_priv(dev);
+ u32 rxq_count_old, rxq_count_new;
+ int err;
+
+ rxq_count_old = dev->real_num_rx_queues;
+ rxq_count_new = rxq_count_old + 1;
+
+ /* Only allow to bind in single device mode or to bind against
+ * the peer device which then ends up in the target netns.
+ */
+ if (nk->pair == NETKIT_DEVICE_PAIR && nk->primary)
+ return -EOPNOTSUPP;
+
+ if (netif_running(dev))
+ netif_carrier_off(dev);
+ err = netif_set_real_num_rx_queues(dev, rxq_count_new);
+ if (netif_running(dev))
+ netif_carrier_on(dev);
+
+ return err ? err : rxq_count_new;
+}
+
+static const struct netdev_queue_mgmt_ops netkit_queue_mgmt_ops = {
+ .ndo_queue_create = netkit_queue_create,
+};
+
static struct net_device *netkit_alloc(struct nlattr *tb[],
const char *ifname,
unsigned char name_assign_type,
@@ -346,8 +374,9 @@ static void netkit_setup(struct net_device *dev)
dev->priv_flags |= IFF_DISABLE_NETPOLL;
dev->lltx = true;
- dev->ethtool_ops = &netkit_ethtool_ops;
- dev->netdev_ops = &netkit_netdev_ops;
+ dev->netdev_ops = &netkit_netdev_ops;
+ dev->ethtool_ops = &netkit_ethtool_ops;
+ dev->queue_mgmt_ops = &netkit_queue_mgmt_ops;
dev->features |= netkit_features;
dev->hw_features = netkit_features;
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH net-next 18/20] netkit: Add io_uring zero-copy support for TCP
2025-09-19 21:31 [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
` (16 preceding siblings ...)
2025-09-19 21:31 ` [PATCH net-next 17/20] netkit: Implement ndo_queue_create Daniel Borkmann
@ 2025-09-19 21:31 ` Daniel Borkmann
2025-09-22 3:17 ` zf
2025-09-19 21:31 ` [PATCH net-next 19/20] netkit: Add xsk support for af_xdp applications Daniel Borkmann
` (3 subsequent siblings)
21 siblings, 1 reply; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-19 21:31 UTC (permalink / raw)
To: netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
From: David Wei <dw@davidwei.uk>
This adds the last missing bit to netkit for supporting io_uring with
zero-copy mode [0]. Up until this point it was not possible to consume
the latter out of containers or Kubernetes Pods where applications are
in their own network namespace.
Thus, as a last missing bit, implement ndo_queue_get_dma_dev() in netkit
to return the physical device of the real rxq for DMA. This allows memory
providers like io_uring zero-copy or devmem to bind to the physically
mapped rxq in netkit.
io_uring example with eth0 being a physical device with 16 queues where
netkit is bound to the last queue, iou-zcrx.c is binary from selftests.
Flow steering to that queue is based on the service VIP:port of the
server utilizing io_uring:
# ethtool -X eth0 start 0 equal 15
# ethtool -X eth0 start 15 equal 1 context new
# ethtool --config-ntuple eth0 flow-type tcp4 dst-ip 1.2.3.4 dst-port 5000 action 15
# ip netns add foo
# ip link add numrxqueues 2 type netkit
# ynl-bind eth0 15 nk0
# ip link set nk0 netns foo
# ip link set nk1 up
# ip netns exec foo ip link set lo up
# ip netns exec foo ip link set nk0 up
# ip netns exec foo ip addr add 1.2.3.4/32 dev nk0
[ ... setup routing etc to get external traffic into the netns ... ]
# ip netns exec foo ./iou-zcrx -s -p 5000 -i nk0 -q 1
Remote io_uring client:
# ./iou-zcrx -c -h 1.2.3.4 -p 5000 -l 12840 -z 65536
We have tested the above against a dual-port Nvidia ConnectX-6 (mlx5)
100G NIC as well as Broadcom BCM957504 (bnxt_en) 100G NIC, both
supporting TCP header/data split. For Cilium, the plan is to open
up support for io_uring in zero-copy mode for regular Kubernetes Pods
when Cilium is configured with netkit datapath mode.
Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://kernel-recipes.org/en/2024/schedule/efficient-zero-copy-networking-using-io_uring [0]
---
drivers/net/netkit.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
index 27ff84833f28..5129b27a7c3c 100644
--- a/drivers/net/netkit.c
+++ b/drivers/net/netkit.c
@@ -274,6 +274,21 @@ static const struct ethtool_ops netkit_ethtool_ops = {
.get_channels = netkit_get_channels,
};
+static struct device *netkit_queue_get_dma_dev(struct net_device *dev, int idx)
+{
+ struct netdev_rx_queue *rxq, *peer_rxq;
+ unsigned int peer_idx;
+
+ rxq = __netif_get_rx_queue(dev, idx);
+ if (!rxq->peer)
+ return NULL;
+
+ peer_rxq = rxq->peer;
+ peer_idx = get_netdev_rx_queue_index(peer_rxq);
+
+ return netdev_queue_get_dma_dev(peer_rxq->dev, peer_idx);
+}
+
static int netkit_queue_create(struct net_device *dev)
{
struct netkit *nk = netkit_priv(dev);
@@ -299,7 +314,8 @@ static int netkit_queue_create(struct net_device *dev)
}
static const struct netdev_queue_mgmt_ops netkit_queue_mgmt_ops = {
- .ndo_queue_create = netkit_queue_create,
+ .ndo_queue_get_dma_dev = netkit_queue_get_dma_dev,
+ .ndo_queue_create = netkit_queue_create,
};
static struct net_device *netkit_alloc(struct nlattr *tb[],
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH net-next 19/20] netkit: Add xsk support for af_xdp applications
2025-09-19 21:31 [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
` (17 preceding siblings ...)
2025-09-19 21:31 ` [PATCH net-next 18/20] netkit: Add io_uring zero-copy support for TCP Daniel Borkmann
@ 2025-09-19 21:31 ` Daniel Borkmann
2025-09-23 11:42 ` Toke Høiland-Jørgensen
2025-09-19 21:31 ` [PATCH net-next 20/20] tools, ynl: Add queue binding ynl sample application Daniel Borkmann
` (2 subsequent siblings)
21 siblings, 1 reply; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-19 21:31 UTC (permalink / raw)
To: netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
Enable support for AF_XDP applications to operate on a netkit device.
The goal is that AF_XDP applications can natively consume AF_XDP
from network namespaces. The use-case from Cilium side is to support
Kubernetes KubeVirt VMs through QEMU's AF_XDP backend. KubeVirt is a
virtual machine management add-on for Kubernetes which aims to provide
a common ground for virtualization. KubeVirt spawns the VMs inside
Kubernetes Pods which reside in their own network namespace just like
regular Pods.
Raw QEMU AF_XDP backend example with eth0 being a physical device with
16 queues where netkit is bound to the last queue (for multi-queue RSS
context can be used if supported by the driver):
# ethtool -X eth0 start 0 equal 15
# ethtool -X eth0 start 15 equal 1 context new
# ethtool --config-ntuple eth0 flow-type ether \
src 00:00:00:00:00:00 \
src-mask ff:ff:ff:ff:ff:ff \
dst $mac dst-mask 00:00:00:00:00:00 \
proto 0 proto-mask 0xffff action 15
# ip netns add foo
# ip link add numrxqueues 2 nk type netkit single
# ynl-bind eth0 15 nk
# ip link set nk netns foo
# ip netns exec foo ip link set lo up
# ip netns exec foo ip link set nk up
# ip netns exec foo qemu-system-x86_64 \
-kernel $kernel \
-drive file=${image_name},index=0,media=disk,format=raw \
-append "root=/dev/sda rw console=ttyS0" \
-cpu host \
-m $memory \
-enable-kvm \
-device virtio-net-pci,netdev=net0,mac=$mac \
-netdev af-xdp,ifname=nk,id=net0,mode=native,queues=1,start-queue=1,inhibit=on,map-path=$dir/xsks_map \
-nographic
We have tested the above against a dual-port Nvidia ConnectX-6 (mlx5)
100G NIC with successful network connectivity out of QEMU. An earlier
iteration of this work was presented at LSF/MM/BPF [0].
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://bpfconf.ebpf.io/bpfconf2025/bpfconf2025_material/lsfmmbpf_2025_netkit_borkmann.pdf [0]
---
drivers/net/netkit.c | 121 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 121 insertions(+)
diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
index 5129b27a7c3c..a1d8a78bab0b 100644
--- a/drivers/net/netkit.c
+++ b/drivers/net/netkit.c
@@ -11,6 +11,7 @@
#include <net/netdev_queues.h>
#include <net/netdev_rx_queue.h>
+#include <net/xdp_sock_drv.h>
#include <net/netkit.h>
#include <net/dst.h>
#include <net/tcx.h>
@@ -234,6 +235,122 @@ static void netkit_get_stats(struct net_device *dev,
stats->tx_dropped = DEV_STATS_READ(dev, tx_dropped);
}
+static int netkit_xsk(struct net_device *dev, struct netdev_bpf *xdp)
+{
+ struct netkit *nk = netkit_priv(dev);
+ struct netdev_bpf xdp_lower;
+ struct netdev_rx_queue *rxq;
+ struct net_device *phys;
+
+ switch (xdp->command) {
+ case XDP_SETUP_XSK_POOL:
+ if (nk->pair == NETKIT_DEVICE_PAIR)
+ return -EOPNOTSUPP;
+ if (xdp->xsk.queue_id >= dev->real_num_rx_queues)
+ return -EINVAL;
+
+ rxq = __netif_get_rx_queue(dev, xdp->xsk.queue_id);
+ if (!rxq->peer)
+ return -EOPNOTSUPP;
+
+ phys = rxq->peer->dev;
+ if (!phys->netdev_ops->ndo_bpf ||
+ !phys->netdev_ops->ndo_xdp_xmit ||
+ !phys->netdev_ops->ndo_xsk_wakeup)
+ return -EOPNOTSUPP;
+
+ memcpy(&xdp_lower, xdp, sizeof(xdp_lower));
+ xdp_lower.xsk.queue_id = get_netdev_rx_queue_index(rxq->peer);
+ break;
+ case XDP_SETUP_PROG:
+ return -EPERM;
+ default:
+ return -EINVAL;
+ }
+
+ return phys->netdev_ops->ndo_bpf(phys, &xdp_lower);
+}
+
+static int netkit_xsk_wakeup(struct net_device *dev, u32 queue_id, u32 flags)
+{
+ struct netdev_rx_queue *rxq;
+ struct net_device *phys;
+
+ if (queue_id >= dev->real_num_rx_queues)
+ return -EINVAL;
+
+ rxq = __netif_get_rx_queue(dev, queue_id);
+ if (!rxq->peer)
+ return -EOPNOTSUPP;
+
+ phys = rxq->peer->dev;
+ if (!phys->netdev_ops->ndo_xsk_wakeup)
+ return -EOPNOTSUPP;
+
+ return phys->netdev_ops->ndo_xsk_wakeup(phys,
+ get_netdev_rx_queue_index(rxq->peer), flags);
+}
+
+static bool netkit_xdp_supported(const struct net_device *dev)
+{
+ bool xdp_ok = IS_ENABLED(CONFIG_XDP_SOCKETS);
+
+ if (!dev->netdev_ops->ndo_bpf ||
+ !dev->netdev_ops->ndo_xdp_xmit ||
+ !dev->netdev_ops->ndo_xsk_wakeup)
+ xdp_ok = false;
+ if ((dev->xdp_features & NETDEV_XDP_ACT_XSK) != NETDEV_XDP_ACT_XSK)
+ xdp_ok = false;
+ return xdp_ok;
+}
+
+static void netkit_expose_xdp(struct net_device *dev, bool xdp_ok,
+ u32 xdp_zc_max_segs)
+{
+ if (xdp_ok) {
+ dev->xdp_zc_max_segs = xdp_zc_max_segs;
+ xdp_set_features_flag_locked(dev, NETDEV_XDP_ACT_XSK);
+ } else {
+ dev->xdp_zc_max_segs = 1;
+ xdp_set_features_flag_locked(dev, 0);
+ }
+}
+
+static void netkit_calculate_xdp(struct net_device *dev,
+ struct netdev_rx_queue *rxq, bool skip_rxq)
+{
+ struct netdev_rx_queue *src_rxq, *dst_rxq;
+ struct net_device *src_dev;
+ u32 xdp_zc_max_segs = ~0;
+ bool xdp_ok = false;
+ int i;
+
+ for (i = 1; i < dev->real_num_rx_queues; i++) {
+ dst_rxq = __netif_get_rx_queue(dev, i);
+ if (dst_rxq == rxq && skip_rxq)
+ continue;
+ src_rxq = dst_rxq->peer;
+ src_dev = src_rxq->dev;
+ xdp_zc_max_segs = min(xdp_zc_max_segs, src_dev->xdp_zc_max_segs);
+ xdp_ok = netkit_xdp_supported(src_dev) &&
+ (i == 1 ? true : xdp_ok);
+ }
+
+ netkit_expose_xdp(dev, xdp_ok, xdp_zc_max_segs);
+}
+
+static void netkit_peer_queues(struct net_device *dev,
+ struct netdev_rx_queue *rxq)
+{
+ netkit_calculate_xdp(dev, rxq, false);
+}
+
+static void netkit_unpeer_queues(struct net_device *dev,
+ struct netdev_rx_queue *rxq)
+{
+ netkit_calculate_xdp(dev, rxq, true);
+}
+
static void netkit_uninit(struct net_device *dev);
static const struct net_device_ops netkit_netdev_ops = {
@@ -247,6 +364,10 @@ static const struct net_device_ops netkit_netdev_ops = {
.ndo_get_peer_dev = netkit_peer_dev,
.ndo_get_stats64 = netkit_get_stats,
.ndo_uninit = netkit_uninit,
+ .ndo_peer_queues = netkit_peer_queues,
+ .ndo_unpeer_queues = netkit_unpeer_queues,
+ .ndo_bpf = netkit_xsk,
+ .ndo_xsk_wakeup = netkit_xsk_wakeup,
.ndo_features_check = passthru_features_check,
};
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH net-next 20/20] tools, ynl: Add queue binding ynl sample application
2025-09-19 21:31 [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
` (18 preceding siblings ...)
2025-09-19 21:31 ` [PATCH net-next 19/20] netkit: Add xsk support for af_xdp applications Daniel Borkmann
@ 2025-09-19 21:31 ` Daniel Borkmann
2025-09-22 17:09 ` Stanislav Fomichev
2025-09-22 12:05 ` [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Nikolay Aleksandrov
2025-09-23 1:59 ` Jakub Kicinski
21 siblings, 1 reply; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-19 21:31 UTC (permalink / raw)
To: netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
From: David Wei <dw@davidwei.uk>
Add a ynl sample application that calls bind-queue to bind a real rxq
to a mapped rxq in a virtual netdev.
# ethtool -X eth0 start 0 equal 15
# ethtool -X eth0 start 15 equal 1 context new
# ethtool --config-ntuple eth0 flow-type [...] action 15
# ip link add numrxqueues 2 nk type netkit single
# ethtool -l nk
Channel parameters for nk:
Pre-set maximums:
RX: 2
TX: 1
Other: n/a
Combined: 1
Current hardware settings:
RX: 1
TX: 1
Other: n/a
Combined: 0
# ip a
4: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether e8:eb:d3:a3:43:f6 brd ff:ff:ff:ff:ff:ff
[...]
8: nk@NONE: <BROADCAST,MULTICAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
# ynl-bind eth0 15 nk
bound eth0, queue 15 to nk, queue 1
# ethtool -l nk
[...]
Current hardware settings:
RX: 2
TX: 1
Other: n/a
Combined: 0
Signed-off-by: David Wei <dw@davidwei.uk>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
tools/net/ynl/samples/bind.c | 56 ++++++++++++++++++++++++++++++++++++
1 file changed, 56 insertions(+)
create mode 100644 tools/net/ynl/samples/bind.c
diff --git a/tools/net/ynl/samples/bind.c b/tools/net/ynl/samples/bind.c
new file mode 100644
index 000000000000..a6426121cbd4
--- /dev/null
+++ b/tools/net/ynl/samples/bind.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdio.h>
+#include <string.h>
+#include <assert.h>
+#include <ynl.h>
+#include <net/if.h>
+
+#include "netdev-user.h"
+
+int main(int argc, char **argv)
+{
+ struct netdev_bind_queue_req *req;
+ struct netdev_bind_queue_rsp *rsp;
+ char if_src[IF_NAMESIZE] = {};
+ char if_dst[IF_NAMESIZE] = {};
+ struct ynl_sock *ys;
+ struct ynl_error yerr;
+ int src_ifindex = 0, dst_ifindex = 0;
+ int src_queue_id = 0;
+
+ if (argc > 1)
+ src_ifindex = if_nametoindex(argv[1]);
+ if (argc > 2)
+ src_queue_id = strtol(argv[2], NULL, 0);
+ if (argc > 3)
+ dst_ifindex = if_nametoindex(argv[3]);
+
+ ys = ynl_sock_create(&ynl_netdev_family, &yerr);
+ if (!ys) {
+ fprintf(stderr, "YNL: %s\n", yerr.msg);
+ return 1;
+ }
+
+ req = netdev_bind_queue_req_alloc();
+ netdev_bind_queue_req_set_src_ifindex(req, src_ifindex);
+ netdev_bind_queue_req_set_src_queue_id(req, src_queue_id);
+ netdev_bind_queue_req_set_dst_ifindex(req, dst_ifindex);
+
+ rsp = netdev_bind_queue(ys, req);
+ netdev_bind_queue_req_free(req);
+ if (!rsp)
+ goto err;
+
+ assert(rsp->_present.dst_queue_id);
+ printf("bound %s, queue %d to %s, queue %d\n",
+ if_indextoname(src_ifindex, if_src), src_queue_id,
+ if_indextoname(dst_ifindex, if_dst), rsp->dst_queue_id);
+
+ netdev_bind_queue_rsp_free(rsp);
+ ynl_sock_destroy(ys);
+ return 0;
+err:
+ fprintf(stderr, "YNL: %s\n", ys->err.msg);
+ ynl_sock_destroy(ys);
+ return 2;
+}
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 18/20] netkit: Add io_uring zero-copy support for TCP
2025-09-19 21:31 ` [PATCH net-next 18/20] netkit: Add io_uring zero-copy support for TCP Daniel Borkmann
@ 2025-09-22 3:17 ` zf
2025-09-22 16:23 ` Daniel Borkmann
0 siblings, 1 reply; 64+ messages in thread
From: zf @ 2025-09-22 3:17 UTC (permalink / raw)
To: Daniel Borkmann, netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei, yangzhenze, Dongdong Wang
在 2025/9/20 05:31, Daniel Borkmann 写道:
> From: David Wei <dw@davidwei.uk>
>
> This adds the last missing bit to netkit for supporting io_uring with
> zero-copy mode [0]. Up until this point it was not possible to consume
> the latter out of containers or Kubernetes Pods where applications are
> in their own network namespace.
>
> Thus, as a last missing bit, implement ndo_queue_get_dma_dev() in netkit
> to return the physical device of the real rxq for DMA. This allows memory
> providers like io_uring zero-copy or devmem to bind to the physically
> mapped rxq in netkit.
>
> io_uring example with eth0 being a physical device with 16 queues where
> netkit is bound to the last queue, iou-zcrx.c is binary from selftests.
> Flow steering to that queue is based on the service VIP:port of the
> server utilizing io_uring:
>
> # ethtool -X eth0 start 0 equal 15
> # ethtool -X eth0 start 15 equal 1 context new
> # ethtool --config-ntuple eth0 flow-type tcp4 dst-ip 1.2.3.4 dst-port 5000 action 15
> # ip netns add foo
> # ip link add numrxqueues 2 type netkit
> # ynl-bind eth0 15 nk0
> # ip link set nk0 netns foo
> # ip link set nk1 up
> # ip netns exec foo ip link set lo up
> # ip netns exec foo ip link set nk0 up
> # ip netns exec foo ip addr add 1.2.3.4/32 dev nk0
> [ ... setup routing etc to get external traffic into the netns ... ]
> # ip netns exec foo ./iou-zcrx -s -p 5000 -i nk0 -q 1
>
> Remote io_uring client:
>
> # ./iou-zcrx -c -h 1.2.3.4 -p 5000 -l 12840 -z 65536
>
> We have tested the above against a dual-port Nvidia ConnectX-6 (mlx5)
> 100G NIC as well as Broadcom BCM957504 (bnxt_en) 100G NIC, both
> supporting TCP header/data split. For Cilium, the plan is to open
> up support for io_uring in zero-copy mode for regular Kubernetes Pods
> when Cilium is configured with netkit datapath mode.
>
From what we have learned, mlx supports TCP header/data split starting
from CX7, relying on the hw rx gro. I would like to ask, can CX6 use TCP
header/data split? Can you share your CX6's mlx driver information and
FW information? I will test it. If CX6 can support, this one is even
better for me. Thanks.
> Signed-off-by: David Wei <dw@davidwei.uk>
> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Link: https://kernel-recipes.org/en/2024/schedule/efficient-zero-copy-networking-using-io_uring [0]
> ---
> drivers/net/netkit.c | 18 +++++++++++++++++-
> 1 file changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
> index 27ff84833f28..5129b27a7c3c 100644
> --- a/drivers/net/netkit.c
> +++ b/drivers/net/netkit.c
> @@ -274,6 +274,21 @@ static const struct ethtool_ops netkit_ethtool_ops = {
> .get_channels = netkit_get_channels,
> };
>
> +static struct device *netkit_queue_get_dma_dev(struct net_device *dev, int idx)
> +{
> + struct netdev_rx_queue *rxq, *peer_rxq;
> + unsigned int peer_idx;
> +
> + rxq = __netif_get_rx_queue(dev, idx);
> + if (!rxq->peer)
> + return NULL;
> +
> + peer_rxq = rxq->peer;
> + peer_idx = get_netdev_rx_queue_index(peer_rxq);
> +
> + return netdev_queue_get_dma_dev(peer_rxq->dev, peer_idx);
> +}
> +
> static int netkit_queue_create(struct net_device *dev)
> {
> struct netkit *nk = netkit_priv(dev);
> @@ -299,7 +314,8 @@ static int netkit_queue_create(struct net_device *dev)
> }
>
> static const struct netdev_queue_mgmt_ops netkit_queue_mgmt_ops = {
> - .ndo_queue_create = netkit_queue_create,
> + .ndo_queue_get_dma_dev = netkit_queue_get_dma_dev,
> + .ndo_queue_create = netkit_queue_create,
> };
>
> static struct net_device *netkit_alloc(struct nlattr *tb[],
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP
2025-09-19 21:31 [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
` (19 preceding siblings ...)
2025-09-19 21:31 ` [PATCH net-next 20/20] tools, ynl: Add queue binding ynl sample application Daniel Borkmann
@ 2025-09-22 12:05 ` Nikolay Aleksandrov
2025-09-23 1:59 ` Jakub Kicinski
21 siblings, 0 replies; 64+ messages in thread
From: Nikolay Aleksandrov @ 2025-09-22 12:05 UTC (permalink / raw)
To: Daniel Borkmann, netdev
Cc: bpf, kuba, davem, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson
On 9/20/25 00:31, Daniel Borkmann wrote:
> Containers use virtual netdevs to route traffic from a physical netdev
> in the host namespace. They do not have access to the physical netdev
> in the host and thus can't use memory providers or AF_XDP that require
> reconfiguring/restarting queues in the physical netdev.
>
> This patchset adds the concept of queue peering to virtual netdevs that
> allow containers to use memory providers and AF_XDP at _native speed_!
> These mapped queues are bound to a real queue in a physical netdev and
> act as a proxy.
>
> Memory providers and AF_XDP operations takes an ifindex and queue id,
> so containers would pass in an ifindex for a virtual netdev and a queue
> id of a mapped queue, which then gets proxied to the underlying real
> queue. Peered queues are created and bound to a real queue atomically
> through a generic ynl netdev operation.
>
> We have implemented support for this concept in netkit and tested the
> latter against Nvidia ConnectX-6 (mlx5) as well as Broadcom BCM957504
> (bnxt_en) 100G NICs. For more details see the individual patches.
>
> Daniel Borkmann (10):
> net: Add ndo_{peer,unpeer}_queues callback
> net, ethtool: Disallow mapped real rxqs to be resized
> xsk: Move NETDEV_XDP_ACT_ZC into generic header
> xsk: Move pool registration into single function
> xsk: Add small helper xp_pool_bindable
> xsk: Change xsk_rcv_check to check netdev/queue_id from pool
> xsk: Proxy pool management for mapped queues
> netkit: Add single device mode for netkit
> netkit: Document fast vs slowpath members via macros
> netkit: Add xsk support for af_xdp applications
>
> David Wei (10):
> net, ynl: Add bind-queue operation
> net: Add peer to netdev_rx_queue
> net: Add ndo_queue_create callback
> net, ynl: Implement netdev_nl_bind_queue_doit
> net, ynl: Add peer info to queue-get response
> net: Proxy net_mp_{open,close}_rxq for mapped queues
> netkit: Implement rtnl_link_ops->alloc
> netkit: Implement ndo_queue_create
> netkit: Add io_uring zero-copy support for TCP
> tools, ynl: Add queue binding ynl sample application
>
> Documentation/netlink/specs/netdev.yaml | 54 ++++
> drivers/net/netkit.c | 362 ++++++++++++++++++++----
> include/linux/netdevice.h | 15 +-
> include/net/netdev_queues.h | 1 +
> include/net/netdev_rx_queue.h | 55 ++++
> include/net/xdp_sock_drv.h | 8 +-
> include/uapi/linux/if_link.h | 6 +
> include/uapi/linux/netdev.h | 20 ++
> net/core/netdev-genl-gen.c | 14 +
> net/core/netdev-genl-gen.h | 1 +
> net/core/netdev-genl.c | 144 +++++++++-
> net/core/netdev_rx_queue.c | 15 +-
> net/ethtool/channels.c | 10 +-
> net/xdp/xsk.c | 27 +-
> net/xdp/xsk.h | 5 +-
> net/xdp/xsk_buff_pool.c | 29 +-
> tools/include/uapi/linux/netdev.h | 20 ++
> tools/net/ynl/samples/bind.c | 56 ++++
> 18 files changed, 750 insertions(+), 92 deletions(-)
> create mode 100644 tools/net/ynl/samples/bind.c
>
I have reviewed the set and it looks good to me. To be fair, I have reviewed
it privately before as well. I really like the changes, we have discussed some
of the ideas implemented before. Personally I especially like the io_uring support
and think that some new interesting use cases will come out of it.
Nice work, for the set:
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Cheers,
Nik
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 09/20] xsk: Move NETDEV_XDP_ACT_ZC into generic header
2025-09-19 21:31 ` [PATCH net-next 09/20] xsk: Move NETDEV_XDP_ACT_ZC into generic header Daniel Borkmann
@ 2025-09-22 15:59 ` Maciej Fijalkowski
0 siblings, 0 replies; 64+ messages in thread
From: Maciej Fijalkowski @ 2025-09-22 15:59 UTC (permalink / raw)
To: Daniel Borkmann
Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
john.fastabend, martin.lau, jordan, magnus.karlsson, David Wei
On Fri, Sep 19, 2025 at 11:31:42PM +0200, Daniel Borkmann wrote:
> Move NETDEV_XDP_ACT_ZC into xdp_sock_drv.h header such that external code
> can reuse it, and rename it into more generic NETDEV_XDP_ACT_XSK.
>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Co-developed-by: David Wei <dw@davidwei.uk>
> Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> ---
> include/net/xdp_sock_drv.h | 4 ++++
> net/xdp/xsk_buff_pool.c | 6 +-----
> 2 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
> index 513c8e9704f6..47120666d8d6 100644
> --- a/include/net/xdp_sock_drv.h
> +++ b/include/net/xdp_sock_drv.h
> @@ -12,6 +12,10 @@
> #define XDP_UMEM_MIN_CHUNK_SHIFT 11
> #define XDP_UMEM_MIN_CHUNK_SIZE (1 << XDP_UMEM_MIN_CHUNK_SHIFT)
>
> +#define NETDEV_XDP_ACT_XSK (NETDEV_XDP_ACT_BASIC | \
> + NETDEV_XDP_ACT_REDIRECT | \
> + NETDEV_XDP_ACT_XSK_ZEROCOPY)
> +
> struct xsk_cb_desc {
> void *src;
> u8 off;
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index aa9788f20d0d..26165baf99f4 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -158,10 +158,6 @@ static void xp_disable_drv_zc(struct xsk_buff_pool *pool)
> }
> }
>
> -#define NETDEV_XDP_ACT_ZC (NETDEV_XDP_ACT_BASIC | \
> - NETDEV_XDP_ACT_REDIRECT | \
> - NETDEV_XDP_ACT_XSK_ZEROCOPY)
> -
> int xp_assign_dev(struct xsk_buff_pool *pool,
> struct net_device *netdev, u16 queue_id, u16 flags)
> {
> @@ -203,7 +199,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
> /* For copy-mode, we are done. */
> return 0;
>
> - if ((netdev->xdp_features & NETDEV_XDP_ACT_ZC) != NETDEV_XDP_ACT_ZC) {
> + if ((netdev->xdp_features & NETDEV_XDP_ACT_XSK) != NETDEV_XDP_ACT_XSK) {
> err = -EOPNOTSUPP;
> goto err_unreg_pool;
> }
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 10/20] xsk: Move pool registration into single function
2025-09-19 21:31 ` [PATCH net-next 10/20] xsk: Move pool registration into single function Daniel Borkmann
@ 2025-09-22 16:01 ` Maciej Fijalkowski
2025-09-22 16:15 ` Daniel Borkmann
0 siblings, 1 reply; 64+ messages in thread
From: Maciej Fijalkowski @ 2025-09-22 16:01 UTC (permalink / raw)
To: Daniel Borkmann
Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
john.fastabend, martin.lau, jordan, magnus.karlsson, David Wei
On Fri, Sep 19, 2025 at 11:31:43PM +0200, Daniel Borkmann wrote:
> Small refactor to move the pool registration into xsk_reg_pool_at_qid,
> such that the netdev and queue_id can be registered there. No change
> in functionality.
>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Co-developed-by: David Wei <dw@davidwei.uk>
> Signed-off-by: David Wei <dw@davidwei.uk>
> ---
> net/xdp/xsk.c | 5 +++++
> net/xdp/xsk_buff_pool.c | 16 +++-------------
> 2 files changed, 8 insertions(+), 13 deletions(-)
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 72e34bd2d925..82ad89f6ba35 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -141,6 +141,11 @@ int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool,
> dev->real_num_rx_queues,
> dev->real_num_tx_queues))
> return -EINVAL;
> + if (xsk_get_pool_from_qid(dev, queue_id))
> + return -EBUSY;
> +
> + pool->netdev = dev;
> + pool->queue_id = queue_id;
>
> if (queue_id < dev->real_num_rx_queues)
> dev->_rx[queue_id].pool = pool;
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index 26165baf99f4..375696f895d4 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -169,32 +169,24 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>
> force_zc = flags & XDP_ZEROCOPY;
> force_copy = flags & XDP_COPY;
> -
> if (force_zc && force_copy)
> return -EINVAL;
>
> - if (xsk_get_pool_from_qid(netdev, queue_id))
> - return -EBUSY;
> -
> - pool->netdev = netdev;
> - pool->queue_id = queue_id;
> err = xsk_reg_pool_at_qid(netdev, pool, queue_id);
> if (err)
> return err;
>
> if (flags & XDP_USE_SG)
> pool->umem->flags |= XDP_UMEM_SG_FLAG;
> -
IMHO all of the stuff below looks like unnecessary code churn.
> if (flags & XDP_USE_NEED_WAKEUP)
> pool->uses_need_wakeup = true;
> - /* Tx needs to be explicitly woken up the first time. Also
> - * for supporting drivers that do not implement this
> - * feature. They will always have to call sendto() or poll().
> + /* Tx needs to be explicitly woken up the first time. Also
> + * for supporting drivers that do not implement this feature.
> + * They will always have to call sendto() or poll().
> */
> pool->cached_need_wakeup = XDP_WAKEUP_TX;
>
> dev_hold(netdev);
> -
> if (force_copy)
> /* For copy-mode, we are done. */
> return 0;
> @@ -203,12 +195,10 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
> err = -EOPNOTSUPP;
> goto err_unreg_pool;
> }
> -
> if (netdev->xdp_zc_max_segs == 1 && (flags & XDP_USE_SG)) {
> err = -EOPNOTSUPP;
> goto err_unreg_pool;
> }
> -
> if (dev_get_min_mp_channel_count(netdev)) {
> err = -EBUSY;
> goto err_unreg_pool;
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 11/20] xsk: Add small helper xp_pool_bindable
2025-09-19 21:31 ` [PATCH net-next 11/20] xsk: Add small helper xp_pool_bindable Daniel Borkmann
@ 2025-09-22 16:03 ` Maciej Fijalkowski
2025-09-22 16:17 ` Daniel Borkmann
0 siblings, 1 reply; 64+ messages in thread
From: Maciej Fijalkowski @ 2025-09-22 16:03 UTC (permalink / raw)
To: Daniel Borkmann
Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
john.fastabend, martin.lau, jordan, magnus.karlsson, David Wei
On Fri, Sep 19, 2025 at 11:31:44PM +0200, Daniel Borkmann wrote:
> Add another small helper called xp_pool_bindable and move the current
> dev_get_min_mp_channel_count test into this helper. Pass in the pool
> object, such that we derive the netdev from the prior registered pool.
>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Co-developed-by: David Wei <dw@davidwei.uk>
> Signed-off-by: David Wei <dw@davidwei.uk>
> ---
> net/xdp/xsk_buff_pool.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index 375696f895d4..d2109d683fe5 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -54,6 +54,11 @@ int xp_alloc_tx_descs(struct xsk_buff_pool *pool, struct xdp_sock *xs)
> return 0;
> }
>
> +static bool xp_pool_bindable(struct xsk_buff_pool *pool)
> +{
> + return dev_get_min_mp_channel_count(pool->netdev) == 0;
> +}
Is this really a must have in this patchset? You don't seem to make use of
it anywhere.
> +
> struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
> struct xdp_umem *umem)
> {
> @@ -199,7 +204,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
> err = -EOPNOTSUPP;
> goto err_unreg_pool;
> }
> - if (dev_get_min_mp_channel_count(netdev)) {
> + if (!xp_pool_bindable(pool)) {
> err = -EBUSY;
> goto err_unreg_pool;
> }
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 01/20] net, ynl: Add bind-queue operation
2025-09-19 21:31 ` [PATCH net-next 01/20] net, ynl: Add bind-queue operation Daniel Borkmann
@ 2025-09-22 16:04 ` Stanislav Fomichev
2025-09-22 16:13 ` Daniel Borkmann
2025-09-23 1:17 ` Jakub Kicinski
1 sibling, 1 reply; 64+ messages in thread
From: Stanislav Fomichev @ 2025-09-22 16:04 UTC (permalink / raw)
To: Daniel Borkmann
Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
john.fastabend, martin.lau, jordan, maciej.fijalkowski,
magnus.karlsson, David Wei
On 09/19, Daniel Borkmann wrote:
> From: David Wei <dw@davidwei.uk>
>
> Add a ynl netdev family operation called bind-queue that _binds_ an
> rxq from a real netdev to a virtual netdev i.e. netkit or veth. This
> bound or _mapped_ rxq in the virtual netdev acts as a proxy for the
> parent real rxq, and can be used by processes running in a container
> to use memory providers (io_uring zero-copy rx or devmem) or AF_XDP.
> An early implementation had only driver-specific integration [0],
> but in order for other virtual devices to reuse, it makes sense to
> have this as a generic API.
>
> src-ifindex and src-queue-id is the real netdev and rxq respectively.
> dst-ifindex is the virtual netdev. Note that this op doesn't take
> dst-queue-id, because the expectation is that the op will _create_ a
> new rxq in the virtual netdev. The virtual netdev must have
> real_num_rx_queues less than num_rx_queues at the time of calling
> bind-queue.
>
> Signed-off-by: David Wei <dw@davidwei.uk>
> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Link: https://bpfconf.ebpf.io/bpfconf2025/bpfconf2025_material/lsfmmbpf_2025_netkit_borkmann.pdf [0]
> ---
> Documentation/netlink/specs/netdev.yaml | 37 +++++++++++++++++++++++++
> include/uapi/linux/netdev.h | 11 ++++++++
> net/core/netdev-genl-gen.c | 14 ++++++++++
> net/core/netdev-genl-gen.h | 1 +
> net/core/netdev-genl.c | 4 +++
> tools/include/uapi/linux/netdev.h | 11 ++++++++
> 6 files changed, 78 insertions(+)
>
> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index e00d3fa1c152..99a430ea8a9a 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -561,6 +561,29 @@ attribute-sets:
> type: u32
> checks:
> min: 1
> + -
> + name: queue-pair
> + attributes:
> + -
> + name: src-ifindex
> + doc: netdev ifindex of the physical device
> + type: u32
> + checks:
> + min: 1
> + -
> + name: src-queue-id
> + doc: netdev queue id of the physical device
> + type: u32
> + -
> + name: dst-ifindex
> + doc: netdev ifindex of the virtual device
> + type: u32
> + checks:
> + min: 1
> + -
> + name: dst-queue-id
> + doc: netdev queue id of the virtual device
> + type: u32
>
> operations:
> list:
> @@ -772,6 +795,20 @@ operations:
> attributes:
> - id
>
> + -
> + name: bind-queue
> + doc: Bind a physical netdev queue to a virtual one
> + attribute-set: queue-pair
> + do:
> + request:
> + attributes:
> + - src-ifindex
> + - src-queue-id
> + - dst-ifindex
> + reply:
> + attributes:
> + - dst-queue-id
> +
> kernel-family:
> headers: ["net/netdev_netlink.h"]
> sock-priv: struct netdev_nl_sock
> diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> index 48eb49aa03d4..05e17765a39d 100644
> --- a/include/uapi/linux/netdev.h
> +++ b/include/uapi/linux/netdev.h
> @@ -210,6 +210,16 @@ enum {
> NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1)
> };
>
> +enum {
> + NETDEV_A_QUEUE_PAIR_SRC_IFINDEX = 1,
> + NETDEV_A_QUEUE_PAIR_SRC_QUEUE_ID,
> + NETDEV_A_QUEUE_PAIR_DST_IFINDEX,
> + NETDEV_A_QUEUE_PAIR_DST_QUEUE_ID,
> +
> + __NETDEV_A_QUEUE_PAIR_MAX,
> + NETDEV_A_QUEUE_PAIR_MAX = (__NETDEV_A_QUEUE_PAIR_MAX - 1)
> +};
> +
> enum {
> NETDEV_CMD_DEV_GET = 1,
> NETDEV_CMD_DEV_ADD_NTF,
> @@ -226,6 +236,7 @@ enum {
> NETDEV_CMD_BIND_RX,
> NETDEV_CMD_NAPI_SET,
> NETDEV_CMD_BIND_TX,
> + NETDEV_CMD_BIND_QUEUE,
>
> __NETDEV_CMD_MAX,
> NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
> diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
> index e9a2a6f26cb7..10b2ab4dd500 100644
> --- a/net/core/netdev-genl-gen.c
> +++ b/net/core/netdev-genl-gen.c
> @@ -106,6 +106,13 @@ static const struct nla_policy netdev_bind_tx_nl_policy[NETDEV_A_DMABUF_FD + 1]
> [NETDEV_A_DMABUF_FD] = { .type = NLA_U32, },
> };
>
> +/* NETDEV_CMD_BIND_QUEUE - do */
> +static const struct nla_policy netdev_bind_queue_nl_policy[NETDEV_A_QUEUE_PAIR_DST_IFINDEX + 1] = {
> + [NETDEV_A_QUEUE_PAIR_SRC_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1),
> + [NETDEV_A_QUEUE_PAIR_SRC_QUEUE_ID] = { .type = NLA_U32, },
> + [NETDEV_A_QUEUE_PAIR_DST_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1),
> +};
> +
> /* Ops table for netdev */
> static const struct genl_split_ops netdev_nl_ops[] = {
> {
> @@ -204,6 +211,13 @@ static const struct genl_split_ops netdev_nl_ops[] = {
> .maxattr = NETDEV_A_DMABUF_FD,
> .flags = GENL_CMD_CAP_DO,
> },
> + {
> + .cmd = NETDEV_CMD_BIND_QUEUE,
> + .doit = netdev_nl_bind_queue_doit,
> + .policy = netdev_bind_queue_nl_policy,
> + .maxattr = NETDEV_A_QUEUE_PAIR_DST_IFINDEX,
> + .flags = GENL_CMD_CAP_DO,
> + },
> };
>
> static const struct genl_multicast_group netdev_nl_mcgrps[] = {
> diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h
> index cf3fad74511f..309248fe2b9e 100644
> --- a/net/core/netdev-genl-gen.h
> +++ b/net/core/netdev-genl-gen.h
> @@ -35,6 +35,7 @@ int netdev_nl_qstats_get_dumpit(struct sk_buff *skb,
> int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info);
> int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info);
> int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info);
> +int netdev_nl_bind_queue_doit(struct sk_buff *skb, struct genl_info *info);
>
> enum {
> NETDEV_NLGRP_MGMT,
> diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> index 470fabbeacd9..b0aea27bf84e 100644
> --- a/net/core/netdev-genl.c
> +++ b/net/core/netdev-genl.c
> @@ -1120,6 +1120,10 @@ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
> return err;
> }
>
> +int netdev_nl_bind_queue_doit(struct sk_buff *skb, struct genl_info *info)
> +{
nit: return 'not supported' for now or something similar?
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 03/20] net: Add ndo_queue_create callback
2025-09-19 21:31 ` [PATCH net-next 03/20] net: Add ndo_queue_create callback Daniel Borkmann
@ 2025-09-22 16:04 ` Stanislav Fomichev
2025-09-22 16:14 ` Daniel Borkmann
2025-09-23 15:58 ` David Wei
2025-09-23 1:22 ` Jakub Kicinski
1 sibling, 2 replies; 64+ messages in thread
From: Stanislav Fomichev @ 2025-09-22 16:04 UTC (permalink / raw)
To: Daniel Borkmann
Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
john.fastabend, martin.lau, jordan, maciej.fijalkowski,
magnus.karlsson, David Wei
On 09/19, Daniel Borkmann wrote:
> From: David Wei <dw@davidwei.uk>
>
> Add ndo_queue_create() to netdev_queue_mgmt_ops that will create a new
> rxq specifically for mapping to a real rxq. The intent is for only
> virtual netdevs i.e. netkit and veth to implement this ndo. This will
> be called from ynl netdev fam bind-queue op to atomically create a
> mapped rxq and bind it to a real rxq.
>
> Signed-off-by: David Wei <dw@davidwei.uk>
> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
> include/net/netdev_queues.h | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> index cd00e0406cf4..6b0d2416728d 100644
> --- a/include/net/netdev_queues.h
> +++ b/include/net/netdev_queues.h
> @@ -149,6 +149,7 @@ struct netdev_queue_mgmt_ops {
> int idx);
> struct device * (*ndo_queue_get_dma_dev)(struct net_device *dev,
> int idx);
> + int (*ndo_queue_create)(struct net_device *dev);
kdoc is missing
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 05/20] net, ynl: Implement netdev_nl_bind_queue_doit
2025-09-19 21:31 ` [PATCH net-next 05/20] net, ynl: Implement netdev_nl_bind_queue_doit Daniel Borkmann
@ 2025-09-22 16:06 ` Stanislav Fomichev
2025-09-23 1:26 ` Jakub Kicinski
0 siblings, 1 reply; 64+ messages in thread
From: Stanislav Fomichev @ 2025-09-22 16:06 UTC (permalink / raw)
To: Daniel Borkmann
Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
john.fastabend, martin.lau, jordan, maciej.fijalkowski,
magnus.karlsson, David Wei
On 09/19, Daniel Borkmann wrote:
> From: David Wei <dw@davidwei.uk>
>
> Implement netdev_nl_bind_queue_doit() that creates a mapped rxq in a
> virtual netdev and then binds it to a real rxq in a physical netdev
> by setting the peer pointer in netdev_rx_queue.
>
> Signed-off-by: David Wei <dw@davidwei.uk>
> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
> net/core/netdev-genl.c | 117 +++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 117 insertions(+)
>
> diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> index b0aea27bf84e..ed0ce3dbfc6f 100644
> --- a/net/core/netdev-genl.c
> +++ b/net/core/netdev-genl.c
> @@ -1122,6 +1122,123 @@ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
>
> int netdev_nl_bind_queue_doit(struct sk_buff *skb, struct genl_info *info)
> {
> + u32 src_ifidx, src_qid, dst_ifidx, dst_qid;
> + struct netdev_rx_queue *src_rxq, *dst_rxq;
> + struct net_device *src_dev, *dst_dev;
> + struct netdev_nl_sock *priv;
> + struct sk_buff *rsp;
> + int err = 0;
> + void *hdr;
> +
> + if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_PAIR_SRC_IFINDEX) ||
> + GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_PAIR_SRC_QUEUE_ID) ||
> + GENL_REQ_ATTR_CHECK(info, NETDEV_A_QUEUE_PAIR_DST_IFINDEX))
> + return -EINVAL;
> +
> + src_ifidx = nla_get_u32(info->attrs[NETDEV_A_QUEUE_PAIR_SRC_IFINDEX]);
> + src_qid = nla_get_u32(info->attrs[NETDEV_A_QUEUE_PAIR_SRC_QUEUE_ID]);
> + dst_ifidx = nla_get_u32(info->attrs[NETDEV_A_QUEUE_PAIR_DST_IFINDEX]);
> + if (dst_ifidx == src_ifidx) {
> + NL_SET_ERR_MSG(info->extack,
> + "Destination driver cannot be same as source driver");
> + return -EOPNOTSUPP;
> + }
> +
[..]
> + priv = genl_sk_priv_get(&netdev_nl_family, NETLINK_CB(skb).sk);
> + if (IS_ERR(priv))
> + return PTR_ERR(priv);
Why do you need genl_sk_priv_get and mutex_lock?
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 01/20] net, ynl: Add bind-queue operation
2025-09-22 16:04 ` Stanislav Fomichev
@ 2025-09-22 16:13 ` Daniel Borkmann
0 siblings, 0 replies; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-22 16:13 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
john.fastabend, martin.lau, jordan, maciej.fijalkowski,
magnus.karlsson, David Wei
On 9/22/25 6:04 PM, Stanislav Fomichev wrote:
[...]
>> diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
>> index 470fabbeacd9..b0aea27bf84e 100644
>> --- a/net/core/netdev-genl.c
>> +++ b/net/core/netdev-genl.c
>> @@ -1120,6 +1120,10 @@ int netdev_nl_bind_tx_doit(struct sk_buff *skb, struct genl_info *info)
>> return err;
>> }
>>
>> +int netdev_nl_bind_queue_doit(struct sk_buff *skb, struct genl_info *info)
>> +{
>
> nit: return 'not supported' for now or something similar?
yeap, will fix in v2, thx!
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 03/20] net: Add ndo_queue_create callback
2025-09-22 16:04 ` Stanislav Fomichev
@ 2025-09-22 16:14 ` Daniel Borkmann
2025-09-23 15:58 ` David Wei
1 sibling, 0 replies; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-22 16:14 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
john.fastabend, martin.lau, jordan, maciej.fijalkowski,
magnus.karlsson, David Wei
On 9/22/25 6:04 PM, Stanislav Fomichev wrote:
> On 09/19, Daniel Borkmann wrote:
>> From: David Wei <dw@davidwei.uk>
>>
>> Add ndo_queue_create() to netdev_queue_mgmt_ops that will create a new
>> rxq specifically for mapping to a real rxq. The intent is for only
>> virtual netdevs i.e. netkit and veth to implement this ndo. This will
>> be called from ynl netdev fam bind-queue op to atomically create a
>> mapped rxq and bind it to a real rxq.
>>
>> Signed-off-by: David Wei <dw@davidwei.uk>
>> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>> ---
>> include/net/netdev_queues.h | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
>> index cd00e0406cf4..6b0d2416728d 100644
>> --- a/include/net/netdev_queues.h
>> +++ b/include/net/netdev_queues.h
>> @@ -149,6 +149,7 @@ struct netdev_queue_mgmt_ops {
>> int idx);
>> struct device * (*ndo_queue_get_dma_dev)(struct net_device *dev,
>> int idx);
>> + int (*ndo_queue_create)(struct net_device *dev);
>
> kdoc is missing
same, will address in v2, thanks for spotting!
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 10/20] xsk: Move pool registration into single function
2025-09-22 16:01 ` Maciej Fijalkowski
@ 2025-09-22 16:15 ` Daniel Borkmann
0 siblings, 0 replies; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-22 16:15 UTC (permalink / raw)
To: Maciej Fijalkowski
Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
john.fastabend, martin.lau, jordan, magnus.karlsson, David Wei
On 9/22/25 6:01 PM, Maciej Fijalkowski wrote:
> On Fri, Sep 19, 2025 at 11:31:43PM +0200, Daniel Borkmann wrote:
>> Small refactor to move the pool registration into xsk_reg_pool_at_qid,
>> such that the netdev and queue_id can be registered there. No change
>> in functionality.
>>
>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>> Co-developed-by: David Wei <dw@davidwei.uk>
>> Signed-off-by: David Wei <dw@davidwei.uk>
>> ---
>> net/xdp/xsk.c | 5 +++++
>> net/xdp/xsk_buff_pool.c | 16 +++-------------
>> 2 files changed, 8 insertions(+), 13 deletions(-)
>>
>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>> index 72e34bd2d925..82ad89f6ba35 100644
>> --- a/net/xdp/xsk.c
>> +++ b/net/xdp/xsk.c
>> @@ -141,6 +141,11 @@ int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool,
>> dev->real_num_rx_queues,
>> dev->real_num_tx_queues))
>> return -EINVAL;
>> + if (xsk_get_pool_from_qid(dev, queue_id))
>> + return -EBUSY;
>> +
>> + pool->netdev = dev;
>> + pool->queue_id = queue_id;
>>
>> if (queue_id < dev->real_num_rx_queues)
>> dev->_rx[queue_id].pool = pool;
>> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
>> index 26165baf99f4..375696f895d4 100644
>> --- a/net/xdp/xsk_buff_pool.c
>> +++ b/net/xdp/xsk_buff_pool.c
>> @@ -169,32 +169,24 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>>
>> force_zc = flags & XDP_ZEROCOPY;
>> force_copy = flags & XDP_COPY;
>> -
>> if (force_zc && force_copy)
>> return -EINVAL;
>>
>> - if (xsk_get_pool_from_qid(netdev, queue_id))
>> - return -EBUSY;
>> -
>> - pool->netdev = netdev;
>> - pool->queue_id = queue_id;
>> err = xsk_reg_pool_at_qid(netdev, pool, queue_id);
>> if (err)
>> return err;
>>
>> if (flags & XDP_USE_SG)
>> pool->umem->flags |= XDP_UMEM_SG_FLAG;
>> -
>
> IMHO all of the stuff below looks like unnecessary code churn.
Ack, will drop it in v2, thx!
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 11/20] xsk: Add small helper xp_pool_bindable
2025-09-22 16:03 ` Maciej Fijalkowski
@ 2025-09-22 16:17 ` Daniel Borkmann
0 siblings, 0 replies; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-22 16:17 UTC (permalink / raw)
To: Maciej Fijalkowski
Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
john.fastabend, martin.lau, jordan, magnus.karlsson, David Wei
On 9/22/25 6:03 PM, Maciej Fijalkowski wrote:
> On Fri, Sep 19, 2025 at 11:31:44PM +0200, Daniel Borkmann wrote:
>> Add another small helper called xp_pool_bindable and move the current
>> dev_get_min_mp_channel_count test into this helper. Pass in the pool
>> object, such that we derive the netdev from the prior registered pool.
>>
>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>> Co-developed-by: David Wei <dw@davidwei.uk>
>> Signed-off-by: David Wei <dw@davidwei.uk>
>> ---
>> net/xdp/xsk_buff_pool.c | 7 ++++++-
>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
>> index 375696f895d4..d2109d683fe5 100644
>> --- a/net/xdp/xsk_buff_pool.c
>> +++ b/net/xdp/xsk_buff_pool.c
>> @@ -54,6 +54,11 @@ int xp_alloc_tx_descs(struct xsk_buff_pool *pool, struct xdp_sock *xs)
>> return 0;
>> }
>>
>> +static bool xp_pool_bindable(struct xsk_buff_pool *pool)
>> +{
>> + return dev_get_min_mp_channel_count(pool->netdev) == 0;
>> +}
>
> Is this really a must have in this patchset? You don't seem to make use of
> it anywhere.
That is needed given we need to look at the pool's netdev which then is
the one of the phys device.
>> struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
>> struct xdp_umem *umem)
>> {
>> @@ -199,7 +204,7 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>> err = -EOPNOTSUPP;
>> goto err_unreg_pool;
>> }
>> - if (dev_get_min_mp_channel_count(netdev)) {
>> + if (!xp_pool_bindable(pool)) {
>> err = -EBUSY;
>> goto err_unreg_pool;
>> }
>> --
>> 2.43.0
>>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 18/20] netkit: Add io_uring zero-copy support for TCP
2025-09-22 3:17 ` zf
@ 2025-09-22 16:23 ` Daniel Borkmann
0 siblings, 0 replies; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-22 16:23 UTC (permalink / raw)
To: zf, netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei, yangzhenze, Dongdong Wang
On 9/22/25 5:17 AM, zf wrote:
> 在 2025/9/20 05:31, Daniel Borkmann 写道:
[...]
>> Remote io_uring client:
>>
>> # ./iou-zcrx -c -h 1.2.3.4 -p 5000 -l 12840 -z 65536
>>
>> We have tested the above against a dual-port Nvidia ConnectX-6 (mlx5)
>> 100G NIC as well as Broadcom BCM957504 (bnxt_en) 100G NIC, both
>> supporting TCP header/data split. For Cilium, the plan is to open
>> up support for io_uring in zero-copy mode for regular Kubernetes Pods
>> when Cilium is configured with netkit datapath mode.
>
> From what we have learned, mlx supports TCP header/data split starting from CX7, relying on the hw rx gro. I would like to ask, can CX6 use TCP header/data split? Can you share your CX6's mlx driver information and FW information? I will test it. If CX6 can support, this one is even better for me. Thanks.
I'll double check with David, but this is a typo here and needs to say CX7,
the af-xdp work was done on CX6. So we'll correct in v2, thanks (& sorry for
the confusion)!
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 08/20] net: Proxy net_mp_{open,close}_rxq for mapped queues
2025-09-19 21:31 ` [PATCH net-next 08/20] net: Proxy net_mp_{open,close}_rxq for mapped queues Daniel Borkmann
@ 2025-09-22 16:35 ` Stanislav Fomichev
0 siblings, 0 replies; 64+ messages in thread
From: Stanislav Fomichev @ 2025-09-22 16:35 UTC (permalink / raw)
To: Daniel Borkmann
Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
john.fastabend, martin.lau, jordan, maciej.fijalkowski,
magnus.karlsson, David Wei
On 09/19, Daniel Borkmann wrote:
> From: David Wei <dw@davidwei.uk>
>
> When a process in a container wants to setup a memory provider, it will
> use the virtual netdev and a mapped rxq, and call net_mp_{open,close}_rxq
> to try and restart the queue. At this point, proxy the queue restart on
> the real rxq in the physical netdev.
>
> For memory providers (io_uring zero-copy rx and devmem), it causes the
> real rxq in the physical netdev to be filled from a memory provider that
> has DMA mapped memory from a process within a container.
>
> Signed-off-by: David Wei <dw@davidwei.uk>
> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
> net/core/netdev_rx_queue.c | 15 ++++++++++-----
> 1 file changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
> index c7d9341b7630..238d3cd9677e 100644
> --- a/net/core/netdev_rx_queue.c
> +++ b/net/core/netdev_rx_queue.c
> @@ -105,13 +105,21 @@ int __net_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
>
> if (!netdev_need_ops_lock(dev))
> return -EOPNOTSUPP;
> -
> if (rxq_idx >= dev->real_num_rx_queues) {
> NL_SET_ERR_MSG(extack, "rx queue index out of range");
> return -ERANGE;
> }
> +
> rxq_idx = array_index_nospec(rxq_idx, dev->real_num_rx_queues);
> + rxq = __netif_get_rx_queue_peer(&dev, &rxq_idx);
>
> + /* Check again since dev might have changed */
> + if (!netdev_need_ops_lock(dev))
> + return -EOPNOTSUPP;
But if old dev != new dev, the new dev is not gonna be locked, right?
Are you not triggering netdev_assert_locked from
netdev_rx_queue_restart?
You might need to resolve the new dev+queue in the callers in order
to do proper locking.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 13/20] xsk: Proxy pool management for mapped queues
2025-09-19 21:31 ` [PATCH net-next 13/20] xsk: Proxy pool management for mapped queues Daniel Borkmann
@ 2025-09-22 16:48 ` Stanislav Fomichev
2025-09-22 17:01 ` Daniel Borkmann
0 siblings, 1 reply; 64+ messages in thread
From: Stanislav Fomichev @ 2025-09-22 16:48 UTC (permalink / raw)
To: Daniel Borkmann
Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
john.fastabend, martin.lau, jordan, maciej.fijalkowski,
magnus.karlsson, David Wei
On 09/19, Daniel Borkmann wrote:
> Similarly what we do for net_mp_{open,close}_rxq for mapped queues,
> proxy also the xsk_{reg,clear}_pool_at_qid via __netif_get_rx_queue_peer
> such that when a virtual netdev picked a mapped rxq, the request gets
> through to the real rxq in the physical netdev.
>
> Change the function signatures for queue_id to unsigned int in order
> to pass the queue_id parameter into __netif_get_rx_queue_peer. The
> proxying is only relevant for queue_id < dev->real_num_rx_queues since
> right now its only supported for rxqs.
>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Co-developed-by: David Wei <dw@davidwei.uk>
> Signed-off-by: David Wei <dw@davidwei.uk>
> ---
> include/net/xdp_sock_drv.h | 4 ++--
> net/xdp/xsk.c | 16 +++++++++++-----
> net/xdp/xsk.h | 5 ++---
> 3 files changed, 15 insertions(+), 10 deletions(-)
>
> diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
> index 47120666d8d6..709af292cba7 100644
> --- a/include/net/xdp_sock_drv.h
> +++ b/include/net/xdp_sock_drv.h
> @@ -29,7 +29,7 @@ bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc);
> u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 max);
> void xsk_tx_release(struct xsk_buff_pool *pool);
> struct xsk_buff_pool *xsk_get_pool_from_qid(struct net_device *dev,
> - u16 queue_id);
> + unsigned int queue_id);
> void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool);
> void xsk_set_tx_need_wakeup(struct xsk_buff_pool *pool);
> void xsk_clear_rx_need_wakeup(struct xsk_buff_pool *pool);
> @@ -286,7 +286,7 @@ static inline void xsk_tx_release(struct xsk_buff_pool *pool)
> }
>
> static inline struct xsk_buff_pool *
> -xsk_get_pool_from_qid(struct net_device *dev, u16 queue_id)
> +xsk_get_pool_from_qid(struct net_device *dev, unsigned int queue_id)
> {
> return NULL;
> }
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index cf40c70ee59f..b9efa6d8a112 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -23,6 +23,8 @@
> #include <linux/netdevice.h>
> #include <linux/rculist.h>
> #include <linux/vmalloc.h>
> +
> +#include <net/netdev_queues.h>
> #include <net/xdp_sock_drv.h>
> #include <net/busy_poll.h>
> #include <net/netdev_lock.h>
> @@ -111,19 +113,20 @@ bool xsk_uses_need_wakeup(struct xsk_buff_pool *pool)
> EXPORT_SYMBOL(xsk_uses_need_wakeup);
>
> struct xsk_buff_pool *xsk_get_pool_from_qid(struct net_device *dev,
> - u16 queue_id)
> + unsigned int queue_id)
> {
> if (queue_id < dev->real_num_rx_queues)
> return dev->_rx[queue_id].pool;
> if (queue_id < dev->real_num_tx_queues)
> return dev->_tx[queue_id].pool;
> -
> return NULL;
> }
> EXPORT_SYMBOL(xsk_get_pool_from_qid);
>
> -void xsk_clear_pool_at_qid(struct net_device *dev, u16 queue_id)
> +void xsk_clear_pool_at_qid(struct net_device *dev, unsigned int queue_id)
> {
> + if (queue_id < dev->real_num_rx_queues)
> + __netif_get_rx_queue_peer(&dev, &queue_id);
> if (queue_id < dev->num_rx_queues)
> dev->_rx[queue_id].pool = NULL;
> if (queue_id < dev->num_tx_queues)
> @@ -135,7 +138,7 @@ void xsk_clear_pool_at_qid(struct net_device *dev, u16 queue_id)
> * This might also change during run time.
> */
> int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool,
> - u16 queue_id)
> + unsigned int queue_id)
> {
> if (queue_id >= max_t(unsigned int,
> dev->real_num_rx_queues,
> @@ -143,6 +146,10 @@ int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool,
> return -EINVAL;
> if (xsk_get_pool_from_qid(dev, queue_id))
> return -EBUSY;
> + if (queue_id < dev->real_num_rx_queues)
> + __netif_get_rx_queue_peer(&dev, &queue_id);
> + if (xsk_get_pool_from_qid(dev, queue_id))
> + return -EBUSY;
>
> pool->netdev = dev;
> pool->queue_id = queue_id;
I feel like both of the above are also gonna be problematic wrt netdev
lock. The callers lock the netdev, the callers will also have
to resolve the virtual->real queue mapping. Hacking up the
queue/netdev deep in the call stack in a few places is not gonna work.
Maybe also add assert for the (new) netdev lock to __netif_get_rx_queue_peer
to trigger these.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 13/20] xsk: Proxy pool management for mapped queues
2025-09-22 16:48 ` Stanislav Fomichev
@ 2025-09-22 17:01 ` Daniel Borkmann
0 siblings, 0 replies; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-22 17:01 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
john.fastabend, martin.lau, jordan, maciej.fijalkowski,
magnus.karlsson, David Wei
On 9/22/25 6:48 PM, Stanislav Fomichev wrote:
> On 09/19, Daniel Borkmann wrote:
[...]
>> @@ -143,6 +146,10 @@ int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool,
>> return -EINVAL;
>> if (xsk_get_pool_from_qid(dev, queue_id))
>> return -EBUSY;
>> + if (queue_id < dev->real_num_rx_queues)
>> + __netif_get_rx_queue_peer(&dev, &queue_id);
>> + if (xsk_get_pool_from_qid(dev, queue_id))
>> + return -EBUSY;
>>
>> pool->netdev = dev;
>> pool->queue_id = queue_id;
>
> I feel like both of the above are also gonna be problematic wrt netdev
> lock. The callers lock the netdev, the callers will also have
> to resolve the virtual->real queue mapping. Hacking up the
> queue/netdev deep in the call stack in a few places is not gonna work.
>
> Maybe also add assert for the (new) netdev lock to __netif_get_rx_queue_peer
> to trigger these.
Good idea, and I'll look into this, thx!
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 20/20] tools, ynl: Add queue binding ynl sample application
2025-09-19 21:31 ` [PATCH net-next 20/20] tools, ynl: Add queue binding ynl sample application Daniel Borkmann
@ 2025-09-22 17:09 ` Stanislav Fomichev
2025-09-23 16:12 ` David Wei
0 siblings, 1 reply; 64+ messages in thread
From: Stanislav Fomichev @ 2025-09-22 17:09 UTC (permalink / raw)
To: Daniel Borkmann
Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
john.fastabend, martin.lau, jordan, maciej.fijalkowski,
magnus.karlsson, David Wei
On 09/19, Daniel Borkmann wrote:
> From: David Wei <dw@davidwei.uk>
>
> Add a ynl sample application that calls bind-queue to bind a real rxq
> to a mapped rxq in a virtual netdev.
Any reason ynl python cli is not enough? Can we use it instead and update the
respective instructions (example) in patch 19?
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 01/20] net, ynl: Add bind-queue operation
2025-09-19 21:31 ` [PATCH net-next 01/20] net, ynl: Add bind-queue operation Daniel Borkmann
2025-09-22 16:04 ` Stanislav Fomichev
@ 2025-09-23 1:17 ` Jakub Kicinski
2025-09-23 16:13 ` David Wei
1 sibling, 1 reply; 64+ messages in thread
From: Jakub Kicinski @ 2025-09-23 1:17 UTC (permalink / raw)
To: Daniel Borkmann
Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
On Fri, 19 Sep 2025 23:31:34 +0200 Daniel Borkmann wrote:
> Subject: [PATCH net-next 01/20] net, ynl: Add bind-queue operation
We use "ynl" for changes to ynl itself. If you're just adding to
the YAML specs or using them there's no need to mention YNL.
Please remove in all the subjects.
> + -
> + name: queue-pair
> + attributes:
> + -
> + name: src-ifindex
> + doc: netdev ifindex of the physical device
> + type: u32
> + checks:
> + min: 1
max: s32-max ?
> + -
> + name: src-queue-id
> + doc: netdev queue id of the physical device
> + type: u32
> @@ -772,6 +795,20 @@ operations:
> attributes:
> - id
>
> + -
> + name: bind-queue
> + doc: Bind a physical netdev queue to a virtual one
Would be good to have a few sentences of documentation here.
All netdev APIs currently carry queue id with type.
I'm guessing the next few patches would explain but whether
you're attaching rx, tx, or both should really be explained here :)
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 02/20] net: Add peer to netdev_rx_queue
2025-09-19 21:31 ` [PATCH net-next 02/20] net: Add peer to netdev_rx_queue Daniel Borkmann
@ 2025-09-23 1:22 ` Jakub Kicinski
2025-09-23 15:56 ` David Wei
0 siblings, 1 reply; 64+ messages in thread
From: Jakub Kicinski @ 2025-09-23 1:22 UTC (permalink / raw)
To: Daniel Borkmann
Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
On Fri, 19 Sep 2025 23:31:35 +0200 Daniel Borkmann wrote:
> +static inline void netdev_rx_queue_peer(struct net_device *src_dev,
> + struct netdev_rx_queue *src_rxq,
> + struct netdev_rx_queue *dst_rxq)
> +{
> + dev_hold(src_dev);
netdev_hold() is required for all new code
> + __netdev_rx_queue_peer(src_rxq, dst_rxq);
Also please avoid static inlines if you need to call a func from
another header. It complicates header dependencies.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 03/20] net: Add ndo_queue_create callback
2025-09-19 21:31 ` [PATCH net-next 03/20] net: Add ndo_queue_create callback Daniel Borkmann
2025-09-22 16:04 ` Stanislav Fomichev
@ 2025-09-23 1:22 ` Jakub Kicinski
2025-09-23 15:58 ` David Wei
1 sibling, 1 reply; 64+ messages in thread
From: Jakub Kicinski @ 2025-09-23 1:22 UTC (permalink / raw)
To: Daniel Borkmann
Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
On Fri, 19 Sep 2025 23:31:36 +0200 Daniel Borkmann wrote:
> From: David Wei <dw@davidwei.uk>
>
> Add ndo_queue_create() to netdev_queue_mgmt_ops that will create a new
> rxq specifically for mapping to a real rxq. The intent is for only
> virtual netdevs i.e. netkit and veth to implement this ndo. This will
> be called from ynl netdev fam bind-queue op to atomically create a
> mapped rxq and bind it to a real rxq.
>
> Signed-off-by: David Wei <dw@davidwei.uk>
> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
> include/net/netdev_queues.h | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> index cd00e0406cf4..6b0d2416728d 100644
> --- a/include/net/netdev_queues.h
> +++ b/include/net/netdev_queues.h
> @@ -149,6 +149,7 @@ struct netdev_queue_mgmt_ops {
> int idx);
> struct device * (*ndo_queue_get_dma_dev)(struct net_device *dev,
> int idx);
> + int (*ndo_queue_create)(struct net_device *dev);
> };
>
> bool netif_rxq_has_unreadable_mp(struct net_device *dev, int idx);
This patch is meaningless, please squash it into something that matters.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 04/20] net: Add ndo_{peer,unpeer}_queues callback
2025-09-19 21:31 ` [PATCH net-next 04/20] net: Add ndo_{peer,unpeer}_queues callback Daniel Borkmann
@ 2025-09-23 1:23 ` Jakub Kicinski
2025-09-23 16:06 ` David Wei
0 siblings, 1 reply; 64+ messages in thread
From: Jakub Kicinski @ 2025-09-23 1:23 UTC (permalink / raw)
To: Daniel Borkmann
Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
On Fri, 19 Sep 2025 23:31:37 +0200 Daniel Borkmann wrote:
> Add ndo_{peer,unpeer}_queues() callback which can be used by virtual drivers
> that implement rxq mapping to a real rxq to update their internal state or
> exposed capability flags from the set of rxq mappings.
Why is this something that virtual drivers implement?
I'd think that queue forwarding can be almost entirely implemented
in the core.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 05/20] net, ynl: Implement netdev_nl_bind_queue_doit
2025-09-22 16:06 ` Stanislav Fomichev
@ 2025-09-23 1:26 ` Jakub Kicinski
2025-09-23 16:06 ` David Wei
0 siblings, 1 reply; 64+ messages in thread
From: Jakub Kicinski @ 2025-09-23 1:26 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: Daniel Borkmann, netdev, bpf, davem, razor, pabeni, willemb, sdf,
john.fastabend, martin.lau, jordan, maciej.fijalkowski,
magnus.karlsson, David Wei
On Mon, 22 Sep 2025 09:06:51 -0700 Stanislav Fomichev wrote:
> > + priv = genl_sk_priv_get(&netdev_nl_family, NETLINK_CB(skb).sk);
> > + if (IS_ERR(priv))
> > + return PTR_ERR(priv);
>
> Why do you need genl_sk_priv_get and mutex_lock?
+1
Also you're taking the instance lock on two netdev instances,
how will this not deadlock? :$
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 06/20] net, ynl: Add peer info to queue-get response
2025-09-19 21:31 ` [PATCH net-next 06/20] net, ynl: Add peer info to queue-get response Daniel Borkmann
@ 2025-09-23 1:32 ` Jakub Kicinski
2025-09-23 16:08 ` David Wei
0 siblings, 1 reply; 64+ messages in thread
From: Jakub Kicinski @ 2025-09-23 1:32 UTC (permalink / raw)
To: Daniel Borkmann
Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
On Fri, 19 Sep 2025 23:31:39 +0200 Daniel Borkmann wrote:
> + name: peer-info
> + attributes:
> + -
> + name: id
> + doc: Queue index of the netdevice to which the peer queue belongs.
> + type: u32
> + -
> + name: ifindex
> + doc: ifindex of the netdevice to which the peer queue belongs.
> + type: u32
Oh, we have an ifindex in the local netns. So the API is to bind a
queue to one side of a netkit and then the other side of the netkit
actually gets to use it? Should we not be "binding" to the device that
is of interest rather than its peer?
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 07/20] net, ethtool: Disallow mapped real rxqs to be resized
2025-09-19 21:31 ` [PATCH net-next 07/20] net, ethtool: Disallow mapped real rxqs to be resized Daniel Borkmann
@ 2025-09-23 1:34 ` Jakub Kicinski
2025-09-23 1:38 ` Jakub Kicinski
0 siblings, 1 reply; 64+ messages in thread
From: Jakub Kicinski @ 2025-09-23 1:34 UTC (permalink / raw)
To: Daniel Borkmann
Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
On Fri, 19 Sep 2025 23:31:40 +0200 Daniel Borkmann wrote:
> Similar to AF_XDP, do not allow queues in a physical netdev to be
> resized by ethtool -L when they are peered.
I think we need the same thing for the ioctl path.
Let's factor the checks out to a helper in net/ethtool/common.c ?
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 07/20] net, ethtool: Disallow mapped real rxqs to be resized
2025-09-23 1:34 ` Jakub Kicinski
@ 2025-09-23 1:38 ` Jakub Kicinski
2025-09-23 16:08 ` David Wei
0 siblings, 1 reply; 64+ messages in thread
From: Jakub Kicinski @ 2025-09-23 1:38 UTC (permalink / raw)
To: Daniel Borkmann
Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
On Mon, 22 Sep 2025 18:34:49 -0700 Jakub Kicinski wrote:
> On Fri, 19 Sep 2025 23:31:40 +0200 Daniel Borkmann wrote:
> > Similar to AF_XDP, do not allow queues in a physical netdev to be
> > resized by ethtool -L when they are peered.
>
> I think we need the same thing for the ioctl path.
> Let's factor the checks out to a helper in net/ethtool/common.c ?
And/or add a helper to check if an Rx Queue is "busy" (af_xdp || mp ||
peer'ed) cause we seem to be checking those three things in multiple
places.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP
2025-09-19 21:31 [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
` (20 preceding siblings ...)
2025-09-22 12:05 ` [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Nikolay Aleksandrov
@ 2025-09-23 1:59 ` Jakub Kicinski
21 siblings, 0 replies; 64+ messages in thread
From: Jakub Kicinski @ 2025-09-23 1:59 UTC (permalink / raw)
To: Daniel Borkmann
Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson
On Fri, 19 Sep 2025 23:31:33 +0200 Daniel Borkmann wrote:
> We have implemented support for this concept in netkit and tested the
> latter against Nvidia ConnectX-6 (mlx5) as well as Broadcom BCM957504
> (bnxt_en) 100G NICs. For more details see the individual patches.
at high level
- not sure how instance locking is going to work here
- integration with other queue related APIs is missing (stats and
upcoming config API)
- the model of "allocating a queue" needs careful thought, the model
of bumping the real num rx on the remote is fine here but it will
not work for real HW queue alloc
- I'd have expected more of the code to live in the core vs so much
handling in netkit
- we need selftests (while the sample is unnecessary)
- last but not least - I recommend
https://lore.kernel.org/all/20250912095730.1efaac16@kernel.org/
;)
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 19/20] netkit: Add xsk support for af_xdp applications
2025-09-19 21:31 ` [PATCH net-next 19/20] netkit: Add xsk support for af_xdp applications Daniel Borkmann
@ 2025-09-23 11:42 ` Toke Høiland-Jørgensen
2025-09-24 10:41 ` Daniel Borkmann
0 siblings, 1 reply; 64+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-09-23 11:42 UTC (permalink / raw)
To: Daniel Borkmann, netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
Daniel Borkmann <daniel@iogearbox.net> writes:
> Enable support for AF_XDP applications to operate on a netkit device.
> The goal is that AF_XDP applications can natively consume AF_XDP
> from network namespaces. The use-case from Cilium side is to support
> Kubernetes KubeVirt VMs through QEMU's AF_XDP backend. KubeVirt is a
> virtual machine management add-on for Kubernetes which aims to provide
> a common ground for virtualization. KubeVirt spawns the VMs inside
> Kubernetes Pods which reside in their own network namespace just like
> regular Pods.
>
> Raw QEMU AF_XDP backend example with eth0 being a physical device with
> 16 queues where netkit is bound to the last queue (for multi-queue RSS
> context can be used if supported by the driver):
>
> # ethtool -X eth0 start 0 equal 15
> # ethtool -X eth0 start 15 equal 1 context new
> # ethtool --config-ntuple eth0 flow-type ether \
> src 00:00:00:00:00:00 \
> src-mask ff:ff:ff:ff:ff:ff \
> dst $mac dst-mask 00:00:00:00:00:00 \
> proto 0 proto-mask 0xffff action 15
> # ip netns add foo
> # ip link add numrxqueues 2 nk type netkit single
> # ynl-bind eth0 15 nk
> # ip link set nk netns foo
> # ip netns exec foo ip link set lo up
> # ip netns exec foo ip link set nk up
> # ip netns exec foo qemu-system-x86_64 \
> -kernel $kernel \
> -drive file=${image_name},index=0,media=disk,format=raw \
> -append "root=/dev/sda rw console=ttyS0" \
> -cpu host \
> -m $memory \
> -enable-kvm \
> -device virtio-net-pci,netdev=net0,mac=$mac \
> -netdev af-xdp,ifname=nk,id=net0,mode=native,queues=1,start-queue=1,inhibit=on,map-path=$dir/xsks_map \
> -nographic
So AFAICT, this example relies on the control plane installing an XDP
program on the physical NIC which will redirect into the right socket;
and since in this example, qemu will install the XSK socket at index 1
in the xsk map, that XDP program will also need to be aware of the queue
index mapping. I can see from your qemu commit[0] that there's support
on the qemu side for specifying an offset into the map to avoid having
to do this translation in the XDP program, but at the very least that
makes this example incomplete, no?
However, even with a complete example, this breaks isolation in the
sense that the entire XSK map is visible inside the pod, so a
misbehaving qemu could interfere with traffic on other queues (by
clearing the map, say). Which seems less than ideal?
Taking a step back, for AF_XDP we already support decoupling the
application-side access to the redirected packets from the interface,
through the use of sockets. Meaning that your use case here could just
as well be served by the control plane setting up AF_XDP socket(s) on
the physical NIC and passing those into qemu, in which case we don't
need this whole queue proxying dance at all.
So, erm, what am I missing that makes this worth it (for AF_XDP; I can
see how it is useful for other things)? :)
-Toke
[0] https://gitlab.com/qemu-project/qemu/-/commit/e53d9ec7ccc2dbb9378353fe2a89ebdca5cd7015
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 02/20] net: Add peer to netdev_rx_queue
2025-09-23 1:22 ` Jakub Kicinski
@ 2025-09-23 15:56 ` David Wei
0 siblings, 0 replies; 64+ messages in thread
From: David Wei @ 2025-09-23 15:56 UTC (permalink / raw)
To: Jakub Kicinski, Daniel Borkmann
Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson
On 2025-09-22 18:22, Jakub Kicinski wrote:
> On Fri, 19 Sep 2025 23:31:35 +0200 Daniel Borkmann wrote:
>> +static inline void netdev_rx_queue_peer(struct net_device *src_dev,
>> + struct netdev_rx_queue *src_rxq,
>> + struct netdev_rx_queue *dst_rxq)
>> +{
>> + dev_hold(src_dev);
>
> netdev_hold() is required for all new code
Got it, will update.
>
>> + __netdev_rx_queue_peer(src_rxq, dst_rxq);
>
> Also please avoid static inlines if you need to call a func from
> another header. It complicates header dependencies.
Didn't know this, thanks, will fix.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 03/20] net: Add ndo_queue_create callback
2025-09-22 16:04 ` Stanislav Fomichev
2025-09-22 16:14 ` Daniel Borkmann
@ 2025-09-23 15:58 ` David Wei
1 sibling, 0 replies; 64+ messages in thread
From: David Wei @ 2025-09-23 15:58 UTC (permalink / raw)
To: Stanislav Fomichev, Daniel Borkmann
Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
john.fastabend, martin.lau, jordan, maciej.fijalkowski,
magnus.karlsson
On 2025-09-22 09:04, Stanislav Fomichev wrote:
> On 09/19, Daniel Borkmann wrote:
>> From: David Wei <dw@davidwei.uk>
>>
>> Add ndo_queue_create() to netdev_queue_mgmt_ops that will create a new
>> rxq specifically for mapping to a real rxq. The intent is for only
>> virtual netdevs i.e. netkit and veth to implement this ndo. This will
>> be called from ynl netdev fam bind-queue op to atomically create a
>> mapped rxq and bind it to a real rxq.
>>
>> Signed-off-by: David Wei <dw@davidwei.uk>
>> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>> ---
>> include/net/netdev_queues.h | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
>> index cd00e0406cf4..6b0d2416728d 100644
>> --- a/include/net/netdev_queues.h
>> +++ b/include/net/netdev_queues.h
>> @@ -149,6 +149,7 @@ struct netdev_queue_mgmt_ops {
>> int idx);
>> struct device * (*ndo_queue_get_dma_dev)(struct net_device *dev,
>> int idx);
>> + int (*ndo_queue_create)(struct net_device *dev);
>
> kdoc is missing
Will add. This was meant to be an RFC so I didn't write one - then it
became a proper patchset.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 03/20] net: Add ndo_queue_create callback
2025-09-23 1:22 ` Jakub Kicinski
@ 2025-09-23 15:58 ` David Wei
0 siblings, 0 replies; 64+ messages in thread
From: David Wei @ 2025-09-23 15:58 UTC (permalink / raw)
To: Jakub Kicinski, Daniel Borkmann
Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson
On 2025-09-22 18:22, Jakub Kicinski wrote:
> On Fri, 19 Sep 2025 23:31:36 +0200 Daniel Borkmann wrote:
>> From: David Wei <dw@davidwei.uk>
>>
>> Add ndo_queue_create() to netdev_queue_mgmt_ops that will create a new
>> rxq specifically for mapping to a real rxq. The intent is for only
>> virtual netdevs i.e. netkit and veth to implement this ndo. This will
>> be called from ynl netdev fam bind-queue op to atomically create a
>> mapped rxq and bind it to a real rxq.
>>
>> Signed-off-by: David Wei <dw@davidwei.uk>
>> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>> ---
>> include/net/netdev_queues.h | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
>> index cd00e0406cf4..6b0d2416728d 100644
>> --- a/include/net/netdev_queues.h
>> +++ b/include/net/netdev_queues.h
>> @@ -149,6 +149,7 @@ struct netdev_queue_mgmt_ops {
>> int idx);
>> struct device * (*ndo_queue_get_dma_dev)(struct net_device *dev,
>> int idx);
>> + int (*ndo_queue_create)(struct net_device *dev);
>> };
>>
>> bool netif_rxq_has_unreadable_mp(struct net_device *dev, int idx);
>
> This patch is meaningless, please squash it into something that matters.
(Y)
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 04/20] net: Add ndo_{peer,unpeer}_queues callback
2025-09-23 1:23 ` Jakub Kicinski
@ 2025-09-23 16:06 ` David Wei
2025-09-23 16:26 ` Daniel Borkmann
0 siblings, 1 reply; 64+ messages in thread
From: David Wei @ 2025-09-23 16:06 UTC (permalink / raw)
To: Jakub Kicinski, Daniel Borkmann
Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson
On 2025-09-22 18:23, Jakub Kicinski wrote:
> On Fri, 19 Sep 2025 23:31:37 +0200 Daniel Borkmann wrote:
>> Add ndo_{peer,unpeer}_queues() callback which can be used by virtual drivers
>> that implement rxq mapping to a real rxq to update their internal state or
>> exposed capability flags from the set of rxq mappings.
>
> Why is this something that virtual drivers implement?
> I'd think that queue forwarding can be almost entirely implemented
> in the core.
I believe Daniel needs it for AF_XDP.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 05/20] net, ynl: Implement netdev_nl_bind_queue_doit
2025-09-23 1:26 ` Jakub Kicinski
@ 2025-09-23 16:06 ` David Wei
0 siblings, 0 replies; 64+ messages in thread
From: David Wei @ 2025-09-23 16:06 UTC (permalink / raw)
To: Jakub Kicinski, Stanislav Fomichev
Cc: Daniel Borkmann, netdev, bpf, davem, razor, pabeni, willemb, sdf,
john.fastabend, martin.lau, jordan, maciej.fijalkowski,
magnus.karlsson
On 2025-09-22 18:26, Jakub Kicinski wrote:
> On Mon, 22 Sep 2025 09:06:51 -0700 Stanislav Fomichev wrote:
>>> + priv = genl_sk_priv_get(&netdev_nl_family, NETLINK_CB(skb).sk);
>>> + if (IS_ERR(priv))
>>> + return PTR_ERR(priv);
>>
>> Why do you need genl_sk_priv_get and mutex_lock?
>
> +1
>
> Also you're taking the instance lock on two netdev instances,
> how will this not deadlock? :$
Yeah... Sorry, we'll need to rethink locking in this function.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 06/20] net, ynl: Add peer info to queue-get response
2025-09-23 1:32 ` Jakub Kicinski
@ 2025-09-23 16:08 ` David Wei
0 siblings, 0 replies; 64+ messages in thread
From: David Wei @ 2025-09-23 16:08 UTC (permalink / raw)
To: Jakub Kicinski, Daniel Borkmann
Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson
On 2025-09-22 18:32, Jakub Kicinski wrote:
> On Fri, 19 Sep 2025 23:31:39 +0200 Daniel Borkmann wrote:
>> + name: peer-info
>> + attributes:
>> + -
>> + name: id
>> + doc: Queue index of the netdevice to which the peer queue belongs.
>> + type: u32
>> + -
>> + name: ifindex
>> + doc: ifindex of the netdevice to which the peer queue belongs.
>> + type: u32
>
> Oh, we have an ifindex in the local netns. So the API is to bind a
> queue to one side of a netkit and then the other side of the netkit
> actually gets to use it? Should we not be "binding" to the device that
> is of interest rather than its peer?
We are binding from a netkit queue to a physical netdev queue of
interest.
Sorry, the terminology in this patchset is not consistent and confusing
clearly. Will address in v2.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 07/20] net, ethtool: Disallow mapped real rxqs to be resized
2025-09-23 1:38 ` Jakub Kicinski
@ 2025-09-23 16:08 ` David Wei
0 siblings, 0 replies; 64+ messages in thread
From: David Wei @ 2025-09-23 16:08 UTC (permalink / raw)
To: Jakub Kicinski, Daniel Borkmann
Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson
On 2025-09-22 18:38, Jakub Kicinski wrote:
> On Mon, 22 Sep 2025 18:34:49 -0700 Jakub Kicinski wrote:
>> On Fri, 19 Sep 2025 23:31:40 +0200 Daniel Borkmann wrote:
>>> Similar to AF_XDP, do not allow queues in a physical netdev to be
>>> resized by ethtool -L when they are peered.
>>
>> I think we need the same thing for the ioctl path.
>> Let's factor the checks out to a helper in net/ethtool/common.c ?
>
> And/or add a helper to check if an Rx Queue is "busy" (af_xdp || mp ||
> peer'ed) cause we seem to be checking those three things in multiple
> places.
Sounds good, will add.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 20/20] tools, ynl: Add queue binding ynl sample application
2025-09-22 17:09 ` Stanislav Fomichev
@ 2025-09-23 16:12 ` David Wei
0 siblings, 0 replies; 64+ messages in thread
From: David Wei @ 2025-09-23 16:12 UTC (permalink / raw)
To: Stanislav Fomichev, Daniel Borkmann
Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
john.fastabend, martin.lau, jordan, maciej.fijalkowski,
magnus.karlsson
On 2025-09-22 10:09, Stanislav Fomichev wrote:
> On 09/19, Daniel Borkmann wrote:
>> From: David Wei <dw@davidwei.uk>
>>
>> Add a ynl sample application that calls bind-queue to bind a real rxq
>> to a mapped rxq in a virtual netdev.
>
> Any reason ynl python cli is not enough? Can we use it instead and update the
> respective instructions (example) in patch 19?
Easier and more portable for my testing to move this binary around
for... reasons. Happy to drop and use Python in v2.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 01/20] net, ynl: Add bind-queue operation
2025-09-23 1:17 ` Jakub Kicinski
@ 2025-09-23 16:13 ` David Wei
0 siblings, 0 replies; 64+ messages in thread
From: David Wei @ 2025-09-23 16:13 UTC (permalink / raw)
To: Jakub Kicinski, Daniel Borkmann
Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson
On 2025-09-22 18:17, Jakub Kicinski wrote:
> On Fri, 19 Sep 2025 23:31:34 +0200 Daniel Borkmann wrote:
>> Subject: [PATCH net-next 01/20] net, ynl: Add bind-queue operation
>
> We use "ynl" for changes to ynl itself. If you're just adding to
> the YAML specs or using them there's no need to mention YNL.
> Please remove in all the subjects.
>
>> + -
>> + name: queue-pair
>> + attributes:
>> + -
>> + name: src-ifindex
>> + doc: netdev ifindex of the physical device
>> + type: u32
>> + checks:
>> + min: 1
>
> max: s32-max ?
>
>> + -
>> + name: src-queue-id
>> + doc: netdev queue id of the physical device
>> + type: u32
>
>
>> @@ -772,6 +795,20 @@ operations:
>> attributes:
>> - id
>>
>> + -
>> + name: bind-queue
>> + doc: Bind a physical netdev queue to a virtual one
>
> Would be good to have a few sentences of documentation here.
> All netdev APIs currently carry queue id with type.
> I'm guessing the next few patches would explain but whether
> you're attaching rx, tx, or both should really be explained here :)
Got it, will expand the docs.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 04/20] net: Add ndo_{peer,unpeer}_queues callback
2025-09-23 16:06 ` David Wei
@ 2025-09-23 16:26 ` Daniel Borkmann
0 siblings, 0 replies; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-23 16:26 UTC (permalink / raw)
To: David Wei, Jakub Kicinski
Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson
On 9/23/25 6:06 PM, David Wei wrote:
> On 2025-09-22 18:23, Jakub Kicinski wrote:
>> On Fri, 19 Sep 2025 23:31:37 +0200 Daniel Borkmann wrote:
>>> Add ndo_{peer,unpeer}_queues() callback which can be used by virtual drivers
>>> that implement rxq mapping to a real rxq to update their internal state or
>>> exposed capability flags from the set of rxq mappings.
>>
>> Why is this something that virtual drivers implement?
>> I'd think that queue forwarding can be almost entirely implemented
>> in the core.
>
> I believe Daniel needs it for AF_XDP.
Yes, in case of af_xdp we basically need to propagate related capabilities
of the netdev, so that we can expose the given xdp flags in this case
further to netkit which implements ndo_bpf etc. Thinking about it, maybe
an alternative could be that netkit always exposes NETDEV_XDP_ACT_XSK etc
and we catch it in netkit's ndo_bpf + ndo_xsk_wakeup implementation when
checking peer queue's dev, and let it fail there instead. I'll play a bit
with this idea instead, perhaps this simplifies things.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 19/20] netkit: Add xsk support for af_xdp applications
2025-09-23 11:42 ` Toke Høiland-Jørgensen
@ 2025-09-24 10:41 ` Daniel Borkmann
2025-09-26 8:55 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-24 10:41 UTC (permalink / raw)
To: Toke Høiland-Jørgensen, netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
On 9/23/25 1:42 PM, Toke Høiland-Jørgensen wrote:
> Daniel Borkmann <daniel@iogearbox.net> writes:
>
>> Enable support for AF_XDP applications to operate on a netkit device.
>> The goal is that AF_XDP applications can natively consume AF_XDP
>> from network namespaces. The use-case from Cilium side is to support
>> Kubernetes KubeVirt VMs through QEMU's AF_XDP backend. KubeVirt is a
>> virtual machine management add-on for Kubernetes which aims to provide
>> a common ground for virtualization. KubeVirt spawns the VMs inside
>> Kubernetes Pods which reside in their own network namespace just like
>> regular Pods.
>>
>> Raw QEMU AF_XDP backend example with eth0 being a physical device with
>> 16 queues where netkit is bound to the last queue (for multi-queue RSS
>> context can be used if supported by the driver):
>>
>> # ethtool -X eth0 start 0 equal 15
>> # ethtool -X eth0 start 15 equal 1 context new
>> # ethtool --config-ntuple eth0 flow-type ether \
>> src 00:00:00:00:00:00 \
>> src-mask ff:ff:ff:ff:ff:ff \
>> dst $mac dst-mask 00:00:00:00:00:00 \
>> proto 0 proto-mask 0xffff action 15
>> # ip netns add foo
>> # ip link add numrxqueues 2 nk type netkit single
>> # ynl-bind eth0 15 nk
>> # ip link set nk netns foo
>> # ip netns exec foo ip link set lo up
>> # ip netns exec foo ip link set nk up
>> # ip netns exec foo qemu-system-x86_64 \
>> -kernel $kernel \
>> -drive file=${image_name},index=0,media=disk,format=raw \
>> -append "root=/dev/sda rw console=ttyS0" \
>> -cpu host \
>> -m $memory \
>> -enable-kvm \
>> -device virtio-net-pci,netdev=net0,mac=$mac \
>> -netdev af-xdp,ifname=nk,id=net0,mode=native,queues=1,start-queue=1,inhibit=on,map-path=$dir/xsks_map \
>> -nographic
>
> So AFAICT, this example relies on the control plane installing an XDP
> program on the physical NIC which will redirect into the right socket;
> and since in this example, qemu will install the XSK socket at index 1
> in the xsk map, that XDP program will also need to be aware of the queue
> index mapping. I can see from your qemu commit[0] that there's support
> on the qemu side for specifying an offset into the map to avoid having
> to do this translation in the XDP program, but at the very least that
> makes this example incomplete, no?
>
> However, even with a complete example, this breaks isolation in the
> sense that the entire XSK map is visible inside the pod, so a
> misbehaving qemu could interfere with traffic on other queues (by
> clearing the map, say). Which seems less than ideal?
For getting to a first starting point to connect all things with KubeVirt,
bind mounting the xsk map from Cilium into the VM launcher Pod which acts
as a regular K8s Pod while not perfect, its not a big issue given its out
of reach from the application sitting inside the VM (and some of the
control plane aspects are baked in the launcher Pod already), so the
isolation barrier is still VM. Eventually my goal is to have a xdp/xsk
redirect extension where we don't need to have the xsk map, and can just
derive the target xsk through the rxq we received traffic on.
> Taking a step back, for AF_XDP we already support decoupling the
> application-side access to the redirected packets from the interface,
> through the use of sockets. Meaning that your use case here could just
> as well be served by the control plane setting up AF_XDP socket(s) on
> the physical NIC and passing those into qemu, in which case we don't
> need this whole queue proxying dance at all.
Cilium should not act as a proxy handing out xsk sockets. Existing
applications expect a netdev from kernel side and should not need to
rewrite just to implement one CNI's protocol. Also, all the memory
should not be accounted against Cilium but rather the application Pod
itself which is consuming af_xdp. Further, on up/downgrades we expect
the data plane to being completely decoupled from the control plane,
if Cilium would own the sockets that would be disruptive which is nogo.
> So, erm, what am I missing that makes this worth it (for AF_XDP; I can
> see how it is useful for other things)? :)
Yeap there are other use cases we've seen from Cilium users as well,
e.g. running dpdk applications on top of af_xdp in regular k8s Pods.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 19/20] netkit: Add xsk support for af_xdp applications
2025-09-24 10:41 ` Daniel Borkmann
@ 2025-09-26 8:55 ` Toke Høiland-Jørgensen
0 siblings, 0 replies; 64+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-09-26 8:55 UTC (permalink / raw)
To: Daniel Borkmann, netdev
Cc: bpf, kuba, davem, razor, pabeni, willemb, sdf, john.fastabend,
martin.lau, jordan, maciej.fijalkowski, magnus.karlsson,
David Wei
Daniel Borkmann <daniel@iogearbox.net> writes:
> On 9/23/25 1:42 PM, Toke Høiland-Jørgensen wrote:
>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>
>>> Enable support for AF_XDP applications to operate on a netkit device.
>>> The goal is that AF_XDP applications can natively consume AF_XDP
>>> from network namespaces. The use-case from Cilium side is to support
>>> Kubernetes KubeVirt VMs through QEMU's AF_XDP backend. KubeVirt is a
>>> virtual machine management add-on for Kubernetes which aims to provide
>>> a common ground for virtualization. KubeVirt spawns the VMs inside
>>> Kubernetes Pods which reside in their own network namespace just like
>>> regular Pods.
>>>
>>> Raw QEMU AF_XDP backend example with eth0 being a physical device with
>>> 16 queues where netkit is bound to the last queue (for multi-queue RSS
>>> context can be used if supported by the driver):
>>>
>>> # ethtool -X eth0 start 0 equal 15
>>> # ethtool -X eth0 start 15 equal 1 context new
>>> # ethtool --config-ntuple eth0 flow-type ether \
>>> src 00:00:00:00:00:00 \
>>> src-mask ff:ff:ff:ff:ff:ff \
>>> dst $mac dst-mask 00:00:00:00:00:00 \
>>> proto 0 proto-mask 0xffff action 15
>>> # ip netns add foo
>>> # ip link add numrxqueues 2 nk type netkit single
>>> # ynl-bind eth0 15 nk
>>> # ip link set nk netns foo
>>> # ip netns exec foo ip link set lo up
>>> # ip netns exec foo ip link set nk up
>>> # ip netns exec foo qemu-system-x86_64 \
>>> -kernel $kernel \
>>> -drive file=${image_name},index=0,media=disk,format=raw \
>>> -append "root=/dev/sda rw console=ttyS0" \
>>> -cpu host \
>>> -m $memory \
>>> -enable-kvm \
>>> -device virtio-net-pci,netdev=net0,mac=$mac \
>>> -netdev af-xdp,ifname=nk,id=net0,mode=native,queues=1,start-queue=1,inhibit=on,map-path=$dir/xsks_map \
>>> -nographic
>>
>> So AFAICT, this example relies on the control plane installing an XDP
>> program on the physical NIC which will redirect into the right socket;
>> and since in this example, qemu will install the XSK socket at index 1
>> in the xsk map, that XDP program will also need to be aware of the queue
>> index mapping. I can see from your qemu commit[0] that there's support
>> on the qemu side for specifying an offset into the map to avoid having
>> to do this translation in the XDP program, but at the very least that
>> makes this example incomplete, no?
>>
>> However, even with a complete example, this breaks isolation in the
>> sense that the entire XSK map is visible inside the pod, so a
>> misbehaving qemu could interfere with traffic on other queues (by
>> clearing the map, say). Which seems less than ideal?
>
> For getting to a first starting point to connect all things with KubeVirt,
> bind mounting the xsk map from Cilium into the VM launcher Pod which acts
> as a regular K8s Pod while not perfect, its not a big issue given its out
> of reach from the application sitting inside the VM (and some of the
> control plane aspects are baked in the launcher Pod already), so the
> isolation barrier is still VM. Eventually my goal is to have a xdp/xsk
> redirect extension where we don't need to have the xsk map, and can just
> derive the target xsk through the rxq we received traffic on.
Right, okay, makes sense.
>> Taking a step back, for AF_XDP we already support decoupling the
>> application-side access to the redirected packets from the interface,
>> through the use of sockets. Meaning that your use case here could just
>> as well be served by the control plane setting up AF_XDP socket(s) on
>> the physical NIC and passing those into qemu, in which case we don't
>> need this whole queue proxying dance at all.
>
> Cilium should not act as a proxy handing out xsk sockets. Existing
> applications expect a netdev from kernel side and should not need to
> rewrite just to implement one CNI's protocol. Also, all the memory
> should not be accounted against Cilium but rather the application Pod
> itself which is consuming af_xdp. Further, on up/downgrades we expect
> the data plane to being completely decoupled from the control plane,
> if Cilium would own the sockets that would be disruptive which is
> nogo.
Hmm, okay, so the kernel-side RXQ buffering is to make it transparent to
the application inside the pod? I guess that makes sense; would be good
to mention in the commit message, though (+ the bit about the map
needing to be in sync) :)
>> So, erm, what am I missing that makes this worth it (for AF_XDP; I can
>> see how it is useful for other things)? :)
> Yeap there are other use cases we've seen from Cilium users as well,
> e.g. running dpdk applications on top of af_xdp in regular k8s Pods.
Yeah, being able to do stuff like that without having to rely on SR-IOV
would be cool, certainly!
-Toke
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 14/20] netkit: Add single device mode for netkit
2025-09-19 21:31 ` [PATCH net-next 14/20] netkit: Add single device mode for netkit Daniel Borkmann
@ 2025-09-27 1:10 ` Jordan Rife
2025-09-29 7:55 ` Daniel Borkmann
0 siblings, 1 reply; 64+ messages in thread
From: Jordan Rife @ 2025-09-27 1:10 UTC (permalink / raw)
To: Daniel Borkmann
Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
john.fastabend, martin.lau, maciej.fijalkowski, magnus.karlsson,
David Wei
On Fri, Sep 19, 2025 at 11:31:47PM +0200, Daniel Borkmann wrote:
> Add a single device mode for netkit instead of netkit pairs. The primary
> target for the paired devices is to connect network namespaces, of course,
> and support has been implemented in projects like Cilium [0]. For the rxq
> binding the plan is to support two main scenarios related to single device
> mode:
>
> * For the use-case of io_uring zero-copy, the control plane can either
> set up a netkit pair where the peer device can perform rxq binding which
> is then tied to the lifetime of the peer device, or the control plane
> can use a regular netkit pair to connect the hostns to a Pod/container
> and dynamically add/remove rxq bindings through a single device without
> having to interrupt the device pair. In the case of io_uring, the memory
> pool is used as skb non-linear pages, and thus the skb will go its way
> through the regular stack into netkit. Things like the netkit policy when
> no BPF is attached or skb scrubbing etc apply as-is in case the paired
> devices are used, or if the backend memory is tied to the single device
> and traffic goes through a paired device.
>
> * For the use-case of AF_XDP, the control plane needs to use netkit in the
> single device mode. The single device mode currently enforces only a
> pass policy when no BPF is attached, and does not yet support BPF link
> attachments for AF_XDP. skbs sent to that device get dropped at the
> moment. Given AF_XDP operates at a lower layer of the stack tying this
> to the netkit pair did not make sense. In future, the plan is to allow
> BPF at the XDP layer which can: i) process traffic coming from the AF_XDP
> application (e.g. QEMU with AF_XDP backend) to filter egress traffic or
> to push selected egress traffic up to the single netkit device to the
> local stack (e.g. DHCP requests), and ii) vice-versa skbs sent to the
> single netkit into the AF_XDP application (e.g. DHCP replies). Also,
> the control-plane can dynamically add/remove rxq bindings for the single
> netkit device without having to interrupt (e.g. down/up cycle) the main
> netkit pair for the Pod which has traffic going in and out.
This seems very cool. I'm curious, in single device mode, how would
traffic originating in the host ns make its way into a pod hosting a
QEMU VM using an AF_XDP backend? How would redirection work between two
such VMs on the same host?
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Co-developed-by: David Wei <dw@davidwei.uk>
> Signed-off-by: David Wei <dw@davidwei.uk>
> Link: https://docs.cilium.io/en/stable/operations/performance/tuning/#netkit-device-mode [0]
> ---
> drivers/net/netkit.c | 108 ++++++++++++++++++++++-------------
> include/uapi/linux/if_link.h | 6 ++
> 2 files changed, 74 insertions(+), 40 deletions(-)
>
> diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
> index 492be60f2e70..ceb1393ee599 100644
> --- a/drivers/net/netkit.c
> +++ b/drivers/net/netkit.c
> @@ -25,6 +25,7 @@ struct netkit {
>
> /* Needed in slow-path */
> enum netkit_mode mode;
> + enum netkit_pairing pair;
> bool primary;
> u32 headroom;
> };
> @@ -133,6 +134,10 @@ static int netkit_open(struct net_device *dev)
> struct netkit *nk = netkit_priv(dev);
> struct net_device *peer = rtnl_dereference(nk->peer);
>
> + if (nk->pair == NETKIT_DEVICE_SINGLE) {
> + netif_carrier_on(dev);
> + return 0;
> + }
> if (!peer)
> return -ENOTCONN;
> if (peer->flags & IFF_UP) {
> @@ -333,6 +338,7 @@ static int netkit_new_link(struct net_device *dev,
> enum netkit_scrub scrub_prim = NETKIT_SCRUB_DEFAULT;
> enum netkit_scrub scrub_peer = NETKIT_SCRUB_DEFAULT;
> struct nlattr *peer_tb[IFLA_MAX + 1], **tbp, *attr;
> + enum netkit_pairing pair = NETKIT_DEVICE_PAIR;
> enum netkit_action policy_prim = NETKIT_PASS;
> enum netkit_action policy_peer = NETKIT_PASS;
> struct nlattr **data = params->data;
> @@ -341,7 +347,7 @@ static int netkit_new_link(struct net_device *dev,
> struct nlattr **tb = params->tb;
> u16 headroom = 0, tailroom = 0;
> struct ifinfomsg *ifmp = NULL;
> - struct net_device *peer;
> + struct net_device *peer = NULL;
> char ifname[IFNAMSIZ];
> struct netkit *nk;
> int err;
> @@ -378,6 +384,8 @@ static int netkit_new_link(struct net_device *dev,
> headroom = nla_get_u16(data[IFLA_NETKIT_HEADROOM]);
> if (data[IFLA_NETKIT_TAILROOM])
> tailroom = nla_get_u16(data[IFLA_NETKIT_TAILROOM]);
> + if (data[IFLA_NETKIT_PAIRING])
> + pair = nla_get_u32(data[IFLA_NETKIT_PAIRING]);
> }
>
> if (ifmp && tbp[IFLA_IFNAME]) {
> @@ -390,45 +398,49 @@ static int netkit_new_link(struct net_device *dev,
> if (mode != NETKIT_L2 &&
> (tb[IFLA_ADDRESS] || tbp[IFLA_ADDRESS]))
> return -EOPNOTSUPP;
> + if (pair != NETKIT_DEVICE_PAIR &&
nit: IMO this would be a little clearer without the inverted logic:
if (pair == NETKIT_DEVICE_SINGLE &&
> + (tb != tbp ||
> + tb[IFLA_NETKIT_PEER_POLICY] ||
> + tb[IFLA_NETKIT_PEER_SCRUB] ||
> + policy_prim != NETKIT_PASS))
> + return -EOPNOTSUPP;
>
> - peer = rtnl_create_link(peer_net, ifname, ifname_assign_type,
> - &netkit_link_ops, tbp, extack);
> - if (IS_ERR(peer))
> - return PTR_ERR(peer);
> -
> - netif_inherit_tso_max(peer, dev);
> - if (headroom) {
> - peer->needed_headroom = headroom;
> - dev->needed_headroom = headroom;
> - }
> - if (tailroom) {
> - peer->needed_tailroom = tailroom;
> - dev->needed_tailroom = tailroom;
> - }
> -
> - if (mode == NETKIT_L2 && !(ifmp && tbp[IFLA_ADDRESS]))
> - eth_hw_addr_random(peer);
> - if (ifmp && dev->ifindex)
> - peer->ifindex = ifmp->ifi_index;
> -
> - nk = netkit_priv(peer);
> - nk->primary = false;
> - nk->policy = policy_peer;
> - nk->scrub = scrub_peer;
> - nk->mode = mode;
> - nk->headroom = headroom;
> - bpf_mprog_bundle_init(&nk->bundle);
> + if (pair == NETKIT_DEVICE_PAIR) {
> + peer = rtnl_create_link(peer_net, ifname, ifname_assign_type,
> + &netkit_link_ops, tbp, extack);
> + if (IS_ERR(peer))
> + return PTR_ERR(peer);
> +
> + netif_inherit_tso_max(peer, dev);
> + if (headroom)
> + peer->needed_headroom = headroom;
> + if (tailroom)
> + peer->needed_tailroom = tailroom;
> + if (mode == NETKIT_L2 && !(ifmp && tbp[IFLA_ADDRESS]))
> + eth_hw_addr_random(peer);
> + if (ifmp && dev->ifindex)
> + peer->ifindex = ifmp->ifi_index;
>
> - err = register_netdevice(peer);
> - if (err < 0)
> - goto err_register_peer;
> - netif_carrier_off(peer);
> - if (mode == NETKIT_L2)
> - dev_change_flags(peer, peer->flags & ~IFF_NOARP, NULL);
> + nk = netkit_priv(peer);
> + nk->primary = false;
> + nk->policy = policy_peer;
> + nk->scrub = scrub_peer;
> + nk->mode = mode;
> + nk->pair = pair;
> + nk->headroom = headroom;
> + bpf_mprog_bundle_init(&nk->bundle);
> +
> + err = register_netdevice(peer);
> + if (err < 0)
> + goto err_register_peer;
> + netif_carrier_off(peer);
> + if (mode == NETKIT_L2)
> + dev_change_flags(peer, peer->flags & ~IFF_NOARP, NULL);
>
> - err = rtnl_configure_link(peer, NULL, 0, NULL);
> - if (err < 0)
> - goto err_configure_peer;
> + err = rtnl_configure_link(peer, NULL, 0, NULL);
> + if (err < 0)
> + goto err_configure_peer;
> + }
>
> if (mode == NETKIT_L2 && !tb[IFLA_ADDRESS])
> eth_hw_addr_random(dev);
> @@ -436,12 +448,17 @@ static int netkit_new_link(struct net_device *dev,
> nla_strscpy(dev->name, tb[IFLA_IFNAME], IFNAMSIZ);
> else
> strscpy(dev->name, "nk%d", IFNAMSIZ);
> + if (headroom)
> + dev->needed_headroom = headroom;
> + if (tailroom)
> + dev->needed_tailroom = tailroom;
>
> nk = netkit_priv(dev);
> nk->primary = true;
> nk->policy = policy_prim;
> nk->scrub = scrub_prim;
> nk->mode = mode;
> + nk->pair = pair;
> nk->headroom = headroom;
> bpf_mprog_bundle_init(&nk->bundle);
>
> @@ -453,10 +470,12 @@ static int netkit_new_link(struct net_device *dev,
> dev_change_flags(dev, dev->flags & ~IFF_NOARP, NULL);
>
> rcu_assign_pointer(netkit_priv(dev)->peer, peer);
> - rcu_assign_pointer(netkit_priv(peer)->peer, dev);
> + if (peer)
> + rcu_assign_pointer(netkit_priv(peer)->peer, dev);
> return 0;
> err_configure_peer:
> - unregister_netdevice(peer);
> + if (peer)
> + unregister_netdevice(peer);
> return err;
> err_register_peer:
> free_netdev(peer);
> @@ -516,6 +535,8 @@ static struct net_device *netkit_dev_fetch(struct net *net, u32 ifindex, u32 whi
> nk = netkit_priv(dev);
> if (!nk->primary)
> return ERR_PTR(-EACCES);
> + if (nk->pair == NETKIT_DEVICE_SINGLE)
> + return ERR_PTR(-EOPNOTSUPP);
> if (which == BPF_NETKIT_PEER) {
> dev = rcu_dereference_rtnl(nk->peer);
> if (!dev)
> @@ -877,6 +898,7 @@ static int netkit_change_link(struct net_device *dev, struct nlattr *tb[],
> { IFLA_NETKIT_PEER_INFO, "peer info" },
> { IFLA_NETKIT_HEADROOM, "headroom" },
> { IFLA_NETKIT_TAILROOM, "tailroom" },
> + { IFLA_NETKIT_PAIRING, "pairing" },
> };
>
> if (!nk->primary) {
> @@ -896,9 +918,11 @@ static int netkit_change_link(struct net_device *dev, struct nlattr *tb[],
> }
>
> if (data[IFLA_NETKIT_POLICY]) {
> + err = -EOPNOTSUPP;
> attr = data[IFLA_NETKIT_POLICY];
> policy = nla_get_u32(attr);
> - err = netkit_check_policy(policy, attr, extack);
> + if (nk->pair == NETKIT_DEVICE_PAIR)
> + err = netkit_check_policy(policy, attr, extack);
> if (err)
> return err;
> WRITE_ONCE(nk->policy, policy);
> @@ -929,6 +953,7 @@ static size_t netkit_get_size(const struct net_device *dev)
> nla_total_size(sizeof(u8)) + /* IFLA_NETKIT_PRIMARY */
> nla_total_size(sizeof(u16)) + /* IFLA_NETKIT_HEADROOM */
> nla_total_size(sizeof(u16)) + /* IFLA_NETKIT_TAILROOM */
> + nla_total_size(sizeof(u32)) + /* IFLA_NETKIT_PAIRING */
> 0;
> }
>
> @@ -949,6 +974,8 @@ static int netkit_fill_info(struct sk_buff *skb, const struct net_device *dev)
> return -EMSGSIZE;
> if (nla_put_u16(skb, IFLA_NETKIT_TAILROOM, dev->needed_tailroom))
> return -EMSGSIZE;
> + if (nla_put_u32(skb, IFLA_NETKIT_PAIRING, nk->pair))
> + return -EMSGSIZE;
>
> if (peer) {
> nk = netkit_priv(peer);
> @@ -970,6 +997,7 @@ static const struct nla_policy netkit_policy[IFLA_NETKIT_MAX + 1] = {
> [IFLA_NETKIT_TAILROOM] = { .type = NLA_U16 },
> [IFLA_NETKIT_SCRUB] = NLA_POLICY_MAX(NLA_U32, NETKIT_SCRUB_DEFAULT),
> [IFLA_NETKIT_PEER_SCRUB] = NLA_POLICY_MAX(NLA_U32, NETKIT_SCRUB_DEFAULT),
> + [IFLA_NETKIT_PAIRING] = NLA_POLICY_MAX(NLA_U32, NETKIT_DEVICE_SINGLE),
> [IFLA_NETKIT_PRIMARY] = { .type = NLA_REJECT,
> .reject_message = "Primary attribute is read-only" },
> };
> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
> index 45f56c9f95d9..4a2f781f3cca 100644
> --- a/include/uapi/linux/if_link.h
> +++ b/include/uapi/linux/if_link.h
> @@ -1294,6 +1294,11 @@ enum netkit_mode {
> NETKIT_L3,
> };
>
> +enum netkit_pairing {
> + NETKIT_DEVICE_PAIR,
> + NETKIT_DEVICE_SINGLE,
> +};
> +
> /* NETKIT_SCRUB_NONE leaves clearing skb->{mark,priority} up to
> * the BPF program if attached. This also means the latter can
> * consume the two fields if they were populated earlier.
> @@ -1318,6 +1323,7 @@ enum {
> IFLA_NETKIT_PEER_SCRUB,
> IFLA_NETKIT_HEADROOM,
> IFLA_NETKIT_TAILROOM,
> + IFLA_NETKIT_PAIRING,
> __IFLA_NETKIT_MAX,
> };
> #define IFLA_NETKIT_MAX (__IFLA_NETKIT_MAX - 1)
> --
> 2.43.0
>
Reviewed-by: Jordan Rife <jordan@jrife.io>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 16/20] netkit: Implement rtnl_link_ops->alloc
2025-09-19 21:31 ` [PATCH net-next 16/20] netkit: Implement rtnl_link_ops->alloc Daniel Borkmann
@ 2025-09-27 1:17 ` Jordan Rife
2025-09-29 7:50 ` Daniel Borkmann
0 siblings, 1 reply; 64+ messages in thread
From: Jordan Rife @ 2025-09-27 1:17 UTC (permalink / raw)
To: Daniel Borkmann
Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
john.fastabend, martin.lau, maciej.fijalkowski, magnus.karlsson,
David Wei
On Fri, Sep 19, 2025 at 11:31:49PM +0200, Daniel Borkmann wrote:
> From: David Wei <dw@davidwei.uk>
>
> Implement rtnl_link_ops->alloc that allows the number of rx queues to be
> set when netkit is created. By default, netkit has only a single rxq (and
> single txq). The number of queues is deliberately not allowed to be changed
> via ethtool -L and is fixed for the lifetime of a netkit instance.
>
> For netkit device creation, numrxqueues with larger than one rxq can be
> specified. These rxqs are then mappable to real rxqs in physical netdevs:
>
> ip link add numrxqueues 2 type netkit
>
> As a starting point, the limit of numrxqueues for netkit is currently set
> to 2, but future work is going to allow mapping multiple real rxqs from
Is the reason for the limit just because QEMU can't take advantage of
more today or is there some other technical limitation?
> physical netdevs, potentially at some point even from different physical
> netdevs.
What would be the use case for having proxied queues from multiple
physical netdevs to the same netkit device? Couldn't you just create
multiple netkit devices, one per physical device?
> Signed-off-by: David Wei <dw@davidwei.uk>
> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
> drivers/net/netkit.c | 78 ++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 72 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
> index 8f1285513d82..e5dfbf7ea351 100644
> --- a/drivers/net/netkit.c
> +++ b/drivers/net/netkit.c
> @@ -9,11 +9,19 @@
> #include <linux/bpf_mprog.h>
> #include <linux/indirect_call_wrapper.h>
>
> +#include <net/netdev_queues.h>
> +#include <net/netdev_rx_queue.h>
> #include <net/netkit.h>
> #include <net/dst.h>
> #include <net/tcx.h>
>
> -#define DRV_NAME "netkit"
> +#define NETKIT_DRV_NAME "netkit"
> +
> +#define NETKIT_NUM_TX_QUEUES_MAX 1
> +#define NETKIT_NUM_RX_QUEUES_MAX 2
> +
> +#define NETKIT_NUM_TX_QUEUES_REAL 1
> +#define NETKIT_NUM_RX_QUEUES_REAL 1
>
> struct netkit {
> __cacheline_group_begin(netkit_fastpath);
> @@ -37,6 +45,8 @@ struct netkit_link {
> struct net_device *dev;
> };
>
> +static struct rtnl_link_ops netkit_link_ops;
> +
> static __always_inline int
> netkit_run(const struct bpf_mprog_entry *entry, struct sk_buff *skb,
> enum netkit_action ret)
> @@ -243,13 +253,69 @@ static const struct net_device_ops netkit_netdev_ops = {
> static void netkit_get_drvinfo(struct net_device *dev,
> struct ethtool_drvinfo *info)
> {
> - strscpy(info->driver, DRV_NAME, sizeof(info->driver));
> + strscpy(info->driver, NETKIT_DRV_NAME, sizeof(info->driver));
> +}
> +
> +static void netkit_get_channels(struct net_device *dev,
> + struct ethtool_channels *channels)
> +{
> + channels->max_rx = dev->num_rx_queues;
> + channels->max_tx = dev->num_tx_queues;
> + channels->max_other = 0;
> + channels->max_combined = 1;
> + channels->rx_count = dev->real_num_rx_queues;
> + channels->tx_count = dev->real_num_tx_queues;
> + channels->other_count = 0;
> + channels->combined_count = 0;
> }
>
> static const struct ethtool_ops netkit_ethtool_ops = {
> .get_drvinfo = netkit_get_drvinfo,
> + .get_channels = netkit_get_channels,
> };
>
> +static struct net_device *netkit_alloc(struct nlattr *tb[],
> + const char *ifname,
> + unsigned char name_assign_type,
> + unsigned int num_tx_queues,
> + unsigned int num_rx_queues)
> +{
> + const struct rtnl_link_ops *ops = &netkit_link_ops;
> + struct net_device *dev;
> +
> + if (num_tx_queues > NETKIT_NUM_TX_QUEUES_MAX ||
> + num_rx_queues > NETKIT_NUM_RX_QUEUES_MAX)
> + return ERR_PTR(-EOPNOTSUPP);
> +
> + dev = alloc_netdev_mqs(ops->priv_size, ifname,
> + name_assign_type, ops->setup,
> + num_tx_queues, num_rx_queues);
> + if (dev) {
> + dev->real_num_tx_queues = NETKIT_NUM_TX_QUEUES_REAL;
> + dev->real_num_rx_queues = NETKIT_NUM_RX_QUEUES_REAL;
> + }
> + return dev;
> +}
> +
> +static void netkit_queue_unpeer(struct net_device *dev)
> +{
> + struct netdev_rx_queue *src_rxq, *dst_rxq;
> + struct net_device *src_dev;
> + int i;
> +
> + if (dev->real_num_rx_queues == 1)
> + return;
> + for (i = 1; i < dev->real_num_rx_queues; i++) {
> + dst_rxq = __netif_get_rx_queue(dev, i);
> + src_rxq = dst_rxq->peer;
> + src_dev = src_rxq->dev;
> +
> + netdev_lock(src_dev);
> + netdev_rx_queue_unpeer(src_dev, src_rxq, dst_rxq);
> + netdev_unlock(src_dev);
> + }
> +}
> +
> static void netkit_setup(struct net_device *dev)
> {
> static const netdev_features_t netkit_features_hw_vlan =
> @@ -330,8 +396,6 @@ static int netkit_validate(struct nlattr *tb[], struct nlattr *data[],
> return 0;
> }
>
> -static struct rtnl_link_ops netkit_link_ops;
> -
> static int netkit_new_link(struct net_device *dev,
> struct rtnl_newlink_params *params,
> struct netlink_ext_ack *extack)
> @@ -865,6 +929,7 @@ static void netkit_release_all(struct net_device *dev)
> static void netkit_uninit(struct net_device *dev)
> {
> netkit_release_all(dev);
> + netkit_queue_unpeer(dev);
> }
>
> static void netkit_del_link(struct net_device *dev, struct list_head *head)
> @@ -1005,8 +1070,9 @@ static const struct nla_policy netkit_policy[IFLA_NETKIT_MAX + 1] = {
> };
>
> static struct rtnl_link_ops netkit_link_ops = {
> - .kind = DRV_NAME,
> + .kind = NETKIT_DRV_NAME,
> .priv_size = sizeof(struct netkit),
> + .alloc = netkit_alloc,
> .setup = netkit_setup,
> .newlink = netkit_new_link,
> .dellink = netkit_del_link,
> @@ -1042,4 +1108,4 @@ MODULE_DESCRIPTION("BPF-programmable network device");
> MODULE_AUTHOR("Daniel Borkmann <daniel@iogearbox.net>");
> MODULE_AUTHOR("Nikolay Aleksandrov <razor@blackwall.org>");
> MODULE_LICENSE("GPL");
> -MODULE_ALIAS_RTNL_LINK(DRV_NAME);
> +MODULE_ALIAS_RTNL_LINK(NETKIT_DRV_NAME);
> --
> 2.43.0
>
Reviewed-by: Jordan Rife <jordan@jrife.io>
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 16/20] netkit: Implement rtnl_link_ops->alloc
2025-09-27 1:17 ` Jordan Rife
@ 2025-09-29 7:50 ` Daniel Borkmann
0 siblings, 0 replies; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-29 7:50 UTC (permalink / raw)
To: Jordan Rife
Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
john.fastabend, martin.lau, maciej.fijalkowski, magnus.karlsson,
David Wei
On 9/27/25 3:17 AM, Jordan Rife wrote:
> On Fri, Sep 19, 2025 at 11:31:49PM +0200, Daniel Borkmann wrote:
>> From: David Wei <dw@davidwei.uk>
>>
>> Implement rtnl_link_ops->alloc that allows the number of rx queues to be
>> set when netkit is created. By default, netkit has only a single rxq (and
>> single txq). The number of queues is deliberately not allowed to be changed
>> via ethtool -L and is fixed for the lifetime of a netkit instance.
>>
>> For netkit device creation, numrxqueues with larger than one rxq can be
>> specified. These rxqs are then mappable to real rxqs in physical netdevs:
>>
>> ip link add numrxqueues 2 type netkit
>>
>> As a starting point, the limit of numrxqueues for netkit is currently set
>> to 2, but future work is going to allow mapping multiple real rxqs from
>
> Is the reason for the limit just because QEMU can't take advantage of
> more today or is there some other technical limitation?
Mainly just to keep the initial series smaller, plan is to lift this to more
queues for both io_uring and af_xdp. QEMU supports multiple queues for af_xdp
but when I spoke to QEMU folks, there is still the issue that QEMU internally
needs to be able to support processing inbound traffic through multiple threads
so its not a backend but QEMU internal limitation atm.
>> physical netdevs, potentially at some point even from different physical
>> netdevs.
>
> What would be the use case for having proxied queues from multiple
> physical netdevs to the same netkit device? Couldn't you just create
> multiple netkit devices, one per physical device?
Yes, multiple netkit devices would work as well in that case.
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH net-next 14/20] netkit: Add single device mode for netkit
2025-09-27 1:10 ` Jordan Rife
@ 2025-09-29 7:55 ` Daniel Borkmann
0 siblings, 0 replies; 64+ messages in thread
From: Daniel Borkmann @ 2025-09-29 7:55 UTC (permalink / raw)
To: Jordan Rife
Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
john.fastabend, martin.lau, maciej.fijalkowski, magnus.karlsson,
David Wei
On 9/27/25 3:10 AM, Jordan Rife wrote:
> On Fri, Sep 19, 2025 at 11:31:47PM +0200, Daniel Borkmann wrote:
[...]
>
> This seems very cool. I'm curious, in single device mode, how would
> traffic originating in the host ns make its way into a pod hosting a
> QEMU VM using an AF_XDP backend? How would redirection work between two
> such VMs on the same host?
For this case the plan would be to have the regular netkit pair connecting
host to Pod through skbs and then we'd direct traffic to that queue bound
netkit device which injects it into the af_xdp socket in ndo_start_xmit.
[...]
>> + if (pair != NETKIT_DEVICE_PAIR &&
>
> nit: IMO this would be a little clearer without the inverted logic:
Ack, will fix.
[...]
>
> Reviewed-by: Jordan Rife <jordan@jrife.io>
Thanks!
^ permalink raw reply [flat|nested] 64+ messages in thread
end of thread, other threads:[~2025-09-29 7:55 UTC | newest]
Thread overview: 64+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-19 21:31 [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 01/20] net, ynl: Add bind-queue operation Daniel Borkmann
2025-09-22 16:04 ` Stanislav Fomichev
2025-09-22 16:13 ` Daniel Borkmann
2025-09-23 1:17 ` Jakub Kicinski
2025-09-23 16:13 ` David Wei
2025-09-19 21:31 ` [PATCH net-next 02/20] net: Add peer to netdev_rx_queue Daniel Borkmann
2025-09-23 1:22 ` Jakub Kicinski
2025-09-23 15:56 ` David Wei
2025-09-19 21:31 ` [PATCH net-next 03/20] net: Add ndo_queue_create callback Daniel Borkmann
2025-09-22 16:04 ` Stanislav Fomichev
2025-09-22 16:14 ` Daniel Borkmann
2025-09-23 15:58 ` David Wei
2025-09-23 1:22 ` Jakub Kicinski
2025-09-23 15:58 ` David Wei
2025-09-19 21:31 ` [PATCH net-next 04/20] net: Add ndo_{peer,unpeer}_queues callback Daniel Borkmann
2025-09-23 1:23 ` Jakub Kicinski
2025-09-23 16:06 ` David Wei
2025-09-23 16:26 ` Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 05/20] net, ynl: Implement netdev_nl_bind_queue_doit Daniel Borkmann
2025-09-22 16:06 ` Stanislav Fomichev
2025-09-23 1:26 ` Jakub Kicinski
2025-09-23 16:06 ` David Wei
2025-09-19 21:31 ` [PATCH net-next 06/20] net, ynl: Add peer info to queue-get response Daniel Borkmann
2025-09-23 1:32 ` Jakub Kicinski
2025-09-23 16:08 ` David Wei
2025-09-19 21:31 ` [PATCH net-next 07/20] net, ethtool: Disallow mapped real rxqs to be resized Daniel Borkmann
2025-09-23 1:34 ` Jakub Kicinski
2025-09-23 1:38 ` Jakub Kicinski
2025-09-23 16:08 ` David Wei
2025-09-19 21:31 ` [PATCH net-next 08/20] net: Proxy net_mp_{open,close}_rxq for mapped queues Daniel Borkmann
2025-09-22 16:35 ` Stanislav Fomichev
2025-09-19 21:31 ` [PATCH net-next 09/20] xsk: Move NETDEV_XDP_ACT_ZC into generic header Daniel Borkmann
2025-09-22 15:59 ` Maciej Fijalkowski
2025-09-19 21:31 ` [PATCH net-next 10/20] xsk: Move pool registration into single function Daniel Borkmann
2025-09-22 16:01 ` Maciej Fijalkowski
2025-09-22 16:15 ` Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 11/20] xsk: Add small helper xp_pool_bindable Daniel Borkmann
2025-09-22 16:03 ` Maciej Fijalkowski
2025-09-22 16:17 ` Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 12/20] xsk: Change xsk_rcv_check to check netdev/queue_id from pool Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 13/20] xsk: Proxy pool management for mapped queues Daniel Borkmann
2025-09-22 16:48 ` Stanislav Fomichev
2025-09-22 17:01 ` Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 14/20] netkit: Add single device mode for netkit Daniel Borkmann
2025-09-27 1:10 ` Jordan Rife
2025-09-29 7:55 ` Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 15/20] netkit: Document fast vs slowpath members via macros Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 16/20] netkit: Implement rtnl_link_ops->alloc Daniel Borkmann
2025-09-27 1:17 ` Jordan Rife
2025-09-29 7:50 ` Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 17/20] netkit: Implement ndo_queue_create Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 18/20] netkit: Add io_uring zero-copy support for TCP Daniel Borkmann
2025-09-22 3:17 ` zf
2025-09-22 16:23 ` Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 19/20] netkit: Add xsk support for af_xdp applications Daniel Borkmann
2025-09-23 11:42 ` Toke Høiland-Jørgensen
2025-09-24 10:41 ` Daniel Borkmann
2025-09-26 8:55 ` Toke Høiland-Jørgensen
2025-09-19 21:31 ` [PATCH net-next 20/20] tools, ynl: Add queue binding ynl sample application Daniel Borkmann
2025-09-22 17:09 ` Stanislav Fomichev
2025-09-23 16:12 ` David Wei
2025-09-22 12:05 ` [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Nikolay Aleksandrov
2025-09-23 1:59 ` Jakub Kicinski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).