From: Daniel Borkmann <daniel@iogearbox.net>
To: netdev@vger.kernel.org
Cc: bpf@vger.kernel.org, kuba@kernel.org, davem@davemloft.net,
razor@blackwall.org, pabeni@redhat.com, willemb@google.com,
sdf@fomichev.me, john.fastabend@gmail.com, martin.lau@kernel.org,
jordan@jrife.io, maciej.fijalkowski@intel.com,
magnus.karlsson@intel.com, dw@davidwei.uk, toke@redhat.com,
yangzhenze@bytedance.com, wangdongdong.6@bytedance.com
Subject: [PATCH net-next v7 09/16] netkit: Add single device mode for netkit
Date: Thu, 15 Jan 2026 09:25:56 +0100 [thread overview]
Message-ID: <20260115082603.219152-10-daniel@iogearbox.net> (raw)
In-Reply-To: <20260115082603.219152-1-daniel@iogearbox.net>
Add a single device mode for netkit instead of netkit pairs. The primary
target for the paired devices is to connect network namespaces, of course,
and support has been implemented in projects like Cilium [0]. For the rxq
leasing the plan is to support two main scenarios related to single device
mode:
* For the use-case of io_uring zero-copy, the control plane can either
set up a netkit pair where the peer device can perform rxq leasing which
is then tied to the lifetime of the peer device, or the control plane
can use a regular netkit pair to connect the hostns to a Pod/container
and dynamically add/remove rxq leasing through a single device without
having to interrupt the device pair. In the case of io_uring, the memory
pool is used as skb non-linear pages, and thus the skb will go its way
through the regular stack into netkit. Things like the netkit policy when
no BPF is attached or skb scrubbing etc apply as-is in case the paired
devices are used, or if the backend memory is tied to the single device
and traffic goes through a paired device.
* For the use-case of AF_XDP, the control plane needs to use netkit in the
single device mode. The single device mode currently enforces only a
pass policy when no BPF is attached, and does not yet support BPF link
attachments for AF_XDP. skbs sent to that device get dropped at the
moment. Given AF_XDP operates at a lower layer of the stack tying this
to the netkit pair did not make sense. In future, the plan is to allow
BPF at the XDP layer which can: i) process traffic coming from the AF_XDP
application (e.g. QEMU with AF_XDP backend) to filter egress traffic or
to push selected egress traffic up to the single netkit device to the
local stack (e.g. DHCP requests), and ii) vice-versa skbs sent to the
single netkit into the AF_XDP application (e.g. DHCP replies). Also,
the control-plane can dynamically manage rxq leasing for the single
netkit device without having to interrupt (e.g. down/up cycle) the main
netkit pair for the Pod which has traffic going in and out.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: David Wei <dw@davidwei.uk>
Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Jordan Rife <jordan@jrife.io>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://docs.cilium.io/en/stable/operations/performance/tuning/#netkit-device-mode [0]
---
drivers/net/netkit.c | 110 ++++++++++++++++++++++-------------
include/uapi/linux/if_link.h | 6 ++
2 files changed, 76 insertions(+), 40 deletions(-)
diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
index 0a2fef7caccb..76332a98af92 100644
--- a/drivers/net/netkit.c
+++ b/drivers/net/netkit.c
@@ -26,6 +26,7 @@ struct netkit {
__cacheline_group_begin(netkit_slowpath);
enum netkit_mode mode;
+ enum netkit_pairing pair;
bool primary;
u32 headroom;
__cacheline_group_end(netkit_slowpath);
@@ -135,6 +136,10 @@ static int netkit_open(struct net_device *dev)
struct netkit *nk = netkit_priv(dev);
struct net_device *peer = rtnl_dereference(nk->peer);
+ if (nk->pair == NETKIT_DEVICE_SINGLE) {
+ netif_carrier_on(dev);
+ return 0;
+ }
if (!peer)
return -ENOTCONN;
if (peer->flags & IFF_UP) {
@@ -335,6 +340,7 @@ static int netkit_new_link(struct net_device *dev,
enum netkit_scrub scrub_prim = NETKIT_SCRUB_DEFAULT;
enum netkit_scrub scrub_peer = NETKIT_SCRUB_DEFAULT;
struct nlattr *peer_tb[IFLA_MAX + 1], **tbp, *attr;
+ enum netkit_pairing pair = NETKIT_DEVICE_PAIR;
enum netkit_action policy_prim = NETKIT_PASS;
enum netkit_action policy_peer = NETKIT_PASS;
struct nlattr **data = params->data;
@@ -343,7 +349,8 @@ static int netkit_new_link(struct net_device *dev,
struct nlattr **tb = params->tb;
u16 headroom = 0, tailroom = 0;
struct ifinfomsg *ifmp = NULL;
- struct net_device *peer;
+ struct net_device *peer = NULL;
+ bool seen_peer = false;
char ifname[IFNAMSIZ];
struct netkit *nk;
int err;
@@ -380,6 +387,12 @@ static int netkit_new_link(struct net_device *dev,
headroom = nla_get_u16(data[IFLA_NETKIT_HEADROOM]);
if (data[IFLA_NETKIT_TAILROOM])
tailroom = nla_get_u16(data[IFLA_NETKIT_TAILROOM]);
+ if (data[IFLA_NETKIT_PAIRING])
+ pair = nla_get_u32(data[IFLA_NETKIT_PAIRING]);
+
+ seen_peer = data[IFLA_NETKIT_PEER_INFO] ||
+ data[IFLA_NETKIT_PEER_SCRUB] ||
+ data[IFLA_NETKIT_PEER_POLICY];
}
if (ifmp && tbp[IFLA_IFNAME]) {
@@ -392,45 +405,46 @@ static int netkit_new_link(struct net_device *dev,
if (mode != NETKIT_L2 &&
(tb[IFLA_ADDRESS] || tbp[IFLA_ADDRESS]))
return -EOPNOTSUPP;
+ if (pair == NETKIT_DEVICE_SINGLE &&
+ (tb != tbp || seen_peer || policy_prim != NETKIT_PASS))
+ return -EOPNOTSUPP;
- peer = rtnl_create_link(peer_net, ifname, ifname_assign_type,
- &netkit_link_ops, tbp, extack);
- if (IS_ERR(peer))
- return PTR_ERR(peer);
-
- netif_inherit_tso_max(peer, dev);
- if (headroom) {
- peer->needed_headroom = headroom;
- dev->needed_headroom = headroom;
- }
- if (tailroom) {
- peer->needed_tailroom = tailroom;
- dev->needed_tailroom = tailroom;
- }
-
- if (mode == NETKIT_L2 && !(ifmp && tbp[IFLA_ADDRESS]))
- eth_hw_addr_random(peer);
- if (ifmp && dev->ifindex)
- peer->ifindex = ifmp->ifi_index;
-
- nk = netkit_priv(peer);
- nk->primary = false;
- nk->policy = policy_peer;
- nk->scrub = scrub_peer;
- nk->mode = mode;
- nk->headroom = headroom;
- bpf_mprog_bundle_init(&nk->bundle);
+ if (pair == NETKIT_DEVICE_PAIR) {
+ peer = rtnl_create_link(peer_net, ifname, ifname_assign_type,
+ &netkit_link_ops, tbp, extack);
+ if (IS_ERR(peer))
+ return PTR_ERR(peer);
+
+ netif_inherit_tso_max(peer, dev);
+ if (headroom)
+ peer->needed_headroom = headroom;
+ if (tailroom)
+ peer->needed_tailroom = tailroom;
+ if (mode == NETKIT_L2 && !(ifmp && tbp[IFLA_ADDRESS]))
+ eth_hw_addr_random(peer);
+ if (ifmp && dev->ifindex)
+ peer->ifindex = ifmp->ifi_index;
- err = register_netdevice(peer);
- if (err < 0)
- goto err_register_peer;
- netif_carrier_off(peer);
- if (mode == NETKIT_L2)
- dev_change_flags(peer, peer->flags & ~IFF_NOARP, NULL);
+ nk = netkit_priv(peer);
+ nk->primary = false;
+ nk->policy = policy_peer;
+ nk->scrub = scrub_peer;
+ nk->mode = mode;
+ nk->pair = pair;
+ nk->headroom = headroom;
+ bpf_mprog_bundle_init(&nk->bundle);
+
+ err = register_netdevice(peer);
+ if (err < 0)
+ goto err_register_peer;
+ netif_carrier_off(peer);
+ if (mode == NETKIT_L2)
+ dev_change_flags(peer, peer->flags & ~IFF_NOARP, NULL);
- err = rtnl_configure_link(peer, NULL, 0, NULL);
- if (err < 0)
- goto err_configure_peer;
+ err = rtnl_configure_link(peer, NULL, 0, NULL);
+ if (err < 0)
+ goto err_configure_peer;
+ }
if (mode == NETKIT_L2 && !tb[IFLA_ADDRESS])
eth_hw_addr_random(dev);
@@ -438,12 +452,17 @@ static int netkit_new_link(struct net_device *dev,
nla_strscpy(dev->name, tb[IFLA_IFNAME], IFNAMSIZ);
else
strscpy(dev->name, "nk%d", IFNAMSIZ);
+ if (headroom)
+ dev->needed_headroom = headroom;
+ if (tailroom)
+ dev->needed_tailroom = tailroom;
nk = netkit_priv(dev);
nk->primary = true;
nk->policy = policy_prim;
nk->scrub = scrub_prim;
nk->mode = mode;
+ nk->pair = pair;
nk->headroom = headroom;
bpf_mprog_bundle_init(&nk->bundle);
@@ -455,10 +474,12 @@ static int netkit_new_link(struct net_device *dev,
dev_change_flags(dev, dev->flags & ~IFF_NOARP, NULL);
rcu_assign_pointer(netkit_priv(dev)->peer, peer);
- rcu_assign_pointer(netkit_priv(peer)->peer, dev);
+ if (peer)
+ rcu_assign_pointer(netkit_priv(peer)->peer, dev);
return 0;
err_configure_peer:
- unregister_netdevice(peer);
+ if (peer)
+ unregister_netdevice(peer);
return err;
err_register_peer:
free_netdev(peer);
@@ -518,6 +539,8 @@ static struct net_device *netkit_dev_fetch(struct net *net, u32 ifindex, u32 whi
nk = netkit_priv(dev);
if (!nk->primary)
return ERR_PTR(-EACCES);
+ if (nk->pair == NETKIT_DEVICE_SINGLE)
+ return ERR_PTR(-EOPNOTSUPP);
if (which == BPF_NETKIT_PEER) {
dev = rcu_dereference_rtnl(nk->peer);
if (!dev)
@@ -879,6 +902,7 @@ static int netkit_change_link(struct net_device *dev, struct nlattr *tb[],
{ IFLA_NETKIT_PEER_INFO, "peer info" },
{ IFLA_NETKIT_HEADROOM, "headroom" },
{ IFLA_NETKIT_TAILROOM, "tailroom" },
+ { IFLA_NETKIT_PAIRING, "pairing" },
};
if (!nk->primary) {
@@ -898,9 +922,11 @@ static int netkit_change_link(struct net_device *dev, struct nlattr *tb[],
}
if (data[IFLA_NETKIT_POLICY]) {
+ err = -EOPNOTSUPP;
attr = data[IFLA_NETKIT_POLICY];
policy = nla_get_u32(attr);
- err = netkit_check_policy(policy, attr, extack);
+ if (nk->pair == NETKIT_DEVICE_PAIR)
+ err = netkit_check_policy(policy, attr, extack);
if (err)
return err;
WRITE_ONCE(nk->policy, policy);
@@ -931,6 +957,7 @@ static size_t netkit_get_size(const struct net_device *dev)
nla_total_size(sizeof(u8)) + /* IFLA_NETKIT_PRIMARY */
nla_total_size(sizeof(u16)) + /* IFLA_NETKIT_HEADROOM */
nla_total_size(sizeof(u16)) + /* IFLA_NETKIT_TAILROOM */
+ nla_total_size(sizeof(u32)) + /* IFLA_NETKIT_PAIRING */
0;
}
@@ -951,6 +978,8 @@ static int netkit_fill_info(struct sk_buff *skb, const struct net_device *dev)
return -EMSGSIZE;
if (nla_put_u16(skb, IFLA_NETKIT_TAILROOM, dev->needed_tailroom))
return -EMSGSIZE;
+ if (nla_put_u32(skb, IFLA_NETKIT_PAIRING, nk->pair))
+ return -EMSGSIZE;
if (peer) {
nk = netkit_priv(peer);
@@ -972,6 +1001,7 @@ static const struct nla_policy netkit_policy[IFLA_NETKIT_MAX + 1] = {
[IFLA_NETKIT_TAILROOM] = { .type = NLA_U16 },
[IFLA_NETKIT_SCRUB] = NLA_POLICY_MAX(NLA_U32, NETKIT_SCRUB_DEFAULT),
[IFLA_NETKIT_PEER_SCRUB] = NLA_POLICY_MAX(NLA_U32, NETKIT_SCRUB_DEFAULT),
+ [IFLA_NETKIT_PAIRING] = NLA_POLICY_MAX(NLA_U32, NETKIT_DEVICE_SINGLE),
[IFLA_NETKIT_PRIMARY] = { .type = NLA_REJECT,
.reject_message = "Primary attribute is read-only" },
};
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 3b491d96e52e..bbd565757298 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -1296,6 +1296,11 @@ enum netkit_mode {
NETKIT_L3,
};
+enum netkit_pairing {
+ NETKIT_DEVICE_PAIR,
+ NETKIT_DEVICE_SINGLE,
+};
+
/* NETKIT_SCRUB_NONE leaves clearing skb->{mark,priority} up to
* the BPF program if attached. This also means the latter can
* consume the two fields if they were populated earlier.
@@ -1320,6 +1325,7 @@ enum {
IFLA_NETKIT_PEER_SCRUB,
IFLA_NETKIT_HEADROOM,
IFLA_NETKIT_TAILROOM,
+ IFLA_NETKIT_PAIRING,
__IFLA_NETKIT_MAX,
};
#define IFLA_NETKIT_MAX (__IFLA_NETKIT_MAX - 1)
--
2.43.0
next prev parent reply other threads:[~2026-01-15 8:26 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-15 8:25 [PATCH net-next v7 00/16] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
2026-01-15 8:25 ` [PATCH net-next v7 01/16] net: Add queue-create operation Daniel Borkmann
2026-01-19 1:43 ` Stanislav Fomichev
2026-01-19 14:20 ` Nikolay Aleksandrov
2026-01-15 8:25 ` [PATCH net-next v7 02/16] net: Implement netdev_nl_queue_create_doit Daniel Borkmann
2026-01-19 1:44 ` Stanislav Fomichev
2026-01-19 14:20 ` Nikolay Aleksandrov
2026-01-15 8:25 ` [PATCH net-next v7 03/16] net: Add lease info to queue-get response Daniel Borkmann
2026-01-19 1:44 ` Stanislav Fomichev
2026-01-15 8:25 ` [PATCH net-next v7 04/16] net, ethtool: Disallow leased real rxqs to be resized Daniel Borkmann
2026-01-19 1:44 ` Stanislav Fomichev
2026-01-15 8:25 ` [PATCH net-next v7 05/16] net: Proxy net_mp_{open,close}_rxq for leased queues Daniel Borkmann
2026-01-19 1:44 ` Stanislav Fomichev
2026-01-21 2:04 ` Jakub Kicinski
2026-01-21 3:44 ` David Wei
2026-01-15 8:25 ` [PATCH net-next v7 06/16] net: Proxy netdev_queue_get_dma_dev " Daniel Borkmann
2026-01-19 1:45 ` Stanislav Fomichev
2026-01-19 14:21 ` Nikolay Aleksandrov
2026-01-15 8:25 ` [PATCH net-next v7 07/16] xsk: Extend xsk_rcv_check validation Daniel Borkmann
2026-01-19 1:45 ` Stanislav Fomichev
2026-01-19 14:21 ` Nikolay Aleksandrov
2026-01-15 8:25 ` [PATCH net-next v7 08/16] xsk: Proxy pool management for leased queues Daniel Borkmann
2026-01-19 1:45 ` Stanislav Fomichev
2026-01-19 14:22 ` Nikolay Aleksandrov
2026-01-15 8:25 ` Daniel Borkmann [this message]
2026-01-15 8:25 ` [PATCH net-next v7 10/16] netkit: Implement rtnl_link_ops->alloc and ndo_queue_create Daniel Borkmann
2026-01-15 8:25 ` [PATCH net-next v7 11/16] netkit: Add netkit notifier to check for unregistering devices Daniel Borkmann
2026-01-19 14:22 ` Nikolay Aleksandrov
2026-01-15 8:25 ` [PATCH net-next v7 12/16] netkit: Add xsk support for af_xdp applications Daniel Borkmann
2026-01-15 8:26 ` [PATCH net-next v7 13/16] selftests/net: Add bpf skb forwarding program Daniel Borkmann
2026-01-19 1:45 ` Stanislav Fomichev
2026-01-19 14:23 ` Nikolay Aleksandrov
2026-01-15 8:26 ` [PATCH net-next v7 14/16] selftests/net: Add env for container based tests Daniel Borkmann
2026-01-19 1:46 ` Stanislav Fomichev
2026-01-19 14:23 ` Nikolay Aleksandrov
2026-01-15 8:26 ` [PATCH net-next v7 15/16] selftests/net: Make NetDrvContEnv support queue leasing Daniel Borkmann
2026-01-19 1:46 ` Stanislav Fomichev
2026-01-19 14:23 ` Nikolay Aleksandrov
2026-01-21 1:51 ` Jakub Kicinski
2026-01-21 1:57 ` David Wei
2026-01-15 8:26 ` [PATCH net-next v7 16/16] selftests/net: Add netkit container tests Daniel Borkmann
2026-01-19 1:46 ` Stanislav Fomichev
2026-01-19 14:25 ` Nikolay Aleksandrov
2026-01-20 11:50 ` [PATCH net-next v7 00/16] netkit: Support for io_uring zero-copy and AF_XDP patchwork-bot+netdevbpf
2026-01-21 2:08 ` Jakub Kicinski
2026-01-21 3:46 ` David Wei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260115082603.219152-10-daniel@iogearbox.net \
--to=daniel@iogearbox.net \
--cc=bpf@vger.kernel.org \
--cc=davem@davemloft.net \
--cc=dw@davidwei.uk \
--cc=john.fastabend@gmail.com \
--cc=jordan@jrife.io \
--cc=kuba@kernel.org \
--cc=maciej.fijalkowski@intel.com \
--cc=magnus.karlsson@intel.com \
--cc=martin.lau@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=razor@blackwall.org \
--cc=sdf@fomichev.me \
--cc=toke@redhat.com \
--cc=wangdongdong.6@bytedance.com \
--cc=willemb@google.com \
--cc=yangzhenze@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox