Netdev List
 help / color / mirror / Atom feed
From: Zhu Yanjun <yanjun.zhu@linux.dev>
To: Daniel Borkmann <daniel@iogearbox.net>,
	netdev@vger.kernel.org,
	"yanjun.zhu@linux.dev" <yanjun.zhu@linux.dev>
Cc: bpf@vger.kernel.org, kuba@kernel.org, davem@davemloft.net,
	razor@blackwall.org, pabeni@redhat.com, willemb@google.com,
	sdf@fomichev.me, john.fastabend@gmail.com, martin.lau@kernel.org,
	jordan@jrife.io, maciej.fijalkowski@intel.com,
	magnus.karlsson@intel.com, dw@davidwei.uk, toke@redhat.com,
	yangzhenze@bytedance.com, wangdongdong.6@bytedance.com
Subject: Re: [PATCH net-next v11 13/14] netkit: Add xsk support for af_xdp applications
Date: Tue, 19 May 2026 20:10:52 -0700	[thread overview]
Message-ID: <ea99efb4-6ac0-4b29-8b38-e9865b46526c@linux.dev> (raw)
In-Reply-To: <20260402231031.447597-14-daniel@iogearbox.net>

在 2026/4/2 16:10, Daniel Borkmann 写道:
> Enable support for AF_XDP applications to operate on a netkit device.
> The goal is that AF_XDP applications can natively consume AF_XDP
> from network namespaces. The use-case from Cilium side is to support
> Kubernetes KubeVirt VMs through QEMU's AF_XDP backend. KubeVirt is a
> virtual machine management add-on for Kubernetes which aims to provide
> a common ground for virtualization. KubeVirt spawns the VMs inside
> Kubernetes Pods which reside in their own network namespace just like
> regular Pods.
> 
> Raw QEMU AF_XDP backend example with eth0 being a physical device with
> 16 queues where netkit is bound to the last queue (for multi-queue RSS
> context can be used if supported by the driver):
> 
>    # ethtool -X eth0 start 0 equal 15
>    # ethtool -X eth0 start 15 equal 1 context new
>    # ethtool --config-ntuple eth0 flow-type ether \
>              src 00:00:00:00:00:00 \
>              src-mask ff:ff:ff:ff:ff:ff \
>              dst $mac dst-mask 00:00:00:00:00:00 \
>              proto 0 proto-mask 0xffff action 15
>    [ ... setup BPF/XDP prog on eth0 to steer into shared xsk map ... ]
>    # ip netns add foo
>    # ip link add numrxqueues 2 nk type netkit single
>    # ynl --family netdev --output-json --do queue-create \
>          --json "{"ifindex": $(ifindex nk), "type": "rx", \
>                   "lease": { "ifindex": $(ifindex eth0), \
>                              "queue": { "type": "rx", "id": 15 } } }"
>    {'id': 1}
>    # ip link set nk netns foo
>    # ip netns exec foo ip link set lo up
>    # ip netns exec foo ip link set nk up
>    # ip netns exec foo qemu-system-x86_64 \
>            -kernel $kernel \
>            -drive file=${image_name},index=0,media=disk,format=raw \
>            -append "root=/dev/sda rw console=ttyS0" \
>            -cpu host \
>            -m $memory \
>            -enable-kvm \
>            -device virtio-net-pci,netdev=net0,mac=$mac \
>            -netdev af-xdp,ifname=nk,id=net0,mode=native,queues=1,start-queue=1,inhibit=on,map-path=$dir/xsks_map \
>            -nographic
> 
> We have tested the above against a dual-port Nvidia ConnectX-6 (mlx5)
> 100G NIC with successful network connectivity out of QEMU. An earlier
> iteration of this work was presented at LSF/MM/BPF [0] and more
> recently at LPC [1].
> 
> For getting to a first starting point to connect all things with
> KubeVirt, bind mounting the xsk map from Cilium into the VM launcher
> Pod which acts as a regular Kubernetes Pod while not perfect, is not
> a big problem given its out of reach from the application sitting
> inside the VM (and some of the control plane aspects are baked in
> the launcher Pod already), so the isolation barrier is still the VM.
> Eventually the goal is to have a XDP/XSK redirect extension where
> there is no need to have the xsk map, and the BPF program can just
> derive the target xsk through the queue where traffic was received
> on.

Sorry, this may be a bit late regarding this commit.

If I remember correctly, eBPF is already supported in netkit. Since eBPF 
can cover everything XDP does, I’m wondering what the motivation was for 
adding explicit XDP support to the netkit driver.

Thanks,
Zhu Yanjun

> 
> The exposure through netkit is because Cilium should not act as a
> proxy handing out xsk sockets. Existing applications expect a netdev
> from kernel side and should not need to rewrite just to implement
> against a CNI's protocol. Also, all the memory should not be accounted
> against Cilium but rather the application Pod itself which is consuming
> AF_XDP. Further, on up/downgrades we expect the data plane to being
> completely decoupled from the control plane; if Cilium would own the
> sockets that would be disruptive. Another use-case which opens up and
> is regularly asked from users would be to have DPDK applications on
> top of AF_XDP in regular Kubernetes Pods.
> 
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Co-developed-by: David Wei <dw@davidwei.uk>
> Signed-off-by: David Wei <dw@davidwei.uk>
> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
> Link: https://bpfconf.ebpf.io/bpfconf2025/bpfconf2025_material/lsfmmbpf_2025_netkit_borkmann.pdf [0]
> Link: https://lpc.events/event/19/contributions/2275/ [1]
> ---
>   drivers/net/netkit.c | 86 ++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 86 insertions(+)
> 
> diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
> index 1ec21aef348f..5619209329d5 100644
> --- a/drivers/net/netkit.c
> +++ b/drivers/net/netkit.c
> @@ -12,6 +12,7 @@
>   #include <net/netdev_lock.h>
>   #include <net/netdev_queues.h>
>   #include <net/netdev_rx_queue.h>
> +#include <net/xdp_sock_drv.h>
>   #include <net/netkit.h>
>   #include <net/dst.h>
>   #include <net/tcx.h>
> @@ -236,6 +237,86 @@ static void netkit_get_stats(struct net_device *dev,
>   	stats->tx_dropped = DEV_STATS_READ(dev, tx_dropped);
>   }
>   
> +static bool netkit_xsk_supported_at_phys(const struct net_device *dev)
> +{
> +	if (!dev->netdev_ops->ndo_bpf ||
> +	    !dev->netdev_ops->ndo_xdp_xmit ||
> +	    !dev->netdev_ops->ndo_xsk_wakeup)
> +		return false;
> +	return true;
> +}
> +
> +static int netkit_xsk(struct net_device *dev, struct netdev_bpf *xdp)
> +{
> +	struct netkit *nk = netkit_priv(dev);
> +	struct netdev_bpf xdp_lower;
> +	struct netdev_rx_queue *rxq;
> +	struct net_device *phys;
> +	bool create = false;
> +	int ret = -EBUSY;
> +
> +	switch (xdp->command) {
> +	case XDP_SETUP_XSK_POOL:
> +		if (nk->pair == NETKIT_DEVICE_PAIR)
> +			return -EOPNOTSUPP;
> +		if (xdp->xsk.queue_id >= dev->real_num_rx_queues)
> +			return -EINVAL;
> +
> +		rxq = __netif_get_rx_queue(dev, xdp->xsk.queue_id);
> +		if (!rxq->lease)
> +			return -EOPNOTSUPP;
> +
> +		phys = rxq->lease->dev;
> +		if (!netkit_xsk_supported_at_phys(phys))
> +			return -EOPNOTSUPP;
> +
> +		create = xdp->xsk.pool;
> +		memcpy(&xdp_lower, xdp, sizeof(xdp_lower));
> +		xdp_lower.xsk.queue_id = get_netdev_rx_queue_index(rxq->lease);
> +		break;
> +	case XDP_SETUP_PROG:
> +		return -EOPNOTSUPP;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	netdev_lock(phys);
> +	if (create &&
> +	    (phys->xdp_features & NETDEV_XDP_ACT_XSK) != NETDEV_XDP_ACT_XSK) {
> +		ret = -EOPNOTSUPP;
> +		goto out;
> +	}
> +	if (!create || !dev_get_min_mp_channel_count(phys))
> +		ret = phys->netdev_ops->ndo_bpf(phys, &xdp_lower);
> +out:
> +	netdev_unlock(phys);
> +	return ret;
> +}
> +
> +static int netkit_xsk_wakeup(struct net_device *dev, u32 queue_id, u32 flags)
> +{
> +	struct netdev_rx_queue *rxq, *rxq_lease;
> +	struct net_device *phys;
> +
> +	if (queue_id >= dev->real_num_rx_queues)
> +		return -EINVAL;
> +
> +	rxq = __netif_get_rx_queue(dev, queue_id);
> +	rxq_lease = READ_ONCE(rxq->lease);
> +	if (unlikely(!rxq_lease))
> +		return -EOPNOTSUPP;
> +
> +	/* netkit_xsk already validated full xsk support, hence it's
> +	 * fine to call into ndo_xsk_wakeup right away given this
> +	 * was a prerequisite to get here in the first place. The
> +	 * phys xsk support cannot change without tearing down the
> +	 * device (which clears the lease first).
> +	 */
> +	phys = rxq_lease->dev;
> +	return phys->netdev_ops->ndo_xsk_wakeup(phys,
> +			get_netdev_rx_queue_index(rxq_lease), flags);
> +}
> +
>   static int netkit_init(struct net_device *dev)
>   {
>   	netdev_lockdep_set_classes(dev);
> @@ -256,6 +337,8 @@ static const struct net_device_ops netkit_netdev_ops = {
>   	.ndo_get_peer_dev	= netkit_peer_dev,
>   	.ndo_get_stats64	= netkit_get_stats,
>   	.ndo_uninit		= netkit_uninit,
> +	.ndo_bpf		= netkit_xsk,
> +	.ndo_xsk_wakeup		= netkit_xsk_wakeup,
>   	.ndo_features_check	= passthru_features_check,
>   };
>   
> @@ -569,6 +652,9 @@ static int netkit_new_link(struct net_device *dev,
>   	nk->headroom = headroom;
>   	bpf_mprog_bundle_init(&nk->bundle);
>   
> +	if (pair == NETKIT_DEVICE_SINGLE)
> +		xdp_set_features_flag(dev, NETDEV_XDP_ACT_XSK);
> +
>   	err = register_netdevice(dev);
>   	if (err < 0)
>   		goto err_configure_peer;


  reply	other threads:[~2026-05-20  3:11 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-02 23:10 [PATCH net-next v11 00/14] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
2026-04-02 23:10 ` [PATCH net-next v11 01/14] net: Add queue-create operation Daniel Borkmann
2026-04-02 23:10 ` [PATCH net-next v11 02/14] net: Implement netdev_nl_queue_create_doit Daniel Borkmann
2026-04-02 23:10 ` [PATCH net-next v11 03/14] net: Add lease info to queue-get response Daniel Borkmann
2026-04-08  3:40   ` Jakub Kicinski
2026-04-08  9:09     ` Daniel Borkmann
2026-04-08 22:12       ` Jakub Kicinski
2026-04-09 13:43         ` Daniel Borkmann
2026-04-09 13:52           ` Daniel Borkmann
2026-04-09 14:46             ` Jakub Kicinski
2026-04-09 15:32               ` Daniel Borkmann
2026-04-10  1:51                 ` Jakub Kicinski
2026-04-10 11:10                   ` Daniel Borkmann
2026-04-02 23:10 ` [PATCH net-next v11 04/14] net, ethtool: Disallow leased real rxqs to be resized Daniel Borkmann
2026-04-02 23:10 ` [PATCH net-next v11 05/14] net: Slightly simplify net_mp_{open,close}_rxq Daniel Borkmann
2026-04-02 23:10 ` [PATCH net-next v11 06/14] net: Proxy netif_mp_{open,close}_rxq for leased queues Daniel Borkmann
2026-04-02 23:10 ` [PATCH net-next v11 07/14] net: Proxy netdev_queue_get_dma_dev " Daniel Borkmann
2026-04-02 23:10 ` [PATCH net-next v11 08/14] xsk: Extend xsk_rcv_check validation Daniel Borkmann
2026-04-02 23:10 ` [PATCH net-next v11 09/14] xsk: Proxy pool management for leased queues Daniel Borkmann
2026-04-02 23:10 ` [PATCH net-next v11 10/14] netkit: Add single device mode for netkit Daniel Borkmann
2026-04-02 23:10 ` [PATCH net-next v11 11/14] netkit: Implement rtnl_link_ops->alloc and ndo_queue_create Daniel Borkmann
2026-04-02 23:10 ` [PATCH net-next v11 12/14] netkit: Add netkit notifier to check for unregistering devices Daniel Borkmann
2026-04-02 23:10 ` [PATCH net-next v11 13/14] netkit: Add xsk support for af_xdp applications Daniel Borkmann
2026-05-20  3:10   ` Zhu Yanjun [this message]
2026-04-02 23:10 ` [PATCH net-next v11 14/14] selftests/net: Add queue leasing tests with netkit Daniel Borkmann
2026-04-08 23:22   ` Jakub Kicinski
2026-04-09 15:26     ` David Wei
2026-04-10  1:19       ` Jakub Kicinski
2026-04-07  9:50 ` [PATCH net-next v11 00/14] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
2026-04-10  2:00 ` patchwork-bot+netdevbpf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ea99efb4-6ac0-4b29-8b38-e9865b46526c@linux.dev \
    --to=yanjun.zhu@linux.dev \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=dw@davidwei.uk \
    --cc=john.fastabend@gmail.com \
    --cc=jordan@jrife.io \
    --cc=kuba@kernel.org \
    --cc=maciej.fijalkowski@intel.com \
    --cc=magnus.karlsson@intel.com \
    --cc=martin.lau@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=razor@blackwall.org \
    --cc=sdf@fomichev.me \
    --cc=toke@redhat.com \
    --cc=wangdongdong.6@bytedance.com \
    --cc=willemb@google.com \
    --cc=yangzhenze@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox