All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Daniel Borkmann <daniel@iogearbox.net>, netdev@vger.kernel.org
Cc: bpf@vger.kernel.org, kuba@kernel.org, davem@davemloft.net,
	razor@blackwall.org, pabeni@redhat.com, willemb@google.com,
	sdf@fomichev.me, john.fastabend@gmail.com, martin.lau@kernel.org,
	jordan@jrife.io, maciej.fijalkowski@intel.com,
	magnus.karlsson@intel.com, David Wei <dw@davidwei.uk>
Subject: Re: [PATCH net-next 19/20] netkit: Add xsk support for af_xdp applications
Date: Fri, 26 Sep 2025 10:55:57 +0200	[thread overview]
Message-ID: <87plbdoan6.fsf@toke.dk> (raw)
In-Reply-To: <5d139efa-c78e-4323-b79d-bbf566ac19b8@iogearbox.net>

Daniel Borkmann <daniel@iogearbox.net> writes:

> On 9/23/25 1:42 PM, Toke Høiland-Jørgensen wrote:
>> Daniel Borkmann <daniel@iogearbox.net> writes:
>> 
>>> Enable support for AF_XDP applications to operate on a netkit device.
>>> The goal is that AF_XDP applications can natively consume AF_XDP
>>> from network namespaces. The use-case from Cilium side is to support
>>> Kubernetes KubeVirt VMs through QEMU's AF_XDP backend. KubeVirt is a
>>> virtual machine management add-on for Kubernetes which aims to provide
>>> a common ground for virtualization. KubeVirt spawns the VMs inside
>>> Kubernetes Pods which reside in their own network namespace just like
>>> regular Pods.
>>>
>>> Raw QEMU AF_XDP backend example with eth0 being a physical device with
>>> 16 queues where netkit is bound to the last queue (for multi-queue RSS
>>> context can be used if supported by the driver):
>>>
>>>    # ethtool -X eth0 start 0 equal 15
>>>    # ethtool -X eth0 start 15 equal 1 context new
>>>    # ethtool --config-ntuple eth0 flow-type ether \
>>>              src 00:00:00:00:00:00 \
>>>              src-mask ff:ff:ff:ff:ff:ff \
>>>              dst $mac dst-mask 00:00:00:00:00:00 \
>>>              proto 0 proto-mask 0xffff action 15
>>>    # ip netns add foo
>>>    # ip link add numrxqueues 2 nk type netkit single
>>>    # ynl-bind eth0 15 nk
>>>    # ip link set nk netns foo
>>>    # ip netns exec foo ip link set lo up
>>>    # ip netns exec foo ip link set nk up
>>>    # ip netns exec foo qemu-system-x86_64 \
>>>            -kernel $kernel \
>>>            -drive file=${image_name},index=0,media=disk,format=raw \
>>>            -append "root=/dev/sda rw console=ttyS0" \
>>>            -cpu host \
>>>            -m $memory \
>>>            -enable-kvm \
>>>            -device virtio-net-pci,netdev=net0,mac=$mac \
>>>            -netdev af-xdp,ifname=nk,id=net0,mode=native,queues=1,start-queue=1,inhibit=on,map-path=$dir/xsks_map \
>>>            -nographic
>> 
>> So AFAICT, this example relies on the control plane installing an XDP
>> program on the physical NIC which will redirect into the right socket;
>> and since in this example, qemu will install the XSK socket at index 1
>> in the xsk map, that XDP program will also need to be aware of the queue
>> index mapping. I can see from your qemu commit[0] that there's support
>> on the qemu side for specifying an offset into the map to avoid having
>> to do this translation in the XDP program, but at the very least that
>> makes this example incomplete, no?
>> 
>> However, even with a complete example, this breaks isolation in the
>> sense that the entire XSK map is visible inside the pod, so a
>> misbehaving qemu could interfere with traffic on other queues (by
>> clearing the map, say). Which seems less than ideal?
>
> For getting to a first starting point to connect all things with KubeVirt,
> bind mounting the xsk map from Cilium into the VM launcher Pod which acts
> as a regular K8s Pod while not perfect, its not a big issue given its out
> of reach from the application sitting inside the VM (and some of the
> control plane aspects are baked in the launcher Pod already), so the
> isolation barrier is still VM. Eventually my goal is to have a xdp/xsk
> redirect extension where we don't need to have the xsk map, and can just
> derive the target xsk through the rxq we received traffic on.

Right, okay, makes sense.

>> Taking a step back, for AF_XDP we already support decoupling the
>> application-side access to the redirected packets from the interface,
>> through the use of sockets. Meaning that your use case here could just
>> as well be served by the control plane setting up AF_XDP socket(s) on
>> the physical NIC and passing those into qemu, in which case we don't
>> need this whole queue proxying dance at all.
>
> Cilium should not act as a proxy handing out xsk sockets. Existing
> applications expect a netdev from kernel side and should not need to
> rewrite just to implement one CNI's protocol. Also, all the memory
> should not be accounted against Cilium but rather the application Pod
> itself which is consuming af_xdp. Further, on up/downgrades we expect
> the data plane to being completely decoupled from the control plane,
> if Cilium would own the sockets that would be disruptive which is
> nogo.

Hmm, okay, so the kernel-side RXQ buffering is to make it transparent to
the application inside the pod? I guess that makes sense; would be good
to mention in the commit message, though (+ the bit about the map
needing to be in sync) :)

>> So, erm, what am I missing that makes this worth it (for AF_XDP; I can
>> see how it is useful for other things)? :)
> Yeap there are other use cases we've seen from Cilium users as well,
> e.g. running dpdk applications on top of af_xdp in regular k8s Pods.

Yeah, being able to do stuff like that without having to rely on SR-IOV
would be cool, certainly!

-Toke


  reply	other threads:[~2025-09-26  8:56 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-19 21:31 [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 01/20] net, ynl: Add bind-queue operation Daniel Borkmann
2025-09-22 16:04   ` Stanislav Fomichev
2025-09-22 16:13     ` Daniel Borkmann
2025-09-23  1:17   ` Jakub Kicinski
2025-09-23 16:13     ` David Wei
2025-09-19 21:31 ` [PATCH net-next 02/20] net: Add peer to netdev_rx_queue Daniel Borkmann
2025-09-23  1:22   ` Jakub Kicinski
2025-09-23 15:56     ` David Wei
2025-09-19 21:31 ` [PATCH net-next 03/20] net: Add ndo_queue_create callback Daniel Borkmann
2025-09-22 16:04   ` Stanislav Fomichev
2025-09-22 16:14     ` Daniel Borkmann
2025-09-23 15:58     ` David Wei
2025-09-23  1:22   ` Jakub Kicinski
2025-09-23 15:58     ` David Wei
2025-09-19 21:31 ` [PATCH net-next 04/20] net: Add ndo_{peer,unpeer}_queues callback Daniel Borkmann
2025-09-23  1:23   ` Jakub Kicinski
2025-09-23 16:06     ` David Wei
2025-09-23 16:26       ` Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 05/20] net, ynl: Implement netdev_nl_bind_queue_doit Daniel Borkmann
2025-09-22 16:06   ` Stanislav Fomichev
2025-09-23  1:26     ` Jakub Kicinski
2025-09-23 16:06       ` David Wei
2025-09-19 21:31 ` [PATCH net-next 06/20] net, ynl: Add peer info to queue-get response Daniel Borkmann
2025-09-23  1:32   ` Jakub Kicinski
2025-09-23 16:08     ` David Wei
2025-09-19 21:31 ` [PATCH net-next 07/20] net, ethtool: Disallow mapped real rxqs to be resized Daniel Borkmann
2025-09-23  1:34   ` Jakub Kicinski
2025-09-23  1:38     ` Jakub Kicinski
2025-09-23 16:08       ` David Wei
2025-09-19 21:31 ` [PATCH net-next 08/20] net: Proxy net_mp_{open,close}_rxq for mapped queues Daniel Borkmann
2025-09-22 16:35   ` Stanislav Fomichev
2025-09-19 21:31 ` [PATCH net-next 09/20] xsk: Move NETDEV_XDP_ACT_ZC into generic header Daniel Borkmann
2025-09-22 15:59   ` Maciej Fijalkowski
2025-09-19 21:31 ` [PATCH net-next 10/20] xsk: Move pool registration into single function Daniel Borkmann
2025-09-22 16:01   ` Maciej Fijalkowski
2025-09-22 16:15     ` Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 11/20] xsk: Add small helper xp_pool_bindable Daniel Borkmann
2025-09-22 16:03   ` Maciej Fijalkowski
2025-09-22 16:17     ` Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 12/20] xsk: Change xsk_rcv_check to check netdev/queue_id from pool Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 13/20] xsk: Proxy pool management for mapped queues Daniel Borkmann
2025-09-22 16:48   ` Stanislav Fomichev
2025-09-22 17:01     ` Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 14/20] netkit: Add single device mode for netkit Daniel Borkmann
2025-09-27  1:10   ` Jordan Rife
2025-09-29  7:55     ` Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 15/20] netkit: Document fast vs slowpath members via macros Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 16/20] netkit: Implement rtnl_link_ops->alloc Daniel Borkmann
2025-09-27  1:17   ` Jordan Rife
2025-09-29  7:50     ` Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 17/20] netkit: Implement ndo_queue_create Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 18/20] netkit: Add io_uring zero-copy support for TCP Daniel Borkmann
2025-09-22  3:17   ` zf
2025-09-22 16:23     ` Daniel Borkmann
2025-09-19 21:31 ` [PATCH net-next 19/20] netkit: Add xsk support for af_xdp applications Daniel Borkmann
2025-09-23 11:42   ` Toke Høiland-Jørgensen
2025-09-24 10:41     ` Daniel Borkmann
2025-09-26  8:55       ` Toke Høiland-Jørgensen [this message]
2025-09-19 21:31 ` [PATCH net-next 20/20] tools, ynl: Add queue binding ynl sample application Daniel Borkmann
2025-09-22 17:09   ` Stanislav Fomichev
2025-09-23 16:12     ` David Wei
2025-09-22 12:05 ` [PATCH net-next 00/20] netkit: Support for io_uring zero-copy and AF_XDP Nikolay Aleksandrov
2025-09-23  1:59 ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87plbdoan6.fsf@toke.dk \
    --to=toke@redhat.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=dw@davidwei.uk \
    --cc=john.fastabend@gmail.com \
    --cc=jordan@jrife.io \
    --cc=kuba@kernel.org \
    --cc=maciej.fijalkowski@intel.com \
    --cc=magnus.karlsson@intel.com \
    --cc=martin.lau@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=razor@blackwall.org \
    --cc=sdf@fomichev.me \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.