From: Jason Wang <jasowang@redhat.com>
To: Ilya Maximets <i.maximets@ovn.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
Eric Blake <eblake@redhat.com>,
Stefan Hajnoczi <stefanha@redhat.com>,
qemu-devel@nongnu.org
Subject: Re: [PATCH v3] net: add initial support for AF_XDP network backend
Date: Tue, 8 Aug 2023 10:28:19 +0800 [thread overview]
Message-ID: <CACGkMEusFonOKcmOY92PVrPF47xLjMfff8TOh3FDCZ9L+0JvoA@mail.gmail.com> (raw)
In-Reply-To: <20230804182110.2627049-1-i.maximets@ovn.org>
On Sat, Aug 5, 2023 at 2:20 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> AF_XDP is a network socket family that allows communication directly
> with the network device driver in the kernel, bypassing most or all
> of the kernel networking stack. In the essence, the technology is
> pretty similar to netmap. But, unlike netmap, AF_XDP is Linux-native
> and works with any network interfaces without driver modifications.
> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
> require access to character devices or unix sockets. Only access to
> the network interface itself is necessary.
>
> This patch implements a network backend that communicates with the
> kernel by creating an AF_XDP socket. A chunk of userspace memory
> is shared between QEMU and the host kernel. 4 ring buffers (Tx, Rx,
> Fill and Completion) are placed in that memory along with a pool of
> memory buffers for the packet data. Data transmission is done by
> allocating one of the buffers, copying packet data into it and
> placing the pointer into Tx ring. After transmission, device will
> return the buffer via Completion ring. On Rx, device will take
> a buffer form a pre-populated Fill ring, write the packet data into
> it and place the buffer into Rx ring.
>
> AF_XDP network backend takes on the communication with the host
> kernel and the network interface and forwards packets to/from the
> peer device in QEMU.
>
> Usage example:
>
> -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
> -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
>
> XDP program bridges the socket with a network interface. It can be
> attached to the interface in 2 different modes:
>
> 1. skb - this mode should work for any interface and doesn't require
> driver support. With a caveat of lower performance.
>
> 2. native - this does require support from the driver and allows to
> bypass skb allocation in the kernel and potentially use
> zero-copy while getting packets in/out userspace.
>
> By default, QEMU will try to use native mode and fall back to skb.
> Mode can be forced via 'mode' option. To force 'copy' even in native
> mode, use 'force-copy=on' option. This might be useful if there is
> some issue with the driver.
>
> Option 'queues=N' allows to specify how many device queues should
> be open. Note that all the queues that are not open are still
> functional and can receive traffic, but it will not be delivered to
> QEMU. So, the number of device queues should generally match the
> QEMU configuration, unless the device is shared with something
> else and the traffic re-direction to appropriate queues is correctly
> configured on a device level (e.g. with ethtool -N).
> 'start-queue=M' option can be used to specify from which queue id
> QEMU should start configuring 'N' queues. It might also be necessary
> to use this option with certain NICs, e.g. MLX5 NICs. See the docs
> for examples.
>
> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
> or CAP_BPF capabilities in order to load default XSK/XDP programs to
> the network interface and configure BPF maps. It is possible, however,
> to run with no capabilities. For that to work, an external process
> with enough capabilities will need to pre-load default XSK program,
> create AF_XDP sockets and pass their file descriptors to QEMU process
> on startup via 'sock-fds' option. Network backend will need to be
> configured with 'inhibit=on' to avoid loading of the program.
> QEMU will need 32 MB of locked memory (RLIMIT_MEMLOCK) per queue
> or CAP_IPC_LOCK.
>
> There are few performance challenges with the current network backends.
>
> First is that they do not support IO threads. This means that data
> path is handled by the main thread in QEMU and may slow down other
> work or may be slowed down by some other work. This also means that
> taking advantage of multi-queue is generally not possible today.
>
> Another thing is that data path is going through the device emulation
> code, which is not really optimized for performance. The fastest
> "frontend" device is virtio-net. But it's not optimized for heavy
> traffic either, because it expects such use-cases to be handled via
> some implementation of vhost (user, kernel, vdpa). In practice, we
> have virtio notifications and rcu lock/unlock on a per-packet basis
> and not very efficient accesses to the guest memory. Communication
> channels between backend and frontend devices do not allow passing
> more than one packet at a time as well.
>
> Some of these challenges can be avoided in the future by adding better
> batching into device emulation or by implementing vhost-af-xdp variant.
>
> There are also a few kernel limitations. AF_XDP sockets do not
> support any kinds of checksum or segmentation offloading. Buffers
> are limited to a page size (4K), i.e. MTU is limited. Multi-buffer
> support implementation for AF_XDP is in progress, but not ready yet.
> Also, transmission in all non-zero-copy modes is synchronous, i.e.
> done in a syscall. That doesn't allow high packet rates on virtual
> interfaces.
>
> However, keeping in mind all of these challenges, current implementation
> of the AF_XDP backend shows a decent performance while running on top
> of a physical NIC with zero-copy support.
>
> Test setup:
>
> 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card.
> Network backend is configured to open the NIC directly in native mode.
> The driver supports zero-copy. NIC is configured to use 1 queue.
>
> Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd
> for PPS testing.
>
> iperf3 result:
> TCP stream : 19.1 Gbps
>
> dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
> Tx only : 3.4 Mpps
> Rx only : 2.0 Mpps
> L2 FWD Loopback : 1.5 Mpps
>
> In skb mode the same setup shows much lower performance, similar to
> the setup where pair of physical NICs is replaced with veth pair:
>
> iperf3 result:
> TCP stream : 9 Gbps
>
> dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
> Tx only : 1.2 Mpps
> Rx only : 1.0 Mpps
> L2 FWD Loopback : 0.7 Mpps
>
> Results in skb mode or over the veth are close to results of a tap
> backend with vhost=on and disabled segmentation offloading bridged
> with a NIC.
>
> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> ---
>
> Version 3:
>
> - Bump requirements to libxdp 1.4.0+. Having that, removed all
> the conditional compilation parts, since all the needed APIs
> are available in this version of libxdp.
>
> - Also removed the ability to pass xsks map fd, since ability
> to just pass socket fds is now always available and it doesn't
> require any capabilities untile manipulations with BPF maps.
>
> - Updated documentation to not call out specific vendors, memory
> numbers or specific required capabilities.
>
> - Changed logic of returning peeked at but unused descriptors.
>
> - Minor cleanups.
>
Queued this for 8.2.
Thanks
prev parent reply other threads:[~2023-08-08 2:29 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-04 18:21 [PATCH v3] net: add initial support for AF_XDP network backend Ilya Maximets
2023-08-08 2:28 ` Jason Wang [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CACGkMEusFonOKcmOY92PVrPF47xLjMfff8TOh3FDCZ9L+0JvoA@mail.gmail.com \
--to=jasowang@redhat.com \
--cc=eblake@redhat.com \
--cc=i.maximets@ovn.org \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).