From: "Daniel P. Berrangé" <berrange@redhat.com>
To: Jason Wang <jasowang@redhat.com>
Cc: qemu-devel@nongnu.org, Ilya Maximets <i.maximets@ovn.org>
Subject: Re: [PULL 12/17] net: add initial support for AF_XDP network backend
Date: Fri, 8 Sep 2023 12:48:09 +0100 [thread overview]
Message-ID: <ZPsJ+TxDYO24T5Yp@redhat.com> (raw)
In-Reply-To: <20230908064507.14596-13-jasowang@redhat.com>
On Fri, Sep 08, 2023 at 02:45:02PM +0800, Jason Wang wrote:
> From: Ilya Maximets <i.maximets@ovn.org>
>
> AF_XDP is a network socket family that allows communication directly
> with the network device driver in the kernel, bypassing most or all
> of the kernel networking stack. In the essence, the technology is
> pretty similar to netmap. But, unlike netmap, AF_XDP is Linux-native
> and works with any network interfaces without driver modifications.
> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
> require access to character devices or unix sockets. Only access to
> the network interface itself is necessary.
>
> This patch implements a network backend that communicates with the
> kernel by creating an AF_XDP socket. A chunk of userspace memory
> is shared between QEMU and the host kernel. 4 ring buffers (Tx, Rx,
> Fill and Completion) are placed in that memory along with a pool of
> memory buffers for the packet data. Data transmission is done by
> allocating one of the buffers, copying packet data into it and
> placing the pointer into Tx ring. After transmission, device will
> return the buffer via Completion ring. On Rx, device will take
> a buffer form a pre-populated Fill ring, write the packet data into
> it and place the buffer into Rx ring.
>
> AF_XDP network backend takes on the communication with the host
> kernel and the network interface and forwards packets to/from the
> peer device in QEMU.
>
> Usage example:
>
> -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
> -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
>
> XDP program bridges the socket with a network interface. It can be
> attached to the interface in 2 different modes:
>
> 1. skb - this mode should work for any interface and doesn't require
> driver support. With a caveat of lower performance.
>
> 2. native - this does require support from the driver and allows to
> bypass skb allocation in the kernel and potentially use
> zero-copy while getting packets in/out userspace.
>
> By default, QEMU will try to use native mode and fall back to skb.
> Mode can be forced via 'mode' option. To force 'copy' even in native
> mode, use 'force-copy=on' option. This might be useful if there is
> some issue with the driver.
>
> Option 'queues=N' allows to specify how many device queues should
> be open. Note that all the queues that are not open are still
> functional and can receive traffic, but it will not be delivered to
> QEMU. So, the number of device queues should generally match the
> QEMU configuration, unless the device is shared with something
> else and the traffic re-direction to appropriate queues is correctly
> configured on a device level (e.g. with ethtool -N).
> 'start-queue=M' option can be used to specify from which queue id
> QEMU should start configuring 'N' queues. It might also be necessary
> to use this option with certain NICs, e.g. MLX5 NICs. See the docs
> for examples.
>
> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
> or CAP_BPF capabilities in order to load default XSK/XDP programs to
> the network interface and configure BPF maps. It is possible, however,
> to run with no capabilities. For that to work, an external process
> with enough capabilities will need to pre-load default XSK program,
> create AF_XDP sockets and pass their file descriptors to QEMU process
> on startup via 'sock-fds' option. Network backend will need to be
> configured with 'inhibit=on' to avoid loading of the program.
> QEMU will need 32 MB of locked memory (RLIMIT_MEMLOCK) per queue
> or CAP_IPC_LOCK.
>
> There are few performance challenges with the current network backends.
>
> First is that they do not support IO threads. This means that data
> path is handled by the main thread in QEMU and may slow down other
> work or may be slowed down by some other work. This also means that
> taking advantage of multi-queue is generally not possible today.
>
> Another thing is that data path is going through the device emulation
> code, which is not really optimized for performance. The fastest
> "frontend" device is virtio-net. But it's not optimized for heavy
> traffic either, because it expects such use-cases to be handled via
> some implementation of vhost (user, kernel, vdpa). In practice, we
> have virtio notifications and rcu lock/unlock on a per-packet basis
> and not very efficient accesses to the guest memory. Communication
> channels between backend and frontend devices do not allow passing
> more than one packet at a time as well.
>
> Some of these challenges can be avoided in the future by adding better
> batching into device emulation or by implementing vhost-af-xdp variant.
>
> There are also a few kernel limitations. AF_XDP sockets do not
> support any kinds of checksum or segmentation offloading. Buffers
> are limited to a page size (4K), i.e. MTU is limited. Multi-buffer
> support implementation for AF_XDP is in progress, but not ready yet.
> Also, transmission in all non-zero-copy modes is synchronous, i.e.
> done in a syscall. That doesn't allow high packet rates on virtual
> interfaces.
>
> However, keeping in mind all of these challenges, current implementation
> of the AF_XDP backend shows a decent performance while running on top
> of a physical NIC with zero-copy support.
>
> Test setup:
>
> 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card.
> Network backend is configured to open the NIC directly in native mode.
> The driver supports zero-copy. NIC is configured to use 1 queue.
>
> Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd
> for PPS testing.
>
> iperf3 result:
> TCP stream : 19.1 Gbps
>
> dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
> Tx only : 3.4 Mpps
> Rx only : 2.0 Mpps
> L2 FWD Loopback : 1.5 Mpps
>
> In skb mode the same setup shows much lower performance, similar to
> the setup where pair of physical NICs is replaced with veth pair:
>
> iperf3 result:
> TCP stream : 9 Gbps
>
> dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
> Tx only : 1.2 Mpps
> Rx only : 1.0 Mpps
> L2 FWD Loopback : 0.7 Mpps
>
> Results in skb mode or over the veth are close to results of a tap
> backend with vhost=on and disabled segmentation offloading bridged
> with a NIC.
> diff --git a/tests/docker/dockerfiles/debian-amd64.docker b/tests/docker/dockerfiles/debian-amd64.docker
> index 02262bc..811a7fe 100644
> --- a/tests/docker/dockerfiles/debian-amd64.docker
> +++ b/tests/docker/dockerfiles/debian-amd64.docker
> @@ -98,6 +98,7 @@ RUN export DEBIAN_FRONTEND=noninteractive && \
> libvirglrenderer-dev \
> libvte-2.91-dev \
> libxen-dev \
> + libxdp-dev \
> libzstd-dev \
> llvm \
> locales \
As the comment at the top of the file states - this is auto-generated
by lcitool and must not be hand editted like this.
Check out docs/devel/testing.rst which has guidance on the process
for adding new package deps with lcitool/libvirt-ci.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
next prev parent reply other threads:[~2023-09-08 11:48 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-08 6:44 [PULL 00/17] Net patches Jason Wang
2023-09-08 6:44 ` [PULL 01/17] tap: Add USO support to tap device Jason Wang
2023-09-08 6:44 ` [PULL 02/17] tap: Add check for USO features Jason Wang
2023-09-08 6:44 ` [PULL 03/17] virtio-net: Add USO flags to vhost support Jason Wang
2023-09-08 6:44 ` [PULL 04/17] virtio-net: Add support for USO features Jason Wang
2024-05-16 13:43 ` Fiona Ebner
2024-05-17 0:47 ` Jason Wang
2023-09-08 6:44 ` [PULL 05/17] igb: remove TCP ACK detection Jason Wang
2023-09-08 6:44 ` [PULL 06/17] igb: rename E1000E_RingInfo_st Jason Wang
2023-09-08 6:44 ` [PULL 07/17] igb: RX descriptors guest writting refactoring Jason Wang
2023-09-08 6:44 ` [PULL 08/17] igb: RX payload " Jason Wang
2023-09-08 6:44 ` [PULL 09/17] igb: add IPv6 extended headers traffic detection Jason Wang
2023-09-08 6:45 ` [PULL 10/17] igb: packet-split descriptors support Jason Wang
2023-09-08 6:45 ` [PULL 11/17] e1000e: rename e1000e_ba_state and e1000e_write_hdr_to_rx_buffers Jason Wang
2023-09-08 6:45 ` [PULL 12/17] net: add initial support for AF_XDP network backend Jason Wang
2023-09-08 11:48 ` Daniel P. Berrangé [this message]
2023-09-08 11:55 ` Ilya Maximets
2023-09-08 6:45 ` [PULL 13/17] ebpf: Added eBPF map update through mmap Jason Wang
2023-09-08 6:45 ` [PULL 14/17] ebpf: Added eBPF initialization by fds Jason Wang
2023-09-08 6:45 ` [PULL 15/17] virtio-net: Added property to load eBPF RSS with fds Jason Wang
2023-09-08 6:45 ` [PULL 16/17] qmp: Added new command to retrieve eBPF blob Jason Wang
2023-09-08 6:45 ` [PULL 17/17] ebpf: Updated eBPF program and skeleton Jason Wang
2023-09-08 11:19 ` [PULL 00/17] Net patches Stefan Hajnoczi
2023-09-08 11:34 ` Ilya Maximets
2023-09-08 11:49 ` Daniel P. Berrangé
2023-09-08 12:00 ` Ilya Maximets
2023-09-08 12:15 ` Daniel P. Berrangé
2023-09-08 14:06 ` Ilya Maximets
2023-09-08 14:15 ` Daniel P. Berrangé
2023-09-13 18:46 ` Ilya Maximets
2023-09-14 8:13 ` Daniel P. Berrangé
2023-09-18 19:36 ` Ilya Maximets
2023-09-19 8:40 ` Daniel P. Berrangé
2023-09-19 9:39 ` Ilya Maximets
2023-09-19 10:03 ` Daniel P. Berrangé
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZPsJ+TxDYO24T5Yp@redhat.com \
--to=berrange@redhat.com \
--cc=i.maximets@ovn.org \
--cc=jasowang@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.