All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: "Björn Töpel" <bjorn.topel@gmail.com>
Cc: magnus.karlsson@intel.com, alexander.h.duyck@intel.com,
	alexander.duyck@gmail.com, john.fastabend@gmail.com, ast@fb.com,
	brouer@redhat.com, willemdebruijn.kernel@gmail.com,
	daniel@iogearbox.net, netdev@vger.kernel.org,
	"Björn Töpel" <bjorn.topel@intel.com>,
	michael.lundkvist@ericsson.com, jesse.brandeburg@intel.com,
	anjali.singhai@intel.com, qi.z.zhang@intel.com
Subject: Re: [PATCH bpf-next 00/15] Introducing AF_XDP support
Date: Tue, 24 Apr 2018 02:22:24 +0300	[thread overview]
Message-ID: <20180424022124-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <20180423135619.7179-1-bjorn.topel@gmail.com>

On Mon, Apr 23, 2018 at 03:56:04PM +0200, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> This RFC introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and, in upcoming
> patch sets, zero-copy semantics. In this v2 version, we have removed
> all zero-copy related code in order to make it smaller, simpler and
> hopefully more review friendly. This RFC only supports copy-mode for
> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX
> using the XDP_DRV path. Zero-copy support requires XDP and driver
> changes that Jesper Dangaard Brouer is working on. Some of his work
> has already been accepted. We will publish our zero-copy support for
> RX and TX on top of his patch sets at a later point in time.
> 
> An AF_XDP socket (XSK) is created with the normal socket()
> syscall. Associated with each XSK are two queues: the RX queue and the
> TX queue. A socket can receive packets on the RX queue and it can send
> packets on the TX queue. These queues are registered and sized with
> the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
> mandatory to have at least one of these queues for each socket. In
> contrast to AF_PACKET V2/V3 these descriptor queues are separated from
> packet buffers. An RX or TX descriptor points to a data buffer in a
> memory area called a UMEM. RX and TX can share the same UMEM so that a
> packet does not have to be copied between RX and TX. Moreover, if a
> packet needs to be kept for a while due to a possible retransmit, the
> descriptor that points to that packet can be changed to point to
> another and reused right away. This again avoids copying data.
> 
> This new dedicated packet buffer area is call a UMEM. It consists of a
> number of equally size frames and each frame has a unique frame id. A
> descriptor in one of the queues references a frame by referencing its
> frame id. The user space allocates memory for this UMEM using whatever
> means it feels is most appropriate (malloc, mmap, huge pages,
> etc). This memory area is then registered with the kernel using the new
> setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
> and the COMPLETION queue. The fill queue is used by the application to
> send down frame ids for the kernel to fill in with RX packet
> data. References to these frames will then appear in the RX queue of
> the XSK once they have been received. The completion queue, on the
> other hand, contains frame ids that the kernel has transmitted
> completely and can now be used again by user space, for either TX or
> RX. Thus, the frame ids appearing in the completion queue are ids that
> were previously transmitted using the TX queue. In summary, the RX and
> FILL queues are used for the RX path and the TX and COMPLETION queues
> are used for the TX path.
> 
> The socket is then finally bound with a bind() call to a device and a
> specific queue id on that device, and it is not until bind is
> completed that traffic starts to flow. Note that in this RFC, all
> packet data is copied out to user-space.
> 
> A new feature in this RFC is that the UMEM can be shared between
> processes, if desired. If a process wants to do this, it simply skips
> the registration of the UMEM and its corresponding two queues, sets a
> flag in the bind call and submits the XSK of the process it would like
> to share UMEM with as well as its own newly created XSK socket. The
> new process will then receive frame id references in its own RX queue
> that point to this shared UMEM. Note that since the queue structures
> are single-consumer / single-producer (for performance reasons), the
> new process has to create its own socket with associated RX and TX
> queues, since it cannot share this with the other process. This is
> also the reason that there is only one set of FILL and COMPLETION
> queues per UMEM. It is the responsibility of a single process to
> handle the UMEM. If multiple-producer / multiple-consumer queues are
> implemented in the future, this requirement could be relaxed.
> 
> How is then packets distributed between these two XSK? We have
> introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
> full). The user-space application can place an XSK at an arbitrary
> place in this map. The XDP program can then redirect a packet to a
> specific index in this map and at this point XDP validates that the
> XSK in that map was indeed bound to that device and queue number. If
> not, the packet is dropped. If the map is empty at that index, the
> packet is also dropped. This also means that it is currently mandatory
> to have an XDP program loaded (and one XSK in the XSKMAP) to be able
> to get any traffic to user space through the XSK.
> 
> AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
> driver does not have support for XDP, or XDP_SKB is explicitly chosen
> when loading the XDP program, XDP_SKB mode is employed that uses SKBs
> together with the generic XDP support and copies out the data to user
> space. A fallback mode that works for any network device. On the other
> hand, if the driver has support for XDP, it will be used by the AF_XDP
> code to provide better performance, but there is still a copy of the
> data into user space.
> 
> There is a xdpsock benchmarking/test application included that
> demonstrates how to use AF_XDP sockets with both private and shared
> UMEMs. Say that you would like your UDP traffic from port 4242 to end
> up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
> for this:
> 
>       ethtool -N p3p2 rx-flow-hash udp4 fn
>       ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>           action 16
> 
> Running the rxdrop benchmark in XDP_DRV mode can then be done
> using:
> 
>       samples/bpf/xdpsock -i p3p2 -q 16 -r -N
> 
> For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
> can be displayed with "-h", as usual.
> 
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
> Intel I40E 40Gbit/s using the i40e driver.
> 
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by commercial packet generator HW that is
> generating packets at full 40 Gbit/s line rate.
> 
> AF_XDP performance 64 byte packets. Results from RFC V2 in parenthesis.
> Benchmark   XDP_SKB   XDP_DRV
> rxdrop       2.9(3.0)   9.4(9.3)  
> txpush       2.5(2.2)   NA*
> l2fwd        1.9(1.7)   2.4(2.4) (TX using XDP_SKB in both cases)
> 
> AF_XDP performance 1500 byte packets:
> Benchmark   XDP_SKB   XDP_DRV
> rxdrop       2.1(2.2)   3.3(3.1)  
> l2fwd        1.4(1.1)   1.8(1.7) (TX using XDP_SKB in both cases)
> 
> * NA since we have no support for TX using the XDP_DRV infrastructure
>   in this RFC. This is for a future patch set since it involves
>   changes to the XDP NDOs. Some of this has been upstreamed by Jesper
>   Dangaard Brouer.
> 
> XDP performance on our system as a base line:
> 
> 64 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      32,921,521  0
> 
> 1500 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      3,289,491   0
> 
> Changes from RFC V2:
> 
> * Optimizations and simplifications to the ring structures inspired by
>   ptr_ring.h 
> * Renamed XDP_[RX|TX]_QUEUE to XDP_[RX|TX]_RING in the uapi to be
>   consistent with AF_PACKET
> * Support for only having an RX queue or a TX queue defined
> * Some bug fixes and code cleanup
> 
> The structure of the patch set is as follows:
> 
> Patches 1-2: Basic socket and umem plumbing 
> Patches 3-10: RX support together with the new XSKMAP
> Patches 11-14: TX support
> Patch 15: Sample application
> 
> We based this patch set on bpf-next commit fbcf93ebcaef ("bpf: btf:
> Clean up btf.h in uapi")
> 
> Questions:
> 
> * How to deal with cache alignment for uapi when different
>   architectures can have different cache line sizes? We have just
>   aligned it to 64 bytes for now, which works for many popular
>   architectures, but not all. Please advise.
> 
> To do:
> 
> * Optimize performance
> 
> * Kernel selftest
> 
> Post-series plan:
> 
> * Kernel load module support of AF_XDP would be nice. Unclear how to
>   achieve this though since our XDP code depends on net/core.
> 
> * Support for AF_XDP sockets without an XPD program loaded. In this
>   case all the traffic on a queue should go up to the user space socket.
> 
> * Daniel Borkmann's suggestion for a "copy to XDP socket, and return
>   XDP_PASS" for a tcpdump-like functionality.
> 
> * And of course getting to zero-copy support in small increments. 
> 
> Thanks: Björn and Magnus
> 
> Björn Töpel (8):
>   net: initial AF_XDP skeleton
>   xsk: add user memory registration support sockopt
>   xsk: add Rx queue setup and mmap support
>   xdp: introduce xdp_return_buff API
>   xsk: add Rx receive functions and poll support
>   bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
>   xsk: wire up XDP_DRV side of AF_XDP
>   xsk: wire up XDP_SKB side of AF_XDP
> 
> Magnus Karlsson (7):
>   xsk: add umem fill queue support and mmap
>   xsk: add support for bind for Rx
>   xsk: add umem completion queue support and mmap
>   xsk: add Tx queue setup and mmap support
>   xsk: support for Tx
>   xsk: statistics support
>   samples/bpf: sample application for AF_XDP sockets
> 
>  MAINTAINERS                         |   8 +
>  include/linux/bpf.h                 |  26 +
>  include/linux/bpf_types.h           |   3 +
>  include/linux/filter.h              |   2 +-
>  include/linux/socket.h              |   5 +-
>  include/net/xdp.h                   |   1 +
>  include/net/xdp_sock.h              |  46 ++
>  include/uapi/linux/bpf.h            |   1 +
>  include/uapi/linux/if_xdp.h         |  87 ++++
>  kernel/bpf/Makefile                 |   3 +
>  kernel/bpf/verifier.c               |   8 +-
>  kernel/bpf/xskmap.c                 | 286 +++++++++++
>  net/Kconfig                         |   1 +
>  net/Makefile                        |   1 +
>  net/core/dev.c                      |  34 +-
>  net/core/filter.c                   |  40 +-
>  net/core/sock.c                     |  12 +-
>  net/core/xdp.c                      |  15 +-
>  net/xdp/Kconfig                     |   7 +
>  net/xdp/Makefile                    |   2 +
>  net/xdp/xdp_umem.c                  | 256 ++++++++++
>  net/xdp/xdp_umem.h                  |  65 +++
>  net/xdp/xdp_umem_props.h            |  23 +
>  net/xdp/xsk.c                       | 704 +++++++++++++++++++++++++++
>  net/xdp/xsk_queue.c                 |  73 +++
>  net/xdp/xsk_queue.h                 | 245 ++++++++++
>  samples/bpf/Makefile                |   4 +
>  samples/bpf/xdpsock.h               |  11 +
>  samples/bpf/xdpsock_kern.c          |  56 +++
>  samples/bpf/xdpsock_user.c          | 947 ++++++++++++++++++++++++++++++++++++
>  security/selinux/hooks.c            |   4 +-
>  security/selinux/include/classmap.h |   4 +-
>  32 files changed, 2945 insertions(+), 35 deletions(-)
>  create mode 100644 include/net/xdp_sock.h
>  create mode 100644 include/uapi/linux/if_xdp.h
>  create mode 100644 kernel/bpf/xskmap.c
>  create mode 100644 net/xdp/Kconfig
>  create mode 100644 net/xdp/Makefile
>  create mode 100644 net/xdp/xdp_umem.c
>  create mode 100644 net/xdp/xdp_umem.h
>  create mode 100644 net/xdp/xdp_umem_props.h
>  create mode 100644 net/xdp/xsk.c
>  create mode 100644 net/xdp/xsk_queue.c
>  create mode 100644 net/xdp/xsk_queue.h
>  create mode 100644 samples/bpf/xdpsock.h
>  create mode 100644 samples/bpf/xdpsock_kern.c
>  create mode 100644 samples/bpf/xdpsock_user.c

Is there a chance of Documentation/networking/af_xdp.txt ?


> 
> -- 
> 2.14.1

  parent reply	other threads:[~2018-04-23 23:22 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-23 13:56 [PATCH bpf-next 00/15] Introducing AF_XDP support Björn Töpel
2018-04-23 13:56 ` [PATCH bpf-next 01/15] net: initial AF_XDP skeleton Björn Töpel
2018-04-23 13:56 ` [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt Björn Töpel
2018-04-23 16:18   ` Michael S. Tsirkin
2018-04-23 20:00     ` Björn Töpel
2018-04-23 20:11       ` Michael S. Tsirkin
2018-04-23 20:15         ` Björn Töpel
2018-04-23 20:26           ` Michael S. Tsirkin
2018-04-24  7:01             ` Björn Töpel
2018-04-23 23:04   ` Willem de Bruijn
2018-04-24  7:30     ` Björn Töpel
2018-04-24 14:27   ` kbuild test robot
2018-04-23 13:56 ` [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap Björn Töpel
2018-04-23 23:16   ` Michael S. Tsirkin
2018-04-25 12:37     ` Björn Töpel
2018-04-23 23:21   ` Michael S. Tsirkin
2018-04-23 23:59     ` Willem de Bruijn
2018-04-24  8:08       ` Magnus Karlsson
2018-04-24 16:55         ` Willem de Bruijn
2018-04-23 13:56 ` [PATCH bpf-next 04/15] xsk: add Rx queue setup and mmap support Björn Töpel
2018-04-23 13:56 ` [PATCH bpf-next 05/15] xsk: add support for bind for Rx Björn Töpel
2018-04-24 16:55   ` Willem de Bruijn
2018-04-24 18:43     ` Björn Töpel
2018-04-23 13:56 ` [PATCH bpf-next 06/15] xdp: introduce xdp_return_buff API Björn Töpel
2018-04-23 13:56 ` [PATCH bpf-next 07/15] xsk: add Rx receive functions and poll support Björn Töpel
2018-04-24 16:56   ` Willem de Bruijn
2018-04-24 18:32     ` Björn Töpel
2018-04-23 13:56 ` [PATCH bpf-next 08/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP Björn Töpel
2018-04-24 16:56   ` Willem de Bruijn
2018-04-24 18:58     ` Björn Töpel
2018-04-23 13:56 ` [PATCH bpf-next 09/15] xsk: wire up XDP_DRV side of AF_XDP Björn Töpel
2018-04-23 13:56 ` [PATCH bpf-next 10/15] xsk: wire up XDP_SKB " Björn Töpel
2018-04-23 13:56 ` [PATCH bpf-next 11/15] xsk: add umem completion queue support and mmap Björn Töpel
2018-04-23 13:56 ` [PATCH bpf-next 12/15] xsk: add Tx queue setup and mmap support Björn Töpel
2018-04-23 13:56 ` [PATCH bpf-next 13/15] xsk: support for Tx Björn Töpel
2018-04-24 16:57   ` Willem de Bruijn
2018-04-25  9:11     ` Magnus Karlsson
2018-04-25 19:00       ` Willem de Bruijn
2018-04-26  4:02         ` Björn Töpel
2018-04-23 13:56 ` [PATCH bpf-next 14/15] xsk: statistics support Björn Töpel
2018-04-24 16:58   ` Willem de Bruijn
2018-04-25 10:50     ` Magnus Karlsson
2018-04-23 13:56 ` [PATCH bpf-next 15/15] samples/bpf: sample application for AF_XDP sockets Björn Töpel
2018-04-23 23:31   ` Michael S. Tsirkin
2018-04-24  8:22     ` Magnus Karlsson
2018-04-23 23:22 ` Michael S. Tsirkin [this message]
2018-04-24  6:55   ` [PATCH bpf-next 00/15] Introducing AF_XDP support Björn Töpel
2018-04-24  7:27     ` Jesper Dangaard Brouer
2018-04-24  7:33       ` Björn Töpel
2018-04-24  2:29 ` Jason Wang
2018-04-24  8:44   ` Magnus Karlsson
2018-04-24  9:10     ` Jason Wang
2018-04-24  9:14       ` Magnus Karlsson
2018-04-24 17:03 ` Willem de Bruijn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180424022124-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=alexander.duyck@gmail.com \
    --cc=alexander.h.duyck@intel.com \
    --cc=anjali.singhai@intel.com \
    --cc=ast@fb.com \
    --cc=bjorn.topel@gmail.com \
    --cc=bjorn.topel@intel.com \
    --cc=brouer@redhat.com \
    --cc=daniel@iogearbox.net \
    --cc=jesse.brandeburg@intel.com \
    --cc=john.fastabend@gmail.com \
    --cc=magnus.karlsson@intel.com \
    --cc=michael.lundkvist@ericsson.com \
    --cc=netdev@vger.kernel.org \
    --cc=qi.z.zhang@intel.com \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.