Re: [RFC PATCH 00/24] Introducing AF_XDP support

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <brouer@redhat.com>
To: William Tu <u9012063@gmail.com>
Cc: "Björn Töpel" <bjorn.topel@gmail.com>,
	magnus.karlsson@intel.com,
	"Alexander Duyck" <alexander.h.duyck@intel.com>,
	"Alexander Duyck" <alexander.duyck@gmail.com>,
	"John Fastabend" <john.fastabend@gmail.com>,
	"Alexei Starovoitov" <ast@fb.com>,
	willemdebruijn.kernel@gmail.com,
	"Daniel Borkmann" <daniel@iogearbox.net>,
	"Linux Kernel Network Developers" <netdev@vger.kernel.org>,
	"Björn Töpel" <bjorn.topel@intel.com>,
	michael.lundkvist@ericsson.com, jesse.brandeburg@intel.com,
	anjali.singhai@intel.com, jeffrey.b.shaw@intel.com,
	ferruh.yigit@intel.com, qi.z.zhang@intel.com, brouer@redhat.com
Subject: Re: [RFC PATCH 00/24] Introducing AF_XDP support
Date: Mon, 26 Mar 2018 18:38:10 +0200	[thread overview]
Message-ID: <20180326183810.2ef4e29f@redhat.com> (raw)
In-Reply-To: <CALDO+SYK_4RfRBe7Z0n00wkPhKc1HwGmcGT34W0tVMVZFLqYpw@mail.gmail.com>


On Mon, 26 Mar 2018 09:06:54 -0700 William Tu <u9012063@gmail.com> wrote:

> On Wed, Jan 31, 2018 at 5:53 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> > From: Björn Töpel <bjorn.topel@intel.com>
> >
> > This RFC introduces a new address family called AF_XDP that is
> > optimized for high performance packet processing and zero-copy
> > semantics. Throughput improvements can be up to 20x compared to V2 and
> > V3 for the micro benchmarks included. Would be great to get your
> > feedback on it. Note that this is the follow up RFC to AF_PACKET V4
> > from November last year. The feedback from that RFC submission and the
> > presentation at NetdevConf in Seoul was to create a new address family
> > instead of building on top of AF_PACKET. AF_XDP is this new address
> > family.
> >
> > The main difference between AF_XDP and AF_PACKET V2/V3 on a descriptor
> > level is that TX and RX descriptors are separated from packet
> > buffers. An RX or TX descriptor points to a data buffer in a packet
> > buffer area. RX and TX can share the same packet buffer so that a
> > packet does not have to be copied between RX and TX. Moreover, if a
> > packet needs to be kept for a while due to a possible retransmit, then
> > the descriptor that points to that packet buffer can be changed to
> > point to another buffer and reused right away. This again avoids
> > copying data.
> >
> > The RX and TX descriptor rings are registered with the setsockopts
> > XDP_RX_RING and XDP_TX_RING, similar to AF_PACKET. The packet buffer
> > area is allocated by user space and registered with the kernel using
> > the new XDP_MEM_REG setsockopt. All these three areas are shared
> > between user space and kernel space. The socket is then bound with a
> > bind() call to a device and a specific queue id on that device, and it
> > is not until bind is completed that traffic starts to flow.
> >
> > An XDP program can be loaded to direct part of the traffic on that
> > device and queue id to user space through a new redirect action in an
> > XDP program called bpf_xdpsk_redirect that redirects a packet up to
> > the socket in user space. All the other XDP actions work just as
> > before. Note that the current RFC requires the user to load an XDP
> > program to get any traffic to user space (for example all traffic to
> > user space with the one-liner program "return
> > bpf_xdpsk_redirect();"). We plan on introducing a patch that removes
> > this requirement and sends all traffic from a queue to user space if
> > an AF_XDP socket is bound to it.
> >
> > AF_XDP can operate in three different modes: XDP_SKB, XDP_DRV, and
> > XDP_DRV_ZC (shorthand for XDP_DRV with a zero-copy allocator as there
> > is no specific mode called XDP_DRV_ZC). If the driver does not have
> > support for XDP, or XDP_SKB is explicitly chosen when loading the XDP
> > program, XDP_SKB mode is employed that uses SKBs together with the
> > generic XDP support and copies out the data to user space. A fallback
> > mode that works for any network device. On the other hand, if the
> > driver has support for XDP (all three NDOs: ndo_bpf, ndo_xdp_xmit and
> > ndo_xdp_flush), these NDOs, without any modifications, will be used by
> > the AF_XDP code to provide better performance, but there is still a
> > copy of the data into user space. The last mode, XDP_DRV_ZC, is XDP
> > driver support with the zero-copy user space allocator that provides
> > even better performance. In this mode, the networking HW (or SW driver
> > if it is a virtual driver like veth) DMAs/puts packets straight into
> > the packet buffer that is shared between user space and kernel
> > space. The RX and TX descriptor queues of the networking HW are NOT
> > shared to user space. Only the kernel can read and write these and it
> > is the kernel driver's responsibility to translate these HW specific
> > descriptors to the HW agnostic ones in the virtual descriptor rings
> > that user space sees. This way, a malicious user space program cannot
> > mess with the networking HW. This mode though requires some extensions
> > to XDP.
> >
> > To get the XDP_DRV_ZC mode to work for RX, we chose to introduce a
> > buffer pool concept so that the same XDP driver code can be used for
> > buffers allocated using the page allocator (XDP_DRV), the user-space
> > zero-copy allocator (XDP_DRV_ZC), or some internal driver specific
> > allocator/cache/recycling mechanism. The ndo_bpf call has also been
> > extended with two commands for registering and unregistering an XSK
> > socket and is in the RX case mainly used to communicate some
> > information about the user-space buffer pool to the driver.
> >
> > For the TX path, our plan was to use ndo_xdp_xmit and ndo_xdp_flush,
> > but we run into problems with this (further discussion in the
> > challenges section) and had to introduce a new NDO called
> > ndo_xdp_xmit_xsk (xsk = XDP socket). It takes a pointer to a netdevice
> > and an explicit queue id that packets should be sent out on. In
> > contrast to ndo_xdp_xmit, it is asynchronous and pulls packets to be
> > sent from the xdp socket (associated with the dev and queue
> > combination that was provided with the NDO call) using a callback
> > (get_tx_packet), and when they have been transmitted it uses another
> > callback (tx_completion) to signal completion of packets. These
> > callbacks are set via ndo_bpf in the new XDP_REGISTER_XSK
> > command. ndo_xdp_xmit_xsk is exclusively used by the XDP socket code
> > and thus does not clash with the XDP_REDIRECT use of
> > ndo_xdp_xmit. This is one of the reasons that the XDP_DRV mode
> > (without ZC) is currently not supported by TX. Please have a look at
> > the challenges section for further discussions.
> >
> > The AF_XDP bind call acts on a queue pair (channel in ethtool speak),
> > so the user needs to steer the traffic to the zero-copy enabled queue
> > pair. Which queue to use, is up to the user.
> >
> > For an untrusted application, HW packet steering to a specific queue
> > pair (the one associated with the application) is a requirement, as
> > the application would otherwise be able to see other user space
> > processes' packets. If the HW cannot support the required packet
> > steering, XDP_DRV or XDP_SKB mode have to be used as they do not
> > expose the NIC's packet buffer into user space as the packets are
> > copied into user space from the NIC's packet buffer in the kernel.
> >
> > There is a xdpsock benchmarking/test application included. Say that
> > you would like your UDP traffic from port 4242 to end up in queue 16,
> > that we will enable AF_XDP on. Here, we use ethtool for this:
> >
> >       ethtool -N p3p2 rx-flow-hash udp4 fn
> >       ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
> >           action 16
> >
> > Running the l2fwd benchmark in XDP_DRV_ZC mode can then be done using:
> >
> >       samples/bpf/xdpsock -i p3p2 -q 16 -l -N
> >
> > For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
> > can be displayed with "-h", as usual.
> >
> > We have run some benchmarks on a dual socket system with two Broadwell
> > E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> > cores which gives a total of 28, but only two cores are used in these
> > experiments. One for TR/RX and one for the user space application. The
> > memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> > 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> > memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
> > Intel I40E 40Gbit/s using the i40e driver.
> >
> > Below are the results in Mpps of the I40E NIC benchmark runs for 64
> > byte packets, generated by commercial packet generator HW that is
> > generating packets at full 40 Gbit/s line rate.
> >
> > XDP baseline numbers without this RFC:
> > xdp_rxq_info --action XDP_DROP 31.3 Mpps
> > xdp_rxq_info --action XDP_TX   16.7 Mpps
> >
> > XDP performance with this RFC i.e. with the buffer allocator:
> > XDP_DROP 21.0 Mpps
> > XDP_TX   11.9 Mpps
> >
> > AF_PACKET V4 performance from previous RFC on 4.14-rc7:
> > Benchmark   V2     V3     V4     V4+ZC
> > rxdrop      0.67   0.73   0.74   33.7
> > txpush      0.98   0.98   0.91   19.6
> > l2fwd       0.66   0.71   0.67   15.5
> >
> > AF_XDP performance:
> > Benchmark   XDP_SKB   XDP_DRV    XDP_DRV_ZC (all in Mpps)
> > rxdrop      3.3        11.6         16.9
> > txpush      2.2         NA*         21.8
> > l2fwd       1.7         NA*         10.4
> >  
> 
> Hi,
> I also did an evaluation of AF_XDP, however the performance isn't as
> good as above.
> I'd like to share the result and see if there are some tuning suggestions.
> 
> System:
> 16 core, Intel(R) Xeon(R) CPU E5-2440 v2 @ 1.90GHz
> Intel 10G X540-AT2 ---> so I can only run XDP_SKB mode

Hmmm, why is X540-AT2 not able to use XDP natively?

> AF_XDP performance:
> Benchmark   XDP_SKB
> rxdrop      1.27 Mpps
> txpush      0.99 Mpps
> l2fwd        0.85 Mpps

Definitely too low...

What is the performance if you drop packets via iptables?

Command:
 $ iptables -t raw -I PREROUTING -p udp --dport 9 --j DROP

> NIC configuration:
> the command
> "ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 action 16"
> doesn't work on my ixgbe driver, so I use ntuple:
> 
> ethtool -K enp10s0f0 ntuple on
> ethtool -U enp10s0f0 flow-type udp4 src-ip 10.1.1.100 action 1
> then
> echo 1 > /proc/sys/net/core/bpf_jit_enable
> ./xdpsock -i enp10s0f0 -r -S --queue=1
> 
> I also take a look at perf result:
> For rxdrop:
> 86.56%  xdpsock xdpsock           [.] main
>   9.22%  xdpsock  [kernel.vmlinux]  [k] nmi
>   4.23%  xdpsock  xdpsock         [.] xq_enq

It looks very strange that you see non-maskable interrupt's (NMI) being
this high...

 
> For l2fwd:
>  20.81%  xdpsock xdpsock             [.] main
>  10.64%  xdpsock [kernel.vmlinux]    [k] clflush_cache_range

Oh, clflush_cache_range is being called!
Do your system use an IOMMU ?

>   8.46%  xdpsock  [kernel.vmlinux]    [k] xsk_sendmsg
>   6.72%  xdpsock  [kernel.vmlinux]    [k] skb_set_owner_w
>   5.89%  xdpsock  [kernel.vmlinux]    [k] __domain_mapping
>   5.74%  xdpsock  [kernel.vmlinux]    [k] alloc_skb_with_frags
>   4.62%  xdpsock  [kernel.vmlinux]    [k] netif_skb_features
>   3.96%  xdpsock  [kernel.vmlinux]    [k] ___slab_alloc
>   3.18%  xdpsock  [kernel.vmlinux]    [k] nmi

Again high count for NMI ?!?

Maybe you just forgot to tell perf that you want it to decode the
bpf_prog correctly?

https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#perf-tool-symbols

Enable via:
 $ sysctl net/core/bpf_jit_kallsyms=1

And use perf report (while BPF is STILL LOADED):

 $ perf report --kallsyms=/proc/kallsyms

E.g. for emailing this you can use this command:

 $ perf report --sort cpu,comm,dso,symbol --kallsyms=/proc/kallsyms --no-children --stdio -g none | head -n 40
 

> I observed that the i40e's XDP_SKB result is much better than my ixgbe's result.
> I wonder in XDP_SKB mode, does the driver make performance difference?
> Or my cpu (E5-2440 v2 @ 1.90GHz) is too old?

I suspect some setup issue on your system.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

next prev parent reply	other threads:[~2018-03-26 16:38 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 01/24] xsk: AF_XDP sockets buildable skeleton Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 02/24] xsk: add user memory registration sockopt Björn Töpel
2018-02-07 16:00   ` Willem de Bruijn
2018-02-07 21:39     ` Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 03/24] xsk: added XDP_{R,T}X_RING sockopt and supporting structures Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 04/24] xsk: add bind support and introduce Rx functionality Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 05/24] bpf: added bpf_xdpsk_redirect Björn Töpel
2018-02-05 13:42   ` Jesper Dangaard Brouer
2018-02-07 21:11     ` Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 06/24] net: wire up xsk support in the XDP_REDIRECT path Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 07/24] xsk: introduce Tx functionality Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 08/24] i40e: add support for XDP_REDIRECT Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 09/24] samples/bpf: added xdpsock program Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 10/24] netdevice: added XDP_{UN,}REGISTER_XSK command to ndo_bpf Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 11/24] netdevice: added ndo for transmitting a packet from an XDP socket Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 12/24] xsk: add iterator functions to xsk_ring Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 13/24] i40e: introduce external allocator support Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 14/24] i40e: implemented page recycling buff_pool Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 15/24] i40e: start using " Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 16/24] i40e: separated buff_pool interface from i40e implementaion Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 17/24] xsk: introduce xsk_buff_pool Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 18/24] xdp: added buff_pool support to struct xdp_buff Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 19/24] xsk: add support for zero copy Rx Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 20/24] xsk: add support for zero copy Tx Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 21/24] i40e: implement xsk sub-commands in ndo_bpf for zero copy Rx Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 22/24] i40e: introduced a clean_tx callback function Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 23/24] i40e: introduced Tx completion callbacks Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 24/24] i40e: Tx support for zero copy allocator Björn Töpel
2018-02-01 16:42 ` [RFC PATCH 00/24] Introducing AF_XDP support Jesper Dangaard Brouer
2018-02-02 10:31 ` Jesper Dangaard Brouer
2018-02-05 15:05 ` Björn Töpel
2018-02-07 15:54   ` Willem de Bruijn
2018-02-07 21:28     ` Björn Töpel
2018-02-08 23:16       ` Willem de Bruijn
2018-02-07 17:59 ` Tom Herbert
2018-02-07 21:38   ` Björn Töpel
2018-03-26 16:06 ` William Tu
2018-03-26 16:38   ` Jesper Dangaard Brouer [this message]
2018-03-26 21:58     ` William Tu
2018-03-27  6:09       ` Björn Töpel
2018-03-27  9:37       ` Jesper Dangaard Brouer
2018-03-28  0:06         ` William Tu
2018-03-28  8:01           ` Jesper Dangaard Brouer
2018-03-28 15:05             ` William Tu
2018-03-26 22:54     ` Tushar Dave
2018-03-26 23:03       ` Alexander Duyck
2018-03-26 23:20         ` Tushar Dave
2018-03-28  0:49           ` William Tu
2018-03-27  6:30         ` Björn Töpel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180326183810.2ef4e29f@redhat.com \
    --to=brouer@redhat.com \
    --cc=alexander.duyck@gmail.com \
    --cc=alexander.h.duyck@intel.com \
    --cc=anjali.singhai@intel.com \
    --cc=ast@fb.com \
    --cc=bjorn.topel@gmail.com \
    --cc=bjorn.topel@intel.com \
    --cc=daniel@iogearbox.net \
    --cc=ferruh.yigit@intel.com \
    --cc=jeffrey.b.shaw@intel.com \
    --cc=jesse.brandeburg@intel.com \
    --cc=john.fastabend@gmail.com \
    --cc=magnus.karlsson@intel.com \
    --cc=michael.lundkvist@ericsson.com \
    --cc=netdev@vger.kernel.org \
    --cc=qi.z.zhang@intel.com \
    --cc=u9012063@gmail.com \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.