From: Jesper Dangaard Brouer <brouer@redhat.com>
To: William Tu <u9012063@gmail.com>
Cc: "Björn Töpel" <bjorn.topel@gmail.com>,
magnus.karlsson@intel.com,
"Alexander Duyck" <alexander.h.duyck@intel.com>,
"Alexander Duyck" <alexander.duyck@gmail.com>,
"John Fastabend" <john.fastabend@gmail.com>,
"Alexei Starovoitov" <ast@fb.com>,
willemdebruijn.kernel@gmail.com,
"Daniel Borkmann" <daniel@iogearbox.net>,
"Linux Kernel Network Developers" <netdev@vger.kernel.org>,
"Björn Töpel" <bjorn.topel@intel.com>,
michael.lundkvist@ericsson.com, jesse.brandeburg@intel.com,
anjali.singhai@intel.com, jeffrey.b.shaw@intel.com,
ferruh.yigit@intel.com, qi.z.zhang@intel.com, brouer@redhat.com
Subject: Re: [RFC PATCH 00/24] Introducing AF_XDP support
Date: Mon, 26 Mar 2018 18:38:10 +0200 [thread overview]
Message-ID: <20180326183810.2ef4e29f@redhat.com> (raw)
In-Reply-To: <CALDO+SYK_4RfRBe7Z0n00wkPhKc1HwGmcGT34W0tVMVZFLqYpw@mail.gmail.com>
On Mon, 26 Mar 2018 09:06:54 -0700 William Tu <u9012063@gmail.com> wrote:
> On Wed, Jan 31, 2018 at 5:53 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> > From: Björn Töpel <bjorn.topel@intel.com>
> >
> > This RFC introduces a new address family called AF_XDP that is
> > optimized for high performance packet processing and zero-copy
> > semantics. Throughput improvements can be up to 20x compared to V2 and
> > V3 for the micro benchmarks included. Would be great to get your
> > feedback on it. Note that this is the follow up RFC to AF_PACKET V4
> > from November last year. The feedback from that RFC submission and the
> > presentation at NetdevConf in Seoul was to create a new address family
> > instead of building on top of AF_PACKET. AF_XDP is this new address
> > family.
> >
> > The main difference between AF_XDP and AF_PACKET V2/V3 on a descriptor
> > level is that TX and RX descriptors are separated from packet
> > buffers. An RX or TX descriptor points to a data buffer in a packet
> > buffer area. RX and TX can share the same packet buffer so that a
> > packet does not have to be copied between RX and TX. Moreover, if a
> > packet needs to be kept for a while due to a possible retransmit, then
> > the descriptor that points to that packet buffer can be changed to
> > point to another buffer and reused right away. This again avoids
> > copying data.
> >
> > The RX and TX descriptor rings are registered with the setsockopts
> > XDP_RX_RING and XDP_TX_RING, similar to AF_PACKET. The packet buffer
> > area is allocated by user space and registered with the kernel using
> > the new XDP_MEM_REG setsockopt. All these three areas are shared
> > between user space and kernel space. The socket is then bound with a
> > bind() call to a device and a specific queue id on that device, and it
> > is not until bind is completed that traffic starts to flow.
> >
> > An XDP program can be loaded to direct part of the traffic on that
> > device and queue id to user space through a new redirect action in an
> > XDP program called bpf_xdpsk_redirect that redirects a packet up to
> > the socket in user space. All the other XDP actions work just as
> > before. Note that the current RFC requires the user to load an XDP
> > program to get any traffic to user space (for example all traffic to
> > user space with the one-liner program "return
> > bpf_xdpsk_redirect();"). We plan on introducing a patch that removes
> > this requirement and sends all traffic from a queue to user space if
> > an AF_XDP socket is bound to it.
> >
> > AF_XDP can operate in three different modes: XDP_SKB, XDP_DRV, and
> > XDP_DRV_ZC (shorthand for XDP_DRV with a zero-copy allocator as there
> > is no specific mode called XDP_DRV_ZC). If the driver does not have
> > support for XDP, or XDP_SKB is explicitly chosen when loading the XDP
> > program, XDP_SKB mode is employed that uses SKBs together with the
> > generic XDP support and copies out the data to user space. A fallback
> > mode that works for any network device. On the other hand, if the
> > driver has support for XDP (all three NDOs: ndo_bpf, ndo_xdp_xmit and
> > ndo_xdp_flush), these NDOs, without any modifications, will be used by
> > the AF_XDP code to provide better performance, but there is still a
> > copy of the data into user space. The last mode, XDP_DRV_ZC, is XDP
> > driver support with the zero-copy user space allocator that provides
> > even better performance. In this mode, the networking HW (or SW driver
> > if it is a virtual driver like veth) DMAs/puts packets straight into
> > the packet buffer that is shared between user space and kernel
> > space. The RX and TX descriptor queues of the networking HW are NOT
> > shared to user space. Only the kernel can read and write these and it
> > is the kernel driver's responsibility to translate these HW specific
> > descriptors to the HW agnostic ones in the virtual descriptor rings
> > that user space sees. This way, a malicious user space program cannot
> > mess with the networking HW. This mode though requires some extensions
> > to XDP.
> >
> > To get the XDP_DRV_ZC mode to work for RX, we chose to introduce a
> > buffer pool concept so that the same XDP driver code can be used for
> > buffers allocated using the page allocator (XDP_DRV), the user-space
> > zero-copy allocator (XDP_DRV_ZC), or some internal driver specific
> > allocator/cache/recycling mechanism. The ndo_bpf call has also been
> > extended with two commands for registering and unregistering an XSK
> > socket and is in the RX case mainly used to communicate some
> > information about the user-space buffer pool to the driver.
> >
> > For the TX path, our plan was to use ndo_xdp_xmit and ndo_xdp_flush,
> > but we run into problems with this (further discussion in the
> > challenges section) and had to introduce a new NDO called
> > ndo_xdp_xmit_xsk (xsk = XDP socket). It takes a pointer to a netdevice
> > and an explicit queue id that packets should be sent out on. In
> > contrast to ndo_xdp_xmit, it is asynchronous and pulls packets to be
> > sent from the xdp socket (associated with the dev and queue
> > combination that was provided with the NDO call) using a callback
> > (get_tx_packet), and when they have been transmitted it uses another
> > callback (tx_completion) to signal completion of packets. These
> > callbacks are set via ndo_bpf in the new XDP_REGISTER_XSK
> > command. ndo_xdp_xmit_xsk is exclusively used by the XDP socket code
> > and thus does not clash with the XDP_REDIRECT use of
> > ndo_xdp_xmit. This is one of the reasons that the XDP_DRV mode
> > (without ZC) is currently not supported by TX. Please have a look at
> > the challenges section for further discussions.
> >
> > The AF_XDP bind call acts on a queue pair (channel in ethtool speak),
> > so the user needs to steer the traffic to the zero-copy enabled queue
> > pair. Which queue to use, is up to the user.
> >
> > For an untrusted application, HW packet steering to a specific queue
> > pair (the one associated with the application) is a requirement, as
> > the application would otherwise be able to see other user space
> > processes' packets. If the HW cannot support the required packet
> > steering, XDP_DRV or XDP_SKB mode have to be used as they do not
> > expose the NIC's packet buffer into user space as the packets are
> > copied into user space from the NIC's packet buffer in the kernel.
> >
> > There is a xdpsock benchmarking/test application included. Say that
> > you would like your UDP traffic from port 4242 to end up in queue 16,
> > that we will enable AF_XDP on. Here, we use ethtool for this:
> >
> > ethtool -N p3p2 rx-flow-hash udp4 fn
> > ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
> > action 16
> >
> > Running the l2fwd benchmark in XDP_DRV_ZC mode can then be done using:
> >
> > samples/bpf/xdpsock -i p3p2 -q 16 -l -N
> >
> > For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
> > can be displayed with "-h", as usual.
> >
> > We have run some benchmarks on a dual socket system with two Broadwell
> > E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> > cores which gives a total of 28, but only two cores are used in these
> > experiments. One for TR/RX and one for the user space application. The
> > memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> > 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> > memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
> > Intel I40E 40Gbit/s using the i40e driver.
> >
> > Below are the results in Mpps of the I40E NIC benchmark runs for 64
> > byte packets, generated by commercial packet generator HW that is
> > generating packets at full 40 Gbit/s line rate.
> >
> > XDP baseline numbers without this RFC:
> > xdp_rxq_info --action XDP_DROP 31.3 Mpps
> > xdp_rxq_info --action XDP_TX 16.7 Mpps
> >
> > XDP performance with this RFC i.e. with the buffer allocator:
> > XDP_DROP 21.0 Mpps
> > XDP_TX 11.9 Mpps
> >
> > AF_PACKET V4 performance from previous RFC on 4.14-rc7:
> > Benchmark V2 V3 V4 V4+ZC
> > rxdrop 0.67 0.73 0.74 33.7
> > txpush 0.98 0.98 0.91 19.6
> > l2fwd 0.66 0.71 0.67 15.5
> >
> > AF_XDP performance:
> > Benchmark XDP_SKB XDP_DRV XDP_DRV_ZC (all in Mpps)
> > rxdrop 3.3 11.6 16.9
> > txpush 2.2 NA* 21.8
> > l2fwd 1.7 NA* 10.4
> >
>
> Hi,
> I also did an evaluation of AF_XDP, however the performance isn't as
> good as above.
> I'd like to share the result and see if there are some tuning suggestions.
>
> System:
> 16 core, Intel(R) Xeon(R) CPU E5-2440 v2 @ 1.90GHz
> Intel 10G X540-AT2 ---> so I can only run XDP_SKB mode
Hmmm, why is X540-AT2 not able to use XDP natively?
> AF_XDP performance:
> Benchmark XDP_SKB
> rxdrop 1.27 Mpps
> txpush 0.99 Mpps
> l2fwd 0.85 Mpps
Definitely too low...
What is the performance if you drop packets via iptables?
Command:
$ iptables -t raw -I PREROUTING -p udp --dport 9 --j DROP
> NIC configuration:
> the command
> "ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 action 16"
> doesn't work on my ixgbe driver, so I use ntuple:
>
> ethtool -K enp10s0f0 ntuple on
> ethtool -U enp10s0f0 flow-type udp4 src-ip 10.1.1.100 action 1
> then
> echo 1 > /proc/sys/net/core/bpf_jit_enable
> ./xdpsock -i enp10s0f0 -r -S --queue=1
>
> I also take a look at perf result:
> For rxdrop:
> 86.56% xdpsock xdpsock [.] main
> 9.22% xdpsock [kernel.vmlinux] [k] nmi
> 4.23% xdpsock xdpsock [.] xq_enq
It looks very strange that you see non-maskable interrupt's (NMI) being
this high...
> For l2fwd:
> 20.81% xdpsock xdpsock [.] main
> 10.64% xdpsock [kernel.vmlinux] [k] clflush_cache_range
Oh, clflush_cache_range is being called!
Do your system use an IOMMU ?
> 8.46% xdpsock [kernel.vmlinux] [k] xsk_sendmsg
> 6.72% xdpsock [kernel.vmlinux] [k] skb_set_owner_w
> 5.89% xdpsock [kernel.vmlinux] [k] __domain_mapping
> 5.74% xdpsock [kernel.vmlinux] [k] alloc_skb_with_frags
> 4.62% xdpsock [kernel.vmlinux] [k] netif_skb_features
> 3.96% xdpsock [kernel.vmlinux] [k] ___slab_alloc
> 3.18% xdpsock [kernel.vmlinux] [k] nmi
Again high count for NMI ?!?
Maybe you just forgot to tell perf that you want it to decode the
bpf_prog correctly?
https://prototype-kernel.readthedocs.io/en/latest/bpf/troubleshooting.html#perf-tool-symbols
Enable via:
$ sysctl net/core/bpf_jit_kallsyms=1
And use perf report (while BPF is STILL LOADED):
$ perf report --kallsyms=/proc/kallsyms
E.g. for emailing this you can use this command:
$ perf report --sort cpu,comm,dso,symbol --kallsyms=/proc/kallsyms --no-children --stdio -g none | head -n 40
> I observed that the i40e's XDP_SKB result is much better than my ixgbe's result.
> I wonder in XDP_SKB mode, does the driver make performance difference?
> Or my cpu (E5-2440 v2 @ 1.90GHz) is too old?
I suspect some setup issue on your system.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
next prev parent reply other threads:[~2018-03-26 16:38 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-01-31 13:53 [RFC PATCH 00/24] Introducing AF_XDP support Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 01/24] xsk: AF_XDP sockets buildable skeleton Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 02/24] xsk: add user memory registration sockopt Björn Töpel
2018-02-07 16:00 ` Willem de Bruijn
2018-02-07 21:39 ` Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 03/24] xsk: added XDP_{R,T}X_RING sockopt and supporting structures Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 04/24] xsk: add bind support and introduce Rx functionality Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 05/24] bpf: added bpf_xdpsk_redirect Björn Töpel
2018-02-05 13:42 ` Jesper Dangaard Brouer
2018-02-07 21:11 ` Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 06/24] net: wire up xsk support in the XDP_REDIRECT path Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 07/24] xsk: introduce Tx functionality Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 08/24] i40e: add support for XDP_REDIRECT Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 09/24] samples/bpf: added xdpsock program Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 10/24] netdevice: added XDP_{UN,}REGISTER_XSK command to ndo_bpf Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 11/24] netdevice: added ndo for transmitting a packet from an XDP socket Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 12/24] xsk: add iterator functions to xsk_ring Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 13/24] i40e: introduce external allocator support Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 14/24] i40e: implemented page recycling buff_pool Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 15/24] i40e: start using " Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 16/24] i40e: separated buff_pool interface from i40e implementaion Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 17/24] xsk: introduce xsk_buff_pool Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 18/24] xdp: added buff_pool support to struct xdp_buff Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 19/24] xsk: add support for zero copy Rx Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 20/24] xsk: add support for zero copy Tx Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 21/24] i40e: implement xsk sub-commands in ndo_bpf for zero copy Rx Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 22/24] i40e: introduced a clean_tx callback function Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 23/24] i40e: introduced Tx completion callbacks Björn Töpel
2018-01-31 13:53 ` [RFC PATCH 24/24] i40e: Tx support for zero copy allocator Björn Töpel
2018-02-01 16:42 ` [RFC PATCH 00/24] Introducing AF_XDP support Jesper Dangaard Brouer
2018-02-02 10:31 ` Jesper Dangaard Brouer
2018-02-05 15:05 ` Björn Töpel
2018-02-07 15:54 ` Willem de Bruijn
2018-02-07 21:28 ` Björn Töpel
2018-02-08 23:16 ` Willem de Bruijn
2018-02-07 17:59 ` Tom Herbert
2018-02-07 21:38 ` Björn Töpel
2018-03-26 16:06 ` William Tu
2018-03-26 16:38 ` Jesper Dangaard Brouer [this message]
2018-03-26 21:58 ` William Tu
2018-03-27 6:09 ` Björn Töpel
2018-03-27 9:37 ` Jesper Dangaard Brouer
2018-03-28 0:06 ` William Tu
2018-03-28 8:01 ` Jesper Dangaard Brouer
2018-03-28 15:05 ` William Tu
2018-03-26 22:54 ` Tushar Dave
2018-03-26 23:03 ` Alexander Duyck
2018-03-26 23:20 ` Tushar Dave
2018-03-28 0:49 ` William Tu
2018-03-27 6:30 ` Björn Töpel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180326183810.2ef4e29f@redhat.com \
--to=brouer@redhat.com \
--cc=alexander.duyck@gmail.com \
--cc=alexander.h.duyck@intel.com \
--cc=anjali.singhai@intel.com \
--cc=ast@fb.com \
--cc=bjorn.topel@gmail.com \
--cc=bjorn.topel@intel.com \
--cc=daniel@iogearbox.net \
--cc=ferruh.yigit@intel.com \
--cc=jeffrey.b.shaw@intel.com \
--cc=jesse.brandeburg@intel.com \
--cc=john.fastabend@gmail.com \
--cc=magnus.karlsson@intel.com \
--cc=michael.lundkvist@ericsson.com \
--cc=netdev@vger.kernel.org \
--cc=qi.z.zhang@intel.com \
--cc=u9012063@gmail.com \
--cc=willemdebruijn.kernel@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).