[Intel-wired-lan] [RFC PATCH bpf-next 00/12] AF_XDP, zero-copy support

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <brouer@redhat.com>
To: intel-wired-lan@osuosl.org
Subject: [Intel-wired-lan] [RFC PATCH bpf-next 00/12] AF_XDP, zero-copy support
Date: Wed, 16 May 2018 12:47:07 +0200	[thread overview]
Message-ID: <20180516124707.59d60d2c@redhat.com> (raw)
In-Reply-To: <20180515190615.23099-1-bjorn.topel@gmail.com>

On Tue, 15 May 2018 21:06:03 +0200
Bj?rn T?pel <bjorn.topel@gmail.com> wrote:

> e have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
> NIC is Intel I40E 40Gbit/s using the i40e driver.
> 
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by a commercial packet generator HW
> outputing packets at full 40 Gbit/s line rate. The results are without
> retpoline so that we can compare against previous numbers. 
> 
> AF_XDP performance 64 byte packets. Results from the AF_XDP V3 patch
> set are also reported for ease of reference.
> 
> Benchmark   XDP_SKB    XDP_DRV    XDP_DRV with zerocopy
> rxdrop       2.9*       9.6*       21.5
> txpush       2.6*       -          21.6
> l2fwd        1.9*       2.5*       15.0

These performance numbers are actually amazing.

When reaching these amazing/crazy speeds, where we are approaching the
speed of light (travel 30 cm in 1 nanosec), we have to view these
numbers differently, because we are actually working on a nanosec scale.

21.5 Mpps is 46.5 nanosec.

If we want to optimize for +1 Mpps, then (1/22.5*10^3=44.44ns) your
actually only have to optimize the code with 2 nanosec, and with this
2.0 GHz CPU it should in theory only be 4 cycles, but likely have more
instructions per cycle (I see around 2.5 ins per cycle), so we are
looking at (2*2*2.5) needing to find 10 cycles for +1Mpps.

Comparing to XDP_DROP of 32.3Mpps vs ZC-rxdrop 21.5Mpps, this is
actually only a "slowdown" of 15.55 ns, for having frame travel through
xdp_do_redirect, do map lookup etc, and queue into userspace, and
return frames back to kernel.  That is rather amazingly fast.

  (1/21.5*10^3)-(1/32.3*10^3) = 15.55 ns

Another performance number which is amazing is your l2fwd number of
15Mpps, because it if faster than xdp_redirect_map on i40e NICs on my
system, which runs at 12.2 Mpps (2.8Mpps slower).  Again looking at the
nanosec scale instead, this correspond to 15.3 ns.
  I expect, this improvement comes from avoiding page_frag_free, and
avoiding the TX dma_map call (as you premap pages DMA for TX). Reverse
calculating based on perf percentage, I find that these should only
cost 7.18 ns.  Maybe the rest is because you are running TX and TX-dma
completion on another CPU.

I notice you are also using the XDP return-API, which still does a
rhashtable_lookup per frame.  I plan to optimize this to do bulking, to
get away from per frame lookup.  Thus, this should get even faster.

> * From AF_XDP V3 patch set and cover letter.
> 
> AF_XDP performance 1500 byte packets:
> Benchmark   XDP_SKB   XDP_DRV     XDP_DRV with zerocopy
> rxdrop       2.1        3.3       3.3
> l2fwd        1.4        1.8       3.1
> 
> So why do we not get higher values for RX similar to the 34 Mpps we
> had in AF_PACKET V4? We made an experiment running the rxdrop
> benchmark without using the xdp_do_redirect/flush infrastructure nor
> using an XDP program (all traffic on a queue goes to one
> socket). Instead the driver acts directly on the AF_XDP socket. With
> this we got 36.9 Mpps, a significant improvement without any change to
> the uapi. So not forcing users to have an XDP program if they do not
> need it, might be a good idea. This measurement is actually higher
> than what we got with AF_PACKET V4.

So, that are you telling me with your number 36.9 Mpps for
direct-socket-rxdrop...

Compared to XDP_DROP at 32.3Mpps, are you saying that it only costs
3.86 nanosec to call the XDP bpf_prog which returns XDP_DROP.  That is
very impressive actually. (1/32.3*10^3)-(1/36.9*10^3)

Compared to ZC-AF_XDP rxdrop 21.5Mpps, are you saying the cost of XDP
redirect infrastructure, map lookups etc (incl. return-API per frame)
cost 19.41 nanosec (1/21.5*10^3)-(1/36.9*10^3).  Which is approx 40
clock-cycles or 100 (speculative) instructions.  That is not too bad,
and we are still optimizing this stuff.

> XDP performance on our system as a base line:
> 
> 64 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      32.3M  0
> 
> 1500 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      3.3M    0

Overall I'm *very* impressed by the performance of ZC AF_XDP.
Just remember that measuring improvement in +N Mpps, is actually
misleading, when operating at these (light) speeds.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

WARNING: multiple messages have this Message-ID (diff)

From: Jesper Dangaard Brouer <brouer@redhat.com>
To: "Björn Töpel" <bjorn.topel@gmail.com>
Cc: magnus.karlsson@gmail.com, magnus.karlsson@intel.com,
	alexander.h.duyck@intel.com, alexander.duyck@gmail.com,
	john.fastabend@gmail.com, ast@fb.com,
	willemdebruijn.kernel@gmail.com, daniel@iogearbox.net,
	mst@redhat.com, netdev@vger.kernel.org,
	"Björn Töpel" <bjorn.topel@intel.com>,
	michael.lundkvist@ericsson.com, jesse.brandeburg@intel.com,
	anjali.singhai@intel.com, qi.z.zhang@intel.com,
	intel-wired-lan@lists.osuosl.org, brouer@redhat.com
Subject: Re: [RFC PATCH bpf-next 00/12] AF_XDP, zero-copy support
Date: Wed, 16 May 2018 12:47:07 +0200	[thread overview]
Message-ID: <20180516124707.59d60d2c@redhat.com> (raw)
In-Reply-To: <20180515190615.23099-1-bjorn.topel@gmail.com>

On Tue, 15 May 2018 21:06:03 +0200
Björn Töpel <bjorn.topel@gmail.com> wrote:

> e have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
> NIC is Intel I40E 40Gbit/s using the i40e driver.
> 
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by a commercial packet generator HW
> outputing packets at full 40 Gbit/s line rate. The results are without
> retpoline so that we can compare against previous numbers. 
> 
> AF_XDP performance 64 byte packets. Results from the AF_XDP V3 patch
> set are also reported for ease of reference.
> 
> Benchmark   XDP_SKB    XDP_DRV    XDP_DRV with zerocopy
> rxdrop       2.9*       9.6*       21.5
> txpush       2.6*       -          21.6
> l2fwd        1.9*       2.5*       15.0

These performance numbers are actually amazing.

When reaching these amazing/crazy speeds, where we are approaching the
speed of light (travel 30 cm in 1 nanosec), we have to view these
numbers differently, because we are actually working on a nanosec scale.

21.5 Mpps is 46.5 nanosec.

If we want to optimize for +1 Mpps, then (1/22.5*10^3=44.44ns) your
actually only have to optimize the code with 2 nanosec, and with this
2.0 GHz CPU it should in theory only be 4 cycles, but likely have more
instructions per cycle (I see around 2.5 ins per cycle), so we are
looking at (2*2*2.5) needing to find 10 cycles for +1Mpps.

Comparing to XDP_DROP of 32.3Mpps vs ZC-rxdrop 21.5Mpps, this is
actually only a "slowdown" of 15.55 ns, for having frame travel through
xdp_do_redirect, do map lookup etc, and queue into userspace, and
return frames back to kernel.  That is rather amazingly fast.

  (1/21.5*10^3)-(1/32.3*10^3) = 15.55 ns

Another performance number which is amazing is your l2fwd number of
15Mpps, because it if faster than xdp_redirect_map on i40e NICs on my
system, which runs at 12.2 Mpps (2.8Mpps slower).  Again looking at the
nanosec scale instead, this correspond to 15.3 ns.
  I expect, this improvement comes from avoiding page_frag_free, and
avoiding the TX dma_map call (as you premap pages DMA for TX). Reverse
calculating based on perf percentage, I find that these should only
cost 7.18 ns.  Maybe the rest is because you are running TX and TX-dma
completion on another CPU.

I notice you are also using the XDP return-API, which still does a
rhashtable_lookup per frame.  I plan to optimize this to do bulking, to
get away from per frame lookup.  Thus, this should get even faster.

> * From AF_XDP V3 patch set and cover letter.
> 
> AF_XDP performance 1500 byte packets:
> Benchmark   XDP_SKB   XDP_DRV     XDP_DRV with zerocopy
> rxdrop       2.1        3.3       3.3
> l2fwd        1.4        1.8       3.1
> 
> So why do we not get higher values for RX similar to the 34 Mpps we
> had in AF_PACKET V4? We made an experiment running the rxdrop
> benchmark without using the xdp_do_redirect/flush infrastructure nor
> using an XDP program (all traffic on a queue goes to one
> socket). Instead the driver acts directly on the AF_XDP socket. With
> this we got 36.9 Mpps, a significant improvement without any change to
> the uapi. So not forcing users to have an XDP program if they do not
> need it, might be a good idea. This measurement is actually higher
> than what we got with AF_PACKET V4.

So, that are you telling me with your number 36.9 Mpps for
direct-socket-rxdrop...

Compared to XDP_DROP at 32.3Mpps, are you saying that it only costs
3.86 nanosec to call the XDP bpf_prog which returns XDP_DROP.  That is
very impressive actually. (1/32.3*10^3)-(1/36.9*10^3)

Compared to ZC-AF_XDP rxdrop 21.5Mpps, are you saying the cost of XDP
redirect infrastructure, map lookups etc (incl. return-API per frame)
cost 19.41 nanosec (1/21.5*10^3)-(1/36.9*10^3).  Which is approx 40
clock-cycles or 100 (speculative) instructions.  That is not too bad,
and we are still optimizing this stuff.

> XDP performance on our system as a base line:
> 
> 64 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      32.3M  0
> 
> 1500 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      3.3M    0

Overall I'm *very* impressed by the performance of ZC AF_XDP.
Just remember that measuring improvement in +N Mpps, is actually
misleading, when operating at these (light) speeds.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

next prev parent reply	other threads:[~2018-05-16 10:47 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-15 19:06 [Intel-wired-lan] [RFC PATCH bpf-next 00/12] AF_XDP, zero-copy support =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 19:06 ` Björn Töpel
2018-05-15 19:06 ` [Intel-wired-lan] [RFC PATCH bpf-next 01/12] xsk: remove rebind support =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 19:06   ` Björn Töpel
2018-05-15 19:06 ` [Intel-wired-lan] [RFC PATCH bpf-next 02/12] xsk: moved struct xdp_umem definition =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 19:06   ` Björn Töpel
2018-05-15 19:06 ` [Intel-wired-lan] [RFC PATCH bpf-next 03/12] xsk: introduce xdp_umem_frame =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 19:06   ` Björn Töpel
2018-05-15 19:06 ` [Intel-wired-lan] [RFC PATCH bpf-next 04/12] net: xdp: added bpf_netdev_command XDP_SETUP_XSK_UMEM =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 19:06   ` Björn Töpel
2018-05-15 19:06 ` [Intel-wired-lan] [RFC PATCH bpf-next 05/12] xdp: add MEM_TYPE_ZERO_COPY =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 19:06   ` Björn Töpel
2018-05-17  5:57   ` [Intel-wired-lan] " Jesper Dangaard Brouer
2018-05-17  5:57     ` Jesper Dangaard Brouer
2018-05-17  7:08     ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-17  7:08       ` Björn Töpel
2018-05-17  7:09       ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-17  7:09         ` Björn Töpel
2018-05-15 19:06 ` [Intel-wired-lan] [RFC PATCH bpf-next 06/12] xsk: add zero-copy support for Rx =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 19:06   ` Björn Töpel
2018-05-15 19:06 ` [Intel-wired-lan] [RFC PATCH bpf-next 07/12] net: added netdevice operation for Tx =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 19:06   ` Björn Töpel
2018-05-15 19:06 ` [Intel-wired-lan] [RFC PATCH bpf-next 08/12] xsk: wire upp Tx zero-copy functions =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 19:06   ` Björn Töpel
2018-05-15 19:06 ` [Intel-wired-lan] [RFC PATCH bpf-next 09/12] samples/bpf: minor *_nb_free performance fix =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 19:06   ` Björn Töpel
2018-05-15 19:06 ` [Intel-wired-lan] [RFC PATCH bpf-next 10/12] i40e: added queue pair disable/enable functions =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 19:06   ` Björn Töpel
2018-05-15 19:06 ` [Intel-wired-lan] [RFC PATCH bpf-next 11/12] i40e: implement AF_XDP zero-copy support for Rx =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 19:06   ` Björn Töpel
2018-05-15 20:25   ` [Intel-wired-lan] " Alexander Duyck
2018-05-15 20:25     ` Alexander Duyck
2018-05-15 19:06 ` [Intel-wired-lan] [RFC PATCH bpf-next 12/12] i40e: implement Tx zero-copy =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-15 19:06   ` Björn Töpel
2018-05-16 14:28   ` [Intel-wired-lan] " Jesper Dangaard Brouer
2018-05-16 14:28     ` Jesper Dangaard Brouer
2018-05-16 14:38     ` [Intel-wired-lan] " Magnus Karlsson
2018-05-16 14:38       ` Magnus Karlsson
2018-05-16 15:38       ` [Intel-wired-lan] " Magnus Karlsson
2018-05-16 15:38         ` Magnus Karlsson
2018-05-16 18:53         ` [Intel-wired-lan] " Jesper Dangaard Brouer
2018-05-16 18:53           ` Jesper Dangaard Brouer
2018-05-17 21:31   ` [Intel-wired-lan] " Jesper Dangaard Brouer
2018-05-17 21:31     ` Jesper Dangaard Brouer
2018-05-18  4:23     ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-18  4:23       ` Björn Töpel
2018-05-16 10:47 ` Jesper Dangaard Brouer [this message]
2018-05-16 10:47   ` [RFC PATCH bpf-next 00/12] AF_XDP, zero-copy support Jesper Dangaard Brouer
2018-05-16 17:04 ` [Intel-wired-lan] " Alexei Starovoitov
2018-05-16 17:04   ` Alexei Starovoitov
2018-05-16 17:49   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2018-05-16 17:49     ` Björn Töpel
2018-05-16 18:14   ` [Intel-wired-lan] " Jeff Kirsher
2018-05-16 18:14     ` Jeff Kirsher

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180516124707.59d60d2c@redhat.com \
    --to=brouer@redhat.com \
    --cc=intel-wired-lan@osuosl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.