Linux network is damn fast, need more use XDP (Was: DC behaviors today)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <brouer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: Dave Taht <dave-DJTZSAfmFlI@public.gmane.org>
Cc: "netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	bloat-JXvr2/1DY2fm6VMwtOF2vx4hnT+Y9+D1@public.gmane.org,
	"Christina Jacob"
	<christina.jacob.koikara-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	"Joel Wirāmu Pauling"
	<joel-T541r0D4Wprk1uMJSBkQmQ@public.gmane.org>,
	"cerowrt-devel-JXvr2/1DY2fm6VMwtOF2vx4hnT+Y9+D1@public.gmane.org"
	<cerowrt-devel-JXvr2/1DY2fm6VMwtOF2vx4hnT+Y9+D1@public.gmane.org>,
	"David Ahern"
	<dsa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org>,
	"Tariq Toukan" <tariqt-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Subject: Linux network is damn fast, need more use XDP (Was:  DC behaviors today)
Date: Mon, 4 Dec 2017 11:56:51 +0100	[thread overview]
Message-ID: <20171204110923.3a213986@redhat.com> (raw)
In-Reply-To: <87bmjff7l6.fsf_-_-DEcvNJsl3XAMlNlIB+YWUg@public.gmane.org>

On Sun, 03 Dec 2017 20:19:33 -0800 Dave Taht <dave@taht.net> wrote:

> Changing the topic, adding bloat.

Adding netdev, and also adjust the topic to be a rant on that the Linux
kernel network stack is actually damn fast, and if you need something
faster then XDP can solved your needs...

> Joel Wirāmu Pauling <joel@aenertia.net> writes:
> 
> > Just from a Telco/Industry perspective slant.
> >
> > Everything in DC has moved to SFP28 interfaces at 25Gbit as the server
> > port of interconnect. Everything TOR wise is now QSFP28 - 100Gbit.
> > Mellanox X5 cards are the current hotness, and their offload
> > enhancements (ASAP2 - which is sorta like DPDK on steroids) allows for
> > OVS flow rules programming into the card. We have a lot of customers
> > chomping at the bit for that feature (disclaimer I work for Nuage
> > Networks, and we are working on enhanced OVS to do just that) for NFV
> > workloads.  
> 
> What Jesper's been working on for ages has been to try and get linux's
> PPS up for small packets, which last I heard was hovering at about
> 4Gbits.

I hope you made a typo here Dave, the normal Linux kernel is definitely
way beyond 4Gbit/s, you must have misunderstood something, maybe you
meant 40Gbit/s? (which is also too low)

Scaling up to more CPUs and TCP-stream, Tariq[1] and I have showed the
Linux kernel network stack scales to 94Gbit/s (linerate minus overhead).
But when the drivers page-recycler fails, we hit bottlenecks in the
page-allocator, that cause negative scaling to around 43Gbit/s.

[1] http://lkml.kernel.org/r/cef85936-10b2-5d76-9f97-cb03b418fd94@mellanox.com

Linux have for a _long_ time been doing 10Gbit/s TCP-stream easily, on
a SINGLE CPU.  This is mostly thanks to TSO/GRO aggregating packets,
but last couple of years the network stack have been optimized (with
UDP workloads), and as a result we can do 10G without TSO/GRO on a
single-CPU.  This is "only" 812Kpps with MTU size frames.

It is important to NOTICE that I'm mostly talking about SINGLE-CPU
performance.  But the Linux kernel scales very well to more CPUs, and
you can scale this up, although we are starting to hit scalability
issues in MM-land[1].

I've also demonstrated that netdev-community have optimized the kernels
per-CPU processing power to around 2Mpps.  What does this really
mean... well with MTU size packets 812Kpps was 10Gbit/s, thus 25Gbit/s
should be around 2Mpps.... That implies Linux can do 25Gbit/s on a
single CPU without GRO (MTU size frames).  Do you need more I ask?

> The route table lookup also really expensive on the main cpu.

Well, it used-to-be very expensive. Vincent Bernat wrote some excellent
blogposts[2][3] on the recent improvements over kernel versions, and
gave due credit to people involved.

[2] https://vincent.bernat.im/en/blog/2017-performance-progression-ipv4-route-lookup-linux
[3] https://vincent.bernat.im/en/blog/2017-performance-progression-ipv6-route-lookup-linux

He measured around 25 to 35 nanosec cost of route lookups.  My own
recent measurements were 36.9 ns cost of fib_table_lookup.

> Does this stuff offload the route table lookup also?

If you have not heard, the netdev-community have worked on something
called XDP (eXpress Data Path).  This is a new layer in the network
stack, that basically operates a the same "layer"/level as DPDK.
Thus, surprise we get the same performance numbers as DPDK. E.g. I can
do 13.4 Mpps forwarding with ixgbe on a single CPU (more CPUs=14.6Mps)

We can actually use XDP for (software) offloading the Linux routing
table.  There are two methods we are experimenting with:

(1) externally monitor route changes from userspace and update BPF-maps
to reflect this. That approach is already accepted upstream[4][5].  I'm
measuring 9,513,746 pps per CPU with that approach.

(2) add a bpf helper to simply call fib_table_lookup() from the XDP hook.
This is still experimental patches (credit to David Ahern), and I've
measured 9,350,160 pps with this approach in a single CPU.  Using more
CPUs we hit 14.6Mpps (only used 3 CPUs in that test)

[4] https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_router_ipv4_user.c
[5] https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_router_ipv4_kern.c

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
_______________________________________________
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat

next      parent reply	other threads:[~2017-12-04 10:56 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CAA93jw43M=dhPOFhMJo7f-qOq=k=kKS6ppq4o9=hsTEKoBdUpA@mail.gmail.com>
     [not found] ` <92906bd8-7bad-945d-83c8-a2f9598aac2c@lackof.org>
     [not found]   ` <CAA93jw5pRMcZmZQmRwSi_1dETEjTHhmg2iJ3A-ijuOMi+mg4+Q@mail.gmail.com>
     [not found]     ` <CAKiAkGT54RPLQ4f1tzCj9wcW=mnK7+=uJfaotw9G+H_JEy_hqQ@mail.gmail.com>
     [not found]       ` <87bmjff7l6.fsf_-_@nemesis.taht.net>
     [not found]         ` <87bmjff7l6.fsf_-_-DEcvNJsl3XAMlNlIB+YWUg@public.gmane.org>
2017-12-04 10:56           ` Jesper Dangaard Brouer [this message]
2017-12-04 17:00             ` [Bloat] Linux network is damn fast, need more use XDP (Was: DC behaviors today) Dave Taht
     [not found]               ` <CAA93jw4yOz2KoJGz4t9KqFrr=Zx+=N_r-c_W9iQCpGCBCgDVgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-12-04 20:49                 ` Joel Wirāmu Pauling
     [not found]             ` <20171204110923.3a213986-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-12-04 17:19               ` Matthias Tafelmeier
2017-12-07  8:33                 ` [Bloat] " Jesper Dangaard Brouer
2017-12-07 18:50                   ` Matthias Tafelmeier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171204110923.3a213986@redhat.com \
    --to=brouer-h+wxahxf7alqt0dzr+alfa@public.gmane.org \
    --cc=bloat-JXvr2/1DY2fm6VMwtOF2vx4hnT+Y9+D1@public.gmane.org \
    --cc=cerowrt-devel-JXvr2/1DY2fm6VMwtOF2vx4hnT+Y9+D1@public.gmane.org \
    --cc=christina.jacob.koikara-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    --cc=dave-DJTZSAfmFlI@public.gmane.org \
    --cc=dsa-qUQiAmfTcIp+XZJcv9eMoEEOCMrvLtNR@public.gmane.org \
    --cc=joel-T541r0D4Wprk1uMJSBkQmQ@public.gmane.org \
    --cc=netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=tariqt-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.