netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Paolo Abeni <pabeni@redhat.com>
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: netdev@vger.kernel.org, "David S. Miller" <davem@davemloft.net>,
	Pablo Neira Ayuso <pablo@netfilter.org>,
	Florian Westphal <fw@strlen.de>,
	Eric Dumazet <edumazet@google.com>,
	Hannes Frederic Sowa <hannes@stressinduktion.org>
Subject: Re: [RFC PATCH 00/11] udp: full early demux for unconnected sockets
Date: Mon, 25 Sep 2017 22:26:09 +0200	[thread overview]
Message-ID: <1506371169.2614.3.camel@redhat.com> (raw)
In-Reply-To: <1506117524.29839.176.camel@edumazet-glaptop3.roam.corp.google.com>

On Fri, 2017-09-22 at 14:58 -0700, Eric Dumazet wrote:
> On Fri, 2017-09-22 at 23:06 +0200, Paolo Abeni wrote:
> > This series refactor the UDP early demux code so that:
> > 
> > * full socket lookup is performed for unicast packets
> > * a sk is grabbed even for unconnected socket match
> > * a dst cache is used even in such scenario
> > 
> > To perform this tasks a couple of facilities are added:
> > 
> > * noref socket references, scoped inside the current RCU section, to be
> >   explicitly cleared before leaving such section
> > * a dst cache inside the inet and inet6 local addresses tables, caching the
> >   related local dst entry
> > 
> > The measured performance gain under small packet UDP flood is as follow:
> > 
> > ingress NIC	vanilla		patched		delta
> > rx queues	(kpps)		(kpps)		(%)
> > [ipv4]
> > 1		2177		2414		10
> > 2		2527		2892		14
> > 3		3050		3733		22
> 
> 
> This is a clear sign your program is not using latest SO_REUSEPORT +
> [ec]BPF filter [1]
> 
> return socket[RX_QUEUE# | or CPU#];
> 
> If udp_sink uses SO_REUSEPORT with no extra hint, socket selection is
> based on a lazy hash, meaning that you do not have proper siloing.
> 
> return socket[hash(skb)];
> 
> Multiple cpus can then :
>  - compete on grabbing same socket refcount
>  - compete on grabbing the receive queue lock
>  - compete for releasing lock and socket refcount
>  - skb freeing done on different cpus than where allocated.
> 
> You are adding complexity to the kernel because you are using a
> sub-optimal user space program, favoring false sharing.
> 
> First solve the false sharing issue.
> 
> Performance with 2 rx queues should be almost twice the performance with
> 1 rx queue.
> 
> Then we can see if the gains you claim are still applicable.

Here are the performance results using a BPF filter to distribute the
ingress packet to the reuseport socket with the same id of the ingress
CPU - we have 1 to 1 mapping between the ingress receive queue and the
destination socket:

ingress NIC     vanilla         patched         delta
rx queues       (kpps)          (kpps)          (%)
[ipv4]
2               3020                3663                21
3               4352                5179                19
4               5318                6194                16
5               6258                7583                21
6               7376                8558                16

[ipv6]
2               2446                3949                61
3               3099                5092                64
4               3698                6611                78
5               4382                7852                79
6               5116                8851                73

Sone notes:

- figures obtained with: 

ethtool  -L em2 combined $n
MASK=1
for I in `seq 0 $((n - 1))`; do
        [ $I -eq 0 ] && USE_BPF="--use_bpf" || USE_BPF=""
        udp_sink  --reuseport $USE_BPF --recvfrom --count 10000000 --port 9 &
        taskset -p $((MASK << ($I + $n) )) $!
done

- in the IPv6 routing code we currently have a relevant bottle-neck in
ip6_pol_route(), I see a lot of contention on a dst refcount, so
without early demux the performances do not scale well there.

- For maximum performances BH and user space sink need to run on
difference CPUs - yes we have some more cacheline misses and a little
contention on the receive queue spin lock, but a lot less icache misses
and more CPU cycles available, the overall tput is a lot higher than
binding on the same CPU where the BH is running.

> PS: Wei Wan is about to release the IPV6 changes so that the big
> differences you showed are going to disappear soon.

Interesting, looking forward to that!

Cheers,

Paolo

  reply	other threads:[~2017-09-25 20:26 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-22 21:06 [RFC PATCH 00/11] udp: full early demux for unconnected sockets Paolo Abeni
2017-09-22 21:06 ` [RFC PATCH 01/11] net: add support for noref skb->sk Paolo Abeni
2017-09-22 21:06 ` [RFC PATCH 02/11] net: allow early demux to fetch noref socket Paolo Abeni
2017-09-22 21:06 ` [RFC PATCH 03/11] udp: do not touch socket refcount in early demux Paolo Abeni
2017-09-22 21:06 ` [RFC PATCH 04/11] net: add simple socket-like dst cache helpers Paolo Abeni
2017-09-22 21:06 ` [RFC PATCH 05/11] udp: perform full socket lookup in early demux Paolo Abeni
2017-09-22 21:06 ` [RFC PATCH 06/11] ip/route: factor out helper for local route creation Paolo Abeni
2017-09-22 21:06 ` [RFC PATCH 07/11] ipv6/addrconf: add an helper for inet6 address lookup Paolo Abeni
2017-09-22 21:06 ` [RFC PATCH 08/11] net: implement local route cache inside ifaddr Paolo Abeni
2017-09-22 21:06 ` [RFC PATCH 09/11] route: add ipv4/6 helpers to do partial route lookup vs local dst Paolo Abeni
2017-09-22 21:58 ` [RFC PATCH 00/11] udp: full early demux for unconnected sockets Eric Dumazet
2017-09-25 20:26   ` Paolo Abeni [this message]
2017-09-26 20:18 ` [RFC PATCH 10/11] IP: early demux can return an error code Paolo Abeni
2017-09-26 20:18 ` [RFC PATCH 11/11] udp: dst lookup in early demux for unconnected sockets Paolo Abeni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1506371169.2614.3.camel@redhat.com \
    --to=pabeni@redhat.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=eric.dumazet@gmail.com \
    --cc=fw@strlen.de \
    --cc=hannes@stressinduktion.org \
    --cc=netdev@vger.kernel.org \
    --cc=pablo@netfilter.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).