Re: epoll_wait() performance - Jesper Dangaard Brouer

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <brouer@redhat.com>
To: David Laight <David.Laight@ACULAB.COM>
Cc: 'Marek Majkowski' <marek@cloudflare.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	network dev <netdev@vger.kernel.org>,
	kernel-team <kernel-team@cloudflare.com>,
	Paolo Abeni <pabeni@redhat.com>,
	brouer@redhat.com
Subject: Re: epoll_wait() performance
Date: Thu, 28 Nov 2019 12:12:05 +0100	[thread overview]
Message-ID: <20191128121205.65c8dea1@carbon> (raw)
In-Reply-To: <5eecf41c7e124d7dbc0ab363d94b7d13@AcuMS.aculab.com>

On Wed, 27 Nov 2019 16:04:12 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> From: Jesper Dangaard Brouer
> > Sent: 27 November 2019 15:48
> > On Wed, 27 Nov 2019 10:39:44 +0000 David Laight <David.Laight@ACULAB.COM> wrote:
> >   
> > > ...  
> > > > > While using recvmmsg() to read multiple messages might seem a good idea, it is much
> > > > > slower than recv() when there is only one message (even recvmsg() is a lot slower).
> > > > > (I'm not sure why the code paths are so slow, I suspect it is all the copy_from_user()
> > > > > and faffing with the user iov[].)
> > > > >
> > > > > So using poll() we repoll the fd after calling recv() to find is there is a second message.
> > > > > However the second poll has a significant performance cost (but less than using recvmmsg()).  
> > > >
> > > > That sounds wrong. Single recvmmsg(), even when receiving only a
> > > > single message, should be faster than two syscalls - recv() and
> > > > poll().  
> > >
> > > My suspicion is the extra two copy_from_user() needed for each recvmsg are a
> > > significant overhead, most likely due to the crappy code that tries to stop
> > > the kernel buffer being overrun.
> > >
> > > I need to run the tests on a system with a 'home built' kernel to see how much
> > > difference this make (by seeing how much slower duplicating the copy makes it).
> > >
> > > The system call cost of poll() gets factored over a reasonable number of sockets.
> > > So doing poll() on a socket with no data is a lot faster that the setup for recvmsg
> > > even allowing for looking up the fd.
> > >
> > > This could be fixed by an extra flag to recvmmsg() to indicate that you only really
> > > expect one message and to call the poll() function before each subsequent receive.
> > >
> > > There is also the 'reschedule' that Eric added to the loop in recvmmsg.
> > > I don't know how much that actually costs.
> > > In this case the process is likely to be running at a RT priority and pinned to a cpu.
> > > In some cases the cpu is also reserved (at boot time) so that 'random' other code can't use it.
> > >
> > > We really do want to receive all these UDP packets in a timely manner.
> > > Although very low latency isn't itself an issue.
> > > The data is telephony audio with (typically) one packet every 20ms.
> > > The code only looks for packets every 10ms - that helps no end since, in principle,
> > > only a single poll()/epoll_wait() call (on all the sockets) is needed every 10ms.  
> > 
> > I have a simple udp_sink tool[1] that cycle through the different
> > receive socket system calls.  I gave it a quick spin on a F31 kernel
> > 5.3.12-300.fc31.x86_64 on a mlx5 100G interface, and I'm very surprised
> > to see a significant regression/slowdown for recvMmsg.
> > 
> > $ sudo ./udp_sink --port 9 --repeat 1 --count $((10**7))
> >           	run      count   	ns/pkt	pps		cycles	payload
> > recvMmsg/32	run:  0	10000000	1461.41	684270.96	5261	18	 demux:1
> > recvmsg   	run:  0	10000000	889.82	1123824.84	3203	18	 demux:1
> > read      	run:  0	10000000	974.81	1025841.68	3509	18	 demux:1
> > recvfrom  	run:  0	10000000	1056.51	946513.44	3803	18	 demux:1
> > 
> > Normal recvmsg almost have double performance that recvmmsg.
> >  recvMmsg/32 = 684,270 pps
> >  recvmsg     = 1,123,824 pps  
> 
> Can you test recv() as well?

Sure: https://github.com/netoptimizer/network-testing/commit/9e3c8b86a2d662

$ sudo taskset -c 1 ./udp_sink --port 9  --count $((10**6*2))
          	run      count   	ns/pkt	pps		cycles	payload
recvMmsg/32  	run:  0	 2000000	653.29	1530704.29	2351	18	 demux:1
recvmsg   	run:  0	 2000000	631.01	1584760.06	2271	18	 demux:1
read      	run:  0	 2000000	582.24	1717518.16	2096	18	 demux:1
recvfrom  	run:  0	 2000000	547.26	1827269.12	1970	18	 demux:1
recv      	run:  0	 2000000	547.37	1826930.39	1970	18	 demux:1

> I think it might be faster than read().

Slightly, but same speed as recvfrom.

Strangely recvMmsg is not that bad in this testrun, and it is on the
same kernel 5.3.12-300.fc31.x86_64 and hardware.  I have CPU pinned
udp_sink, as it if jumps to the CPU doing RX-NAPI it will be fighting
for CPU time with softirq (which have Eric mitigated a bit), and
results are bad and look like this:

[broadwell src]$ sudo taskset -c 5 ./udp_sink --port 9  --count $((10**6*2))
          	run      count   	ns/pkt	pps		cycles	payload
recvMmsg/32  	run:  0	 2000000	1252.44	798439.60	4508	18	 demux:1
recvmsg   	run:  0	 2000000	1917.65	521470.72	6903	18	 demux:1
read      	run:  0	 2000000	1817.31	550263.37	6542	18	 demux:1
recvfrom  	run:  0	 2000000	1742.44	573909.46	6272	18	 demux:1
recv      	run:  0	 2000000	1741.51	574213.08	6269	18	 demux:1


> [...]
> > Found some old results (approx v4.10-rc1):
> > 
> > [brouer@skylake src]$ sudo taskset -c 2 ./udp_sink --count $((10**7)) --port 9 --connect
> >  recvMmsg/32    run: 0 10000000 537.89  1859106.74      2155    21559353816
> >  recvmsg        run: 0 10000000 552.69  1809344.44      2215    22152468673
> >  read           run: 0 10000000 476.65  2097970.76      1910    19104864199
> >  recvfrom       run: 0 10000000 450.76  2218492.60      1806    18066972794  
> 
> That is probably nearer what I am seeing on a 4.15 Ubuntu 18.04 kernel.
> recvmmsg() and recvmsg() are similar - but both a lot slower then recv().

Notice tool can also test connect UDP sockets, which is done in above.
I did a quick run with --connect:

$ sudo taskset -c 1 ./udp_sink --port 9  --count $((10**6*2)) --connect
          	run      count   	ns/pkt	pps		cycles	payload
recvMmsg/32  	run:  0	 2000000	500.72	1997107.02	1802	18	 demux:1 c:1
recvmsg   	run:  0	 2000000	662.52	1509380.46	2385	18	 demux:1 c:1
read      	run:  0	 2000000	613.46	1630103.14	2208	18	 demux:1 c:1
recvfrom  	run:  0	 2000000	577.71	1730974.34	2079	18	 demux:1 c:1
recv      	run:  0	 2000000	578.27	1729305.35	2081	18	 demux:1 c:1

And now, recvMmsg is actually the fastest...?!


p.s.
DISPLAIMER: Do notice that this udp_sink tool is a network-overload
micro-benchmark, that does not represent the use-case you are
describing.
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

next prev parent reply	other threads:[~2019-11-28 11:12 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-22 11:17 epoll_wait() performance David Laight
2019-11-27  9:50 ` Marek Majkowski
2019-11-27 10:39   ` David Laight
2019-11-27 15:48     ` Jesper Dangaard Brouer
2019-11-27 16:04       ` David Laight
2019-11-27 19:48         ` Willem de Bruijn
2019-11-28 16:25           ` David Laight
2019-11-28 11:12         ` Jesper Dangaard Brouer [this message]
2019-11-28 16:37           ` David Laight
2019-11-28 16:52             ` Willy Tarreau
2019-12-19  7:57             ` Jesper Dangaard Brouer
2019-11-27 16:26       ` Paolo Abeni
2019-11-27 17:30         ` David Laight
2019-11-27 17:46           ` Eric Dumazet
2019-11-28 10:17             ` David Laight
2019-11-30  1:07               ` Eric Dumazet
2019-11-30 13:29                 ` Jakub Sitnicki
2019-12-02 12:24                   ` David Laight
2019-12-02 16:47                     ` Willem de Bruijn
2019-11-27 17:50           ` Paolo Abeni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191128121205.65c8dea1@carbon \
    --to=brouer@redhat.com \
    --cc=David.Laight@ACULAB.COM \
    --cc=kernel-team@cloudflare.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=marek@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.