From: Jesper Dangaard Brouer <brouer@redhat.com>
To: David Laight <David.Laight@ACULAB.COM>
Cc: 'Marek Majkowski' <marek@cloudflare.com>,
linux-kernel <linux-kernel@vger.kernel.org>,
network dev <netdev@vger.kernel.org>,
kernel-team <kernel-team@cloudflare.com>,
brouer@redhat.com, Paolo Abeni <pabeni@redhat.com>
Subject: Re: epoll_wait() performance
Date: Wed, 27 Nov 2019 16:48:21 +0100 [thread overview]
Message-ID: <20191127164821.1c41deff@carbon> (raw)
In-Reply-To: <5f4028c48a1a4673bd3b38728e8ade07@AcuMS.aculab.com>
On Wed, 27 Nov 2019 10:39:44 +0000 David Laight <David.Laight@ACULAB.COM> wrote:
> ...
> > > While using recvmmsg() to read multiple messages might seem a good idea, it is much
> > > slower than recv() when there is only one message (even recvmsg() is a lot slower).
> > > (I'm not sure why the code paths are so slow, I suspect it is all the copy_from_user()
> > > and faffing with the user iov[].)
> > >
> > > So using poll() we repoll the fd after calling recv() to find is there is a second message.
> > > However the second poll has a significant performance cost (but less than using recvmmsg()).
> >
> > That sounds wrong. Single recvmmsg(), even when receiving only a
> > single message, should be faster than two syscalls - recv() and
> > poll().
>
> My suspicion is the extra two copy_from_user() needed for each recvmsg are a
> significant overhead, most likely due to the crappy code that tries to stop
> the kernel buffer being overrun.
>
> I need to run the tests on a system with a 'home built' kernel to see how much
> difference this make (by seeing how much slower duplicating the copy makes it).
>
> The system call cost of poll() gets factored over a reasonable number of sockets.
> So doing poll() on a socket with no data is a lot faster that the setup for recvmsg
> even allowing for looking up the fd.
>
> This could be fixed by an extra flag to recvmmsg() to indicate that you only really
> expect one message and to call the poll() function before each subsequent receive.
>
> There is also the 'reschedule' that Eric added to the loop in recvmmsg.
> I don't know how much that actually costs.
> In this case the process is likely to be running at a RT priority and pinned to a cpu.
> In some cases the cpu is also reserved (at boot time) so that 'random' other code can't use it.
>
> We really do want to receive all these UDP packets in a timely manner.
> Although very low latency isn't itself an issue.
> The data is telephony audio with (typically) one packet every 20ms.
> The code only looks for packets every 10ms - that helps no end since, in principle,
> only a single poll()/epoll_wait() call (on all the sockets) is needed every 10ms.
I have a simple udp_sink tool[1] that cycle through the different
receive socket system calls. I gave it a quick spin on a F31 kernel
5.3.12-300.fc31.x86_64 on a mlx5 100G interface, and I'm very surprised
to see a significant regression/slowdown for recvMmsg.
$ sudo ./udp_sink --port 9 --repeat 1 --count $((10**7))
run count ns/pkt pps cycles payload
recvMmsg/32 run: 0 10000000 1461.41 684270.96 5261 18 demux:1
recvmsg run: 0 10000000 889.82 1123824.84 3203 18 demux:1
read run: 0 10000000 974.81 1025841.68 3509 18 demux:1
recvfrom run: 0 10000000 1056.51 946513.44 3803 18 demux:1
Normal recvmsg almost have double performance that recvmmsg.
recvMmsg/32 = 684,270 pps
recvmsg = 1,123,824 pps
[1] https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
For connected UDP socket:
$ sudo ./udp_sink --port 9 --repeat 1 --connect
run count ns/pkt pps cycles payload
recvMmsg/32 run: 0 1000000 1240.06 806411.73 4464 18 demux:1 c:1
recvmsg run: 0 1000000 768.80 1300724.75 2767 18 demux:1 c:1
read run: 0 1000000 823.40 1214478.40 2964 18 demux:1 c:1
recvfrom run: 0 1000000 889.19 1124616.11 3201 18 demux:1 c:1
Found some old results (approx v4.10-rc1):
[brouer@skylake src]$ sudo taskset -c 2 ./udp_sink --count $((10**7)) --port 9 --connect
recvMmsg/32 run: 0 10000000 537.89 1859106.74 2155 21559353816
recvmsg run: 0 10000000 552.69 1809344.44 2215 22152468673
read run: 0 10000000 476.65 2097970.76 1910 19104864199
recvfrom run: 0 10000000 450.76 2218492.60 1806 18066972794
next prev parent reply other threads:[~2019-11-27 15:48 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-11-22 11:17 epoll_wait() performance David Laight
2019-11-27 9:50 ` Marek Majkowski
2019-11-27 10:39 ` David Laight
2019-11-27 15:48 ` Jesper Dangaard Brouer [this message]
2019-11-27 16:04 ` David Laight
2019-11-27 19:48 ` Willem de Bruijn
2019-11-28 16:25 ` David Laight
2019-11-28 11:12 ` Jesper Dangaard Brouer
2019-11-28 16:37 ` David Laight
2019-11-28 16:52 ` Willy Tarreau
2019-12-19 7:57 ` Jesper Dangaard Brouer
2019-11-27 16:26 ` Paolo Abeni
2019-11-27 17:30 ` David Laight
2019-11-27 17:46 ` Eric Dumazet
2019-11-28 10:17 ` David Laight
2019-11-30 1:07 ` Eric Dumazet
2019-11-30 13:29 ` Jakub Sitnicki
2019-12-02 12:24 ` David Laight
2019-12-02 16:47 ` Willem de Bruijn
2019-11-27 17:50 ` Paolo Abeni
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191127164821.1c41deff@carbon \
--to=brouer@redhat.com \
--cc=David.Laight@ACULAB.COM \
--cc=kernel-team@cloudflare.com \
--cc=linux-kernel@vger.kernel.org \
--cc=marek@cloudflare.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.