* epoll_wait() (and epoll_pwait) stall for 206 ms per call on sockets with a small-ish snd/rcv buffer. @ 2019-07-06 1:45 Carlo Wood 2019-07-06 16:16 ` Kernel BUG: " Carlo Wood 0 siblings, 1 reply; 4+ messages in thread From: Carlo Wood @ 2019-07-06 1:45 UTC (permalink / raw) To: davem, netdev Hi all, hopefully I'm reporting this in the right place. While working on my C++ networking library, I ran into a problem that I have traced back to the kernel itself. The problem is related to setting a socket buffer size (snd/rcv). At first I noticed the problem while trying to test the thread-safety of my lock free stream buffer implementation, with std::streambuf interface, that allows reading and writing by different threads at the same time. In order to test this, I read data from a socket into a buffer and write that back to the same socket from the same buffer. This resulted in a speed of only a few 10kB/s bandwidth, and after investigation turned out to be caused by epoll_pwait() not returning until 205 to 208 ms had passed, even though data was already available. At first I thought it could be a bug in libev that I was using as wrapper around epoll - so I ripped that out and wrote my own code to interface with epoll, but with the same result. Next I changed the test into a one-way stream: just send data from one socket to another (not back); also with the same result. Then I wrote a test application that was standalone, single threaded and in C. Here I discovered that first calling connect() on a socket and then (immediately after) setting the socket buffer sizes caused the stall to occur. Assuming that this is not allowed I thought I had solved it. But, back to sending data back and forth over the TCP connection caused the problem to return. I wrote again a standalone C program, single threaded, that reproduces the problem. You can find this program here (I'm scared to trigger spam filters by attaching it here, but let me know if that is OK): https://github.com/CarloWood/ai-evio-testsuite/blob/e6e36b66256ec5d1760d2b7b96b34d60a12477aa/src/epoll_bug.c With socket snd/rcv buffers smaller than 32kB I always get the problem, but using 64kB (65536 bytes) I don't, on my box. I tested this with kernels 4.15, 5.2.0-rc7 and had someone else confirm it with 5.1.15. When running the program as-is, it will start printing a flood of "epoll_wait() was delayed!" meaning that epoll_wait() didn't return within 10 ms. The actual delay can be seen with for example strace -T. You can also uncomment the commented printf's which will print what is the amount of bytes written to and read from each socket and therefore what is the number of bytes that are still "in the pipe line". What I observe is that the number of bytes in the pipe line is very constant (because this just writes till no more can be written, and then - on the other end - reads till there is nothing to be read anymore - and that single threaded) until suddenly that amount jumps UP to about 2.5 MB - then is SLOWLY eaten till a say 1.5 MB (without that epoll says more can be written) when suddenly more can be written and it fills up the pipe line again. I have no idea where those mega bytes of data are buffered; the socket buffers are WAY smaller. Obviously I have no idea if this is a problem with epoll itself, or that there is a problem with TCP/IP for this case. But I am not a kernel hacker, so all I could to was provide this little test case that allows you to reproduce it. -- Carlo Wood <carlo@alinoe.com> ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Kernel BUG: epoll_wait() (and epoll_pwait) stall for 206 ms per call on sockets with a small-ish snd/rcv buffer. 2019-07-06 1:45 epoll_wait() (and epoll_pwait) stall for 206 ms per call on sockets with a small-ish snd/rcv buffer Carlo Wood @ 2019-07-06 16:16 ` Carlo Wood 2019-07-06 18:19 ` Carlo Wood 0 siblings, 1 reply; 4+ messages in thread From: Carlo Wood @ 2019-07-06 16:16 UTC (permalink / raw) To: davem, netdev I improved the test case a bit: https://github.com/CarloWood/ai-evio-testsuite/blob/2a9ae49e3ae39eea7cb1d105883254370f53831b/src/epoll_bug.c If the bug doesn't show, please increase burst_size and/or decrease sndbuf_size and rcvbuf_size. The output that I get with VERBOSE defined is for example: [...snip...] wrote 34784 bytes to 6 (total written now 9665792), now in pipe line: 34784 Read 34784 bytes from fd 5 (total read now 9665792), left in pipe line: 0 wrote 34784 bytes to 6 (total written now 9700576), now in pipe line: 34784 Read 34784 bytes from fd 5 (total read now 9700576), left in pipe line: 0 wrote 34784 bytes to 6 (total written now 9735360), now in pipe line: 34784 Read 34784 bytes from fd 5 (total read now 9735360), left in pipe line: 0 wrote 34784 bytes to 6 (total written now 9770144), now in pipe line: 34784 Read 34784 bytes from fd 5 (total read now 9770144), left in pipe line: 0 wrote 34784 bytes to 6 (total written now 9804928), now in pipe line: 34784 Read 34784 bytes from fd 5 (total read now 9804928), left in pipe line: 0 wrote 34784 bytes to 6 (total written now 9839712), now in pipe line: 34784 Read 34784 bytes from fd 5 (total read now 9839712), left in pipe line: 0 wrote 34784 bytes to 6 (total written now 9874496), now in pipe line: 34784 Read 34784 bytes from fd 5 (total read now 9874496), left in pipe line: 0 wrote 34784 bytes to 6 (total written now 9909280), now in pipe line: 34784 Read 34784 bytes from fd 5 (total read now 9909280), left in pipe line: 0 wrote 34784 bytes to 6 (total written now 9944064), now in pipe line: 34784 Read 34784 bytes from fd 5 (total read now 9944064), left in pipe line: 0 wrote 34784 bytes to 6 (total written now 9978848), now in pipe line: 34784 Read 34784 bytes from fd 5 (total read now 9978848), left in pipe line: 0 wrote 21152 bytes to 6 (total written now 10000000), now in pipe line: 21152 Read 21152 bytes from fd 5 (total read now 10000000), left in pipe line: 0 epoll_wait() stalled 193 milliseconds! Read 21888 bytes from fd 6 (total read now 70912), left in pipe line: 21888 epoll_wait() stalled 255 milliseconds! Read 21888 bytes from fd 6 (total read now 92800), left in pipe line: 0 write(5, buf, 9907200) = 43776 (total written now 136576), now in pipe line: 43776 epoll_wait() stalled 211 milliseconds! Read 21888 bytes from fd 6 (total read now 114688), left in pipe line: 21888 epoll_wait() stalled 207 milliseconds! write(5, buf, 9863424) = 38272 (total written now 174848), now in pipe line: 60160 Read 16384 bytes from fd 6 (total read now 131072), left in pipe line: 43776 epoll_wait() stalled 207 milliseconds! Read 21888 bytes from fd 6 (total read now 152960), left in pipe line: 21888 epoll_wait() stalled 207 milliseconds! Read 21888 bytes from fd 6 (total read now 174848), left in pipe line: 0 write(5, buf, 9825152) = 43776 (total written now 218624), now in pipe line: 43776 epoll_wait() stalled 207 milliseconds! Read 21888 bytes from fd 6 (total read now 196736), left in pipe line: 21888 epoll_wait() stalled 211 milliseconds! Read 21888 bytes from fd 6 (total read now 218624), left in pipe line: 0 write(5, buf, 9781376) = 43776 (total written now 262400), now in pipe line: 43776 epoll_wait() stalled 207 milliseconds! Read 21888 bytes from fd 6 (total read now 240512), left in pipe line: 21888 epoll_wait() stalled 207 milliseconds! Read 21888 bytes from fd 6 (total read now 262400), left in pipe line: 0 write(5, buf, 9737600) = 43776 (total written now 306176), now in pipe line: 43776 epoll_wait() stalled 207 milliseconds! Read 21888 bytes from fd 6 (total read now 284288), left in pipe line: 21888 epoll_wait() stalled 207 milliseconds! Read 21888 bytes from fd 6 (total read now 306176), left in pipe line: 0 write(5, buf, 9693824) = 43776 (total written now 349952), now in pipe line: 43776 epoll_wait() stalled 207 milliseconds! Read 21888 bytes from fd 6 (total read now 328064), left in pipe line: 21888 epoll_wait() stalled 207 milliseconds! write(5, buf, 9650048) = 38272 (total written now 388224), now in pipe line: 60160 Read 16384 bytes from fd 6 (total read now 344448), left in pipe line: 43776 epoll_wait() stalled 207 milliseconds! ... etc. etc. It seems that the problem always occur right after stopping to write data in one direction, and then happens for the way back. In the case above the burst_size is set to 10000000 bytes, and it writes that amount and reads it on the other side successfully. What I think is going on however is that the data on the way back is stalling, during which the "forward" burst finishes (without the epoll_wait stalling it is VERY fast). Above you see: Read 21152 bytes from fd 5 (total read now 10000000), left in pipe line: 0 epoll_wait() stalled 193 milliseconds! But since we know that the stall always seems to be 207ms, I'm pretty sure that actually it stalled 14 ms before that, and needed 14 ms to finish the complete burst in one direction. -- Carlo Wood <carlo@alinoe.com> ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Kernel BUG: epoll_wait() (and epoll_pwait) stall for 206 ms per call on sockets with a small-ish snd/rcv buffer. 2019-07-06 16:16 ` Kernel BUG: " Carlo Wood @ 2019-07-06 18:19 ` Carlo Wood 2019-07-08 20:11 ` Neal Cardwell 0 siblings, 1 reply; 4+ messages in thread From: Carlo Wood @ 2019-07-06 18:19 UTC (permalink / raw) To: davem, netdev While investigating this further, I read on http://www.masterraghu.com/subjects/np/introduction/unix_network_programming_v1.3/ch07lev1sec5.html under "SO_RCVBUF and SO_SNDBUF Socket Options": When setting the size of the TCP socket receive buffer, the ordering of the function calls is important. This is because of TCP's window scale option (Section 2.6), which is exchanged with the peer on the SYN segments when the connection is established. For a client, this means the SO_RCVBUF socket option must be set before calling connect. For a server, this means the socket option must be set for the listening socket before calling listen. Setting this option for the connected socket will have no effect whatsoever on the possible window scale option because accept does not return with the connected socket until TCP's three-way handshake is complete. That is why this option must be set for the listening socket. (The sizes of the socket buffers are always inherited from the listening socket by the newly created connected socket: pp. 462–463 of TCPv2.) As mentioned in a previous post, I had already discovered about needing to set the socket buffers before connect, but I didn't know about setting them before the call to listen() in order to get the buffer sizes inherited by the accepted sockets. After fixing this in my test program, all problems disappeared when keeping the send and receive buffers the same on both sides. However, when only setting the send and receive buffers on the client socket (not on the (accepted or) listen socket), epoll_wait() still stalls 43ms. When the SO_SNDBUF is smaller than 33182 bytes. Here is the latest version of my test program: https://github.com/CarloWood/ai-evio-testsuite/blob/master/src/epoll_bug.c I have to retract most of my "bug" report, it might even not really be a bug then... but nevertheless, what remains strange is the fact that setting the socket buffer sizes on the accepted sockets can lead to so much crippling effect, while the quoted website states: Setting this option for the connected socket will have no effect whatsoever on the possible window scale option because accept does not return with the connected socket until TCP's three-way handshake is complete. And when only setting the socket buffer sizes for the client socket (that I use to send back received data; so this is the sending side now) then why does epoll_wait() stall 43 ms per call when the receiving side is using the default (much larger) socket buffer sizes? That 43 ms is STILL crippling-- slowing down the transmission of the data to a trickling speed compared to what it should be. -- Carlo Wood <carlo@alinoe.com> ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Kernel BUG: epoll_wait() (and epoll_pwait) stall for 206 ms per call on sockets with a small-ish snd/rcv buffer. 2019-07-06 18:19 ` Carlo Wood @ 2019-07-08 20:11 ` Neal Cardwell 0 siblings, 0 replies; 4+ messages in thread From: Neal Cardwell @ 2019-07-08 20:11 UTC (permalink / raw) To: Carlo Wood; +Cc: David Miller, Netdev On Sat, Jul 6, 2019 at 2:19 PM Carlo Wood <carlo@alinoe.com> wrote: > > While investigating this further, I read on > http://www.masterraghu.com/subjects/np/introduction/unix_network_programming_v1.3/ch07lev1sec5.html > under "SO_RCVBUF and SO_SNDBUF Socket Options": > > When setting the size of the TCP socket receive buffer, the > ordering of the function calls is important. This is because of > TCP's window scale option (Section 2.6), which is exchanged with > the peer on the SYN segments when the connection is established. > For a client, this means the SO_RCVBUF socket option must be set > before calling connect. For a server, this means the socket option > must be set for the listening socket before calling listen. Setting > this option for the connected socket will have no effect whatsoever > on the possible window scale option because accept does not return > with the connected socket until TCP's three-way handshake is > complete. That is why this option must be set for the listening > socket. (The sizes of the socket buffers are always inherited from > the listening socket by the newly created connected socket: pp. > 462–463 of TCPv2.) > > As mentioned in a previous post, I had already discovered about > needing to set the socket buffers before connect, but I didn't know > about setting them before the call to listen() in order to get the > buffer sizes inherited by the accepted sockets. > > After fixing this in my test program, all problems disappeared when > keeping the send and receive buffers the same on both sides. > > However, when only setting the send and receive buffers on the client > socket (not on the (accepted or) listen socket), epoll_wait() still > stalls 43ms. When the SO_SNDBUF is smaller than 33182 bytes. > > Here is the latest version of my test program: > > https://github.com/CarloWood/ai-evio-testsuite/blob/master/src/epoll_bug.c > > I have to retract most of my "bug" report, it might even not really be > a bug then... but nevertheless, what remains strange is the fact > that setting the socket buffer sizes on the accepted sockets can lead > to so much crippling effect, while the quoted website states: > > Setting this option for the connected socket will have no effect > whatsoever on the possible window scale option because accept does > not return with the connected socket until TCP's three-way > handshake is complete. > > And when only setting the socket buffer sizes for the client socket > (that I use to send back received data; so this is the sending > side now) then why does epoll_wait() stall 43 ms per call when the > receiving side is using the default (much larger) socket buffer sizes? > > That 43 ms is STILL crippling-- slowing down the transmission of the > data to a trickling speed compared to what it should be. Based on the magic numbers you cite, including the fact that this test program seems to send traffic over a loopback device (presumably MTU=65536), epoll_wait() stalling 43 ms (slightly longer than the typical Linux delayed ACK timer), and the problem only happening if SO_SNDBUF is smaller than 33182 bytes (half the MTU)... a guess would be that when you artificially make the SO_SNDBUF that low, then you are asking the kernel to only allow your sending sockets to buffer less than a single MTU of data, which means that the packets the sender sends are going to be less than the MTU, which means that the receiver may tend to eventually (after the initial quick ACKs expire) delay its ACKs in hopes of eventually receiving a full MSS of data (see __tcp_ack_snd_check()). Since the SO_SNDBUF is so small in this case, the sender then would not be able to write() or transmit anything else until the receiver sends a delayed ACK for the data ~40ms later, allowing the sending socket to free the previously sent data and trigger the sender's next EPOLLOUT and write(). You could try grabbing a packet capture of the traffic over your loopback device during the test to see if it matches that theory: tcpdump -i lo -s 100 -w /tmp/test.pcap cheers, neal ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2019-07-08 20:11 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-07-06 1:45 epoll_wait() (and epoll_pwait) stall for 206 ms per call on sockets with a small-ish snd/rcv buffer Carlo Wood 2019-07-06 16:16 ` Kernel BUG: " Carlo Wood 2019-07-06 18:19 ` Carlo Wood 2019-07-08 20:11 ` Neal Cardwell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).