* Re: [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced [not found] <20091105095947.32131.99768.stgit@rabbit.intern.cm-ag> @ 2009-11-05 10:30 ` Eric Dumazet 2009-11-05 10:57 ` Max Kellermann 0 siblings, 1 reply; 6+ messages in thread From: Eric Dumazet @ 2009-11-05 10:30 UTC (permalink / raw) To: Max Kellermann; +Cc: linux-kernel, jens.axboe, max, Linux Netdev List Max Kellermann a écrit : > When splicing a large amount of bytes from a TCP socket to a pipe > (more than PIPE_BUFFERS), splice() can block, even though the pipe was > empty. The correct behavior would be to copy as much as possible, and > return without blocking. Block only if nothing can be transferred. > When the destination pipe is (initially) writable, splice() should do > the same with or without SPLICE_F_NONBLOCK. > > The cause is the loop in tcp_splice_read(): it calls > __tcp_splice_read() (and thus skb_splice_bits() and splice_to_pipe()) > again and again, until the requested number of bytes has been > transferred, or an error occurs. In the first iteration, up to 64 kB > is copied, and the second iteration will block, because > splice_to_pipe() is called again and sees the pipe is already full. > > This patch adds SPLICE_F_NONBLOCK to the splice flags after the first > iteration has finished successfully. This prevents the second > splice_to_pipe() call from blocking. The resulting EAGAIN error is > handled gracefully, and tcp_splice_read() returns the number of bytes > successfully moved. > --- > > net/ipv4/tcp.c | 5 +++++ > 1 files changed, 5 insertions(+), 0 deletions(-) > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 9114524..0f8b01f 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -628,6 +628,11 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos, > (sk->sk_shutdown & RCV_SHUTDOWN) || > signal_pending(current)) > break; > + > + /* the following splice_to_pipe() calls should not > + block, because we have already successfully > + transferred at least one buffer */ > + tss.flags |= SPLICE_F_NONBLOCK; > } > > release_sock(sk); > CC netdev I dont think this patch is correct. Could you describe your use case ? If you dont want to block on output pipe, you should set this NONBLOCK flag before calling splice(SPLICE_F_NONBLOCK) syscall. ie : Use the socket in blocking mode, but output pipe in non-blocking mode. Some application could have a thread working in full blocking mode, and have another thread reading the pipe (and eventually unblocking first thread) Thanks ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced 2009-11-05 10:30 ` [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced Eric Dumazet @ 2009-11-05 10:57 ` Max Kellermann 2009-11-05 11:21 ` Eric Dumazet 0 siblings, 1 reply; 6+ messages in thread From: Max Kellermann @ 2009-11-05 10:57 UTC (permalink / raw) To: Eric Dumazet; +Cc: linux-kernel, jens.axboe, Linux Netdev List On 2009/11/05 11:30, Eric Dumazet <eric.dumazet@gmail.com> wrote: > I dont think this patch is correct. Could you describe your use case ? See my second email, there's a demo source. > If you dont want to block on output pipe, you should set this NONBLOCK > flag before calling splice(SPLICE_F_NONBLOCK) syscall. > > ie : Use the socket in blocking mode, but output pipe in non-blocking mode. Do you think that a splice() should block if the socket is readable and the pipe is writable according to select()? "The correct behavior would be to copy as much as possible, and return without blocking. Block only if nothing can be transferred." Do you disagree with that? > Some application could have a thread working in full blocking mode, > and have another thread reading the pipe (and eventually unblocking > first thread) I don't get this objection. Please explain. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced 2009-11-05 10:57 ` Max Kellermann @ 2009-11-05 11:21 ` Eric Dumazet 2009-11-05 13:23 ` Max Kellermann 0 siblings, 1 reply; 6+ messages in thread From: Eric Dumazet @ 2009-11-05 11:21 UTC (permalink / raw) To: Max Kellermann; +Cc: linux-kernel, jens.axboe, Linux Netdev List Max Kellermann a écrit : > On 2009/11/05 11:30, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> I dont think this patch is correct. Could you describe your use case ? > > See my second email, there's a demo source. > >> If you dont want to block on output pipe, you should set this NONBLOCK >> flag before calling splice(SPLICE_F_NONBLOCK) syscall. >> >> ie : Use the socket in blocking mode, but output pipe in non-blocking mode. > > Do you think that a splice() should block if the socket is readable > and the pipe is writable according to select()? > Yes, this is perfectly legal select() can return "OK to write on fd", and still, write(fd, buffer, 10000000) is supposer/allowed to block if fd is not O_NDELAY If you want to not block on fd, use O_NDELAY (if using write() syscall), or SPLICE_F_NONBLOCK splice() flag ? Please read recent commit on this area and why I think your patch conflicts with this commit. commit 42324c62704365d6a3e89138dea55909d2f26afe Author: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu Oct 1 15:26:00 2009 -0700 net: splice() from tcp to pipe should take into account O_NONBLOCK tcp_splice_read() doesnt take into account socket's O_NONBLOCK flag Before this patch : splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE); causes a random endless block (if pipe is full) and splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK); will return 0 immediately if the TCP buffer is empty. User application has no way to instruct splice() that socket should be in blocking mode but pipe in nonblock more. Many projects cannot use splice(tcp -> pipe) because of this flaw. http://git.samba.org/?p=samba.git;a=history;f=source3/lib/recvfile.c;h=ea0159642137390a0f7e57a123684e6e63e47581;hb=HEAD http://lkml.indiana.edu/hypermail/linux/kernel/0807.2/0687.html Linus introduced SPLICE_F_NONBLOCK in commit 29e350944fdc2dfca102500790d8ad6d6ff4f69d (splice: add SPLICE_F_NONBLOCK flag ) It doesn't make the splice itself necessarily nonblocking (because the actual file descriptors that are spliced from/to may block unless they have the O_NONBLOCK flag set), but it makes the splice pipe operations nonblocking. Linus intention was clear : let SPLICE_F_NONBLOCK control the splice pipe mode only This patch instruct tcp_splice_read() to use the underlying file O_NONBLOCK flag, as other socket operations do. Users will then call : splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK ); to block on data coming from socket (if file is in blocking mode), and not block on pipe output (to avoid deadlock) First version of this patch was submitted by Octavian Purdila Reported-by: Volker Lendecke <vl@samba.org> Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced 2009-11-05 11:21 ` Eric Dumazet @ 2009-11-05 13:23 ` Max Kellermann 2009-11-05 14:11 ` Eric Dumazet 0 siblings, 1 reply; 6+ messages in thread From: Max Kellermann @ 2009-11-05 13:23 UTC (permalink / raw) To: Eric Dumazet; +Cc: linux-kernel, jens.axboe, Linux Netdev List On 2009/11/05 12:21, Eric Dumazet <eric.dumazet@gmail.com> wrote: > Max Kellermann a écrit : > > Do you think that a splice() should block if the socket is readable > > and the pipe is writable according to select()? > > > > Yes, this is perfectly legal > > select() can return "OK to write on fd", > and still, write(fd, buffer, 10000000) is supposer/allowed to block if fd is not O_NDELAY From the select() manpage: "those in writefds will be watched to see if a write will not block" From the poll() manpage: "Writing now will not block." This looks unambiguous to me, and contradicts with your thesis. Can you provide sources? What is your interpretation of the guarantees provided by select() and poll()? Which byte count is "ok" to write after POLLOUT, and how much is "too much"? How does the application know? > Please read recent commit on this area and why I think your patch > conflicts with this commit. I understand your patch, but I don't understand the conflict with my patch. Can you describe a breakage caused by my patch? ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced 2009-11-05 13:23 ` Max Kellermann @ 2009-11-05 14:11 ` Eric Dumazet 2009-11-05 14:33 ` Max Kellermann 0 siblings, 1 reply; 6+ messages in thread From: Eric Dumazet @ 2009-11-05 14:11 UTC (permalink / raw) To: Max Kellermann; +Cc: linux-kernel, jens.axboe, Linux Netdev List Max Kellermann a écrit : > On 2009/11/05 12:21, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> Max Kellermann a écrit : >>> Do you think that a splice() should block if the socket is readable >>> and the pipe is writable according to select()? >>> >> Yes, this is perfectly legal >> >> select() can return "OK to write on fd", >> and still, write(fd, buffer, 10000000) is supposer/allowed to block if fd is not O_NDELAY > >>From the select() manpage: "those in writefds will be watched to see > if a write will not block" > >>From the poll() manpage: "Writing now will not block." > > This looks unambiguous to me, and contradicts with your thesis. Can > you provide sources? > > What is your interpretation of the guarantees provided by select() and > poll()? Which byte count is "ok" to write after POLLOUT, and how much > is "too much"? How does the application know? It cannot, therefore an application uses O_NDELAY to avoid blocking. Try following program if you are not convinced #include <unistd.h> #include <sys/poll.h> #include <stdio.h> char buffer[1000000]; int main(int argc, char *argv[]) { int fds[2]; struct pollfd pfd; int res; pipe(fds); pfd.fd = fds[1]; pfd.events = POLLOUT; res = poll(&pfd, 1, -1); if (res > 0 && pfd.revents & POLLOUT) printf("OK to write on pipe\n"); write(fds[1], buffer, sizeof(buffer)); // why it blocks, did poll() lied ??? return 0; } > I understand your patch, but I don't understand the conflict with my > patch. Can you describe a breakage caused by my patch? I only pointed out that using splice(tcp -> pipe) and blocking on pipe _can_ block, even on _first_ frame received from tcp, as you discovered. Your only choices to avoid a deadlock are : 1) to use SPLICE_F_NONBLOCK. 2) Using a second thread to read the pipe and empty it. First thread will happily transfert 1000000 bytes in one syscall... 3) or limit your splice(... len, flags) length to 16 (16 buffers of one byte in pathological cases) Your patch basically makes SPLICE_F_NONBLOCK option always set (choice 1) above) So users wanting option 3) are stuck. You force them to use a poll()/select() thing while they dont want to poll : They have a producer thread(s), and a consumer thread(s). producer() { while (1) splice(tcp, &offset, pfds[1], NULL, 10000000, SPLICE_F_MORE | SPLICE_F_MOVE); } Why in the first place have an option if it is always set ? ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced 2009-11-05 14:11 ` Eric Dumazet @ 2009-11-05 14:33 ` Max Kellermann 0 siblings, 0 replies; 6+ messages in thread From: Max Kellermann @ 2009-11-05 14:33 UTC (permalink / raw) To: Eric Dumazet; +Cc: linux-kernel, jens.axboe, Linux Netdev List On 2009/11/05 15:11, Eric Dumazet <eric.dumazet@gmail.com> wrote: > It cannot, therefore an application uses O_NDELAY to avoid blocking. > > Try following program if you are not convinced Indeed, I'm surprised by the result rendered by the Linux kernel. That however still contradicts with the poll() documentation. So this boils down to the question: kernel bug or documentation bug? http://www.opengroup.org/onlinepubs/9699919799/functions/pselect.html "A descriptor shall be considered ready for writing when a call to an output function with O_NONBLOCK clear would not block, whether or not the function would transfer data successfully." There is no size limit mentioned here. Your program reveals that the kernel violates this definition. > Your patch basically makes SPLICE_F_NONBLOCK option always set > (choice 1) above) [...] > Why in the first place have an option if it is always set ? It is not, you misunderstood my patch. If there's no room in the pipe buffer, then the first iteration of the "while" loop will block (as usual). *After* the first iteration has finished (and at least one buffer has been moved already), the flag is set, and further iterations will not block. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2009-11-05 14:33 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20091105095947.32131.99768.stgit@rabbit.intern.cm-ag>
2009-11-05 10:30 ` [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced Eric Dumazet
2009-11-05 10:57 ` Max Kellermann
2009-11-05 11:21 ` Eric Dumazet
2009-11-05 13:23 ` Max Kellermann
2009-11-05 14:11 ` Eric Dumazet
2009-11-05 14:33 ` Max Kellermann
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).