Re: [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced
       [not found] <20091105095947.32131.99768.stgit@rabbit.intern.cm-ag>
@ 2009-11-05 10:30 ` Eric Dumazet
  2009-11-05 10:57   ` Max Kellermann
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2009-11-05 10:30 UTC (permalink / raw)
  To: Max Kellermann; +Cc: linux-kernel, jens.axboe, max, Linux Netdev List

Max Kellermann a écrit :
> When splicing a large amount of bytes from a TCP socket to a pipe
> (more than PIPE_BUFFERS), splice() can block, even though the pipe was
> empty.  The correct behavior would be to copy as much as possible, and
> return without blocking.  Block only if nothing can be transferred.
> When the destination pipe is (initially) writable, splice() should do
> the same with or without SPLICE_F_NONBLOCK.
> 
> The cause is the loop in tcp_splice_read(): it calls
> __tcp_splice_read() (and thus skb_splice_bits() and splice_to_pipe())
> again and again, until the requested number of bytes has been
> transferred, or an error occurs.  In the first iteration, up to 64 kB
> is copied, and the second iteration will block, because
> splice_to_pipe() is called again and sees the pipe is already full.
> 
> This patch adds SPLICE_F_NONBLOCK to the splice flags after the first
> iteration has finished successfully.  This prevents the second
> splice_to_pipe() call from blocking.  The resulting EAGAIN error is
> handled gracefully, and tcp_splice_read() returns the number of bytes
> successfully moved.
> ---
> 
>  net/ipv4/tcp.c |    5 +++++
>  1 files changed, 5 insertions(+), 0 deletions(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 9114524..0f8b01f 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -628,6 +628,11 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
>  		    (sk->sk_shutdown & RCV_SHUTDOWN) ||
>  		    signal_pending(current))
>  			break;
> +
> +		/* the following splice_to_pipe() calls should not
> +		   block, because we have already successfully
> +		   transferred at least one buffer */
> +		tss.flags |= SPLICE_F_NONBLOCK;
>  	}
>  
>  	release_sock(sk);
> 

CC netdev

I dont think this patch is correct. Could you describe your use case ?

If you dont want to block on output pipe, you should set this NONBLOCK 
flag before calling splice(SPLICE_F_NONBLOCK) syscall.

ie : Use the socket in blocking mode, but output pipe in non-blocking mode.

Some application could have a thread working in full blocking mode, and have another
thread reading the pipe (and eventually unblocking first thread)

Thanks

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced
  2009-11-05 10:30 ` [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced Eric Dumazet
@ 2009-11-05 10:57   ` Max Kellermann
  2009-11-05 11:21     ` Eric Dumazet
  0 siblings, 1 reply; 6+ messages in thread
From: Max Kellermann @ 2009-11-05 10:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, jens.axboe, Linux Netdev List

On 2009/11/05 11:30, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> I dont think this patch is correct. Could you describe your use case ?

See my second email, there's a demo source.

> If you dont want to block on output pipe, you should set this NONBLOCK 
> flag before calling splice(SPLICE_F_NONBLOCK) syscall.
> 
> ie : Use the socket in blocking mode, but output pipe in non-blocking mode.

Do you think that a splice() should block if the socket is readable
and the pipe is writable according to select()?

"The correct behavior would be to copy as much as possible, and return
without blocking.  Block only if nothing can be transferred."

Do you disagree with that?

> Some application could have a thread working in full blocking mode,
> and have another thread reading the pipe (and eventually unblocking
> first thread)

I don't get this objection.  Please explain.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced
  2009-11-05 10:57   ` Max Kellermann
@ 2009-11-05 11:21     ` Eric Dumazet
  2009-11-05 13:23       ` Max Kellermann
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2009-11-05 11:21 UTC (permalink / raw)
  To: Max Kellermann; +Cc: linux-kernel, jens.axboe, Linux Netdev List

Max Kellermann a écrit :
> On 2009/11/05 11:30, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> I dont think this patch is correct. Could you describe your use case ?
> 
> See my second email, there's a demo source.
> 
>> If you dont want to block on output pipe, you should set this NONBLOCK 
>> flag before calling splice(SPLICE_F_NONBLOCK) syscall.
>>
>> ie : Use the socket in blocking mode, but output pipe in non-blocking mode.
> 
> Do you think that a splice() should block if the socket is readable
> and the pipe is writable according to select()?
> 

Yes, this is perfectly legal

select() can return "OK to write on fd",
and still, write(fd, buffer, 10000000) is supposer/allowed to block if fd is not O_NDELAY

If you want to not block on fd, use O_NDELAY (if using write() syscall),
or SPLICE_F_NONBLOCK splice() flag ?

Please read recent commit on this area and why I think your patch conflicts with
this commit.

commit 42324c62704365d6a3e89138dea55909d2f26afe
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date:   Thu Oct 1 15:26:00 2009 -0700

    net: splice() from tcp to pipe should take into account O_NONBLOCK

    tcp_splice_read() doesnt take into account socket's O_NONBLOCK flag

    Before this patch :

    splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE);
    causes a random endless block (if pipe is full) and
    splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
    will return 0 immediately if the TCP buffer is empty.

    User application has no way to instruct splice() that socket should be in blocking mode
    but pipe in nonblock more.

    Many projects cannot use splice(tcp -> pipe) because of this flaw.

    http://git.samba.org/?p=samba.git;a=history;f=source3/lib/recvfile.c;h=ea0159642137390a0f7e57a123684e6e63e47581;hb=HEAD
    http://lkml.indiana.edu/hypermail/linux/kernel/0807.2/0687.html

    Linus introduced  SPLICE_F_NONBLOCK in commit 29e350944fdc2dfca102500790d8ad6d6ff4f69d
    (splice: add SPLICE_F_NONBLOCK flag )

      It doesn't make the splice itself necessarily nonblocking (because the
      actual file descriptors that are spliced from/to may block unless they
      have the O_NONBLOCK flag set), but it makes the splice pipe operations
      nonblocking.

    Linus intention was clear : let SPLICE_F_NONBLOCK control the splice pipe mode only

    This patch instruct tcp_splice_read() to use the underlying file O_NONBLOCK
    flag, as other socket operations do.
    Users will then call :

    splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK );

    to block on data coming from socket (if file is in blocking mode),
    and not block on pipe output (to avoid deadlock)

    First version of this patch was submitted by Octavian Purdila

    Reported-by: Volker Lendecke <vl@samba.org>
    Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
    Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
    Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Acked-by: Jens Axboe <jens.axboe@oracle.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced
  2009-11-05 11:21     ` Eric Dumazet
@ 2009-11-05 13:23       ` Max Kellermann
  2009-11-05 14:11         ` Eric Dumazet
  0 siblings, 1 reply; 6+ messages in thread
From: Max Kellermann @ 2009-11-05 13:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, jens.axboe, Linux Netdev List

On 2009/11/05 12:21, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Max Kellermann a écrit :
> > Do you think that a splice() should block if the socket is readable
> > and the pipe is writable according to select()?
> > 
> 
> Yes, this is perfectly legal
> 
> select() can return "OK to write on fd",
> and still, write(fd, buffer, 10000000) is supposer/allowed to block if fd is not O_NDELAY

From the select() manpage: "those in writefds will be watched to see
if a write will not block"

From the poll() manpage: "Writing now will not block."

This looks unambiguous to me, and contradicts with your thesis.  Can
you provide sources?

What is your interpretation of the guarantees provided by select() and
poll()?  Which byte count is "ok" to write after POLLOUT, and how much
is "too much"?  How does the application know?

> Please read recent commit on this area and why I think your patch
> conflicts with this commit.

I understand your patch, but I don't understand the conflict with my
patch.  Can you describe a breakage caused by my patch?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced
  2009-11-05 13:23       ` Max Kellermann
@ 2009-11-05 14:11         ` Eric Dumazet
  2009-11-05 14:33           ` Max Kellermann
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2009-11-05 14:11 UTC (permalink / raw)
  To: Max Kellermann; +Cc: linux-kernel, jens.axboe, Linux Netdev List

Max Kellermann a écrit :
> On 2009/11/05 12:21, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Max Kellermann a écrit :
>>> Do you think that a splice() should block if the socket is readable
>>> and the pipe is writable according to select()?
>>>
>> Yes, this is perfectly legal
>>
>> select() can return "OK to write on fd",
>> and still, write(fd, buffer, 10000000) is supposer/allowed to block if fd is not O_NDELAY
> 
>>From the select() manpage: "those in writefds will be watched to see
> if a write will not block"
> 
>>From the poll() manpage: "Writing now will not block."
> 
> This looks unambiguous to me, and contradicts with your thesis.  Can
> you provide sources?
> 
> What is your interpretation of the guarantees provided by select() and
> poll()?  Which byte count is "ok" to write after POLLOUT, and how much
> is "too much"?  How does the application know?

It cannot, therefore an application uses O_NDELAY to avoid blocking.

Try following program if you are not convinced

#include <unistd.h>
#include <sys/poll.h>
#include <stdio.h>

char buffer[1000000];

int main(int argc, char *argv[])
{
	int fds[2];
	struct pollfd pfd;
	int res;
	
	pipe(fds);
	pfd.fd = fds[1];
	pfd.events = POLLOUT;
	res = poll(&pfd, 1, -1);
	if (res > 0 && pfd.revents & POLLOUT)
		printf("OK to write on pipe\n");
	write(fds[1], buffer, sizeof(buffer)); // why it blocks, did poll() lied ???
	return 0;
}



> I understand your patch, but I don't understand the conflict with my
> patch.  Can you describe a breakage caused by my patch?

I only pointed out that using splice(tcp -> pipe) and blocking on pipe
_can_ block, even on _first_ frame received from tcp, as you discovered.


Your only choices to avoid a deadlock are :
1) to use SPLICE_F_NONBLOCK.
2) Using a second thread to read the pipe and empty it. First thread will
   happily transfert 1000000 bytes in one syscall...
3) or limit your splice(... len, flags) length to 16 (16 buffers of one byte
   in pathological cases)

Your patch basically makes SPLICE_F_NONBLOCK option always set (choice 1) above)

So users wanting option 3) are stuck. You force them to use a poll()/select()
thing while they dont want to poll : They have a producer thread(s), and a consumer
thread(s).

producer()
{
	while (1)
		splice(tcp, &offset, pfds[1], NULL, 10000000,
		       SPLICE_F_MORE | SPLICE_F_MOVE);
}

Why in the first place have an option if it is always set ?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced
  2009-11-05 14:11         ` Eric Dumazet
@ 2009-11-05 14:33           ` Max Kellermann
  0 siblings, 0 replies; 6+ messages in thread
From: Max Kellermann @ 2009-11-05 14:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, jens.axboe, Linux Netdev List

On 2009/11/05 15:11, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> It cannot, therefore an application uses O_NDELAY to avoid blocking.
> 
> Try following program if you are not convinced

Indeed, I'm surprised by the result rendered by the Linux kernel.
That however still contradicts with the poll() documentation.

So this boils down to the question: kernel bug or documentation bug?

http://www.opengroup.org/onlinepubs/9699919799/functions/pselect.html

"A descriptor shall be considered ready for writing when a call to an
output function with O_NONBLOCK clear would not block, whether or not
the function would transfer data successfully."

There is no size limit mentioned here.  Your program reveals that the
kernel violates this definition.

> Your patch basically makes SPLICE_F_NONBLOCK option always set
> (choice 1) above)
[...]
> Why in the first place have an option if it is always set ?

It is not, you misunderstood my patch.  If there's no room in the pipe
buffer, then the first iteration of the "while" loop will block (as
usual).  *After* the first iteration has finished (and at least one
buffer has been moved already), the flag is set, and further
iterations will not block.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-11-05 14:33 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20091105095947.32131.99768.stgit@rabbit.intern.cm-ag>
2009-11-05 10:30 ` [PATCH] tcp: set SPLICE_F_NONBLOCK after first buffer has been spliced Eric Dumazet
2009-11-05 10:57   ` Max Kellermann
2009-11-05 11:21     ` Eric Dumazet
2009-11-05 13:23       ` Max Kellermann
2009-11-05 14:11         ` Eric Dumazet
2009-11-05 14:33           ` Max Kellermann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).