All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] pipe: don't block after data has been written
@ 2009-11-05 15:31 Max Kellermann
  2009-11-05 16:20 ` Américo Wang
  2009-11-05 16:27 ` Eric Dumazet
  0 siblings, 2 replies; 8+ messages in thread
From: Max Kellermann @ 2009-11-05 15:31 UTC (permalink / raw)
  To: linux-kernel; +Cc: eric.dumazet, mk

According to the select() / poll() documentation, a write operation on
a file descriptor which is "ready for writing" must not block.  Linux
violates this rule: if you pass a very large buffer to write(), the
system call will not return until everything is written, or an error
occurs.

This patch adds a simple check: if at least one byte has already been
written, break from the loop, instead of calling pipe_wait().

Signed-off-by: Max Kellermann <max@duempel.org>
---

 fs/pipe.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index ae17d02..9d84f0b 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -582,7 +582,7 @@ redo2:
 		}
 		if (bufs < PIPE_BUFFERS)
 			continue;
-		if (filp->f_flags & O_NONBLOCK) {
+		if (filp->f_flags & O_NONBLOCK || ret > 0) {
 			if (!ret)
 				ret = -EAGAIN;
 			break;


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] pipe: don't block after data has been written
  2009-11-05 15:31 [PATCH] pipe: don't block after data has been written Max Kellermann
@ 2009-11-05 16:20 ` Américo Wang
  2009-11-05 16:25   ` Max Kellermann
  2009-11-05 16:27 ` Eric Dumazet
  1 sibling, 1 reply; 8+ messages in thread
From: Américo Wang @ 2009-11-05 16:20 UTC (permalink / raw)
  To: Max Kellermann; +Cc: linux-kernel, eric.dumazet, mk

On Thu, Nov 05, 2009 at 04:31:47PM +0100, Max Kellermann wrote:
>According to the select() / poll() documentation, a write operation on
>a file descriptor which is "ready for writing" must not block.  Linux
>violates this rule: if you pass a very large buffer to write(), the
>system call will not return until everything is written, or an error
>occurs.
>
>This patch adds a simple check: if at least one byte has already been
>written, break from the loop, instead of calling pipe_wait().

Do you have any working test-case for this?

Thanks.

>
>Signed-off-by: Max Kellermann <max@duempel.org>
>---
>
> fs/pipe.c |    2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
>diff --git a/fs/pipe.c b/fs/pipe.c
>index ae17d02..9d84f0b 100644
>--- a/fs/pipe.c
>+++ b/fs/pipe.c
>@@ -582,7 +582,7 @@ redo2:
> 		}
> 		if (bufs < PIPE_BUFFERS)
> 			continue;
>-		if (filp->f_flags & O_NONBLOCK) {
>+		if (filp->f_flags & O_NONBLOCK || ret > 0) {
> 			if (!ret)
> 				ret = -EAGAIN;
> 			break;
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/

-- 
Live like a child, think like the god.
 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] pipe: don't block after data has been written
  2009-11-05 16:20 ` Américo Wang
@ 2009-11-05 16:25   ` Max Kellermann
  0 siblings, 0 replies; 8+ messages in thread
From: Max Kellermann @ 2009-11-05 16:25 UTC (permalink / raw)
  To: Américo Wang; +Cc: linux-kernel, eric.dumazet

On 2009/11/05 17:20, Américo Wang <xiyou.wangcong@gmail.com> wrote:
> On Thu, Nov 05, 2009 at 04:31:47PM +0100, Max Kellermann wrote:
> >According to the select() / poll() documentation, a write operation on
> >a file descriptor which is "ready for writing" must not block.  Linux
> >violates this rule: if you pass a very large buffer to write(), the
> >system call will not return until everything is written, or an error
> >occurs.
> >
> >This patch adds a simple check: if at least one byte has already been
> >written, break from the loop, instead of calling pipe_wait().
> 
> Do you have any working test-case for this?

Eric Dumazet posted a program earlier today (in response to my
pipe/splice patch):

 http://lkml.org/lkml/2009/11/5/144

With this patch applied, it runs correctly (without it, the write()
blocks until the consumer has read everything):

 poll([{fd=4, events=POLLOUT}], 1, -1)   = 1 ([{fd=4, revents=POLLOUT}])
 [...]
 write(4, "\0\0\0\0"..., 1000000) = 65536

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] pipe: don't block after data has been written
  2009-11-05 15:31 [PATCH] pipe: don't block after data has been written Max Kellermann
  2009-11-05 16:20 ` Américo Wang
@ 2009-11-05 16:27 ` Eric Dumazet
  2009-11-05 16:36   ` Max Kellermann
  1 sibling, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2009-11-05 16:27 UTC (permalink / raw)
  To: Max Kellermann; +Cc: linux-kernel, mk

Max Kellermann a écrit :
> According to the select() / poll() documentation, a write operation on
> a file descriptor which is "ready for writing" must not block.  Linux
> violates this rule: if you pass a very large buffer to write(), the
> system call will not return until everything is written, or an error
> occurs.
> 
> This patch adds a simple check: if at least one byte has already been
> written, break from the loop, instead of calling pipe_wait().
> 
> Signed-off-by: Max Kellermann <max@duempel.org>
> ---
> 
>  fs/pipe.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index ae17d02..9d84f0b 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -582,7 +582,7 @@ redo2:
>  		}
>  		if (bufs < PIPE_BUFFERS)
>  			continue;
> -		if (filp->f_flags & O_NONBLOCK) {
> +		if (filp->f_flags & O_NONBLOCK || ret > 0) {
>  			if (!ret)
>  				ret = -EAGAIN;
>  			break;
> 

Then select()/poll() documentation is wrong, please correct documentation ?

http://www.opengroup.org/onlinepubs/000095399/functions/write.html

	ssize_t write(int fildes, const void *buf, size_t nbyte);

	If the O_NONBLOCK flag is clear, a write request may cause the thread to block,
	but on normal completion it shall return nbyte.

Every Unix I know behaves the same when writing to a pipe.


Your patch breaks many programs, that dont use poll()/select()

char result[1000000];
main()
{
	computethings();
	write(1, buffer, 1000000);
}


$ ./program | more

Please learn how useful O_NDELAY can be in a poll()/select() environment.

Thanks

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] pipe: don't block after data has been written
  2009-11-05 16:27 ` Eric Dumazet
@ 2009-11-05 16:36   ` Max Kellermann
  2009-11-05 16:37     ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Max Kellermann @ 2009-11-05 16:36 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, mk

On 2009/11/05 17:27, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Your patch breaks many programs, that dont use poll()/select()
> 
> char result[1000000];
> main()
> {
> 	computethings();
> 	write(1, buffer, 1000000);
> }

Your code does not check the return value of write().  This is a bug.

So how exactly does my patch break this program?

Let's read some more of the manual you cited: "If a write() requests
that more bytes be written than there is room for (for example, the
process' file size limit or the physical end of a medium), only as
many bytes as there is room for shall be written."

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] pipe: don't block after data has been written
  2009-11-05 16:36   ` Max Kellermann
@ 2009-11-05 16:37     ` Eric Dumazet
  2009-11-05 17:13       ` Zan Lynx
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2009-11-05 16:37 UTC (permalink / raw)
  To: Eric Dumazet, linux-kernel, mk

Max Kellermann a écrit :
> On 2009/11/05 17:27, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Your patch breaks many programs, that dont use poll()/select()
>>
>> char result[1000000];
>> main()
>> {
>> 	computethings();
>> 	write(1, buffer, 1000000);
>> }
> 
> Your code does not check the return value of write().  This is a bug.
> 

Welcome to real world.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] pipe: don't block after data has been written
  2009-11-05 16:37     ` Eric Dumazet
@ 2009-11-05 17:13       ` Zan Lynx
  2009-11-05 18:32         ` Alan Cox
  0 siblings, 1 reply; 8+ messages in thread
From: Zan Lynx @ 2009-11-05 17:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, mk

On 11/5/09 9:37 AM, Eric Dumazet wrote:
> Max Kellermann a écrit :
>> On 2009/11/05 17:27, Eric Dumazet<eric.dumazet@gmail.com>  wrote:
>>> Your patch breaks many programs, that dont use poll()/select()
>>>
>>> char result[1000000];
>>> main()
>>> {
>>> 	computethings();
>>> 	write(1, buffer, 1000000);
>>> }
>>
>> Your code does not check the return value of write().  This is a bug.
>>
>
> Welcome to real world.

Yes in the real world there are bugs. The decision is to choose which 
bug you are going to expose. If it was my decision I would make the code 
work as documented, as Max wants to do.

I remember many years ago needing to fix some inetd-called server code 
that got unexpected partial writes on blocking sockets. It was either 
Solaris or HP/UX. So this is nothing new.

In fact I think that Linux will already do short writes if a signal is 
received without restart set for the handler. I found several bugs last 
year in glibc and libstdc++ fwrite and iostreams regarding that.
-- 
Zan Lynx
zlynx@acm.org

"Knowledge is Power.  Power Corrupts.  Study Hard.  Be Evil."

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] pipe: don't block after data has been written
  2009-11-05 17:13       ` Zan Lynx
@ 2009-11-05 18:32         ` Alan Cox
  0 siblings, 0 replies; 8+ messages in thread
From: Alan Cox @ 2009-11-05 18:32 UTC (permalink / raw)
  To: Zan Lynx; +Cc: Eric Dumazet, linux-kernel, mk

> > Welcome to real world.
> 
> Yes in the real world there are bugs. The decision is to choose which 
> bug you are going to expose. If it was my decision I would make the code 
> work as documented, as Max wants to do.

Outside of academia the reality is fairly simple. A system needs to
behave according to the expected behaviour. That is a mix of things
- Standards
- Extrapolation (applying the logic of the standard to cases beyond it)
- Tradition (things that used to work still work)

If you like: How it is defined to work, how it is expected to work and how
it worked last year.

Tradition is a suprisingly large part of it. In the unix world that
tradition includes things like "signals do not interrupt disk I/O writes
causing short writes".

Pipes however is pretty much pure standards behaviour

In blocking mode they block
In non-blocking mode they don't block

Furthermore there are specific rules about writes under a certain size
always occurring in an atomic manner.

> In fact I think that Linux will already do short writes if a signal is 
> received without restart set for the handler. I found several bugs last 
> year in glibc and libstdc++ fwrite and iostreams regarding that.

The kernel takes great pains not to do this in the cases where tradition
dictates otherwise (notably in disk I/O)

Alan

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-11-05 18:31 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-11-05 15:31 [PATCH] pipe: don't block after data has been written Max Kellermann
2009-11-05 16:20 ` Américo Wang
2009-11-05 16:25   ` Max Kellermann
2009-11-05 16:27 ` Eric Dumazet
2009-11-05 16:36   ` Max Kellermann
2009-11-05 16:37     ` Eric Dumazet
2009-11-05 17:13       ` Zan Lynx
2009-11-05 18:32         ` Alan Cox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.