From: "Alex Xu (Hello71)" <alex_y_xu@yahoo.ca>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: acrichton@mozilla.com, Christian Brauner <christian@brauner.io>,
David Howells <dhowells@redhat.com>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
keyrings@vger.kernel.org, Linux API <linux-api@vger.kernel.org>,
linux-block <linux-block@vger.kernel.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Rasmus Villemoes <linux@rasmusvillemoes.dk>,
LSM List <linux-security-module@vger.kernel.org>,
linux-usb@vger.kernel.org,
Nicolas Dichtel <nicolas.dichtel@6wind.com>,
Peter Zijlstra <peterz@infradead.org>,
Ian Kent <raven@themaw.net>
Subject: Re: [REGRESSION?] Simultaneous writes to a reader-less, non-full pipe can hang
Date: Wed, 04 Aug 2021 15:48:36 -0400 [thread overview]
Message-ID: <1628105897.vb3ko0vb06.none@localhost> (raw)
In-Reply-To: <CAHk-=wiLr55zHUWNzmp3DeoO0DUaYp7vAzQB5KUCni5FpwC7Uw@mail.gmail.com>
Excerpts from Linus Torvalds's message of August 4, 2021 12:31 pm:
> Your program is buggy.
>
> On Wed, Aug 4, 2021 at 8:37 AM Alex Xu (Hello71) <alex_y_xu@yahoo.ca> wrote:
>>
>> pipe(pipefd);
>> printf("init buffer: %d\n", fcntl(pipefd[1], F_GETPIPE_SZ));
>> printf("new buffer: %d\n", fcntl(pipefd[1], F_SETPIPE_SZ, 0));
>
> Yeah, what did you expect this to do? You said you want a minimal
> buffer, you get a really small buffer.
>
> Then you try to write multiple messages to the pipe that you just said
> should have a minimum size.
>
> Don't do that then.
>
>> /proc/x/stack shows that the remaining thread is hanging at pipe.c:560.
>> It looks like not only there needs to be space in the pipe, but also
>> slots.
>
> Correct. The fullness of a pipe is not about whether it has the
> possibility of merging more bytes into an existing not-full slot, but
> about whether it has empty slots left.
>
> Part of that is simply the POSIX pipe guarantees - a write of size
> PIPE_BUF or less is guaranteed to be atomic, so it mustn't be split
> among buffers.
>
> So a pipe must not be "writable" unless it has space for at least that
> much (think select/poll, which don't know the size of the write).
>
> The fact that we might be able to reuse a partially filled buffer for
> smaller writes is simply not relevant to that issue.
>
> And yes, we could have different measures of "could write" for
> different writes, but we just don't have or want that complexity.
>
> Please don't mess with F_SETPIPE_SZ unless you have a really good
> reason to do so, and actually understand what you are doing.
>
> Doing a F_SETPIPE_SZ, 0 basically means "I want the mimimum pipe size
> possible". And that one accepts exactly one write at a time.
>
> Of course, the exact semantics are much more complicated than that
> "exactly one write". The pipe write code will optimistically merge
> writes into a previous buffer, which means that depending on the
> pattern of your writes, the exact number of bytes you can write will
> be very different.
>
> But that "merge writes into a previous buffer" only appends to the
> buffer - not _reuse_ it - so when each buffer is one page in size,
> what happens is that you can merge up to 4096 bytes worth of writes,
> but then after that the pipe write will want a new buffer - even if
> the old buffer is now empty because of old reads.
>
> That's why your test program won't block immediately: both writers
> will actually start out happily doing writes into that one buffer that
> is allocated, but at some point that buffer ends, and it wants to
> allocate a new buffer.
>
> But you told it not to allocate more buffers, and the old buffer is
> never completely empty because your readers never read _everythign_,
> so it will hang, waiting for you to empty the one minimal buffer it
> allocated. And that will never happen.
>
> There's a very real reason why we do *not* by default say "pipes can
> only ever use only one buffer".
>
> I don't think this is a regression, but if you have an actual
> application - not a test program - that does crazy things like this
> and used to work (I'm not sure it has ever worked, though), we can
> look into making it work again.
>
> That said, I suspect the way to make it work is to just say "the
> minimum pipe size is two slots" rather than change the "we want at
> least one empty slot". Exactly because of that whole "look, we must
> not consider a pipe that doesn't have a slot writable".
>
> Because clearly people don't understand how subtle F_SETPIPE_SZ is.
> It's not really a "byte count", even though that is how it's
> expressed.
>
> Linus
>
I agree that if this only affects programs which intentionally adjust
the pipe buffer size, then it is not a huge issue. The problem,
admittedly buried very close to the bottom of my email, is that the
kernel will silently provide one-page pipe buffers if the user has
exceeded 16384 (by default) pipe buffer pages allocated. Try this
slightly more complicated program:
#define _GNU_SOURCE
#include <fcntl.h>
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
static int pipefd[2];
void *thread_start(void *arg) {
char buf[1];
int i;
for (i = 0; i < 1000000; i++) {
read(pipefd[0], buf, sizeof(buf));
if (write(pipefd[1], buf, sizeof(buf)) == -1)
break;
}
printf("done %d\n", i);
return NULL;
}
int main() {
signal(SIGPIPE, SIG_IGN);
// pretend there are a very large number of make processes running
for (int i = 0; i < 1025; i++)
{
pipe(pipefd);
write(pipefd[1], "aa", 2);
}
printf("%d %d\n", pipefd[0], pipefd[1]);
printf("init buffer: %d\n", fcntl(pipefd[1], F_GETPIPE_SZ));
//printf("new buffer: %d\n", fcntl(pipefd[1], F_SETPIPE_SZ, 0));
pthread_t thread1, thread2;
pthread_create(&thread1, NULL, thread_start, NULL);
pthread_create(&thread2, NULL, thread_start, NULL);
sleep(3);
close(pipefd[0]);
close(pipefd[1]);
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
}
You may need to raise your RLIMIT_NOFILE before running the program.
With default settings (fs.pipe-user-pages-soft = 16384) the init buffer
will be 4096, no mangling required. I think this could be probably be
solved if the kernel instead reduced over-quota pipes to two pages
instead of one page. If someone wants to set a one-page pipe buffer,
then they can suffer the consequences, but the kernel shouldn't silently
hand people that footgun.
Regards,
Alex.
next prev parent reply other threads:[~2021-08-04 19:48 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1628086770.5rn8p04n6j.none.ref@localhost>
2021-08-04 15:37 ` [REGRESSION?] Simultaneous writes to a reader-less, non-full pipe can hang Alex Xu (Hello71)
2021-08-04 16:31 ` Linus Torvalds
2021-08-04 19:48 ` Alex Xu (Hello71) [this message]
2021-08-04 20:04 ` Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1628105897.vb3ko0vb06.none@localhost \
--to=alex_y_xu@yahoo.ca \
--cc=acrichton@mozilla.com \
--cc=christian@brauner.io \
--cc=dhowells@redhat.com \
--cc=gregkh@linuxfoundation.org \
--cc=keyrings@vger.kernel.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-security-module@vger.kernel.org \
--cc=linux-usb@vger.kernel.org \
--cc=linux@rasmusvillemoes.dk \
--cc=nicolas.dichtel@6wind.com \
--cc=peterz@infradead.org \
--cc=raven@themaw.net \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.