* [PATCH 0/2] pipe: change pipe_write() to never add a zero-sized buffer
@ 2025-02-09 15:07 Oleg Nesterov
2025-02-09 15:07 ` [PATCH 1/2] " Oleg Nesterov
` (3 more replies)
0 siblings, 4 replies; 16+ messages in thread
From: Oleg Nesterov @ 2025-02-09 15:07 UTC (permalink / raw)
To: Christian Brauner, Jeff Layton, Linus Torvalds
Cc: David Howells, Gautham R. Shenoy, K Prateek Nayak, Mateusz Guzik,
Neeraj Upadhyay, Oliver Sang, Swapnil Sapkal, WangYuli,
linux-fsdevel, linux-kernel
Please review.
pipe_write() can insert the empty buffer and this looks very confusing
to me. Because it looks obviously unnecessary and complicates the code.
In fact this logic doesn't even look strictly correct. For example,
eat_empty_buffer() simply updates pipe->tail but (unlike pipe_read) it
doesn't wake the writers.
But is pipe_write() is the only possible source of buf->len == 0 ?
I am not 100% sure, fs/splice.c is very nontrivial, but it seems that
this code at least tries to not add a zero-sized buf into pipe->bufs.
Oleg.
---
fs/pipe.c | 47 ++++++++++-------------------------------------
fs/splice.c | 14 ++++++--------
include/linux/pipe_fs_i.h | 6 ++++++
3 files changed, 22 insertions(+), 45 deletions(-)
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/2] pipe: change pipe_write() to never add a zero-sized buffer
2025-02-09 15:07 [PATCH 0/2] pipe: change pipe_write() to never add a zero-sized buffer Oleg Nesterov
@ 2025-02-09 15:07 ` Oleg Nesterov
2025-02-09 17:26 ` Linus Torvalds
2025-02-09 15:08 ` [PATCH 2/2] splice: add some pipe_buf_assert_len() checks Oleg Nesterov
` (2 subsequent siblings)
3 siblings, 1 reply; 16+ messages in thread
From: Oleg Nesterov @ 2025-02-09 15:07 UTC (permalink / raw)
To: Christian Brauner, Jeff Layton, Linus Torvalds
Cc: David Howells, Gautham R. Shenoy, K Prateek Nayak, Mateusz Guzik,
Neeraj Upadhyay, Oliver Sang, Swapnil Sapkal, WangYuli,
linux-fsdevel, linux-kernel
a194dfe6e6f6 ("pipe: Rearrange sequence in pipe_write() to preallocate slot")
changed pipe_write() to increment pipe->head in advance. IIUC to avoid the
race with the post_one_notification()-like code which can add another buffer
under pipe->rd_wait.lock without pipe->mutex.
This is no longer necessary after c73be61cede5 ("pipe: Add general notification
queue support"), pipe_write() checks pipe_has_watch_queue() and returns -EXDEV
at the start. And can't help in any case, pipe_write() no longer takes this
rd_wait.lock spinlock.
Change pipe_write() to call copy_page_from_iter() first and do nothing if it
fails. This way pipe_write() can't add a zero-sized buffer and we can simplify
pipe_read() which currently has to take care of this very unlikely case.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
fs/pipe.c | 47 +++++++++------------------------------
include/linux/pipe_fs_i.h | 6 +++++
2 files changed, 16 insertions(+), 37 deletions(-)
diff --git a/fs/pipe.c b/fs/pipe.c
index 2ae75adfba64..7f24d707f6a1 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -303,7 +303,7 @@ anon_pipe_read(struct kiocb *iocb, struct iov_iter *to)
if (!pipe_empty(head, tail)) {
struct pipe_buffer *buf = &pipe->bufs[tail & mask];
- size_t chars = buf->len;
+ size_t chars = pipe_buf_assert_len(buf);
size_t written;
int error;
@@ -360,29 +360,9 @@ anon_pipe_read(struct kiocb *iocb, struct iov_iter *to)
break;
}
mutex_unlock(&pipe->mutex);
-
/*
* We only get here if we didn't actually read anything.
*
- * However, we could have seen (and removed) a zero-sized
- * pipe buffer, and might have made space in the buffers
- * that way.
- *
- * You can't make zero-sized pipe buffers by doing an empty
- * write (not even in packet mode), but they can happen if
- * the writer gets an EFAULT when trying to fill a buffer
- * that already got allocated and inserted in the buffer
- * array.
- *
- * So we still need to wake up any pending writers in the
- * _very_ unlikely case that the pipe was full, but we got
- * no data.
- */
- if (unlikely(wake_writer))
- wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
- kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
-
- /*
* But because we didn't read anything, at this point we can
* just return directly with -ERESTARTSYS if we're interrupted,
* since we've done any required wakeups and there's no need
@@ -391,7 +371,6 @@ anon_pipe_read(struct kiocb *iocb, struct iov_iter *to)
if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0)
return -ERESTARTSYS;
- wake_writer = false;
wake_next_reader = true;
mutex_lock(&pipe->mutex);
}
@@ -526,33 +505,27 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
pipe->tmp_page = page;
}
- /* Allocate a slot in the ring in advance and attach an
- * empty buffer. If we fault or otherwise fail to use
- * it, either the reader will consume it or it'll still
- * be there for the next write.
- */
- pipe->head = head + 1;
+ copied = copy_page_from_iter(page, 0, PAGE_SIZE, from);
+ if (unlikely(copied < PAGE_SIZE && iov_iter_count(from))) {
+ if (!ret)
+ ret = -EFAULT;
+ break;
+ }
+ pipe->head = head + 1;
+ pipe->tmp_page = NULL;
/* Insert it into the buffer array */
buf = &pipe->bufs[head & mask];
buf->page = page;
buf->ops = &anon_pipe_buf_ops;
buf->offset = 0;
- buf->len = 0;
if (is_packetized(filp))
buf->flags = PIPE_BUF_FLAG_PACKET;
else
buf->flags = PIPE_BUF_FLAG_CAN_MERGE;
- pipe->tmp_page = NULL;
- copied = copy_page_from_iter(page, 0, PAGE_SIZE, from);
- if (unlikely(copied < PAGE_SIZE && iov_iter_count(from))) {
- if (!ret)
- ret = -EFAULT;
- break;
- }
- ret += copied;
buf->len = copied;
+ ret += copied;
if (!iov_iter_count(from))
break;
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 8ff23bf5a819..4174429a3e0e 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -31,6 +31,12 @@ struct pipe_buffer {
unsigned long private;
};
+static inline unsigned int pipe_buf_assert_len(struct pipe_buffer *buf)
+{
+ WARN_ON_ONCE(!buf->len);
+ return buf->len;
+}
+
/**
* struct pipe_inode_info - a linux kernel pipe
* @mutex: mutex protecting the whole thing
--
2.25.1.362.g51ebf55
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 2/2] splice: add some pipe_buf_assert_len() checks
2025-02-09 15:07 [PATCH 0/2] pipe: change pipe_write() to never add a zero-sized buffer Oleg Nesterov
2025-02-09 15:07 ` [PATCH 1/2] " Oleg Nesterov
@ 2025-02-09 15:08 ` Oleg Nesterov
2025-02-10 11:40 ` [PATCH v2 1/1] pipe: change pipe_write() to never add a zero-sized buffer Oleg Nesterov
2025-02-10 12:52 ` [PATCH 0/2] " Christian Brauner
3 siblings, 0 replies; 16+ messages in thread
From: Oleg Nesterov @ 2025-02-09 15:08 UTC (permalink / raw)
To: Christian Brauner, Jeff Layton, Linus Torvalds
Cc: David Howells, Gautham R. Shenoy, K Prateek Nayak, Mateusz Guzik,
Neeraj Upadhyay, Oliver Sang, Swapnil Sapkal, WangYuli,
linux-fsdevel, linux-kernel
After the previous patch the readers can't (hopefully) hit a zero-sized
buffer, add a few pipe_buf_assert_len() debugging checks.
pipe_buf_assert_len() can probably have more users, including the writers
which update pipe->head.
While at it, simplify eat_empty_buffer(), it can use pipe_buf(pipe->tail).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
fs/splice.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)
diff --git a/fs/splice.c b/fs/splice.c
index 28cfa63aa236..fb7841c07edd 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -453,7 +453,7 @@ static int splice_from_pipe_feed(struct pipe_inode_info *pipe, struct splice_des
while (!pipe_empty(head, tail)) {
struct pipe_buffer *buf = &pipe->bufs[tail & mask];
- sd->len = buf->len;
+ sd->len = pipe_buf_assert_len(buf);
if (sd->len > sd->total_len)
sd->len = sd->total_len;
@@ -494,13 +494,11 @@ static int splice_from_pipe_feed(struct pipe_inode_info *pipe, struct splice_des
/* We know we have a pipe buffer, but maybe it's empty? */
static inline bool eat_empty_buffer(struct pipe_inode_info *pipe)
{
- unsigned int tail = pipe->tail;
- unsigned int mask = pipe->ring_size - 1;
- struct pipe_buffer *buf = &pipe->bufs[tail & mask];
+ struct pipe_buffer *buf = pipe_buf(pipe, pipe->tail);
- if (unlikely(!buf->len)) {
+ if (unlikely(!pipe_buf_assert_len(buf))) {
pipe_buf_release(pipe, buf);
- pipe->tail = tail+1;
+ pipe->tail++;
return true;
}
@@ -717,7 +715,7 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
left = sd.total_len;
for (n = 0; !pipe_empty(head, tail) && left && n < nbufs; tail++) {
struct pipe_buffer *buf = &pipe->bufs[tail & mask];
- size_t this_len = buf->len;
+ size_t this_len = pipe_buf_assert_len(buf);
/* zero-length bvecs are not supported, skip them */
if (!this_len)
@@ -852,7 +850,7 @@ ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out,
struct pipe_buffer *buf = &pipe->bufs[tail & mask];
size_t seg;
- if (!buf->len) {
+ if (!pipe_buf_assert_len(buf)) {
tail++;
continue;
}
--
2.25.1.362.g51ebf55
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] pipe: change pipe_write() to never add a zero-sized buffer
2025-02-09 15:07 ` [PATCH 1/2] " Oleg Nesterov
@ 2025-02-09 17:26 ` Linus Torvalds
2025-02-09 18:02 ` Oleg Nesterov
0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2025-02-09 17:26 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Christian Brauner, Jeff Layton, David Howells, Gautham R. Shenoy,
K Prateek Nayak, Mateusz Guzik, Neeraj Upadhyay, Oliver Sang,
Swapnil Sapkal, WangYuli, linux-fsdevel, linux-kernel
On Sun, 9 Feb 2025 at 07:08, Oleg Nesterov <oleg@redhat.com> wrote:
>
> This is no longer necessary after c73be61cede5 ("pipe: Add general notification
> queue support"), pipe_write() checks pipe_has_watch_queue() and returns -EXDEV
> at the start. And can't help in any case, pipe_write() no longer takes this
> rd_wait.lock spinlock.
Ack. This code all goes back to the two-level locking thing with
notifications using just the spinlock side.
The locking was removed from this code in commit dfaabf916b1c
("fs/pipe: remove unnecessary spinlock from pipe_write()"), but the
"pre-allocate" logic remained.
This patch seems to be the right thing to do and removes the vestiges
of the old model.
But I don't think you need that pipe_buf_assert_len() thing. And if
you do, please don't make it a pointless inline helper that only hides
what it does.
Linus
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] pipe: change pipe_write() to never add a zero-sized buffer
2025-02-09 17:26 ` Linus Torvalds
@ 2025-02-09 18:02 ` Oleg Nesterov
2025-02-09 18:17 ` Linus Torvalds
0 siblings, 1 reply; 16+ messages in thread
From: Oleg Nesterov @ 2025-02-09 18:02 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christian Brauner, Jeff Layton, David Howells, Gautham R. Shenoy,
K Prateek Nayak, Mateusz Guzik, Neeraj Upadhyay, Oliver Sang,
Swapnil Sapkal, WangYuli, linux-fsdevel, linux-kernel
On 02/09, Linus Torvalds wrote:
>
> This patch seems to be the right thing to do and removes the vestiges
> of the old model.
OK, thanks.
> But I don't think you need that pipe_buf_assert_len() thing.
Well, I'd prefer to keep this WARN_ON_ONCE() for some time... If
nobody hits this warning we can kill eat_empty_buffer() and more
hopefully dead checks, for example
/* zero-length bvecs are not supported, skip them */
if (!this_len)
continue;
in iter_file_splice_write().
> And if
> you do, please don't make it a pointless inline helper that only hides
> what it does.
Could you explain what do you think should I do if I keep this check?
make pipe_buf_assert_len() return void? or just replace it with
WARN_ON_ONCE(!buf->len) in its callers?
Oleg.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] pipe: change pipe_write() to never add a zero-sized buffer
2025-02-09 18:02 ` Oleg Nesterov
@ 2025-02-09 18:17 ` Linus Torvalds
2025-02-09 18:44 ` Oleg Nesterov
0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2025-02-09 18:17 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Christian Brauner, Jeff Layton, David Howells, Gautham R. Shenoy,
K Prateek Nayak, Mateusz Guzik, Neeraj Upadhyay, Oliver Sang,
Swapnil Sapkal, WangYuli, linux-fsdevel, linux-kernel
On Sun, 9 Feb 2025 at 10:02, Oleg Nesterov <oleg@redhat.com> wrote:
>
> Could you explain what do you think should I do if I keep this check?
> make pipe_buf_assert_len() return void? or just replace it with
> WARN_ON_ONCE(!buf->len) in its callers?
Just replace it with WARN_ON_ONCE() in any place where you really
think it's needed.
And honestly, if you worry so much about it, just allow the zero-sized
case. I don't see why you want to special-case it in the first place.
Yes, the zero sized buffer *used* to be a special case, but it was a
special case for writes.
And yes, splice wants to actually wait for "do we have real data" and
has that eat_empty_buffer() case, but just *keep* it.
Keeping that check is *better* and clearer than adding some pointless warning.
IOW, why warn for a case that isn't a problem, and you're only making
it a problem by thinking it is?
Linus
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] pipe: change pipe_write() to never add a zero-sized buffer
2025-02-09 18:17 ` Linus Torvalds
@ 2025-02-09 18:44 ` Oleg Nesterov
2025-02-09 18:52 ` Linus Torvalds
0 siblings, 1 reply; 16+ messages in thread
From: Oleg Nesterov @ 2025-02-09 18:44 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christian Brauner, Jeff Layton, David Howells, Gautham R. Shenoy,
K Prateek Nayak, Mateusz Guzik, Neeraj Upadhyay, Oliver Sang,
Swapnil Sapkal, WangYuli, linux-fsdevel, linux-kernel
On 02/09, Linus Torvalds wrote:
>
> On Sun, 9 Feb 2025 at 10:02, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Could you explain what do you think should I do if I keep this check?
> > make pipe_buf_assert_len() return void? or just replace it with
> > WARN_ON_ONCE(!buf->len) in its callers?
>
> Just replace it with WARN_ON_ONCE() in any place where you really
> think it's needed.
OK, will do.
> IOW, why warn for a case that isn't a problem, and you're only making
> it a problem by thinking it is?
Again, lets look eat_empty_buffer().
The comment says "maybe it's empty" but how/why can this happen ?
The changelog for d1a819a2ec2d3 ("splice: teach splice pipe reading
about empty pipe buffers") says "you can trigger it by doing a write
to a pipe that fails" but if someone looks at anon_pipe_write() after
1/2 this case is not possible.
And if eat_empty_buffer() flushes the buffer and updates pipe->tail,
why doesn't it wake the writers?
WARN_ON_ONCE() makes it clear that we do not expect !buf->len == 0,
and the kernel will complain if it does happen.
So unless you have a strong opinion, I'd prefer to keep it for now.
Oleg.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] pipe: change pipe_write() to never add a zero-sized buffer
2025-02-09 18:44 ` Oleg Nesterov
@ 2025-02-09 18:52 ` Linus Torvalds
2025-02-09 19:15 ` Oleg Nesterov
0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2025-02-09 18:52 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Christian Brauner, Jeff Layton, David Howells, Gautham R. Shenoy,
K Prateek Nayak, Mateusz Guzik, Neeraj Upadhyay, Oliver Sang,
Swapnil Sapkal, WangYuli, linux-fsdevel, linux-kernel
On Sun, 9 Feb 2025 at 10:45, Oleg Nesterov <oleg@redhat.com> wrote:
>
> Again, lets look eat_empty_buffer().
>
> The comment says "maybe it's empty" but how/why can this happen ?
WHY DO YOU CARE?
Even if it's just defensive programming, it's just plain good code,
when the rule is "the caller is waiting for data".
And if it means that you don't need to add random WARN_ON_ONCE() statements.
So here's the deal: either you
(a) convince yourself that the length really can never be true, and
you write a comment about why that is the case.
or
(b) you DON'T convince yourself that that is true, and you leave
eat_empty_buffer() alone.
and both of those situations are fine.
But adding a random sprinkling of WARN_ON_ONCE() statements is worse
than *either* of those two cases.
See what my argument is? Adding a WARN_ON is just making the code
worse. Don't do it. It's the worst of both worlds: it adds code to
generate, and it adds confusion to readers ("why are we warning?" or
"when can this happen").
In contrast, the "eat_empty_buffer()" case just saying "if it's an
empty buffer, it doesn't satisfy my needs, so I'll just release the
empty buffer and go on".
That's simple and straightforward. Easy to maintain, and more
efficient than your WARN_ONs.
And if you can convince yourself it's not needed, you remove it.
EXACTLY LIKE THE WARN_ON.
So there's simply never any upside to the WARN_ON.
Linus
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] pipe: change pipe_write() to never add a zero-sized buffer
2025-02-09 18:52 ` Linus Torvalds
@ 2025-02-09 19:15 ` Oleg Nesterov
2025-02-09 23:39 ` K Prateek Nayak
0 siblings, 1 reply; 16+ messages in thread
From: Oleg Nesterov @ 2025-02-09 19:15 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christian Brauner, Jeff Layton, David Howells, Gautham R. Shenoy,
K Prateek Nayak, Mateusz Guzik, Neeraj Upadhyay, Oliver Sang,
Swapnil Sapkal, WangYuli, linux-fsdevel, linux-kernel
On 02/09, Linus Torvalds wrote:
>
> On Sun, 9 Feb 2025 at 10:45, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Again, lets look eat_empty_buffer().
> >
> > The comment says "maybe it's empty" but how/why can this happen ?
>
> WHY DO YOU CARE?
Because it looks unclear/confusing, and I think it can confuse other
readers of this code. Especially after 1/2.
> So here's the deal: either you
...
> (b) you DON'T convince yourself that that is true, and you leave
> eat_empty_buffer() alone.
Yes, I failed to convince myself that fs/splice.c can never add an
empty bufer. Although it seems to me it should not.
> In contrast, the "eat_empty_buffer()" case just saying "if it's an
> empty buffer, it doesn't satisfy my needs, so I'll just release the
> empty buffer and go on".
... without wakeup_pipe_writers().
OK, nevermind, I see your point even if I do not 100% agree.
I'll send v2 without WARN_ON() and without 2/2.
Oleg.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] pipe: change pipe_write() to never add a zero-sized buffer
2025-02-09 19:15 ` Oleg Nesterov
@ 2025-02-09 23:39 ` K Prateek Nayak
2025-02-10 17:22 ` Oleg Nesterov
0 siblings, 1 reply; 16+ messages in thread
From: K Prateek Nayak @ 2025-02-09 23:39 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Christian Brauner, Jeff Layton, David Howells, Gautham R. Shenoy,
Mateusz Guzik, Neeraj Upadhyay, Oliver Sang, Swapnil Sapkal,
WangYuli, linux-fsdevel, linux-kernel, Linus Torvalds
Hello Oleg,
On 2/10/2025 12:45 AM, Oleg Nesterov wrote:
> On 02/09, Linus Torvalds wrote:
>>
>> On Sun, 9 Feb 2025 at 10:45, Oleg Nesterov <oleg@redhat.com> wrote:
>>>
>>> Again, lets look eat_empty_buffer().
>>>
>>> The comment says "maybe it's empty" but how/why can this happen ?
>>
>> WHY DO YOU CARE?
>
> Because it looks unclear/confusing, and I think it can confuse other
> readers of this code. Especially after 1/2.
>
>> So here's the deal: either you
> ...
>> (b) you DON'T convince yourself that that is true, and you leave
>> eat_empty_buffer() alone.
>
> Yes, I failed to convince myself that fs/splice.c can never add an
> empty bufer. Although it seems to me it should not.
>
>> In contrast, the "eat_empty_buffer()" case just saying "if it's an
>> empty buffer, it doesn't satisfy my needs, so I'll just release the
>> empty buffer and go on".
>
> ... without wakeup_pipe_writers().
>
> OK, nevermind, I see your point even if I do not 100% agree.
>
> I'll send v2 without WARN_ON() and without 2/2.
Went ahead and tested that version on top of mainline with your
previous change to skip updating {a,c,m}time, here are the results:
==================================================================
Test : sched-messaging
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: mainline + no-acm_time[pct imp](CV) patched[pct imp](CV)
1-groups 1.00 [ -0.00]( 7.19) 0.95 [ 4.90](12.39)
2-groups 1.00 [ -0.00]( 3.54) 1.02 [ -1.92]( 6.55)
4-groups 1.00 [ -0.00]( 2.78) 1.01 [ -0.85]( 2.18)
8-groups 1.00 [ -0.00]( 1.04) 0.99 [ 0.63]( 0.77)
16-groups 1.00 [ -0.00]( 1.02) 1.00 [ -0.26]( 0.98)
I don't see any regression / improvements from a performance standpoint
on my 3rd Generation EPYC system (2 x 64C/128T. boost on, C2 disabled)
Feel free to include:
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>
> Oleg.
>
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 1/1] pipe: change pipe_write() to never add a zero-sized buffer
2025-02-09 15:07 [PATCH 0/2] pipe: change pipe_write() to never add a zero-sized buffer Oleg Nesterov
2025-02-09 15:07 ` [PATCH 1/2] " Oleg Nesterov
2025-02-09 15:08 ` [PATCH 2/2] splice: add some pipe_buf_assert_len() checks Oleg Nesterov
@ 2025-02-10 11:40 ` Oleg Nesterov
2025-02-10 12:52 ` [PATCH 0/2] " Christian Brauner
3 siblings, 0 replies; 16+ messages in thread
From: Oleg Nesterov @ 2025-02-10 11:40 UTC (permalink / raw)
To: Christian Brauner, Jeff Layton, Linus Torvalds
Cc: David Howells, Gautham R. Shenoy, K Prateek Nayak, Mateusz Guzik,
Neeraj Upadhyay, Oliver Sang, Swapnil Sapkal, WangYuli,
linux-fsdevel, linux-kernel
a194dfe6e6f6 ("pipe: Rearrange sequence in pipe_write() to preallocate slot")
changed pipe_write() to increment pipe->head in advance. IIUC to avoid the
race with the post_one_notification()-like code which can add another buffer
under pipe->rd_wait.lock without pipe->mutex.
This is no longer necessary after c73be61cede5 ("pipe: Add general notification
queue support"), pipe_write() checks pipe_has_watch_queue() and returns -EXDEV
at the start. And can't help in any case, pipe_write() no longer takes this
rd_wait.lock spinlock.
Change pipe_write() to call copy_page_from_iter() first and do nothing if it
fails. This way pipe_write() can't add a zero-sized buffer and we can simplify
pipe_read() which currently has to take care of this very unlikely case.
Also, with this patch we can probably kill eat_empty_buffer() and more
"is this buffer empty" checks in fs/splice.c later.
Link: https://lore.kernel.org/all/20250209150718.GA17013@redhat.com/
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
fs/pipe.c | 45 +++++++++------------------------------------
1 file changed, 9 insertions(+), 36 deletions(-)
diff --git a/fs/pipe.c b/fs/pipe.c
index 2ae75adfba64..b0641f75b1ba 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -360,29 +360,9 @@ anon_pipe_read(struct kiocb *iocb, struct iov_iter *to)
break;
}
mutex_unlock(&pipe->mutex);
-
/*
* We only get here if we didn't actually read anything.
*
- * However, we could have seen (and removed) a zero-sized
- * pipe buffer, and might have made space in the buffers
- * that way.
- *
- * You can't make zero-sized pipe buffers by doing an empty
- * write (not even in packet mode), but they can happen if
- * the writer gets an EFAULT when trying to fill a buffer
- * that already got allocated and inserted in the buffer
- * array.
- *
- * So we still need to wake up any pending writers in the
- * _very_ unlikely case that the pipe was full, but we got
- * no data.
- */
- if (unlikely(wake_writer))
- wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
- kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
-
- /*
* But because we didn't read anything, at this point we can
* just return directly with -ERESTARTSYS if we're interrupted,
* since we've done any required wakeups and there's no need
@@ -391,7 +371,6 @@ anon_pipe_read(struct kiocb *iocb, struct iov_iter *to)
if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0)
return -ERESTARTSYS;
- wake_writer = false;
wake_next_reader = true;
mutex_lock(&pipe->mutex);
}
@@ -526,33 +505,27 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
pipe->tmp_page = page;
}
- /* Allocate a slot in the ring in advance and attach an
- * empty buffer. If we fault or otherwise fail to use
- * it, either the reader will consume it or it'll still
- * be there for the next write.
- */
- pipe->head = head + 1;
+ copied = copy_page_from_iter(page, 0, PAGE_SIZE, from);
+ if (unlikely(copied < PAGE_SIZE && iov_iter_count(from))) {
+ if (!ret)
+ ret = -EFAULT;
+ break;
+ }
+ pipe->head = head + 1;
+ pipe->tmp_page = NULL;
/* Insert it into the buffer array */
buf = &pipe->bufs[head & mask];
buf->page = page;
buf->ops = &anon_pipe_buf_ops;
buf->offset = 0;
- buf->len = 0;
if (is_packetized(filp))
buf->flags = PIPE_BUF_FLAG_PACKET;
else
buf->flags = PIPE_BUF_FLAG_CAN_MERGE;
- pipe->tmp_page = NULL;
- copied = copy_page_from_iter(page, 0, PAGE_SIZE, from);
- if (unlikely(copied < PAGE_SIZE && iov_iter_count(from))) {
- if (!ret)
- ret = -EFAULT;
- break;
- }
- ret += copied;
buf->len = copied;
+ ret += copied;
if (!iov_iter_count(from))
break;
--
2.25.1.362.g51ebf55
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 0/2] pipe: change pipe_write() to never add a zero-sized buffer
2025-02-09 15:07 [PATCH 0/2] pipe: change pipe_write() to never add a zero-sized buffer Oleg Nesterov
` (2 preceding siblings ...)
2025-02-10 11:40 ` [PATCH v2 1/1] pipe: change pipe_write() to never add a zero-sized buffer Oleg Nesterov
@ 2025-02-10 12:52 ` Christian Brauner
3 siblings, 0 replies; 16+ messages in thread
From: Christian Brauner @ 2025-02-10 12:52 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Christian Brauner, David Howells, Gautham R. Shenoy,
K Prateek Nayak, Mateusz Guzik, Neeraj Upadhyay, Oliver Sang,
Swapnil Sapkal, WangYuli, linux-fsdevel, linux-kernel,
Jeff Layton, Linus Torvalds
On Sun, 09 Feb 2025 16:07:18 +0100, Oleg Nesterov wrote:
> Please review.
>
> pipe_write() can insert the empty buffer and this looks very confusing
> to me. Because it looks obviously unnecessary and complicates the code.
>
> In fact this logic doesn't even look strictly correct. For example,
> eat_empty_buffer() simply updates pipe->tail but (unlike pipe_read) it
> doesn't wake the writers.
>
> [...]
Applied to the vfs-6.15.misc branch of the vfs/vfs.git tree.
Patches in the vfs-6.15.misc branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.15.misc
[1/1] pipe: change pipe_write() to never add a zero-sized buffer
https://git.kernel.org/vfs/vfs/c/af69e27b3c82
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] pipe: change pipe_write() to never add a zero-sized buffer
2025-02-09 23:39 ` K Prateek Nayak
@ 2025-02-10 17:22 ` Oleg Nesterov
2025-02-10 17:37 ` Linus Torvalds
2025-02-11 3:59 ` K Prateek Nayak
0 siblings, 2 replies; 16+ messages in thread
From: Oleg Nesterov @ 2025-02-10 17:22 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Christian Brauner, Jeff Layton, David Howells, Gautham R. Shenoy,
Mateusz Guzik, Neeraj Upadhyay, Oliver Sang, Swapnil Sapkal,
WangYuli, linux-fsdevel, linux-kernel, Linus Torvalds
Hi Prateek,
On 02/10, K Prateek Nayak wrote:
>
> 1-groups 1.00 [ -0.00]( 7.19) 0.95 [ 4.90](12.39)
> 2-groups 1.00 [ -0.00]( 3.54) 1.02 [ -1.92]( 6.55)
> 4-groups 1.00 [ -0.00]( 2.78) 1.01 [ -0.85]( 2.18)
> 8-groups 1.00 [ -0.00]( 1.04) 0.99 [ 0.63]( 0.77)
> 16-groups 1.00 [ -0.00]( 1.02) 1.00 [ -0.26]( 0.98)
>
> I don't see any regression / improvements from a performance standpoint
Yes, this patch shouldn't make any difference performance-wise, at least
in this case. Although I was thinking the same thing when I sent "pipe_read:
don't wake up the writer if the pipe is still full" ;)
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Thanks! Please see v2, I've included you tag.
Any chance you can also test the patch below?
To me it looks like a cleanup which makes the "merge small writes" logic
more understandable. And note that "page-align the rest of the writes"
doesn't work anyway if "total_len & (PAGE_SIZE-1)" can't fit in the last
buffer.
However, in this particular case with DATASIZE=100 this patch can increase
the number of copy_page_from_iter()'s in pipe_write(). And with this change
receiver() can certainly get the short reads, so this can increase the
number of sys_read() calls.
So I am just curious if this change can cause any noticeable regression on
your machine.
Thank you!
Oleg.
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -459,16 +459,16 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
was_empty = pipe_empty(head, pipe->tail);
chars = total_len & (PAGE_SIZE-1);
if (chars && !was_empty) {
- unsigned int mask = pipe->ring_size - 1;
- struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask];
+ struct pipe_buffer *buf = pipe_buf(pipe, head - 1);
int offset = buf->offset + buf->len;
+ int avail = PAGE_SIZE - offset;
- if ((buf->flags & PIPE_BUF_FLAG_CAN_MERGE) &&
- offset + chars <= PAGE_SIZE) {
+ if (avail && (buf->flags & PIPE_BUF_FLAG_CAN_MERGE)) {
ret = pipe_buf_confirm(pipe, buf);
if (ret)
goto out;
+ chars = min_t(ssize_t, chars, avail);
ret = copy_page_from_iter(buf->page, offset, chars, from);
if (unlikely(ret < chars)) {
ret = -EFAULT;
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] pipe: change pipe_write() to never add a zero-sized buffer
2025-02-10 17:22 ` Oleg Nesterov
@ 2025-02-10 17:37 ` Linus Torvalds
2025-02-10 18:36 ` Oleg Nesterov
2025-02-11 3:59 ` K Prateek Nayak
1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2025-02-10 17:37 UTC (permalink / raw)
To: Oleg Nesterov
Cc: K Prateek Nayak, Christian Brauner, Jeff Layton, David Howells,
Gautham R. Shenoy, Mateusz Guzik, Neeraj Upadhyay, Oliver Sang,
Swapnil Sapkal, WangYuli, linux-fsdevel, linux-kernel
On Mon, 10 Feb 2025 at 09:22, Oleg Nesterov <oleg@redhat.com> wrote:
>
> + int avail = PAGE_SIZE - offset;
>
> - if ((buf->flags & PIPE_BUF_FLAG_CAN_MERGE) &&
> - offset + chars <= PAGE_SIZE) {
> + if (avail && (buf->flags & PIPE_BUF_FLAG_CAN_MERGE)) {
> ret = pipe_buf_confirm(pipe, buf);
> if (ret)
> goto out;
>
> + chars = min_t(ssize_t, chars, avail);
If I read this correctly, this patch is horribly broken.
You can't do partial writes. Pipes have one very core atomicity
guarantee: from the man-pages:
PIPE_BUF
POSIX.1 says that writes of less than PIPE_BUF bytes must be
atomic: the output data is written to the pipe as a contiguous
sequence. Writes of more than PIPE_BUF bytes may be nonatomic:
the kernel may interleave the data with data written by other
processes. POSIX.1 requires PIPE_BUF to be at least 512 bytes.
(On Linux, PIPE_BUF is 4096 bytes.)
IOW, that whole "try to write as many chars as there is room" is very
very broken. You have to write all or nothing.
So you can't (say) first write just 50 bytes of a 100-byte pipe write
because it fits in the last buffer, and then wait for another buffer
to become free to write the rest. Not just because another writer
might come in and start mixing in data, but because *readers* may well
expect to get 100 bytes or nothing.
And to make matters worse, you'll never notice the bug until something
breaks very subtly (unless we happen to have a good test-case for this
somewhere - there might be a test for this in LTP).
And yes, this is actually something I know for a fact that people
depend on. Lots of traditional UNIX "send packet commands over pipes"
programs around, which expect the packets to be atomic.
So things *will* break, but it might take a while before you hit just
the right race condition for things to go south, and the errors migth
end up being very non-obvious indeed.
Note that the initial
chars = total_len & (PAGE_SIZE-1);
before the whole test for "can we merge" is fine, because if total_len
is larger than a page, it's no longer a write we need to worry about
atomicity with.
Maybe we should add a comment somewhere about this.
Linus
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] pipe: change pipe_write() to never add a zero-sized buffer
2025-02-10 17:37 ` Linus Torvalds
@ 2025-02-10 18:36 ` Oleg Nesterov
0 siblings, 0 replies; 16+ messages in thread
From: Oleg Nesterov @ 2025-02-10 18:36 UTC (permalink / raw)
To: Linus Torvalds
Cc: K Prateek Nayak, Christian Brauner, Jeff Layton, David Howells,
Gautham R. Shenoy, Mateusz Guzik, Neeraj Upadhyay, Oliver Sang,
Swapnil Sapkal, WangYuli, linux-fsdevel, linux-kernel
On 02/10, Linus Torvalds wrote:
>
> On Mon, 10 Feb 2025 at 09:22, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > + int avail = PAGE_SIZE - offset;
> >
> > - if ((buf->flags & PIPE_BUF_FLAG_CAN_MERGE) &&
> > - offset + chars <= PAGE_SIZE) {
> > + if (avail && (buf->flags & PIPE_BUF_FLAG_CAN_MERGE)) {
> > ret = pipe_buf_confirm(pipe, buf);
> > if (ret)
> > goto out;
> >
> > + chars = min_t(ssize_t, chars, avail);
>
> If I read this correctly, this patch is horribly broken.
>
> You can't do partial writes. Pipes have one very core atomicity
> guarantee: from the man-pages:
>
> PIPE_BUF
> POSIX.1 says that writes of less than PIPE_BUF bytes must be
> atomic:
Ah, I didn't know!
Thanks for your explanation and the quick NACK!
> Maybe we should add a comment somewhere about this.
Agreed. It would certainly help the ignorant readers like me ;)
So I guess that the "goto again" logic in sender/receiver in
tools/perf/bench/sched-messaging.c is currently pointless,
DATASIZE == 100 < PIPE_BUF.
Thanks.
Oleg.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/2] pipe: change pipe_write() to never add a zero-sized buffer
2025-02-10 17:22 ` Oleg Nesterov
2025-02-10 17:37 ` Linus Torvalds
@ 2025-02-11 3:59 ` K Prateek Nayak
1 sibling, 0 replies; 16+ messages in thread
From: K Prateek Nayak @ 2025-02-11 3:59 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Christian Brauner, Jeff Layton, David Howells, Gautham R. Shenoy,
Mateusz Guzik, Neeraj Upadhyay, Oliver Sang, Swapnil Sapkal,
WangYuli, linux-fsdevel, linux-kernel, Linus Torvalds
Hello Oleg,
On 2/10/2025 10:52 PM, Oleg Nesterov wrote:
> Hi Prateek,
>
> On 02/10, K Prateek Nayak wrote:
>>
>> 1-groups 1.00 [ -0.00]( 7.19) 0.95 [ 4.90](12.39)
>> 2-groups 1.00 [ -0.00]( 3.54) 1.02 [ -1.92]( 6.55)
>> 4-groups 1.00 [ -0.00]( 2.78) 1.01 [ -0.85]( 2.18)
>> 8-groups 1.00 [ -0.00]( 1.04) 0.99 [ 0.63]( 0.77)
>> 16-groups 1.00 [ -0.00]( 1.02) 1.00 [ -0.26]( 0.98)
>>
>> I don't see any regression / improvements from a performance standpoint
>
> Yes, this patch shouldn't make any difference performance-wise, at least
> in this case. Although I was thinking the same thing when I sent "pipe_read:
> don't wake up the writer if the pipe is still full" ;)
>
>> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>
> Thanks! Please see v2, I've included you tag.
Thank you. I can confirm it is same as the variant I tested.
>
> Any chance you can also test the patch below?
>
> To me it looks like a cleanup which makes the "merge small writes" logic
> more understandable. And note that "page-align the rest of the writes"
> doesn't work anyway if "total_len & (PAGE_SIZE-1)" can't fit in the last
> buffer.
>
> However, in this particular case with DATASIZE=100 this patch can increase
> the number of copy_page_from_iter()'s in pipe_write(). And with this change
> receiver() can certainly get the short reads, so this can increase the
> number of sys_read() calls.
>
> So I am just curious if this change can cause any noticeable regression on
> your machine.
For the sake of science:
==================================================================
Test : sched-messaging
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: baseline[pct imp](CV) merge_writes[pct imp](CV)
1-groups 1.00 [ -0.00](12.39) 1.08 [ -7.62](11.73)
2-groups 1.00 [ -0.00]( 6.55) 0.97 [ 2.52]( 3.01)
4-groups 1.00 [ -0.00]( 2.18) 1.00 [ 0.42]( 1.97)
8-groups 1.00 [ -0.00]( 0.77) 1.03 [ -3.35]( 5.07)
16-groups 1.00 [ -0.00]( 0.98) 1.01 [ -1.37]( 2.20)
I see some improvements up until 4 groups (160 tasks) but beyond that it
goes into a slight regression territory but the variance is large to
draw any conclusions.
Science experiment concluded.
>
> Thank you!
>
> Oleg.
>
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -459,16 +459,16 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
> was_empty = pipe_empty(head, pipe->tail);
> chars = total_len & (PAGE_SIZE-1);
> if (chars && !was_empty) {
> - unsigned int mask = pipe->ring_size - 1;
> - struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask];
> + struct pipe_buffer *buf = pipe_buf(pipe, head - 1);
> int offset = buf->offset + buf->len;
> + int avail = PAGE_SIZE - offset;
>
> - if ((buf->flags & PIPE_BUF_FLAG_CAN_MERGE) &&
> - offset + chars <= PAGE_SIZE) {
> + if (avail && (buf->flags & PIPE_BUF_FLAG_CAN_MERGE)) {
> ret = pipe_buf_confirm(pipe, buf);
> if (ret)
> goto out;
>
> + chars = min_t(ssize_t, chars, avail);
> ret = copy_page_from_iter(buf->page, offset, chars, from);
> if (unlikely(ret < chars)) {
> ret = -EFAULT;
>
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-02-11 3:59 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-09 15:07 [PATCH 0/2] pipe: change pipe_write() to never add a zero-sized buffer Oleg Nesterov
2025-02-09 15:07 ` [PATCH 1/2] " Oleg Nesterov
2025-02-09 17:26 ` Linus Torvalds
2025-02-09 18:02 ` Oleg Nesterov
2025-02-09 18:17 ` Linus Torvalds
2025-02-09 18:44 ` Oleg Nesterov
2025-02-09 18:52 ` Linus Torvalds
2025-02-09 19:15 ` Oleg Nesterov
2025-02-09 23:39 ` K Prateek Nayak
2025-02-10 17:22 ` Oleg Nesterov
2025-02-10 17:37 ` Linus Torvalds
2025-02-10 18:36 ` Oleg Nesterov
2025-02-11 3:59 ` K Prateek Nayak
2025-02-09 15:08 ` [PATCH 2/2] splice: add some pipe_buf_assert_len() checks Oleg Nesterov
2025-02-10 11:40 ` [PATCH v2 1/1] pipe: change pipe_write() to never add a zero-sized buffer Oleg Nesterov
2025-02-10 12:52 ` [PATCH 0/2] " Christian Brauner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).