* [PATCH 1/3] pipe: drop an always true check in anon_pipe_write()
2025-03-03 23:04 [PATCH 0/3] some pipe + wait stuff Mateusz Guzik
@ 2025-03-03 23:04 ` Mateusz Guzik
2025-03-04 14:07 ` Oleg Nesterov
2025-03-03 23:04 ` [PATCH 2/3] pipe: cache 2 pages instead of 1 Mateusz Guzik
` (2 subsequent siblings)
3 siblings, 1 reply; 14+ messages in thread
From: Mateusz Guzik @ 2025-03-03 23:04 UTC (permalink / raw)
To: torvalds
Cc: oleg, brauner, mingo, peterz, rostedt, linux-kernel,
linux-fsdevel, Mateusz Guzik
The check operates on the stale value of 'head' and always loops back.
Just do it unconditionally. No functional changes.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
fs/pipe.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/pipe.c b/fs/pipe.c
index 19a7948ab234..d5238f6e0f08 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -529,10 +529,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
if (!iov_iter_count(from))
break;
- }
- if (!pipe_full(head, pipe->tail, pipe->max_usage))
continue;
+ }
/* Wait for buffer space to become available. */
if ((filp->f_flags & O_NONBLOCK) ||
--
2.43.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* Re: [PATCH 1/3] pipe: drop an always true check in anon_pipe_write()
2025-03-03 23:04 ` [PATCH 1/3] pipe: drop an always true check in anon_pipe_write() Mateusz Guzik
@ 2025-03-04 14:07 ` Oleg Nesterov
2025-03-04 14:58 ` Mateusz Guzik
2025-03-04 15:44 ` pipes && EPOLLET, again Oleg Nesterov
0 siblings, 2 replies; 14+ messages in thread
From: Oleg Nesterov @ 2025-03-04 14:07 UTC (permalink / raw)
To: Mateusz Guzik
Cc: torvalds, brauner, mingo, peterz, rostedt, linux-kernel,
linux-fsdevel
On 03/04, Mateusz Guzik wrote:
>
> @@ -529,10 +529,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>
> if (!iov_iter_count(from))
> break;
> - }
>
> - if (!pipe_full(head, pipe->tail, pipe->max_usage))
> continue;
> + }
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
It seems that we can also remove the unnecessary signal_pending()
check, but I need to recheck and we need to cleanup the poll_usage
logic first.
This will also remove the unnecessary wakeups when the writer is
interrupted by signal/
diff --git a/fs/pipe.c b/fs/pipe.c
index b0641f75b1ba..ed55a86ca03b 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -541,12 +541,6 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
ret = -EAGAIN;
break;
}
- if (signal_pending(current)) {
- if (!ret)
- ret = -ERESTARTSYS;
- break;
- }
-
/*
* We're going to release the pipe lock and wait for more
* space. We wake up any readers if necessary, and then
@@ -554,10 +548,11 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
* become empty while we dropped the lock.
*/
mutex_unlock(&pipe->mutex);
- if (was_empty)
+ if (was_empty || pipe->poll_usage)
wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
- wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
+ if (wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe)) < 0)
+ return ret ?: -ERESTARTSYS;
mutex_lock(&pipe->mutex);
was_empty = pipe_empty(pipe->head, pipe->tail);
wake_next_writer = true;
^ permalink raw reply related [flat|nested] 14+ messages in thread* Re: [PATCH 1/3] pipe: drop an always true check in anon_pipe_write()
2025-03-04 14:07 ` Oleg Nesterov
@ 2025-03-04 14:58 ` Mateusz Guzik
2025-03-04 15:18 ` Oleg Nesterov
2025-03-04 15:44 ` pipes && EPOLLET, again Oleg Nesterov
1 sibling, 1 reply; 14+ messages in thread
From: Mateusz Guzik @ 2025-03-04 14:58 UTC (permalink / raw)
To: Oleg Nesterov
Cc: torvalds, brauner, mingo, peterz, rostedt, linux-kernel,
linux-fsdevel
On Tue, Mar 4, 2025 at 3:08 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 03/04, Mateusz Guzik wrote:
> >
> > @@ -529,10 +529,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
> >
> > if (!iov_iter_count(from))
> > break;
> > - }
> >
> > - if (!pipe_full(head, pipe->tail, pipe->max_usage))
> > continue;
> > + }
>
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>
>
thanks
> It seems that we can also remove the unnecessary signal_pending()
> check, but I need to recheck and we need to cleanup the poll_usage
> logic first.
>
> This will also remove the unnecessary wakeups when the writer is
> interrupted by signal/
>
[snip]
There are many touch ups to do here, I don't have an opinion about this diff.
I don't have compiled stats handy, but few months back I asked some
people to query pipe writes with dtrace on FreeBSD. Everything is very
heavily skewed towards < 128 bytes in total, including tons of 1 bytes
(and no, it's not just make). However, there is also quite a bit of
absolutely massive multipage ops as well -- north of 64K even. Doing
that kind of op in one go is way faster than restarting rep mov every
time interleaved with SMAP toggling, which is expensive in its own
right.
Thus I would argue someone(tm) should do it in Linux, but I don't have
immediate plans. Perhaps you would be happy to do it? :)
In the meantime, I would hope any refactoring in the area would make
the above easier (my patch does, I'm not claiming your does not :P)
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/3] pipe: drop an always true check in anon_pipe_write()
2025-03-04 14:58 ` Mateusz Guzik
@ 2025-03-04 15:18 ` Oleg Nesterov
0 siblings, 0 replies; 14+ messages in thread
From: Oleg Nesterov @ 2025-03-04 15:18 UTC (permalink / raw)
To: Mateusz Guzik
Cc: torvalds, brauner, mingo, peterz, rostedt, linux-kernel,
linux-fsdevel
On 03/04, Mateusz Guzik wrote:
>
> On Tue, Mar 4, 2025 at 3:08 PM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > It seems that we can also remove the unnecessary signal_pending()
> > check, but I need to recheck and we need to cleanup the poll_usage
> > logic first.
> >
> > This will also remove the unnecessary wakeups when the writer is
> > interrupted by signal/
> >
>
> There are many touch ups to do here, I don't have an opinion about this diff.
...
> Thus I would argue someone(tm) should do it in Linux, but I don't have
> immediate plans. Perhaps you would be happy to do it? :)
Probably. In any case this needs a separate patch, sorry for confusion.
Oleg.
^ permalink raw reply [flat|nested] 14+ messages in thread
* pipes && EPOLLET, again
2025-03-04 14:07 ` Oleg Nesterov
2025-03-04 14:58 ` Mateusz Guzik
@ 2025-03-04 15:44 ` Oleg Nesterov
2025-03-04 18:12 ` Linus Torvalds
1 sibling, 1 reply; 14+ messages in thread
From: Oleg Nesterov @ 2025-03-04 15:44 UTC (permalink / raw)
To: Linus Torvalds, Mateusz Guzik
Cc: brauner, mingo, peterz, rostedt, linux-kernel, linux-fsdevel
Linus,
On 03/04, Oleg Nesterov wrote:
>
> and we need to cleanup the poll_usage
> logic first.
We have already discussed this before, I'll probably do this later,
but lets forget it for now.
Don't we need the trivial one-liner below anyway?
I am not saying this is a bug, but please consider
#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/epoll.h>
#include <assert.h>
static char buf[16 * 4096];
int main(void)
{
int pfd[2], efd;
struct epoll_event evt = { .events = EPOLLIN | EPOLLET };
pipe(pfd);
efd = epoll_create1(0);
epoll_ctl(efd, EPOLL_CTL_ADD, pfd[0], &evt);
write(pfd[1], buf, 4096);
assert(epoll_wait(efd, &evt, 1, 0) == 1);
if (!fork()) {
write(pfd[1], buf, sizeof(buf));
assert(0);
}
sleep(1);
assert(epoll_wait(efd, &evt, 1, 0) == 1);
return 0;
}
the 2nd epoll_wait() fails, despite the fact that the child has already
written 15 * PAGE_SIZE bytes. This doesn't look consistent to me...
Oleg.
---
diff --git a/fs/pipe.c b/fs/pipe.c
index b0641f75b1ba..8a32257cc74f 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -554,7 +554,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
* become empty while we dropped the lock.
*/
mutex_unlock(&pipe->mutex);
- if (was_empty)
+ if (was_empty || pipe->poll_usage)
wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
^ permalink raw reply related [flat|nested] 14+ messages in thread* Re: pipes && EPOLLET, again
2025-03-04 15:44 ` pipes && EPOLLET, again Oleg Nesterov
@ 2025-03-04 18:12 ` Linus Torvalds
2025-03-04 19:32 ` Oleg Nesterov
0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2025-03-04 18:12 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Mateusz Guzik, brauner, mingo, peterz, rostedt, linux-kernel,
linux-fsdevel
On Tue, 4 Mar 2025 at 05:45, Oleg Nesterov <oleg@redhat.com> wrote:
>
> Linus,
>
> On 03/04, Oleg Nesterov wrote:
> >
> > and we need to cleanup the poll_usage
> > logic first.
>
> We have already discussed this before, I'll probably do this later,
> but lets forget it for now.
>
> Don't we need the trivial one-liner below anyway?
See this email of mine:
https://lore.kernel.org/all/CAHk-=wiCRwRFi0kGwd_Uv+Xv4HOB-ivHyUp9it6CNSmrKT4gOA@mail.gmail.com/
and the last paragraph in particular.
The whole "poll_usage" thing is a kernel *hack* to deal with broken
user space that expected garbage semantics that aren't real, and were
never real.
They were a random implementation detail, and they were *wrong*.
But because we have a strict "we don't break user space", we
introduced that completely bogus hack to say "ok, we'll send these
completely wrong extraneous events despite the fact that nothing has
changed, because some broken user space program was written to expect
them".
> I am not saying this is a bug, but please consider
That program is buggy, and we're not adding new hacks for new bugs.
If you ask for an edge-triggered EPOLL event, you get an *edge*
triggered EPOLL event. And there is no edge - the pipe was already
readable, no edge anywhere in sight.
So to clarify: we added the hack to work around *existing* user space
bugs that happened to work before.
If anything, we might consider removing the crazy "poll_usage" hack in
the (probably futile) hope that user space has been fixed.
But we're not encouraging *new* bugs.
Linus
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: pipes && EPOLLET, again
2025-03-04 18:12 ` Linus Torvalds
@ 2025-03-04 19:32 ` Oleg Nesterov
2025-03-04 19:49 ` Linus Torvalds
0 siblings, 1 reply; 14+ messages in thread
From: Oleg Nesterov @ 2025-03-04 19:32 UTC (permalink / raw)
To: Linus Torvalds
Cc: Mateusz Guzik, brauner, mingo, peterz, rostedt, linux-kernel,
linux-fsdevel
On 03/04, Linus Torvalds wrote:
>
> On Tue, 4 Mar 2025 at 05:45, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Don't we need the trivial one-liner below anyway?
>
> See this email of mine:
>
> https://lore.kernel.org/all/CAHk-=wiCRwRFi0kGwd_Uv+Xv4HOB-ivHyUp9it6CNSmrKT4gOA@mail.gmail.com/
>
> and the last paragraph in particular.
>
> The whole "poll_usage" thing is a kernel *hack* to deal with broken
> user space that expected garbage semantics that aren't real, and were
> never real.
Yes agreed. But we can make this hack more understandable. But as I said,
this is off-topic right now.
> introduced that completely bogus hack to say "ok, we'll send these
> completely wrong extraneous events despite the fact that nothing has
> changed, because some broken user space program was written to expect
> them".
Yes, but since we already have this hack:
> That program is buggy, and we're not adding new hacks for new bugs.
Yes, but see below...
> If you ask for an edge-triggered EPOLL event, you get an *edge*
> triggered EPOLL event. And there is no edge - the pipe was already
> readable, no edge anywhere in sight.
Yes, the pipe was already readable before before fork, but this condition
was already "consumed" by the 1st epoll_wait(EPOLLET). Please see below.
> If anything, we might consider removing the crazy "poll_usage" hack in
> the (probably futile) hope that user space has been fixed.
This would be the best option ;) Until then:
I agree that my test case is "buggy", but afaics it is not buggier than
userspace programs which rely on the unconditional kill_fasync()'s in
pipe_read/pipe_write?
So. anon_pipe_write() does
if (was_empty)
wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
before wait_event(pipe->wr_wait), but before return it does
if (was_empty || pipe->poll_usage)
wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
and this looks confusing to me.
If pipe_write() doesn't take poll_usage into account before wait_event(wr_wait),
then it doesn't need kill_fasync() too?
So I won't argue, but why not make both cases more consistent?
Oleg.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: pipes && EPOLLET, again
2025-03-04 19:32 ` Oleg Nesterov
@ 2025-03-04 19:49 ` Linus Torvalds
0 siblings, 0 replies; 14+ messages in thread
From: Linus Torvalds @ 2025-03-04 19:49 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Mateusz Guzik, brauner, mingo, peterz, rostedt, linux-kernel,
linux-fsdevel
On Tue, 4 Mar 2025 at 09:32, Oleg Nesterov <oleg@redhat.com> wrote:
>
> I agree that my test case is "buggy", but afaics it is not buggier than
> userspace programs which rely on the unconditional kill_fasync()'s in
> pipe_read/pipe_write?
I'm not convinced any such users actually exist.
The reason kill_fasync() is unconditional is that it's cheap. The
normal situation is "nobody there", and we test that without any
locking.
So we've never bothered to make any changes to that path, and there's
never been any real reason to have any "was_empty" like conditionals.
Linus
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 2/3] pipe: cache 2 pages instead of 1
2025-03-03 23:04 [PATCH 0/3] some pipe + wait stuff Mateusz Guzik
2025-03-03 23:04 ` [PATCH 1/3] pipe: drop an always true check in anon_pipe_write() Mateusz Guzik
@ 2025-03-03 23:04 ` Mateusz Guzik
2025-03-03 23:04 ` [PATCH 3/3] wait: avoid spurious calls to prepare_to_wait_event() in ___wait_event() Mateusz Guzik
2025-03-04 8:46 ` [PATCH 0/3] some pipe + wait stuff Christian Brauner
3 siblings, 0 replies; 14+ messages in thread
From: Mateusz Guzik @ 2025-03-03 23:04 UTC (permalink / raw)
To: torvalds
Cc: oleg, brauner, mingo, peterz, rostedt, linux-kernel,
linux-fsdevel, Mateusz Guzik
User data is kept in a circular buffer backed by pages allocated as
needed. Only having space for one spare is still prone to having to
resort to allocation / freeing.
In my testing this decreases page allocs by 60% during a kernel build.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
fs/pipe.c | 60 ++++++++++++++++++++++++++-------------
include/linux/pipe_fs_i.h | 2 +-
2 files changed, 41 insertions(+), 21 deletions(-)
diff --git a/fs/pipe.c b/fs/pipe.c
index d5238f6e0f08..f5a316d4da95 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -112,20 +112,40 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
pipe_lock(pipe2);
}
+static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe)
+{
+ for (int i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
+ if (pipe->tmp_page[i]) {
+ struct page *page = pipe->tmp_page[i];
+ pipe->tmp_page[i] = NULL;
+ return page;
+ }
+ }
+
+ return alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
+}
+
+static void anon_pipe_put_page(struct pipe_inode_info *pipe,
+ struct page *page)
+{
+ if (page_count(page) == 1) {
+ for (int i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
+ if (!pipe->tmp_page[i]) {
+ pipe->tmp_page[i] = page;
+ return;
+ }
+ }
+ }
+
+ put_page(page);
+}
+
static void anon_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
{
struct page *page = buf->page;
- /*
- * If nobody else uses this page, and we don't already have a
- * temporary page, let's keep track of it as a one-deep
- * allocation cache. (Otherwise just release our reference to it)
- */
- if (page_count(page) == 1 && !pipe->tmp_page)
- pipe->tmp_page = page;
- else
- put_page(page);
+ anon_pipe_put_page(pipe, page);
}
static bool anon_pipe_buf_try_steal(struct pipe_inode_info *pipe,
@@ -493,27 +513,25 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
if (!pipe_full(head, pipe->tail, pipe->max_usage)) {
unsigned int mask = pipe->ring_size - 1;
struct pipe_buffer *buf;
- struct page *page = pipe->tmp_page;
+ struct page *page;
int copied;
- if (!page) {
- page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
- if (unlikely(!page)) {
- ret = ret ? : -ENOMEM;
- break;
- }
- pipe->tmp_page = page;
+ page = anon_pipe_get_page(pipe);
+ if (unlikely(!page)) {
+ if (!ret)
+ ret = -ENOMEM;
+ break;
}
copied = copy_page_from_iter(page, 0, PAGE_SIZE, from);
if (unlikely(copied < PAGE_SIZE && iov_iter_count(from))) {
+ anon_pipe_put_page(pipe, page);
if (!ret)
ret = -EFAULT;
break;
}
pipe->head = head + 1;
- pipe->tmp_page = NULL;
/* Insert it into the buffer array */
buf = &pipe->bufs[head & mask];
buf->page = page;
@@ -846,8 +864,10 @@ void free_pipe_info(struct pipe_inode_info *pipe)
if (pipe->watch_queue)
put_watch_queue(pipe->watch_queue);
#endif
- if (pipe->tmp_page)
- __free_page(pipe->tmp_page);
+ for (i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
+ if (pipe->tmp_page[i])
+ __free_page(pipe->tmp_page[i]);
+ }
kfree(pipe->bufs);
kfree(pipe);
}
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 8ff23bf5a819..eb7994a1ff93 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -72,7 +72,7 @@ struct pipe_inode_info {
#ifdef CONFIG_WATCH_QUEUE
bool note_loss;
#endif
- struct page *tmp_page;
+ struct page *tmp_page[2];
struct fasync_struct *fasync_readers;
struct fasync_struct *fasync_writers;
struct pipe_buffer *bufs;
--
2.43.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* [PATCH 3/3] wait: avoid spurious calls to prepare_to_wait_event() in ___wait_event()
2025-03-03 23:04 [PATCH 0/3] some pipe + wait stuff Mateusz Guzik
2025-03-03 23:04 ` [PATCH 1/3] pipe: drop an always true check in anon_pipe_write() Mateusz Guzik
2025-03-03 23:04 ` [PATCH 2/3] pipe: cache 2 pages instead of 1 Mateusz Guzik
@ 2025-03-03 23:04 ` Mateusz Guzik
2025-03-04 14:19 ` Peter Zijlstra
2025-03-04 8:46 ` [PATCH 0/3] some pipe + wait stuff Christian Brauner
3 siblings, 1 reply; 14+ messages in thread
From: Mateusz Guzik @ 2025-03-03 23:04 UTC (permalink / raw)
To: torvalds
Cc: oleg, brauner, mingo, peterz, rostedt, linux-kernel,
linux-fsdevel, Mateusz Guzik
In vast majority of cases the condition determining whether the thread
can proceed is true after the first wake up.
However, even in that case the thread ends up calling into
prepare_to_wait_event() again, suffering a spurious irq + lock trip.
Then it calls into finish_wait() to unlink itself.
Note that in case of a pending signal the work done by
prepare_to_wait_event() gets ignored even without the change.
pre-check the condition after waking up instead.
Stats gathared during a kernel build:
bpftrace -e 'kprobe:prepare_to_wait_event,kprobe:finish_wait \
{ @[probe] = count(); }'
@[kprobe:finish_wait]: 392483
@[kprobe:prepare_to_wait_event]: 778690
As in calls to prepare_to_wait_event() almost double calls to
finish_wait(). This evens out with the patch.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
One may worry about using "condition" twice. However, macros leading up
to this one already do it, so it should be fine.
Also one may wonder about fences -- to my understanding going off and on
CPU guarantees a full fence, so the now avoided lock trip changes
nothing.
include/linux/wait.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/include/linux/wait.h b/include/linux/wait.h
index 2bdc8f47963b..965a19809c7e 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -316,6 +316,9 @@ extern void init_wait_entry(struct wait_queue_entry *wq_entry, int flags);
} \
\
cmd; \
+ \
+ if (condition) \
+ break; \
} \
finish_wait(&wq_head, &__wq_entry); \
__out: __ret; \
--
2.43.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* Re: [PATCH 3/3] wait: avoid spurious calls to prepare_to_wait_event() in ___wait_event()
2025-03-03 23:04 ` [PATCH 3/3] wait: avoid spurious calls to prepare_to_wait_event() in ___wait_event() Mateusz Guzik
@ 2025-03-04 14:19 ` Peter Zijlstra
2025-03-04 15:25 ` Mateusz Guzik
0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2025-03-04 14:19 UTC (permalink / raw)
To: Mateusz Guzik
Cc: torvalds, oleg, brauner, mingo, rostedt, linux-kernel,
linux-fsdevel
On Tue, Mar 04, 2025 at 12:04:09AM +0100, Mateusz Guzik wrote:
> In vast majority of cases the condition determining whether the thread
> can proceed is true after the first wake up.
>
> However, even in that case the thread ends up calling into
> prepare_to_wait_event() again, suffering a spurious irq + lock trip.
>
> Then it calls into finish_wait() to unlink itself.
>
> Note that in case of a pending signal the work done by
> prepare_to_wait_event() gets ignored even without the change.
>
> pre-check the condition after waking up instead.
>
> Stats gathared during a kernel build:
> bpftrace -e 'kprobe:prepare_to_wait_event,kprobe:finish_wait \
> { @[probe] = count(); }'
>
> @[kprobe:finish_wait]: 392483
> @[kprobe:prepare_to_wait_event]: 778690
>
> As in calls to prepare_to_wait_event() almost double calls to
> finish_wait(). This evens out with the patch.
>
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---
>
> One may worry about using "condition" twice. However, macros leading up
> to this one already do it, so it should be fine.
>
> Also one may wonder about fences -- to my understanding going off and on
> CPU guarantees a full fence, so the now avoided lock trip changes
> nothing.
so it always helps if you provide actual numbers. Supposedly this makes
it go faster?
Also, how much bytes does it add to the image?
Anyway, no real objection, but it would be good to have better
substantiation etc.
> include/linux/wait.h | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/include/linux/wait.h b/include/linux/wait.h
> index 2bdc8f47963b..965a19809c7e 100644
> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -316,6 +316,9 @@ extern void init_wait_entry(struct wait_queue_entry *wq_entry, int flags);
> } \
> \
> cmd; \
> + \
> + if (condition) \
> + break; \
> } \
> finish_wait(&wq_head, &__wq_entry); \
> __out: __ret; \
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH 3/3] wait: avoid spurious calls to prepare_to_wait_event() in ___wait_event()
2025-03-04 14:19 ` Peter Zijlstra
@ 2025-03-04 15:25 ` Mateusz Guzik
0 siblings, 0 replies; 14+ messages in thread
From: Mateusz Guzik @ 2025-03-04 15:25 UTC (permalink / raw)
To: Peter Zijlstra
Cc: torvalds, oleg, brauner, mingo, rostedt, linux-kernel,
linux-fsdevel
[-- Attachment #1: Type: text/plain, Size: 3001 bytes --]
On Tue, Mar 4, 2025 at 3:19 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Mar 04, 2025 at 12:04:09AM +0100, Mateusz Guzik wrote:
> > In vast majority of cases the condition determining whether the thread
> > can proceed is true after the first wake up.
> >
> > However, even in that case the thread ends up calling into
> > prepare_to_wait_event() again, suffering a spurious irq + lock trip.
> >
> > Then it calls into finish_wait() to unlink itself.
> >
> > Note that in case of a pending signal the work done by
> > prepare_to_wait_event() gets ignored even without the change.
> >
> > pre-check the condition after waking up instead.
> >
> > Stats gathared during a kernel build:
> > bpftrace -e 'kprobe:prepare_to_wait_event,kprobe:finish_wait \
> > { @[probe] = count(); }'
> >
> > @[kprobe:finish_wait]: 392483
> > @[kprobe:prepare_to_wait_event]: 778690
> >
> > As in calls to prepare_to_wait_event() almost double calls to
> > finish_wait(). This evens out with the patch.
> >
> > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> > ---
> >
> > One may worry about using "condition" twice. However, macros leading up
> > to this one already do it, so it should be fine.
> >
> > Also one may wonder about fences -- to my understanding going off and on
> > CPU guarantees a full fence, so the now avoided lock trip changes
> > nothing.
>
> so it always helps if you provide actual numbers. Supposedly this makes
> it go faster?
>
> Also, how much bytes does it add to the image?
>
> Anyway, no real objection, but it would be good to have better
> substantiation etc.
>
I did not bother benchmarking this per se, but I did demonstrate with
bpftrace above that this does avoid the irq and lock usage, which is
normally the desired outcome.
It is a fair point that in this case this is a tradeoff with i-cache footprint.
bloat-o-meter claims:
add/remove: 2/2 grow/shrink: 159/5 up/down: 3865/-250 (3615)
[..]
Total: Before=29946099, After=29949714, chg +0.01%
The biggest growth, by a large margin:
will_read_block.part - 183 +183
fuse_get_req 800 924 +124
load_module 9137 9255 +118
fuse_do_readfolio 415 528 +113
fuse_page_mkwrite 157 259 +102
On the other hand the shrinkage:
port_fops_read 504 503 -1
__ps2_command 1402 1401 -1
fcntl_setlk 961 947 -14
__pfx_fuse_wait_on_folio_writeback 16 - -16
load_compressed_image 4007 3974 -33
uprobe_notify_resume 3384 3323 -61
fuse_wait_on_folio_writeback 124 - -124
The entire thing is attached.
--
Mateusz Guzik <mjguzik gmail.com>
[-- Attachment #2: bloat.txt --]
[-- Type: text/plain, Size: 11097 bytes --]
add/remove: 2/2 grow/shrink: 159/5 up/down: 3865/-250 (3615)
Function old new delta
will_read_block.part - 183 +183
fuse_get_req 800 924 +124
load_module 9137 9255 +118
fuse_do_readfolio 415 528 +113
fuse_page_mkwrite 157 259 +102
wait_port_writable 545 642 +97
ksm_scan_thread 8189 8270 +81
sg_ioctl 2839 2918 +79
fuse_wait_on_folio_writeback.part 202 280 +78
ring_buffer_wait 431 500 +69
__bio_queue_enter 551 617 +66
drm_wait_vblank_ioctl 1692 1756 +64
blk_queue_enter 490 554 +64
add_transaction_credits 700 762 +62
khugepaged 2197 2256 +59
sg_open 1663 1721 +58
sg_read 1523 1579 +56
fuse_dev_do_read 2405 2459 +54
joydev_read 1106 1158 +52
btrfs_wait_block_group_cache_progress 370 420 +50
syslog_print 709 757 +48
usb_wait_anchor_empty_timeout 235 280 +45
mmu_interval_notifier_remove 412 456 +44
ec_guard 510 554 +44
kcompactd 961 1004 +43
wait_for_tpm_stat 669 711 +42
psi_rtpoll_worker 906 948 +42
anon_pipe_write 1581 1621 +40
ksgxd 357 394 +37
btrfs_commit_transaction 3598 3634 +36
scsi_block_when_processing_errors 200 234 +34
save_compressed_image 2695 2728 +33
virtio_gpu_init 2249 2281 +32
regmap_async_complete.part 413 444 +31
fuse_wait_on_page_writeback 185 216 +31
fanotify_handle_event 6056 6087 +31
crtc_crc_read 718 749 +31
acpi_ec_stop 418 448 +30
fuse_readahead 930 959 +29
pipe_wait_writable 234 262 +28
evdev_read 643 671 +28
ps2_drain 322 349 +27
__tpm_tis_request_locality 466 492 +26
throttle_direct_reclaim 626 651 +25
ps2_do_sendbyte 539 563 +24
drm_atomic_helper_wait_for_vblanks.part 538 562 +24
devkmsg_read 467 491 +24
set_termios.part 637 660 +23
fuse_wait_aborted 252 275 +23
btrfs_caching_ctl_wait_done 177 200 +23
fuse_file_cached_io_open 377 399 +22
btrfs_scrub_pause 233 255 +22
virtballoon_free_page_report 219 240 +21
locks_lock_inode_wait 357 378 +21
usbhid_wait_io 238 258 +20
serio_raw_read 521 541 +20
request_wait_answer 519 539 +20
btrfs_pause_balance 264 284 +20
blk_mq_freeze_queue_wait_timeout 213 233 +20
anon_pipe_read 970 990 +20
shpc_write_cmd 656 675 +19
mousedev_read 457 476 +19
btrfs_commit_transaction_async 253 272 +19
tell_host 226 244 +18
oom_killer_disable 281 299 +18
virtgpu_virtio_get_uuid 200 217 +17
pipe_wait_readable 228 245 +17
async_synchronize_cookie_domain 285 302 +17
jbd2_journal_destroy 768 784 +16
decompress_threadfn 323 339 +16
crc32_threadfn 317 333 +16
compress_threadfn 344 360 +16
__pfx_will_read_block.part - 16 +16
wait_current_trans 296 311 +15
pciehp_sysfs_enable_slot 403 418 +15
pciehp_sysfs_disable_slot 392 407 +15
module_patient_check_exists.isra 313 328 +15
btrfs_balance_delayed_items 436 450 +14
genl_unregister_family 480 493 +13
btrfs_rm_dev_replace_blocked 159 172 +13
pci_dpc_recovered 212 224 +12
jbd2_log_wait_commit 269 281 +12
fuse_sync_fs 569 581 +12
fuse_set_nowrite 212 224 +12
flush_scrub_stripes 615 627 +12
drm_vblank_work_flush_all 217 229 +12
drm_read 678 690 +12
blk_mq_freeze_queue_wait 150 162 +12
rmw_rbio 2673 2684 +11
pseudo_lock_measure_trigger 740 751 +11
percpu_ref_switch_to_atomic_sync 186 197 +11
flush_space 1395 1406 +11
btrfs_bio_counter_inc_blocked 256 267 +11
__usermodehelper_disable 308 319 +11
wait_scrub_stripe_io 138 148 +10
wait_for_device_probe 148 158 +10
usbdev_ioctl 5231 5241 +10
synchronize_rcu_expedited 459 469 +10
serial_core_unregister_port 739 749 +10
pci_doe_wait.constprop 193 203 +10
oom_reaper 973 983 +10
kernfs_drain 305 315 +10
kauditd_thread 677 687 +10
jbd2_journal_lock_updates 210 220 +10
jbd2_journal_load 891 901 +10
iosf_mbi_punit_acquire 184 194 +10
iosf_mbi_block_punit_i2c_access 818 828 +10
fsnotify_destroy_group 213 223 +10
expand_files 667 677 +10
exp_funnel_lock 503 513 +10
dm_wait_event 157 167 +10
cpuset_handle_hotplug 3241 3251 +10
btrfs_wait_on_delayed_iputs 160 170 +10
btrfs_scrub_cancel_dev 255 265 +10
btrfs_scrub_cancel 252 262 +10
btrfs_remount_begin 221 231 +10
btrfs_drew_write_lock 199 209 +10
__synchronize_irq 148 158 +10
__scrub_blocked_if_needed 185 195 +10
virtio_gpu_vram_mmap 484 493 +9
virtio_gpu_get_caps_ioctl 555 564 +9
try_flush_qgroup 275 284 +9
start_this_handle 1243 1252 +9
scrub_raid56_parity_stripe 1328 1337 +9
pcie_wait_cmd 568 577 +9
pci_wait_cfg 170 179 +9
ns_revision_read 394 403 +9
mmu_interval_read_begin 190 199 +9
drm_vblank_work_flush 195 204 +9
dquot_disable 1596 1605 +9
dm_kcopyd_client_destroy 353 362 +9
cpufreq_freq_transition_begin 333 342 +9
cppc_set_perf 885 894 +9
btrfs_cancel_balance 339 348 +9
wake_up_and_wait_for_irq_thread_ready 156 164 +8
squashfs_cache_get 799 807 +8
scrub_rbio_work_locked 2250 2258 +8
rdtgroup_pseudo_lock_create 1725 1733 +8
drm_wait_one_vblank 473 481 +8
close_ctree 1258 1266 +8
btrfs_write_dirty_block_groups 910 918 +8
btrfs_start_ordered_extent_nowriteback 379 387 +8
btrfs_cleanup_transaction.isra 1373 1381 +8
autofs_wait 1763 1771 +8
wait_for_random_bytes 235 242 +7
virtio_gpu_cursor_ping 647 654 +7
usb_poison_urb 176 183 +7
usb_kill_urb.part 149 156 +7
rcu_sync_enter 251 258 +7
io_sqd_handle_event 305 312 +7
btrfs_drew_read_lock 136 143 +7
__vt_event_wait.part 122 129 +7
__percpu_ref_switch_mode 458 465 +7
wb_wait_for_completion 130 136 +6
hib_wait_io 131 137 +6
do_coredump 6667 6673 +6
xfs_pwork_poll 186 191 +5
virtio_gpu_queue_ctrl_sgs 648 653 +5
tty_wait_until_sent 429 434 +5
submit_read_wait_bio_list 373 377 +4
wait_for_commit 401 402 +1
port_fops_read 504 503 -1
__ps2_command 1402 1401 -1
fcntl_setlk 961 947 -14
__pfx_fuse_wait_on_folio_writeback 16 - -16
load_compressed_image 4007 3974 -33
uprobe_notify_resume 3384 3323 -61
fuse_wait_on_folio_writeback 124 - -124
Total: Before=29946099, After=29949714, chg +0.01%
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/3] some pipe + wait stuff
2025-03-03 23:04 [PATCH 0/3] some pipe + wait stuff Mateusz Guzik
` (2 preceding siblings ...)
2025-03-03 23:04 ` [PATCH 3/3] wait: avoid spurious calls to prepare_to_wait_event() in ___wait_event() Mateusz Guzik
@ 2025-03-04 8:46 ` Christian Brauner
3 siblings, 0 replies; 14+ messages in thread
From: Christian Brauner @ 2025-03-04 8:46 UTC (permalink / raw)
To: Mateusz Guzik, peterz
Cc: Christian Brauner, oleg, mingo, rostedt, linux-kernel,
linux-fsdevel, torvalds
On Tue, 04 Mar 2025 00:04:06 +0100, Mateusz Guzik wrote:
> As a side effect of looking at the pipe hang I came up with 3 changes to
> consider for -next.
>
> The first one is a trivial clean up which I wont mind if it merely gets
> folded into someone else's change for pipes.
>
> The second one reduces page alloc/free calls for the backing area (60%
> less during a kernel build in my testing). I already posted this, but
> the cc list was not proper.
>
> [...]
This looks sane to me.
Would be good to get an Ack from Peter on the wait change.
---
Applied to the vfs-6.15.pipe branch of the vfs/vfs.git tree.
Patches in the vfs-6.15.pipe branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.15.pipe
[1/3] pipe: drop an always true check in anon_pipe_write()
https://git.kernel.org/vfs/vfs/c/a40cd5849dab
[2/3] pipe: cache 2 pages instead of 1
https://git.kernel.org/vfs/vfs/c/46af8e2406c2
[3/3] wait: avoid spurious calls to prepare_to_wait_event() in ___wait_event()
https://git.kernel.org/vfs/vfs/c/84654c7f4730
^ permalink raw reply [flat|nested] 14+ messages in thread