public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] some pipe + wait stuff
@ 2025-03-03 23:04 Mateusz Guzik
  2025-03-03 23:04 ` [PATCH 1/3] pipe: drop an always true check in anon_pipe_write() Mateusz Guzik
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Mateusz Guzik @ 2025-03-03 23:04 UTC (permalink / raw)
  To: torvalds
  Cc: oleg, brauner, mingo, peterz, rostedt, linux-kernel,
	linux-fsdevel, Mateusz Guzik

As a side effect of looking at the pipe hang I came up with 3 changes to
consider for -next.

The first one is a trivial clean up which I wont mind if it merely gets
folded into someone else's change for pipes.

The second one reduces page alloc/free calls for the backing area (60%
less during a kernel build in my testing). I already posted this, but
the cc list was not proper.

The last one concerns the wait/wakeup mechanism and drops one lock trip
in the common case after waking up. That too was posted some days ago,
but nobody was biting. Perhaps you will be interested (but again, maybe
I got the wrong people from get_maintainer.pl).

Mateusz Guzik (3):
  pipe: drop an always true check in anon_pipe_write()
  pipe: cache 2 pages instead of 1
  wait: avoid spurious calls to prepare_to_wait_event() in
    ___wait_event()

 fs/pipe.c                 | 63 +++++++++++++++++++++++++--------------
 include/linux/pipe_fs_i.h |  2 +-
 include/linux/wait.h      |  3 ++
 3 files changed, 45 insertions(+), 23 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] pipe: drop an always true check in anon_pipe_write()
  2025-03-03 23:04 [PATCH 0/3] some pipe + wait stuff Mateusz Guzik
@ 2025-03-03 23:04 ` Mateusz Guzik
  2025-03-04 14:07   ` Oleg Nesterov
  2025-03-03 23:04 ` [PATCH 2/3] pipe: cache 2 pages instead of 1 Mateusz Guzik
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Mateusz Guzik @ 2025-03-03 23:04 UTC (permalink / raw)
  To: torvalds
  Cc: oleg, brauner, mingo, peterz, rostedt, linux-kernel,
	linux-fsdevel, Mateusz Guzik

The check operates on the stale value of 'head' and always loops back.

Just do it unconditionally. No functional changes.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 fs/pipe.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 19a7948ab234..d5238f6e0f08 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -529,10 +529,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 
 			if (!iov_iter_count(from))
 				break;
-		}
 
-		if (!pipe_full(head, pipe->tail, pipe->max_usage))
 			continue;
+		}
 
 		/* Wait for buffer space to become available. */
 		if ((filp->f_flags & O_NONBLOCK) ||
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] pipe: cache 2 pages instead of 1
  2025-03-03 23:04 [PATCH 0/3] some pipe + wait stuff Mateusz Guzik
  2025-03-03 23:04 ` [PATCH 1/3] pipe: drop an always true check in anon_pipe_write() Mateusz Guzik
@ 2025-03-03 23:04 ` Mateusz Guzik
  2025-03-03 23:04 ` [PATCH 3/3] wait: avoid spurious calls to prepare_to_wait_event() in ___wait_event() Mateusz Guzik
  2025-03-04  8:46 ` [PATCH 0/3] some pipe + wait stuff Christian Brauner
  3 siblings, 0 replies; 14+ messages in thread
From: Mateusz Guzik @ 2025-03-03 23:04 UTC (permalink / raw)
  To: torvalds
  Cc: oleg, brauner, mingo, peterz, rostedt, linux-kernel,
	linux-fsdevel, Mateusz Guzik

User data is kept in a circular buffer backed by pages allocated as
needed. Only having space for one spare is still prone to having to
resort to allocation / freeing.

In my testing this decreases page allocs by 60% during a kernel build.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 fs/pipe.c                 | 60 ++++++++++++++++++++++++++-------------
 include/linux/pipe_fs_i.h |  2 +-
 2 files changed, 41 insertions(+), 21 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index d5238f6e0f08..f5a316d4da95 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -112,20 +112,40 @@ void pipe_double_lock(struct pipe_inode_info *pipe1,
 	pipe_lock(pipe2);
 }
 
+static struct page *anon_pipe_get_page(struct pipe_inode_info *pipe)
+{
+	for (int i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
+		if (pipe->tmp_page[i]) {
+			struct page *page = pipe->tmp_page[i];
+			pipe->tmp_page[i] = NULL;
+			return page;
+		}
+	}
+
+	return alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
+}
+
+static void anon_pipe_put_page(struct pipe_inode_info *pipe,
+			       struct page *page)
+{
+	if (page_count(page) == 1) {
+		for (int i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
+			if (!pipe->tmp_page[i]) {
+				pipe->tmp_page[i] = page;
+				return;
+			}
+		}
+	}
+
+	put_page(page);
+}
+
 static void anon_pipe_buf_release(struct pipe_inode_info *pipe,
 				  struct pipe_buffer *buf)
 {
 	struct page *page = buf->page;
 
-	/*
-	 * If nobody else uses this page, and we don't already have a
-	 * temporary page, let's keep track of it as a one-deep
-	 * allocation cache. (Otherwise just release our reference to it)
-	 */
-	if (page_count(page) == 1 && !pipe->tmp_page)
-		pipe->tmp_page = page;
-	else
-		put_page(page);
+	anon_pipe_put_page(pipe, page);
 }
 
 static bool anon_pipe_buf_try_steal(struct pipe_inode_info *pipe,
@@ -493,27 +513,25 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 		if (!pipe_full(head, pipe->tail, pipe->max_usage)) {
 			unsigned int mask = pipe->ring_size - 1;
 			struct pipe_buffer *buf;
-			struct page *page = pipe->tmp_page;
+			struct page *page;
 			int copied;
 
-			if (!page) {
-				page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
-				if (unlikely(!page)) {
-					ret = ret ? : -ENOMEM;
-					break;
-				}
-				pipe->tmp_page = page;
+			page = anon_pipe_get_page(pipe);
+			if (unlikely(!page)) {
+				if (!ret)
+					ret = -ENOMEM;
+				break;
 			}
 
 			copied = copy_page_from_iter(page, 0, PAGE_SIZE, from);
 			if (unlikely(copied < PAGE_SIZE && iov_iter_count(from))) {
+				anon_pipe_put_page(pipe, page);
 				if (!ret)
 					ret = -EFAULT;
 				break;
 			}
 
 			pipe->head = head + 1;
-			pipe->tmp_page = NULL;
 			/* Insert it into the buffer array */
 			buf = &pipe->bufs[head & mask];
 			buf->page = page;
@@ -846,8 +864,10 @@ void free_pipe_info(struct pipe_inode_info *pipe)
 	if (pipe->watch_queue)
 		put_watch_queue(pipe->watch_queue);
 #endif
-	if (pipe->tmp_page)
-		__free_page(pipe->tmp_page);
+	for (i = 0; i < ARRAY_SIZE(pipe->tmp_page); i++) {
+		if (pipe->tmp_page[i])
+			__free_page(pipe->tmp_page[i]);
+	}
 	kfree(pipe->bufs);
 	kfree(pipe);
 }
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 8ff23bf5a819..eb7994a1ff93 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -72,7 +72,7 @@ struct pipe_inode_info {
 #ifdef CONFIG_WATCH_QUEUE
 	bool note_loss;
 #endif
-	struct page *tmp_page;
+	struct page *tmp_page[2];
 	struct fasync_struct *fasync_readers;
 	struct fasync_struct *fasync_writers;
 	struct pipe_buffer *bufs;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] wait: avoid spurious calls to prepare_to_wait_event() in ___wait_event()
  2025-03-03 23:04 [PATCH 0/3] some pipe + wait stuff Mateusz Guzik
  2025-03-03 23:04 ` [PATCH 1/3] pipe: drop an always true check in anon_pipe_write() Mateusz Guzik
  2025-03-03 23:04 ` [PATCH 2/3] pipe: cache 2 pages instead of 1 Mateusz Guzik
@ 2025-03-03 23:04 ` Mateusz Guzik
  2025-03-04 14:19   ` Peter Zijlstra
  2025-03-04  8:46 ` [PATCH 0/3] some pipe + wait stuff Christian Brauner
  3 siblings, 1 reply; 14+ messages in thread
From: Mateusz Guzik @ 2025-03-03 23:04 UTC (permalink / raw)
  To: torvalds
  Cc: oleg, brauner, mingo, peterz, rostedt, linux-kernel,
	linux-fsdevel, Mateusz Guzik

In vast majority of cases the condition determining whether the thread
can proceed is true after the first wake up.

However, even in that case the thread ends up calling into
prepare_to_wait_event() again, suffering a spurious irq + lock trip.

Then it calls into finish_wait() to unlink itself.

Note that in case of a pending signal the work done by
prepare_to_wait_event() gets ignored even without the change.

pre-check the condition after waking up instead.

Stats gathared during a kernel build:
bpftrace -e 'kprobe:prepare_to_wait_event,kprobe:finish_wait \
		 { @[probe] = count(); }'

@[kprobe:finish_wait]: 392483
@[kprobe:prepare_to_wait_event]: 778690

As in calls to prepare_to_wait_event() almost double calls to
finish_wait(). This evens out with the patch.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---

One may worry about using "condition" twice. However, macros leading up
to this one already do it, so it should be fine.

Also one may wonder about fences -- to my understanding going off and on
CPU guarantees a full fence, so the now avoided lock trip changes
nothing.

 include/linux/wait.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 2bdc8f47963b..965a19809c7e 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -316,6 +316,9 @@ extern void init_wait_entry(struct wait_queue_entry *wq_entry, int flags);
 		}								\
 										\
 		cmd;								\
+										\
+		if (condition)							\
+			break;							\
 	}									\
 	finish_wait(&wq_head, &__wq_entry);					\
 __out:	__ret;									\
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] some pipe + wait stuff
  2025-03-03 23:04 [PATCH 0/3] some pipe + wait stuff Mateusz Guzik
                   ` (2 preceding siblings ...)
  2025-03-03 23:04 ` [PATCH 3/3] wait: avoid spurious calls to prepare_to_wait_event() in ___wait_event() Mateusz Guzik
@ 2025-03-04  8:46 ` Christian Brauner
  3 siblings, 0 replies; 14+ messages in thread
From: Christian Brauner @ 2025-03-04  8:46 UTC (permalink / raw)
  To: Mateusz Guzik, peterz
  Cc: Christian Brauner, oleg, mingo, rostedt, linux-kernel,
	linux-fsdevel, torvalds

On Tue, 04 Mar 2025 00:04:06 +0100, Mateusz Guzik wrote:
> As a side effect of looking at the pipe hang I came up with 3 changes to
> consider for -next.
> 
> The first one is a trivial clean up which I wont mind if it merely gets
> folded into someone else's change for pipes.
> 
> The second one reduces page alloc/free calls for the backing area (60%
> less during a kernel build in my testing). I already posted this, but
> the cc list was not proper.
> 
> [...]

This looks sane to me.
Would be good to get an Ack from Peter on the wait change.

---

Applied to the vfs-6.15.pipe branch of the vfs/vfs.git tree.
Patches in the vfs-6.15.pipe branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.15.pipe

[1/3] pipe: drop an always true check in anon_pipe_write()
      https://git.kernel.org/vfs/vfs/c/a40cd5849dab
[2/3] pipe: cache 2 pages instead of 1
      https://git.kernel.org/vfs/vfs/c/46af8e2406c2
[3/3] wait: avoid spurious calls to prepare_to_wait_event() in ___wait_event()
      https://git.kernel.org/vfs/vfs/c/84654c7f4730

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] pipe: drop an always true check in anon_pipe_write()
  2025-03-03 23:04 ` [PATCH 1/3] pipe: drop an always true check in anon_pipe_write() Mateusz Guzik
@ 2025-03-04 14:07   ` Oleg Nesterov
  2025-03-04 14:58     ` Mateusz Guzik
  2025-03-04 15:44     ` pipes && EPOLLET, again Oleg Nesterov
  0 siblings, 2 replies; 14+ messages in thread
From: Oleg Nesterov @ 2025-03-04 14:07 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: torvalds, brauner, mingo, peterz, rostedt, linux-kernel,
	linux-fsdevel

On 03/04, Mateusz Guzik wrote:
>
> @@ -529,10 +529,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  
>  			if (!iov_iter_count(from))
>  				break;
> -		}
>  
> -		if (!pipe_full(head, pipe->tail, pipe->max_usage))
>  			continue;
> +		}

Reviewed-by: Oleg Nesterov <oleg@redhat.com>



It seems that we can also remove the unnecessary signal_pending()
check, but I need to recheck and we need to cleanup the poll_usage
logic first.

This will also remove the unnecessary wakeups when the writer is
interrupted by signal/

diff --git a/fs/pipe.c b/fs/pipe.c
index b0641f75b1ba..ed55a86ca03b 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -541,12 +541,6 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 				ret = -EAGAIN;
 			break;
 		}
-		if (signal_pending(current)) {
-			if (!ret)
-				ret = -ERESTARTSYS;
-			break;
-		}
-
 		/*
 		 * We're going to release the pipe lock and wait for more
 		 * space. We wake up any readers if necessary, and then
@@ -554,10 +548,11 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 		 * become empty while we dropped the lock.
 		 */
 		mutex_unlock(&pipe->mutex);
-		if (was_empty)
+		if (was_empty || pipe->poll_usage)
 			wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
 		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
-		wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
+		if (wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe)) < 0)
+			return ret ?: -ERESTARTSYS;
 		mutex_lock(&pipe->mutex);
 		was_empty = pipe_empty(pipe->head, pipe->tail);
 		wake_next_writer = true;


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] wait: avoid spurious calls to prepare_to_wait_event() in ___wait_event()
  2025-03-03 23:04 ` [PATCH 3/3] wait: avoid spurious calls to prepare_to_wait_event() in ___wait_event() Mateusz Guzik
@ 2025-03-04 14:19   ` Peter Zijlstra
  2025-03-04 15:25     ` Mateusz Guzik
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2025-03-04 14:19 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: torvalds, oleg, brauner, mingo, rostedt, linux-kernel,
	linux-fsdevel

On Tue, Mar 04, 2025 at 12:04:09AM +0100, Mateusz Guzik wrote:
> In vast majority of cases the condition determining whether the thread
> can proceed is true after the first wake up.
> 
> However, even in that case the thread ends up calling into
> prepare_to_wait_event() again, suffering a spurious irq + lock trip.
> 
> Then it calls into finish_wait() to unlink itself.
> 
> Note that in case of a pending signal the work done by
> prepare_to_wait_event() gets ignored even without the change.
> 
> pre-check the condition after waking up instead.
> 
> Stats gathared during a kernel build:
> bpftrace -e 'kprobe:prepare_to_wait_event,kprobe:finish_wait \
> 		 { @[probe] = count(); }'
> 
> @[kprobe:finish_wait]: 392483
> @[kprobe:prepare_to_wait_event]: 778690
> 
> As in calls to prepare_to_wait_event() almost double calls to
> finish_wait(). This evens out with the patch.
> 
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---
> 
> One may worry about using "condition" twice. However, macros leading up
> to this one already do it, so it should be fine.
> 
> Also one may wonder about fences -- to my understanding going off and on
> CPU guarantees a full fence, so the now avoided lock trip changes
> nothing.

so it always helps if you provide actual numbers. Supposedly this makes
it go faster?

Also, how much bytes does it add to the image?

Anyway, no real objection, but it would be good to have better
substantiation etc.

>  include/linux/wait.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/include/linux/wait.h b/include/linux/wait.h
> index 2bdc8f47963b..965a19809c7e 100644
> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -316,6 +316,9 @@ extern void init_wait_entry(struct wait_queue_entry *wq_entry, int flags);
>  		}								\
>  										\
>  		cmd;								\
> +										\
> +		if (condition)							\
> +			break;							\
>  	}									\
>  	finish_wait(&wq_head, &__wq_entry);					\
>  __out:	__ret;									\
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] pipe: drop an always true check in anon_pipe_write()
  2025-03-04 14:07   ` Oleg Nesterov
@ 2025-03-04 14:58     ` Mateusz Guzik
  2025-03-04 15:18       ` Oleg Nesterov
  2025-03-04 15:44     ` pipes && EPOLLET, again Oleg Nesterov
  1 sibling, 1 reply; 14+ messages in thread
From: Mateusz Guzik @ 2025-03-04 14:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: torvalds, brauner, mingo, peterz, rostedt, linux-kernel,
	linux-fsdevel

On Tue, Mar 4, 2025 at 3:08 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 03/04, Mateusz Guzik wrote:
> >
> > @@ -529,10 +529,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
> >
> >                       if (!iov_iter_count(from))
> >                               break;
> > -             }
> >
> > -             if (!pipe_full(head, pipe->tail, pipe->max_usage))
> >                       continue;
> > +             }
>
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>
>

thanks

> It seems that we can also remove the unnecessary signal_pending()
> check, but I need to recheck and we need to cleanup the poll_usage
> logic first.
>
> This will also remove the unnecessary wakeups when the writer is
> interrupted by signal/
>
[snip]

There are many touch ups to do here, I don't have an opinion about this diff.

I don't have compiled stats handy, but few months back I asked some
people to query pipe writes with dtrace on FreeBSD. Everything is very
heavily skewed towards < 128 bytes in total, including tons of 1 bytes
(and no, it's not just make). However, there is also quite a bit of
absolutely massive multipage ops as well -- north of  64K even. Doing
that kind of op in one go is way faster than restarting rep mov every
time interleaved with SMAP toggling, which is expensive in its own
right.

Thus I would argue someone(tm) should do it in Linux, but I don't have
immediate plans. Perhaps you would be happy to do it? :)

In the meantime, I would hope any refactoring in the area would make
the above easier (my patch does, I'm not claiming your does not :P)
-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] pipe: drop an always true check in anon_pipe_write()
  2025-03-04 14:58     ` Mateusz Guzik
@ 2025-03-04 15:18       ` Oleg Nesterov
  0 siblings, 0 replies; 14+ messages in thread
From: Oleg Nesterov @ 2025-03-04 15:18 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: torvalds, brauner, mingo, peterz, rostedt, linux-kernel,
	linux-fsdevel

On 03/04, Mateusz Guzik wrote:
>
> On Tue, Mar 4, 2025 at 3:08 PM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > It seems that we can also remove the unnecessary signal_pending()
> > check, but I need to recheck and we need to cleanup the poll_usage
> > logic first.
> >
> > This will also remove the unnecessary wakeups when the writer is
> > interrupted by signal/
> >
>
> There are many touch ups to do here, I don't have an opinion about this diff.

...

> Thus I would argue someone(tm) should do it in Linux, but I don't have
> immediate plans. Perhaps you would be happy to do it? :)

Probably. In any case this needs a separate patch, sorry for confusion.

Oleg.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] wait: avoid spurious calls to prepare_to_wait_event() in ___wait_event()
  2025-03-04 14:19   ` Peter Zijlstra
@ 2025-03-04 15:25     ` Mateusz Guzik
  0 siblings, 0 replies; 14+ messages in thread
From: Mateusz Guzik @ 2025-03-04 15:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, oleg, brauner, mingo, rostedt, linux-kernel,
	linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 3001 bytes --]

On Tue, Mar 4, 2025 at 3:19 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Mar 04, 2025 at 12:04:09AM +0100, Mateusz Guzik wrote:
> > In vast majority of cases the condition determining whether the thread
> > can proceed is true after the first wake up.
> >
> > However, even in that case the thread ends up calling into
> > prepare_to_wait_event() again, suffering a spurious irq + lock trip.
> >
> > Then it calls into finish_wait() to unlink itself.
> >
> > Note that in case of a pending signal the work done by
> > prepare_to_wait_event() gets ignored even without the change.
> >
> > pre-check the condition after waking up instead.
> >
> > Stats gathared during a kernel build:
> > bpftrace -e 'kprobe:prepare_to_wait_event,kprobe:finish_wait \
> >                { @[probe] = count(); }'
> >
> > @[kprobe:finish_wait]: 392483
> > @[kprobe:prepare_to_wait_event]: 778690
> >
> > As in calls to prepare_to_wait_event() almost double calls to
> > finish_wait(). This evens out with the patch.
> >
> > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> > ---
> >
> > One may worry about using "condition" twice. However, macros leading up
> > to this one already do it, so it should be fine.
> >
> > Also one may wonder about fences -- to my understanding going off and on
> > CPU guarantees a full fence, so the now avoided lock trip changes
> > nothing.
>
> so it always helps if you provide actual numbers. Supposedly this makes
> it go faster?
>
> Also, how much bytes does it add to the image?
>
> Anyway, no real objection, but it would be good to have better
> substantiation etc.
>

I did not bother benchmarking this per se, but I did demonstrate with
bpftrace above that this does avoid the irq and lock usage, which is
normally the desired outcome.

It is a fair point that in this case this is a tradeoff with i-cache footprint.

bloat-o-meter claims:
add/remove: 2/2 grow/shrink: 159/5 up/down: 3865/-250 (3615)
[..]
Total: Before=29946099, After=29949714, chg +0.01%

The biggest growth, by a large margin:
will_read_block.part                           -     183    +183
fuse_get_req                                 800     924    +124
load_module                                 9137    9255    +118
fuse_do_readfolio                            415     528    +113
fuse_page_mkwrite                            157     259    +102

On the other hand the shrinkage:
port_fops_read                               504     503      -1
__ps2_command                               1402    1401      -1
fcntl_setlk                                  961     947     -14
__pfx_fuse_wait_on_folio_writeback            16       -     -16
load_compressed_image                       4007    3974     -33
uprobe_notify_resume                        3384    3323     -61
fuse_wait_on_folio_writeback                 124       -    -124

The entire thing is attached.


--
Mateusz Guzik <mjguzik gmail.com>

[-- Attachment #2: bloat.txt --]
[-- Type: text/plain, Size: 11097 bytes --]

add/remove: 2/2 grow/shrink: 159/5 up/down: 3865/-250 (3615)
Function                                     old     new   delta
will_read_block.part                           -     183    +183
fuse_get_req                                 800     924    +124
load_module                                 9137    9255    +118
fuse_do_readfolio                            415     528    +113
fuse_page_mkwrite                            157     259    +102
wait_port_writable                           545     642     +97
ksm_scan_thread                             8189    8270     +81
sg_ioctl                                    2839    2918     +79
fuse_wait_on_folio_writeback.part            202     280     +78
ring_buffer_wait                             431     500     +69
__bio_queue_enter                            551     617     +66
drm_wait_vblank_ioctl                       1692    1756     +64
blk_queue_enter                              490     554     +64
add_transaction_credits                      700     762     +62
khugepaged                                  2197    2256     +59
sg_open                                     1663    1721     +58
sg_read                                     1523    1579     +56
fuse_dev_do_read                            2405    2459     +54
joydev_read                                 1106    1158     +52
btrfs_wait_block_group_cache_progress        370     420     +50
syslog_print                                 709     757     +48
usb_wait_anchor_empty_timeout                235     280     +45
mmu_interval_notifier_remove                 412     456     +44
ec_guard                                     510     554     +44
kcompactd                                    961    1004     +43
wait_for_tpm_stat                            669     711     +42
psi_rtpoll_worker                            906     948     +42
anon_pipe_write                             1581    1621     +40
ksgxd                                        357     394     +37
btrfs_commit_transaction                    3598    3634     +36
scsi_block_when_processing_errors            200     234     +34
save_compressed_image                       2695    2728     +33
virtio_gpu_init                             2249    2281     +32
regmap_async_complete.part                   413     444     +31
fuse_wait_on_page_writeback                  185     216     +31
fanotify_handle_event                       6056    6087     +31
crtc_crc_read                                718     749     +31
acpi_ec_stop                                 418     448     +30
fuse_readahead                               930     959     +29
pipe_wait_writable                           234     262     +28
evdev_read                                   643     671     +28
ps2_drain                                    322     349     +27
__tpm_tis_request_locality                   466     492     +26
throttle_direct_reclaim                      626     651     +25
ps2_do_sendbyte                              539     563     +24
drm_atomic_helper_wait_for_vblanks.part      538     562     +24
devkmsg_read                                 467     491     +24
set_termios.part                             637     660     +23
fuse_wait_aborted                            252     275     +23
btrfs_caching_ctl_wait_done                  177     200     +23
fuse_file_cached_io_open                     377     399     +22
btrfs_scrub_pause                            233     255     +22
virtballoon_free_page_report                 219     240     +21
locks_lock_inode_wait                        357     378     +21
usbhid_wait_io                               238     258     +20
serio_raw_read                               521     541     +20
request_wait_answer                          519     539     +20
btrfs_pause_balance                          264     284     +20
blk_mq_freeze_queue_wait_timeout             213     233     +20
anon_pipe_read                               970     990     +20
shpc_write_cmd                               656     675     +19
mousedev_read                                457     476     +19
btrfs_commit_transaction_async               253     272     +19
tell_host                                    226     244     +18
oom_killer_disable                           281     299     +18
virtgpu_virtio_get_uuid                      200     217     +17
pipe_wait_readable                           228     245     +17
async_synchronize_cookie_domain              285     302     +17
jbd2_journal_destroy                         768     784     +16
decompress_threadfn                          323     339     +16
crc32_threadfn                               317     333     +16
compress_threadfn                            344     360     +16
__pfx_will_read_block.part                     -      16     +16
wait_current_trans                           296     311     +15
pciehp_sysfs_enable_slot                     403     418     +15
pciehp_sysfs_disable_slot                    392     407     +15
module_patient_check_exists.isra             313     328     +15
btrfs_balance_delayed_items                  436     450     +14
genl_unregister_family                       480     493     +13
btrfs_rm_dev_replace_blocked                 159     172     +13
pci_dpc_recovered                            212     224     +12
jbd2_log_wait_commit                         269     281     +12
fuse_sync_fs                                 569     581     +12
fuse_set_nowrite                             212     224     +12
flush_scrub_stripes                          615     627     +12
drm_vblank_work_flush_all                    217     229     +12
drm_read                                     678     690     +12
blk_mq_freeze_queue_wait                     150     162     +12
rmw_rbio                                    2673    2684     +11
pseudo_lock_measure_trigger                  740     751     +11
percpu_ref_switch_to_atomic_sync             186     197     +11
flush_space                                 1395    1406     +11
btrfs_bio_counter_inc_blocked                256     267     +11
__usermodehelper_disable                     308     319     +11
wait_scrub_stripe_io                         138     148     +10
wait_for_device_probe                        148     158     +10
usbdev_ioctl                                5231    5241     +10
synchronize_rcu_expedited                    459     469     +10
serial_core_unregister_port                  739     749     +10
pci_doe_wait.constprop                       193     203     +10
oom_reaper                                   973     983     +10
kernfs_drain                                 305     315     +10
kauditd_thread                               677     687     +10
jbd2_journal_lock_updates                    210     220     +10
jbd2_journal_load                            891     901     +10
iosf_mbi_punit_acquire                       184     194     +10
iosf_mbi_block_punit_i2c_access              818     828     +10
fsnotify_destroy_group                       213     223     +10
expand_files                                 667     677     +10
exp_funnel_lock                              503     513     +10
dm_wait_event                                157     167     +10
cpuset_handle_hotplug                       3241    3251     +10
btrfs_wait_on_delayed_iputs                  160     170     +10
btrfs_scrub_cancel_dev                       255     265     +10
btrfs_scrub_cancel                           252     262     +10
btrfs_remount_begin                          221     231     +10
btrfs_drew_write_lock                        199     209     +10
__synchronize_irq                            148     158     +10
__scrub_blocked_if_needed                    185     195     +10
virtio_gpu_vram_mmap                         484     493      +9
virtio_gpu_get_caps_ioctl                    555     564      +9
try_flush_qgroup                             275     284      +9
start_this_handle                           1243    1252      +9
scrub_raid56_parity_stripe                  1328    1337      +9
pcie_wait_cmd                                568     577      +9
pci_wait_cfg                                 170     179      +9
ns_revision_read                             394     403      +9
mmu_interval_read_begin                      190     199      +9
drm_vblank_work_flush                        195     204      +9
dquot_disable                               1596    1605      +9
dm_kcopyd_client_destroy                     353     362      +9
cpufreq_freq_transition_begin                333     342      +9
cppc_set_perf                                885     894      +9
btrfs_cancel_balance                         339     348      +9
wake_up_and_wait_for_irq_thread_ready        156     164      +8
squashfs_cache_get                           799     807      +8
scrub_rbio_work_locked                      2250    2258      +8
rdtgroup_pseudo_lock_create                 1725    1733      +8
drm_wait_one_vblank                          473     481      +8
close_ctree                                 1258    1266      +8
btrfs_write_dirty_block_groups               910     918      +8
btrfs_start_ordered_extent_nowriteback       379     387      +8
btrfs_cleanup_transaction.isra              1373    1381      +8
autofs_wait                                 1763    1771      +8
wait_for_random_bytes                        235     242      +7
virtio_gpu_cursor_ping                       647     654      +7
usb_poison_urb                               176     183      +7
usb_kill_urb.part                            149     156      +7
rcu_sync_enter                               251     258      +7
io_sqd_handle_event                          305     312      +7
btrfs_drew_read_lock                         136     143      +7
__vt_event_wait.part                         122     129      +7
__percpu_ref_switch_mode                     458     465      +7
wb_wait_for_completion                       130     136      +6
hib_wait_io                                  131     137      +6
do_coredump                                 6667    6673      +6
xfs_pwork_poll                               186     191      +5
virtio_gpu_queue_ctrl_sgs                    648     653      +5
tty_wait_until_sent                          429     434      +5
submit_read_wait_bio_list                    373     377      +4
wait_for_commit                              401     402      +1
port_fops_read                               504     503      -1
__ps2_command                               1402    1401      -1
fcntl_setlk                                  961     947     -14
__pfx_fuse_wait_on_folio_writeback            16       -     -16
load_compressed_image                       4007    3974     -33
uprobe_notify_resume                        3384    3323     -61
fuse_wait_on_folio_writeback                 124       -    -124
Total: Before=29946099, After=29949714, chg +0.01%

^ permalink raw reply	[flat|nested] 14+ messages in thread

* pipes && EPOLLET, again
  2025-03-04 14:07   ` Oleg Nesterov
  2025-03-04 14:58     ` Mateusz Guzik
@ 2025-03-04 15:44     ` Oleg Nesterov
  2025-03-04 18:12       ` Linus Torvalds
  1 sibling, 1 reply; 14+ messages in thread
From: Oleg Nesterov @ 2025-03-04 15:44 UTC (permalink / raw)
  To: Linus Torvalds, Mateusz Guzik
  Cc: brauner, mingo, peterz, rostedt, linux-kernel, linux-fsdevel

Linus,

On 03/04, Oleg Nesterov wrote:
>
> and we need to cleanup the poll_usage
> logic first.

We have already discussed this before, I'll probably do this later,
but lets forget it for now.

Don't we need the trivial one-liner below anyway?

I am not saying this is a bug, but please consider

	#include <stdio.h>
	#include <stdbool.h>
	#include <stdlib.h>
	#include <unistd.h>
	#include <sys/epoll.h>
	#include <assert.h>

	static char buf[16 * 4096];

	int main(void)
	{
		int pfd[2], efd;
		struct epoll_event evt = { .events = EPOLLIN | EPOLLET };

		pipe(pfd);
		efd = epoll_create1(0);
		epoll_ctl(efd, EPOLL_CTL_ADD, pfd[0], &evt);

		write(pfd[1], buf, 4096);
		assert(epoll_wait(efd, &evt, 1, 0) == 1);

		if (!fork()) {
			write(pfd[1], buf, sizeof(buf));
			assert(0);
		}

		sleep(1);
		assert(epoll_wait(efd, &evt, 1, 0) == 1);

		return 0;
	}

the 2nd epoll_wait() fails, despite the fact that the child has already
written 15 * PAGE_SIZE bytes. This doesn't look consistent to me...

Oleg.
---

diff --git a/fs/pipe.c b/fs/pipe.c
index b0641f75b1ba..8a32257cc74f 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -554,7 +554,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 		 * become empty while we dropped the lock.
 		 */
 		mutex_unlock(&pipe->mutex);
-		if (was_empty)
+		if (was_empty || pipe->poll_usage)
 			wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
 		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 		wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: pipes && EPOLLET, again
  2025-03-04 15:44     ` pipes && EPOLLET, again Oleg Nesterov
@ 2025-03-04 18:12       ` Linus Torvalds
  2025-03-04 19:32         ` Oleg Nesterov
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2025-03-04 18:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mateusz Guzik, brauner, mingo, peterz, rostedt, linux-kernel,
	linux-fsdevel

On Tue, 4 Mar 2025 at 05:45, Oleg Nesterov <oleg@redhat.com> wrote:
>
> Linus,
>
> On 03/04, Oleg Nesterov wrote:
> >
> > and we need to cleanup the poll_usage
> > logic first.
>
> We have already discussed this before, I'll probably do this later,
> but lets forget it for now.
>
> Don't we need the trivial one-liner below anyway?

See this email of mine:

  https://lore.kernel.org/all/CAHk-=wiCRwRFi0kGwd_Uv+Xv4HOB-ivHyUp9it6CNSmrKT4gOA@mail.gmail.com/

and the last paragraph in particular.

The whole "poll_usage" thing is a kernel *hack* to deal with broken
user space that expected garbage semantics that aren't real, and were
never real.

They were a random implementation detail, and they were *wrong*.

But because we have a strict "we don't break user space", we
introduced that completely bogus hack to say "ok, we'll send these
completely wrong extraneous events despite the fact that nothing has
changed, because some broken user space program was written to expect
them".

> I am not saying this is a bug, but please consider

That program is buggy, and we're not adding new hacks for new bugs.

If you ask for an edge-triggered EPOLL event, you get an *edge*
triggered EPOLL event. And there is no edge - the pipe was already
readable, no edge anywhere in sight.

So to clarify: we added the hack to work around *existing* user space
bugs that happened to work before.

If anything, we might consider removing the crazy "poll_usage" hack in
the (probably futile) hope that user space has been fixed.

But we're not encouraging *new* bugs.

               Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: pipes && EPOLLET, again
  2025-03-04 18:12       ` Linus Torvalds
@ 2025-03-04 19:32         ` Oleg Nesterov
  2025-03-04 19:49           ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Oleg Nesterov @ 2025-03-04 19:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mateusz Guzik, brauner, mingo, peterz, rostedt, linux-kernel,
	linux-fsdevel

On 03/04, Linus Torvalds wrote:
>
> On Tue, 4 Mar 2025 at 05:45, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Don't we need the trivial one-liner below anyway?
>
> See this email of mine:
>
>   https://lore.kernel.org/all/CAHk-=wiCRwRFi0kGwd_Uv+Xv4HOB-ivHyUp9it6CNSmrKT4gOA@mail.gmail.com/
>
> and the last paragraph in particular.
>
> The whole "poll_usage" thing is a kernel *hack* to deal with broken
> user space that expected garbage semantics that aren't real, and were
> never real.

Yes agreed. But we can make this hack more understandable. But as I said,
this is off-topic right now.

> introduced that completely bogus hack to say "ok, we'll send these
> completely wrong extraneous events despite the fact that nothing has
> changed, because some broken user space program was written to expect
> them".

Yes, but since we already have this hack:

> That program is buggy, and we're not adding new hacks for new bugs.

Yes, but see below...

> If you ask for an edge-triggered EPOLL event, you get an *edge*
> triggered EPOLL event. And there is no edge - the pipe was already
> readable, no edge anywhere in sight.

Yes, the pipe was already readable before before fork, but this condition
was already "consumed" by the 1st epoll_wait(EPOLLET). Please see below.

> If anything, we might consider removing the crazy "poll_usage" hack in
> the (probably futile) hope that user space has been fixed.

This would be the best option ;) Until then:

I agree that my test case is "buggy", but afaics it is not buggier than
userspace programs which rely on the unconditional kill_fasync()'s in
pipe_read/pipe_write?

So. anon_pipe_write() does

	if (was_empty)
		wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
	kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);

before wait_event(pipe->wr_wait), but before return it does

	if (was_empty || pipe->poll_usage)
		wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
	kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);

and this looks confusing to me.

If pipe_write() doesn't take poll_usage into account before wait_event(wr_wait),
then it doesn't need kill_fasync() too?

So I won't argue, but why not make both cases more consistent?

Oleg.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: pipes && EPOLLET, again
  2025-03-04 19:32         ` Oleg Nesterov
@ 2025-03-04 19:49           ` Linus Torvalds
  0 siblings, 0 replies; 14+ messages in thread
From: Linus Torvalds @ 2025-03-04 19:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mateusz Guzik, brauner, mingo, peterz, rostedt, linux-kernel,
	linux-fsdevel

On Tue, 4 Mar 2025 at 09:32, Oleg Nesterov <oleg@redhat.com> wrote:
>
> I agree that my test case is "buggy", but afaics it is not buggier than
> userspace programs which rely on the unconditional kill_fasync()'s in
> pipe_read/pipe_write?

I'm not convinced any such users actually exist.

The reason kill_fasync() is unconditional is that it's cheap. The
normal situation is "nobody there", and we test that without any
locking.

So we've never bothered to make any changes to that path, and there's
never been any real reason to have any "was_empty" like conditionals.

               Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-03-04 19:49 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-03 23:04 [PATCH 0/3] some pipe + wait stuff Mateusz Guzik
2025-03-03 23:04 ` [PATCH 1/3] pipe: drop an always true check in anon_pipe_write() Mateusz Guzik
2025-03-04 14:07   ` Oleg Nesterov
2025-03-04 14:58     ` Mateusz Guzik
2025-03-04 15:18       ` Oleg Nesterov
2025-03-04 15:44     ` pipes && EPOLLET, again Oleg Nesterov
2025-03-04 18:12       ` Linus Torvalds
2025-03-04 19:32         ` Oleg Nesterov
2025-03-04 19:49           ` Linus Torvalds
2025-03-03 23:04 ` [PATCH 2/3] pipe: cache 2 pages instead of 1 Mateusz Guzik
2025-03-03 23:04 ` [PATCH 3/3] wait: avoid spurious calls to prepare_to_wait_event() in ___wait_event() Mateusz Guzik
2025-03-04 14:19   ` Peter Zijlstra
2025-03-04 15:25     ` Mateusz Guzik
2025-03-04  8:46 ` [PATCH 0/3] some pipe + wait stuff Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox