[PATCH] pipe_read: don't wake up the writer if the pipe is still full

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] pipe_read: don't wake up the writer if the pipe is still full
@ 2025-01-02 14:07 Oleg Nesterov
  2025-01-02 16:20 ` WangYuli
                   ` (3 more replies)
  0 siblings, 4 replies; 109+ messages in thread
From: Oleg Nesterov @ 2025-01-02 14:07 UTC (permalink / raw)
  To: Manfred Spraul, Linus Torvalds, Christian Brauner, David Howells
  Cc: WangYuli, linux-fsdevel, linux-kernel

wake_up(pipe->wr_wait) makes no sense if pipe_full() is still true after
the reading, the writer sleeping in wait_event(wr_wait, pipe_writable())
will check the pipe_writable() == !pipe_full() condition and sleep again.

Only wake the writer if we actually released a pipe buf, and the pipe was
full before we did so.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/pipe.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 12b22c2723b7..82fede0f2111 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -253,7 +253,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 	size_t total_len = iov_iter_count(to);
 	struct file *filp = iocb->ki_filp;
 	struct pipe_inode_info *pipe = filp->private_data;
-	bool was_full, wake_next_reader = false;
+	bool wake_writer = false, wake_next_reader = false;
 	ssize_t ret;
 
 	/* Null read succeeds. */
@@ -264,14 +264,13 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 	mutex_lock(&pipe->mutex);
 
 	/*
-	 * We only wake up writers if the pipe was full when we started
-	 * reading in order to avoid unnecessary wakeups.
+	 * We only wake up writers if the pipe was full when we started reading
+	 * and it is no longer full after reading to avoid unnecessary wakeups.
 	 *
 	 * But when we do wake up writers, we do so using a sync wakeup
 	 * (WF_SYNC), because we want them to get going and generate more
 	 * data for us.
 	 */
-	was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage);
 	for (;;) {
 		/* Read ->head with a barrier vs post_one_notification() */
 		unsigned int head = smp_load_acquire(&pipe->head);
@@ -340,8 +339,10 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 				buf->len = 0;
 			}
 
-			if (!buf->len)
+			if (!buf->len) {
+				wake_writer |= pipe_full(head, tail, pipe->max_usage);
 				tail = pipe_update_tail(pipe, buf, tail);
+			}
 			total_len -= chars;
 			if (!total_len)
 				break;	/* common path: read succeeded */
@@ -377,7 +378,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 		 * _very_ unlikely case that the pipe was full, but we got
 		 * no data.
 		 */
-		if (unlikely(was_full))
+		if (unlikely(wake_writer))
 			wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
 		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
 
@@ -390,15 +391,15 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 		if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0)
 			return -ERESTARTSYS;
 
-		mutex_lock(&pipe->mutex);
-		was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage);
+		wake_writer = false;
 		wake_next_reader = true;
+		mutex_lock(&pipe->mutex);
 	}
 	if (pipe_empty(pipe->head, pipe->tail))
 		wake_next_reader = false;
 	mutex_unlock(&pipe->mutex);
 
-	if (was_full)
+	if (wake_writer)
 		wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
 	if (wake_next_reader)
 		wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
-- 
2.25.1.362.g51ebf55



^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-01-02 14:07 [PATCH] pipe_read: don't wake up the writer if the pipe is still full Oleg Nesterov
@ 2025-01-02 16:20 ` WangYuli
  2025-01-02 16:46   ` Oleg Nesterov
  2025-01-04  8:42 ` Christian Brauner
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 109+ messages in thread
From: WangYuli @ 2025-01-02 16:20 UTC (permalink / raw)
  To: Oleg Nesterov, Manfred Spraul, Linus Torvalds, Christian Brauner,
	David Howells
  Cc: linux-fsdevel, linux-kernel, yushengjin, zhangdandan, chenyichong


[-- Attachment #1.1.1.1: Type: text/plain, Size: 4388 bytes --]

[Adding some of my colleagues who were part of the original submission 
to the CC list for their information.]


On 2025/1/2 22:07, Oleg Nesterov wrote:
> wake_up(pipe->wr_wait) makes no sense if pipe_full() is still true after
> the reading, the writer sleeping in wait_event(wr_wait, pipe_writable())
> will check the pipe_writable() == !pipe_full() condition and sleep again.
>
> Only wake the writer if we actually released a pipe buf, and the pipe was
> full before we did so.

As Linus said, for fs/pipe, he "want any patches to be very clearly 
documented," perhaps we should include a link to the original discussion 
here.

Link: 
https://lore.kernel.org/all/75B06EE0B67747ED+20241225094202.597305-1-wangyuli@uniontech.com/

Link: https://lore.kernel.org/all/20241229135737.GA3293@redhat.com/

> Signed-off-by: Oleg Nesterov<oleg@redhat.com>

Reported-by: WangYuli <wangyuli@uniontech.com>

I'm happy to provide more test results for this patch if it's not too late.

> ---
>   fs/pipe.c | 19 ++++++++++---------
>   1 file changed, 10 insertions(+), 9 deletions(-)
>
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 12b22c2723b7..82fede0f2111 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -253,7 +253,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>   	size_t total_len = iov_iter_count(to);
>   	struct file *filp = iocb->ki_filp;
>   	struct pipe_inode_info *pipe = filp->private_data;
> -	bool was_full, wake_next_reader = false;
> +	bool wake_writer = false, wake_next_reader = false;
>   	ssize_t ret;
>   
>   	/* Null read succeeds. */
> @@ -264,14 +264,13 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>   	mutex_lock(&pipe->mutex);
>   
>   	/*
> -	 * We only wake up writers if the pipe was full when we started
> -	 * reading in order to avoid unnecessary wakeups.
> +	 * We only wake up writers if the pipe was full when we started reading
> +	 * and it is no longer full after reading to avoid unnecessary wakeups.
>   	 *
>   	 * But when we do wake up writers, we do so using a sync wakeup
>   	 * (WF_SYNC), because we want them to get going and generate more
>   	 * data for us.
>   	 */
> -	was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage);
>   	for (;;) {
>   		/* Read ->head with a barrier vs post_one_notification() */
>   		unsigned int head = smp_load_acquire(&pipe->head);
> @@ -340,8 +339,10 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>   				buf->len = 0;
>   			}
>   
> -			if (!buf->len)
> +			if (!buf->len) {
> +				wake_writer |= pipe_full(head, tail, pipe->max_usage);
>   				tail = pipe_update_tail(pipe, buf, tail);
> +			}
>   			total_len -= chars;
>   			if (!total_len)
>   				break;	/* common path: read succeeded */
> @@ -377,7 +378,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>   		 * _very_ unlikely case that the pipe was full, but we got
>   		 * no data.
>   		 */
> -		if (unlikely(was_full))
> +		if (unlikely(wake_writer))
>   			wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
>   		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
>   
> @@ -390,15 +391,15 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>   		if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0)
>   			return -ERESTARTSYS;
>   
> -		mutex_lock(&pipe->mutex);
> -		was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage);
> +		wake_writer = false;
>   		wake_next_reader = true;
> +		mutex_lock(&pipe->mutex);
>   	}
>   	if (pipe_empty(pipe->head, pipe->tail))
>   		wake_next_reader = false;
>   	mutex_unlock(&pipe->mutex);
>   
> -	if (was_full)
> +	if (wake_writer)
>   		wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
>   	if (wake_next_reader)
>   		wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
Hmm..
Initially, the sole purpose of our original patch was to simply check if 
there were any waiting processes in the process wait queue to avoid 
unnecessary wake-ups, for both reads and writes.
And then, sincerely thank you all for taking the time to review it!
While your patch and ours share some little similarities, our primary 
goals may vary slightly. Do you have any suggestions on how we could 
better achieve our original objective?

Thanks,
-- 
WangYuli

[-- Attachment #1.1.1.2: Type: text/html, Size: 5812 bytes --]

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 645 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-01-02 16:20 ` WangYuli
@ 2025-01-02 16:46   ` Oleg Nesterov
  0 siblings, 0 replies; 109+ messages in thread
From: Oleg Nesterov @ 2025-01-02 16:46 UTC (permalink / raw)
  To: WangYuli
  Cc: Manfred Spraul, Linus Torvalds, Christian Brauner, David Howells,
	linux-fsdevel, linux-kernel, yushengjin, zhangdandan, chenyichong

On 01/03, WangYuli wrote:
>
> [Adding some of my colleagues who were part of the original submission to
> the CC list for their information.]

OK,

> perhaps we should include a link to the original discussion
>
> Link: https://lore.kernel.org/all/75B06EE0B67747ED+20241225094202.597305-1-wangyuli@uniontech.com/
...
> Reported-by: WangYuli <wangyuli@uniontech.com>

WangYuli, this patch has nothing to do with your original patch and
the discussion above.

> I'm happy to provide more test results for this patch if it's not too late.

Would be great, but I don't think this patch can make any difference
performance-wise in practice. Short reads are not that common, I guess.

> Hmm..
> Initially, the sole purpose of our original patch was to simply check if
> there were any waiting processes in the process wait queue to avoid
> unnecessary wake-ups, for both reads and writes.

Exactly. So once again, this patch is orthogonal to the possible
wq_has_sleeper() optimizations.

> Do you have any suggestions on how we could better
> achieve our original objective?

See
	wakeup_pipe_readers/writers() && pipe_poll()
	https://lore.kernel.org/all/20250102163320.GA17691@redhat.com/

Oleg.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-01-02 14:07 [PATCH] pipe_read: don't wake up the writer if the pipe is still full Oleg Nesterov
  2025-01-02 16:20 ` WangYuli
@ 2025-01-04  8:42 ` Christian Brauner
  2025-01-31  9:49 ` K Prateek Nayak
  2025-02-24  9:26 ` Sapkal, Swapnil
  3 siblings, 0 replies; 109+ messages in thread
From: Christian Brauner @ 2025-01-04  8:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christian Brauner, WangYuli, linux-fsdevel, linux-kernel,
	Manfred Spraul, Linus Torvalds, David Howells

On Thu, 02 Jan 2025 15:07:15 +0100, Oleg Nesterov wrote:
> wake_up(pipe->wr_wait) makes no sense if pipe_full() is still true after
> the reading, the writer sleeping in wait_event(wr_wait, pipe_writable())
> will check the pipe_writable() == !pipe_full() condition and sleep again.
> 
> Only wake the writer if we actually released a pipe buf, and the pipe was
> full before we did so.
> 
> [...]

Applied to the vfs-6.14.misc branch of the vfs/vfs.git tree.
Patches in the vfs-6.14.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.14.misc

[1/1] pipe_read: don't wake up the writer if the pipe is still full
      https://git.kernel.org/vfs/vfs/c/b004b4d254e7

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-01-02 14:07 [PATCH] pipe_read: don't wake up the writer if the pipe is still full Oleg Nesterov
  2025-01-02 16:20 ` WangYuli
  2025-01-04  8:42 ` Christian Brauner
@ 2025-01-31  9:49 ` K Prateek Nayak
  2025-01-31 13:23   ` Oleg Nesterov
  2025-01-31 20:06   ` Linus Torvalds
  2025-02-24  9:26 ` Sapkal, Swapnil
  3 siblings, 2 replies; 109+ messages in thread
From: K Prateek Nayak @ 2025-01-31  9:49 UTC (permalink / raw)
  To: Oleg Nesterov, Manfred Spraul, Linus Torvalds, Christian Brauner,
	David Howells
  Cc: WangYuli, linux-fsdevel, linux-kernel, Gautham R. Shenoy,
	Swapnil Sapkal, Neeraj Upadhyay

Hello Oleg,

On 1/2/2025 7:37 PM, Oleg Nesterov wrote:
> wake_up(pipe->wr_wait) makes no sense if pipe_full() is still true after
> the reading, the writer sleeping in wait_event(wr_wait, pipe_writable())
> will check the pipe_writable() == !pipe_full() condition and sleep again.
> 
> Only wake the writer if we actually released a pipe buf, and the pipe was
> full before we did so.

I noticed a performance regression in perf bench sched messaging at
higher utilization (larger number of groups) with this patch on the
mainline kernel. For lower utilization, this patch yields good
improvements but once the system is oversubscribed, the tale flips.

Following are the results from my testing on mainline at commit
05dbaf8dd8bf ("Merge tag 'x86-urgent-2025-01-28' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
with and without this patch:

    ==================================================================
     Test          : sched-messaging
     cmdline       : perf bench sched messaging -p -t -l 100000 -g <groups>
     Units         : Normalized time in seconds
     Interpretation: Lower is better
     Statistic     : AMean
     ==================================================================
     Case:         mainline[pct imp](CV)    revert[pct imp](CV)
      1-groups     1.00 [ -0.00](12.29)     1.26 [-25.91]( 2.71)
      2-groups     1.00 [ -0.00]( 3.64)     1.39 [-38.53]( 0.89)
      4-groups     1.00 [ -0.00]( 3.33)     1.41 [-41.42]( 1.21)
      8-groups     1.00 [ -0.00]( 2.90)     1.10 [ -9.89]( 0.95)
     16-groups     1.00 [ -0.00]( 1.46)     0.66 [ 34.46]( 1.59)

On my 3rd Generation EPYC system (2 x 64C/128T), I see that on reverting
the changes on the above mentioned commit, sched-messaging sees a
regression up until the 8 group case which contains 320 tasks, however
with 16 groups (640 tasks), the revert helps with performance.

Based on the trend in the performance, one can deduce that at lower
utilization, sched-messaging benefits from not traversing the wake up
path unnecessarily since wake_up_interruptible_sync_poll() acquires a
lock before checking if the wait queue is empty or not thus saving on
system time. However, at high utilization, there is likely a writer
waiting to write to the pipe by the time the wait queue is inspected.

Following are the perf profile comparing the mainline with the revert:

o 1-group (4.604s [mainline] vs 8.163s [revert])

     sudo ./perf record -C 0-7,64-127 -e ibs_op/cnt_ctl=1/ -- taskset -c 0-7,64-127 ./perf bench sched messaging -p -t -l 100000 -g 1

     (sched-messaging was pinned to 1 CCX and only that CCX was profiled
      using IBS to reduce noise)

							mainline			vs			revert

Samples: 606K of event 'ibs_op/cnt_ctl=1/', Event count (approx.): 205972485144                        Samples: 479K of event 'ibs_op/cnt_ctl=1/', Event count (approx.): 200365591518
Overhead  Command          Shared Object         Symbol                                                Overhead  Command          Shared Object         Symbol
    4.80%  sched-messaging  [kernel.kallsyms]     [k] srso_alias_safe_ret                                  5.12%  sched-messaging  [kernel.kallsyms]     [k] srso_alias_safe_ret
    4.10%  sched-messaging  [kernel.kallsyms]     [k] rep_movs_alternative                                 4.30%  sched-messaging  [kernel.kallsyms]     [k] rep_movs_alternative
    3.24%  sched-messaging  [kernel.kallsyms]     [k] osq_lock                                             3.42%  sched-messaging  [kernel.kallsyms]     [k] srso_alias_return_thunk
    3.23%  sched-messaging  [kernel.kallsyms]     [k] srso_alias_return_thunk                              3.31%  sched-messaging  [kernel.kallsyms]     [k] syscall_exit_to_user_mode
    3.13%  sched-messaging  [kernel.kallsyms]     [k] syscall_exit_to_user_mode                            2.71%  sched-messaging  [kernel.kallsyms]     [k] osq_lock
    2.44%  sched-messaging  [kernel.kallsyms]     [k] pipe_write                                           2.64%  sched-messaging  [kernel.kallsyms]     [k] pipe_write
    2.38%  sched-messaging  [kernel.kallsyms]     [k] pipe_read                                            2.34%  sched-messaging  [kernel.kallsyms]     [k] do_syscall_64
    2.23%  sched-messaging  [kernel.kallsyms]     [k] do_syscall_64                                        2.33%  sched-messaging  [kernel.kallsyms]     [k] pipe_read
    2.19%  sched-messaging  [kernel.kallsyms]     [k] mutex_spin_on_owner                                  2.10%  sched-messaging  [kernel.kallsyms]     [k] fdget_pos
    2.05%  swapper          [kernel.kallsyms]     [k] native_sched_clock                                   1.97%  sched-messaging  [kernel.kallsyms]     [k] vfs_write
    1.94%  sched-messaging  [kernel.kallsyms]     [k] fdget_pos                                            1.93%  sched-messaging  [kernel.kallsyms]     [k] entry_SYSRETQ_unsafe_stack
    1.88%  sched-messaging  [kernel.kallsyms]     [k] vfs_read                                             1.91%  sched-messaging  [kernel.kallsyms]     [k] vfs_read
    1.87%  swapper          [kernel.kallsyms]     [k] psi_group_change                                     1.89%  sched-messaging  [kernel.kallsyms]     [k] mutex_spin_on_owner
    1.85%  sched-messaging  [kernel.kallsyms]     [k] entry_SYSRETQ_unsafe_stack                           1.78%  sched-messaging  [kernel.kallsyms]     [k] current_time
    1.83%  sched-messaging  [kernel.kallsyms]     [k] vfs_write                                            1.77%  sched-messaging  [kernel.kallsyms]     [k] apparmor_file_permission
    1.68%  sched-messaging  [kernel.kallsyms]     [k] current_time                                         1.72%  sched-messaging  [kernel.kallsyms]     [k] aa_file_perm
    1.67%  sched-messaging  [kernel.kallsyms]     [k] apparmor_file_permission                             1.66%  sched-messaging  [kernel.kallsyms]     [k] rw_verify_area
    1.64%  sched-messaging  [kernel.kallsyms]     [k] aa_file_perm                                         1.59%  sched-messaging  [kernel.kallsyms]     [k] entry_SYSCALL_64_after_hwframe
    1.56%  sched-messaging  [kernel.kallsyms]     [k] rw_verify_area                                       1.38%  sched-messaging  [kernel.kallsyms]     [k] _copy_from_iter
    1.50%  sched-messaging  [kernel.kallsyms]     [k] entry_SYSCALL_64_after_hwframe                       1.38%  sched-messaging  [kernel.kallsyms]     [k] ktime_get_coarse_real_ts64_mg
    1.36%  sched-messaging  [kernel.kallsyms]     [k] ktime_get_coarse_real_ts64_mg                        1.37%  sched-messaging  [kernel.kallsyms]     [k] native_sched_clock
    1.33%  sched-messaging  [kernel.kallsyms]     [k] native_sched_clock                                   1.36%  swapper          [kernel.kallsyms]     [k] native_sched_clock
    1.29%  sched-messaging  libc.so.6             [.] read                                                 1.34%  sched-messaging  libc.so.6             [.] __GI___libc_write
    1.29%  sched-messaging  [kernel.kallsyms]     [k] _copy_from_iter                                      1.30%  sched-messaging  [kernel.kallsyms]     [k] _copy_to_iter
    1.28%  sched-messaging  [kernel.kallsyms]     [k] _copy_to_iter                                        1.29%  sched-messaging  libc.so.6             [.] read
    1.20%  sched-messaging  libc.so.6             [.] __GI___libc_write                                    1.23%  swapper          [kernel.kallsyms]     [k] psi_group_change
    1.19%  sched-messaging  [kernel.kallsyms]     [k] __mutex_lock.constprop.0                             1.10%  sched-messaging  [kernel.kallsyms]     [k] psi_group_change
    1.07%  swapper          [kernel.kallsyms]     [k] srso_alias_safe_ret                                  1.06%  sched-messaging  [kernel.kallsyms]     [k] atime_needs_update
    1.04%  sched-messaging  [kernel.kallsyms]     [k] atime_needs_update                                   1.00%  sched-messaging  [kernel.kallsyms]     [k] security_file_permission
    0.98%  sched-messaging  [kernel.kallsyms]     [k] security_file_permission                             0.97%  sched-messaging  [kernel.kallsyms]     [k] update_sd_lb_stats.constprop.0
    0.97%  sched-messaging  [kernel.kallsyms]     [k] psi_group_change                                     0.94%  sched-messaging  [kernel.kallsyms]     [k] copy_page_to_iter
    0.96%  sched-messaging  [kernel.kallsyms]     [k] copy_page_to_iter                                    0.93%  sched-messaging  [kernel.kallsyms]     [k] syscall_return_via_sysret
    0.88%  sched-messaging  [kernel.kallsyms]     [k] ksys_read                                            0.91%  sched-messaging  [kernel.kallsyms]     [k] copy_page_from_iter
    0.87%  sched-messaging  [kernel.kallsyms]     [k] syscall_return_via_sysret                            0.90%  sched-messaging  [kernel.kallsyms]     [k] ksys_write
    0.86%  sched-messaging  [kernel.kallsyms]     [k] copy_page_from_iter                                  0.89%  sched-messaging  [kernel.kallsyms]     [k] ksys_read
    0.85%  sched-messaging  [kernel.kallsyms]     [k] ksys_write                                           0.82%  sched-messaging  [kernel.kallsyms]     [k] fput
    0.79%  sched-messaging  [kernel.kallsyms]     [k] fsnotify_pre_content                                 0.82%  sched-messaging  [kernel.kallsyms]     [k] fsnotify_pre_content
    0.77%  sched-messaging  [kernel.kallsyms]     [k] fput                                                 0.78%  sched-messaging  [kernel.kallsyms]     [k] __mutex_lock.constprop.0
    0.71%  sched-messaging  [kernel.kallsyms]     [k] mutex_lock                                           0.78%  sched-messaging  [kernel.kallsyms]     [k] native_queued_spin_lock_slowpath
    0.71%  sched-messaging  [kernel.kallsyms]     [k] __rcu_read_unlock                                    0.75%  sched-messaging  [kernel.kallsyms]     [k] __rcu_read_unlock
    0.71%  swapper          [kernel.kallsyms]     [k] srso_alias_return_thunk                              0.73%  sched-messaging  [kernel.kallsyms]     [k] mutex_lock
    0.68%  sched-messaging  [kernel.kallsyms]     [k] x64_sys_call                                         0.70%  sched-messaging  [kernel.kallsyms]     [k] x64_sys_call
    0.68%  sched-messaging  [kernel.kallsyms]     [k] _raw_spin_lock_irqsave                               0.69%  swapper          [kernel.kallsyms]     [k] srso_alias_safe_ret
    0.65%  swapper          [kernel.kallsyms]     [k] menu_select                                          0.59%  sched-messaging  [kernel.kallsyms]     [k] __rcu_read_lock
    0.57%  sched-messaging  [kernel.kallsyms]     [k] __schedule                                           0.59%  sched-messaging  [kernel.kallsyms]     [k] inode_needs_update_time.part.0
    0.56%  sched-messaging  [kernel.kallsyms]     [k] page_copy_sane                                       0.57%  sched-messaging  [kernel.kallsyms]     [k] page_copy_sane
    0.54%  sched-messaging  [kernel.kallsyms]     [k] __rcu_read_lock                                      0.52%  sched-messaging  [kernel.kallsyms]     [k] select_task_rq_fair
    0.53%  sched-messaging  [kernel.kallsyms]     [k] inode_needs_update_time.part.0                       0.51%  sched-messaging  libc.so.6             [.] __GI___pthread_enable_asynccancel
    0.48%  sched-messaging  libc.so.6             [.] __GI___pthread_enable_asynccancel                    0.50%  sched-messaging  [kernel.kallsyms]     [k] fpregs_assert_state_consistent
    0.48%  sched-messaging  [kernel.kallsyms]     [k] entry_SYSCALL_64                                     0.50%  sched-messaging  [kernel.kallsyms]     [k] entry_SYSCALL_64
    0.48%  swapper          [kernel.kallsyms]     [k] save_fpregs_to_fpstate                               0.49%  sched-messaging  [kernel.kallsyms]     [k] mutex_unlock
    0.48%  sched-messaging  [kernel.kallsyms]     [k] mutex_unlock                                         0.49%  sched-messaging  [kernel.kallsyms]     [k] __schedule
    0.47%  sched-messaging  [kernel.kallsyms]     [k] dequeue_entity                                       0.48%  sched-messaging  [kernel.kallsyms]     [k] update_load_avg
    0.46%  sched-messaging  [kernel.kallsyms]     [k] fpregs_assert_state_consistent                       0.48%  sched-messaging  [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
    0.46%  sched-messaging  [kernel.kallsyms]     [k] update_load_avg                                      0.48%  sched-messaging  [kernel.kallsyms]     [k] cpu_util
    0.46%  sched-messaging  [kernel.kallsyms]     [k] __update_load_avg_se                                 0.47%  swapper          [kernel.kallsyms]     [k] srso_alias_return_thunk
    0.45%  sched-messaging  [kernel.kallsyms]     [k] __update_load_avg_cfs_rq                             0.46%  sched-messaging  [kernel.kallsyms]     [k] __update_load_avg_se
    0.37%  swapper          [kernel.kallsyms]     [k] __schedule                                           0.45%  sched-messaging  [kernel.kallsyms]     [k] __update_load_avg_cfs_rq
    0.36%  swapper          [kernel.kallsyms]     [k] enqueue_entity                                       0.45%  swapper          [kernel.kallsyms]     [k] menu_select
    0.35%  sched-messaging  perf                  [.] sender                                               0.38%  sched-messaging  perf                  [.] sender
    0.34%  sched-messaging  [kernel.kallsyms]     [k] file_update_time                                     0.37%  sched-messaging  [kernel.kallsyms]     [k] file_update_time
    0.34%  swapper          [kernel.kallsyms]     [k] acpi_processor_ffh_cstate_enter                      0.36%  sched-messaging  [kernel.kallsyms]     [k] _find_next_and_bit
    0.33%  sched-messaging  perf                  [.] receiver                                             0.34%  sched-messaging  [kernel.kallsyms]     [k] dequeue_entity
    0.32%  sched-messaging  [kernel.kallsyms]     [k] __cond_resched                                       0.33%  sched-messaging  [kernel.kallsyms]     [k] update_curr
---

o 16-groups (11.895s [Mainline] vs 8.163s [revert])

     sudo ./perf record -a -e ibs_op/cnt_ctl=1/ -- ./perf bench sched messaging -p -t -l 100000 -g 1

     (Whole system was profiled since there are 640 tasks on a 256CPU
      setup)

							mainline			vs			revert

Samples: 10M of event 'ibs_op/cnt_ctl=1/', Event count (approx.): 3257434807546                     Samples: 6M of event 'ibs_op/cnt_ctl=1/', Event count (approx.): 3115778240381
Overhead  Command          Shared Object         Symbol                                             Overhead  Command          Shared Object            Symbol
    5.07%  sched-messaging  [kernel.kallsyms]     [k] srso_alias_safe_ret                               5.28%  sched-messaging  [kernel.kallsyms]        [k] srso_alias_safe_ret
    4.24%  sched-messaging  [kernel.kallsyms]     [k] rep_movs_alternative                              4.55%  sched-messaging  [kernel.kallsyms]        [k] rep_movs_alternative
    3.42%  sched-messaging  [kernel.kallsyms]     [k] srso_alias_return_thunk                           3.56%  sched-messaging  [kernel.kallsyms]        [k] srso_alias_return_thunk
    3.26%  sched-messaging  [kernel.kallsyms]     [k] syscall_exit_to_user_mode                         3.44%  sched-messaging  [kernel.kallsyms]        [k] syscall_exit_to_user_mode
    2.55%  sched-messaging  [kernel.kallsyms]     [k] pipe_write                                        2.78%  sched-messaging  [kernel.kallsyms]        [k] pipe_write
    2.51%  sched-messaging  [kernel.kallsyms]     [k] osq_lock                                          2.48%  sched-messaging  [kernel.kallsyms]        [k] do_syscall_64
    2.38%  sched-messaging  [kernel.kallsyms]     [k] pipe_read                                         2.47%  sched-messaging  [kernel.kallsyms]        [k] pipe_read
    2.31%  sched-messaging  [kernel.kallsyms]     [k] do_syscall_64                                     2.15%  sched-messaging  [kernel.kallsyms]        [k] fdget_pos
    2.11%  sched-messaging  [kernel.kallsyms]     [k] mutex_spin_on_owner                               2.12%  sched-messaging  [kernel.kallsyms]        [k] vfs_write
    2.00%  sched-messaging  [kernel.kallsyms]     [k] fdget_pos                                         2.03%  sched-messaging  [kernel.kallsyms]        [k] entry_SYSRETQ_unsafe_stack
    1.93%  sched-messaging  [kernel.kallsyms]     [k] vfs_write                                         1.97%  sched-messaging  [kernel.kallsyms]        [k] vfs_read
    1.90%  sched-messaging  [kernel.kallsyms]     [k] entry_SYSRETQ_unsafe_stack                        1.92%  sched-messaging  [kernel.kallsyms]        [k] native_sched_clock
    1.88%  sched-messaging  [kernel.kallsyms]     [k] vfs_read                                          1.87%  sched-messaging  [kernel.kallsyms]        [k] psi_group_change
    1.77%  sched-messaging  [kernel.kallsyms]     [k] native_sched_clock                                1.87%  sched-messaging  [kernel.kallsyms]        [k] current_time
    1.74%  sched-messaging  [kernel.kallsyms]     [k] current_time                                      1.83%  sched-messaging  [kernel.kallsyms]        [k] apparmor_file_permission
    1.70%  sched-messaging  [kernel.kallsyms]     [k] apparmor_file_permission                          1.79%  sched-messaging  [kernel.kallsyms]        [k] aa_file_perm
    1.67%  sched-messaging  [kernel.kallsyms]     [k] aa_file_perm                                      1.73%  sched-messaging  [kernel.kallsyms]        [k] rw_verify_area
    1.61%  sched-messaging  [kernel.kallsyms]     [k] rw_verify_area                                    1.66%  sched-messaging  [kernel.kallsyms]        [k] entry_SYSCALL_64_after_hwframe
    1.60%  sched-messaging  [kernel.kallsyms]     [k] psi_group_change                                  1.48%  sched-messaging  [kernel.kallsyms]        [k] ktime_get_coarse_real_ts64_mg
    1.56%  sched-messaging  [kernel.kallsyms]     [k] entry_SYSCALL_64_after_hwframe                    1.46%  sched-messaging  [kernel.kallsyms]        [k] _copy_from_iter
    1.38%  sched-messaging  [kernel.kallsyms]     [k] ktime_get_coarse_real_ts64_mg                     1.39%  sched-messaging  libc.so.6                [.] __GI___libc_write
    1.37%  sched-messaging  [kernel.kallsyms]     [k] _copy_from_iter                                   1.39%  sched-messaging  [kernel.kallsyms]        [k] _copy_to_iter
    1.31%  sched-messaging  [kernel.kallsyms]     [k] _copy_to_iter                                     1.37%  sched-messaging  libc.so.6                [.] read
    1.31%  sched-messaging  libc.so.6             [.] read                                              1.10%  sched-messaging  [kernel.kallsyms]        [k] atime_needs_update
    1.28%  sched-messaging  libc.so.6             [.] __GI___libc_write                                 1.07%  swapper          [kernel.kallsyms]        [k] native_sched_clock
    1.23%  swapper          [kernel.kallsyms]     [k] native_sched_clock                                1.05%  sched-messaging  [kernel.kallsyms]        [k] native_queued_spin_lock_slowpath
    1.04%  sched-messaging  [kernel.kallsyms]     [k] atime_needs_update                                1.05%  sched-messaging  [kernel.kallsyms]        [k] security_file_permission
    0.99%  sched-messaging  [kernel.kallsyms]     [k] security_file_permission                          1.00%  sched-messaging  [kernel.kallsyms]        [k] copy_page_to_iter
    0.99%  swapper          [kernel.kallsyms]     [k] psi_group_change                                  0.97%  sched-messaging  [kernel.kallsyms]        [k] copy_page_from_iter
    0.96%  sched-messaging  [kernel.kallsyms]     [k] copy_page_to_iter                                 0.97%  sched-messaging  [kernel.kallsyms]        [k] syscall_return_via_sysret
    0.91%  sched-messaging  [kernel.kallsyms]     [k] copy_page_from_iter                               0.96%  sched-messaging  [kernel.kallsyms]        [k] ksys_write
    0.90%  sched-messaging  [kernel.kallsyms]     [k] syscall_return_via_sysret                         0.95%  sched-messaging  [kernel.kallsyms]        [k] ksys_read
    0.90%  sched-messaging  [kernel.kallsyms]     [k] ksys_read                                         0.85%  sched-messaging  [kernel.kallsyms]        [k] fsnotify_pre_content
    0.90%  sched-messaging  [kernel.kallsyms]     [k] __mutex_lock.constprop.0                          0.84%  sched-messaging  [kernel.kallsyms]        [k] fput
    0.88%  sched-messaging  [kernel.kallsyms]     [k] ksys_write                                        0.82%  swapper          [kernel.kallsyms]        [k] psi_group_change
    0.80%  sched-messaging  [kernel.kallsyms]     [k] fput                                              0.80%  sched-messaging  [kernel.kallsyms]        [k] __rcu_read_unlock
    0.80%  sched-messaging  [kernel.kallsyms]     [k] fsnotify_pre_content                              0.76%  sched-messaging  [kernel.kallsyms]        [k] x64_sys_call
    0.74%  sched-messaging  [kernel.kallsyms]     [k] mutex_lock                                        0.76%  sched-messaging  [kernel.kallsyms]        [k] mutex_lock
    0.73%  sched-messaging  [kernel.kallsyms]     [k] __rcu_read_unlock                                 0.69%  sched-messaging  [kernel.kallsyms]        [k] __schedule
    0.70%  sched-messaging  [kernel.kallsyms]     [k] x64_sys_call                                      0.67%  sched-messaging  [kernel.kallsyms]        [k] osq_lock
    0.69%  swapper          [kernel.kallsyms]     [k] srso_alias_safe_ret                               0.64%  swapper          [kernel.kallsyms]        [k] srso_alias_safe_ret
    0.63%  sched-messaging  [kernel.kallsyms]     [k] __schedule                                        0.62%  sched-messaging  [kernel.kallsyms]        [k] __rcu_read_lock
    0.62%  sched-messaging  [kernel.kallsyms]     [k] _raw_spin_lock_irqsave                            0.61%  sched-messaging  [kernel.kallsyms]        [k] inode_needs_update_time.part.0
    0.57%  sched-messaging  [kernel.kallsyms]     [k] page_copy_sane                                    0.61%  sched-messaging  [kernel.kallsyms]        [k] page_copy_sane
    0.57%  sched-messaging  [kernel.kallsyms]     [k] __rcu_read_lock                                   0.59%  sched-messaging  [kernel.kallsyms]        [k] select_task_rq_fair
    0.56%  sched-messaging  [kernel.kallsyms]     [k] inode_needs_update_time.part.0                    0.54%  sched-messaging  [kernel.kallsyms]        [k] _raw_spin_lock_irqsave
    0.52%  sched-messaging  [kernel.kallsyms]     [k] update_sd_lb_stats.constprop.0                    0.53%  sched-messaging  libc.so.6                [.] __GI___pthread_enable_asynccancel
    0.49%  sched-messaging  [kernel.kallsyms]     [k] restore_fpregs_from_fpstate                       0.53%  sched-messaging  [kernel.kallsyms]        [k] update_load_avg
    0.49%  sched-messaging  libc.so.6             [.] __GI___pthread_enable_asynccancel                 0.52%  sched-messaging  [kernel.kallsyms]        [k] entry_SYSCALL_64
    0.49%  sched-messaging  [kernel.kallsyms]     [k] entry_SYSCALL_64                                  0.52%  sched-messaging  [kernel.kallsyms]        [k] fpregs_assert_state_consistent
    0.49%  sched-messaging  [kernel.kallsyms]     [k] __update_load_avg_se                              0.52%  sched-messaging  [kernel.kallsyms]        [k] __update_load_avg_se
    0.49%  sched-messaging  [kernel.kallsyms]     [k] mutex_unlock                                      0.52%  sched-messaging  [kernel.kallsyms]        [k] mutex_spin_on_owner
    0.48%  sched-messaging  [kernel.kallsyms]     [k] update_load_avg                                   0.51%  sched-messaging  [kernel.kallsyms]        [k] mutex_unlock
    0.47%  sched-messaging  [kernel.kallsyms]     [k] fpregs_assert_state_consistent                    0.47%  sched-messaging  [kernel.kallsyms]        [k] __update_load_avg_cfs_rq
    0.46%  swapper          [kernel.kallsyms]     [k] srso_alias_return_thunk                           0.43%  swapper          [kernel.kallsyms]        [k] srso_alias_return_thunk
    0.46%  sched-messaging  [kernel.kallsyms]     [k] __update_load_avg_cfs_rq                          0.41%  sched-messaging  [kernel.kallsyms]        [k] __mutex_lock.constprop.0
    0.43%  swapper          [kernel.kallsyms]     [k] menu_select                                       0.41%  sched-messaging  [kernel.kallsyms]        [k] dequeue_entity
    0.39%  sched-messaging  [kernel.kallsyms]     [k] dequeue_entity                                    0.40%  sched-messaging  [kernel.kallsyms]        [k] update_curr
    0.39%  sched-messaging  [kernel.kallsyms]     [k] native_queued_spin_lock_slowpath                  0.39%  sched-messaging  perf                     [.] sender
    0.38%  sched-messaging  [kernel.kallsyms]     [k] update_curr                                       0.39%  sched-messaging  [kernel.kallsyms]        [k] file_update_time
    0.37%  sched-messaging  perf                  [.] sender                                            0.37%  sched-messaging  [kernel.kallsyms]        [k] psi_task_switch
    0.35%  sched-messaging  [kernel.kallsyms]     [k] file_update_time                                  0.37%  swapper          [kernel.kallsyms]        [k] menu_select
    0.35%  sched-messaging  [kernel.kallsyms]     [k] select_task_rq_fair                               0.34%  sched-messaging  perf                     [.] receiver
    0.34%  sched-messaging  [kernel.kallsyms]     [k] psi_task_switch                                   0.32%  sched-messaging  [kernel.kallsyms]        [k] __calc_delta.constprop.0
---

For 1-groups I see "osq_lock" turning slightly hotter on mainline
compared to the revert probably suggesting more optimistic spinning
on the "pipe->mutex".

For the 16-client case, I see that "native_queued_spin_lock_slowpath"
jumps up with the revert.

Adding --call-graph when profiling completely alters the profile but
in case of the revert, I was able to see which paths lead to
"native_queued_spin_lock_slowpath" with 16-groups case:


   Overhead  Command          Shared Object            Symbol
-    4.21%  sched-messaging  [kernel.kallsyms]        [k] native_queued_spin_lock_slowpath
    - 2.77% native_queued_spin_lock_slowpath
       - 1.52% _raw_spin_lock_irqsave
          - 1.35% prepare_to_wait_event
             - 1.34% pipe_write
                  vfs_write
                  ksys_write
                  do_syscall_64
                  entry_SYSCALL_64
                  __GI___libc_write
                  write (inlined)
                  start_thread
       - 1.25% _raw_spin_lock
          - 1.25% raw_spin_rq_lock_nested
             - 0.95% __task_rq_lock
                - try_to_wake_up
                   - 0.95% autoremove_wake_function
                        __wake_up_common
                        __wake_up_sync_key
---
  

> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>   fs/pipe.c | 19 ++++++++++---------
>   1 file changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 12b22c2723b7..82fede0f2111 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -253,7 +253,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>   	size_t total_len = iov_iter_count(to);
>   	struct file *filp = iocb->ki_filp;
>   	struct pipe_inode_info *pipe = filp->private_data;
> -	bool was_full, wake_next_reader = false;
> +	bool wake_writer = false, wake_next_reader = false;
>   	ssize_t ret;
>   
>   	/* Null read succeeds. */
> @@ -264,14 +264,13 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>   	mutex_lock(&pipe->mutex);
>   
>   	/*
> -	 * We only wake up writers if the pipe was full when we started
> -	 * reading in order to avoid unnecessary wakeups.
> +	 * We only wake up writers if the pipe was full when we started reading
> +	 * and it is no longer full after reading to avoid unnecessary wakeups.
>   	 *
>   	 * But when we do wake up writers, we do so using a sync wakeup
>   	 * (WF_SYNC), because we want them to get going and generate more
>   	 * data for us.
>   	 */
> -	was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage);
>   	for (;;) {
>   		/* Read ->head with a barrier vs post_one_notification() */
>   		unsigned int head = smp_load_acquire(&pipe->head);
> @@ -340,8 +339,10 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>   				buf->len = 0;
>   			}
>   
> -			if (!buf->len)
> +			if (!buf->len) {
> +				wake_writer |= pipe_full(head, tail, pipe->max_usage);
>   				tail = pipe_update_tail(pipe, buf, tail);
> +			}
>   			total_len -= chars;
>   			if (!total_len)
>   				break;	/* common path: read succeeded */
> @@ -377,7 +378,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>   		 * _very_ unlikely case that the pipe was full, but we got
>   		 * no data.
>   		 */
> -		if (unlikely(was_full))
> +		if (unlikely(wake_writer))
>   			wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
>   		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
>   
> @@ -390,15 +391,15 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>   		if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0)
>   			return -ERESTARTSYS;
>   
> -		mutex_lock(&pipe->mutex);
> -		was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage);
> +		wake_writer = false;
>   		wake_next_reader = true;
> +		mutex_lock(&pipe->mutex);
>   	}

Looking at the performance trend, I tried the following (possibly dumb)
experiment on top of mainline:

diff --git a/fs/pipe.c b/fs/pipe.c
index 82fede0f2111..43d827f99c55 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -395,6 +395,19 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
  		wake_next_reader = true;
  		mutex_lock(&pipe->mutex);
  	}
+
+	if (!wake_writer && !pipe_full(pipe->head, pipe->tail, pipe->max_usage)) {
+		/*
+		 * Proactively wake up writers if the pipe is not full.
+		 * This smp_mb() pairs with another barrier in ___wait_event(),
+		 * see more details in comments of waitqueue_active().
+		 */
+		smp_mb();
+
+		if (waitqueue_active(&pipe->wr_wait))
+			wake_writer = true;
+	}
+
  	if (pipe_empty(pipe->head, pipe->tail))
  		wake_next_reader = false;
  	mutex_unlock(&pipe->mutex);

base-commit: 05dbaf8dd8bf537d4b4eb3115ab42a5fb40ff1f5
--

and I see that the perfomance at lower utilization is closer to the
mainline whereas at higher utlization, it is close to that with this
patch reverted:

     ==================================================================
     Test          : sched-messaging
     Units         : Normalized time in seconds
     Interpretation: Lower is better
     Statistic     : AMean
     ==================================================================
     Case:         mainline[pct imp](CV)    revert[pct imp](CV)      patched[pct imp](CV)
      1-groups     1.00 [ -0.00](12.29)     1.26 [-25.91]( 2.71)     0.96 [  4.05]( 1.61)
      2-groups     1.00 [ -0.00]( 3.64)     1.39 [-38.53]( 0.89)     1.05 [ -5.26]( 0.93)
      4-groups     1.00 [ -0.00]( 3.33)     1.41 [-41.42]( 1.21)     1.04 [ -4.18]( 1.38)
      8-groups     1.00 [ -0.00]( 2.90)     1.10 [ -9.89]( 0.95)     0.84 [ 16.07]( 1.55)
     16-groups     1.00 [ -0.00]( 1.46)     0.66 [ 34.46]( 1.59)     0.50 [ 49.55]( 1.91)

The rationale was at higher utilization, perhaps there is a delay
in wakeup of writers from the time tail was moved but looking at all the
synchronization with "pipe->mutex", it is highly unlikely and I do not
have a good explanation for why this helps (or if it is even correct)

Following are some system-wide aggregates of schedstats on each
kernel running the 16-group variant collected using perf sched
stats [0]:

     sudo ./perf sched stats report #cord -- ./perf bench sched messaging -p -t -l 100000 -g 16

kernel                                                           :      mainline                         revert                          patched
runtime                                                          :     11.418s                           7.207s                           6.278s
sched_yield() count                                              :           0                                0                                0
Legacy counter can be ignored                                    :           0                                0                                0
schedule() called                                                :      402376                           403424                           172432
schedule() left the processor idle                               :      144622  (    35.94% )            142240  (    35.26% )             56732  (    32.90% )
try_to_wake_up() was called                                      :      237032                           241834                           101645
try_to_wake_up() was called to wake up the local cpu             :        1064  (     0.45% )             16656  (     6.89% )             12385  (    12.18% )
total runtime by tasks on this processor (in jiffies)            :  9072083005                       5516672721                       5105984838
total waittime by tasks on this processor (in jiffies)           :  4380309658  (    48.28% )        7304939649  (   132.42% )        6120940564  (   119.88% )
total timeslices run on this cpu                                 :      257644                           261129                           115628

[0] https://lore.kernel.org/lkml/20241122084452.1064968-1-swapnil.sapkal@amd.com/

The trend seems to be higher local CPU wakeups albeit with more wait time
but that diesn't seem to hurt the progress of sched-messaging.

>   	if (pipe_empty(pipe->head, pipe->tail))
>   		wake_next_reader = false;
>   	mutex_unlock(&pipe->mutex);
>   
> -	if (was_full)
> +	if (wake_writer)
>   		wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
>   	if (wake_next_reader)
>   		wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);

If you need any more information from my test setup, please do let me
know. All tests were run on a dual socket 3rd Generation EPYC system
(2 x 64C/128T) running in NPS1 mode with C2 disabled and boost enabled.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-01-31  9:49 ` K Prateek Nayak
@ 2025-01-31 13:23   ` Oleg Nesterov
  2025-01-31 20:06   ` Linus Torvalds
  1 sibling, 0 replies; 109+ messages in thread
From: Oleg Nesterov @ 2025-01-31 13:23 UTC (permalink / raw)
  To: K Prateek Nayak, Oliver Sang, Mateusz Guzik
  Cc: Manfred Spraul, Linus Torvalds, Christian Brauner, David Howells,
	WangYuli, linux-fsdevel, linux-kernel, Gautham R. Shenoy,
	Swapnil Sapkal, Neeraj Upadhyay

(Add Oliver and Mateusz)

On 01/31, K Prateek Nayak wrote:
>
> On 1/2/2025 7:37 PM, Oleg Nesterov wrote:
> >wake_up(pipe->wr_wait) makes no sense if pipe_full() is still true after
> >the reading, the writer sleeping in wait_event(wr_wait, pipe_writable())
> >will check the pipe_writable() == !pipe_full() condition and sleep again.
> >
> >Only wake the writer if we actually released a pipe buf, and the pipe was
> >full before we did so.
>
> I noticed a performance regression in perf bench sched messaging at
> higher utilization (larger number of groups) with this patch on the
> mainline kernel. For lower utilization, this patch yields good
> improvements but once the system is oversubscribed, the tale flips.

Thanks a lot Prateek for your investigations.

I wasn't aware of tools/perf/bench/sched-messaging.c, but it seems
to do the same thing as hackbench. So this was already reported, plus
other "random" regressions and improvements caused by this patch. See
https://lore.kernel.org/all/202501201311.6d25a0b9-lkp@intel.com/

Yes, if the system is oversubscribed, then the early/unnecessary wakeup
is not necessarily bad, but I still can't fully understand why this patch
makes a noticeable difference in this case. I can't even understand why
(with or without this patch) the readers sleep on rd_wait MUCH more often
than the writers on wr_wait, may be because pipe_write() is generally
slower than pipe_read() ...

I promised to (try to) investigate on the previous weekend, but I am
a lazy dog, sorry! I'll try to do it on this weekend. Perhaps it would
be better to simply revert this patch...

As for the change you propose... At first glance it doesn't look right
to me, but this needs another discussion. At least it can be simplified
afaics. As for the waitqueue_active() check, it probably makes sense
(before wake_up) regardless, and this connects to another (confusing)
discussion, please see
https://lore.kernel.org/all/75B06EE0B67747ED+20241225094202.597305-1-wangyuli@uniontech.com/

Thanks!

Oleg.

> Following are the results from my testing on mainline at commit
> 05dbaf8dd8bf ("Merge tag 'x86-urgent-2025-01-28' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
> with and without this patch:
>
>    ==================================================================
>     Test          : sched-messaging
>     cmdline       : perf bench sched messaging -p -t -l 100000 -g <groups>
>     Units         : Normalized time in seconds
>     Interpretation: Lower is better
>     Statistic     : AMean
>     ==================================================================
>     Case:         mainline[pct imp](CV)    revert[pct imp](CV)
>      1-groups     1.00 [ -0.00](12.29)     1.26 [-25.91]( 2.71)
>      2-groups     1.00 [ -0.00]( 3.64)     1.39 [-38.53]( 0.89)
>      4-groups     1.00 [ -0.00]( 3.33)     1.41 [-41.42]( 1.21)
>      8-groups     1.00 [ -0.00]( 2.90)     1.10 [ -9.89]( 0.95)
>     16-groups     1.00 [ -0.00]( 1.46)     0.66 [ 34.46]( 1.59)
>
> On my 3rd Generation EPYC system (2 x 64C/128T), I see that on reverting
> the changes on the above mentioned commit, sched-messaging sees a
> regression up until the 8 group case which contains 320 tasks, however
> with 16 groups (640 tasks), the revert helps with performance.
>
> Based on the trend in the performance, one can deduce that at lower
> utilization, sched-messaging benefits from not traversing the wake up
> path unnecessarily since wake_up_interruptible_sync_poll() acquires a
> lock before checking if the wait queue is empty or not thus saving on
> system time. However, at high utilization, there is likely a writer
> waiting to write to the pipe by the time the wait queue is inspected.
>
> Following are the perf profile comparing the mainline with the revert:
>
> o 1-group (4.604s [mainline] vs 8.163s [revert])
>
>     sudo ./perf record -C 0-7,64-127 -e ibs_op/cnt_ctl=1/ -- taskset -c 0-7,64-127 ./perf bench sched messaging -p -t -l 100000 -g 1
>
>     (sched-messaging was pinned to 1 CCX and only that CCX was profiled
>      using IBS to reduce noise)
>
> 							mainline			vs			revert
>
> Samples: 606K of event 'ibs_op/cnt_ctl=1/', Event count (approx.): 205972485144                        Samples: 479K of event 'ibs_op/cnt_ctl=1/', Event count (approx.): 200365591518
> Overhead  Command          Shared Object         Symbol                                                Overhead  Command          Shared Object         Symbol
>    4.80%  sched-messaging  [kernel.kallsyms]     [k] srso_alias_safe_ret                                  5.12%  sched-messaging  [kernel.kallsyms]     [k] srso_alias_safe_ret
>    4.10%  sched-messaging  [kernel.kallsyms]     [k] rep_movs_alternative                                 4.30%  sched-messaging  [kernel.kallsyms]     [k] rep_movs_alternative
>    3.24%  sched-messaging  [kernel.kallsyms]     [k] osq_lock                                             3.42%  sched-messaging  [kernel.kallsyms]     [k] srso_alias_return_thunk
>    3.23%  sched-messaging  [kernel.kallsyms]     [k] srso_alias_return_thunk                              3.31%  sched-messaging  [kernel.kallsyms]     [k] syscall_exit_to_user_mode
>    3.13%  sched-messaging  [kernel.kallsyms]     [k] syscall_exit_to_user_mode                            2.71%  sched-messaging  [kernel.kallsyms]     [k] osq_lock
>    2.44%  sched-messaging  [kernel.kallsyms]     [k] pipe_write                                           2.64%  sched-messaging  [kernel.kallsyms]     [k] pipe_write
>    2.38%  sched-messaging  [kernel.kallsyms]     [k] pipe_read                                            2.34%  sched-messaging  [kernel.kallsyms]     [k] do_syscall_64
>    2.23%  sched-messaging  [kernel.kallsyms]     [k] do_syscall_64                                        2.33%  sched-messaging  [kernel.kallsyms]     [k] pipe_read
>    2.19%  sched-messaging  [kernel.kallsyms]     [k] mutex_spin_on_owner                                  2.10%  sched-messaging  [kernel.kallsyms]     [k] fdget_pos
>    2.05%  swapper          [kernel.kallsyms]     [k] native_sched_clock                                   1.97%  sched-messaging  [kernel.kallsyms]     [k] vfs_write
>    1.94%  sched-messaging  [kernel.kallsyms]     [k] fdget_pos                                            1.93%  sched-messaging  [kernel.kallsyms]     [k] entry_SYSRETQ_unsafe_stack
>    1.88%  sched-messaging  [kernel.kallsyms]     [k] vfs_read                                             1.91%  sched-messaging  [kernel.kallsyms]     [k] vfs_read
>    1.87%  swapper          [kernel.kallsyms]     [k] psi_group_change                                     1.89%  sched-messaging  [kernel.kallsyms]     [k] mutex_spin_on_owner
>    1.85%  sched-messaging  [kernel.kallsyms]     [k] entry_SYSRETQ_unsafe_stack                           1.78%  sched-messaging  [kernel.kallsyms]     [k] current_time
>    1.83%  sched-messaging  [kernel.kallsyms]     [k] vfs_write                                            1.77%  sched-messaging  [kernel.kallsyms]     [k] apparmor_file_permission
>    1.68%  sched-messaging  [kernel.kallsyms]     [k] current_time                                         1.72%  sched-messaging  [kernel.kallsyms]     [k] aa_file_perm
>    1.67%  sched-messaging  [kernel.kallsyms]     [k] apparmor_file_permission                             1.66%  sched-messaging  [kernel.kallsyms]     [k] rw_verify_area
>    1.64%  sched-messaging  [kernel.kallsyms]     [k] aa_file_perm                                         1.59%  sched-messaging  [kernel.kallsyms]     [k] entry_SYSCALL_64_after_hwframe
>    1.56%  sched-messaging  [kernel.kallsyms]     [k] rw_verify_area                                       1.38%  sched-messaging  [kernel.kallsyms]     [k] _copy_from_iter
>    1.50%  sched-messaging  [kernel.kallsyms]     [k] entry_SYSCALL_64_after_hwframe                       1.38%  sched-messaging  [kernel.kallsyms]     [k] ktime_get_coarse_real_ts64_mg
>    1.36%  sched-messaging  [kernel.kallsyms]     [k] ktime_get_coarse_real_ts64_mg                        1.37%  sched-messaging  [kernel.kallsyms]     [k] native_sched_clock
>    1.33%  sched-messaging  [kernel.kallsyms]     [k] native_sched_clock                                   1.36%  swapper          [kernel.kallsyms]     [k] native_sched_clock
>    1.29%  sched-messaging  libc.so.6             [.] read                                                 1.34%  sched-messaging  libc.so.6             [.] __GI___libc_write
>    1.29%  sched-messaging  [kernel.kallsyms]     [k] _copy_from_iter                                      1.30%  sched-messaging  [kernel.kallsyms]     [k] _copy_to_iter
>    1.28%  sched-messaging  [kernel.kallsyms]     [k] _copy_to_iter                                        1.29%  sched-messaging  libc.so.6             [.] read
>    1.20%  sched-messaging  libc.so.6             [.] __GI___libc_write                                    1.23%  swapper          [kernel.kallsyms]     [k] psi_group_change
>    1.19%  sched-messaging  [kernel.kallsyms]     [k] __mutex_lock.constprop.0                             1.10%  sched-messaging  [kernel.kallsyms]     [k] psi_group_change
>    1.07%  swapper          [kernel.kallsyms]     [k] srso_alias_safe_ret                                  1.06%  sched-messaging  [kernel.kallsyms]     [k] atime_needs_update
>    1.04%  sched-messaging  [kernel.kallsyms]     [k] atime_needs_update                                   1.00%  sched-messaging  [kernel.kallsyms]     [k] security_file_permission
>    0.98%  sched-messaging  [kernel.kallsyms]     [k] security_file_permission                             0.97%  sched-messaging  [kernel.kallsyms]     [k] update_sd_lb_stats.constprop.0
>    0.97%  sched-messaging  [kernel.kallsyms]     [k] psi_group_change                                     0.94%  sched-messaging  [kernel.kallsyms]     [k] copy_page_to_iter
>    0.96%  sched-messaging  [kernel.kallsyms]     [k] copy_page_to_iter                                    0.93%  sched-messaging  [kernel.kallsyms]     [k] syscall_return_via_sysret
>    0.88%  sched-messaging  [kernel.kallsyms]     [k] ksys_read                                            0.91%  sched-messaging  [kernel.kallsyms]     [k] copy_page_from_iter
>    0.87%  sched-messaging  [kernel.kallsyms]     [k] syscall_return_via_sysret                            0.90%  sched-messaging  [kernel.kallsyms]     [k] ksys_write
>    0.86%  sched-messaging  [kernel.kallsyms]     [k] copy_page_from_iter                                  0.89%  sched-messaging  [kernel.kallsyms]     [k] ksys_read
>    0.85%  sched-messaging  [kernel.kallsyms]     [k] ksys_write                                           0.82%  sched-messaging  [kernel.kallsyms]     [k] fput
>    0.79%  sched-messaging  [kernel.kallsyms]     [k] fsnotify_pre_content                                 0.82%  sched-messaging  [kernel.kallsyms]     [k] fsnotify_pre_content
>    0.77%  sched-messaging  [kernel.kallsyms]     [k] fput                                                 0.78%  sched-messaging  [kernel.kallsyms]     [k] __mutex_lock.constprop.0
>    0.71%  sched-messaging  [kernel.kallsyms]     [k] mutex_lock                                           0.78%  sched-messaging  [kernel.kallsyms]     [k] native_queued_spin_lock_slowpath
>    0.71%  sched-messaging  [kernel.kallsyms]     [k] __rcu_read_unlock                                    0.75%  sched-messaging  [kernel.kallsyms]     [k] __rcu_read_unlock
>    0.71%  swapper          [kernel.kallsyms]     [k] srso_alias_return_thunk                              0.73%  sched-messaging  [kernel.kallsyms]     [k] mutex_lock
>    0.68%  sched-messaging  [kernel.kallsyms]     [k] x64_sys_call                                         0.70%  sched-messaging  [kernel.kallsyms]     [k] x64_sys_call
>    0.68%  sched-messaging  [kernel.kallsyms]     [k] _raw_spin_lock_irqsave                               0.69%  swapper          [kernel.kallsyms]     [k] srso_alias_safe_ret
>    0.65%  swapper          [kernel.kallsyms]     [k] menu_select                                          0.59%  sched-messaging  [kernel.kallsyms]     [k] __rcu_read_lock
>    0.57%  sched-messaging  [kernel.kallsyms]     [k] __schedule                                           0.59%  sched-messaging  [kernel.kallsyms]     [k] inode_needs_update_time.part.0
>    0.56%  sched-messaging  [kernel.kallsyms]     [k] page_copy_sane                                       0.57%  sched-messaging  [kernel.kallsyms]     [k] page_copy_sane
>    0.54%  sched-messaging  [kernel.kallsyms]     [k] __rcu_read_lock                                      0.52%  sched-messaging  [kernel.kallsyms]     [k] select_task_rq_fair
>    0.53%  sched-messaging  [kernel.kallsyms]     [k] inode_needs_update_time.part.0                       0.51%  sched-messaging  libc.so.6             [.] __GI___pthread_enable_asynccancel
>    0.48%  sched-messaging  libc.so.6             [.] __GI___pthread_enable_asynccancel                    0.50%  sched-messaging  [kernel.kallsyms]     [k] fpregs_assert_state_consistent
>    0.48%  sched-messaging  [kernel.kallsyms]     [k] entry_SYSCALL_64                                     0.50%  sched-messaging  [kernel.kallsyms]     [k] entry_SYSCALL_64
>    0.48%  swapper          [kernel.kallsyms]     [k] save_fpregs_to_fpstate                               0.49%  sched-messaging  [kernel.kallsyms]     [k] mutex_unlock
>    0.48%  sched-messaging  [kernel.kallsyms]     [k] mutex_unlock                                         0.49%  sched-messaging  [kernel.kallsyms]     [k] __schedule
>    0.47%  sched-messaging  [kernel.kallsyms]     [k] dequeue_entity                                       0.48%  sched-messaging  [kernel.kallsyms]     [k] update_load_avg
>    0.46%  sched-messaging  [kernel.kallsyms]     [k] fpregs_assert_state_consistent                       0.48%  sched-messaging  [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
>    0.46%  sched-messaging  [kernel.kallsyms]     [k] update_load_avg                                      0.48%  sched-messaging  [kernel.kallsyms]     [k] cpu_util
>    0.46%  sched-messaging  [kernel.kallsyms]     [k] __update_load_avg_se                                 0.47%  swapper          [kernel.kallsyms]     [k] srso_alias_return_thunk
>    0.45%  sched-messaging  [kernel.kallsyms]     [k] __update_load_avg_cfs_rq                             0.46%  sched-messaging  [kernel.kallsyms]     [k] __update_load_avg_se
>    0.37%  swapper          [kernel.kallsyms]     [k] __schedule                                           0.45%  sched-messaging  [kernel.kallsyms]     [k] __update_load_avg_cfs_rq
>    0.36%  swapper          [kernel.kallsyms]     [k] enqueue_entity                                       0.45%  swapper          [kernel.kallsyms]     [k] menu_select
>    0.35%  sched-messaging  perf                  [.] sender                                               0.38%  sched-messaging  perf                  [.] sender
>    0.34%  sched-messaging  [kernel.kallsyms]     [k] file_update_time                                     0.37%  sched-messaging  [kernel.kallsyms]     [k] file_update_time
>    0.34%  swapper          [kernel.kallsyms]     [k] acpi_processor_ffh_cstate_enter                      0.36%  sched-messaging  [kernel.kallsyms]     [k] _find_next_and_bit
>    0.33%  sched-messaging  perf                  [.] receiver                                             0.34%  sched-messaging  [kernel.kallsyms]     [k] dequeue_entity
>    0.32%  sched-messaging  [kernel.kallsyms]     [k] __cond_resched                                       0.33%  sched-messaging  [kernel.kallsyms]     [k] update_curr
> ---
>
> o 16-groups (11.895s [Mainline] vs 8.163s [revert])
>
>     sudo ./perf record -a -e ibs_op/cnt_ctl=1/ -- ./perf bench sched messaging -p -t -l 100000 -g 1
>
>     (Whole system was profiled since there are 640 tasks on a 256CPU
>      setup)
>
> 							mainline			vs			revert
>
> Samples: 10M of event 'ibs_op/cnt_ctl=1/', Event count (approx.): 3257434807546                     Samples: 6M of event 'ibs_op/cnt_ctl=1/', Event count (approx.): 3115778240381
> Overhead  Command          Shared Object         Symbol                                             Overhead  Command          Shared Object            Symbol
>    5.07%  sched-messaging  [kernel.kallsyms]     [k] srso_alias_safe_ret                               5.28%  sched-messaging  [kernel.kallsyms]        [k] srso_alias_safe_ret
>    4.24%  sched-messaging  [kernel.kallsyms]     [k] rep_movs_alternative                              4.55%  sched-messaging  [kernel.kallsyms]        [k] rep_movs_alternative
>    3.42%  sched-messaging  [kernel.kallsyms]     [k] srso_alias_return_thunk                           3.56%  sched-messaging  [kernel.kallsyms]        [k] srso_alias_return_thunk
>    3.26%  sched-messaging  [kernel.kallsyms]     [k] syscall_exit_to_user_mode                         3.44%  sched-messaging  [kernel.kallsyms]        [k] syscall_exit_to_user_mode
>    2.55%  sched-messaging  [kernel.kallsyms]     [k] pipe_write                                        2.78%  sched-messaging  [kernel.kallsyms]        [k] pipe_write
>    2.51%  sched-messaging  [kernel.kallsyms]     [k] osq_lock                                          2.48%  sched-messaging  [kernel.kallsyms]        [k] do_syscall_64
>    2.38%  sched-messaging  [kernel.kallsyms]     [k] pipe_read                                         2.47%  sched-messaging  [kernel.kallsyms]        [k] pipe_read
>    2.31%  sched-messaging  [kernel.kallsyms]     [k] do_syscall_64                                     2.15%  sched-messaging  [kernel.kallsyms]        [k] fdget_pos
>    2.11%  sched-messaging  [kernel.kallsyms]     [k] mutex_spin_on_owner                               2.12%  sched-messaging  [kernel.kallsyms]        [k] vfs_write
>    2.00%  sched-messaging  [kernel.kallsyms]     [k] fdget_pos                                         2.03%  sched-messaging  [kernel.kallsyms]        [k] entry_SYSRETQ_unsafe_stack
>    1.93%  sched-messaging  [kernel.kallsyms]     [k] vfs_write                                         1.97%  sched-messaging  [kernel.kallsyms]        [k] vfs_read
>    1.90%  sched-messaging  [kernel.kallsyms]     [k] entry_SYSRETQ_unsafe_stack                        1.92%  sched-messaging  [kernel.kallsyms]        [k] native_sched_clock
>    1.88%  sched-messaging  [kernel.kallsyms]     [k] vfs_read                                          1.87%  sched-messaging  [kernel.kallsyms]        [k] psi_group_change
>    1.77%  sched-messaging  [kernel.kallsyms]     [k] native_sched_clock                                1.87%  sched-messaging  [kernel.kallsyms]        [k] current_time
>    1.74%  sched-messaging  [kernel.kallsyms]     [k] current_time                                      1.83%  sched-messaging  [kernel.kallsyms]        [k] apparmor_file_permission
>    1.70%  sched-messaging  [kernel.kallsyms]     [k] apparmor_file_permission                          1.79%  sched-messaging  [kernel.kallsyms]        [k] aa_file_perm
>    1.67%  sched-messaging  [kernel.kallsyms]     [k] aa_file_perm                                      1.73%  sched-messaging  [kernel.kallsyms]        [k] rw_verify_area
>    1.61%  sched-messaging  [kernel.kallsyms]     [k] rw_verify_area                                    1.66%  sched-messaging  [kernel.kallsyms]        [k] entry_SYSCALL_64_after_hwframe
>    1.60%  sched-messaging  [kernel.kallsyms]     [k] psi_group_change                                  1.48%  sched-messaging  [kernel.kallsyms]        [k] ktime_get_coarse_real_ts64_mg
>    1.56%  sched-messaging  [kernel.kallsyms]     [k] entry_SYSCALL_64_after_hwframe                    1.46%  sched-messaging  [kernel.kallsyms]        [k] _copy_from_iter
>    1.38%  sched-messaging  [kernel.kallsyms]     [k] ktime_get_coarse_real_ts64_mg                     1.39%  sched-messaging  libc.so.6                [.] __GI___libc_write
>    1.37%  sched-messaging  [kernel.kallsyms]     [k] _copy_from_iter                                   1.39%  sched-messaging  [kernel.kallsyms]        [k] _copy_to_iter
>    1.31%  sched-messaging  [kernel.kallsyms]     [k] _copy_to_iter                                     1.37%  sched-messaging  libc.so.6                [.] read
>    1.31%  sched-messaging  libc.so.6             [.] read                                              1.10%  sched-messaging  [kernel.kallsyms]        [k] atime_needs_update
>    1.28%  sched-messaging  libc.so.6             [.] __GI___libc_write                                 1.07%  swapper          [kernel.kallsyms]        [k] native_sched_clock
>    1.23%  swapper          [kernel.kallsyms]     [k] native_sched_clock                                1.05%  sched-messaging  [kernel.kallsyms]        [k] native_queued_spin_lock_slowpath
>    1.04%  sched-messaging  [kernel.kallsyms]     [k] atime_needs_update                                1.05%  sched-messaging  [kernel.kallsyms]        [k] security_file_permission
>    0.99%  sched-messaging  [kernel.kallsyms]     [k] security_file_permission                          1.00%  sched-messaging  [kernel.kallsyms]        [k] copy_page_to_iter
>    0.99%  swapper          [kernel.kallsyms]     [k] psi_group_change                                  0.97%  sched-messaging  [kernel.kallsyms]        [k] copy_page_from_iter
>    0.96%  sched-messaging  [kernel.kallsyms]     [k] copy_page_to_iter                                 0.97%  sched-messaging  [kernel.kallsyms]        [k] syscall_return_via_sysret
>    0.91%  sched-messaging  [kernel.kallsyms]     [k] copy_page_from_iter                               0.96%  sched-messaging  [kernel.kallsyms]        [k] ksys_write
>    0.90%  sched-messaging  [kernel.kallsyms]     [k] syscall_return_via_sysret                         0.95%  sched-messaging  [kernel.kallsyms]        [k] ksys_read
>    0.90%  sched-messaging  [kernel.kallsyms]     [k] ksys_read                                         0.85%  sched-messaging  [kernel.kallsyms]        [k] fsnotify_pre_content
>    0.90%  sched-messaging  [kernel.kallsyms]     [k] __mutex_lock.constprop.0                          0.84%  sched-messaging  [kernel.kallsyms]        [k] fput
>    0.88%  sched-messaging  [kernel.kallsyms]     [k] ksys_write                                        0.82%  swapper          [kernel.kallsyms]        [k] psi_group_change
>    0.80%  sched-messaging  [kernel.kallsyms]     [k] fput                                              0.80%  sched-messaging  [kernel.kallsyms]        [k] __rcu_read_unlock
>    0.80%  sched-messaging  [kernel.kallsyms]     [k] fsnotify_pre_content                              0.76%  sched-messaging  [kernel.kallsyms]        [k] x64_sys_call
>    0.74%  sched-messaging  [kernel.kallsyms]     [k] mutex_lock                                        0.76%  sched-messaging  [kernel.kallsyms]        [k] mutex_lock
>    0.73%  sched-messaging  [kernel.kallsyms]     [k] __rcu_read_unlock                                 0.69%  sched-messaging  [kernel.kallsyms]        [k] __schedule
>    0.70%  sched-messaging  [kernel.kallsyms]     [k] x64_sys_call                                      0.67%  sched-messaging  [kernel.kallsyms]        [k] osq_lock
>    0.69%  swapper          [kernel.kallsyms]     [k] srso_alias_safe_ret                               0.64%  swapper          [kernel.kallsyms]        [k] srso_alias_safe_ret
>    0.63%  sched-messaging  [kernel.kallsyms]     [k] __schedule                                        0.62%  sched-messaging  [kernel.kallsyms]        [k] __rcu_read_lock
>    0.62%  sched-messaging  [kernel.kallsyms]     [k] _raw_spin_lock_irqsave                            0.61%  sched-messaging  [kernel.kallsyms]        [k] inode_needs_update_time.part.0
>    0.57%  sched-messaging  [kernel.kallsyms]     [k] page_copy_sane                                    0.61%  sched-messaging  [kernel.kallsyms]        [k] page_copy_sane
>    0.57%  sched-messaging  [kernel.kallsyms]     [k] __rcu_read_lock                                   0.59%  sched-messaging  [kernel.kallsyms]        [k] select_task_rq_fair
>    0.56%  sched-messaging  [kernel.kallsyms]     [k] inode_needs_update_time.part.0                    0.54%  sched-messaging  [kernel.kallsyms]        [k] _raw_spin_lock_irqsave
>    0.52%  sched-messaging  [kernel.kallsyms]     [k] update_sd_lb_stats.constprop.0                    0.53%  sched-messaging  libc.so.6                [.] __GI___pthread_enable_asynccancel
>    0.49%  sched-messaging  [kernel.kallsyms]     [k] restore_fpregs_from_fpstate                       0.53%  sched-messaging  [kernel.kallsyms]        [k] update_load_avg
>    0.49%  sched-messaging  libc.so.6             [.] __GI___pthread_enable_asynccancel                 0.52%  sched-messaging  [kernel.kallsyms]        [k] entry_SYSCALL_64
>    0.49%  sched-messaging  [kernel.kallsyms]     [k] entry_SYSCALL_64                                  0.52%  sched-messaging  [kernel.kallsyms]        [k] fpregs_assert_state_consistent
>    0.49%  sched-messaging  [kernel.kallsyms]     [k] __update_load_avg_se                              0.52%  sched-messaging  [kernel.kallsyms]        [k] __update_load_avg_se
>    0.49%  sched-messaging  [kernel.kallsyms]     [k] mutex_unlock                                      0.52%  sched-messaging  [kernel.kallsyms]        [k] mutex_spin_on_owner
>    0.48%  sched-messaging  [kernel.kallsyms]     [k] update_load_avg                                   0.51%  sched-messaging  [kernel.kallsyms]        [k] mutex_unlock
>    0.47%  sched-messaging  [kernel.kallsyms]     [k] fpregs_assert_state_consistent                    0.47%  sched-messaging  [kernel.kallsyms]        [k] __update_load_avg_cfs_rq
>    0.46%  swapper          [kernel.kallsyms]     [k] srso_alias_return_thunk                           0.43%  swapper          [kernel.kallsyms]        [k] srso_alias_return_thunk
>    0.46%  sched-messaging  [kernel.kallsyms]     [k] __update_load_avg_cfs_rq                          0.41%  sched-messaging  [kernel.kallsyms]        [k] __mutex_lock.constprop.0
>    0.43%  swapper          [kernel.kallsyms]     [k] menu_select                                       0.41%  sched-messaging  [kernel.kallsyms]        [k] dequeue_entity
>    0.39%  sched-messaging  [kernel.kallsyms]     [k] dequeue_entity                                    0.40%  sched-messaging  [kernel.kallsyms]        [k] update_curr
>    0.39%  sched-messaging  [kernel.kallsyms]     [k] native_queued_spin_lock_slowpath                  0.39%  sched-messaging  perf                     [.] sender
>    0.38%  sched-messaging  [kernel.kallsyms]     [k] update_curr                                       0.39%  sched-messaging  [kernel.kallsyms]        [k] file_update_time
>    0.37%  sched-messaging  perf                  [.] sender                                            0.37%  sched-messaging  [kernel.kallsyms]        [k] psi_task_switch
>    0.35%  sched-messaging  [kernel.kallsyms]     [k] file_update_time                                  0.37%  swapper          [kernel.kallsyms]        [k] menu_select
>    0.35%  sched-messaging  [kernel.kallsyms]     [k] select_task_rq_fair                               0.34%  sched-messaging  perf                     [.] receiver
>    0.34%  sched-messaging  [kernel.kallsyms]     [k] psi_task_switch                                   0.32%  sched-messaging  [kernel.kallsyms]        [k] __calc_delta.constprop.0
> ---
>
> For 1-groups I see "osq_lock" turning slightly hotter on mainline
> compared to the revert probably suggesting more optimistic spinning
> on the "pipe->mutex".
>
> For the 16-client case, I see that "native_queued_spin_lock_slowpath"
> jumps up with the revert.
>
> Adding --call-graph when profiling completely alters the profile but
> in case of the revert, I was able to see which paths lead to
> "native_queued_spin_lock_slowpath" with 16-groups case:
>
>
>   Overhead  Command          Shared Object            Symbol
> -    4.21%  sched-messaging  [kernel.kallsyms]        [k] native_queued_spin_lock_slowpath
>    - 2.77% native_queued_spin_lock_slowpath
>       - 1.52% _raw_spin_lock_irqsave
>          - 1.35% prepare_to_wait_event
>             - 1.34% pipe_write
>                  vfs_write
>                  ksys_write
>                  do_syscall_64
>                  entry_SYSCALL_64
>                  __GI___libc_write
>                  write (inlined)
>                  start_thread
>       - 1.25% _raw_spin_lock
>          - 1.25% raw_spin_rq_lock_nested
>             - 0.95% __task_rq_lock
>                - try_to_wake_up
>                   - 0.95% autoremove_wake_function
>                        __wake_up_common
>                        __wake_up_sync_key
> ---
>
> >
> >Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> >---
> >  fs/pipe.c | 19 ++++++++++---------
> >  1 file changed, 10 insertions(+), 9 deletions(-)
> >
> >diff --git a/fs/pipe.c b/fs/pipe.c
> >index 12b22c2723b7..82fede0f2111 100644
> >--- a/fs/pipe.c
> >+++ b/fs/pipe.c
> >@@ -253,7 +253,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
> >  	size_t total_len = iov_iter_count(to);
> >  	struct file *filp = iocb->ki_filp;
> >  	struct pipe_inode_info *pipe = filp->private_data;
> >-	bool was_full, wake_next_reader = false;
> >+	bool wake_writer = false, wake_next_reader = false;
> >  	ssize_t ret;
> >  	/* Null read succeeds. */
> >@@ -264,14 +264,13 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
> >  	mutex_lock(&pipe->mutex);
> >  	/*
> >-	 * We only wake up writers if the pipe was full when we started
> >-	 * reading in order to avoid unnecessary wakeups.
> >+	 * We only wake up writers if the pipe was full when we started reading
> >+	 * and it is no longer full after reading to avoid unnecessary wakeups.
> >  	 *
> >  	 * But when we do wake up writers, we do so using a sync wakeup
> >  	 * (WF_SYNC), because we want them to get going and generate more
> >  	 * data for us.
> >  	 */
> >-	was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage);
> >  	for (;;) {
> >  		/* Read ->head with a barrier vs post_one_notification() */
> >  		unsigned int head = smp_load_acquire(&pipe->head);
> >@@ -340,8 +339,10 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
> >  				buf->len = 0;
> >  			}
> >-			if (!buf->len)
> >+			if (!buf->len) {
> >+				wake_writer |= pipe_full(head, tail, pipe->max_usage);
> >  				tail = pipe_update_tail(pipe, buf, tail);
> >+			}
> >  			total_len -= chars;
> >  			if (!total_len)
> >  				break;	/* common path: read succeeded */
> >@@ -377,7 +378,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
> >  		 * _very_ unlikely case that the pipe was full, but we got
> >  		 * no data.
> >  		 */
> >-		if (unlikely(was_full))
> >+		if (unlikely(wake_writer))
> >  			wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
> >  		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
> >@@ -390,15 +391,15 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
> >  		if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0)
> >  			return -ERESTARTSYS;
> >-		mutex_lock(&pipe->mutex);
> >-		was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage);
> >+		wake_writer = false;
> >  		wake_next_reader = true;
> >+		mutex_lock(&pipe->mutex);
> >  	}
>
> Looking at the performance trend, I tried the following (possibly dumb)
> experiment on top of mainline:
>
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 82fede0f2111..43d827f99c55 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -395,6 +395,19 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>  		wake_next_reader = true;
>  		mutex_lock(&pipe->mutex);
>  	}
> +
> +	if (!wake_writer && !pipe_full(pipe->head, pipe->tail, pipe->max_usage)) {
> +		/*
> +		 * Proactively wake up writers if the pipe is not full.
> +		 * This smp_mb() pairs with another barrier in ___wait_event(),
> +		 * see more details in comments of waitqueue_active().
> +		 */
> +		smp_mb();
> +
> +		if (waitqueue_active(&pipe->wr_wait))
> +			wake_writer = true;
> +	}
> +
>  	if (pipe_empty(pipe->head, pipe->tail))
>  		wake_next_reader = false;
>  	mutex_unlock(&pipe->mutex);
>
> base-commit: 05dbaf8dd8bf537d4b4eb3115ab42a5fb40ff1f5
> --
>
> and I see that the perfomance at lower utilization is closer to the
> mainline whereas at higher utlization, it is close to that with this
> patch reverted:
>
>     ==================================================================
>     Test          : sched-messaging
>     Units         : Normalized time in seconds
>     Interpretation: Lower is better
>     Statistic     : AMean
>     ==================================================================
>     Case:         mainline[pct imp](CV)    revert[pct imp](CV)      patched[pct imp](CV)
>      1-groups     1.00 [ -0.00](12.29)     1.26 [-25.91]( 2.71)     0.96 [  4.05]( 1.61)
>      2-groups     1.00 [ -0.00]( 3.64)     1.39 [-38.53]( 0.89)     1.05 [ -5.26]( 0.93)
>      4-groups     1.00 [ -0.00]( 3.33)     1.41 [-41.42]( 1.21)     1.04 [ -4.18]( 1.38)
>      8-groups     1.00 [ -0.00]( 2.90)     1.10 [ -9.89]( 0.95)     0.84 [ 16.07]( 1.55)
>     16-groups     1.00 [ -0.00]( 1.46)     0.66 [ 34.46]( 1.59)     0.50 [ 49.55]( 1.91)
>
> The rationale was at higher utilization, perhaps there is a delay
> in wakeup of writers from the time tail was moved but looking at all the
> synchronization with "pipe->mutex", it is highly unlikely and I do not
> have a good explanation for why this helps (or if it is even correct)
>
> Following are some system-wide aggregates of schedstats on each
> kernel running the 16-group variant collected using perf sched
> stats [0]:
>
>     sudo ./perf sched stats report #cord -- ./perf bench sched messaging -p -t -l 100000 -g 16
>
> kernel                                                           :      mainline                         revert                          patched
> runtime                                                          :     11.418s                           7.207s                           6.278s
> sched_yield() count                                              :           0                                0                                0
> Legacy counter can be ignored                                    :           0                                0                                0
> schedule() called                                                :      402376                           403424                           172432
> schedule() left the processor idle                               :      144622  (    35.94% )            142240  (    35.26% )             56732  (    32.90% )
> try_to_wake_up() was called                                      :      237032                           241834                           101645
> try_to_wake_up() was called to wake up the local cpu             :        1064  (     0.45% )             16656  (     6.89% )             12385  (    12.18% )
> total runtime by tasks on this processor (in jiffies)            :  9072083005                       5516672721                       5105984838
> total waittime by tasks on this processor (in jiffies)           :  4380309658  (    48.28% )        7304939649  (   132.42% )        6120940564  (   119.88% )
> total timeslices run on this cpu                                 :      257644                           261129                           115628
>
> [0] https://lore.kernel.org/lkml/20241122084452.1064968-1-swapnil.sapkal@amd.com/
>
> The trend seems to be higher local CPU wakeups albeit with more wait time
> but that diesn't seem to hurt the progress of sched-messaging.
>
> >  	if (pipe_empty(pipe->head, pipe->tail))
> >  		wake_next_reader = false;
> >  	mutex_unlock(&pipe->mutex);
> >-	if (was_full)
> >+	if (wake_writer)
> >  		wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
> >  	if (wake_next_reader)
> >  		wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
>
> If you need any more information from my test setup, please do let me
> know. All tests were run on a dual socket 3rd Generation EPYC system
> (2 x 64C/128T) running in NPS1 mode with C2 disabled and boost enabled.
>
> --
> Thanks and Regards,
> Prateek
>


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-01-31  9:49 ` K Prateek Nayak
  2025-01-31 13:23   ` Oleg Nesterov
@ 2025-01-31 20:06   ` Linus Torvalds
  2025-02-02 17:01     ` Oleg Nesterov
  1 sibling, 1 reply; 109+ messages in thread
From: Linus Torvalds @ 2025-01-31 20:06 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Oleg Nesterov, Manfred Spraul, Christian Brauner, David Howells,
	WangYuli, linux-fsdevel, linux-kernel, Gautham R. Shenoy,
	Swapnil Sapkal, Neeraj Upadhyay

On Fri, 31 Jan 2025 at 01:50, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> On my 3rd Generation EPYC system (2 x 64C/128T), I see that on reverting
> the changes on the above mentioned commit, sched-messaging sees a
> regression up until the 8 group case which contains 320 tasks, however
> with 16 groups (640 tasks), the revert helps with performance.

I suspect that the extra wakeups just end up perturbing timing, and
then you just randomly get better performance on that particular
test-case and machine.

I'm not sure this is worth worrying about, unless there's a real load
somewhere that shows this regression.

            Linus

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-01-31 20:06   ` Linus Torvalds
@ 2025-02-02 17:01     ` Oleg Nesterov
  2025-02-02 18:39       ` Linus Torvalds
  2025-02-03  9:05       ` K Prateek Nayak
  0 siblings, 2 replies; 109+ messages in thread
From: Oleg Nesterov @ 2025-02-02 17:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: K Prateek Nayak, Manfred Spraul, Christian Brauner, David Howells,
	WangYuli, linux-fsdevel, linux-kernel, Gautham R. Shenoy,
	Swapnil Sapkal, Neeraj Upadhyay

On 01/31, Linus Torvalds wrote:
>
> On Fri, 31 Jan 2025 at 01:50, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> >
> > On my 3rd Generation EPYC system (2 x 64C/128T), I see that on reverting
> > the changes on the above mentioned commit, sched-messaging sees a
> > regression up until the 8 group case which contains 320 tasks, however
> > with 16 groups (640 tasks), the revert helps with performance.
>
> I suspect that the extra wakeups just end up perturbing timing, and
> then you just randomly get better performance on that particular
> test-case and machine.
>
> I'm not sure this is worth worrying about, unless there's a real load
> somewhere that shows this regression.

Well yes, but the problem is that people seem to believe that hackbench
is the "real" workload, even in the "overloaded" case...

And if we do care about performance... Could you look at the trivial patch
at the end? I don't think {a,c,m}time make any sense in the !fifo case, but
as you explained before they are visible to fstat() so we probably shouldn't
remove file_accessed/file_update_time unconditionally.

This patch does help if I change hackbench to uses pipe2(O_NOATIME) instead
of pipe(). And in fact it helps even in the simplest case:

	static char buf[17 * 4096];

	static struct timeval TW, TR;

	int wr(int fd, int size)
	{
		int c, r;
		struct timeval t0, t1;

		gettimeofday(&t0, NULL);
		for (c = 0; (r = write(fd, buf, size)) > 0; c += r);
		gettimeofday(&t1, NULL);
		timeradd(&TW, &t1, &TW);
		timersub(&TW, &t0, &TW);

		return c;
	}

	int rd(int fd, int size)
	{
		int c, r;
		struct timeval t0, t1;

		gettimeofday(&t0, NULL);
		for (c = 0; (r = read(fd, buf, size)) > 0; c += r);
		gettimeofday(&t1, NULL);
		timeradd(&TR, &t1, &TR);
		timersub(&TR, &t0, &TR);

		return c;
	}

	int main(int argc, const char *argv[])
	{
		int fd[2], nb = 1, noat, loop, size;

		assert(argc == 4);
		noat = atoi(argv[1]) ? O_NOATIME : 0;
		loop = atoi(argv[2]);
		size = atoi(argv[3]);

		assert(pipe2(fd, noat) == 0);
		assert(ioctl(fd[0], FIONBIO, &nb) == 0);
		assert(ioctl(fd[1], FIONBIO, &nb) == 0);

		assert(size <= sizeof(buf));
		while (loop--)
			assert(wr(fd[1], size) == rd(fd[0], size));

		printf("TW = %lu.%03lu\n", TW.tv_sec, TW.tv_usec/1000);
		printf("TR = %lu.%03lu\n", TR.tv_sec, TR.tv_usec/1000);

		return 0;
	}


Now,

	/# for i in 1 2 3; do /host/tmp/test 0 10000 100; done
	TW = 7.692
	TR = 5.704
	TW = 7.930
	TR = 5.858
	TW = 7.685
	TR = 5.697
	/#
	/# for i in 1 2 3; do /host/tmp/test 1 10000 100; done
	TW = 6.432
	TR = 4.533
	TW = 6.612
	TR = 4.638
	TW = 6.409
	TR = 4.523

Oleg.
---

diff --git a/fs/pipe.c b/fs/pipe.c
index a3f5fd7256e9..14b2c0f8b616 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1122,6 +1122,9 @@ int create_pipe_files(struct file **res, int flags)
 		}
 	}
 
+	if (flags & O_NOATIME)
+		inode->i_flags |= S_NOCMTIME;
+
 	f = alloc_file_pseudo(inode, pipe_mnt, "",
 				O_WRONLY | (flags & (O_NONBLOCK | O_DIRECT)),
 				&pipefifo_fops);
@@ -1134,7 +1137,7 @@ int create_pipe_files(struct file **res, int flags)
 	f->private_data = inode->i_pipe;
 	f->f_pipe = 0;
 
-	res[0] = alloc_file_clone(f, O_RDONLY | (flags & O_NONBLOCK),
+	res[0] = alloc_file_clone(f, O_RDONLY | (flags & (O_NONBLOCK | O_NOATIME)),
 				  &pipefifo_fops);
 	if (IS_ERR(res[0])) {
 		put_pipe_info(inode, inode->i_pipe);
@@ -1154,7 +1157,7 @@ static int __do_pipe_flags(int *fd, struct file **files, int flags)
 	int error;
 	int fdw, fdr;
 
-	if (flags & ~(O_CLOEXEC | O_NONBLOCK | O_DIRECT | O_NOTIFICATION_PIPE))
+	if (flags & ~(O_CLOEXEC | O_NONBLOCK | O_DIRECT | O_NOATIME | O_NOTIFICATION_PIPE))
 		return -EINVAL;
 
 	error = create_pipe_files(files, flags);


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-02 17:01     ` Oleg Nesterov
@ 2025-02-02 18:39       ` Linus Torvalds
  2025-02-02 19:32         ` Oleg Nesterov
  2025-02-04 11:17         ` Christian Brauner
  2025-02-03  9:05       ` K Prateek Nayak
  1 sibling, 2 replies; 109+ messages in thread
From: Linus Torvalds @ 2025-02-02 18:39 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: K Prateek Nayak, Manfred Spraul, Christian Brauner, David Howells,
	WangYuli, linux-fsdevel, linux-kernel, Gautham R. Shenoy,
	Swapnil Sapkal, Neeraj Upadhyay

On Sun, 2 Feb 2025 at 09:02, Oleg Nesterov <oleg@redhat.com> wrote:
>
> And if we do care about performance... Could you look at the trivial patch
> at the end? I don't think {a,c,m}time make any sense in the !fifo case, but
> as you explained before they are visible to fstat() so we probably shouldn't
> remove file_accessed/file_update_time unconditionally.

I dislike that patch because if we actually want to do this, I don't
think you are going far enough.

Yeah, you may stop updating the time, but you still do that
sb_start_write_trylock(), and you still call out to
file_update_time(), and it's all fairly expensive.

So the short-circuiting happens too late, and it happens using a flag
that is non-standard and only with a system call that almost nobody
actually uses (ie 'pipe2()' rather than the normal 'pipe()').

Put another way: if we really care about this, we should just be a lot
more aggressive.

Yes, the time is visible in fstat(). Yes, we've done this forever. But
if the time update is such a big thing, let's go all in, and just see
if anybody really notices?

For example, for tty's, a few years ago we intentionally started doing
time updates only every few seconds, because it was leaking keyboard
timing information (see tty_update_time()). Nobody ever complained.

So I'd actually favor a "let's just remove time updates entirely for
unnamed pipes", and see if anybody notices. Simpler and more
straightforward.

And yes, maybe somebody *does* notice, and we'll have to revisit.

IOW, if you care about this, I'd *much* rather try to do the big and
simple approach _first_. Not do something small and timid that nobody
will actually ever use and that complicates the code.

                Linus

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-02 18:39       ` Linus Torvalds
@ 2025-02-02 19:32         ` Oleg Nesterov
  2025-02-04 11:17         ` Christian Brauner
  1 sibling, 0 replies; 109+ messages in thread
From: Oleg Nesterov @ 2025-02-02 19:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: K Prateek Nayak, Manfred Spraul, Christian Brauner, David Howells,
	WangYuli, linux-fsdevel, linux-kernel, Gautham R. Shenoy,
	Swapnil Sapkal, Neeraj Upadhyay

On 02/02, Linus Torvalds wrote:
>
> On Sun, 2 Feb 2025 at 09:02, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > And if we do care about performance... Could you look at the trivial patch
> > at the end? I don't think {a,c,m}time make any sense in the !fifo case, but
> > as you explained before they are visible to fstat() so we probably shouldn't
> > remove file_accessed/file_update_time unconditionally.
>
> I dislike that patch because if we actually want to do this, I don't
> think you are going far enough.

...

Oh yes, yes, I agree, and for the same reasons, including the unnecessary
sb_start_write_trylock() even if it is likely very cheap. Plus it doesn't
look consistent in that "f_flags & O_NOATIME" can be changed by fcntl()
but "i_flags & S_NOCMTIME" can't be changed. Not to mention that this
"feature" will probably be never used.

In case it was not clear, I just tried to measure how much
file_accessed/file_update_time hurt performance-wise. It turns out - a lot.
And the ugly O_NOATIME knob simplifies the before/after testing.

However, yes I was worried about fstat(). But,

> So I'd actually favor a "let's just remove time updates entirely for
> unnamed pipes", and see if anybody notices. Simpler and more
> straightforward.

OK, agreed. Will send the patch.

Oleg.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-02 18:39       ` Linus Torvalds
  2025-02-02 19:32         ` Oleg Nesterov
@ 2025-02-04 11:17         ` Christian Brauner
  1 sibling, 0 replies; 109+ messages in thread
From: Christian Brauner @ 2025-02-04 11:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, K Prateek Nayak, Manfred Spraul, David Howells,
	WangYuli, linux-fsdevel, linux-kernel, Gautham R. Shenoy,
	Swapnil Sapkal, Neeraj Upadhyay

On Sun, Feb 02, 2025 at 10:39:16AM -0800, Linus Torvalds wrote:
> On Sun, 2 Feb 2025 at 09:02, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > And if we do care about performance... Could you look at the trivial patch
> > at the end? I don't think {a,c,m}time make any sense in the !fifo case, but
> > as you explained before they are visible to fstat() so we probably shouldn't
> > remove file_accessed/file_update_time unconditionally.
> 
> I dislike that patch because if we actually want to do this, I don't
> think you are going far enough.
> 
> Yeah, you may stop updating the time, but you still do that
> sb_start_write_trylock(), and you still call out to
> file_update_time(), and it's all fairly expensive.
> 
> So the short-circuiting happens too late, and it happens using a flag
> that is non-standard and only with a system call that almost nobody
> actually uses (ie 'pipe2()' rather than the normal 'pipe()').
> 
> Put another way: if we really care about this, we should just be a lot
> more aggressive.
> 
> Yes, the time is visible in fstat(). Yes, we've done this forever. But
> if the time update is such a big thing, let's go all in, and just see
> if anybody really notices?
> 
> For example, for tty's, a few years ago we intentionally started doing
> time updates only every few seconds, because it was leaking keyboard
> timing information (see tty_update_time()). Nobody ever complained.
> 
> So I'd actually favor a "let's just remove time updates entirely for
> unnamed pipes", and see if anybody notices. Simpler and more
> straightforward.
> 
> And yes, maybe somebody *does* notice, and we'll have to revisit.
> 
> IOW, if you care about this, I'd *much* rather try to do the big and
> simple approach _first_. Not do something small and timid that nobody
> will actually ever use and that complicates the code.

Agreed.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-02 17:01     ` Oleg Nesterov
  2025-02-02 18:39       ` Linus Torvalds
@ 2025-02-03  9:05       ` K Prateek Nayak
  2025-02-04 13:49         ` Oleg Nesterov
  1 sibling, 1 reply; 109+ messages in thread
From: K Prateek Nayak @ 2025-02-03  9:05 UTC (permalink / raw)
  To: Oleg Nesterov, Linus Torvalds
  Cc: Manfred Spraul, Christian Brauner, David Howells, WangYuli,
	linux-fsdevel, linux-kernel, Gautham R. Shenoy, Swapnil Sapkal,
	Neeraj Upadhyay

Hello Oleg,

Thank you for pointing me to the regression reports and the relevant
upstream discussions on the parallel thread.

On 2/2/2025 10:31 PM, Oleg Nesterov wrote:
> On 01/31, Linus Torvalds wrote:
>>
>> On Fri, 31 Jan 2025 at 01:50, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>>
>>> On my 3rd Generation EPYC system (2 x 64C/128T), I see that on reverting
>>> the changes on the above mentioned commit, sched-messaging sees a
>>> regression up until the 8 group case which contains 320 tasks, however
>>> with 16 groups (640 tasks), the revert helps with performance.
>>
>> I suspect that the extra wakeups just end up perturbing timing, and
>> then you just randomly get better performance on that particular
>> test-case and machine.
>>
>> I'm not sure this is worth worrying about, unless there's a real load
>> somewhere that shows this regression.
> 
> Well yes, but the problem is that people seem to believe that hackbench
> is the "real" workload, even in the "overloaded" case...
> 
> And if we do care about performance... Could you look at the trivial patch
> at the end? I don't think {a,c,m}time make any sense in the !fifo case, but
> as you explained before they are visible to fstat() so we probably shouldn't
> remove file_accessed/file_update_time unconditionally.
> 
> This patch does help if I change hackbench to uses pipe2(O_NOATIME) instead
> of pipe(). And in fact it helps even in the simplest case:
> 
> 	static char buf[17 * 4096];
> 
> 	static struct timeval TW, TR;
> 
> 	int wr(int fd, int size)
> 	{
> 		int c, r;
> 		struct timeval t0, t1;
> 
> 		gettimeofday(&t0, NULL);
> 		for (c = 0; (r = write(fd, buf, size)) > 0; c += r);
> 		gettimeofday(&t1, NULL);
> 		timeradd(&TW, &t1, &TW);
> 		timersub(&TW, &t0, &TW);
> 
> 		return c;
> 	}
> 
> 	int rd(int fd, int size)
> 	{
> 		int c, r;
> 		struct timeval t0, t1;
> 
> 		gettimeofday(&t0, NULL);
> 		for (c = 0; (r = read(fd, buf, size)) > 0; c += r);
> 		gettimeofday(&t1, NULL);
> 		timeradd(&TR, &t1, &TR);
> 		timersub(&TR, &t0, &TR);
> 
> 		return c;
> 	}
> 
> 	int main(int argc, const char *argv[])
> 	{
> 		int fd[2], nb = 1, noat, loop, size;
> 
> 		assert(argc == 4);
> 		noat = atoi(argv[1]) ? O_NOATIME : 0;
> 		loop = atoi(argv[2]);
> 		size = atoi(argv[3]);
> 
> 		assert(pipe2(fd, noat) == 0);
> 		assert(ioctl(fd[0], FIONBIO, &nb) == 0);
> 		assert(ioctl(fd[1], FIONBIO, &nb) == 0);
> 
> 		assert(size <= sizeof(buf));
> 		while (loop--)
> 			assert(wr(fd[1], size) == rd(fd[0], size));
> 
> 		printf("TW = %lu.%03lu\n", TW.tv_sec, TW.tv_usec/1000);
> 		printf("TR = %lu.%03lu\n", TR.tv_sec, TR.tv_usec/1000);
> 
> 		return 0;
> 	}
> 
> 
> Now,
> 
> 	/# for i in 1 2 3; do /host/tmp/test 0 10000 100; done
> 	TW = 7.692
> 	TR = 5.704
> 	TW = 7.930
> 	TR = 5.858
> 	TW = 7.685
> 	TR = 5.697
> 	/#
> 	/# for i in 1 2 3; do /host/tmp/test 1 10000 100; done
> 	TW = 6.432
> 	TR = 4.533
> 	TW = 6.612
> 	TR = 4.638
> 	TW = 6.409
> 	TR = 4.523
> 
> Oleg.
> ---
> 

With the below patch on mainline, I see more improvements for a
modified version of sched-messaging (sched-messaging is same as
hackbench as you noted on the parallel thread) that uses
pipe2(O_NOATIME)

The original regression is still noticeable despite the improvements
but if folks believe this is a corner case with the original changes
exhibited by sched-messaging, I'll just continue further testing with
the new baseline.

That said, following are the results for the below patch:

    ==================================================================
     Test          : sched-messaging
     cmdline       : perf bench sched messaging -p -t -l 100000 -g <groups>
     Units         : Normalized time in seconds
     Interpretation: Lower is better
     Statistic     : AMean
     ==================================================================
     Case:         mainline[pct imp](CV)    patched[pct imp](CV)
      1-groups     1.00 [ -0.00](12.29)     0.94 [  6.76]( 12.59)
      2-groups     1.00 [ -0.00]( 3.64)     0.95 [  5.16](  5.99)
      4-groups     1.00 [ -0.00]( 3.33)     0.99 [  1.03](  1.89)
      8-groups     1.00 [ -0.00]( 2.90)     1.00 [  0.16](  1.23)
     16-groups     1.00 [ -0.00]( 1.46)     0.98 [  2.01](  0.98)

Please feel free to add:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

I'll give your generalized optimization a spin too when it comes out.
Meanwhile, I'll go run a bunch of benchmarks to see if the original
change has affected any other workload in my test bed.

Thank you for looking into this.

-- 
Thanks and Regards,
Prateek

> diff --git a/fs/pipe.c b/fs/pipe.c
> index a3f5fd7256e9..14b2c0f8b616 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -1122,6 +1122,9 @@ int create_pipe_files(struct file **res, int flags)
>   		}
>   	}
>   
> +	if (flags & O_NOATIME)
> +		inode->i_flags |= S_NOCMTIME;
> +
>   	f = alloc_file_pseudo(inode, pipe_mnt, "",
>   				O_WRONLY | (flags & (O_NONBLOCK | O_DIRECT)),
>   				&pipefifo_fops);
> @@ -1134,7 +1137,7 @@ int create_pipe_files(struct file **res, int flags)
>   	f->private_data = inode->i_pipe;
>   	f->f_pipe = 0;
>   
> -	res[0] = alloc_file_clone(f, O_RDONLY | (flags & O_NONBLOCK),
> +	res[0] = alloc_file_clone(f, O_RDONLY | (flags & (O_NONBLOCK | O_NOATIME)),
>   				  &pipefifo_fops);
>   	if (IS_ERR(res[0])) {
>   		put_pipe_info(inode, inode->i_pipe);
> @@ -1154,7 +1157,7 @@ static int __do_pipe_flags(int *fd, struct file **files, int flags)
>   	int error;
>   	int fdw, fdr;
>   
> -	if (flags & ~(O_CLOEXEC | O_NONBLOCK | O_DIRECT | O_NOTIFICATION_PIPE))
> +	if (flags & ~(O_CLOEXEC | O_NONBLOCK | O_DIRECT | O_NOATIME | O_NOTIFICATION_PIPE))
>   		return -EINVAL;
>   
>   	error = create_pipe_files(files, flags);
> 



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-03  9:05       ` K Prateek Nayak
@ 2025-02-04 13:49         ` Oleg Nesterov
  0 siblings, 0 replies; 109+ messages in thread
From: Oleg Nesterov @ 2025-02-04 13:49 UTC (permalink / raw)
  To: K Prateek Nayak, Linus Torvalds
  Cc: Manfred Spraul, Christian Brauner, David Howells, WangYuli,
	linux-fsdevel, linux-kernel, Gautham R. Shenoy, Swapnil Sapkal,
	Neeraj Upadhyay

On 02/03, K Prateek Nayak wrote:
>
> With the below patch on mainline, I see more improvements for a
> modified version of sched-messaging (sched-messaging is same as
> hackbench as you noted on the parallel thread) that uses
> pipe2(O_NOATIME)

Thanks,

> The original regression is still noticeable despite the improvements
> but if folks believe this is a corner case with the original changes
> exhibited by sched-messaging, I'll just continue further testing with
> the new baseline.

I still don't know if we should worry or not... But if we want to try
to improve the wake_writer logic, then I think it makes sense to cleanup
this code first.

IMO the (untested) patch below makes sense regardless, I am going to send
it after I grep fs/splice.c a bit more.

a194dfe6e6f6f ("pipe: Rearrange sequence in pipe_write() to preallocate slot")
changed pipe_write() to increment pipe->head in advance.  IIUC to avoid the
race with the post_one_notification()-like code which can add another buffer
under pipe->rd_wait.lock without pipe->mutex.

This is no longer necessary after c73be61cede ("pipe: Add general notification
queue support"), pipe_write() checks pipe_has_watch_queue() and returns -EXDEV
at the start. And can't help in any case, pipe_write() no longer takes this
spinlock.

Change pipe_write() to call copy_page_from_iter() first and do nothing if it
fails. This way pipe_write() can't add a zero-sized bufer and we can simplify
pipe_read() which currently has to handle this very unlikely case.

Oleg.

diff --git a/fs/pipe.c b/fs/pipe.c
index baaa8c0817f1..0816070a5e7a 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -312,6 +312,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 			size_t written;
 			int error;
 
+			WARN_ON_ONCE(chars == 0);
 			if (chars > total_len) {
 				if (buf->flags & PIPE_BUF_FLAG_WHOLE) {
 					if (ret == 0)
@@ -365,29 +366,9 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 			break;
 		}
 		mutex_unlock(&pipe->mutex);
-
 		/*
 		 * We only get here if we didn't actually read anything.
 		 *
-		 * However, we could have seen (and removed) a zero-sized
-		 * pipe buffer, and might have made space in the buffers
-		 * that way.
-		 *
-		 * You can't make zero-sized pipe buffers by doing an empty
-		 * write (not even in packet mode), but they can happen if
-		 * the writer gets an EFAULT when trying to fill a buffer
-		 * that already got allocated and inserted in the buffer
-		 * array.
-		 *
-		 * So we still need to wake up any pending writers in the
-		 * _very_ unlikely case that the pipe was full, but we got
-		 * no data.
-		 */
-		if (unlikely(wake_writer))
-			wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
-		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
-
-		/*
 		 * But because we didn't read anything, at this point we can
 		 * just return directly with -ERESTARTSYS if we're interrupted,
 		 * since we've done any required wakeups and there's no need
@@ -396,7 +377,6 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 		if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0)
 			return -ERESTARTSYS;
 
-		wake_writer = false;
 		wake_next_reader = true;
 		mutex_lock(&pipe->mutex);
 	}
@@ -524,31 +504,25 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 				pipe->tmp_page = page;
 			}
 
-			/* Allocate a slot in the ring in advance and attach an
-			 * empty buffer.  If we fault or otherwise fail to use
-			 * it, either the reader will consume it or it'll still
-			 * be there for the next write.
-			 */
-			pipe->head = head + 1;
+			copied = copy_page_from_iter(page, 0, PAGE_SIZE, from);
+			if (unlikely(copied < PAGE_SIZE && iov_iter_count(from))) {
+				if (!ret)
+					ret = -EFAULT;
+				break;
+			}
 
+			pipe->head = head + 1;
+			pipe->tmp_page = NULL;
 			/* Insert it into the buffer array */
 			buf = &pipe->bufs[head & mask];
 			buf->page = page;
 			buf->ops = &anon_pipe_buf_ops;
 			buf->offset = 0;
-			buf->len = 0;
 			if (is_packetized(filp))
 				buf->flags = PIPE_BUF_FLAG_PACKET;
 			else
 				buf->flags = PIPE_BUF_FLAG_CAN_MERGE;
-			pipe->tmp_page = NULL;
 
-			copied = copy_page_from_iter(page, 0, PAGE_SIZE, from);
-			if (unlikely(copied < PAGE_SIZE && iov_iter_count(from))) {
-				if (!ret)
-					ret = -EFAULT;
-				break;
-			}
 			ret += copied;
 			buf->len = copied;
 


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-01-02 14:07 [PATCH] pipe_read: don't wake up the writer if the pipe is still full Oleg Nesterov
                   ` (2 preceding siblings ...)
  2025-01-31  9:49 ` K Prateek Nayak
@ 2025-02-24  9:26 ` Sapkal, Swapnil
  2025-02-24 14:24   ` Oleg Nesterov
  2025-02-27 12:50   ` Oleg Nesterov
  3 siblings, 2 replies; 109+ messages in thread
From: Sapkal, Swapnil @ 2025-02-24  9:26 UTC (permalink / raw)
  To: Oleg Nesterov, Manfred Spraul, Linus Torvalds, Christian Brauner,
	David Howells
  Cc: WangYuli, linux-fsdevel, linux-kernel, K Prateek Nayak,
	Shenoy, Gautham Ranjal, Neeraj.Upadhyay

Hello Oleg,

On 1/2/2025 7:37 PM, Oleg Nesterov wrote:
> wake_up(pipe->wr_wait) makes no sense if pipe_full() is still true after
> the reading, the writer sleeping in wait_event(wr_wait, pipe_writable())
> will check the pipe_writable() == !pipe_full() condition and sleep again.
> 
> Only wake the writer if we actually released a pipe buf, and the pipe was
> full before we did so.
> 

We saw hang in hackbench in our weekly regression testing on mainline 
kernel. The bisect pointed to this commit.

This patch avoids the unnecessary writer wakeup but I think there may be 
a subtle race due to which the writer is never woken up in certain cases.

On zen5 system with 2 sockets with 192C/384T each, I ran hackbench with 
16 groups or 32 groups. In 1 out of 20 runs, the race condition is 
occurring where the writer is not getting woken up and the benchmarks 
hangs. I tried reverting this commit and it again started working fine.

I also tried with
https://lore.kernel.org/all/20250210114039.GA3588@redhat.com/. After 
applying this patch, the frequency of hang is reduced to 1 in 100 times, 
but hang still
exists.

Whenever I compare the case where was_full would have been set but 
wake_writer was not set, I see the following pattern:

ret = 100 (Read was successful)
pipe_full() = 1
total_len = 0
buf->len != 0

total_len is computed using iov_iter_count() while the buf->len is the 
length of the buffer corresponding to tail(pipe->bufs[tail & mask].len).
Looking at pipe_write(), there seems to be a case where the writer can 
make progress when (chars && !was_empty) which only looks at 
iov_iter_count(). Could it be the case that there is still room in the 
buffer but we are not waking up the writer?

> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>   fs/pipe.c | 19 ++++++++++---------
>   1 file changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 12b22c2723b7..82fede0f2111 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -253,7 +253,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>   	size_t total_len = iov_iter_count(to);
>   	struct file *filp = iocb->ki_filp;
>   	struct pipe_inode_info *pipe = filp->private_data;
> -	bool was_full, wake_next_reader = false;
> +	bool wake_writer = false, wake_next_reader = false;
>   	ssize_t ret;
>   
>   	/* Null read succeeds. */
> @@ -264,14 +264,13 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>   	mutex_lock(&pipe->mutex);
>   
>   	/*
> -	 * We only wake up writers if the pipe was full when we started
> -	 * reading in order to avoid unnecessary wakeups.
> +	 * We only wake up writers if the pipe was full when we started reading
> +	 * and it is no longer full after reading to avoid unnecessary wakeups.
>   	 *
>   	 * But when we do wake up writers, we do so using a sync wakeup
>   	 * (WF_SYNC), because we want them to get going and generate more
>   	 * data for us.
>   	 */
> -	was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage);
>   	for (;;) {
>   		/* Read ->head with a barrier vs post_one_notification() */
>   		unsigned int head = smp_load_acquire(&pipe->head);
> @@ -340,8 +339,10 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>   				buf->len = 0;
>   			}
>   
> -			if (!buf->len)
> +			if (!buf->len) {
> +				wake_writer |= pipe_full(head, tail, pipe->max_usage);
>   				tail = pipe_update_tail(pipe, buf, tail);
> +			}
>   			total_len -= chars;
>   			if (!total_len)
>   				break;	/* common path: read succeeded */
> @@ -377,7 +378,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>   		 * _very_ unlikely case that the pipe was full, but we got
>   		 * no data.
>   		 */
> -		if (unlikely(was_full))
> +		if (unlikely(wake_writer))
>   			wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
>   		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
>   
> @@ -390,15 +391,15 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>   		if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0)
>   			return -ERESTARTSYS;
>   
> -		mutex_lock(&pipe->mutex);
> -		was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage);
> +		wake_writer = false;
>   		wake_next_reader = true;
> +		mutex_lock(&pipe->mutex);
>   	}
>   	if (pipe_empty(pipe->head, pipe->tail))
>   		wake_next_reader = false;
>   	mutex_unlock(&pipe->mutex);
>   
> -	if (was_full)
> +	if (wake_writer)
>   		wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
>   	if (wake_next_reader)
>   		wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
--
Thanks and Regards,
Swapnil

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-24  9:26 ` Sapkal, Swapnil
@ 2025-02-24 14:24   ` Oleg Nesterov
  2025-02-24 18:36     ` Linus Torvalds
                       ` (2 more replies)
  2025-02-27 12:50   ` Oleg Nesterov
  1 sibling, 3 replies; 109+ messages in thread
From: Oleg Nesterov @ 2025-02-24 14:24 UTC (permalink / raw)
  To: Sapkal, Swapnil
  Cc: Manfred Spraul, Linus Torvalds, Christian Brauner, David Howells,
	WangYuli, linux-fsdevel, linux-kernel, K Prateek Nayak,
	Shenoy, Gautham Ranjal, Neeraj.Upadhyay

Hi Sapkal,

On 02/24, Sapkal, Swapnil wrote:
>
> We saw hang in hackbench in our weekly regression testing on mainline
> kernel. The bisect pointed to this commit.

OMG. This patch caused a lot of "hackbench performance degradation" reports,
but hang??

Just in case, did you use

	https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git/tree/src/hackbench/hackbench.c

?

OK, I gave up ;) I'll send the revert patch tomorrow (can't do this today)
even if I still don't see how this patch can be wrong.

> Whenever I compare the case where was_full would have been set but
> wake_writer was not set, I see the following pattern:
>
> ret = 100 (Read was successful)
> pipe_full() = 1
> total_len = 0
> buf->len != 0
>
> total_len is computed using iov_iter_count() while the buf->len is the
> length of the buffer corresponding to tail(pipe->bufs[tail & mask].len).
> Looking at pipe_write(), there seems to be a case where the writer can make
> progress when (chars && !was_empty) which only looks at iov_iter_count().
> Could it be the case that there is still room in the buffer but we are not
> waking up the writer?

I don't think so, but perhaps I am totally confused.

If the writer sleeps on pipe->wr_wait, it has already tried to write into
the pipe->bufs[head - 1] buffer before the sleep.

Yes, the reader can read from that buffer, but this won't make it more "writable"
for this particular writer, "PAGE_SIZE - buf->offset + buf->len" won't be changed.
I even wrote the test-case, let me quote my old email below.

Thanks,

Oleg.
--------------------------------------------------------------------------------

Meanwhile I wrote a stupid test-case below.

Without the patch

	State:	S (sleeping)
	voluntary_ctxt_switches:	74
	nonvoluntary_ctxt_switches:	5
	State:	S (sleeping)
	voluntary_ctxt_switches:	4169
	nonvoluntary_ctxt_switches:	5
	finally release the buffer
	wrote next char!

With the patch

	State:	S (sleeping)
	voluntary_ctxt_switches:	74
	nonvoluntary_ctxt_switches:	3
	State:	S (sleeping)
	voluntary_ctxt_switches:	74
	nonvoluntary_ctxt_switches:	3
	finally release the buffer
	wrote next char!

As you can see, without this patch pipe_read() wakes the writer up
4095 times for no reason, the writer burns a bit of CPU and blocks
again after wakeup until the last read(fd[0], &c, 1).

Oleg.

-------------------------------------------------------------------------------
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>
#include <sys/ioctl.h>
#include <stdio.h>
#include <errno.h>

int main(void)
{
	int fd[2], nb, cnt;
	char cmd[1024], c;

	assert(pipe(fd) == 0);

	nb = 1; assert(ioctl(fd[1], FIONBIO, &nb) == 0);
	while (write(fd[1], &c, 1) == 1);
	assert(errno = -EAGAIN);
	nb = 0; assert(ioctl(fd[1], FIONBIO, &nb) == 0);

	// The pipe is full, the next write() will block.

	sprintf(cmd, "grep -e State -e ctxt_switches /proc/%d/status", getpid());

	if (!fork()) {
		// wait until the parent sleeps in pipe_write()
		usleep(10000);

		system(cmd);
		// trigger 4095 unnecessary wakeups
		for (cnt = 0; cnt < 4095; ++cnt) {
			assert(read(fd[0], &c, 1) == 1);
			usleep(1000);
		}
		system(cmd);

		// this should actually wake the writer
		printf("finally release the buffer\n");
		assert(read(fd[0], &c, 1) == 1);
		return 0;
	}

	assert(write(fd[1], &c, 1) == 1);
	printf("wrote next char!\n");

	return 0;
}


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-24 14:24   ` Oleg Nesterov
@ 2025-02-24 18:36     ` Linus Torvalds
  2025-02-25 14:26       ` Oleg Nesterov
  2025-02-25 11:57     ` Oleg Nesterov
  2025-02-26 13:18     ` Mateusz Guzik
  2 siblings, 1 reply; 109+ messages in thread
From: Linus Torvalds @ 2025-02-24 18:36 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Sapkal, Swapnil, Manfred Spraul, Christian Brauner, David Howells,
	WangYuli, linux-fsdevel, linux-kernel, K Prateek Nayak,
	Shenoy, Gautham Ranjal, Neeraj.Upadhyay

On Mon, 24 Feb 2025 at 06:25, Oleg Nesterov <oleg@redhat.com> wrote:
>
> OK, I gave up ;) I'll send the revert patch tomorrow (can't do this today)
> even if I still don't see how this patch can be wrong.

Let's think about this a bit before reverting.

Because I think I see at least one possible issue..

With that commit aaec5a95d596 ("pipe_read: don't wake up the writer if
the pipe is still full"), the rule for waking writers is pretty
simple: we only wake a writer if we update the tail pointer (so that
we made a new slot available) _and_ the pipe was full before we did
that.

And we do so while holding the pipe mutex, so we're guaranteed to be
serialized with writers that are testing whether they can write (using
the same pipe_full() logic).

Finally - we delay the actual wakeup until we actually sleep or are
done with the read(), and we don't hold the mutex at that point any
more, but we have updated the tail pointer and released the mutex, so
the writer is guaranteed to have either seen the updates, or will see
our wakeup.

All pretty simple and seems fool-proof, and the reader side logic
would seem solid.

But I think I see a potential problem.

Because there's an additional subtlety: the pipe wakeup code not only
wakes writers up only if it has freed an entry, it also does an
EXCLUSIVE wakeup.

Which means that the reader will only wake up *one* writer on the wait queue.

And the *WRITER* side then will wake up any others when it has
written, but *that* logic is

 (a) wake up the next writer only if we were on the wait-queue (and
could thus have been the sole recipient of a wakeup)
 (b) wake up the next writer only if the pipe isn't full

which also seems entirely sane. We must wake the next writer if we've
"used up" the wakeup, but only when it makes sense.

However, I see at least one case where this exclusive wakeup seems broken:

                /*
                 * But because we didn't read anything, at this point we can
                 * just return directly with -ERESTARTSYS if we're interrupted,
                 * since we've done any required wakeups and there's no need
                 * to mark anything accessed. And we've dropped the lock.
                 */
                if (wait_event_interruptible_exclusive(pipe->rd_wait,
pipe_readable(pipe)) < 0)
                        return -ERESTARTSYS;

and I'm wondering if the issue is that the *readers* got stuck,
Because that "return -ERESTARTSYS" path now basically will by-pass the
logic to wake up the next exclusive waiter.

Because that "return -ERESTARTSYS" is *after* the reader has been on
the rd_wait queue - and possibly gotten the only wakeup that any of
the readers will ever get - and now it returns without waking up any
other reader.

So then the pipe stays full, because no readers are reading, even
though there's potentially tons of them.

And maybe the "we had tons of extra write wakeups" meant that this was
a pre-existing bug, but it was basically hidden by all the extra
writers being woken up, and in turn waking up the readers that got
missed.

I dunno. This feels wrong. And looking at the hackbench code, I don't
see how it could actually be a problem on *that* load, because I don't
see any signals that could cause that ERESTARTSYS case to happen, and
if it did, the actual system call restart should get it all going
again.

So I think that early return is actually buggy, and I think that
comment is wrong (because "we didn't read anything" doesn't mean that
we might not need to finish up), but I don't see how this could really
cause the reported problems.

But maybe somebody sees some other subtle issue here.

The writer side does *not* have that early return case.  It also does
that wait_event_interruptible_exclusive() thing, but it will always
end up doing the "wake_next_writer" logic if it got to that point.

The bug would have made more sense on the writer side.

But I basically do wonder if there's some bad interaction with the
whole "exclusive wait" logic and the "we now only wake up one single
time". The fact that I found *one* thing that smells bad to me makes
me think maybe there's another that I didn't see.

               Linus

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-24 18:36     ` Linus Torvalds
@ 2025-02-25 14:26       ` Oleg Nesterov
  0 siblings, 0 replies; 109+ messages in thread
From: Oleg Nesterov @ 2025-02-25 14:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Sapkal, Swapnil, Manfred Spraul, Christian Brauner, David Howells,
	WangYuli, linux-fsdevel, linux-kernel, K Prateek Nayak,
	Shenoy, Gautham Ranjal, Neeraj.Upadhyay

On 02/24, Linus Torvalds wrote:
>
> However, I see at least one case where this exclusive wakeup seems broken:
>
>                 /*
>                  * But because we didn't read anything, at this point we can
>                  * just return directly with -ERESTARTSYS if we're interrupted,
>                  * since we've done any required wakeups and there's no need
>                  * to mark anything accessed. And we've dropped the lock.
>                  */
>                 if (wait_event_interruptible_exclusive(pipe->rd_wait,
> pipe_readable(pipe)) < 0)
>                         return -ERESTARTSYS;
>
> and I'm wondering if the issue is that the *readers* got stuck,
> Because that "return -ERESTARTSYS" path now basically will by-pass the
> logic to wake up the next exclusive waiter.

I think this is fine... lets denote this reader as R.

> Because that "return -ERESTARTSYS" is *after* the reader has been on
> the rd_wait queue - and possibly gotten the only wakeup that any of
> the readers will ever get - and now it returns without waking up any
> other reader.

I think this can't happen. ___wait_event() does

	init_wait_entry(&__wq_entry, exclusive ? WQ_FLAG_EXCLUSIVE : 0);	\
	for (;;) {								\
		long __int = prepare_to_wait_event(&wq_head, &__wq_entry, state);\
										\
		if (condition)							\
			break;							\
										\
		if (___wait_is_interruptible(state) && __int) {			\
			__ret = __int;						\
			goto __out;						\
		}								\
										\
		cmd;								\
	}									\

and in this case condition == pipe_readable(pipe), cmd == schedule().

Suppose that R got that only wakeup, and wake_up() races with some signal
so that signal_pending(R) is true.

In this case prepare_to_wait_event() will return -ERESTARTSYS, but
___wait_event() won't return this error code, it will check pipe_readable()
and return 0.

After that R will restart the main loop with wake_next_reader = true,
and whatever it does it should do wake_up(pipe->rd_wait) before return.

Note also that prepare_to_wait_event() removes the waiter from the
wait_queue_head->head list, so another wake_up() can't pick this task.

Can ___wait_event() miss the pipe_readable() event in this case? No,
both wake_up() and prepare_to_wait_event() take the same wq_head->lock.

What if pipe_readable() is actually false? Say, a spurios wakeup or, say,
pipe_write() does wake_up(rd_wait) when another reader has already made
the pipe_readable() condition false? This case looks "obviously fine" too.

So I am still confused.

I will wait for reply from Sapkal, then I'll try to make a debugging patch.

Oleg.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-24 14:24   ` Oleg Nesterov
  2025-02-24 18:36     ` Linus Torvalds
@ 2025-02-25 11:57     ` Oleg Nesterov
  2025-02-26  5:55       ` Sapkal, Swapnil
  2025-03-03 13:00       ` Alexey Gladkov
  2025-02-26 13:18     ` Mateusz Guzik
  2 siblings, 2 replies; 109+ messages in thread
From: Oleg Nesterov @ 2025-02-25 11:57 UTC (permalink / raw)
  To: Sapkal, Swapnil
  Cc: Manfred Spraul, Linus Torvalds, Christian Brauner, David Howells,
	WangYuli, linux-fsdevel, linux-kernel, K Prateek Nayak,
	Shenoy, Gautham Ranjal, Neeraj.Upadhyay

On 02/24, Oleg Nesterov wrote:
>
> Just in case, did you use
>
> 	https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git/tree/src/hackbench/hackbench.c
>
> ?

Or did you use another version?

Exactly what parameters did you use?

If possible, please reproduce the hang again. How many threads/processes
sleeping in pipe_read() or pipe_write() do you see? (you can look at
/proc/$pid/stack).

Please pick one sleeping writer, and do

	$ strace -p pidof_that_write

this should wake this writer up. If a missed wakeup is the only problem,
hackbench should continue.

The more info you can provide the better ;)

Oleg.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-25 11:57     ` Oleg Nesterov
@ 2025-02-26  5:55       ` Sapkal, Swapnil
  2025-02-26 11:38         ` Oleg Nesterov
  2025-03-03 13:00       ` Alexey Gladkov
  1 sibling, 1 reply; 109+ messages in thread
From: Sapkal, Swapnil @ 2025-02-26  5:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Manfred Spraul, Linus Torvalds, Christian Brauner, David Howells,
	WangYuli, linux-fsdevel, linux-kernel, K Prateek Nayak,
	Shenoy, Gautham Ranjal, Neeraj.Upadhyay

Hi Oleg,


On 2/25/2025 5:27 PM, Oleg Nesterov wrote:
> On 02/24, Oleg Nesterov wrote:
>>
>> Just in case, did you use
>>
>> 	https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git/tree/src/hackbench/hackbench.c
>>
>> ?
> 
> Or did you use another version?
> 

I am running hackbench using lkp-tests which downloads hackbench source 
from same rt-tests with version 2.8.

https://github.com/intel/lkp-tests.git
https://www.kernel.org/pub/linux/utils/rt-tests/rt-tests-2.8.tar.gz

> Exactly what parameters did you use?
> 

Exact command with parameters is

	/usr/bin/hackbench -g 16 -f 20 --threads --pipe -l 100000 -s 100

> If possible, please reproduce the hang again. How many threads/processes
> sleeping in pipe_read() or pipe_write() do you see? (you can look at
> /proc/$pid/stack).
> 

In the latest hang, I saw 37 threads sleeping out of which 20 were 
sleeping in pipe_read() and 17 in pipe_write().

Main hackbench thread (which spawns the readers and writers) has the 
following stack trace:

[<0>] futex_wait_queue+0x6e/0x90
[<0>] __futex_wait+0x143/0x1c0
[<0>] futex_wait+0x69/0x110
[<0>] do_futex+0x147/0x1d0
[<0>] __x64_sys_futex+0x7c/0x1e0
[<0>] x64_sys_call+0x207a/0x2140
[<0>] do_syscall_64+0x6f/0x110
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

The readers have the following pipe_read stack trace:

[<0>] pipe_read+0x338/0x460
[<0>] vfs_read+0x308/0x350
[<0>] ksys_read+0xcc/0xe0
[<0>] __x64_sys_read+0x1d/0x30
[<0>] x64_sys_call+0x1b89/0x2140
[<0>] do_syscall_64+0x6f/0x110
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

The writers have the following pipe_write stack trace:

[<0>] pipe_write+0x370/0x630
[<0>] vfs_write+0x378/0x420
[<0>] ksys_write+0xcc/0xe0
[<0>] __x64_sys_write+0x1d/0x30
[<0>] x64_sys_call+0x16b3/0x2140
[<0>] do_syscall_64+0x6f/0x110
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

> Please pick one sleeping writer, and do
> 
> 	$ strace -p pidof_that_write
> 
> this should wake this writer up. If a missed wakeup is the only problem,
> hackbench should continue.
> 

I tried waking one of the writer and the benchmark progressed and 
completed successfully.

> The more info you can provide the better ;)
> 
> Oleg.
> 
--
Thanks and Regards,
Swapnil

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-26  5:55       ` Sapkal, Swapnil
@ 2025-02-26 11:38         ` Oleg Nesterov
  2025-02-26 17:56           ` Sapkal, Swapnil
  0 siblings, 1 reply; 109+ messages in thread
From: Oleg Nesterov @ 2025-02-26 11:38 UTC (permalink / raw)
  To: Sapkal, Swapnil
  Cc: Manfred Spraul, Linus Torvalds, Christian Brauner, David Howells,
	WangYuli, linux-fsdevel, linux-kernel, K Prateek Nayak,
	Shenoy, Gautham Ranjal, Neeraj.Upadhyay

Thanks Sapkal!

I'll try to think. Meanwhile,

On 02/26, Sapkal, Swapnil wrote:
>
> Exact command with parameters is
>
> 	/usr/bin/hackbench -g 16 -f 20 --threads --pipe -l 100000 -s 100

Can you reproduce with "--process" rather than "--threads" ?

Oleg.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-26 11:38         ` Oleg Nesterov
@ 2025-02-26 17:56           ` Sapkal, Swapnil
  2025-02-26 18:12             ` Oleg Nesterov
  0 siblings, 1 reply; 109+ messages in thread
From: Sapkal, Swapnil @ 2025-02-26 17:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Manfred Spraul, Linus Torvalds, Christian Brauner, David Howells,
	WangYuli, linux-fsdevel, linux-kernel, K Prateek Nayak,
	Shenoy, Gautham Ranjal, Neeraj.Upadhyay, mjguzik

Hi Oleg,

On 2/26/2025 5:08 PM, Oleg Nesterov wrote:
> Thanks Sapkal!
> 
> I'll try to think. Meanwhile,
> 
> On 02/26, Sapkal, Swapnil wrote:
>>
>> Exact command with parameters is
>>
>> 	/usr/bin/hackbench -g 16 -f 20 --threads --pipe -l 100000 -s 100
> 
> Can you reproduce with "--process" rather than "--threads" ?
> 
I was able to reproduce the issue with processes also. Total 33 
processes were sleeping out of which 20 were readers and 13 were writers.

The stack trace for main hackbench process is as follows:

[<0>] do_wait+0xb5/0x110
[<0>] kernel_wait4+0xb2/0x150
[<0>] __do_sys_wait4+0x89/0xa0
[<0>] __x64_sys_wait4+0x20/0x30
[<0>] x64_sys_call+0x1bf7/0x2140
[<0>] do_syscall_64+0x6f/0x110
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

I am trying to reproduce the issue with suggestions by Mateusz.

> Oleg.
> 
--
Thanks and Regards,
Swapnil

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-26 17:56           ` Sapkal, Swapnil
@ 2025-02-26 18:12             ` Oleg Nesterov
  0 siblings, 0 replies; 109+ messages in thread
From: Oleg Nesterov @ 2025-02-26 18:12 UTC (permalink / raw)
  To: Sapkal, Swapnil
  Cc: Manfred Spraul, Linus Torvalds, Christian Brauner, David Howells,
	WangYuli, linux-fsdevel, linux-kernel, K Prateek Nayak,
	Shenoy, Gautham Ranjal, Neeraj.Upadhyay, mjguzik

On 02/26, Sapkal, Swapnil wrote:
>
> >Can you reproduce with "--process" rather than "--threads" ?
> >
> I was able to reproduce the issue with processes also. Total 33 processes
> were sleeping out of which 20 were readers and 13 were writers.

Thanks a lot. I am wondering what makes your machine (or .config?)
special ;) A lot of people and robots tested this patch with these
options.

> The stack trace for main hackbench process is as follows:
>
> [<0>] do_wait+0xb5/0x110

Yes, this is clear, the main thread/process is not interesting.

> I am trying to reproduce the issue with suggestions by Mateusz.

Great.

So far I still have no clue. Most probably I will ask you to do
more testing after that, perhaps with some debugging patches.

Thanks again for your help,

Oleg.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-25 11:57     ` Oleg Nesterov
  2025-02-26  5:55       ` Sapkal, Swapnil
@ 2025-03-03 13:00       ` Alexey Gladkov
  2025-03-03 15:46         ` K Prateek Nayak
  1 sibling, 1 reply; 109+ messages in thread
From: Alexey Gladkov @ 2025-03-03 13:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Sapkal, Swapnil, Manfred Spraul, Linus Torvalds,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, K Prateek Nayak, Shenoy, Gautham Ranjal,
	Neeraj.Upadhyay

On Tue, Feb 25, 2025 at 12:57:37PM +0100, Oleg Nesterov wrote:
> On 02/24, Oleg Nesterov wrote:
> >
> > Just in case, did you use
> >
> > 	https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git/tree/src/hackbench/hackbench.c
> >
> > ?
> 
> Or did you use another version?
> 
> Exactly what parameters did you use?
> 
> If possible, please reproduce the hang again. How many threads/processes
> sleeping in pipe_read() or pipe_write() do you see? (you can look at
> /proc/$pid/stack).
> 
> Please pick one sleeping writer, and do
> 
> 	$ strace -p pidof_that_write
> 
> this should wake this writer up. If a missed wakeup is the only problem,
> hackbench should continue.
> 
> The more info you can provide the better ;)

I was also able to reproduce the hackbench hang with the parameters
mentioned earlier (threads and processes) on the kernel from master.

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-03 13:00       ` Alexey Gladkov
@ 2025-03-03 15:46         ` K Prateek Nayak
  2025-03-03 17:18           ` Alexey Gladkov
  0 siblings, 1 reply; 109+ messages in thread
From: K Prateek Nayak @ 2025-03-03 15:46 UTC (permalink / raw)
  To: Alexey Gladkov, Oleg Nesterov
  Cc: Sapkal, Swapnil, Manfred Spraul, Linus Torvalds,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, Shenoy, Gautham Ranjal, Neeraj.Upadhyay

Hello Legion,

On 3/3/2025 6:30 PM, Alexey Gladkov wrote:
> On Tue, Feb 25, 2025 at 12:57:37PM +0100, Oleg Nesterov wrote:
>> On 02/24, Oleg Nesterov wrote:
>>>
>>> Just in case, did you use
>>>
>>> 	https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git/tree/src/hackbench/hackbench.c
>>>
>>> ?
>>
>> Or did you use another version?
>>
>> Exactly what parameters did you use?
>>
>> If possible, please reproduce the hang again. How many threads/processes
>> sleeping in pipe_read() or pipe_write() do you see? (you can look at
>> /proc/$pid/stack).
>>
>> Please pick one sleeping writer, and do
>>
>> 	$ strace -p pidof_that_write
>>
>> this should wake this writer up. If a missed wakeup is the only problem,
>> hackbench should continue.
>>
>> The more info you can provide the better ;)
> 
> I was also able to reproduce the hackbench hang with the parameters
> mentioned earlier (threads and processes) on the kernel from master.
> 

Thank you for reporting your observations!

If you are able to reproduce it reliably, could you please give the
below diff posted by Swapnil from the parallel thread [1] a try:

diff --git a/fs/pipe.c b/fs/pipe.c
index ce1af7592780..a1931c817822 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -417,9 +417,19 @@ static inline int is_packetized(struct file *file)
  /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
  static inline bool pipe_writable(const struct pipe_inode_info *pipe)
  {
-    unsigned int head = READ_ONCE(pipe->head);
-    unsigned int tail = READ_ONCE(pipe->tail);
      unsigned int max_usage = READ_ONCE(pipe->max_usage);
+    unsigned int head, tail;
+
+    tail = READ_ONCE(pipe->tail);
+    /*
+     * Since the unsigned arithmetic in this lockless preemptible context
+     * relies on the fact that the tail can never be ahead of head, read
+     * the head after the tail to ensure we've not missed any updates to
+     * the head. Reordering the reads can cause wraparounds and give the
+     * illusion that the pipe is full.
+     */
+    smp_rmb();
+    head = READ_ONCE(pipe->head);
  
      return !pipe_full(head, tail, max_usage) ||
          !READ_ONCE(pipe->readers);
---

We've been running hackbench for a while now with the above diff and we
haven't run into a hang yet. Sorry for the troubles and thank you again.

[1] https://lore.kernel.org/all/03a1f4af-47e0-459d-b2bf-9f65536fc2ab@amd.com/

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-03 15:46         ` K Prateek Nayak
@ 2025-03-03 17:18           ` Alexey Gladkov
  0 siblings, 0 replies; 109+ messages in thread
From: Alexey Gladkov @ 2025-03-03 17:18 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Oleg Nesterov, Sapkal, Swapnil, Manfred Spraul, Linus Torvalds,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, Shenoy, Gautham Ranjal, Neeraj.Upadhyay

On Mon, Mar 03, 2025 at 09:16:08PM +0530, K Prateek Nayak wrote:
> Hello Legion,
> 
> On 3/3/2025 6:30 PM, Alexey Gladkov wrote:
> > On Tue, Feb 25, 2025 at 12:57:37PM +0100, Oleg Nesterov wrote:
> >> On 02/24, Oleg Nesterov wrote:
> >>>
> >>> Just in case, did you use
> >>>
> >>> 	https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git/tree/src/hackbench/hackbench.c
> >>>
> >>> ?
> >>
> >> Or did you use another version?
> >>
> >> Exactly what parameters did you use?
> >>
> >> If possible, please reproduce the hang again. How many threads/processes
> >> sleeping in pipe_read() or pipe_write() do you see? (you can look at
> >> /proc/$pid/stack).
> >>
> >> Please pick one sleeping writer, and do
> >>
> >> 	$ strace -p pidof_that_write
> >>
> >> this should wake this writer up. If a missed wakeup is the only problem,
> >> hackbench should continue.
> >>
> >> The more info you can provide the better ;)
> > 
> > I was also able to reproduce the hackbench hang with the parameters
> > mentioned earlier (threads and processes) on the kernel from master.
> > 
> 
> Thank you for reporting your observations!
> 
> If you are able to reproduce it reliably, could you please give the
> below diff posted by Swapnil from the parallel thread [1] a try:
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index ce1af7592780..a1931c817822 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -417,9 +417,19 @@ static inline int is_packetized(struct file *file)
>   /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
>   static inline bool pipe_writable(const struct pipe_inode_info *pipe)
>   {
> -    unsigned int head = READ_ONCE(pipe->head);
> -    unsigned int tail = READ_ONCE(pipe->tail);
>       unsigned int max_usage = READ_ONCE(pipe->max_usage);
> +    unsigned int head, tail;
> +
> +    tail = READ_ONCE(pipe->tail);
> +    /*
> +     * Since the unsigned arithmetic in this lockless preemptible context
> +     * relies on the fact that the tail can never be ahead of head, read
> +     * the head after the tail to ensure we've not missed any updates to
> +     * the head. Reordering the reads can cause wraparounds and give the
> +     * illusion that the pipe is full.
> +     */
> +    smp_rmb();
> +    head = READ_ONCE(pipe->head);
>   
>       return !pipe_full(head, tail, max_usage) ||
>           !READ_ONCE(pipe->readers);
> ---
> 
> We've been running hackbench for a while now with the above diff and we
> haven't run into a hang yet. Sorry for the troubles and thank you again.

No problem at all.

Along with the patch above, I tried reproducing the problem for about 40
minutes and no hangs found. Before that hackbench would hang within 5
minutes or so.

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-24 14:24   ` Oleg Nesterov
  2025-02-24 18:36     ` Linus Torvalds
  2025-02-25 11:57     ` Oleg Nesterov
@ 2025-02-26 13:18     ` Mateusz Guzik
  2025-02-26 13:21       ` Mateusz Guzik
  2025-02-27 16:18       ` Sapkal, Swapnil
  2 siblings, 2 replies; 109+ messages in thread
From: Mateusz Guzik @ 2025-02-26 13:18 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Sapkal, Swapnil, Manfred Spraul, Linus Torvalds,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, K Prateek Nayak, Shenoy, Gautham Ranjal,
	Neeraj.Upadhyay

On Mon, Feb 24, 2025 at 03:24:32PM +0100, Oleg Nesterov wrote:
> On 02/24, Sapkal, Swapnil wrote:
> > Whenever I compare the case where was_full would have been set but
> > wake_writer was not set, I see the following pattern:
> >
> > ret = 100 (Read was successful)
> > pipe_full() = 1
> > total_len = 0
> > buf->len != 0
> >
> > total_len is computed using iov_iter_count() while the buf->len is the
> > length of the buffer corresponding to tail(pipe->bufs[tail & mask].len).
> > Looking at pipe_write(), there seems to be a case where the writer can make
> > progress when (chars && !was_empty) which only looks at iov_iter_count().
> > Could it be the case that there is still room in the buffer but we are not
> > waking up the writer?
> 
> I don't think so, but perhaps I am totally confused.
> 
> If the writer sleeps on pipe->wr_wait, it has already tried to write into
> the pipe->bufs[head - 1] buffer before the sleep.
> 
> Yes, the reader can read from that buffer, but this won't make it more "writable"
> for this particular writer, "PAGE_SIZE - buf->offset + buf->len" won't be changed.

While I think the now-removed wakeup was indeed hiding a bug, I also
think the write thing pointed out above is a fair point (orthogonal
though).

The initial call to pipe_write allows for appending to an existing page.

However, should the pipe be full, the loop which follows it insists on
allocating a new one and waits for a slot, even if ultimately *there is*
space now.

The hackbench invocation used here passes around 100 bytes.

Both readers and writers do rounds over pipes issuing 100 byte-sized
ops.

Suppose the pipe does not have space to hold the extra 100 bytes. The
writer goes to sleep and waits for the tail to move. A reader shows up,
reads 100 bytes (now there is space!) but since the current buf was not
depleted it does not mess with the tail. 

The bench spawns tons of threads, ensuring there is a lot of competition
for the cpu time. The reader might get just enough time to largely
deplete the pipe to a point where there is only one buf in there with
space in it. Should pipe_write() be invoked now it would succeed
appending to a page. But if the writer was already asleep, it is going
to insist on allocating a new page.

As for the bug, I don't see anything obvious myself.

However, I think there are 2 avenues which warrant checking.

Sapkal, if you have time, can you please boot up the kernel which is
more likely to run into the problem and then run hackbench as follows:

1. with 1 fd instead of 20:

/usr/bin/hackbench -g 16 -f 1 --threads --pipe -l 100000 -s 100

2. with a size which divides 4096 evenly (e.g., 128):

/usr/bin/hackbench -g 1 -f 20 --threads --pipe -l 100000 -s 128

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-26 13:18     ` Mateusz Guzik
@ 2025-02-26 13:21       ` Mateusz Guzik
  2025-02-26 17:16         ` Oleg Nesterov
  2025-02-27 16:18       ` Sapkal, Swapnil
  1 sibling, 1 reply; 109+ messages in thread
From: Mateusz Guzik @ 2025-02-26 13:21 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Sapkal, Swapnil, Manfred Spraul, Linus Torvalds,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, K Prateek Nayak, Shenoy, Gautham Ranjal,
	Neeraj.Upadhyay

On Wed, Feb 26, 2025 at 2:19 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Mon, Feb 24, 2025 at 03:24:32PM +0100, Oleg Nesterov wrote:
> > On 02/24, Sapkal, Swapnil wrote:
> > > Whenever I compare the case where was_full would have been set but
> > > wake_writer was not set, I see the following pattern:
> > >
> > > ret = 100 (Read was successful)
> > > pipe_full() = 1
> > > total_len = 0
> > > buf->len != 0
> > >
> > > total_len is computed using iov_iter_count() while the buf->len is the
> > > length of the buffer corresponding to tail(pipe->bufs[tail & mask].len).
> > > Looking at pipe_write(), there seems to be a case where the writer can make
> > > progress when (chars && !was_empty) which only looks at iov_iter_count().
> > > Could it be the case that there is still room in the buffer but we are not
> > > waking up the writer?
> >
> > I don't think so, but perhaps I am totally confused.
> >
> > If the writer sleeps on pipe->wr_wait, it has already tried to write into
> > the pipe->bufs[head - 1] buffer before the sleep.
> >
> > Yes, the reader can read from that buffer, but this won't make it more "writable"
> > for this particular writer, "PAGE_SIZE - buf->offset + buf->len" won't be changed.
>
> While I think the now-removed wakeup was indeed hiding a bug, I also
> think the write thing pointed out above is a fair point (orthogonal
> though).
>
> The initial call to pipe_write allows for appending to an existing page.
>
> However, should the pipe be full, the loop which follows it insists on
> allocating a new one and waits for a slot, even if ultimately *there is*
> space now.
>
> The hackbench invocation used here passes around 100 bytes.
>
> Both readers and writers do rounds over pipes issuing 100 byte-sized
> ops.
>
> Suppose the pipe does not have space to hold the extra 100 bytes. The
> writer goes to sleep and waits for the tail to move. A reader shows up,
> reads 100 bytes (now there is space!) but since the current buf was not
> depleted it does not mess with the tail.
>
> The bench spawns tons of threads, ensuring there is a lot of competition
> for the cpu time. The reader might get just enough time to largely
> deplete the pipe to a point where there is only one buf in there with
> space in it. Should pipe_write() be invoked now it would succeed
> appending to a page. But if the writer was already asleep, it is going
> to insist on allocating a new page.

Now that I sent the e-mail, I realized the page would have unread data
after some offset, so there is no room to *append* to it, unless one
wants to memmove everythiing back.

Please ignore this bit :P

However, the suggestion below stands:

>
> As for the bug, I don't see anything obvious myself.
>
> However, I think there are 2 avenues which warrant checking.
>
> Sapkal, if you have time, can you please boot up the kernel which is
> more likely to run into the problem and then run hackbench as follows:
>
> 1. with 1 fd instead of 20:
>
> /usr/bin/hackbench -g 16 -f 1 --threads --pipe -l 100000 -s 100
>
> 2. with a size which divides 4096 evenly (e.g., 128):
>
> /usr/bin/hackbench -g 1 -f 20 --threads --pipe -l 100000 -s 128



-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-26 13:21       ` Mateusz Guzik
@ 2025-02-26 17:16         ` Oleg Nesterov
  0 siblings, 0 replies; 109+ messages in thread
From: Oleg Nesterov @ 2025-02-26 17:16 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Sapkal, Swapnil, Manfred Spraul, Linus Torvalds,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, K Prateek Nayak, Shenoy, Gautham Ranjal,
	Neeraj.Upadhyay

On 02/26, Mateusz Guzik wrote:
>
> On Wed, Feb 26, 2025 at 2:19 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> Now that I sent the e-mail, I realized the page would have unread data
> after some offset, so there is no room to *append* to it, unless one
> wants to memmove everythiing back.

Yes, but... even "memmove everything back" won't help if
pipe->ring_size > 1 (PIPE_DEF_BUFFERS == 16 by default).

> However, the suggestion below stands:

Agreed, any additional info can help.

Oleg.

> > As for the bug, I don't see anything obvious myself.
> >
> > However, I think there are 2 avenues which warrant checking.
> >
> > Sapkal, if you have time, can you please boot up the kernel which is
> > more likely to run into the problem and then run hackbench as follows:
> >
> > 1. with 1 fd instead of 20:
> >
> > /usr/bin/hackbench -g 16 -f 1 --threads --pipe -l 100000 -s 100
> >
> > 2. with a size which divides 4096 evenly (e.g., 128):
> >
> > /usr/bin/hackbench -g 1 -f 20 --threads --pipe -l 100000 -s 128
> 
> 
> 
> -- 
> Mateusz Guzik <mjguzik gmail.com>
> 


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-26 13:18     ` Mateusz Guzik
  2025-02-26 13:21       ` Mateusz Guzik
@ 2025-02-27 16:18       ` Sapkal, Swapnil
  2025-02-27 16:34         ` Mateusz Guzik
  2025-02-27 21:12         ` Oleg Nesterov
  1 sibling, 2 replies; 109+ messages in thread
From: Sapkal, Swapnil @ 2025-02-27 16:18 UTC (permalink / raw)
  To: Mateusz Guzik, Oleg Nesterov
  Cc: Manfred Spraul, Linus Torvalds, Christian Brauner, David Howells,
	WangYuli, linux-fsdevel, linux-kernel, K Prateek Nayak,
	Shenoy, Gautham Ranjal, Neeraj.Upadhyay

Hi Mateusz,

On 2/26/2025 6:48 PM, Mateusz Guzik wrote:
> On Mon, Feb 24, 2025 at 03:24:32PM +0100, Oleg Nesterov wrote:
>> On 02/24, Sapkal, Swapnil wrote:
>>> Whenever I compare the case where was_full would have been set but
>>> wake_writer was not set, I see the following pattern:
>>>
>>> ret = 100 (Read was successful)
>>> pipe_full() = 1
>>> total_len = 0
>>> buf->len != 0
>>>
>>> total_len is computed using iov_iter_count() while the buf->len is the
>>> length of the buffer corresponding to tail(pipe->bufs[tail & mask].len).
>>> Looking at pipe_write(), there seems to be a case where the writer can make
>>> progress when (chars && !was_empty) which only looks at iov_iter_count().
>>> Could it be the case that there is still room in the buffer but we are not
>>> waking up the writer?
>>
>> I don't think so, but perhaps I am totally confused.
>>
>> If the writer sleeps on pipe->wr_wait, it has already tried to write into
>> the pipe->bufs[head - 1] buffer before the sleep.
>>
>> Yes, the reader can read from that buffer, but this won't make it more "writable"
>> for this particular writer, "PAGE_SIZE - buf->offset + buf->len" won't be changed.
> 
> While I think the now-removed wakeup was indeed hiding a bug, I also
> think the write thing pointed out above is a fair point (orthogonal
> though).
> 
> The initial call to pipe_write allows for appending to an existing page.
> 
> However, should the pipe be full, the loop which follows it insists on
> allocating a new one and waits for a slot, even if ultimately *there is*
> space now.
> 
> The hackbench invocation used here passes around 100 bytes.
> 
> Both readers and writers do rounds over pipes issuing 100 byte-sized
> ops.
> 
> Suppose the pipe does not have space to hold the extra 100 bytes. The
> writer goes to sleep and waits for the tail to move. A reader shows up,
> reads 100 bytes (now there is space!) but since the current buf was not
> depleted it does not mess with the tail.
> 
> The bench spawns tons of threads, ensuring there is a lot of competition
> for the cpu time. The reader might get just enough time to largely
> deplete the pipe to a point where there is only one buf in there with
> space in it. Should pipe_write() be invoked now it would succeed
> appending to a page. But if the writer was already asleep, it is going
> to insist on allocating a new page.
> 
> As for the bug, I don't see anything obvious myself.
> 
> However, I think there are 2 avenues which warrant checking.
> 
> Sapkal, if you have time, can you please boot up the kernel which is
> more likely to run into the problem and then run hackbench as follows:
> 

I tried reproducing the issue with both the scenarios mentioned below.

> 1. with 1 fd instead of 20:
> 
> /usr/bin/hackbench -g 16 -f 1 --threads --pipe -l 100000 -s 100
> 

With this I was not able to reproduce the issue. I tried almost 5000 
iterations.

> 2. with a size which divides 4096 evenly (e.g., 128):
> 
> /usr/bin/hackbench -g 1 -f 20 --threads --pipe -l 100000 -s 128

I was not able to reproduce the issue with 1 group. But I thought you 
wanted to change only the message size to 128 bytes. When I retain the 
number of groups to 16 and change the message size to 128, it took me 
around 150 iterations to reproduce this issue (with 100 bytes it was 20 
iterations). The exact command was

/usr/bin/hackbench -g 16 -f 20 --threads --pipe -l 100000 -s 128

I will try to sprinkle some trace_printk's in the code where the state 
of the pipe changes. I will report here if I find something.

--
Thanks and Regards,
Swapnil

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-27 16:18       ` Sapkal, Swapnil
@ 2025-02-27 16:34         ` Mateusz Guzik
  2025-02-27 21:12         ` Oleg Nesterov
  1 sibling, 0 replies; 109+ messages in thread
From: Mateusz Guzik @ 2025-02-27 16:34 UTC (permalink / raw)
  To: Sapkal, Swapnil
  Cc: Oleg Nesterov, Manfred Spraul, Linus Torvalds, Christian Brauner,
	David Howells, WangYuli, linux-fsdevel, linux-kernel,
	K Prateek Nayak, Shenoy, Gautham Ranjal, Neeraj.Upadhyay

On Thu, Feb 27, 2025 at 5:20 PM Sapkal, Swapnil <swapnil.sapkal@amd.com> wrote:
> I tried reproducing the issue with both the scenarios mentioned below.
>
> > 1. with 1 fd instead of 20:
> >
> > /usr/bin/hackbench -g 16 -f 1 --threads --pipe -l 100000 -s 100
> >
>
> With this I was not able to reproduce the issue. I tried almost 5000
> iterations.
>

Ok, noted.

> > 2. with a size which divides 4096 evenly (e.g., 128):
> >
> > /usr/bin/hackbench -g 1 -f 20 --threads --pipe -l 100000 -s 128
>
> I was not able to reproduce the issue with 1 group. But I thought you
> wanted to change only the message size to 128 bytes.

Yes indeed, thanks for catching the problem.

> When I retain the number of groups to 16 and change the message size to
> 128, it took me around 150 iterations to reproduce this issue (with 100
> bytes it was 20 iterations). The exact command was
>
> /usr/bin/hackbench -g 16 -f 20 --threads --pipe -l 100000 -s 128
>
> I will try to sprinkle some trace_printk's in the code where the state
> of the pipe changes. I will report here if I find something.
>

Thanks.

So to be clear, this is Oleg's bug, I am only looking from the side
out of curiosity what's up. As it usually goes with these, after the
dust settles I very much expect the fix will be roughly a one liner.
:)

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-27 16:18       ` Sapkal, Swapnil
  2025-02-27 16:34         ` Mateusz Guzik
@ 2025-02-27 21:12         ` Oleg Nesterov
  2025-02-28  5:58           ` Sapkal, Swapnil
  1 sibling, 1 reply; 109+ messages in thread
From: Oleg Nesterov @ 2025-02-27 21:12 UTC (permalink / raw)
  To: Sapkal, Swapnil
  Cc: Mateusz Guzik, Manfred Spraul, Linus Torvalds, Christian Brauner,
	David Howells, WangYuli, linux-fsdevel, linux-kernel,
	K Prateek Nayak, Shenoy, Gautham Ranjal, Neeraj.Upadhyay

Sapkal, first of all, thanks again!

On 02/27, Sapkal, Swapnil wrote:
>
> >1. with 1 fd instead of 20:
> >
> >/usr/bin/hackbench -g 16 -f 1 --threads --pipe -l 100000 -s 100
>
> With this I was not able to reproduce the issue. I tried almost 5000
> iterations.

OK,

> >2. with a size which divides 4096 evenly (e.g., 128):
...
> When I retain the number of
> groups to 16 and change the message size to 128, it took me around 150
> iterations to reproduce this issue (with 100 bytes it was 20 iterations).
> The exact command was
>
> /usr/bin/hackbench -g 16 -f 20 --threads --pipe -l 100000 -s 128

Ah, good. This is good ;)

> I will try to sprinkle some trace_printk's in the code where the state of
> the pipe changes. I will report here if I find something.

Great! but...

Sapkal, I was going to finish (and test! ;) the patch below tomorrow, after
you test the previous debugging patch I sent in this thread. But since you
are going to change the kernel...

For the moment, please forget about that (as Mateusz pointed buggy) patch.
Could you apply the patch below and reproduce the problem ?

If yes, please do prctl(666) after the hang and send us the output from
dmesg, between "DUMP START" and "DUMP END". You can just do

	$ perl -e 'syscall 157,666'

to call prctl(666) and trigger the dump.

Oleg.
---

diff --git a/fs/pipe.c b/fs/pipe.c
index b0641f75b1ba..566c75a0ff81 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -376,6 +376,8 @@ anon_pipe_read(struct kiocb *iocb, struct iov_iter *to)
 	}
 	if (pipe_empty(pipe->head, pipe->tail))
 		wake_next_reader = false;
+	if (ret > 0)
+		pipe->r_cnt++;
 	mutex_unlock(&pipe->mutex);
 
 	if (wake_writer)
@@ -565,6 +567,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 out:
 	if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
 		wake_next_writer = false;
+	if (ret > 0)
+		pipe->w_cnt++;
 	mutex_unlock(&pipe->mutex);
 
 	/*
@@ -695,6 +699,42 @@ pipe_poll(struct file *filp, poll_table *wait)
 	return mask;
 }
 
+static DEFINE_MUTEX(PI_MUTEX);
+static LIST_HEAD(PI_LIST);
+
+void pi_dump(void);
+void pi_dump(void)
+{
+	struct pipe_inode_info *pipe;
+
+	pr_crit("---------- DUMP START ----------\n");
+	mutex_lock(&PI_MUTEX);
+	list_for_each_entry(pipe, &PI_LIST, pi_list) {
+		unsigned head, tail;
+
+		mutex_lock(&pipe->mutex);
+		head = pipe->head;
+		tail = pipe->tail;
+		pr_crit("E=%d F=%d; W=%d R=%d\n",
+			pipe_empty(head, tail), pipe_full(head, tail, pipe->max_usage),
+			pipe->w_cnt, pipe->r_cnt);
+
+// INCOMPLETE
+pr_crit("RD=%d WR=%d\n", waitqueue_active(&pipe->rd_wait), waitqueue_active(&pipe->wr_wait));
+
+		for (; tail < head; tail++) {
+			struct pipe_buffer *buf = pipe_buf(pipe, tail);
+			WARN_ON(buf->ops != &anon_pipe_buf_ops);
+			pr_crit("buf: o=%d l=%d\n", buf->offset, buf->len);
+		}
+		pr_crit("\n");
+
+		mutex_unlock(&pipe->mutex);
+	}
+	mutex_unlock(&PI_MUTEX);
+	pr_crit("---------- DUMP END ------------\n");
+}
+
 static void put_pipe_info(struct inode *inode, struct pipe_inode_info *pipe)
 {
 	int kill = 0;
@@ -706,8 +746,14 @@ static void put_pipe_info(struct inode *inode, struct pipe_inode_info *pipe)
 	}
 	spin_unlock(&inode->i_lock);
 
-	if (kill)
+	if (kill) {
+		if (!list_empty(&pipe->pi_list)) {
+			mutex_lock(&PI_MUTEX);
+			list_del_init(&pipe->pi_list);
+			mutex_unlock(&PI_MUTEX);
+		}
 		free_pipe_info(pipe);
+	}
 }
 
 static int
@@ -790,6 +836,13 @@ struct pipe_inode_info *alloc_pipe_info(void)
 	if (pipe == NULL)
 		goto out_free_uid;
 
+	INIT_LIST_HEAD(&pipe->pi_list);
+	if (!strcmp(current->comm, "hackbench")) {
+		mutex_lock(&PI_MUTEX);
+		list_add_tail(&pipe->pi_list, &PI_LIST);
+		mutex_unlock(&PI_MUTEX);
+	}
+
 	if (pipe_bufs * PAGE_SIZE > max_size && !capable(CAP_SYS_RESOURCE))
 		pipe_bufs = max_size >> PAGE_SHIFT;
 
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 8ff23bf5a819..48d9bf5171dc 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -80,6 +80,9 @@ struct pipe_inode_info {
 #ifdef CONFIG_WATCH_QUEUE
 	struct watch_queue *watch_queue;
 #endif
+
+	struct list_head pi_list;
+	unsigned w_cnt, r_cnt;
 };
 
 /*
diff --git a/kernel/sys.c b/kernel/sys.c
index 4efca8a97d62..a85e34861b2e 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2483,6 +2483,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 	error = 0;
 	switch (option) {
+	case 666: {
+		extern void pi_dump(void);
+		pi_dump();
+		break;
+	}
 	case PR_SET_PDEATHSIG:
 		if (!valid_signal(arg2)) {
 			error = -EINVAL;


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-27 21:12         ` Oleg Nesterov
@ 2025-02-28  5:58           ` Sapkal, Swapnil
  2025-02-28 14:30             ` Oleg Nesterov
  0 siblings, 1 reply; 109+ messages in thread
From: Sapkal, Swapnil @ 2025-02-28  5:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mateusz Guzik, Manfred Spraul, Linus Torvalds, Christian Brauner,
	David Howells, WangYuli, linux-fsdevel, linux-kernel,
	K Prateek Nayak, Shenoy, Gautham Ranjal, Neeraj.Upadhyay

[-- Attachment #1: Type: text/plain, Size: 5245 bytes --]

Hi Oleg,

On 2/28/2025 2:42 AM, Oleg Nesterov wrote:
> Sapkal, first of all, thanks again!
> 
> On 02/27, Sapkal, Swapnil wrote:
>>
>>> 1. with 1 fd instead of 20:
>>>
>>> /usr/bin/hackbench -g 16 -f 1 --threads --pipe -l 100000 -s 100
>>
>> With this I was not able to reproduce the issue. I tried almost 5000
>> iterations.
> 
> OK,
> 
>>> 2. with a size which divides 4096 evenly (e.g., 128):
> ...
>> When I retain the number of
>> groups to 16 and change the message size to 128, it took me around 150
>> iterations to reproduce this issue (with 100 bytes it was 20 iterations).
>> The exact command was
>>
>> /usr/bin/hackbench -g 16 -f 20 --threads --pipe -l 100000 -s 128
> 
> Ah, good. This is good ;)
> 
>> I will try to sprinkle some trace_printk's in the code where the state of
>> the pipe changes. I will report here if I find something.
> 
> Great! but...
> 
> Sapkal, I was going to finish (and test! ;) the patch below tomorrow, after
> you test the previous debugging patch I sent in this thread. But since you
> are going to change the kernel...
> 
> For the moment, please forget about that (as Mateusz pointed buggy) patch.
> Could you apply the patch below and reproduce the problem ?
> 

Yes, I was able to reproduce the problem with the below patch.

> If yes, please do prctl(666) after the hang and send us the output from
> dmesg, between "DUMP START" and "DUMP END". You can just do
> 
> 	$ perl -e 'syscall 157,666'
> 
> to call prctl(666) and trigger the dump.
> 

I found a case in the dump where the pipe is empty still both reader and 
writer are waiting on it.

[ 1397.829761] E=1 F=0; W=1719147 R=1719147
[ 1397.837843] RD=1 WR=1

Full dump is attached below.

> Oleg.
> ---
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index b0641f75b1ba..566c75a0ff81 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -376,6 +376,8 @@ anon_pipe_read(struct kiocb *iocb, struct iov_iter *to)
>   	}
>   	if (pipe_empty(pipe->head, pipe->tail))
>   		wake_next_reader = false;
> +	if (ret > 0)
> +		pipe->r_cnt++;
>   	mutex_unlock(&pipe->mutex);
>   
>   	if (wake_writer)
> @@ -565,6 +567,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>   out:
>   	if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
>   		wake_next_writer = false;
> +	if (ret > 0)
> +		pipe->w_cnt++;
>   	mutex_unlock(&pipe->mutex);
>   
>   	/*
> @@ -695,6 +699,42 @@ pipe_poll(struct file *filp, poll_table *wait)
>   	return mask;
>   }
>   
> +static DEFINE_MUTEX(PI_MUTEX);
> +static LIST_HEAD(PI_LIST);
> +
> +void pi_dump(void);
> +void pi_dump(void)
> +{
> +	struct pipe_inode_info *pipe;
> +
> +	pr_crit("---------- DUMP START ----------\n");
> +	mutex_lock(&PI_MUTEX);
> +	list_for_each_entry(pipe, &PI_LIST, pi_list) {
> +		unsigned head, tail;
> +
> +		mutex_lock(&pipe->mutex);
> +		head = pipe->head;
> +		tail = pipe->tail;
> +		pr_crit("E=%d F=%d; W=%d R=%d\n",
> +			pipe_empty(head, tail), pipe_full(head, tail, pipe->max_usage),
> +			pipe->w_cnt, pipe->r_cnt);
> +
> +// INCOMPLETE
> +pr_crit("RD=%d WR=%d\n", waitqueue_active(&pipe->rd_wait), waitqueue_active(&pipe->wr_wait));
> +
> +		for (; tail < head; tail++) {
> +			struct pipe_buffer *buf = pipe_buf(pipe, tail);
> +			WARN_ON(buf->ops != &anon_pipe_buf_ops);
> +			pr_crit("buf: o=%d l=%d\n", buf->offset, buf->len);
> +		}
> +		pr_crit("\n");
> +
> +		mutex_unlock(&pipe->mutex);
> +	}
> +	mutex_unlock(&PI_MUTEX);
> +	pr_crit("---------- DUMP END ------------\n");
> +}
> +
>   static void put_pipe_info(struct inode *inode, struct pipe_inode_info *pipe)
>   {
>   	int kill = 0;
> @@ -706,8 +746,14 @@ static void put_pipe_info(struct inode *inode, struct pipe_inode_info *pipe)
>   	}
>   	spin_unlock(&inode->i_lock);
>   
> -	if (kill)
> +	if (kill) {
> +		if (!list_empty(&pipe->pi_list)) {
> +			mutex_lock(&PI_MUTEX);
> +			list_del_init(&pipe->pi_list);
> +			mutex_unlock(&PI_MUTEX);
> +		}
>   		free_pipe_info(pipe);
> +	}
>   }
>   
>   static int
> @@ -790,6 +836,13 @@ struct pipe_inode_info *alloc_pipe_info(void)
>   	if (pipe == NULL)
>   		goto out_free_uid;
>   
> +	INIT_LIST_HEAD(&pipe->pi_list);
> +	if (!strcmp(current->comm, "hackbench")) {
> +		mutex_lock(&PI_MUTEX);
> +		list_add_tail(&pipe->pi_list, &PI_LIST);
> +		mutex_unlock(&PI_MUTEX);
> +	}
> +
>   	if (pipe_bufs * PAGE_SIZE > max_size && !capable(CAP_SYS_RESOURCE))
>   		pipe_bufs = max_size >> PAGE_SHIFT;
>   
> diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
> index 8ff23bf5a819..48d9bf5171dc 100644
> --- a/include/linux/pipe_fs_i.h
> +++ b/include/linux/pipe_fs_i.h
> @@ -80,6 +80,9 @@ struct pipe_inode_info {
>   #ifdef CONFIG_WATCH_QUEUE
>   	struct watch_queue *watch_queue;
>   #endif
> +
> +	struct list_head pi_list;
> +	unsigned w_cnt, r_cnt;
>   };
>   
>   /*
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 4efca8a97d62..a85e34861b2e 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2483,6 +2483,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>   
>   	error = 0;
>   	switch (option) {
> +	case 666: {
> +		extern void pi_dump(void);
> +		pi_dump();
> +		break;
> +	}
>   	case PR_SET_PDEATHSIG:
>   		if (!valid_signal(arg2)) {
>   			error = -EINVAL;
> 
--
Thanks and Regards,
Swapnil

[-- Attachment #2: dump --]
[-- Type: text/plain, Size: 22644 bytes --]

[ 1394.383241] ---------- DUMP START ----------
[ 1394.388211] E=1 F=0; W=640 R=640
[ 1394.392001] RD=0 WR=0

[ 1394.396300] E=0 F=0; W=1 R=0
[ 1394.399625] RD=0 WR=0
[ 1394.402219] buf: o=0 l=1

[ 1394.406824] E=1 F=0; W=2000000 R=2000000
[ 1394.411322] RD=0 WR=0

[ 1394.415621] E=1 F=0; W=2000000 R=2000000
[ 1394.420108] RD=0 WR=0

[ 1394.424404] E=1 F=0; W=2000000 R=2000000
[ 1394.428898] RD=0 WR=0

[ 1394.433197] E=1 F=0; W=2000000 R=2000000
[ 1394.437692] RD=0 WR=0

[ 1394.441992] E=1 F=0; W=2000000 R=2000000
[ 1394.446490] RD=0 WR=0

[ 1394.450787] E=1 F=0; W=2000000 R=2000000
[ 1394.455282] RD=0 WR=0

[ 1394.459578] E=1 F=0; W=2000000 R=2000000
[ 1394.464073] RD=0 WR=0

[ 1394.468369] E=1 F=0; W=2000000 R=2000000
[ 1394.472866] RD=0 WR=0

[ 1394.477171] E=1 F=0; W=2000000 R=2000000
[ 1394.481666] RD=0 WR=0

[ 1394.485962] E=1 F=0; W=2000000 R=2000000
[ 1394.490460] RD=0 WR=0

[ 1394.494756] E=1 F=0; W=2000000 R=2000000
[ 1394.499252] RD=0 WR=0

[ 1394.503550] E=1 F=0; W=2000000 R=2000000
[ 1394.508044] RD=0 WR=0

[ 1394.512341] E=1 F=0; W=2000000 R=2000000
[ 1394.520567] RD=0 WR=0

[ 1394.532241] E=1 F=0; W=2000000 R=2000000
[ 1394.540298] RD=0 WR=0

[ 1394.551509] E=1 F=0; W=2000000 R=2000000
[ 1394.559501] RD=0 WR=0

[ 1394.570844] E=1 F=0; W=2000000 R=2000000
[ 1394.578883] RD=0 WR=0

[ 1394.590320] E=1 F=0; W=2000000 R=2000000
[ 1394.598359] RD=0 WR=0

[ 1394.609810] E=1 F=0; W=2000000 R=2000000
[ 1394.617850] RD=0 WR=0

[ 1394.629299] E=1 F=0; W=2000000 R=2000000
[ 1394.637334] RD=0 WR=0

[ 1394.648784] E=1 F=0; W=2000000 R=2000000
[ 1394.656837] RD=0 WR=0

[ 1394.668310] E=1 F=0; W=2000000 R=2000000
[ 1394.676365] RD=0 WR=0

[ 1394.687846] E=1 F=0; W=2000000 R=2000000
[ 1394.695906] RD=0 WR=0

[ 1394.707397] E=1 F=0; W=2000000 R=2000000
[ 1394.715461] RD=0 WR=0

[ 1394.726965] E=1 F=0; W=2000000 R=2000000
[ 1394.735036] RD=0 WR=0

[ 1394.746535] E=1 F=0; W=2000000 R=2000000
[ 1394.754609] RD=0 WR=0

[ 1394.766125] E=1 F=0; W=2000000 R=2000000
[ 1394.774200] RD=0 WR=0

[ 1394.785729] E=1 F=0; W=2000000 R=2000000
[ 1394.793813] RD=0 WR=0

[ 1394.805347] E=1 F=0; W=2000000 R=2000000
[ 1394.813433] RD=0 WR=0

[ 1394.824981] E=1 F=0; W=2000000 R=2000000
[ 1394.833074] RD=0 WR=0

[ 1394.844628] E=1 F=0; W=2000000 R=2000000
[ 1394.852720] RD=0 WR=0

[ 1394.864289] E=1 F=0; W=2000000 R=2000000
[ 1394.872388] RD=0 WR=0

[ 1394.883938] E=1 F=0; W=2000000 R=2000000
[ 1394.892030] RD=0 WR=0

[ 1394.903579] E=1 F=0; W=2000000 R=2000000
[ 1394.911673] RD=0 WR=0

[ 1394.923222] E=1 F=0; W=2000000 R=2000000
[ 1394.931321] RD=0 WR=0

[ 1394.942876] E=1 F=0; W=2000000 R=2000000
[ 1394.950972] RD=0 WR=0

[ 1394.962527] E=1 F=0; W=2000000 R=2000000
[ 1394.970623] RD=0 WR=0

[ 1394.982182] E=1 F=0; W=2000000 R=2000000
[ 1394.990279] RD=0 WR=0

[ 1395.001833] E=1 F=0; W=2000000 R=2000000
[ 1395.009942] RD=0 WR=0

[ 1395.021516] E=1 F=0; W=2000000 R=2000000
[ 1395.029623] RD=0 WR=0

[ 1395.041200] E=1 F=0; W=2000000 R=2000000
[ 1395.049310] RD=0 WR=0

[ 1395.060882] E=1 F=0; W=2000000 R=2000000
[ 1395.068986] RD=0 WR=0

[ 1395.080560] E=1 F=0; W=2000000 R=2000000
[ 1395.088667] RD=0 WR=0

[ 1395.100238] E=1 F=0; W=2000000 R=2000000
[ 1395.108348] RD=0 WR=0

[ 1395.119903] E=1 F=0; W=2000000 R=2000000
[ 1395.128002] RD=0 WR=0

[ 1395.139557] E=1 F=0; W=2000000 R=2000000
[ 1395.147646] RD=0 WR=0

[ 1395.159199] E=1 F=0; W=2000000 R=2000000
[ 1395.167295] RD=0 WR=0

[ 1395.178848] E=1 F=0; W=2000000 R=2000000
[ 1395.186942] RD=0 WR=0

[ 1395.198491] E=1 F=0; W=2000000 R=2000000
[ 1395.206587] RD=0 WR=0

[ 1395.218143] E=1 F=0; W=2000000 R=2000000
[ 1395.226241] RD=0 WR=0

[ 1395.237794] E=1 F=0; W=2000000 R=2000000
[ 1395.245893] RD=0 WR=0

[ 1395.257452] E=1 F=0; W=2000000 R=2000000
[ 1395.265548] RD=0 WR=0

[ 1395.277111] E=1 F=0; W=2000000 R=2000000
[ 1395.285212] RD=0 WR=0

[ 1395.296779] E=1 F=0; W=2000000 R=2000000
[ 1395.304873] RD=0 WR=0

[ 1395.316446] E=1 F=0; W=2000000 R=2000000
[ 1395.324541] RD=0 WR=0

[ 1395.336087] E=1 F=0; W=2000000 R=2000000
[ 1395.344163] RD=0 WR=0

[ 1395.355701] E=1 F=0; W=2000000 R=2000000
[ 1395.363787] RD=0 WR=0

[ 1395.375325] E=1 F=0; W=2000000 R=2000000
[ 1395.383409] RD=0 WR=0

[ 1395.394937] E=1 F=0; W=2000000 R=2000000
[ 1395.403015] RD=0 WR=0

[ 1395.414546] E=1 F=0; W=2000000 R=2000000
[ 1395.422630] RD=0 WR=0

[ 1395.434162] E=1 F=0; W=2000000 R=2000000
[ 1395.442248] RD=0 WR=0

[ 1395.453797] E=1 F=0; W=2000000 R=2000000
[ 1395.461886] RD=0 WR=0

[ 1395.473432] E=1 F=0; W=2000000 R=2000000
[ 1395.481526] RD=0 WR=0

[ 1395.493080] E=1 F=0; W=2000000 R=2000000
[ 1395.501175] RD=0 WR=0

[ 1395.512740] E=1 F=0; W=2000000 R=2000000
[ 1395.520835] RD=0 WR=0

[ 1395.532396] E=1 F=0; W=2000000 R=2000000
[ 1395.540497] RD=0 WR=0

[ 1395.552049] E=1 F=0; W=2000000 R=2000000
[ 1395.560149] RD=0 WR=0

[ 1395.571700] E=1 F=0; W=2000000 R=2000000
[ 1395.579792] RD=0 WR=0

[ 1395.591330] E=1 F=0; W=2000000 R=2000000
[ 1395.599416] RD=0 WR=0

[ 1395.610953] E=1 F=0; W=2000000 R=2000000
[ 1395.619041] RD=0 WR=0

[ 1395.630576] E=1 F=0; W=2000000 R=2000000
[ 1395.638666] RD=0 WR=0

[ 1395.650211] E=1 F=0; W=2000000 R=2000000
[ 1395.658295] RD=0 WR=0

[ 1395.669832] E=1 F=0; W=2000000 R=2000000
[ 1395.677915] RD=0 WR=0

[ 1395.689457] E=1 F=0; W=2000000 R=2000000
[ 1395.697543] RD=0 WR=0

[ 1395.709091] E=1 F=0; W=2000000 R=2000000
[ 1395.717172] RD=0 WR=0

[ 1395.728749] E=1 F=0; W=2000000 R=2000000
[ 1395.736846] RD=0 WR=0

[ 1395.748398] E=1 F=0; W=2000000 R=2000000
[ 1395.756488] RD=0 WR=0

[ 1395.768035] E=1 F=0; W=2000000 R=2000000
[ 1395.776117] RD=0 WR=0

[ 1395.787658] E=1 F=0; W=2000000 R=2000000
[ 1395.795746] RD=0 WR=0

[ 1395.807289] E=1 F=0; W=2000000 R=2000000
[ 1395.815363] RD=0 WR=0

[ 1395.826900] E=1 F=0; W=2000000 R=2000000
[ 1395.834983] RD=0 WR=0

[ 1395.846510] E=1 F=0; W=2000000 R=2000000
[ 1395.854584] RD=0 WR=0

[ 1395.866107] E=1 F=0; W=2000000 R=2000000
[ 1395.874203] RD=0 WR=0

[ 1395.885740] E=1 F=0; W=2000000 R=2000000
[ 1395.893826] RD=0 WR=0

[ 1395.905373] E=1 F=0; W=2000000 R=2000000
[ 1395.913469] RD=0 WR=0

[ 1395.925031] E=1 F=0; W=2000000 R=2000000
[ 1395.933115] RD=0 WR=0

[ 1395.944654] E=1 F=0; W=2000000 R=2000000
[ 1395.952751] RD=0 WR=0

[ 1395.964305] E=1 F=0; W=2000000 R=2000000
[ 1395.972399] RD=0 WR=0

[ 1395.983967] E=1 F=0; W=2000000 R=2000000
[ 1395.992063] RD=0 WR=0

[ 1396.003614] E=1 F=0; W=2000000 R=2000000
[ 1396.011715] RD=0 WR=0

[ 1396.023270] E=1 F=0; W=2000000 R=2000000
[ 1396.031359] RD=0 WR=0

[ 1396.042909] E=1 F=0; W=2000000 R=2000000
[ 1396.051001] RD=0 WR=0

[ 1396.062547] E=1 F=0; W=2000000 R=2000000
[ 1396.070642] RD=0 WR=0

[ 1396.082190] E=1 F=0; W=2000000 R=2000000
[ 1396.090283] RD=0 WR=0

[ 1396.101841] E=1 F=0; W=2000000 R=2000000
[ 1396.109930] RD=0 WR=0

[ 1396.121489] E=1 F=0; W=2000000 R=2000000
[ 1396.129588] RD=0 WR=0

[ 1396.141143] E=1 F=0; W=2000000 R=2000000
[ 1396.149239] RD=0 WR=0

[ 1396.160801] E=1 F=0; W=2000000 R=2000000
[ 1396.168902] RD=0 WR=0

[ 1396.180467] E=1 F=0; W=2000000 R=2000000
[ 1396.188569] RD=0 WR=0

[ 1396.200135] E=1 F=0; W=2000000 R=2000000
[ 1396.208238] RD=0 WR=0

[ 1396.219802] E=1 F=0; W=2000000 R=2000000
[ 1396.227898] RD=0 WR=0

[ 1396.239459] E=1 F=0; W=2000000 R=2000000
[ 1396.247559] RD=0 WR=0

[ 1396.259110] E=1 F=0; W=2000000 R=2000000
[ 1396.267204] RD=0 WR=0

[ 1396.278760] E=1 F=0; W=2000000 R=2000000
[ 1396.286859] RD=0 WR=0

[ 1396.298417] E=1 F=0; W=2000000 R=2000000
[ 1396.306514] RD=0 WR=0

[ 1396.318052] E=1 F=0; W=2000000 R=2000000
[ 1396.326137] RD=0 WR=0

[ 1396.337686] E=1 F=0; W=2000000 R=2000000
[ 1396.345771] RD=0 WR=0

[ 1396.357315] E=1 F=0; W=2000000 R=2000000
[ 1396.365403] RD=0 WR=0

[ 1396.376947] E=1 F=0; W=2000000 R=2000000
[ 1396.385032] RD=0 WR=0

[ 1396.396589] E=1 F=0; W=2000000 R=2000000
[ 1396.404677] RD=0 WR=0

[ 1396.416224] E=1 F=0; W=2000000 R=2000000
[ 1396.424312] RD=0 WR=0

[ 1396.435877] E=1 F=0; W=2000000 R=2000000
[ 1396.443971] RD=0 WR=0

[ 1396.455514] E=1 F=0; W=2000000 R=2000000
[ 1396.463602] RD=0 WR=0

[ 1396.475133] E=1 F=0; W=2000000 R=2000000
[ 1396.483219] RD=0 WR=0

[ 1396.494754] E=1 F=0; W=2000000 R=2000000
[ 1396.502841] RD=0 WR=0

[ 1396.514380] E=1 F=0; W=2000000 R=2000000
[ 1396.522466] RD=0 WR=0

[ 1396.533997] E=1 F=0; W=2000000 R=2000000
[ 1396.542082] RD=0 WR=0

[ 1396.553623] E=1 F=0; W=2000000 R=2000000
[ 1396.561716] RD=0 WR=0

[ 1396.573262] E=1 F=0; W=2000000 R=2000000
[ 1396.581351] RD=0 WR=0

[ 1396.592900] E=1 F=0; W=2000000 R=2000000
[ 1396.600991] RD=0 WR=0

[ 1396.612548] E=1 F=0; W=2000000 R=2000000
[ 1396.620643] RD=0 WR=0

[ 1396.632190] E=1 F=0; W=2000000 R=2000000
[ 1396.640275] RD=0 WR=0

[ 1396.651826] E=1 F=0; W=2000000 R=2000000
[ 1396.659919] RD=0 WR=0

[ 1396.671472] E=1 F=0; W=2000000 R=2000000
[ 1396.679572] RD=0 WR=0

[ 1396.691123] E=1 F=0; W=2000000 R=2000000
[ 1396.699220] RD=0 WR=0

[ 1396.710769] E=1 F=0; W=2000000 R=2000000
[ 1396.718862] RD=0 WR=0

[ 1396.730400] E=1 F=0; W=2000000 R=2000000
[ 1396.738487] RD=0 WR=0

[ 1396.750011] E=1 F=0; W=2000000 R=2000000
[ 1396.758092] RD=0 WR=0

[ 1396.769612] E=1 F=0; W=2000000 R=2000000
[ 1396.777695] RD=0 WR=0

[ 1396.789224] E=1 F=0; W=2000000 R=2000000
[ 1396.797305] RD=0 WR=0

[ 1396.808835] E=1 F=0; W=2000000 R=2000000
[ 1396.816914] RD=0 WR=0

[ 1396.828440] E=1 F=0; W=2000000 R=2000000
[ 1396.836522] RD=0 WR=0

[ 1396.848060] E=1 F=0; W=2000000 R=2000000
[ 1396.856140] RD=0 WR=0

[ 1396.867661] E=1 F=0; W=2000000 R=2000000
[ 1396.875742] RD=0 WR=0

[ 1396.887286] E=1 F=0; W=2000000 R=2000000
[ 1396.895367] RD=0 WR=0

[ 1396.906898] E=1 F=0; W=2000000 R=2000000
[ 1396.914978] RD=0 WR=0

[ 1396.926502] E=1 F=0; W=2000000 R=2000000
[ 1396.934587] RD=0 WR=0

[ 1396.946117] E=1 F=0; W=2000000 R=2000000
[ 1396.954214] RD=0 WR=0

[ 1396.965749] E=1 F=0; W=2000000 R=2000000
[ 1396.973834] RD=0 WR=0

[ 1396.985370] E=1 F=0; W=2000000 R=2000000
[ 1396.993458] RD=0 WR=0

[ 1397.005001] E=1 F=0; W=2000000 R=2000000
[ 1397.013098] RD=0 WR=0

[ 1397.024636] E=1 F=0; W=2000000 R=2000000
[ 1397.032727] RD=0 WR=0

[ 1397.044276] E=1 F=0; W=2000000 R=2000000
[ 1397.052374] RD=0 WR=0

[ 1397.063936] E=1 F=0; W=2000000 R=2000000
[ 1397.072036] RD=0 WR=0

[ 1397.083591] E=1 F=0; W=2000000 R=2000000
[ 1397.091689] RD=0 WR=0

[ 1397.103256] E=1 F=0; W=2000000 R=2000000
[ 1397.111359] RD=0 WR=0

[ 1397.122917] E=1 F=0; W=2000000 R=2000000
[ 1397.131013] RD=0 WR=0

[ 1397.142558] E=1 F=0; W=2000000 R=2000000
[ 1397.150649] RD=0 WR=0

[ 1397.162209] E=1 F=0; W=2000000 R=2000000
[ 1397.170301] RD=0 WR=0

[ 1397.181852] E=1 F=0; W=2000000 R=2000000
[ 1397.189943] RD=0 WR=0

[ 1397.201487] E=1 F=0; W=2000000 R=2000000
[ 1397.209580] RD=0 WR=0

[ 1397.221115] E=1 F=0; W=2000000 R=2000000
[ 1397.229204] RD=0 WR=0

[ 1397.240745] E=1 F=0; W=2000000 R=2000000
[ 1397.248839] RD=0 WR=0

[ 1397.260372] E=1 F=0; W=2000000 R=2000000
[ 1397.268463] RD=0 WR=0

[ 1397.280014] E=1 F=0; W=2000000 R=2000000
[ 1397.288103] RD=0 WR=0

[ 1397.299647] E=1 F=0; W=2000000 R=2000000
[ 1397.307734] RD=0 WR=0

[ 1397.319280] E=1 F=0; W=2000000 R=2000000
[ 1397.327370] RD=0 WR=0

[ 1397.338916] E=1 F=0; W=2000000 R=2000000
[ 1397.347001] RD=0 WR=0

[ 1397.358548] E=1 F=0; W=2000000 R=2000000
[ 1397.366632] RD=0 WR=0

[ 1397.378157] E=1 F=0; W=2000000 R=2000000
[ 1397.386246] RD=0 WR=0

[ 1397.397798] E=1 F=0; W=2000000 R=2000000
[ 1397.405888] RD=0 WR=0

[ 1397.417423] E=1 F=0; W=2000000 R=2000000
[ 1397.425510] RD=0 WR=0

[ 1397.437064] E=1 F=0; W=2000000 R=2000000
[ 1397.445155] RD=0 WR=0

[ 1397.456709] E=1 F=0; W=2000000 R=2000000
[ 1397.464800] RD=0 WR=0

[ 1397.476337] E=1 F=0; W=2000000 R=2000000
[ 1397.484430] RD=0 WR=0

[ 1397.495986] E=1 F=0; W=2000000 R=2000000
[ 1397.504081] RD=0 WR=0

[ 1397.515639] E=1 F=0; W=2000000 R=2000000
[ 1397.523737] RD=0 WR=0

[ 1397.535289] E=1 F=0; W=2000000 R=2000000
[ 1397.543372] RD=0 WR=0

[ 1397.554925] E=1 F=0; W=2000000 R=2000000
[ 1397.563021] RD=0 WR=0

[ 1397.574562] E=1 F=0; W=2000000 R=2000000
[ 1397.582654] RD=0 WR=0

[ 1397.594199] E=1 F=0; W=2000000 R=2000000
[ 1397.602287] RD=0 WR=0

[ 1397.613840] E=1 F=0; W=2000000 R=2000000
[ 1397.621934] RD=0 WR=0

[ 1397.633484] E=1 F=0; W=2000000 R=2000000
[ 1397.641570] RD=0 WR=0

[ 1397.653119] E=1 F=0; W=2000000 R=2000000
[ 1397.661211] RD=0 WR=0

[ 1397.672759] E=1 F=0; W=2000000 R=2000000
[ 1397.680850] RD=0 WR=0

[ 1397.692391] E=1 F=0; W=2000000 R=2000000
[ 1397.700474] RD=0 WR=0

[ 1397.712018] E=1 F=0; W=2000000 R=2000000
[ 1397.720110] RD=0 WR=0

[ 1397.731651] E=1 F=0; W=2000000 R=2000000
[ 1397.739739] RD=0 WR=0

[ 1397.751279] E=1 F=0; W=2000000 R=2000000
[ 1397.759370] RD=0 WR=0

[ 1397.770904] E=1 F=0; W=2000000 R=2000000
[ 1397.778992] RD=0 WR=0

[ 1397.790521] E=1 F=0; W=2000000 R=2000000
[ 1397.798607] RD=0 WR=0

[ 1397.810140] E=1 F=0; W=1719158 R=1719158
[ 1397.818226] RD=1 WR=0

[ 1397.829761] E=1 F=0; W=1719147 R=1719147
[ 1397.837843] RD=1 WR=1

[ 1397.849382] E=1 F=0; W=1719147 R=1719147
[ 1397.857465] RD=1 WR=0

[ 1397.869004] E=1 F=0; W=1719147 R=1719147
[ 1397.877090] RD=1 WR=0

[ 1397.888627] E=1 F=0; W=1719147 R=1719147
[ 1397.896715] RD=1 WR=0

[ 1397.908261] E=1 F=0; W=1719147 R=1719147
[ 1397.916348] RD=1 WR=0

[ 1397.927895] E=1 F=0; W=1719147 R=1719147
[ 1397.935981] RD=1 WR=0

[ 1397.947520] E=1 F=0; W=1719147 R=1719147
[ 1397.955600] RD=1 WR=0

[ 1397.967128] E=1 F=0; W=1719147 R=1719147
[ 1397.975211] RD=1 WR=0

[ 1397.986745] E=1 F=0; W=1719147 R=1719147
[ 1397.994830] RD=1 WR=0

[ 1398.006391] E=1 F=0; W=1719147 R=1719147
[ 1398.014480] RD=1 WR=0

[ 1398.026020] E=1 F=0; W=1719147 R=1719147
[ 1398.034104] RD=1 WR=0

[ 1398.045662] E=1 F=0; W=1719147 R=1719147
[ 1398.053754] RD=1 WR=0

[ 1398.065297] E=1 F=0; W=1719147 R=1719147
[ 1398.073385] RD=1 WR=0

[ 1398.084929] E=1 F=0; W=1719147 R=1719147
[ 1398.093022] RD=1 WR=0

[ 1398.104572] E=1 F=0; W=1719147 R=1719147
[ 1398.112666] RD=1 WR=0

[ 1398.124215] E=1 F=0; W=1719147 R=1719147
[ 1398.132303] RD=1 WR=0

[ 1398.143856] E=1 F=0; W=1719147 R=1719147
[ 1398.151945] RD=1 WR=0

[ 1398.163499] E=1 F=0; W=1719147 R=1719147
[ 1398.171592] RD=1 WR=0

[ 1398.183134] E=1 F=0; W=1719147 R=1719147
[ 1398.191225] RD=1 WR=0

[ 1398.202768] E=1 F=0; W=2000000 R=2000000
[ 1398.210859] RD=0 WR=0

[ 1398.222399] E=1 F=0; W=2000000 R=2000000
[ 1398.230485] RD=0 WR=0

[ 1398.242031] E=1 F=0; W=2000000 R=2000000
[ 1398.250122] RD=0 WR=0

[ 1398.261683] E=1 F=0; W=2000000 R=2000000
[ 1398.269786] RD=0 WR=0

[ 1398.281349] E=1 F=0; W=2000000 R=2000000
[ 1398.289447] RD=0 WR=0

[ 1398.301001] E=1 F=0; W=2000000 R=2000000
[ 1398.309100] RD=0 WR=0

[ 1398.320645] E=1 F=0; W=2000000 R=2000000
[ 1398.328735] RD=0 WR=0

[ 1398.340276] E=1 F=0; W=2000000 R=2000000
[ 1398.348363] RD=0 WR=0

[ 1398.359902] E=1 F=0; W=2000000 R=2000000
[ 1398.367988] RD=0 WR=0

[ 1398.379526] E=1 F=0; W=2000000 R=2000000
[ 1398.387609] RD=0 WR=0

[ 1398.399146] E=1 F=0; W=2000000 R=2000000
[ 1398.407231] RD=0 WR=0

[ 1398.418775] E=1 F=0; W=2000000 R=2000000
[ 1398.426858] RD=0 WR=0

[ 1398.438394] E=1 F=0; W=2000000 R=2000000
[ 1398.446475] RD=0 WR=0

[ 1398.458004] E=1 F=0; W=2000000 R=2000000
[ 1398.466084] RD=0 WR=0

[ 1398.477611] E=1 F=0; W=2000000 R=2000000
[ 1398.485695] RD=0 WR=0

[ 1398.497226] E=1 F=0; W=2000000 R=2000000
[ 1398.505308] RD=0 WR=0

[ 1398.516848] E=1 F=0; W=2000000 R=2000000
[ 1398.524934] RD=0 WR=0

[ 1398.536469] E=1 F=0; W=2000000 R=2000000
[ 1398.544554] RD=0 WR=0

[ 1398.556088] E=1 F=0; W=2000000 R=2000000
[ 1398.564171] RD=0 WR=0

[ 1398.575707] E=1 F=0; W=2000000 R=2000000
[ 1398.583794] RD=0 WR=0

[ 1398.595331] E=1 F=0; W=2000000 R=2000000
[ 1398.603421] RD=0 WR=0

[ 1398.614981] E=1 F=0; W=2000000 R=2000000
[ 1398.623073] RD=0 WR=0

[ 1398.634620] E=1 F=0; W=2000000 R=2000000
[ 1398.642716] RD=0 WR=0

[ 1398.654268] E=1 F=0; W=2000000 R=2000000
[ 1398.662367] RD=0 WR=0

[ 1398.673915] E=1 F=0; W=2000000 R=2000000
[ 1398.682011] RD=0 WR=0

[ 1398.693577] E=1 F=0; W=2000000 R=2000000
[ 1398.701673] RD=0 WR=0

[ 1398.713231] E=1 F=0; W=2000000 R=2000000
[ 1398.721325] RD=0 WR=0

[ 1398.732888] E=1 F=0; W=2000000 R=2000000
[ 1398.740983] RD=0 WR=0

[ 1398.752544] E=1 F=0; W=2000000 R=2000000
[ 1398.760640] RD=0 WR=0

[ 1398.772197] E=1 F=0; W=2000000 R=2000000
[ 1398.780294] RD=0 WR=0

[ 1398.791845] E=1 F=0; W=2000000 R=2000000
[ 1398.799943] RD=0 WR=0

[ 1398.811504] E=1 F=0; W=2000000 R=2000000
[ 1398.819602] RD=0 WR=0

[ 1398.831146] E=1 F=0; W=2000000 R=2000000
[ 1398.839237] RD=0 WR=0

[ 1398.850784] E=1 F=0; W=2000000 R=2000000
[ 1398.858874] RD=0 WR=0

[ 1398.870409] E=1 F=0; W=2000000 R=2000000
[ 1398.878494] RD=0 WR=0

[ 1398.890025] E=1 F=0; W=2000000 R=2000000
[ 1398.898108] RD=0 WR=0

[ 1398.909661] E=1 F=0; W=2000000 R=2000000
[ 1398.917746] RD=0 WR=0

[ 1398.929296] E=1 F=0; W=2000000 R=2000000
[ 1398.937384] RD=0 WR=0

[ 1398.948929] E=1 F=0; W=2000000 R=2000000
[ 1398.957019] RD=0 WR=0

[ 1398.968558] E=1 F=0; W=2000000 R=2000000
[ 1398.976647] RD=0 WR=0

[ 1398.988215] E=1 F=0; W=2000000 R=2000000
[ 1398.996319] RD=0 WR=0

[ 1399.007905] E=1 F=0; W=2000000 R=2000000
[ 1399.016001] RD=0 WR=0

[ 1399.027563] E=1 F=0; W=2000000 R=2000000
[ 1399.035657] RD=0 WR=0

[ 1399.047213] E=1 F=0; W=2000000 R=2000000
[ 1399.055307] RD=0 WR=0

[ 1399.066876] E=1 F=0; W=2000000 R=2000000
[ 1399.074979] RD=0 WR=0

[ 1399.086548] E=1 F=0; W=2000000 R=2000000
[ 1399.094652] RD=0 WR=0

[ 1399.106229] E=1 F=0; W=2000000 R=2000000
[ 1399.114341] RD=0 WR=0

[ 1399.125919] E=1 F=0; W=2000000 R=2000000
[ 1399.134026] RD=0 WR=0

[ 1399.145616] E=1 F=0; W=2000000 R=2000000
[ 1399.153720] RD=0 WR=0

[ 1399.165307] E=1 F=0; W=2000000 R=2000000
[ 1399.173423] RD=0 WR=0

[ 1399.185007] E=1 F=0; W=2000000 R=2000000
[ 1399.193113] RD=0 WR=0

[ 1399.204691] E=1 F=0; W=2000000 R=2000000
[ 1399.212812] RD=0 WR=0

[ 1399.224410] E=1 F=0; W=2000000 R=2000000
[ 1399.232525] RD=0 WR=0

[ 1399.244127] E=1 F=0; W=2000000 R=2000000
[ 1399.252246] RD=0 WR=0

[ 1399.263850] E=1 F=0; W=2000000 R=2000000
[ 1399.271971] RD=0 WR=0

[ 1399.283579] E=1 F=0; W=2000000 R=2000000
[ 1399.291701] RD=0 WR=0

[ 1399.303311] E=1 F=0; W=2000000 R=2000000
[ 1399.311441] RD=0 WR=0

[ 1399.323047] E=1 F=0; W=2000000 R=2000000
[ 1399.331165] RD=0 WR=0

[ 1399.342769] E=1 F=0; W=2000000 R=2000000
[ 1399.350887] RD=0 WR=0

[ 1399.362486] E=1 F=0; W=2000000 R=2000000
[ 1399.370593] RD=0 WR=0

[ 1399.382178] E=1 F=0; W=2000000 R=2000000
[ 1399.390291] RD=0 WR=0

[ 1399.401866] E=1 F=0; W=2000000 R=2000000
[ 1399.409971] RD=0 WR=0

[ 1399.421555] E=1 F=0; W=2000000 R=2000000
[ 1399.429669] RD=0 WR=0

[ 1399.441249] E=1 F=0; W=2000000 R=2000000
[ 1399.449361] RD=0 WR=0

[ 1399.460949] E=1 F=0; W=2000000 R=2000000
[ 1399.469068] RD=0 WR=0

[ 1399.480660] E=1 F=0; W=2000000 R=2000000
[ 1399.488778] RD=0 WR=0

[ 1399.500366] E=1 F=0; W=2000000 R=2000000
[ 1399.508475] RD=0 WR=0

[ 1399.520060] E=1 F=0; W=2000000 R=2000000
[ 1399.528170] RD=0 WR=0

[ 1399.539747] E=1 F=0; W=2000000 R=2000000
[ 1399.547850] RD=0 WR=0

[ 1399.559425] E=1 F=0; W=2000000 R=2000000
[ 1399.567530] RD=0 WR=0

[ 1399.579103] E=1 F=0; W=2000000 R=2000000
[ 1399.587207] RD=0 WR=0

[ 1399.598769] E=1 F=0; W=2000000 R=2000000
[ 1399.606865] RD=0 WR=0

[ 1399.618424] E=1 F=0; W=2000000 R=2000000
[ 1399.626520] RD=0 WR=0

[ 1399.638080] E=1 F=0; W=2000000 R=2000000
[ 1399.646175] RD=0 WR=0

[ 1399.657744] E=1 F=0; W=2000000 R=2000000
[ 1399.665842] RD=0 WR=0

[ 1399.677415] E=1 F=0; W=2000000 R=2000000
[ 1399.685518] RD=0 WR=0

[ 1399.697092] E=1 F=0; W=2000000 R=2000000
[ 1399.705197] RD=0 WR=0

[ 1399.716762] E=1 F=0; W=2000000 R=2000000
[ 1399.724865] RD=0 WR=0

[ 1399.736434] E=1 F=0; W=2000000 R=2000000
[ 1399.744534] RD=0 WR=0

[ 1399.756089] E=1 F=0; W=2000000 R=2000000
[ 1399.764183] RD=0 WR=0

[ 1399.775731] E=1 F=0; W=2000000 R=2000000
[ 1399.783818] RD=0 WR=0

[ 1399.795374] E=1 F=0; W=2000000 R=2000000
[ 1399.803458] RD=0 WR=0

[ 1399.814992] E=1 F=0; W=2000000 R=2000000
[ 1399.823076] RD=0 WR=0

[ 1399.834602] E=1 F=0; W=2000000 R=2000000
[ 1399.842684] RD=0 WR=0

[ 1399.854209] E=1 F=0; W=2000000 R=2000000
[ 1399.862290] RD=0 WR=0

[ 1399.873796] E=1 F=0; W=2000000 R=2000000
[ 1399.881873] RD=0 WR=0

[ 1399.893400] E=1 F=0; W=2000000 R=2000000
[ 1399.901480] RD=0 WR=0

[ 1399.912997] E=1 F=0; W=2000000 R=2000000
[ 1399.921074] RD=0 WR=0

[ 1399.932601] E=1 F=0; W=2000000 R=2000000
[ 1399.940682] RD=0 WR=0

[ 1399.952202] E=1 F=0; W=2000000 R=2000000
[ 1399.960281] RD=0 WR=0

[ 1399.971803] E=1 F=0; W=2000000 R=2000000
[ 1399.979874] RD=0 WR=0

[ 1399.991384] E=1 F=0; W=2000000 R=2000000
[ 1399.999464] RD=0 WR=0

[ 1400.011005] E=1 F=0; W=2000000 R=2000000
[ 1400.019094] RD=0 WR=0

[ 1400.030619] E=1 F=0; W=2000000 R=2000000
[ 1400.038705] RD=0 WR=0

[ 1400.050241] E=1 F=0; W=2000000 R=2000000
[ 1400.058336] RD=0 WR=0

[ 1400.069874] E=1 F=0; W=2000000 R=2000000
[ 1400.077959] RD=0 WR=0

[ 1400.089508] E=1 F=0; W=2000000 R=2000000
[ 1400.097597] RD=0 WR=0

[ 1400.109149] E=1 F=0; W=2000000 R=2000000
[ 1400.117240] RD=0 WR=0

[ 1400.128796] E=1 F=0; W=2000000 R=2000000
[ 1400.136891] RD=0 WR=0

[ 1400.148448] E=1 F=0; W=2000000 R=2000000
[ 1400.156539] RD=0 WR=0

[ 1400.168097] E=1 F=0; W=2000000 R=2000000
[ 1400.176193] RD=0 WR=0

[ 1400.187752] E=1 F=0; W=2000000 R=2000000
[ 1400.195848] RD=0 WR=0

[ 1400.207409] E=1 F=0; W=2000000 R=2000000
[ 1400.215509] RD=0 WR=0

[ 1400.227064] E=1 F=0; W=2000000 R=2000000
[ 1400.235158] RD=0 WR=0

[ 1400.246716] E=1 F=0; W=2000000 R=2000000
[ 1400.254812] RD=0 WR=0

[ 1400.266368] E=1 F=0; W=2000000 R=2000000
[ 1400.274461] RD=0 WR=0

[ 1400.286002] E=1 F=0; W=2000000 R=2000000
[ 1400.294091] RD=0 WR=0

[ 1400.305629] E=1 F=0; W=2000000 R=2000000
[ 1400.313722] RD=0 WR=0

[ 1400.325237] E=1 F=0; W=2000000 R=2000000
[ 1400.333316] RD=0 WR=0

[ 1400.344834] E=1 F=0; W=2000000 R=2000000
[ 1400.352912] RD=0 WR=0

[ 1400.364432] E=1 F=0; W=2000000 R=2000000
[ 1400.372514] RD=0 WR=0

[ 1400.384052] E=1 F=0; W=2000000 R=2000000
[ 1400.392134] RD=0 WR=0

[ 1400.403665] E=1 F=0; W=2000000 R=2000000
[ 1400.411755] RD=0 WR=0

[ 1400.423282] E=1 F=0; W=2000000 R=2000000
[ 1400.431365] RD=0 WR=0

[ 1400.442897] E=1 F=0; W=2000000 R=2000000
[ 1400.450984] RD=0 WR=0

[ 1400.462520] E=1 F=0; W=2000000 R=2000000
[ 1400.470607] RD=0 WR=0

[ 1400.482141] E=1 F=0; W=2000000 R=2000000
[ 1400.490225] RD=0 WR=0

[ 1400.501793] E=1 F=0; W=2000000 R=2000000
[ 1400.509886] RD=0 WR=0

[ 1400.521428] E=1 F=0; W=2000000 R=2000000
[ 1400.529517] RD=0 WR=0

[ 1400.541063] E=1 F=0; W=2000000 R=2000000
[ 1400.549149] RD=0 WR=0

[ 1400.560688] ---------- DUMP END ------------

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-28  5:58           ` Sapkal, Swapnil
@ 2025-02-28 14:30             ` Oleg Nesterov
  2025-02-28 16:33               ` Oleg Nesterov
  0 siblings, 1 reply; 109+ messages in thread
From: Oleg Nesterov @ 2025-02-28 14:30 UTC (permalink / raw)
  To: Sapkal, Swapnil
  Cc: Mateusz Guzik, Manfred Spraul, Linus Torvalds, Christian Brauner,
	David Howells, WangYuli, linux-fsdevel, linux-kernel,
	K Prateek Nayak, Shenoy, Gautham Ranjal, Neeraj.Upadhyay

On 02/28, Sapkal, Swapnil wrote:
>
> Yes, I was able to reproduce the problem with the below patch.
...
> I found a case in the dump where the pipe is empty still both reader and
> writer are waiting on it.
>
> [ 1397.829761] E=1 F=0; W=1719147 R=1719147
> [ 1397.837843] RD=1 WR=1

Thanks! and I see no more "WR=1" in the full dump.

This means that all live writes hang on the same pipe.

So maybe the trivial program below can too reproduce the problem on your machine??

Say, with GROUPS=16 and WRITERS=20 ... or maybe even with GROUPS=1 and WRITERS=320 ...

Oleg.

-------------------------------------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>
#include <pthread.h>

static int GROUPS, WRITERS;
static volatile int ALIVE[1024];

void *group(void *arg)
{
	int fd[2], n, id = (long)arg;
	char buf[100];

	assert(pipe(fd) == 0);

	for (n = 0; n < WRITERS; ++n) {
		int pid = fork();
		assert(pid >= 0);
		if (pid)
			continue;

		close(fd[0]);
		for (;;)
			assert(write(fd[1], buf, sizeof(buf)) == sizeof(buf));
	}

	for (;;) {
		assert(read(fd[0], buf, sizeof(buf)) == sizeof(buf));
		ALIVE[id] = 1;
	}
}

int main(int argc, const char *argv[])
{
	pthread_t pt;
	int n;

	assert(argc == 3);
	GROUPS = atoi(argv[1]);
	WRITERS = atoi(argv[2]);
	assert(GROUPS <= 1024);

	for (n = 0; n < GROUPS; ++n)
		assert(pthread_create(&pt, NULL, group, (void*)(long)n) == 0);

	for (;;) {
		sleep(1);

		for (n = 0; n < GROUPS; ++n) {
			if (ALIVE[n] == 0)
				printf("!!! thread %d stuck?\n", n);
			ALIVE[n] = 0;
		}
	}
}


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-28 14:30             ` Oleg Nesterov
@ 2025-02-28 16:33               ` Oleg Nesterov
  2025-03-03  9:46                 ` Sapkal, Swapnil
  0 siblings, 1 reply; 109+ messages in thread
From: Oleg Nesterov @ 2025-02-28 16:33 UTC (permalink / raw)
  To: Sapkal, Swapnil
  Cc: Mateusz Guzik, Manfred Spraul, Linus Torvalds, Christian Brauner,
	David Howells, WangYuli, linux-fsdevel, linux-kernel,
	K Prateek Nayak, Shenoy, Gautham Ranjal, Neeraj.Upadhyay

And... I know, I know you already hate me ;)

but if you have time, could you check if this patch (with or without the
previous debugging patch) makes any difference? Just to be sure.

Oleg.
---

diff --git a/fs/pipe.c b/fs/pipe.c
index 4336b8cccf84..524b8845523e 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -445,7 +445,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 		return 0;
 
 	mutex_lock(&pipe->mutex);
-
+again:
 	if (!pipe->readers) {
 		send_sig(SIGPIPE, current, 0);
 		ret = -EPIPE;
@@ -467,20 +467,24 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 		unsigned int mask = pipe->ring_size - 1;
 		struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask];
 		int offset = buf->offset + buf->len;
+		int xxx;
 
 		if ((buf->flags & PIPE_BUF_FLAG_CAN_MERGE) &&
 		    offset + chars <= PAGE_SIZE) {
-			ret = pipe_buf_confirm(pipe, buf);
-			if (ret)
+			xxx = pipe_buf_confirm(pipe, buf);
+			if (xxx) {
+				if (!ret) ret = xxx;
 				goto out;
+			}
 
-			ret = copy_page_from_iter(buf->page, offset, chars, from);
-			if (unlikely(ret < chars)) {
-				ret = -EFAULT;
+			xxx = copy_page_from_iter(buf->page, offset, chars, from);
+			if (unlikely(xxx < chars)) {
+				if (!ret) ret = -EFAULT;
 				goto out;
 			}
 
-			buf->len += ret;
+			ret += xxx;
+			buf->len += xxx;
 			if (!iov_iter_count(from))
 				goto out;
 		}
@@ -567,6 +571,7 @@ atomic_inc(&WR_SLEEP);
 		mutex_lock(&pipe->mutex);
 		was_empty = pipe_empty(pipe->head, pipe->tail);
 		wake_next_writer = true;
+		goto again;
 	}
 out:
 	if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-28 16:33               ` Oleg Nesterov
@ 2025-03-03  9:46                 ` Sapkal, Swapnil
  2025-03-03 14:37                   ` Mateusz Guzik
                                     ` (2 more replies)
  0 siblings, 3 replies; 109+ messages in thread
From: Sapkal, Swapnil @ 2025-03-03  9:46 UTC (permalink / raw)
  To: Oleg Nesterov, K Prateek Nayak
  Cc: Mateusz Guzik, Manfred Spraul, Linus Torvalds, Christian Brauner,
	David Howells, WangYuli, linux-fsdevel, linux-kernel,
	Shenoy, Gautham Ranjal, Neeraj.Upadhyay, Ananth.narayan

[-- Attachment #1: Type: text/plain, Size: 3944 bytes --]

Hi Oleg,

On 2/28/2025 10:03 PM, Oleg Nesterov wrote:
> And... I know, I know you already hate me ;)
> 

Not at all :)

> but if you have time, could you check if this patch (with or without the
> previous debugging patch) makes any difference? Just to be sure.
> 

Sure, I will give this a try.

But in the meanwhile me and Prateek tried some of the experiments in the weekend.
We were able to reproduce this issue on a third generation EPYC system as well as
on an Intel Emerald Rapids (2 X INTEL(R) XEON(R) PLATINUM 8592+).

We tried heavy hammered tracing approach over the weekend on top of your debug patch.
I have attached the debug patch below. With tracing we found the following case for
pipe_writable():

   hackbench-118768  [206] .....  1029.550601: pipe_write: 000000005eea28ff: 0: 37 38 16: 1

Here,

head = 37
tail = 38
max_usage = 16
pipe_full() returns 1.

Between reading of head and later the tail, the tail seems to have moved ahead of the
head leading to wraparound. Applying the following changes I have not yet run into a
hang on the original machine where I first saw it:

diff --git a/fs/pipe.c b/fs/pipe.c
index ce1af7592780..a1931c817822 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -417,9 +417,19 @@ static inline int is_packetized(struct file *file)
  /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
  static inline bool pipe_writable(const struct pipe_inode_info *pipe)
  {
-	unsigned int head = READ_ONCE(pipe->head);
-	unsigned int tail = READ_ONCE(pipe->tail);
  	unsigned int max_usage = READ_ONCE(pipe->max_usage);
+	unsigned int head, tail;
+
+	tail = READ_ONCE(pipe->tail);
+	/*
+	 * Since the unsigned arithmetic in this lockless preemptible context
+	 * relies on the fact that the tail can never be ahead of head, read
+	 * the head after the tail to ensure we've not missed any updates to
+	 * the head. Reordering the reads can cause wraparounds and give the
+	 * illusion that the pipe is full.
+	 */
+	smp_rmb();
+	head = READ_ONCE(pipe->head);
  
  	return !pipe_full(head, tail, max_usage) ||
  		!READ_ONCE(pipe->readers);
---

smp_rmb() on x86 is a nop and even without the barrier we were not able to
reproduce the hang even after 10000 iterations.

If you think this is a genuine bug fix, I will send a patch for this.

Thanks to Prateek who was actively involved in this debug.

--
Thanks and Regards,
Swapnil

> Oleg.
> ---
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 4336b8cccf84..524b8845523e 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -445,7 +445,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>   		return 0;
>   
>   	mutex_lock(&pipe->mutex);
> -
> +again:
>   	if (!pipe->readers) {
>   		send_sig(SIGPIPE, current, 0);
>   		ret = -EPIPE;
> @@ -467,20 +467,24 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>   		unsigned int mask = pipe->ring_size - 1;
>   		struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask];
>   		int offset = buf->offset + buf->len;
> +		int xxx;
>   
>   		if ((buf->flags & PIPE_BUF_FLAG_CAN_MERGE) &&
>   		    offset + chars <= PAGE_SIZE) {
> -			ret = pipe_buf_confirm(pipe, buf);
> -			if (ret)
> +			xxx = pipe_buf_confirm(pipe, buf);
> +			if (xxx) {
> +				if (!ret) ret = xxx;
>   				goto out;
> +			}
>   
> -			ret = copy_page_from_iter(buf->page, offset, chars, from);
> -			if (unlikely(ret < chars)) {
> -				ret = -EFAULT;
> +			xxx = copy_page_from_iter(buf->page, offset, chars, from);
> +			if (unlikely(xxx < chars)) {
> +				if (!ret) ret = -EFAULT;
>   				goto out;
>   			}
>   
> -			buf->len += ret;
> +			ret += xxx;
> +			buf->len += xxx;
>   			if (!iov_iter_count(from))
>   				goto out;
>   		}
> @@ -567,6 +571,7 @@ atomic_inc(&WR_SLEEP);
>   		mutex_lock(&pipe->mutex);
>   		was_empty = pipe_empty(pipe->head, pipe->tail);
>   		wake_next_writer = true;
> +		goto again;
>   	}
>   out:
>   	if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
> 

[-- Attachment #2: debug.diff --]
[-- Type: text/plain, Size: 8448 bytes --]

diff --git a/fs/pipe.c b/fs/pipe.c
index 82fede0f2111..a0b737a8b8f9 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -217,6 +217,20 @@ static inline bool pipe_readable(const struct pipe_inode_info *pipe)
 	return !pipe_empty(head, tail) || !writers;
 }
 
+/* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
+static inline bool pipe_readable_sleep_check(const struct pipe_inode_info *pipe)
+{
+	unsigned int head = READ_ONCE(pipe->head);
+	unsigned int tail = READ_ONCE(pipe->tail);
+	unsigned int writers = READ_ONCE(pipe->writers);
+	bool empty = pipe_empty(head, tail);
+	bool ret = !empty || !writers;
+
+	trace_printk("%p: %d: %u %u: %d\n", (void*)pipe, ret, head, tail, empty);
+
+	return ret;
+}
+
 static inline unsigned int pipe_update_tail(struct pipe_inode_info *pipe,
 					    struct pipe_buffer *buf,
 					    unsigned int tail)
@@ -243,6 +257,7 @@ static inline unsigned int pipe_update_tail(struct pipe_inode_info *pipe,
 	 * Without a watch_queue, we can simply increment the tail
 	 * without the spinlock - the mutex is enough.
 	 */
+	trace_printk("%p: t: %u -> %u\n", (void*)pipe, pipe->tail, pipe->tail + 1);
 	pipe->tail = ++tail;
 	return tail;
 }
@@ -388,7 +403,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 		 * since we've done any required wakeups and there's no need
 		 * to mark anything accessed. And we've dropped the lock.
 		 */
-		if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0)
+		if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable_sleep_check(pipe)) < 0)
 			return -ERESTARTSYS;
 
 		wake_writer = false;
@@ -397,6 +412,8 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 	}
 	if (pipe_empty(pipe->head, pipe->tail))
 		wake_next_reader = false;
+	if (ret > 0)
+		pipe->r_cnt++;
 	mutex_unlock(&pipe->mutex);
 
 	if (wake_writer)
@@ -425,6 +442,19 @@ static inline bool pipe_writable(const struct pipe_inode_info *pipe)
 		!READ_ONCE(pipe->readers);
 }
 
+/* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
+static inline bool pipe_writable_sleep_check(const struct pipe_inode_info *pipe)
+{
+	unsigned int head = READ_ONCE(pipe->head);
+	unsigned int tail = READ_ONCE(pipe->tail);
+	unsigned int max_usage = READ_ONCE(pipe->max_usage);
+	bool full = pipe_full(head, tail, max_usage);
+	bool ret = !full || !READ_ONCE(pipe->readers);
+
+	trace_printk("%p: %d: %u %u %u: %d\n", (void*)pipe, ret, head, tail, max_usage, full);
+	return ret;
+}
+
 static ssize_t
 pipe_write(struct kiocb *iocb, struct iov_iter *from)
 {
@@ -490,6 +520,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 			}
 
 			buf->len += ret;
+			trace_printk("%p: m: %u\n", (void*)pipe, head);
 			if (!iov_iter_count(from))
 				goto out;
 		}
@@ -525,6 +556,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 			 * be there for the next write.
 			 */
 			pipe->head = head + 1;
+			trace_printk("%p: h: %u -> %u\n", (void*)pipe, head, head + 1);
 
 			/* Insert it into the buffer array */
 			buf = &pipe->bufs[head & mask];
@@ -577,7 +609,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 		if (was_empty)
 			wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
 		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
-		wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
+		wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable_sleep_check(pipe));
 		mutex_lock(&pipe->mutex);
 		was_empty = pipe_empty(pipe->head, pipe->tail);
 		wake_next_writer = true;
@@ -585,6 +617,8 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 out:
 	if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
 		wake_next_writer = false;
+	if (ret > 0)
+		pipe->w_cnt++;
 	mutex_unlock(&pipe->mutex);
 
 	/*
@@ -705,6 +739,50 @@ pipe_poll(struct file *filp, poll_table *wait)
 	return mask;
 }
 
+static DEFINE_MUTEX(PI_MUTEX);
+static LIST_HEAD(PI_LIST);
+
+void pi_dump(void);
+void pi_dump(void)
+{
+	struct pipe_inode_info *pipe;
+
+	pr_crit("---------- DUMP START ----------\n");
+	mutex_lock(&PI_MUTEX);
+	list_for_each_entry(pipe, &PI_LIST, pi_list) {
+		unsigned head, tail;
+
+		mutex_lock(&pipe->mutex);
+		head = pipe->head;
+		tail = pipe->tail;
+		pr_crit("inode: %p\n", (void*)pipe);
+		pr_crit("E=%d F=%d; W=%d R=%d\n",
+			pipe_empty(head, tail), pipe_full(head, tail, pipe->max_usage),
+			pipe->w_cnt, pipe->r_cnt);
+
+// INCOMPLETE
+pr_crit("RD=%d WR=%d\n", waitqueue_active(&pipe->rd_wait), waitqueue_active(&pipe->wr_wait));
+
+		if (pipe_empty(head, tail) && waitqueue_active(&pipe->rd_wait) && waitqueue_active(&pipe->wr_wait)) {
+			pr_crit("RD waiters:\n");
+			__wait_queue_traverse_print_tasks(&pipe->rd_wait);
+			pr_crit("WR waiters:\n");
+			__wait_queue_traverse_print_tasks(&pipe->wr_wait);
+		}
+
+		for (; tail < head; tail++) {
+			struct pipe_buffer *buf = pipe_buf(pipe, tail);
+			WARN_ON(buf->ops != &anon_pipe_buf_ops);
+			pr_crit("buf: o=%d l=%d\n", buf->offset, buf->len);
+		}
+		pr_crit("\n");
+
+		mutex_unlock(&pipe->mutex);
+	}
+	mutex_unlock(&PI_MUTEX);
+	pr_crit("---------- DUMP END ------------\n");
+}
+
 static void put_pipe_info(struct inode *inode, struct pipe_inode_info *pipe)
 {
 	int kill = 0;
@@ -716,8 +794,14 @@ static void put_pipe_info(struct inode *inode, struct pipe_inode_info *pipe)
 	}
 	spin_unlock(&inode->i_lock);
 
-	if (kill)
+	if (kill) {
+		if (!list_empty(&pipe->pi_list)) {
+			mutex_lock(&PI_MUTEX);
+			list_del_init(&pipe->pi_list);
+			mutex_unlock(&PI_MUTEX);
+		}
 		free_pipe_info(pipe);
+	}
 }
 
 static int
@@ -800,6 +884,13 @@ struct pipe_inode_info *alloc_pipe_info(void)
 	if (pipe == NULL)
 		goto out_free_uid;
 
+	INIT_LIST_HEAD(&pipe->pi_list);
+	if (!strcmp(current->comm, "hackbench")) {
+		mutex_lock(&PI_MUTEX);
+		list_add_tail(&pipe->pi_list, &PI_LIST);
+		mutex_unlock(&PI_MUTEX);
+	}
+
 	if (pipe_bufs * PAGE_SIZE > max_size && !capable(CAP_SYS_RESOURCE))
 		pipe_bufs = max_size >> PAGE_SHIFT;
 
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 8ff23bf5a819..48d9bf5171dc 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -80,6 +80,9 @@ struct pipe_inode_info {
 #ifdef CONFIG_WATCH_QUEUE
 	struct watch_queue *watch_queue;
 #endif
+
+	struct list_head pi_list;
+	unsigned w_cnt, r_cnt;
 };
 
 /*
diff --git a/include/linux/wait.h b/include/linux/wait.h
index 6d90ad974408..2c37517f6a05 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -215,6 +215,7 @@ void __wake_up_locked_sync_key(struct wait_queue_head *wq_head, unsigned int mod
 void __wake_up_locked(struct wait_queue_head *wq_head, unsigned int mode, int nr);
 void __wake_up_sync(struct wait_queue_head *wq_head, unsigned int mode);
 void __wake_up_pollfree(struct wait_queue_head *wq_head);
+void __wait_queue_traverse_print_tasks(struct wait_queue_head *wq_head);
 
 #define wake_up(x)			__wake_up(x, TASK_NORMAL, 1, NULL)
 #define wake_up_nr(x, nr)		__wake_up(x, TASK_NORMAL, nr, NULL)
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 51e38f5f4701..8f33da87a219 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -174,6 +174,29 @@ void __wake_up_sync_key(struct wait_queue_head *wq_head, unsigned int mode,
 }
 EXPORT_SYMBOL_GPL(__wake_up_sync_key);
 
+void __wait_queue_traverse_print_tasks(struct wait_queue_head *wq_head)
+{
+	wait_queue_entry_t *curr, *next;
+	unsigned long flags;
+
+	if (unlikely(!wq_head))
+		return;
+
+	spin_lock_irqsave(&wq_head->lock, flags);
+	curr = list_first_entry(&wq_head->head, wait_queue_entry_t, entry);
+
+	if (&curr->entry == &wq_head->head)
+		return;
+
+	list_for_each_entry_safe_from(curr, next, &wq_head->head, entry) {
+		struct task_struct *tsk =  (struct task_struct *)curr->private;
+
+		pr_crit("%d(%s)\n", tsk->pid, tsk->comm);
+	}
+	spin_unlock_irqrestore(&wq_head->lock, flags);
+}
+EXPORT_SYMBOL_GPL(__wait_queue_traverse_print_tasks);
+
 /**
  * __wake_up_locked_sync_key - wake up a thread blocked on a locked waitqueue.
  * @wq_head: the waitqueue
diff --git a/kernel/sys.c b/kernel/sys.c
index c4c701c6f0b4..676e623d491d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2477,6 +2477,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 	error = 0;
 	switch (option) {
+	case 666: {
+		extern void pi_dump(void);
+		pi_dump();
+		break;
+	}
 	case PR_SET_PDEATHSIG:
 		if (!valid_signal(arg2)) {
 			error = -EINVAL;

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-03  9:46                 ` Sapkal, Swapnil
@ 2025-03-03 14:37                   ` Mateusz Guzik
  2025-03-03 14:51                     ` Mateusz Guzik
  2025-03-03 16:49                   ` Oleg Nesterov
  2025-03-04  5:06                   ` Hillf Danton
  2 siblings, 1 reply; 109+ messages in thread
From: Mateusz Guzik @ 2025-03-03 14:37 UTC (permalink / raw)
  To: Sapkal, Swapnil
  Cc: Oleg Nesterov, K Prateek Nayak, Manfred Spraul, Linus Torvalds,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, Shenoy, Gautham Ranjal, Neeraj.Upadhyay,
	Ananth.narayan

On Mon, Mar 3, 2025 at 10:46 AM Sapkal, Swapnil <swapnil.sapkal@amd.com> wrote:
> But in the meanwhile me and Prateek tried some of the experiments in the weekend.
> We were able to reproduce this issue on a third generation EPYC system as well as
> on an Intel Emerald Rapids (2 X INTEL(R) XEON(R) PLATINUM 8592+).
>
> We tried heavy hammered tracing approach over the weekend on top of your debug patch.
> I have attached the debug patch below. With tracing we found the following case for
> pipe_writable():
>
>    hackbench-118768  [206] .....  1029.550601: pipe_write: 000000005eea28ff: 0: 37 38 16: 1
>
> Here,
>
> head = 37
> tail = 38
> max_usage = 16
> pipe_full() returns 1.
>

AFAICT the benchmark has one reader per fd, but multiple writers.

Maybe I'm misunderstanding something, but for such a case I think this
is expected as a possible transient condition and while not ideal, it
should not lead to the bug at hand.

Suppose there is only one reader and one writer and a wakeup-worthy
condition showed up. Then both sides perform wakeups *after* dropping
the pipe mutex, meaning their state is published before whoever they
intend to wake up gets on CPU. At the same time any new arrivals which
did not sleep start with taking the mutex.

Suppose there are two or more writers (one of which is blocked) and
still one reader and the pipe transitions to no longer full. Before
the woken up writer reaches pipe_writable() the pipe could have
transitioned to any state an arbitrary number of times, but someone
had to observe the correct state. In particular it is legitimate for a
non-sleeping writer to sneak in and fill in the pipe and the reader to
have time to get back empty it again etc.

Or to put it differently, if the patch does correct the bug, it needs
to explain how everyone ends up blocked. Per the above, there always
should be at least one writer and one reader who make progress -- this
somehow breaks (hence the bug), but I don't believe the memory
ordering explains it.

Consequently I think the patch happens to hide the bug, just like the
now removed spurious wakeup used to do.

> Between reading of head and later the tail, the tail seems to have moved ahead of the
> head leading to wraparound. Applying the following changes I have not yet run into a
> hang on the original machine where I first saw it:
>
> diff --git a/fs/pipe.c b/fs/pipe.c
> index ce1af7592780..a1931c817822 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -417,9 +417,19 @@ static inline int is_packetized(struct file *file)
>   /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
>   static inline bool pipe_writable(const struct pipe_inode_info *pipe)
>   {
> -       unsigned int head = READ_ONCE(pipe->head);
> -       unsigned int tail = READ_ONCE(pipe->tail);
>         unsigned int max_usage = READ_ONCE(pipe->max_usage);
> +       unsigned int head, tail;
> +
> +       tail = READ_ONCE(pipe->tail);
> +       /*
> +        * Since the unsigned arithmetic in this lockless preemptible context
> +        * relies on the fact that the tail can never be ahead of head, read
> +        * the head after the tail to ensure we've not missed any updates to
> +        * the head. Reordering the reads can cause wraparounds and give the
> +        * illusion that the pipe is full.
> +        */
> +       smp_rmb();
> +       head = READ_ONCE(pipe->head);
>
>         return !pipe_full(head, tail, max_usage) ||
>                 !READ_ONCE(pipe->readers);
> ---
>
> smp_rmb() on x86 is a nop and even without the barrier we were not able to
> reproduce the hang even after 10000 iterations.
>
> If you think this is a genuine bug fix, I will send a patch for this.
>

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-03 14:37                   ` Mateusz Guzik
@ 2025-03-03 14:51                     ` Mateusz Guzik
  2025-03-03 15:31                       ` K Prateek Nayak
  0 siblings, 1 reply; 109+ messages in thread
From: Mateusz Guzik @ 2025-03-03 14:51 UTC (permalink / raw)
  To: Sapkal, Swapnil
  Cc: Oleg Nesterov, K Prateek Nayak, Manfred Spraul, Linus Torvalds,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, Shenoy, Gautham Ranjal, Neeraj.Upadhyay,
	Ananth.narayan

On Mon, Mar 3, 2025 at 3:37 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Mon, Mar 3, 2025 at 10:46 AM Sapkal, Swapnil <swapnil.sapkal@amd.com> wrote:
> > But in the meanwhile me and Prateek tried some of the experiments in the weekend.
> > We were able to reproduce this issue on a third generation EPYC system as well as
> > on an Intel Emerald Rapids (2 X INTEL(R) XEON(R) PLATINUM 8592+).
> >
> > We tried heavy hammered tracing approach over the weekend on top of your debug patch.
> > I have attached the debug patch below. With tracing we found the following case for
> > pipe_writable():
> >
> >    hackbench-118768  [206] .....  1029.550601: pipe_write: 000000005eea28ff: 0: 37 38 16: 1
> >
> > Here,
> >
> > head = 37
> > tail = 38
> > max_usage = 16
> > pipe_full() returns 1.
> >
>
> AFAICT the benchmark has one reader per fd, but multiple writers.
>
> Maybe I'm misunderstanding something, but for such a case I think this
> is expected as a possible transient condition and while not ideal, it
> should not lead to the bug at hand.
>
> Suppose there is only one reader and one writer and a wakeup-worthy
> condition showed up. Then both sides perform wakeups *after* dropping
> the pipe mutex, meaning their state is published before whoever they
> intend to wake up gets on CPU. At the same time any new arrivals which
> did not sleep start with taking the mutex.
>
> Suppose there are two or more writers (one of which is blocked) and
> still one reader and the pipe transitions to no longer full. Before
> the woken up writer reaches pipe_writable() the pipe could have
> transitioned to any state an arbitrary number of times, but someone
> had to observe the correct state. In particular it is legitimate for a
> non-sleeping writer to sneak in and fill in the pipe and the reader to
> have time to get back empty it again etc.
>
> Or to put it differently, if the patch does correct the bug, it needs
> to explain how everyone ends up blocked. Per the above, there always
> should be at least one writer and one reader who make progress -- this
> somehow breaks (hence the bug), but I don't believe the memory
> ordering explains it.
>
> Consequently I think the patch happens to hide the bug, just like the
> now removed spurious wakeup used to do.
>

Now that I wrote the above, I had an epiphany and indeed there may be
something to it. :)

Suppose the pipe is full, the reader drains one buffer and issues a
wakeup on its way out. There is still several bytes stored to read.

Suppose the woken up writer is still trying to get on CPU.

On subsequent calls the reader keeps messing with the tail, *exposing*
the possibility of the pipe_writable check being racy even if there is
only one reader and one writer.

I'm gonna grab lunch and chew on it, but I think you guys are on the
right track. But some more fences may be needed.

> > Between reading of head and later the tail, the tail seems to have moved ahead of the
> > head leading to wraparound. Applying the following changes I have not yet run into a
> > hang on the original machine where I first saw it:
> >
> > diff --git a/fs/pipe.c b/fs/pipe.c
> > index ce1af7592780..a1931c817822 100644
> > --- a/fs/pipe.c
> > +++ b/fs/pipe.c
> > @@ -417,9 +417,19 @@ static inline int is_packetized(struct file *file)
> >   /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
> >   static inline bool pipe_writable(const struct pipe_inode_info *pipe)
> >   {
> > -       unsigned int head = READ_ONCE(pipe->head);
> > -       unsigned int tail = READ_ONCE(pipe->tail);
> >         unsigned int max_usage = READ_ONCE(pipe->max_usage);
> > +       unsigned int head, tail;
> > +
> > +       tail = READ_ONCE(pipe->tail);
> > +       /*
> > +        * Since the unsigned arithmetic in this lockless preemptible context
> > +        * relies on the fact that the tail can never be ahead of head, read
> > +        * the head after the tail to ensure we've not missed any updates to
> > +        * the head. Reordering the reads can cause wraparounds and give the
> > +        * illusion that the pipe is full.
> > +        */
> > +       smp_rmb();
> > +       head = READ_ONCE(pipe->head);
> >
> >         return !pipe_full(head, tail, max_usage) ||
> >                 !READ_ONCE(pipe->readers);
> > ---
> >
> > smp_rmb() on x86 is a nop and even without the barrier we were not able to
> > reproduce the hang even after 10000 iterations.
> >
> > If you think this is a genuine bug fix, I will send a patch for this.
> >
>
>
> --
> Mateusz Guzik <mjguzik gmail.com>



-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-03 14:51                     ` Mateusz Guzik
@ 2025-03-03 15:31                       ` K Prateek Nayak
  2025-03-03 17:54                         ` Mateusz Guzik
  0 siblings, 1 reply; 109+ messages in thread
From: K Prateek Nayak @ 2025-03-03 15:31 UTC (permalink / raw)
  To: Mateusz Guzik, Sapkal, Swapnil
  Cc: Oleg Nesterov, Manfred Spraul, Linus Torvalds, Christian Brauner,
	David Howells, WangYuli, linux-fsdevel, linux-kernel,
	Shenoy, Gautham Ranjal, Neeraj.Upadhyay, Ananth.narayan

Hello Mateusz,

On 3/3/2025 8:21 PM, Mateusz Guzik wrote:
> On Mon, Mar 3, 2025 at 3:37 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> On Mon, Mar 3, 2025 at 10:46 AM Sapkal, Swapnil <swapnil.sapkal@amd.com> wrote:
>>> But in the meanwhile me and Prateek tried some of the experiments in the weekend.
>>> We were able to reproduce this issue on a third generation EPYC system as well as
>>> on an Intel Emerald Rapids (2 X INTEL(R) XEON(R) PLATINUM 8592+).
>>>
>>> We tried heavy hammered tracing approach over the weekend on top of your debug patch.
>>> I have attached the debug patch below. With tracing we found the following case for
>>> pipe_writable():
>>>
>>>     hackbench-118768  [206] .....  1029.550601: pipe_write: 000000005eea28ff: 0: 37 38 16: 1
>>>
>>> Here,
>>>
>>> head = 37
>>> tail = 38
>>> max_usage = 16
>>> pipe_full() returns 1.
>>>
>>
>> AFAICT the benchmark has one reader per fd, but multiple writers.
>>
>> Maybe I'm misunderstanding something, but for such a case I think this
>> is expected as a possible transient condition and while not ideal, it
>> should not lead to the bug at hand.
>>
>> Suppose there is only one reader and one writer and a wakeup-worthy
>> condition showed up. Then both sides perform wakeups *after* dropping
>> the pipe mutex, meaning their state is published before whoever they
>> intend to wake up gets on CPU. At the same time any new arrivals which
>> did not sleep start with taking the mutex.
>>
>> Suppose there are two or more writers (one of which is blocked) and
>> still one reader and the pipe transitions to no longer full. Before
>> the woken up writer reaches pipe_writable() the pipe could have
>> transitioned to any state an arbitrary number of times, but someone
>> had to observe the correct state. In particular it is legitimate for a
>> non-sleeping writer to sneak in and fill in the pipe and the reader to
>> have time to get back empty it again etc.
>>
>> Or to put it differently, if the patch does correct the bug, it needs
>> to explain how everyone ends up blocked. Per the above, there always
>> should be at least one writer and one reader who make progress -- this
>> somehow breaks (hence the bug), but I don't believe the memory
>> ordering explains it.
>>
>> Consequently I think the patch happens to hide the bug, just like the
>> now removed spurious wakeup used to do.
>>
> 
> Now that I wrote the above, I had an epiphany and indeed there may be
> something to it. :)
> 
> Suppose the pipe is full, the reader drains one buffer and issues a
> wakeup on its way out. There is still several bytes stored to read.
> 
> Suppose the woken up writer is still trying to get on CPU.
> 
> On subsequent calls the reader keeps messing with the tail, *exposing*
> the possibility of the pipe_writable check being racy even if there is
> only one reader and one writer.

Yup. One possibility looking at the larger trace data, we may have a
situation as follows:

Say:

pipe->head = 16
pipe->tail = 15
2 writers were waiting on a reader since pipe was full
and action ...

         Reader                          Writer1                                  Writer2
         ======                          =======                                  =======

pipe->tail = tail + 1 (16)
(Wakes up writer 1)             (!pipe_full() yay!)
done                            pipe->head = head + 1 (17)
                                 (pipe is not full; wake writer2)        (yawn! I'm up)
                                 done                                    head = READ_ONCE(pipe->head) (17)
                                                                         ... (interrupted)
(Guess who's back)
pipe->tail = tail + 1 (17)      (back again)
...                             pipe->head = head + 1 (18)
(reader's back)                 (I'm done mate!)                        ... (back)
pipe->tail = tail + 1 (18)                                              tail = READ_ONCE(pipe->tail) (18)
...                                                                     (u32)(17 - 18) >= 16? (Yup! Pipe is full)
(Sleeps until pipe has data)                                            (Sleep until pipe has room)

---

Now the above might be totally wrong and I might have missed a few
intricacies of the wakeups in pipes but below is the trace snippet that
led us to try rearranging the reads and test again:

     hackbench-118768  1029.549127: pipe_write: 000000005eea28ff: 0: 32 16 16: 1  (118768 - sleeps)
     ...
     hackbench-118766  1029.550592: pipe_write: 000000005eea28ff: h: 37 -> 38     (118766 - write succeeds)
     ...
     hackbench-118740  1029.550599: pipe_read:  000000005eea28ff: t: 37 -> 38     (118740 - read succeeds)
     ...
     hackbench-118740  1029.550599: pipe_read:  000000005eea28ff: 0: 38 38: 1     (118740 - sleeps)
     hackbench-118768  1029.550601: pipe_write: 000000005eea28ff: 0: 37 38 16: 1  (118768 - wakes up; finds tail (38) > head (37); sleeps)

The trace goes on but if at this point 118766 were to drop out, 118740
and 118768  would both indefinitely wait on each other. This is an
uncharted territory for Swapnil and I so we are trying a bunch of stuff
based on patterns we see - any and all advice is greatly appreciated.

-- 
Thanks and Regards,
Prateek

> 
> I'm gonna grab lunch and chew on it, but I think you guys are on the
> right track. But some more fences may be needed.
> 
>>> Between reading of head and later the tail, the tail seems to have moved ahead of the
>>> head leading to wraparound. Applying the following changes I have not yet run into a
>>> hang on the original machine where I first saw it:
>>>
>>> diff --git a/fs/pipe.c b/fs/pipe.c
>>> index ce1af7592780..a1931c817822 100644
>>> --- a/fs/pipe.c
>>> +++ b/fs/pipe.c
>>> @@ -417,9 +417,19 @@ static inline int is_packetized(struct file *file)
>>>    /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
>>>    static inline bool pipe_writable(const struct pipe_inode_info *pipe)
>>>    {
>>> -       unsigned int head = READ_ONCE(pipe->head);
>>> -       unsigned int tail = READ_ONCE(pipe->tail);
>>>          unsigned int max_usage = READ_ONCE(pipe->max_usage);
>>> +       unsigned int head, tail;
>>> +
>>> +       tail = READ_ONCE(pipe->tail);
>>> +       /*
>>> +        * Since the unsigned arithmetic in this lockless preemptible context
>>> +        * relies on the fact that the tail can never be ahead of head, read
>>> +        * the head after the tail to ensure we've not missed any updates to
>>> +        * the head. Reordering the reads can cause wraparounds and give the
>>> +        * illusion that the pipe is full.
>>> +        */
>>> +       smp_rmb();
>>> +       head = READ_ONCE(pipe->head);
>>>
>>>          return !pipe_full(head, tail, max_usage) ||
>>>                  !READ_ONCE(pipe->readers);
>>> ---
>>>
>>> smp_rmb() on x86 is a nop and even without the barrier we were not able to
>>> reproduce the hang even after 10000 iterations.
>>>
>>> If you think this is a genuine bug fix, I will send a patch for this.
>>>
>>
>>
>> --
>> Mateusz Guzik <mjguzik gmail.com>
> 
> 
> 




^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-03 15:31                       ` K Prateek Nayak
@ 2025-03-03 17:54                         ` Mateusz Guzik
  2025-03-03 18:11                           ` Linus Torvalds
  2025-03-03 18:32                           ` [PATCH] pipe_read: don't wake up the writer if the pipe is still full K Prateek Nayak
  0 siblings, 2 replies; 109+ messages in thread
From: Mateusz Guzik @ 2025-03-03 17:54 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Sapkal, Swapnil, Oleg Nesterov, Manfred Spraul, Linus Torvalds,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, Shenoy, Gautham Ranjal, Neeraj.Upadhyay,
	Ananth.narayan

Can you guys try out the patch below?

It changes things up so that there is no need to read 2 different vars.

It is not the final version and I don't claim to be able to fully
justify the thing at the moment either, but I would like to know if it
fixes the problem.

If you don't have time that's fine, this is a quick jab. While I can't
reproduce the bug myself even after inserting a delay by hand with
msleep between the loads, I verified it does not outright break either.
:P

diff --git a/fs/pipe.c b/fs/pipe.c
index 19a7948ab234..e61ad589fc2c 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -210,11 +210,21 @@ static const struct pipe_buf_operations anon_pipe_buf_ops = {
 /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
 static inline bool pipe_readable(const struct pipe_inode_info *pipe)
 {
-	unsigned int head = READ_ONCE(pipe->head);
-	unsigned int tail = READ_ONCE(pipe->tail);
-	unsigned int writers = READ_ONCE(pipe->writers);
+	return !READ_ONCE(pipe->isempty) || !READ_ONCE(pipe->writers);
+}
+
+static inline void pipe_recalc_state(struct pipe_inode_info *pipe)
+{
+	pipe->isempty = pipe_empty(pipe->head, pipe->tail);
+	pipe->isfull = pipe_full(pipe->head, pipe->tail, pipe->max_usage);
+	VFS_BUG_ON(pipe->isempty && pipe->isfull);
+}
 
-	return !pipe_empty(head, tail) || !writers;
+static inline void pipe_update_head(struct pipe_inode_info *pipe,
+				    unsigned int head)
+{
+	pipe->head = ++head;
+	pipe_recalc_state(pipe);
 }
 
 static inline unsigned int pipe_update_tail(struct pipe_inode_info *pipe,
@@ -244,6 +254,7 @@ static inline unsigned int pipe_update_tail(struct pipe_inode_info *pipe,
 	 * without the spinlock - the mutex is enough.
 	 */
 	pipe->tail = ++tail;
+	pipe_recalc_state(pipe);
 	return tail;
 }
 
@@ -403,12 +414,7 @@ static inline int is_packetized(struct file *file)
 /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
 static inline bool pipe_writable(const struct pipe_inode_info *pipe)
 {
-	unsigned int head = READ_ONCE(pipe->head);
-	unsigned int tail = READ_ONCE(pipe->tail);
-	unsigned int max_usage = READ_ONCE(pipe->max_usage);
-
-	return !pipe_full(head, tail, max_usage) ||
-		!READ_ONCE(pipe->readers);
+	return !READ_ONCE(pipe->isfull) || !READ_ONCE(pipe->readers);
 }
 
 static ssize_t
@@ -512,7 +518,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 				break;
 			}
 
-			pipe->head = head + 1;
+			pipe_update_head(pipe, head);
 			pipe->tmp_page = NULL;
 			/* Insert it into the buffer array */
 			buf = &pipe->bufs[head & mask];
@@ -529,10 +535,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 
 			if (!iov_iter_count(from))
 				break;
-		}
 
-		if (!pipe_full(head, pipe->tail, pipe->max_usage))
 			continue;
+		}
 
 		/* Wait for buffer space to become available. */
 		if ((filp->f_flags & O_NONBLOCK) ||
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 8ff23bf5a819..d4b7539399b5 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -69,6 +69,8 @@ struct pipe_inode_info {
 	unsigned int r_counter;
 	unsigned int w_counter;
 	bool poll_usage;
+	bool isempty;
+	bool isfull;
 #ifdef CONFIG_WATCH_QUEUE
 	bool note_loss;
 #endif

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-03 17:54                         ` Mateusz Guzik
@ 2025-03-03 18:11                           ` Linus Torvalds
  2025-03-03 18:33                             ` Mateusz Guzik
  2025-03-03 18:32                           ` [PATCH] pipe_read: don't wake up the writer if the pipe is still full K Prateek Nayak
  1 sibling, 1 reply; 109+ messages in thread
From: Linus Torvalds @ 2025-03-03 18:11 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: K Prateek Nayak, Sapkal, Swapnil, Oleg Nesterov, Manfred Spraul,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, Shenoy, Gautham Ranjal, Neeraj.Upadhyay,
	Ananth.narayan

On Mon, 3 Mar 2025 at 07:55, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> Can you guys try out the patch below?
>
> It changes things up so that there is no need to read 2 different vars.

No, please don't do it this way.

I think the memory ordering is interesting, and we ignored it -
incorrectly - because all the "normal" cases are done either under the
pipe lock (safe), or are done with "wait_event()" that will retry on
wakeups.

And then we got the subtle issues with "was woken, but raced with
order of operations" case got missed. This has probably been around
forever (possibly since we got rid of the BKL).

But I don't like the "add separate full/empty fields that duplicate
things", just to have those written always under the lock, and then
loaded as one op.

I think there are better models.

So I think I'd prefer the "add the barrier" model.

We could also possibly just make head/tail be 16-bit fields, and then
read things atomically by reading them as a single 32-bit word. That
would expose the (existing) alpha issues more, since alpha doesn't
have atomic 16-bit writes, but I can't find it in myself to care. I
guess we could make it be two aligned 32-bit fields on alpha, and just
use 64-bit reads.

We already treat those fields specially with the whole READ_ONCE()
dance, so treating them even more specially would not be a noticeably
different situation.

Hmm?

I just generally dislike redundant information in data structures.
Then you get into nasty cases where some path forgets to update the
redundant fields correctly. So I'd really just prefer the existing
model, just with being careful about this case.

                    Linus

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-03 18:11                           ` Linus Torvalds
@ 2025-03-03 18:33                             ` Mateusz Guzik
  2025-03-03 18:55                               ` Linus Torvalds
  0 siblings, 1 reply; 109+ messages in thread
From: Mateusz Guzik @ 2025-03-03 18:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: K Prateek Nayak, Sapkal, Swapnil, Oleg Nesterov, Manfred Spraul,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, Shenoy, Gautham Ranjal, Neeraj.Upadhyay,
	Ananth.narayan

On Mon, Mar 3, 2025 at 7:11 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> But I don't like the "add separate full/empty fields that duplicate
> things", just to have those written always under the lock, and then
> loaded as one op.
>
> I think there are better models.
>
> So I think I'd prefer the "add the barrier" model.
>

I was trying to avoid having to reason about the fences, I would argue
not having to worry about this makes future changes easier to make.

> We could also possibly just make head/tail be 16-bit fields, and then
> read things atomically by reading them as a single 32-bit word. That
> would expose the (existing) alpha issues more, since alpha doesn't
> have atomic 16-bit writes, but I can't find it in myself to care. I
> guess we could make it be two aligned 32-bit fields on alpha, and just
> use 64-bit reads.

I admit I did not think of this, pretty obvious now that you mention it.

Perhaps either Prateek or Swapnil would be interested in coding this up?

Ultimately the crux of the issue is their finding.

>
> I just generally dislike redundant information in data structures.
> Then you get into nasty cases where some path forgets to update the
> redundant fields correctly. So I'd really just prefer the existing
> model, just with being careful about this case.
>

The stock code already has a dedicated routine to advance the tail,
adding one for head (instead of an ad-hoc increment) is borderline
just clean up.

Then having both recalc the state imo does not add any bug-pronness [I
did ignore the CONFIG_WATCH_QUEUE case though] afaics.

Even so, per the above, that's a side note.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-03 18:33                             ` Mateusz Guzik
@ 2025-03-03 18:55                               ` Linus Torvalds
  2025-03-03 19:06                                 ` Mateusz Guzik
  2025-03-03 20:27                                 ` Oleg Nesterov
  0 siblings, 2 replies; 109+ messages in thread
From: Linus Torvalds @ 2025-03-03 18:55 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: K Prateek Nayak, Sapkal, Swapnil, Oleg Nesterov, Manfred Spraul,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, Shenoy, Gautham Ranjal, Neeraj.Upadhyay,
	Ananth.narayan

On Mon, 3 Mar 2025 at 08:33, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> The stock code already has a dedicated routine to advance the tail,
> adding one for head (instead of an ad-hoc increment) is borderline
> just clean up.

There's currently a fair number of open-coded assignments:

    git grep -E 'pipe->((tail)|(head)).*=' fs/

and some of those are under specific locking rules together with other
updates (ie the watch-queue 'note_loss' thing.

But hey, if some explicit empty/full flag is simpler, then it
certainly does fit with our current model too, since we already do
have those other flags (exactly like 'note_loss')

I do particularly hate seeing 'bool' in structures like this. On alpha
it is either fundamentally racy, or it's 32-bit. On other
architectures, it's typically 8 bits for a 1-bit value.

But we do have holes in that structure where it slots.

             Linus

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-03 18:55                               ` Linus Torvalds
@ 2025-03-03 19:06                                 ` Mateusz Guzik
  2025-03-03 20:27                                 ` Oleg Nesterov
  1 sibling, 0 replies; 109+ messages in thread
From: Mateusz Guzik @ 2025-03-03 19:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: K Prateek Nayak, Sapkal, Swapnil, Oleg Nesterov, Manfred Spraul,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, Shenoy, Gautham Ranjal, Neeraj.Upadhyay,
	Ananth.narayan

On Mon, Mar 3, 2025 at 7:56 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 3 Mar 2025 at 08:33, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > The stock code already has a dedicated routine to advance the tail,
> > adding one for head (instead of an ad-hoc increment) is borderline
> > just clean up.
>
> There's currently a fair number of open-coded assignments:
>
>     git grep -E 'pipe->((tail)|(head)).*=' fs/
>
> and some of those are under specific locking rules together with other
> updates (ie the watch-queue 'note_loss' thing.
>
> But hey, if some explicit empty/full flag is simpler, then it
> certainly does fit with our current model too, since we already do
> have those other flags (exactly like 'note_loss')
>
> I do particularly hate seeing 'bool' in structures like this. On alpha
> it is either fundamentally racy, or it's 32-bit. On other
> architectures, it's typically 8 bits for a 1-bit value.
>
> But we do have holes in that structure where it slots.
>

I was thinking about just fs/pipe.c.

These helpers being used elsewhere is not something I was aware of (or
thought would be a thing). The relevant git grep makes me self-nak
that patch. :->

But that's some side stuff. Not only it is your call how to proceed,
but per the previous e-mail I agree 16-byte head/tail and a 32-byte
read would be best.

Hopefully the AMD guys will want to take a stab. Otherwise I'll hack it up.
-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-03 18:55                               ` Linus Torvalds
  2025-03-03 19:06                                 ` Mateusz Guzik
@ 2025-03-03 20:27                                 ` Oleg Nesterov
  2025-03-03 20:46                                   ` Linus Torvalds
  1 sibling, 1 reply; 109+ messages in thread
From: Oleg Nesterov @ 2025-03-03 20:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mateusz Guzik, K Prateek Nayak, Sapkal, Swapnil, Manfred Spraul,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, Shenoy, Gautham Ranjal, Neeraj.Upadhyay,
	Ananth.narayan

On 03/03, Linus Torvalds wrote:
>
> There's currently a fair number of open-coded assignments:
>
>     git grep -E 'pipe->((tail)|(head)).*=' fs/
>
> and some of those are under specific locking rules together with other
> updates (ie the watch-queue 'note_loss' thing.

Stupid question... but do we really need to change the code which update
tail/head if we pack them into a single word?

I mean,

	-	unsigned int head;
	-	unsigned int tail;
	+	union {
	+		struct {
	+			u16 head, tail;
	+		}
	+
	+		__u32   head_tail;
	+	}

Now pipe_writebale() can read do READ_ONCE(pipe->head_tail) "atomically"
without preemption and this is all we need, no?

Yes, pipe_writable() should take endianess into account, but this is
simple...

Oleg


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-03 20:27                                 ` Oleg Nesterov
@ 2025-03-03 20:46                                   ` Linus Torvalds
  2025-03-04  5:31                                     ` K Prateek Nayak
                                                       ` (4 more replies)
  0 siblings, 5 replies; 109+ messages in thread
From: Linus Torvalds @ 2025-03-03 20:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mateusz Guzik, K Prateek Nayak, Sapkal, Swapnil, Manfred Spraul,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, Shenoy, Gautham Ranjal, Neeraj.Upadhyay,
	Ananth.narayan

[-- Attachment #1: Type: text/plain, Size: 475 bytes --]

On Mon, 3 Mar 2025 at 10:28, Oleg Nesterov <oleg@redhat.com> wrote:
>
> Stupid question... but do we really need to change the code which update
> tail/head if we pack them into a single word?

No. It's only the READ_ONCE() parts that need changing.

See this suggested patch, which does something very similar to what
you were thinking of.

ENTIRELY UNTESTED, but it seems to generate ok code. It might even
generate better code than what we have now.

               Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 3963 bytes --]

 fs/pipe.c                 | 19 ++++++++-----------
 include/linux/pipe_fs_i.h | 39 +++++++++++++++++++++++++++++++++++++--
 2 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index ce1af7592780..97cc70572606 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -210,11 +210,10 @@ static const struct pipe_buf_operations anon_pipe_buf_ops = {
 /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
 static inline bool pipe_readable(const struct pipe_inode_info *pipe)
 {
-	unsigned int head = READ_ONCE(pipe->head);
-	unsigned int tail = READ_ONCE(pipe->tail);
+	union pipe_index idx = { READ_ONCE(pipe->head_tail) };
 	unsigned int writers = READ_ONCE(pipe->writers);
 
-	return !pipe_empty(head, tail) || !writers;
+	return !pipe_empty(idx.head, idx.tail) || !writers;
 }
 
 static inline unsigned int pipe_update_tail(struct pipe_inode_info *pipe,
@@ -417,11 +416,10 @@ static inline int is_packetized(struct file *file)
 /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
 static inline bool pipe_writable(const struct pipe_inode_info *pipe)
 {
-	unsigned int head = READ_ONCE(pipe->head);
-	unsigned int tail = READ_ONCE(pipe->tail);
+	union pipe_index idx = { READ_ONCE(pipe->head_tail) };
 	unsigned int max_usage = READ_ONCE(pipe->max_usage);
 
-	return !pipe_full(head, tail, max_usage) ||
+	return !pipe_full(idx.head, idx.tail, max_usage) ||
 		!READ_ONCE(pipe->readers);
 }
 
@@ -659,7 +657,7 @@ pipe_poll(struct file *filp, poll_table *wait)
 {
 	__poll_t mask;
 	struct pipe_inode_info *pipe = filp->private_data;
-	unsigned int head, tail;
+	union pipe_index idx;
 
 	/* Epoll has some historical nasty semantics, this enables them */
 	WRITE_ONCE(pipe->poll_usage, true);
@@ -680,19 +678,18 @@ pipe_poll(struct file *filp, poll_table *wait)
 	 * if something changes and you got it wrong, the poll
 	 * table entry will wake you up and fix it.
 	 */
-	head = READ_ONCE(pipe->head);
-	tail = READ_ONCE(pipe->tail);
+	idx.head_tail = READ_ONCE(pipe->head_tail);
 
 	mask = 0;
 	if (filp->f_mode & FMODE_READ) {
-		if (!pipe_empty(head, tail))
+		if (!pipe_empty(idx.head, idx.tail))
 			mask |= EPOLLIN | EPOLLRDNORM;
 		if (!pipe->writers && filp->f_pipe != pipe->w_counter)
 			mask |= EPOLLHUP;
 	}
 
 	if (filp->f_mode & FMODE_WRITE) {
-		if (!pipe_full(head, tail, pipe->max_usage))
+		if (!pipe_full(idx.head, idx.tail, pipe->max_usage))
 			mask |= EPOLLOUT | EPOLLWRNORM;
 		/*
 		 * Most Unices do not set EPOLLERR for FIFOs but on Linux they
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 8ff23bf5a819..b1a3b99f9ff8 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -31,6 +31,33 @@ struct pipe_buffer {
 	unsigned long private;
 };
 
+/*
+ * Really only alpha needs 32-bit fields, but
+ * might as well do it for 64-bit architectures
+ * since that's what we've historically done,
+ * and it makes 'head_tail' always be a simple
+ * 'unsigned long'.
+ */
+#ifdef CONFIG_64BIT
+  typedef unsigned int pipe_index_t;
+#else
+  typedef unsigned short pipe_index_t;
+#endif
+
+/*
+ * We have to declare this outside 'struct pipe_inode_info',
+ * but then we can't use 'union pipe_index' for an anonymous
+ * union, so we end up having to duplicate this declaration
+ * below. Annoying.
+ */
+union pipe_index {
+	unsigned long head_tail;
+	struct {
+		pipe_index_t head;
+		pipe_index_t tail;
+	};
+};
+
 /**
  *	struct pipe_inode_info - a linux kernel pipe
  *	@mutex: mutex protecting the whole thing
@@ -58,8 +85,16 @@ struct pipe_buffer {
 struct pipe_inode_info {
 	struct mutex mutex;
 	wait_queue_head_t rd_wait, wr_wait;
-	unsigned int head;
-	unsigned int tail;
+
+	/* This has to match the 'union pipe_index' above */
+	union {
+		unsigned long head_tail;
+		struct {
+			pipe_index_t head;
+			pipe_index_t tail;
+		};
+	};
+
 	unsigned int max_usage;
 	unsigned int ring_size;
 	unsigned int nr_accounted;

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-03 20:46                                   ` Linus Torvalds
@ 2025-03-04  5:31                                     ` K Prateek Nayak
  2025-03-04  6:32                                       ` Linus Torvalds
  2025-03-04 12:54                                     ` Oleg Nesterov
                                                       ` (3 subsequent siblings)
  4 siblings, 1 reply; 109+ messages in thread
From: K Prateek Nayak @ 2025-03-04  5:31 UTC (permalink / raw)
  To: Linus Torvalds, Oleg Nesterov
  Cc: Mateusz Guzik, Sapkal, Swapnil, Manfred Spraul, Christian Brauner,
	David Howells, WangYuli, linux-fsdevel, linux-kernel,
	Shenoy, Gautham Ranjal, Neeraj.Upadhyay, Ananth.narayan

Hello Linus,

On 3/4/2025 2:16 AM, Linus Torvalds wrote:
> On Mon, 3 Mar 2025 at 10:28, Oleg Nesterov <oleg@redhat.com> wrote:
>>
>> Stupid question... but do we really need to change the code which update
>> tail/head if we pack them into a single word?
> 
> No. It's only the READ_ONCE() parts that need changing.
> 
> See this suggested patch, which does something very similar to what
> you were thinking of.
> 
> ENTIRELY UNTESTED, but it seems to generate ok code. It might even
> generate better code than what we have now.

With the patch on top of commit aaec5a95d596 ("pipe_read: don't wake up
the writer if the pipe is still full"), we've not seen any hangs yet
with a few thousand iterations of short loops, and a few hundred
iterations of larger loop sizes with hackbench.

If you can provide you S-o-b, we can send out an official patch with a
commit log. We'll wait for Oleg's response in case he has any concerns.

> 
>                 Linus

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-04  5:31                                     ` K Prateek Nayak
@ 2025-03-04  6:32                                       ` Linus Torvalds
  0 siblings, 0 replies; 109+ messages in thread
From: Linus Torvalds @ 2025-03-04  6:32 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Oleg Nesterov, Mateusz Guzik, Sapkal, Swapnil, Manfred Spraul,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, Shenoy, Gautham Ranjal, Neeraj.Upadhyay,
	Ananth.narayan

On Mon, 3 Mar 2025 at 19:31, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
]> > ENTIRELY UNTESTED, but it seems to generate ok code. It might even
> > generate better code than what we have now.
>
> With the patch on top of commit aaec5a95d596 ("pipe_read: don't wake up
> the writer if the pipe is still full"), we've not seen any hangs yet
> with a few thousand iterations of short loops, and a few hundred
> iterations of larger loop sizes with hackbench.
>
> If you can provide you S-o-b, we can send out an official patch with a
> commit log. We'll wait for Oleg's response in case he has any concerns.

Ack. With that testing background, please write a message and add my

  Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

and we'll get this all fixed up.

I assume this all goes back to commit 8cefc107ca54 ("pipe: Use head
and tail pointers for the ring, not cursor and length") back in 2019.

Or possibly 85190d15f4ea ("pipe: don't use 'pipe_wait() for basic pipe IO")?

But it was all hidden by the fact that we used to just wake things up
very aggressively and you'd never notice the race as a result, so then
it got exposed by the more minimal wakeup changes.

            Linus

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-03 20:46                                   ` Linus Torvalds
  2025-03-04  5:31                                     ` K Prateek Nayak
@ 2025-03-04 12:54                                     ` Oleg Nesterov
  2025-03-04 13:25                                       ` Oleg Nesterov
  2025-03-04 18:28                                       ` Linus Torvalds
  2025-03-04 13:51                                     ` [PATCH] fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex K Prateek Nayak
                                                       ` (2 subsequent siblings)
  4 siblings, 2 replies; 109+ messages in thread
From: Oleg Nesterov @ 2025-03-04 12:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mateusz Guzik, K Prateek Nayak, Sapkal, Swapnil, Manfred Spraul,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, Shenoy, Gautham Ranjal, Neeraj.Upadhyay,
	Ananth.narayan

On 03/03, Linus Torvalds wrote:
>
> ENTIRELY UNTESTED, but it seems to generate ok code. It might even
> generate better code than what we have now.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>

but I have another question...

>  static inline bool pipe_readable(const struct pipe_inode_info *pipe)
>  {
> -	unsigned int head = READ_ONCE(pipe->head);
> -	unsigned int tail = READ_ONCE(pipe->tail);
> +	union pipe_index idx = { READ_ONCE(pipe->head_tail) };

I thought this is wrong, but then I noticed that in your version
->head_tail is the 1st member in this union.

Still perhaps

	union pipe_index idx = { .head_tail = READ_ONCE(pipe->head_tail) };

will look more clear?

> +/*
> + * Really only alpha needs 32-bit fields, but
> + * might as well do it for 64-bit architectures
> + * since that's what we've historically done,
> + * and it makes 'head_tail' always be a simple
> + * 'unsigned long'.
> + */
> +#ifdef CONFIG_64BIT
> +  typedef unsigned int pipe_index_t;
> +#else
> +  typedef unsigned short pipe_index_t;
> +#endif

I am just curious, why we can't use "unsigned short" unconditionally
and avoid #ifdef ?

Is "unsigned int" more efficient on 64-bit?

Oleg.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-04 12:54                                     ` Oleg Nesterov
@ 2025-03-04 13:25                                       ` Oleg Nesterov
  2025-03-04 18:28                                       ` Linus Torvalds
  1 sibling, 0 replies; 109+ messages in thread
From: Oleg Nesterov @ 2025-03-04 13:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mateusz Guzik, K Prateek Nayak, Sapkal, Swapnil, Manfred Spraul,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, Shenoy, Gautham Ranjal, Neeraj.Upadhyay,
	Ananth.narayan

On 03/04, Oleg Nesterov wrote:
>
> On 03/03, Linus Torvalds wrote:
> >
> > +/*
> > + * Really only alpha needs 32-bit fields, but
> > + * might as well do it for 64-bit architectures
> > + * since that's what we've historically done,
> > + * and it makes 'head_tail' always be a simple
> > + * 'unsigned long'.
> > + */
> > +#ifdef CONFIG_64BIT
> > +  typedef unsigned int pipe_index_t;
> > +#else
> > +  typedef unsigned short pipe_index_t;
> > +#endif
>
> I am just curious, why we can't use "unsigned short" unconditionally
> and avoid #ifdef ?
>
> Is "unsigned int" more efficient on 64-bit?

Ah, I guess I misread the comment...

So, the problem is that 64-bit alpha can't write u16 "atomically" ?

Oleg.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-04 12:54                                     ` Oleg Nesterov
  2025-03-04 13:25                                       ` Oleg Nesterov
@ 2025-03-04 18:28                                       ` Linus Torvalds
  2025-03-04 22:11                                         ` Oleg Nesterov
  2025-03-05  4:40                                         ` K Prateek Nayak
  1 sibling, 2 replies; 109+ messages in thread
From: Linus Torvalds @ 2025-03-04 18:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mateusz Guzik, K Prateek Nayak, Sapkal, Swapnil, Manfred Spraul,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, Shenoy, Gautham Ranjal, Neeraj.Upadhyay,
	Ananth.narayan

On Tue, 4 Mar 2025 at 02:55, Oleg Nesterov <oleg@redhat.com> wrote:
>
> I thought this is wrong, but then I noticed that in your version
> ->head_tail is the 1st member in this union.

Yes. That was intentional, to make the code have much less extraneous noise.

The only reason to ever use that "union pipe_index" is for this whole
"one word for everything", so I feel that making it compact is
actually more legible than adding extra markers.

> > + * Really only alpha needs 32-bit fields, but
> > + * might as well do it for 64-bit architectures
> > + * since that's what we've historically done,
> > + * and it makes 'head_tail' always be a simple
> > + * 'unsigned long'.
> > + */
> > +#ifdef CONFIG_64BIT
> > +  typedef unsigned int pipe_index_t;
> > +#else
> > +  typedef unsigned short pipe_index_t;
> > +#endif
>
> I am just curious, why we can't use "unsigned short" unconditionally
> and avoid #ifdef ?
>
> Is "unsigned int" more efficient on 64-bit?

The main reason is that a "unsigned short" write on alpha isn't atomic
- it's a read-modify-write operation, and so it isn't safe to mix

        spin_lock_irq(&pipe->rd_wait.lock);
         ...
        pipe->tail = ++tail;
        ...
        spin_unlock_irq(&pipe->rd_wait.lock);

with

         mutex_lock(&pipe->mutex);
         ...
         pipe->head = head + 1;
         ...
         mutex_unlock(&pipe->mutex);

 because while they are two different fields using two different
locks, on alpha the above only works if they are in separate words
(because updating one will do a read-and-write-back of the other).

This is a fundamental alpha architecture bug. I was actually quite
ready to just kill off alpha support entirely, because it's a dead
architecture that is unfixably broken. But there's some crazy patches
to make gcc generate horrific atomic code to make this all work on
alpha by Maciej Rozycki, so one day we'll be in the situation that
alpha can be considered "fixed", but we're not there yet.

Do we really care? Maybe not. The race is probably very hard to hit,
so with the two remaining alpha users, we could just say "let's just
make it use 16-bit ops".

But even on x86, 32-bit ops potentially generate just slightly better
code due to lack of some prefix bytes.

And those fields *used* to be 32-bit, so my patch basically kept the
status quo on 64-bit machines (and just turned it into 16-bit fields
on 32-bit architectures).

Anyway, I wouldn't object to just unconditionally making it "two
16-bit indexes make a 32-bit head_tail" if it actually makes the
structure smaller. It might not even matter on 64-bit because of
alignment of fields around it - I didn't check. As mentioned, it was
more of a combination of "alpha" plus "no change to relevant other
architectures".

                Linus

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-04 18:28                                       ` Linus Torvalds
@ 2025-03-04 22:11                                         ` Oleg Nesterov
  2025-03-05  4:40                                         ` K Prateek Nayak
  1 sibling, 0 replies; 109+ messages in thread
From: Oleg Nesterov @ 2025-03-04 22:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mateusz Guzik, K Prateek Nayak, Sapkal, Swapnil, Manfred Spraul,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, Shenoy, Gautham Ranjal, Neeraj.Upadhyay,
	Ananth.narayan

On 03/04, Linus Torvalds wrote:
>
> On Tue, 4 Mar 2025 at 02:55, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > > + * Really only alpha needs 32-bit fields, but
> > > + * might as well do it for 64-bit architectures
> > > + * since that's what we've historically done,
> > > + * and it makes 'head_tail' always be a simple
> > > + * 'unsigned long'.
> > > + */
> > > +#ifdef CONFIG_64BIT
> > > +  typedef unsigned int pipe_index_t;
> > > +#else
> > > +  typedef unsigned short pipe_index_t;
> > > +#endif
> >
> > I am just curious, why we can't use "unsigned short" unconditionally
> > and avoid #ifdef ?
> >
> > Is "unsigned int" more efficient on 64-bit?
>
> The main reason is that a "unsigned short" write on alpha isn't atomic

Yes, I have already realized this when I tried to actually read the
comment, see my next email in reply to myself.

But thanks for your detailed explanation!

Oleg.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-04 18:28                                       ` Linus Torvalds
  2025-03-04 22:11                                         ` Oleg Nesterov
@ 2025-03-05  4:40                                         ` K Prateek Nayak
  2025-03-05  4:52                                           ` Linus Torvalds
  1 sibling, 1 reply; 109+ messages in thread
From: K Prateek Nayak @ 2025-03-05  4:40 UTC (permalink / raw)
  To: Linus Torvalds, Oleg Nesterov
  Cc: Mateusz Guzik, Sapkal, Swapnil, Manfred Spraul, Christian Brauner,
	David Howells, WangYuli, linux-fsdevel, linux-kernel,
	Shenoy, Gautham Ranjal, Neeraj.Upadhyay, Ananth.narayan

Hello Linus,

On 3/4/2025 11:58 PM, Linus Torvalds wrote:
> On Tue, 4 Mar 2025 at 02:55, Oleg Nesterov <oleg@redhat.com> wrote:
>>
>> I thought this is wrong, but then I noticed that in your version
>> ->head_tail is the 1st member in this union.
> 
> Yes. That was intentional, to make the code have much less extraneous noise.
> 
> The only reason to ever use that "union pipe_index" is for this whole
> "one word for everything", so I feel that making it compact is
> actually more legible than adding extra markers.
> 
>>> + * Really only alpha needs 32-bit fields, but
>>> + * might as well do it for 64-bit architectures
>>> + * since that's what we've historically done,
>>> + * and it makes 'head_tail' always be a simple
>>> + * 'unsigned long'.
>>> + */
>>> +#ifdef CONFIG_64BIT
>>> +  typedef unsigned int pipe_index_t;
>>> +#else
>>> +  typedef unsigned short pipe_index_t;
>>> +#endif
>>
>> I am just curious, why we can't use "unsigned short" unconditionally
>> and avoid #ifdef ?
>>
>> Is "unsigned int" more efficient on 64-bit?
> 
> The main reason is that a "unsigned short" write on alpha isn't atomic
> - it's a read-modify-write operation, and so it isn't safe to mix
> 
>          spin_lock_irq(&pipe->rd_wait.lock);
>           ...
>          pipe->tail = ++tail;
>          ...
>          spin_unlock_irq(&pipe->rd_wait.lock);

 From my understanding, this is still done with "pipe->mutex" held. Both
anon_pipe_read() and pipe_resize_ring() will lock "pipe->mutex" first
and then take the "pipe->rd_wait.lock" when updating "pipe->tail".
"pipe->head" is always updated with "pipe->mutex" held.

Could that be enough to guaranteed that RMW on a 16-bit data on Alpha
is safe since the updates to the two 16-bit fields are protected by the
"pipe->mutex" or am I missing something?

> 
> with
> 
>           mutex_lock(&pipe->mutex);
>           ...
>           pipe->head = head + 1;
>           ...
>           mutex_unlock(&pipe->mutex);
> 
>   because while they are two different fields using two different
> locks, on alpha the above only works if they are in separate words
> (because updating one will do a read-and-write-back of the other).
> 
> This is a fundamental alpha architecture bug. I was actually quite
> ready to just kill off alpha support entirely, because it's a dead
> architecture that is unfixably broken. But there's some crazy patches
> to make gcc generate horrific atomic code to make this all work on
> alpha by Maciej Rozycki, so one day we'll be in the situation that
> alpha can be considered "fixed", but we're not there yet.
> 
> Do we really care? Maybe not. The race is probably very hard to hit,
> so with the two remaining alpha users, we could just say "let's just
> make it use 16-bit ops".
> 
> But even on x86, 32-bit ops potentially generate just slightly better
> code due to lack of some prefix bytes.
> 
> And those fields *used* to be 32-bit, so my patch basically kept the
> status quo on 64-bit machines (and just turned it into 16-bit fields
> on 32-bit architectures).
> 
> Anyway, I wouldn't object to just unconditionally making it "two
> 16-bit indexes make a 32-bit head_tail" if it actually makes the
> structure smaller. It might not even matter on 64-bit because of
> alignment of fields around it - I didn't check. As mentioned, it was
> more of a combination of "alpha" plus "no change to relevant other
> architectures".
> 
>                  Linus

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-05  4:40                                         ` K Prateek Nayak
@ 2025-03-05  4:52                                           ` Linus Torvalds
  0 siblings, 0 replies; 109+ messages in thread
From: Linus Torvalds @ 2025-03-05  4:52 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Oleg Nesterov, Mateusz Guzik, Sapkal, Swapnil, Manfred Spraul,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, Shenoy, Gautham Ranjal, Neeraj.Upadhyay,
	Ananth.narayan

On Tue, 4 Mar 2025 at 18:41, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> >          spin_lock_irq(&pipe->rd_wait.lock);
> >           ...
> >          pipe->tail = ++tail;
> >          ...
> >          spin_unlock_irq(&pipe->rd_wait.lock);
>
>  From my understanding, this is still done with "pipe->mutex" held. Both
> anon_pipe_read() and pipe_resize_ring() will lock "pipe->mutex" first
> and then take the "pipe->rd_wait.lock" when updating "pipe->tail".
> "pipe->head" is always updated with "pipe->mutex" held.

No, see the actual watch_queue code: post_one_notification() in
fs/watch_queue.c.

It's isn't the exact sequence I posted, it looks like

        smp_store_release(&pipe->head, head + 1); /* vs pipe_read() */

instead, and it's pipe->head there vs pipe->tail in pipe_read().

And I do think we end up having exclusion thanks to pipe_update_tail()
taking that spinlock if the pipe is actually a watchqueue thing, so it
might all be ok on alpha too.

So *maybe* we can just make it all be two 16-bit words in a 32-bit
thing, but somebody needs to walk through it all to make sure.

              Linus

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [PATCH] fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex
  2025-03-03 20:46                                   ` Linus Torvalds
  2025-03-04  5:31                                     ` K Prateek Nayak
  2025-03-04 12:54                                     ` Oleg Nesterov
@ 2025-03-04 13:51                                     ` K Prateek Nayak
  2025-03-04 18:36                                       ` Alexey Gladkov
  2025-03-04 19:03                                       ` Linus Torvalds
  2025-03-05 15:31                                     ` [PATCH] pipe_read: don't wake up the writer if the pipe is still full Rasmus Villemoes
  2025-03-05 16:40                                     ` Linus Torvalds
  4 siblings, 2 replies; 109+ messages in thread
From: K Prateek Nayak @ 2025-03-04 13:51 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Linus Torvalds, Oleg Nesterov,
	Swapnil Sapkal, Alexey Gladkov, linux-fsdevel, linux-kernel
  Cc: Jan Kara, Mateusz Guzik, Manfred Spraul, David Howells, WangYuli,
	Hillf Danton, Gautham R. Shenoy, Neeraj.Upadhyay, Ananth.narayan,
	K Prateek Nayak

From: Linus Torvalds <torvalds@linux-foundation.org>

pipe_readable(), pipe_writable(), and pipe_poll() can read "pipe->head"
and "pipe->tail" outside of "pipe->mutex" critical section. When the
head and the tail are read individually in that order, there is a window
for interruption between the two reads in which both the head and the
tail can be updated by concurrent readers and writers.

One of the problematic scenarios observed with hackbench running
multiple groups on a large server on a particular pipe inode is as
follows:

    pipe->head = 36
    pipe->tail = 36

    hackbench-118762  [057] .....  1029.550548: pipe_write: *wakes up: pipe not full*
    hackbench-118762  [057] .....  1029.550548: pipe_write: head: 36 -> 37 [tail: 36]
    hackbench-118762  [057] .....  1029.550548: pipe_write: *wake up next reader 118740*
    hackbench-118762  [057] .....  1029.550548: pipe_write: *wake up next writer 118768*

    hackbench-118768  [206] .....  1029.55055X: pipe_write: *writer wakes up*
    hackbench-118768  [206] .....  1029.55055X: pipe_write: head = READ_ONCE(pipe->head) [37]
    ... CPU 206 interrupted (exact wakeup was not traced but 118768 did read head at 37 in traces)

    hackbench-118740  [057] .....  1029.550558: pipe_read:  *reader wakes up: pipe is not empty*
    hackbench-118740  [057] .....  1029.550558: pipe_read:  tail: 36 -> 37 [head = 37]
    hackbench-118740  [057] .....  1029.550559: pipe_read:  *pipe is empty; wakeup writer 118768*
    hackbench-118740  [057] .....  1029.550559: pipe_read:  *sleeps*

    hackbench-118766  [185] .....  1029.550592: pipe_write: *New writer comes in*
    hackbench-118766  [185] .....  1029.550592: pipe_write: head: 37 -> 38 [tail: 37]
    hackbench-118766  [185] .....  1029.550592: pipe_write: *wakes up reader 118766*

    hackbench-118740  [185] .....  1029.550598: pipe_read:  *reader wakes up; pipe not empty*
    hackbench-118740  [185] .....  1029.550599: pipe_read:  tail: 37 -> 38 [head: 38]
    hackbench-118740  [185] .....  1029.550599: pipe_read:  *pipe is empty*
    hackbench-118740  [185] .....  1029.550599: pipe_read:  *reader sleeps; wakeup writer 118768*

    ... CPU 206 switches back to writer
    hackbench-118768  [206] .....  1029.550601: pipe_write: tail = READ_ONCE(pipe->tail) [38]
    hackbench-118768  [206] .....  1029.550601: pipe_write: pipe_full()? (u32)(37 - 38) >= 16? Yes
    hackbench-118768  [206] .....  1029.550601: pipe_write: *writer goes back to sleep*

    [ Tasks 118740 and 118768 can then indefinitely wait on each other. ]

The unsigned arithmetic in pipe_occupancy() wraps around when
"pipe->tail > pipe->head" leading to pipe_full() returning true despite
the pipe being empty.

The case of genuine wraparound of "pipe->head" is handled since pipe
buffer has data allowing readers to make progress until the pipe->tail
wraps too after which the reader will wakeup a sleeping writer, however,
mistaking the pipe to be full when it is in fact empty can lead to
readers and writers waiting on each other indefinitely.

This issue became more problematic and surfaced as a hang in hackbench
after the optimization in commit aaec5a95d596 ("pipe_read: don't wake up
the writer if the pipe is still full") significantly reduced the number
of spurious wakeups of writers that had previously helped mask the
issue.

To avoid missing any updates between the reads of "pipe->head" and
"pipe->write", unionize the two with a single unsigned long
"pipe->head_tail" member that can be loaded atomically.

Using "pipe->head_tail" to read the head and the tail ensures the
lockless checks do not miss any updates to the head or the tail and
since those two are only updated under "pipe->mutex", it ensures that
the head is always ahead of, or equal to the tail resulting in correct
calculations.

  [ prateek: commit log, testing on x86 platforms. ]

Reported-and-debugged-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Closes: https://lore.kernel.org/lkml/e813814e-7094-4673-bc69-731af065a0eb@amd.com/
Reported-by: Alexey Gladkov <legion@kernel.org>
Closes: https://lore.kernel.org/all/Z8Wn0nTvevLRG_4m@example.org/
Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Changes are based on:

  git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs-6.15.pipe

at commit commit ee5eda8ea595 ("pipe: change pipe_write() to never add a
zero-sized buffer") but also applies cleanly on top of v6.14-rc5.

The diff from Linus is kept as is except for removing the whitespaces in
front of the typedef that checkpatch complained about (the warning on
usage of typedef itself has been ignored since I could not think of a
better alternative other than #ifdef hackery in pipe_inode_info and the
newly introduced pipe_index union.) and the suggestion from Oleg to
explicitly initialize the "head_tail" with:

    union pipe_index idx = { .head_tail = READ_ONCE(pipe->head_tail) }

I went with commit 8cefc107ca54 ("pipe: Use head and tail pointers for
the ring, not cursor and length") for the "Fixes:" tag since pipe_poll()
added:

    unsigned int head = READ_ONCE(pipe->head);
    unsigned int tail = READ_ONCE(pipe->tail);

    poll_wait(filp, &pipe->wait, wait);

    BUG_ON(pipe_occupancy(head, tail) > pipe->ring_size);

and the race described can trigger that BUG_ON() but as Linus pointed
out in [1] the commit 85190d15f4ea ("pipe: don't use 'pipe_wait() for
basic pipe IO") is probably the one that can cause the writers to
sleep on empty pipe since the pipe_wait() used prior did not drop the
pipe lock until it called schedule() and prepare_to_wait() was called
before pipe_unlock() ensuring no races.

[1] https://lore.kernel.org/all/CAHk-=wh804HX8H86VRUSKoJGVG0eBe8sPz8=E3d8LHftOCSqwQ@mail.gmail.com/

Please let me know if the patch requires any modifications and I'll jump
right on it. The changes have been tested on both a 5th Generation AMD
EPYC system and on a dual socket Intel Emerald Rapids system with
multiple thousand iterations of hackbench for small and large loop
counts. Thanks a ton to Swapnil for all the help.
---
 fs/pipe.c                 | 19 ++++++++-----------
 include/linux/pipe_fs_i.h | 39 +++++++++++++++++++++++++++++++++++++--
 2 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index b0641f75b1ba..780990f307ab 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -210,11 +210,10 @@ static const struct pipe_buf_operations anon_pipe_buf_ops = {
 /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
 static inline bool pipe_readable(const struct pipe_inode_info *pipe)
 {
-	unsigned int head = READ_ONCE(pipe->head);
-	unsigned int tail = READ_ONCE(pipe->tail);
+	union pipe_index idx = { .head_tail = READ_ONCE(pipe->head_tail) };
 	unsigned int writers = READ_ONCE(pipe->writers);
 
-	return !pipe_empty(head, tail) || !writers;
+	return !pipe_empty(idx.head, idx.tail) || !writers;
 }
 
 static inline unsigned int pipe_update_tail(struct pipe_inode_info *pipe,
@@ -403,11 +402,10 @@ static inline int is_packetized(struct file *file)
 /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
 static inline bool pipe_writable(const struct pipe_inode_info *pipe)
 {
-	unsigned int head = READ_ONCE(pipe->head);
-	unsigned int tail = READ_ONCE(pipe->tail);
+	union pipe_index idx = { .head_tail = READ_ONCE(pipe->head_tail) };
 	unsigned int max_usage = READ_ONCE(pipe->max_usage);
 
-	return !pipe_full(head, tail, max_usage) ||
+	return !pipe_full(idx.head, idx.tail, max_usage) ||
 		!READ_ONCE(pipe->readers);
 }
 
@@ -649,7 +647,7 @@ pipe_poll(struct file *filp, poll_table *wait)
 {
 	__poll_t mask;
 	struct pipe_inode_info *pipe = filp->private_data;
-	unsigned int head, tail;
+	union pipe_index idx;
 
 	/* Epoll has some historical nasty semantics, this enables them */
 	WRITE_ONCE(pipe->poll_usage, true);
@@ -670,19 +668,18 @@ pipe_poll(struct file *filp, poll_table *wait)
 	 * if something changes and you got it wrong, the poll
 	 * table entry will wake you up and fix it.
 	 */
-	head = READ_ONCE(pipe->head);
-	tail = READ_ONCE(pipe->tail);
+	idx.head_tail = READ_ONCE(pipe->head_tail);
 
 	mask = 0;
 	if (filp->f_mode & FMODE_READ) {
-		if (!pipe_empty(head, tail))
+		if (!pipe_empty(idx.head, idx.tail))
 			mask |= EPOLLIN | EPOLLRDNORM;
 		if (!pipe->writers && filp->f_pipe != pipe->w_counter)
 			mask |= EPOLLHUP;
 	}
 
 	if (filp->f_mode & FMODE_WRITE) {
-		if (!pipe_full(head, tail, pipe->max_usage))
+		if (!pipe_full(idx.head, idx.tail, pipe->max_usage))
 			mask |= EPOLLOUT | EPOLLWRNORM;
 		/*
 		 * Most Unices do not set EPOLLERR for FIFOs but on Linux they
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 8ff23bf5a819..3cc4f8eab853 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -31,6 +31,33 @@ struct pipe_buffer {
 	unsigned long private;
 };
 
+/*
+ * Really only alpha needs 32-bit fields, but
+ * might as well do it for 64-bit architectures
+ * since that's what we've historically done,
+ * and it makes 'head_tail' always be a simple
+ * 'unsigned long'.
+ */
+#ifdef CONFIG_64BIT
+typedef unsigned int pipe_index_t;
+#else
+typedef unsigned short pipe_index_t;
+#endif
+
+/*
+ * We have to declare this outside 'struct pipe_inode_info',
+ * but then we can't use 'union pipe_index' for an anonymous
+ * union, so we end up having to duplicate this declaration
+ * below. Annoying.
+ */
+union pipe_index {
+	unsigned long head_tail;
+	struct {
+		pipe_index_t head;
+		pipe_index_t tail;
+	};
+};
+
 /**
  *	struct pipe_inode_info - a linux kernel pipe
  *	@mutex: mutex protecting the whole thing
@@ -58,8 +85,16 @@ struct pipe_buffer {
 struct pipe_inode_info {
 	struct mutex mutex;
 	wait_queue_head_t rd_wait, wr_wait;
-	unsigned int head;
-	unsigned int tail;
+
+	/* This has to match the 'union pipe_index' above */
+	union {
+		unsigned long head_tail;
+		struct {
+			pipe_index_t head;
+			pipe_index_t tail;
+		};
+	};
+
 	unsigned int max_usage;
 	unsigned int ring_size;
 	unsigned int nr_accounted;

base-commit: ee5eda8ea59546af2e8f192c060fbf29862d7cbd
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [PATCH] fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex
  2025-03-04 13:51                                     ` [PATCH] fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex K Prateek Nayak
@ 2025-03-04 18:36                                       ` Alexey Gladkov
  2025-03-04 19:03                                       ` Linus Torvalds
  1 sibling, 0 replies; 109+ messages in thread
From: Alexey Gladkov @ 2025-03-04 18:36 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Alexander Viro, Christian Brauner, Linus Torvalds, Oleg Nesterov,
	Swapnil Sapkal, linux-fsdevel, linux-kernel, Jan Kara,
	Mateusz Guzik, Manfred Spraul, David Howells, WangYuli,
	Hillf Danton, Gautham R. Shenoy, Neeraj.Upadhyay, Ananth.narayan

On Tue, Mar 04, 2025 at 01:51:38PM +0000, K Prateek Nayak wrote:
> From: Linus Torvalds <torvalds@linux-foundation.org>
> 
> pipe_readable(), pipe_writable(), and pipe_poll() can read "pipe->head"
> and "pipe->tail" outside of "pipe->mutex" critical section. When the
> head and the tail are read individually in that order, there is a window
> for interruption between the two reads in which both the head and the
> tail can be updated by concurrent readers and writers.
> 
> One of the problematic scenarios observed with hackbench running
> multiple groups on a large server on a particular pipe inode is as
> follows:
> 
>     pipe->head = 36
>     pipe->tail = 36
> 
>     hackbench-118762  [057] .....  1029.550548: pipe_write: *wakes up: pipe not full*
>     hackbench-118762  [057] .....  1029.550548: pipe_write: head: 36 -> 37 [tail: 36]
>     hackbench-118762  [057] .....  1029.550548: pipe_write: *wake up next reader 118740*
>     hackbench-118762  [057] .....  1029.550548: pipe_write: *wake up next writer 118768*
> 
>     hackbench-118768  [206] .....  1029.55055X: pipe_write: *writer wakes up*
>     hackbench-118768  [206] .....  1029.55055X: pipe_write: head = READ_ONCE(pipe->head) [37]
>     ... CPU 206 interrupted (exact wakeup was not traced but 118768 did read head at 37 in traces)
> 
>     hackbench-118740  [057] .....  1029.550558: pipe_read:  *reader wakes up: pipe is not empty*
>     hackbench-118740  [057] .....  1029.550558: pipe_read:  tail: 36 -> 37 [head = 37]
>     hackbench-118740  [057] .....  1029.550559: pipe_read:  *pipe is empty; wakeup writer 118768*
>     hackbench-118740  [057] .....  1029.550559: pipe_read:  *sleeps*
> 
>     hackbench-118766  [185] .....  1029.550592: pipe_write: *New writer comes in*
>     hackbench-118766  [185] .....  1029.550592: pipe_write: head: 37 -> 38 [tail: 37]
>     hackbench-118766  [185] .....  1029.550592: pipe_write: *wakes up reader 118766*
> 
>     hackbench-118740  [185] .....  1029.550598: pipe_read:  *reader wakes up; pipe not empty*
>     hackbench-118740  [185] .....  1029.550599: pipe_read:  tail: 37 -> 38 [head: 38]
>     hackbench-118740  [185] .....  1029.550599: pipe_read:  *pipe is empty*
>     hackbench-118740  [185] .....  1029.550599: pipe_read:  *reader sleeps; wakeup writer 118768*
> 
>     ... CPU 206 switches back to writer
>     hackbench-118768  [206] .....  1029.550601: pipe_write: tail = READ_ONCE(pipe->tail) [38]
>     hackbench-118768  [206] .....  1029.550601: pipe_write: pipe_full()? (u32)(37 - 38) >= 16? Yes
>     hackbench-118768  [206] .....  1029.550601: pipe_write: *writer goes back to sleep*
> 
>     [ Tasks 118740 and 118768 can then indefinitely wait on each other. ]
> 
> The unsigned arithmetic in pipe_occupancy() wraps around when
> "pipe->tail > pipe->head" leading to pipe_full() returning true despite
> the pipe being empty.
> 
> The case of genuine wraparound of "pipe->head" is handled since pipe
> buffer has data allowing readers to make progress until the pipe->tail
> wraps too after which the reader will wakeup a sleeping writer, however,
> mistaking the pipe to be full when it is in fact empty can lead to
> readers and writers waiting on each other indefinitely.
> 
> This issue became more problematic and surfaced as a hang in hackbench
> after the optimization in commit aaec5a95d596 ("pipe_read: don't wake up
> the writer if the pipe is still full") significantly reduced the number
> of spurious wakeups of writers that had previously helped mask the
> issue.
> 
> To avoid missing any updates between the reads of "pipe->head" and
> "pipe->write", unionize the two with a single unsigned long
> "pipe->head_tail" member that can be loaded atomically.
> 
> Using "pipe->head_tail" to read the head and the tail ensures the
> lockless checks do not miss any updates to the head or the tail and
> since those two are only updated under "pipe->mutex", it ensures that
> the head is always ahead of, or equal to the tail resulting in correct
> calculations.
> 
>   [ prateek: commit log, testing on x86 platforms. ]
> 
> Reported-and-debugged-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
> Closes: https://lore.kernel.org/lkml/e813814e-7094-4673-bc69-731af065a0eb@amd.com/
> Reported-by: Alexey Gladkov <legion@kernel.org>
> Closes: https://lore.kernel.org/all/Z8Wn0nTvevLRG_4m@example.org/
> Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length")
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>

With this patch, I'm also not reproducing the problem. Thanks!

> ---
> Changes are based on:
> 
>   git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs-6.15.pipe
> 
> at commit commit ee5eda8ea595 ("pipe: change pipe_write() to never add a
> zero-sized buffer") but also applies cleanly on top of v6.14-rc5.
> 
> The diff from Linus is kept as is except for removing the whitespaces in
> front of the typedef that checkpatch complained about (the warning on
> usage of typedef itself has been ignored since I could not think of a
> better alternative other than #ifdef hackery in pipe_inode_info and the
> newly introduced pipe_index union.) and the suggestion from Oleg to
> explicitly initialize the "head_tail" with:
> 
>     union pipe_index idx = { .head_tail = READ_ONCE(pipe->head_tail) }
> 
> I went with commit 8cefc107ca54 ("pipe: Use head and tail pointers for
> the ring, not cursor and length") for the "Fixes:" tag since pipe_poll()
> added:
> 
>     unsigned int head = READ_ONCE(pipe->head);
>     unsigned int tail = READ_ONCE(pipe->tail);
> 
>     poll_wait(filp, &pipe->wait, wait);
> 
>     BUG_ON(pipe_occupancy(head, tail) > pipe->ring_size);
> 
> and the race described can trigger that BUG_ON() but as Linus pointed
> out in [1] the commit 85190d15f4ea ("pipe: don't use 'pipe_wait() for
> basic pipe IO") is probably the one that can cause the writers to
> sleep on empty pipe since the pipe_wait() used prior did not drop the
> pipe lock until it called schedule() and prepare_to_wait() was called
> before pipe_unlock() ensuring no races.
> 
> [1] https://lore.kernel.org/all/CAHk-=wh804HX8H86VRUSKoJGVG0eBe8sPz8=E3d8LHftOCSqwQ@mail.gmail.com/
> 
> Please let me know if the patch requires any modifications and I'll jump
> right on it. The changes have been tested on both a 5th Generation AMD
> EPYC system and on a dual socket Intel Emerald Rapids system with
> multiple thousand iterations of hackbench for small and large loop
> counts. Thanks a ton to Swapnil for all the help.
> ---
>  fs/pipe.c                 | 19 ++++++++-----------
>  include/linux/pipe_fs_i.h | 39 +++++++++++++++++++++++++++++++++++++--
>  2 files changed, 45 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index b0641f75b1ba..780990f307ab 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -210,11 +210,10 @@ static const struct pipe_buf_operations anon_pipe_buf_ops = {
>  /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
>  static inline bool pipe_readable(const struct pipe_inode_info *pipe)
>  {
> -	unsigned int head = READ_ONCE(pipe->head);
> -	unsigned int tail = READ_ONCE(pipe->tail);
> +	union pipe_index idx = { .head_tail = READ_ONCE(pipe->head_tail) };
>  	unsigned int writers = READ_ONCE(pipe->writers);
>  
> -	return !pipe_empty(head, tail) || !writers;
> +	return !pipe_empty(idx.head, idx.tail) || !writers;
>  }
>  
>  static inline unsigned int pipe_update_tail(struct pipe_inode_info *pipe,
> @@ -403,11 +402,10 @@ static inline int is_packetized(struct file *file)
>  /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
>  static inline bool pipe_writable(const struct pipe_inode_info *pipe)
>  {
> -	unsigned int head = READ_ONCE(pipe->head);
> -	unsigned int tail = READ_ONCE(pipe->tail);
> +	union pipe_index idx = { .head_tail = READ_ONCE(pipe->head_tail) };
>  	unsigned int max_usage = READ_ONCE(pipe->max_usage);
>  
> -	return !pipe_full(head, tail, max_usage) ||
> +	return !pipe_full(idx.head, idx.tail, max_usage) ||
>  		!READ_ONCE(pipe->readers);
>  }
>  
> @@ -649,7 +647,7 @@ pipe_poll(struct file *filp, poll_table *wait)
>  {
>  	__poll_t mask;
>  	struct pipe_inode_info *pipe = filp->private_data;
> -	unsigned int head, tail;
> +	union pipe_index idx;
>  
>  	/* Epoll has some historical nasty semantics, this enables them */
>  	WRITE_ONCE(pipe->poll_usage, true);
> @@ -670,19 +668,18 @@ pipe_poll(struct file *filp, poll_table *wait)
>  	 * if something changes and you got it wrong, the poll
>  	 * table entry will wake you up and fix it.
>  	 */
> -	head = READ_ONCE(pipe->head);
> -	tail = READ_ONCE(pipe->tail);
> +	idx.head_tail = READ_ONCE(pipe->head_tail);
>  
>  	mask = 0;
>  	if (filp->f_mode & FMODE_READ) {
> -		if (!pipe_empty(head, tail))
> +		if (!pipe_empty(idx.head, idx.tail))
>  			mask |= EPOLLIN | EPOLLRDNORM;
>  		if (!pipe->writers && filp->f_pipe != pipe->w_counter)
>  			mask |= EPOLLHUP;
>  	}
>  
>  	if (filp->f_mode & FMODE_WRITE) {
> -		if (!pipe_full(head, tail, pipe->max_usage))
> +		if (!pipe_full(idx.head, idx.tail, pipe->max_usage))
>  			mask |= EPOLLOUT | EPOLLWRNORM;
>  		/*
>  		 * Most Unices do not set EPOLLERR for FIFOs but on Linux they
> diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
> index 8ff23bf5a819..3cc4f8eab853 100644
> --- a/include/linux/pipe_fs_i.h
> +++ b/include/linux/pipe_fs_i.h
> @@ -31,6 +31,33 @@ struct pipe_buffer {
>  	unsigned long private;
>  };
>  
> +/*
> + * Really only alpha needs 32-bit fields, but
> + * might as well do it for 64-bit architectures
> + * since that's what we've historically done,
> + * and it makes 'head_tail' always be a simple
> + * 'unsigned long'.
> + */
> +#ifdef CONFIG_64BIT
> +typedef unsigned int pipe_index_t;
> +#else
> +typedef unsigned short pipe_index_t;
> +#endif
> +
> +/*
> + * We have to declare this outside 'struct pipe_inode_info',
> + * but then we can't use 'union pipe_index' for an anonymous
> + * union, so we end up having to duplicate this declaration
> + * below. Annoying.
> + */
> +union pipe_index {
> +	unsigned long head_tail;
> +	struct {
> +		pipe_index_t head;
> +		pipe_index_t tail;
> +	};
> +};
> +
>  /**
>   *	struct pipe_inode_info - a linux kernel pipe
>   *	@mutex: mutex protecting the whole thing
> @@ -58,8 +85,16 @@ struct pipe_buffer {
>  struct pipe_inode_info {
>  	struct mutex mutex;
>  	wait_queue_head_t rd_wait, wr_wait;
> -	unsigned int head;
> -	unsigned int tail;
> +
> +	/* This has to match the 'union pipe_index' above */
> +	union {
> +		unsigned long head_tail;
> +		struct {
> +			pipe_index_t head;
> +			pipe_index_t tail;
> +		};
> +	};
> +
>  	unsigned int max_usage;
>  	unsigned int ring_size;
>  	unsigned int nr_accounted;
> 
> base-commit: ee5eda8ea59546af2e8f192c060fbf29862d7cbd
> -- 
> 2.34.1
> 

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex
  2025-03-04 13:51                                     ` [PATCH] fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex K Prateek Nayak
  2025-03-04 18:36                                       ` Alexey Gladkov
@ 2025-03-04 19:03                                       ` Linus Torvalds
  1 sibling, 0 replies; 109+ messages in thread
From: Linus Torvalds @ 2025-03-04 19:03 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Alexander Viro, Christian Brauner, Oleg Nesterov, Swapnil Sapkal,
	Alexey Gladkov, linux-fsdevel, linux-kernel, Jan Kara,
	Mateusz Guzik, Manfred Spraul, David Howells, WangYuli,
	Hillf Danton, Gautham R. Shenoy, Neeraj.Upadhyay, Ananth.narayan

On Tue, 4 Mar 2025 at 03:52, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> pipe_readable(), pipe_writable(), and pipe_poll() can read "pipe->head"
> and "pipe->tail" outside of "pipe->mutex" critical section.  [...]

Thanks, applied.

                Linus

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-03 20:46                                   ` Linus Torvalds
                                                       ` (2 preceding siblings ...)
  2025-03-04 13:51                                     ` [PATCH] fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex K Prateek Nayak
@ 2025-03-05 15:31                                     ` Rasmus Villemoes
  2025-03-05 16:50                                       ` Linus Torvalds
  2025-03-05 16:40                                     ` Linus Torvalds
  4 siblings, 1 reply; 109+ messages in thread
From: Rasmus Villemoes @ 2025-03-05 15:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Mateusz Guzik, K Prateek Nayak, Sapkal, Swapnil,
	Manfred Spraul, Christian Brauner, David Howells, WangYuli,
	linux-fsdevel, linux-kernel, Shenoy, Gautham Ranjal,
	Neeraj.Upadhyay, Ananth.narayan, Matthew Wilcox

On Mon, Mar 03 2025, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, 3 Mar 2025 at 10:28, Oleg Nesterov <oleg@redhat.com> wrote:
>>
>> Stupid question... but do we really need to change the code which update
>> tail/head if we pack them into a single word?
>
> No. It's only the READ_ONCE() parts that need changing.
>
> See this suggested patch, which does something very similar to what
> you were thinking of.
>
> +/*
> + * We have to declare this outside 'struct pipe_inode_info',
> + * but then we can't use 'union pipe_index' for an anonymous
> + * union, so we end up having to duplicate this declaration
> + * below. Annoying.
> + */
> +union pipe_index {
> +	unsigned long head_tail;
> +	struct {
> +		pipe_index_t head;
> +		pipe_index_t tail;
> +	};
> +};
> +

-fms-extensions ? Willy wanted to add that for use in mm/ some years ago
[*], and it has come up a few other times as well.

[*] https://lore.kernel.org/lkml/20180419152817.GD25406@bombadil.infradead.org/

Rasmus

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-05 15:31                                     ` [PATCH] pipe_read: don't wake up the writer if the pipe is still full Rasmus Villemoes
@ 2025-03-05 16:50                                       ` Linus Torvalds
  2025-03-06  9:48                                         ` Rasmus Villemoes
  0 siblings, 1 reply; 109+ messages in thread
From: Linus Torvalds @ 2025-03-05 16:50 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Oleg Nesterov, Mateusz Guzik, K Prateek Nayak, Sapkal, Swapnil,
	Manfred Spraul, Christian Brauner, David Howells, WangYuli,
	linux-fsdevel, linux-kernel, Shenoy, Gautham Ranjal,
	Neeraj.Upadhyay, Ananth.narayan, Matthew Wilcox

On Wed, 5 Mar 2025 at 05:31, Rasmus Villemoes <ravi@prevas.dk> wrote:
>
> On Mon, Mar 03 2025, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > +/*
> > + * We have to declare this outside 'struct pipe_inode_info',
> > + * but then we can't use 'union pipe_index' for an anonymous
> > + * union, so we end up having to duplicate this declaration
> > + * below. Annoying.
> > + */
> > +union pipe_index {
> > +     unsigned long head_tail;
> > +     struct {
> > +             pipe_index_t head;
> > +             pipe_index_t tail;
> > +     };
> > +};
> > +
>
> -fms-extensions ? Willy wanted to add that for use in mm/ some years ago
> [*], and it has come up a few other times as well.
>
> [*] https://lore.kernel.org/lkml/20180419152817.GD25406@bombadil.infradead.org/

Oh, I was unaware of that extension, and yes, it would have been
lovely here, avoiding that duplicate union declaration.

But it does require clang support - I see that clang has a
'-fms-extensions' as well, so it's presumably there.

I don't know if it's worth it for the (small handful) of cases we'd
have in the kernel, but it does seem useful.

                 Linus

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-05 16:50                                       ` Linus Torvalds
@ 2025-03-06  9:48                                         ` Rasmus Villemoes
  2025-03-06 14:42                                           ` Rasmus Villemoes
  0 siblings, 1 reply; 109+ messages in thread
From: Rasmus Villemoes @ 2025-03-06  9:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Mateusz Guzik, K Prateek Nayak, Sapkal, Swapnil,
	Manfred Spraul, Christian Brauner, David Howells, WangYuli,
	linux-fsdevel, linux-kernel, Shenoy, Gautham Ranjal,
	Neeraj.Upadhyay, Ananth.narayan, Matthew Wilcox, Nick Desaulniers

On Wed, Mar 05 2025, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, 5 Mar 2025 at 05:31, Rasmus Villemoes <ravi@prevas.dk> wrote:
>>
>> On Mon, Mar 03 2025, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>
>> > +/*
>> > + * We have to declare this outside 'struct pipe_inode_info',
>> > + * but then we can't use 'union pipe_index' for an anonymous
>> > + * union, so we end up having to duplicate this declaration
>> > + * below. Annoying.
>> > + */
>> > +union pipe_index {
>> > +     unsigned long head_tail;
>> > +     struct {
>> > +             pipe_index_t head;
>> > +             pipe_index_t tail;
>> > +     };
>> > +};
>> > +
>>
>> -fms-extensions ? Willy wanted to add that for use in mm/ some years ago
>> [*], and it has come up a few other times as well.
>>
>> [*] https://lore.kernel.org/lkml/20180419152817.GD25406@bombadil.infradead.org/
>
> Oh, I was unaware of that extension, and yes, it would have been
> lovely here, avoiding that duplicate union declaration.
>
> But it does require clang support - I see that clang has a
> '-fms-extensions' as well, so it's presumably there.

Yes, it seems they do have it, but for mysterious reasons saying
-fms-extensions is not quite enough to convince clang that one does
intend to use that MS extension, one also has to say
-Wno-microsoft-anon-tag, or it complains

warning: anonymous unions are a Microsoft extension [-Wmicrosoft-anon-tag]

Also, the warning text is somewhat misleading; anon unions itself have
certainly been a gcc extension since forever, and nowadays a C11 thing,
and clang has a separate -Wpedantic warning for that when using
-std=c99:

warning: anonymous unions are a C11 extension [-Wc11-extensions]

The -W flag name actually suggests an improvement to the warning
"_tagged_ anonymous unions are a Microsoft extension", but I really
wonder why -fms-extensions isn't sufficient to silence that in the first
place. Also, the warning seems to be on by default; it's not some
-Wextra or -Wpedantic thing.

cc += Nick

Rasmus

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-06  9:48                                         ` Rasmus Villemoes
@ 2025-03-06 14:42                                           ` Rasmus Villemoes
  0 siblings, 0 replies; 109+ messages in thread
From: Rasmus Villemoes @ 2025-03-06 14:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-fsdevel, linux-kernel, Nick Desaulniers

On Thu, Mar 06 2025, Rasmus Villemoes <ravi@prevas.dk> wrote:

> On Wed, Mar 05 2025, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>> On Wed, 5 Mar 2025 at 05:31, Rasmus Villemoes <ravi@prevas.dk> wrote:
>>>
>>> On Mon, Mar 03 2025, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>>
>>> > +/*
>>> > + * We have to declare this outside 'struct pipe_inode_info',
>>> > + * but then we can't use 'union pipe_index' for an anonymous
>>> > + * union, so we end up having to duplicate this declaration
>>> > + * below. Annoying.
>>> > + */
>>> > +union pipe_index {
>>> > +     unsigned long head_tail;
>>> > +     struct {
>>> > +             pipe_index_t head;
>>> > +             pipe_index_t tail;
>>> > +     };
>>> > +};
>>> > +
>>>
>>> -fms-extensions ? Willy wanted to add that for use in mm/ some years ago
>>> [*], and it has come up a few other times as well.
>>>
>>> [*] https://lore.kernel.org/lkml/20180419152817.GD25406@bombadil.infradead.org/
>>
>> Oh, I was unaware of that extension, and yes, it would have been
>> lovely here, avoiding that duplicate union declaration.
>>
>> But it does require clang support - I see that clang has a
>> '-fms-extensions' as well, so it's presumably there.
>
> Yes, it seems they do have it, but for mysterious reasons saying
> -fms-extensions is not quite enough to convince clang that one does
> intend to use that MS extension, one also has to say
> -Wno-microsoft-anon-tag, or it complains
>
> warning: anonymous unions are a Microsoft extension [-Wmicrosoft-anon-tag]
>
> Also, the warning text is somewhat misleading; anon unions itself have
> certainly been a gcc extension since forever, and nowadays a C11 thing,
> and clang has a separate -Wpedantic warning for that when using
> -std=c99:
>
> warning: anonymous unions are a C11 extension [-Wc11-extensions]
>
> The -W flag name actually suggests an improvement to the warning
> "_tagged_ anonymous unions are a Microsoft extension", but I really
> wonder why -fms-extensions isn't sufficient to silence that in the first
> place. Also, the warning seems to be on by default; it's not some
> -Wextra or -Wpedantic thing.
>
> cc += Nick

Gah, sorry, I wasn't aware that address didn't work anymore.

So Cc -= everything but the lists and Cc += Nick for real this time,
hopefully.

Rasmus

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-03 20:46                                   ` Linus Torvalds
                                                       ` (3 preceding siblings ...)
  2025-03-05 15:31                                     ` [PATCH] pipe_read: don't wake up the writer if the pipe is still full Rasmus Villemoes
@ 2025-03-05 16:40                                     ` Linus Torvalds
  2025-03-06  8:35                                       ` Rasmus Villemoes
                                                         ` (2 more replies)
  4 siblings, 3 replies; 109+ messages in thread
From: Linus Torvalds @ 2025-03-05 16:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mateusz Guzik, K Prateek Nayak, Sapkal, Swapnil, Manfred Spraul,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, Shenoy, Gautham Ranjal, Neeraj.Upadhyay,
	Ananth.narayan

On Mon, 3 Mar 2025 at 10:46, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> ENTIRELY UNTESTED, but it seems to generate ok code. It might even
> generate better code than what we have now.

Bah. This patch - which is now committed - was actually completely broken.

And the reason that complete breakage didn't show up in testing is
that I suspect nobody really tested or thought about the 32-bit case.

That whole "use 16-bit indexes on 32-bit" is all fine and well, but I
woke up in the middle of the night and realized that it doesn't
actually work.

Because now "pipe_occupancy()" is getting *entirely* the wrong
answers. It just does

        return head - tail;

but that only worked when the arithmetic was done modulo the size of
the indexes. And now it isn't.

So I still haven't *tested* this, but at an absolute minimum, we need
something like this:

  --- a/include/linux/pipe_fs_i.h
  +++ b/include/linux/pipe_fs_i.h
  @@ -192,7 +192,7 @@
    */
   static inline unsigned int pipe_occupancy(unsigned int head,
unsigned int tail)
   {
  -       return head - tail;
  +       return (pipe_index_t)(head - tail);
   }

   /**

and there might be other cases where the pipe_index_t size might matter.

For example, we should add a check to pipe_resize_ring() that the new
size is smaller than the index size. Yes, in practice 'pipe_max_size'
already ends up being that limit (the value is 256 pages), even for
16-bit indices, but we should do this properly.

And then, *while* looking at this, I also noticed that we had a very
much related bug in this area that was pre-existing and not related to
the 16-bit change: pipe_discard_from() is doing the wrong thing for
overflows even in the old 'unsigned int' type, and the whole

        while (pipe->head > old_head)

is bogus, because 'pipe->head' may have wrapped around, so the whole
"is it bigger" test doesn't work like that at all.

Of course, in practice it never hits (and would only hit more easily
with the new 16-bit thing), but it's very very wrong and can result in
a memory leak.

Are there other cases like this? I don't know. I've been looking
around a bit, but those were the only ones I found immediately when I
started thinking about the whole wrap-around issue.

I'd love it if other people tried to think about this too (and maybe
even test the 32-bit case - gasp!)

                         Linus

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-05 16:40                                     ` Linus Torvalds
@ 2025-03-06  8:35                                       ` Rasmus Villemoes
  2025-03-06 17:59                                         ` Linus Torvalds
  2025-03-06  9:28                                       ` Rasmus Villemoes
  2025-03-06 11:39                                       ` [RFC PATCH 0/3] pipe: Convert pipe->{head,tail} to unsigned short K Prateek Nayak
  2 siblings, 1 reply; 109+ messages in thread
From: Rasmus Villemoes @ 2025-03-06  8:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Mateusz Guzik, K Prateek Nayak, Sapkal, Swapnil,
	Manfred Spraul, Christian Brauner, David Howells, WangYuli,
	linux-fsdevel, linux-kernel, Shenoy, Gautham Ranjal,
	Neeraj.Upadhyay, Ananth.narayan

On Wed, Mar 05 2025, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, 3 Mar 2025 at 10:46, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> ENTIRELY UNTESTED, but it seems to generate ok code. It might even
>> generate better code than what we have now.
>
> Bah. This patch - which is now committed - was actually completely broken.
>
> And the reason that complete breakage didn't show up in testing is
> that I suspect nobody really tested or thought about the 32-bit case.
>
> That whole "use 16-bit indexes on 32-bit" is all fine and well, but I
> woke up in the middle of the night and realized that it doesn't
> actually work.
>
> Because now "pipe_occupancy()" is getting *entirely* the wrong
> answers. It just does
>
>         return head - tail;
>
> but that only worked when the arithmetic was done modulo the size of
> the indexes. And now it isn't.
>
> So I still haven't *tested* this, but at an absolute minimum, we need
> something like this:
>
>   --- a/include/linux/pipe_fs_i.h
>   +++ b/include/linux/pipe_fs_i.h
>   @@ -192,7 +192,7 @@
>     */
>    static inline unsigned int pipe_occupancy(unsigned int head,
> unsigned int tail)
>    {
>   -       return head - tail;
>   +       return (pipe_index_t)(head - tail);
>    }
>
>    /**
>
> and there might be other cases where the pipe_index_t size might matter.

Yeah, for example

      unsigned int count, head, tail, mask;

      case FIONREAD:
              mutex_lock(&pipe->mutex);
              count = 0;
              head = pipe->head;
              tail = pipe->tail;
              mask = pipe->ring_size - 1;

              while (tail != head) {
                      count += pipe->bufs[tail & mask].len;
                      tail++;
              }
              mutex_unlock(&pipe->mutex);

If head has already wrapped around, say it's 0 or 1, and tail is close to
65535, that loop is gonna take forever and of course produce the wrong
result.

So yes, there are probably a lot more of these lurking.

There are probably not many tests that stuff 2^28 bytes through a pipe
to try to trigger such corner cases. Perhaps we can help whatever
automated tests are being done by initializing head and tail to
something like (pipe_index_t)-2 when the pipe is created?

Rasmus


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-06  8:35                                       ` Rasmus Villemoes
@ 2025-03-06 17:59                                         ` Linus Torvalds
  0 siblings, 0 replies; 109+ messages in thread
From: Linus Torvalds @ 2025-03-06 17:59 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Oleg Nesterov, Mateusz Guzik, K Prateek Nayak, Sapkal, Swapnil,
	Manfred Spraul, Christian Brauner, David Howells, WangYuli,
	linux-fsdevel, linux-kernel, Shenoy, Gautham Ranjal,
	Neeraj.Upadhyay, Ananth.narayan

On Wed, 5 Mar 2025 at 22:35, Rasmus Villemoes <ravi@prevas.dk> wrote:
>
> On Wed, Mar 05 2025, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > and there might be other cases where the pipe_index_t size might matter.
>
> Yeah, for example
>
>       unsigned int count, head, tail, mask;
>
>       case FIONREAD:

Thanks. I've hopefully fixed this (and the FUSE issue you also
reported), and those should work correctly no on 32-bit. Knock wood.

Mind taking a look and double-checking the fixes?

                Linus

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-05 16:40                                     ` Linus Torvalds
  2025-03-06  8:35                                       ` Rasmus Villemoes
@ 2025-03-06  9:28                                       ` Rasmus Villemoes
  2025-03-06 11:39                                       ` [RFC PATCH 0/3] pipe: Convert pipe->{head,tail} to unsigned short K Prateek Nayak
  2 siblings, 0 replies; 109+ messages in thread
From: Rasmus Villemoes @ 2025-03-06  9:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Mateusz Guzik, K Prateek Nayak, Sapkal, Swapnil,
	Manfred Spraul, Christian Brauner, David Howells, WangYuli,
	linux-fsdevel, linux-kernel, Shenoy, Gautham Ranjal,
	Neeraj.Upadhyay, Ananth.narayan

On Wed, Mar 05 2025, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Are there other cases like this? I don't know. I've been looking
> around a bit, but those were the only ones I found immediately when I
> started thinking about the whole wrap-around issue.

Without too much brainpower spent analyzing each case (i.e., some of
these might actually be ok), I found these:

fs/fuse/dev.c
fuse_dev_splice_write()

  unsigned int head, tail, mask, count;

  pipe_lock(pipe);

  head = pipe->head;
  tail = pipe->tail;
  mask = pipe->ring_size - 1;
  count = head - tail;

Open-coded pipe_occupancy(), so would be fixed by using that with your
fixup.

A bit later in same function there's the same FIONREAD pattern:

  for (idx = tail; idx != head && rem < len; idx++)
          rem += pipe->bufs[idx & mask].len;

fs/pipe.c

We have pipe_update_tail() getting and returning an "unsigned int",
and letting the compiler truncate the result written to pipe->tail:

      pipe->tail = ++tail;
      return tail;

pipe_update_tail() only has one caller, but a rather important one,
pipe_read(), which uses the return value from pipe_update_tail as-is

                 tail = pipe_update_tail(pipe, buf, tail);
         }
         total_len -= chars;
         if (!total_len)
                 break;  /* common path: read succeeded */
         if (!pipe_empty(head, tail))    /* More to do? */
                 continue;

and pipe_empty() takes two "unsigned ints" and is just head==tail -- so if
tail was incremented to 65536 while head is 0 that would break. Probably
pipe_empty() should either take pipe_index_t arguments or cast to that
internally, just as pipe_occupancy. Or, as pipe_full(), being spelled in
terms of pipe_occupancy()==0.

With that fixed, maybe one could spell the FIONREAD-like patterns using
pipe_empty(), i.e. using pipe_empty() to ask "have this tail index now
caught up to this head index". So "idx != head" above would become
"!pipe_empty(idx, head)".

Rasmus

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [RFC PATCH 0/3] pipe: Convert pipe->{head,tail} to unsigned short
  2025-03-05 16:40                                     ` Linus Torvalds
  2025-03-06  8:35                                       ` Rasmus Villemoes
  2025-03-06  9:28                                       ` Rasmus Villemoes
@ 2025-03-06 11:39                                       ` K Prateek Nayak
  2025-03-06 11:39                                         ` [RFC PATCH 1/3] fs/pipe: Limit the slots in pipe_resize_ring() K Prateek Nayak
                                                           ` (2 more replies)
  2 siblings, 3 replies; 109+ messages in thread
From: K Prateek Nayak @ 2025-03-06 11:39 UTC (permalink / raw)
  To: Linus Torvalds, Oleg Nesterov, Miklos Szeredi, Alexander Viro,
	Christian Brauner, Andrew Morton, Hugh Dickins, linux-fsdevel,
	linux-kernel, linux-mm
  Cc: Jan Kara, Matthew Wilcox (Oracle), Mateusz Guzik,
	Gautham R. Shenoy, Rasmus Villemoes, Neeraj.Upadhyay,
	Ananth.narayan, Swapnil Sapkal, K Prateek Nayak

Here is an attempt at converting pipe->{head,tail} to unsigned short
members. All local variables storing the head and the tail have been
modified to unsigned short too the best of my knowledge)

pipe_resize_ring() has added a check to make sure nr_slots can be
contained within the limits of the pipe->{head,tail}. Building on that,
pipe->{max_usage,ring_size} were also converted to unsigned short to
catch any cases of incorrect unsigned arithmetic.

This has been tested for a few hours with anon pipes on a 5th Generation
AMD EPYC System and on a dual socket Intel Granite Rapids system without
experiencing any obvious issues.

pipe_write() was tagged with a debug trace_printk() on one of the test
machines to make sure the head has indeed wrapped around behind the tail
to ensure the wraparound scenarios are indeed happening.

Few pipe_occupancy() and pipe->max_usage based checks have been
converted to use unsigned short based arithmetic in fs/fuse/dev.c,
fs/splice.c, mm/filemap.c, and mm/filemap.c. Few of the observations
from Rasmus on a parallel thread [1] has been folded into Patch 3
(thanks a ton for chasing them).

More eyes and testing is greatly appreciated. If my tests run into any
issues, I'll report back on this thread. Series was tested with:

  hackbench -g 16 -f 20 --threads --pipe -l 10000000 -s 100 # Warp around
  stress-ng --oom-pipe 128 --oom-pipe-ops 100000 -t 600s # pipe resize
  stress-ng --splice 128 --splice-ops 100000000 -t 600s # splice
  stress-ng --vm-splice 128 --vm-splice-ops 100000000 -t 600s # splice

  stress-ng --tee 128 --tee-ops 100000000 -t 600s
  stress-ng --zlib 128 --zlib-ops 1000000 -t 600s
  stress-ng --sigpipe 128 -t 60s

stress-ng did not report any failure in my testing.

[1] https://lore.kernel.org/all/87cyeu5zgk.fsf@prevas.dk/
--
K Prateek Nayak (3):
  fs/pipe: Limit the slots in pipe_resize_ring()
  fs/splice: Atomically read pipe->{head,tail} in opipe_prep()
  treewide: pipe: Convert all references to
    pipe->{head,tail,max_usage,ring_size} to unsigned short

 fs/fuse/dev.c             |  4 +++-
 fs/pipe.c                 | 33 +++++++++++++++-----------
 fs/splice.c               | 50 ++++++++++++++++++++-------------------
 include/linux/pipe_fs_i.h | 39 ++++++++++--------------------
 kernel/watch_queue.c      |  3 ++-
 mm/filemap.c              |  5 ++--
 mm/shmem.c                |  5 ++--
 7 files changed, 69 insertions(+), 70 deletions(-)

base-commit: 848e076317446f9c663771ddec142d7c2eb4cb43
-- 
2.43.0

^ permalink raw reply	[flat|nested] 109+ messages in thread

* [RFC PATCH 1/3] fs/pipe: Limit the slots in pipe_resize_ring()
  2025-03-06 11:39                                       ` [RFC PATCH 0/3] pipe: Convert pipe->{head,tail} to unsigned short K Prateek Nayak
@ 2025-03-06 11:39                                         ` K Prateek Nayak
  2025-03-06 12:28                                           ` Oleg Nesterov
  2025-03-06 11:39                                         ` [RFC PATCH 2/3] fs/splice: Atomically read pipe->{head,tail} in opipe_prep() K Prateek Nayak
  2025-03-06 11:39                                         ` [RFC PATCH 3/3] treewide: pipe: Convert all references to pipe->{head,tail,max_usage,ring_size} to unsigned short K Prateek Nayak
  2 siblings, 1 reply; 109+ messages in thread
From: K Prateek Nayak @ 2025-03-06 11:39 UTC (permalink / raw)
  To: Linus Torvalds, Oleg Nesterov, Miklos Szeredi, Alexander Viro,
	Christian Brauner, Andrew Morton, Hugh Dickins, linux-fsdevel,
	linux-kernel, linux-mm
  Cc: Jan Kara, Matthew Wilcox (Oracle), Mateusz Guzik,
	Gautham R. Shenoy, Rasmus Villemoes, Neeraj.Upadhyay,
	Ananth.narayan, Swapnil Sapkal, K Prateek Nayak

Limit the number of slots in pipe_resize_ring() to the maximum value
representable by pipe->{head,tail}. Values beyond the max limit can
lead to incorrect pipe_occupancy() calculations where the pipe will
never appear full.

Since nr_slots is always a power of 2 and the maximum size of
pipe_index_t is 32 bits, BIT() is sufficient to represent the maximum
value possible for nr_slots.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 fs/pipe.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/pipe.c b/fs/pipe.c
index e8e6698f3698..3ca3103e1de7 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1272,6 +1272,10 @@ int pipe_resize_ring(struct pipe_inode_info *pipe, unsigned int nr_slots)
 	struct pipe_buffer *bufs;
 	unsigned int head, tail, mask, n;
 
+	/* nr_slots larger than limits of pipe->{head,tail} */
+	if (unlikely(nr_slots > BIT(BITS_PER_TYPE(pipe_index_t) - 1)))
+		return -EINVAL;
+
 	bufs = kcalloc(nr_slots, sizeof(*bufs),
 		       GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
 	if (unlikely(!bufs))

base-commit: 848e076317446f9c663771ddec142d7c2eb4cb43
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 1/3] fs/pipe: Limit the slots in pipe_resize_ring()
  2025-03-06 11:39                                         ` [RFC PATCH 1/3] fs/pipe: Limit the slots in pipe_resize_ring() K Prateek Nayak
@ 2025-03-06 12:28                                           ` Oleg Nesterov
  2025-03-06 15:26                                             ` K Prateek Nayak
  0 siblings, 1 reply; 109+ messages in thread
From: Oleg Nesterov @ 2025-03-06 12:28 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Linus Torvalds, Miklos Szeredi, Alexander Viro, Christian Brauner,
	Andrew Morton, Hugh Dickins, linux-fsdevel, linux-kernel,
	linux-mm, Jan Kara, Matthew Wilcox (Oracle), Mateusz Guzik,
	Gautham R. Shenoy, Rasmus Villemoes, Neeraj.Upadhyay,
	Ananth.narayan, Swapnil Sapkal

On 03/06, K Prateek Nayak wrote:
>
> @@ -1272,6 +1272,10 @@ int pipe_resize_ring(struct pipe_inode_info *pipe, unsigned int nr_slots)
>  	struct pipe_buffer *bufs;
>  	unsigned int head, tail, mask, n;
>
> +	/* nr_slots larger than limits of pipe->{head,tail} */
> +	if (unlikely(nr_slots > BIT(BITS_PER_TYPE(pipe_index_t) - 1)))

Hmm, perhaps

	if (nr_slots > (pipe_index_t)-1u)

is more clear?

Oleg.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 1/3] fs/pipe: Limit the slots in pipe_resize_ring()
  2025-03-06 12:28                                           ` Oleg Nesterov
@ 2025-03-06 15:26                                             ` K Prateek Nayak
  0 siblings, 0 replies; 109+ messages in thread
From: K Prateek Nayak @ 2025-03-06 15:26 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Miklos Szeredi, Alexander Viro, Christian Brauner,
	Andrew Morton, Hugh Dickins, linux-fsdevel, linux-kernel,
	linux-mm, Jan Kara, Matthew Wilcox (Oracle), Mateusz Guzik,
	Gautham R. Shenoy, Rasmus Villemoes, Neeraj.Upadhyay,
	Ananth.narayan, Swapnil Sapkal

Hello Oleg,

On 3/6/2025 5:58 PM, Oleg Nesterov wrote:
> On 03/06, K Prateek Nayak wrote:
>>
>> @@ -1272,6 +1272,10 @@ int pipe_resize_ring(struct pipe_inode_info *pipe, unsigned int nr_slots)
>>   	struct pipe_buffer *bufs;
>>   	unsigned int head, tail, mask, n;
>>
>> +	/* nr_slots larger than limits of pipe->{head,tail} */
>> +	if (unlikely(nr_slots > BIT(BITS_PER_TYPE(pipe_index_t) - 1)))
> 
> Hmm, perhaps
> 
> 	if (nr_slots > (pipe_index_t)-1u)
> 
> is more clear?

Indeed it is. I didn't even know we could do that! Thank you for
pointing it out.

> 
> Oleg.
> 

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 109+ messages in thread

* [RFC PATCH 2/3] fs/splice: Atomically read pipe->{head,tail} in opipe_prep()
  2025-03-06 11:39                                       ` [RFC PATCH 0/3] pipe: Convert pipe->{head,tail} to unsigned short K Prateek Nayak
  2025-03-06 11:39                                         ` [RFC PATCH 1/3] fs/pipe: Limit the slots in pipe_resize_ring() K Prateek Nayak
@ 2025-03-06 11:39                                         ` K Prateek Nayak
  2025-03-06 11:39                                         ` [RFC PATCH 3/3] treewide: pipe: Convert all references to pipe->{head,tail,max_usage,ring_size} to unsigned short K Prateek Nayak
  2 siblings, 0 replies; 109+ messages in thread
From: K Prateek Nayak @ 2025-03-06 11:39 UTC (permalink / raw)
  To: Linus Torvalds, Oleg Nesterov, Miklos Szeredi, Alexander Viro,
	Christian Brauner, Andrew Morton, Hugh Dickins, linux-fsdevel,
	linux-kernel, linux-mm
  Cc: Jan Kara, Matthew Wilcox (Oracle), Mateusz Guzik,
	Gautham R. Shenoy, Rasmus Villemoes, Neeraj.Upadhyay,
	Ananth.narayan, Swapnil Sapkal, K Prateek Nayak

opipe_prep() checks pipe_full() before taking the "pipe->mutex". Use the
newly introduced "pipe->head_tail" member to read the head and the tail
atomically and not miss any updates between the reads.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 fs/splice.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/splice.c b/fs/splice.c
index 28cfa63aa236..e51f33aca032 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1682,13 +1682,14 @@ static int ipipe_prep(struct pipe_inode_info *pipe, unsigned int flags)
  */
 static int opipe_prep(struct pipe_inode_info *pipe, unsigned int flags)
 {
+	union pipe_index idx = { .head_tail = READ_ONCE(pipe->head_tail) };
 	int ret;
 
 	/*
 	 * Check pipe occupancy without the inode lock first. This function
 	 * is speculative anyways, so missing one is ok.
 	 */
-	if (!pipe_full(pipe->head, pipe->tail, pipe->max_usage))
+	if (!pipe_full(idx.head, idx.tail, READ_ONCE(pipe->max_usage)))
 		return 0;
 
 	ret = 0;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* [RFC PATCH 3/3] treewide: pipe: Convert all references to pipe->{head,tail,max_usage,ring_size} to unsigned short
  2025-03-06 11:39                                       ` [RFC PATCH 0/3] pipe: Convert pipe->{head,tail} to unsigned short K Prateek Nayak
  2025-03-06 11:39                                         ` [RFC PATCH 1/3] fs/pipe: Limit the slots in pipe_resize_ring() K Prateek Nayak
  2025-03-06 11:39                                         ` [RFC PATCH 2/3] fs/splice: Atomically read pipe->{head,tail} in opipe_prep() K Prateek Nayak
@ 2025-03-06 11:39                                         ` K Prateek Nayak
  2025-03-06 12:32                                           ` Oleg Nesterov
  2 siblings, 1 reply; 109+ messages in thread
From: K Prateek Nayak @ 2025-03-06 11:39 UTC (permalink / raw)
  To: Linus Torvalds, Oleg Nesterov, Miklos Szeredi, Alexander Viro,
	Christian Brauner, Andrew Morton, Hugh Dickins, linux-fsdevel,
	linux-kernel, linux-mm
  Cc: Jan Kara, Matthew Wilcox (Oracle), Mateusz Guzik,
	Gautham R. Shenoy, Rasmus Villemoes, Neeraj.Upadhyay,
	Ananth.narayan, Swapnil Sapkal, K Prateek Nayak

Use 16-bit head and tail to track the pipe buffer production and
consumption.

Since "pipe->max_usage" and "pipe->ring_size" must fall between the head
and the tail limits, convert them to unsigned short as well to catch any
cases of unsigned arithmetic going wrong.

Part of fs/fuse/dev.c, fs/splice.c, mm/filemap.c, and mm/shmem.c were
touched to accommodate the "unsigned short" based calculations of
pipe_occupancy().

pipe->tail is incremented always with both "pipe->mutex" and
"pipe->rd_wait.lock" held for pipes with watch queue. pipe_write() exits
early if pipe has a watch queue but otherwise takes the "pipe->muxtex"
before updating pipe->head. post_one_notification() holds the
"pipe->rd_wait.lock" when updating pipe->head.

Updates to "pipe->head" and "pipe->tail" are always mutually exclusive,
either guarded by "pipe->mutex" or by "pipe->rd_wait.lock". Even a RMW
updates to the 16-bits fields should be safe because of those
synchronization primitives on architectures that cannot do an atomic
16-bit store.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 fs/fuse/dev.c             |  7 +++---
 fs/pipe.c                 | 31 +++++++++++++-------------
 fs/splice.c               | 47 ++++++++++++++++++++-------------------
 include/linux/pipe_fs_i.h | 39 +++++++++++---------------------
 kernel/watch_queue.c      |  3 ++-
 mm/filemap.c              |  5 +++--
 mm/shmem.c                |  5 +++--
 7 files changed, 65 insertions(+), 72 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 2b2d1b755544..993e6dc24de1 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1440,6 +1440,7 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 	int page_nr = 0;
 	struct pipe_buffer *bufs;
 	struct fuse_copy_state cs;
+	unsigned short free_slots;
 	struct fuse_dev *fud = fuse_get_dev(in);
 
 	if (!fud)
@@ -1457,7 +1458,8 @@ static ssize_t fuse_dev_splice_read(struct file *in, loff_t *ppos,
 	if (ret < 0)
 		goto out;
 
-	if (pipe_occupancy(pipe->head, pipe->tail) + cs.nr_segs > pipe->max_usage) {
+	free_slots = pipe->max_usage - pipe_occupancy(pipe->head, pipe->tail);
+	if (free_slots < cs.nr_segs) {
 		ret = -EIO;
 		goto out;
 	}
@@ -2107,9 +2109,8 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe,
 				     struct file *out, loff_t *ppos,
 				     size_t len, unsigned int flags)
 {
-	unsigned int head, tail, mask, count;
+	unsigned short head, tail, mask, count, idx;
 	unsigned nbuf;
-	unsigned idx;
 	struct pipe_buffer *bufs;
 	struct fuse_copy_state cs;
 	struct fuse_dev *fud;
diff --git a/fs/pipe.c b/fs/pipe.c
index 3ca3103e1de7..b8d87eabff79 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -216,9 +216,9 @@ static inline bool pipe_readable(const struct pipe_inode_info *pipe)
 	return !pipe_empty(idx.head, idx.tail) || !writers;
 }
 
-static inline unsigned int pipe_update_tail(struct pipe_inode_info *pipe,
-					    struct pipe_buffer *buf,
-					    unsigned int tail)
+static inline unsigned short pipe_update_tail(struct pipe_inode_info *pipe,
+					      struct pipe_buffer *buf,
+					      unsigned short tail)
 {
 	pipe_buf_release(pipe, buf);
 
@@ -272,9 +272,9 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 	 */
 	for (;;) {
 		/* Read ->head with a barrier vs post_one_notification() */
-		unsigned int head = smp_load_acquire(&pipe->head);
-		unsigned int tail = pipe->tail;
-		unsigned int mask = pipe->ring_size - 1;
+		unsigned short head = smp_load_acquire(&pipe->head);
+		unsigned short tail = pipe->tail;
+		unsigned short mask = pipe->ring_size - 1;
 
 #ifdef CONFIG_WATCH_QUEUE
 		if (pipe->note_loss) {
@@ -417,7 +417,7 @@ static inline int is_packetized(struct file *file)
 static inline bool pipe_writable(const struct pipe_inode_info *pipe)
 {
 	union pipe_index idx = { .head_tail = READ_ONCE(pipe->head_tail) };
-	unsigned int max_usage = READ_ONCE(pipe->max_usage);
+	unsigned short max_usage = READ_ONCE(pipe->max_usage);
 
 	return !pipe_full(idx.head, idx.tail, max_usage) ||
 		!READ_ONCE(pipe->readers);
@@ -428,7 +428,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *filp = iocb->ki_filp;
 	struct pipe_inode_info *pipe = filp->private_data;
-	unsigned int head;
+	unsigned short head;
 	ssize_t ret = 0;
 	size_t total_len = iov_iter_count(from);
 	ssize_t chars;
@@ -471,7 +471,7 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	was_empty = pipe_empty(head, pipe->tail);
 	chars = total_len & (PAGE_SIZE-1);
 	if (chars && !was_empty) {
-		unsigned int mask = pipe->ring_size - 1;
+		unsigned short mask = pipe->ring_size - 1;
 		struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask];
 		int offset = buf->offset + buf->len;
 
@@ -614,7 +614,8 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 static long pipe_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
 	struct pipe_inode_info *pipe = filp->private_data;
-	unsigned int count, head, tail, mask;
+	unsigned short head, tail, mask;
+	unsigned int count;
 
 	switch (cmd) {
 	case FIONREAD:
@@ -1270,10 +1271,10 @@ unsigned int round_pipe_size(unsigned int size)
 int pipe_resize_ring(struct pipe_inode_info *pipe, unsigned int nr_slots)
 {
 	struct pipe_buffer *bufs;
-	unsigned int head, tail, mask, n;
+	unsigned short head, tail, mask, n;
 
 	/* nr_slots larger than limits of pipe->{head,tail} */
-	if (unlikely(nr_slots > BIT(BITS_PER_TYPE(pipe_index_t) - 1)))
+	if (unlikely(nr_slots > USHRT_MAX))
 		return -EINVAL;
 
 	bufs = kcalloc(nr_slots, sizeof(*bufs),
@@ -1298,13 +1299,13 @@ int pipe_resize_ring(struct pipe_inode_info *pipe, unsigned int nr_slots)
 	 * and adjust the indices.
 	 */
 	if (n > 0) {
-		unsigned int h = head & mask;
-		unsigned int t = tail & mask;
+		unsigned short h = head & mask;
+		unsigned short t = tail & mask;
 		if (h > t) {
 			memcpy(bufs, pipe->bufs + t,
 			       n * sizeof(struct pipe_buffer));
 		} else {
-			unsigned int tsize = pipe->ring_size - t;
+			unsigned short tsize = pipe->ring_size - t;
 			if (h > 0)
 				memcpy(bufs + tsize, pipe->bufs,
 				       h * sizeof(struct pipe_buffer));
diff --git a/fs/splice.c b/fs/splice.c
index e51f33aca032..891a7cf9fb55 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -198,9 +198,9 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 		       struct splice_pipe_desc *spd)
 {
 	unsigned int spd_pages = spd->nr_pages;
-	unsigned int tail = pipe->tail;
-	unsigned int head = pipe->head;
-	unsigned int mask = pipe->ring_size - 1;
+	unsigned short tail = pipe->tail;
+	unsigned short head = pipe->head;
+	unsigned short mask = pipe->ring_size - 1;
 	ssize_t ret = 0;
 	int page_nr = 0;
 
@@ -245,9 +245,9 @@ EXPORT_SYMBOL_GPL(splice_to_pipe);
 
 ssize_t add_to_pipe(struct pipe_inode_info *pipe, struct pipe_buffer *buf)
 {
-	unsigned int head = pipe->head;
-	unsigned int tail = pipe->tail;
-	unsigned int mask = pipe->ring_size - 1;
+	unsigned short head = pipe->head;
+	unsigned short tail = pipe->tail;
+	unsigned short mask = pipe->ring_size - 1;
 	int ret;
 
 	if (unlikely(!pipe->readers)) {
@@ -271,7 +271,7 @@ EXPORT_SYMBOL(add_to_pipe);
  */
 int splice_grow_spd(const struct pipe_inode_info *pipe, struct splice_pipe_desc *spd)
 {
-	unsigned int max_usage = READ_ONCE(pipe->max_usage);
+	unsigned short max_usage = READ_ONCE(pipe->max_usage);
 
 	spd->nr_pages_max = max_usage;
 	if (max_usage <= PIPE_DEF_BUFFERS)
@@ -327,12 +327,13 @@ ssize_t copy_splice_read(struct file *in, loff_t *ppos,
 	struct kiocb kiocb;
 	struct page **pages;
 	ssize_t ret;
-	size_t used, npages, chunk, remain, keep = 0;
+	size_t npages, chunk, remain, keep = 0;
+	unsigned short used;
 	int i;
 
 	/* Work out how much data we can actually add into the pipe */
 	used = pipe_occupancy(pipe->head, pipe->tail);
-	npages = max_t(ssize_t, pipe->max_usage - used, 0);
+	npages = max_t(unsigned short, pipe->max_usage - used, 0);
 	len = min_t(size_t, len, npages * PAGE_SIZE);
 	npages = DIV_ROUND_UP(len, PAGE_SIZE);
 
@@ -445,9 +446,9 @@ static void wakeup_pipe_writers(struct pipe_inode_info *pipe)
 static int splice_from_pipe_feed(struct pipe_inode_info *pipe, struct splice_desc *sd,
 			  splice_actor *actor)
 {
-	unsigned int head = pipe->head;
-	unsigned int tail = pipe->tail;
-	unsigned int mask = pipe->ring_size - 1;
+	unsigned short head = pipe->head;
+	unsigned short tail = pipe->tail;
+	unsigned short mask = pipe->ring_size - 1;
 	int ret;
 
 	while (!pipe_empty(head, tail)) {
@@ -494,8 +495,8 @@ static int splice_from_pipe_feed(struct pipe_inode_info *pipe, struct splice_des
 /* We know we have a pipe buffer, but maybe it's empty? */
 static inline bool eat_empty_buffer(struct pipe_inode_info *pipe)
 {
-	unsigned int tail = pipe->tail;
-	unsigned int mask = pipe->ring_size - 1;
+	unsigned short tail = pipe->tail;
+	unsigned short mask = pipe->ring_size - 1;
 	struct pipe_buffer *buf = &pipe->bufs[tail & mask];
 
 	if (unlikely(!buf->len)) {
@@ -690,7 +691,7 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 	while (sd.total_len) {
 		struct kiocb kiocb;
 		struct iov_iter from;
-		unsigned int head, tail, mask;
+		unsigned short head, tail, mask;
 		size_t left;
 		int n;
 
@@ -809,7 +810,7 @@ ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out,
 	pipe_lock(pipe);
 
 	while (len > 0) {
-		unsigned int head, tail, mask, bc = 0;
+		unsigned short head, tail, mask, bc = 0;
 		size_t remain = len;
 
 		/*
@@ -960,7 +961,7 @@ static ssize_t do_splice_read(struct file *in, loff_t *ppos,
 			      struct pipe_inode_info *pipe, size_t len,
 			      unsigned int flags)
 {
-	unsigned int p_space;
+	unsigned short p_space;
 
 	if (unlikely(!(in->f_mode & FMODE_READ)))
 		return -EBADF;
@@ -1724,9 +1725,9 @@ static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			       size_t len, unsigned int flags)
 {
 	struct pipe_buffer *ibuf, *obuf;
-	unsigned int i_head, o_head;
-	unsigned int i_tail, o_tail;
-	unsigned int i_mask, o_mask;
+	unsigned short i_head, o_head;
+	unsigned short i_tail, o_tail;
+	unsigned short i_mask, o_mask;
 	int ret = 0;
 	bool input_wakeup = false;
 
@@ -1861,9 +1862,9 @@ static ssize_t link_pipe(struct pipe_inode_info *ipipe,
 			 size_t len, unsigned int flags)
 {
 	struct pipe_buffer *ibuf, *obuf;
-	unsigned int i_head, o_head;
-	unsigned int i_tail, o_tail;
-	unsigned int i_mask, o_mask;
+	unsigned short i_head, o_head;
+	unsigned short i_tail, o_tail;
+	unsigned short i_mask, o_mask;
 	ssize_t ret = 0;
 
 	/*
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index e572e6fc4f81..0997c028548c 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -31,19 +31,6 @@ struct pipe_buffer {
 	unsigned long private;
 };
 
-/*
- * Really only alpha needs 32-bit fields, but
- * might as well do it for 64-bit architectures
- * since that's what we've historically done,
- * and it makes 'head_tail' always be a simple
- * 'unsigned long'.
- */
-#ifdef CONFIG_64BIT
-typedef unsigned int pipe_index_t;
-#else
-typedef unsigned short pipe_index_t;
-#endif
-
 /*
  * We have to declare this outside 'struct pipe_inode_info',
  * but then we can't use 'union pipe_index' for an anonymous
@@ -51,10 +38,10 @@ typedef unsigned short pipe_index_t;
  * below. Annoying.
  */
 union pipe_index {
-	unsigned long head_tail;
+	unsigned int head_tail;
 	struct {
-		pipe_index_t head;
-		pipe_index_t tail;
+		unsigned short head;
+		unsigned short tail;
 	};
 };
 
@@ -89,15 +76,15 @@ struct pipe_inode_info {
 
 	/* This has to match the 'union pipe_index' above */
 	union {
-		unsigned long head_tail;
+		unsigned int head_tail;
 		struct {
-			pipe_index_t head;
-			pipe_index_t tail;
+			unsigned short head;
+			unsigned short tail;
 		};
 	};
 
-	unsigned int max_usage;
-	unsigned int ring_size;
+	unsigned short max_usage;
+	unsigned short ring_size;
 	unsigned int nr_accounted;
 	unsigned int readers;
 	unsigned int writers;
@@ -181,7 +168,7 @@ static inline bool pipe_has_watch_queue(const struct pipe_inode_info *pipe)
  * @head: The pipe ring head pointer
  * @tail: The pipe ring tail pointer
  */
-static inline bool pipe_empty(unsigned int head, unsigned int tail)
+static inline bool pipe_empty(unsigned short head, unsigned short tail)
 {
 	return head == tail;
 }
@@ -191,9 +178,9 @@ static inline bool pipe_empty(unsigned int head, unsigned int tail)
  * @head: The pipe ring head pointer
  * @tail: The pipe ring tail pointer
  */
-static inline unsigned int pipe_occupancy(unsigned int head, unsigned int tail)
+static inline unsigned short pipe_occupancy(unsigned short head, unsigned short tail)
 {
-	return (pipe_index_t)(head - tail);
+	return head - tail;
 }
 
 /**
@@ -202,8 +189,8 @@ static inline unsigned int pipe_occupancy(unsigned int head, unsigned int tail)
  * @tail: The pipe ring tail pointer
  * @limit: The maximum amount of slots available.
  */
-static inline bool pipe_full(unsigned int head, unsigned int tail,
-			     unsigned int limit)
+static inline bool pipe_full(unsigned short head, unsigned short tail,
+			     unsigned short limit)
 {
 	return pipe_occupancy(head, tail) >= limit;
 }
diff --git a/kernel/watch_queue.c b/kernel/watch_queue.c
index 5267adeaa403..c76cfebf46c8 100644
--- a/kernel/watch_queue.c
+++ b/kernel/watch_queue.c
@@ -101,7 +101,8 @@ static bool post_one_notification(struct watch_queue *wqueue,
 	struct pipe_inode_info *pipe = wqueue->pipe;
 	struct pipe_buffer *buf;
 	struct page *page;
-	unsigned int head, tail, mask, note, offset, len;
+	unsigned short head, tail, mask;
+	unsigned int note, offset, len;
 	bool done = false;
 
 	spin_lock_irq(&pipe->rd_wait.lock);
diff --git a/mm/filemap.c b/mm/filemap.c
index d4564a79eb35..6007b2403471 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2943,9 +2943,10 @@ ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
 {
 	struct folio_batch fbatch;
 	struct kiocb iocb;
-	size_t total_spliced = 0, used, npages;
+	size_t total_spliced = 0, npages;
 	loff_t isize, end_offset;
 	bool writably_mapped;
+	unsigned short used;
 	int i, error = 0;
 
 	if (unlikely(*ppos >= in->f_mapping->host->i_sb->s_maxbytes))
@@ -2956,7 +2957,7 @@ ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
 
 	/* Work out how much data we can actually add into the pipe */
 	used = pipe_occupancy(pipe->head, pipe->tail);
-	npages = max_t(ssize_t, pipe->max_usage - used, 0);
+	npages = max_t(unsigned short, pipe->max_usage - used, 0);
 	len = min_t(size_t, len, npages * PAGE_SIZE);
 
 	folio_batch_init(&fbatch);
diff --git a/mm/shmem.c b/mm/shmem.c
index 4ea6109a8043..339084e5a8a1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3509,13 +3509,14 @@ static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos,
 	struct inode *inode = file_inode(in);
 	struct address_space *mapping = inode->i_mapping;
 	struct folio *folio = NULL;
-	size_t total_spliced = 0, used, npages, n, part;
+	size_t total_spliced = 0, npages, n, part;
+	unsigned short used;
 	loff_t isize;
 	int error = 0;
 
 	/* Work out how much data we can actually add into the pipe */
 	used = pipe_occupancy(pipe->head, pipe->tail);
-	npages = max_t(ssize_t, pipe->max_usage - used, 0);
+	npages = max_t(unsigned short, pipe->max_usage - used, 0);
 	len = min_t(size_t, len, npages * PAGE_SIZE);
 
 	do {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 3/3] treewide: pipe: Convert all references to pipe->{head,tail,max_usage,ring_size} to unsigned short
  2025-03-06 11:39                                         ` [RFC PATCH 3/3] treewide: pipe: Convert all references to pipe->{head,tail,max_usage,ring_size} to unsigned short K Prateek Nayak
@ 2025-03-06 12:32                                           ` Oleg Nesterov
  2025-03-06 12:41                                             ` Oleg Nesterov
  2025-03-06 14:27                                             ` Rasmus Villemoes
  0 siblings, 2 replies; 109+ messages in thread
From: Oleg Nesterov @ 2025-03-06 12:32 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Linus Torvalds, Miklos Szeredi, Alexander Viro, Christian Brauner,
	Andrew Morton, Hugh Dickins, linux-fsdevel, linux-kernel,
	linux-mm, Jan Kara, Matthew Wilcox (Oracle), Mateusz Guzik,
	Gautham R. Shenoy, Rasmus Villemoes, Neeraj.Upadhyay,
	Ananth.narayan, Swapnil Sapkal

On 03/06, K Prateek Nayak wrote:
>
> @@ -272,9 +272,9 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>  	 */
>  	for (;;) {
>  		/* Read ->head with a barrier vs post_one_notification() */
> -		unsigned int head = smp_load_acquire(&pipe->head);
> -		unsigned int tail = pipe->tail;
> -		unsigned int mask = pipe->ring_size - 1;
> +		unsigned short head = smp_load_acquire(&pipe->head);
> +		unsigned short tail = pipe->tail;
> +		unsigned short mask = pipe->ring_size - 1;

I dunno... but if we do this, perhaps we should
s/unsigned int/pipe_index_t instead?

At least this would be more grep friendly.

Oleg.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 3/3] treewide: pipe: Convert all references to pipe->{head,tail,max_usage,ring_size} to unsigned short
  2025-03-06 12:32                                           ` Oleg Nesterov
@ 2025-03-06 12:41                                             ` Oleg Nesterov
  2025-03-06 15:33                                               ` K Prateek Nayak
  2025-03-06 14:27                                             ` Rasmus Villemoes
  1 sibling, 1 reply; 109+ messages in thread
From: Oleg Nesterov @ 2025-03-06 12:41 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Linus Torvalds, Miklos Szeredi, Alexander Viro, Christian Brauner,
	Andrew Morton, Hugh Dickins, linux-fsdevel, linux-kernel,
	linux-mm, Jan Kara, Matthew Wilcox (Oracle), Mateusz Guzik,
	Gautham R. Shenoy, Rasmus Villemoes, Neeraj.Upadhyay,
	Ananth.narayan, Swapnil Sapkal

On 03/06, Oleg Nesterov wrote:
>
> On 03/06, K Prateek Nayak wrote:
> >
> > @@ -272,9 +272,9 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
> >  	 */
> >  	for (;;) {
> >  		/* Read ->head with a barrier vs post_one_notification() */
> > -		unsigned int head = smp_load_acquire(&pipe->head);
> > -		unsigned int tail = pipe->tail;
> > -		unsigned int mask = pipe->ring_size - 1;
> > +		unsigned short head = smp_load_acquire(&pipe->head);
> > +		unsigned short tail = pipe->tail;
> > +		unsigned short mask = pipe->ring_size - 1;
>
> I dunno... but if we do this, perhaps we should
> s/unsigned int/pipe_index_t instead?
>
> At least this would be more grep friendly.

in any case, I think another cleanup before this change makes sense...
pipe->ring_size is overused. pipe_read(), pipe_write() and much more
users do not need "unsigned int mask", they can use pipe_buf(buf, slot)
instead.

Oleg.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 3/3] treewide: pipe: Convert all references to pipe->{head,tail,max_usage,ring_size} to unsigned short
  2025-03-06 12:41                                             ` Oleg Nesterov
@ 2025-03-06 15:33                                               ` K Prateek Nayak
  2025-03-06 18:04                                                 ` Linus Torvalds
  0 siblings, 1 reply; 109+ messages in thread
From: K Prateek Nayak @ 2025-03-06 15:33 UTC (permalink / raw)
  To: Oleg Nesterov, Rasmus Villemoes
  Cc: Linus Torvalds, Miklos Szeredi, Alexander Viro, Christian Brauner,
	Andrew Morton, Hugh Dickins, linux-fsdevel, linux-kernel,
	linux-mm, Jan Kara, Matthew Wilcox (Oracle), Mateusz Guzik,
	Gautham R. Shenoy, Neeraj.Upadhyay, Ananth.narayan,
	Swapnil Sapkal

Hello Oleg, Rasmus,

On 3/6/2025 6:11 PM, Oleg Nesterov wrote:
> On 03/06, Oleg Nesterov wrote:
>>
>> On 03/06, K Prateek Nayak wrote:
>>>
>>> @@ -272,9 +272,9 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>>>   	 */
>>>   	for (;;) {
>>>   		/* Read ->head with a barrier vs post_one_notification() */
>>> -		unsigned int head = smp_load_acquire(&pipe->head);
>>> -		unsigned int tail = pipe->tail;
>>> -		unsigned int mask = pipe->ring_size - 1;
>>> +		unsigned short head = smp_load_acquire(&pipe->head);
>>> +		unsigned short tail = pipe->tail;
>>> +		unsigned short mask = pipe->ring_size - 1;
>>
>> I dunno... but if we do this, perhaps we should
>> s/unsigned int/pipe_index_t instead?
>>
>> At least this would be more grep friendly.

Ack. I'll leave the typedef untouched and convert these to use
pipe_index_t. This was an experiment so see if anything breaks with u16
conversion just to get more testing on that scenario. As Rasmus
mentioned, leaving the head and tail as u32 on 64bit will lead to
better code generation.

> 
> in any case, I think another cleanup before this change makes sense...
> pipe->ring_size is overused. pipe_read(), pipe_write() and much more
> users do not need "unsigned int mask", they can use pipe_buf(buf, slot)
> instead.

Ack. I'll add a cleanup patch ahead of this conversion. Thank you both
for taking a look.

> 
> Oleg.
> 

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 3/3] treewide: pipe: Convert all references to pipe->{head,tail,max_usage,ring_size} to unsigned short
  2025-03-06 15:33                                               ` K Prateek Nayak
@ 2025-03-06 18:04                                                 ` Linus Torvalds
  0 siblings, 0 replies; 109+ messages in thread
From: Linus Torvalds @ 2025-03-06 18:04 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Oleg Nesterov, Rasmus Villemoes, Miklos Szeredi, Alexander Viro,
	Christian Brauner, Andrew Morton, Hugh Dickins, linux-fsdevel,
	linux-kernel, linux-mm, Jan Kara, Matthew Wilcox (Oracle),
	Mateusz Guzik, Gautham R. Shenoy, Neeraj.Upadhyay, Ananth.narayan,
	Swapnil Sapkal

On Thu, 6 Mar 2025 at 05:33, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> >>
> >> I dunno... but if we do this, perhaps we should
> >> s/unsigned int/pipe_index_t instead?
> >>
> >> At least this would be more grep friendly.
>
> Ack. I'll leave the typedef untouched and convert these to use
> pipe_index_t. This was an experiment so see if anything breaks with u16
> conversion just to get more testing on that scenario. As Rasmus
> mentioned, leaving the head and tail as u32 on 64bit will lead to
> better code generation.

Yes, I was going to say the same - please don't change to 'unsigned short'.

Judicious use of 'pipe_index_t' may be a good idea, but as I fixed
some issues Rasmus found, I was also looking at the generated code,
and on at least x86 where 16-bit generates extra instructions and
prefixes, it seems marginally better to treat the values as 32-bit,
and then only do the compares in 16 bits.

That only causes a few "movzwl" instructions (at load time), and then
the occasional "cmpw" (empty check) and "movw" (store) etc.

But I only did a very quick "let's look at a few cases of x86-64 also
using a 16-bit pipe_index_t".

So for testing purposes your patch looks fine, but not as something to apply.

If anything, I think we should actively try to remove as many direct
accesses to these pipe fields as humanly possible. As Oleg said, a lot
of them should just be cleaned up to use the helpers we already have.

Rasmus found a few cases of that already, like that FIONREAD case
where it was just doing a lot of open-coding of things that shouldn't
be open-coded.

I've fixed the two cases he pointed at up as obvious bugs, but it
would be good to see where else issues like this might lurk.

                 Linus

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [RFC PATCH 3/3] treewide: pipe: Convert all references to pipe->{head,tail,max_usage,ring_size} to unsigned short
  2025-03-06 12:32                                           ` Oleg Nesterov
  2025-03-06 12:41                                             ` Oleg Nesterov
@ 2025-03-06 14:27                                             ` Rasmus Villemoes
  1 sibling, 0 replies; 109+ messages in thread
From: Rasmus Villemoes @ 2025-03-06 14:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: K Prateek Nayak, Linus Torvalds, Miklos Szeredi, Alexander Viro,
	Christian Brauner, Andrew Morton, Hugh Dickins, linux-fsdevel,
	linux-kernel, linux-mm, Jan Kara, Matthew Wilcox (Oracle),
	Mateusz Guzik, Gautham R. Shenoy, Neeraj.Upadhyay, Ananth.narayan,
	Swapnil Sapkal

On Thu, Mar 06 2025, Oleg Nesterov <oleg@redhat.com> wrote:

> On 03/06, K Prateek Nayak wrote:
>>
>> @@ -272,9 +272,9 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>>  	 */
>>  	for (;;) {
>>  		/* Read ->head with a barrier vs post_one_notification() */
>> -		unsigned int head = smp_load_acquire(&pipe->head);
>> -		unsigned int tail = pipe->tail;
>> -		unsigned int mask = pipe->ring_size - 1;
>> +		unsigned short head = smp_load_acquire(&pipe->head);
>> +		unsigned short tail = pipe->tail;
>> +		unsigned short mask = pipe->ring_size - 1;
>
> I dunno... but if we do this, perhaps we should
> s/unsigned int/pipe_index_t instead?
>
> At least this would be more grep friendly.

Agreed. Also, while using u16 on all arches may be good for now to make
sure everything is updated, it may also be that it ends up causing
suboptimal code gen for 64 bit architectures, so even if we do change
pipe_index_t now, perhaps we'd want to change it back to "half a ulong"
at some point in the future.

Rasmus

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-03 17:54                         ` Mateusz Guzik
  2025-03-03 18:11                           ` Linus Torvalds
@ 2025-03-03 18:32                           ` K Prateek Nayak
  2025-03-04  5:22                             ` K Prateek Nayak
  1 sibling, 1 reply; 109+ messages in thread
From: K Prateek Nayak @ 2025-03-03 18:32 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Sapkal, Swapnil, Oleg Nesterov, Manfred Spraul, Linus Torvalds,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, Shenoy, Gautham Ranjal, Neeraj.Upadhyay,
	Ananth.narayan

Hello Mateusz,

On 3/3/2025 11:24 PM, Mateusz Guzik wrote:
> Can you guys try out the patch below?
> 
> It changes things up so that there is no need to read 2 different vars.
> 
> It is not the final version and I don't claim to be able to fully
> justify the thing at the moment either, but I would like to know if it
> fixes the problem.

Happy to help! We've queued the below patch for an overnight run, will
report back once it is done.

Full disclaimer: We're testing on top of commit aaec5a95d596
("pipe_read: don't wake up the writer if the pipe is still full") where the
issue is more reproducible. I've replaced the VFS_BUG_ON() with a plain
BUG_ON() based on [1] since v6.14-rc1 did not include the CONFIG_DEBUG_VFS
bits. Hope that is alright.

[1] https://lore.kernel.org/lkml/20250209185523.745956-2-mjguzik@gmail.com/

/off to get some shut eyes/

-- 
Thanks and Regards,
Prateek

> 
> If you don't have time that's fine, this is a quick jab. While I can't
> reproduce the bug myself even after inserting a delay by hand with
> msleep between the loads, I verified it does not outright break either.
> :P
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 19a7948ab234..e61ad589fc2c 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -210,11 +210,21 @@ static const struct pipe_buf_operations anon_pipe_buf_ops = {
>   /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
>   static inline bool pipe_readable(const struct pipe_inode_info *pipe)
>   {
> -	unsigned int head = READ_ONCE(pipe->head);
> -	unsigned int tail = READ_ONCE(pipe->tail);
> -	unsigned int writers = READ_ONCE(pipe->writers);
> +	return !READ_ONCE(pipe->isempty) || !READ_ONCE(pipe->writers);
> +}
> +
> +static inline void pipe_recalc_state(struct pipe_inode_info *pipe)
> +{
> +	pipe->isempty = pipe_empty(pipe->head, pipe->tail);
> +	pipe->isfull = pipe_full(pipe->head, pipe->tail, pipe->max_usage);
> +	VFS_BUG_ON(pipe->isempty && pipe->isfull);
> +}
>   
> -	return !pipe_empty(head, tail) || !writers;
> +static inline void pipe_update_head(struct pipe_inode_info *pipe,
> +				    unsigned int head)
> +{
> +	pipe->head = ++head;
> +	pipe_recalc_state(pipe);
>   }
>   
>   static inline unsigned int pipe_update_tail(struct pipe_inode_info *pipe,
> @@ -244,6 +254,7 @@ static inline unsigned int pipe_update_tail(struct pipe_inode_info *pipe,
>   	 * without the spinlock - the mutex is enough.
>   	 */
>   	pipe->tail = ++tail;
> +	pipe_recalc_state(pipe);
>   	return tail;
>   }
>   
> @@ -403,12 +414,7 @@ static inline int is_packetized(struct file *file)
>   /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
>   static inline bool pipe_writable(const struct pipe_inode_info *pipe)
>   {
> -	unsigned int head = READ_ONCE(pipe->head);
> -	unsigned int tail = READ_ONCE(pipe->tail);
> -	unsigned int max_usage = READ_ONCE(pipe->max_usage);
> -
> -	return !pipe_full(head, tail, max_usage) ||
> -		!READ_ONCE(pipe->readers);
> +	return !READ_ONCE(pipe->isfull) || !READ_ONCE(pipe->readers);
>   }
>   
>   static ssize_t
> @@ -512,7 +518,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>   				break;
>   			}
>   
> -			pipe->head = head + 1;
> +			pipe_update_head(pipe, head);
>   			pipe->tmp_page = NULL;
>   			/* Insert it into the buffer array */
>   			buf = &pipe->bufs[head & mask];
> @@ -529,10 +535,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>   
>   			if (!iov_iter_count(from))
>   				break;
> -		}
>   
> -		if (!pipe_full(head, pipe->tail, pipe->max_usage))
>   			continue;
> +		}
>   
>   		/* Wait for buffer space to become available. */
>   		if ((filp->f_flags & O_NONBLOCK) ||
> diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
> index 8ff23bf5a819..d4b7539399b5 100644
> --- a/include/linux/pipe_fs_i.h
> +++ b/include/linux/pipe_fs_i.h
> @@ -69,6 +69,8 @@ struct pipe_inode_info {
>   	unsigned int r_counter;
>   	unsigned int w_counter;
>   	bool poll_usage;
> +	bool isempty;
> +	bool isfull;
>   #ifdef CONFIG_WATCH_QUEUE
>   	bool note_loss;
>   #endif



^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-03 18:32                           ` [PATCH] pipe_read: don't wake up the writer if the pipe is still full K Prateek Nayak
@ 2025-03-04  5:22                             ` K Prateek Nayak
  0 siblings, 0 replies; 109+ messages in thread
From: K Prateek Nayak @ 2025-03-04  5:22 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Sapkal, Swapnil, Oleg Nesterov, Manfred Spraul, Linus Torvalds,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, Shenoy, Gautham Ranjal, Neeraj.Upadhyay,
	Ananth.narayan

Hello Mateusz,

On 3/4/2025 12:02 AM, K Prateek Nayak wrote:
> Hello Mateusz,
> 
> On 3/3/2025 11:24 PM, Mateusz Guzik wrote:
>> Can you guys try out the patch below?
>>
>> It changes things up so that there is no need to read 2 different vars.
>>
>> It is not the final version and I don't claim to be able to fully
>> justify the thing at the moment either, but I would like to know if it
>> fixes the problem.
> 
> Happy to help! We've queued the below patch for an overnight run, will
> report back once it is done.

Hackbench has been running for a few thousand iteration now without
experiencing any hangs yet with your changes.

> 
> Full disclaimer: We're testing on top of commit aaec5a95d596
> ("pipe_read: don't wake up the writer if the pipe is still full") where the
> issue is more reproducible. I've replaced the VFS_BUG_ON() with a plain
> BUG_ON() based on [1] since v6.14-rc1 did not include the CONFIG_DEBUG_VFS
> bits. Hope that is alright.
> 
> [1] https://lore.kernel.org/lkml/20250209185523.745956-2-mjguzik@gmail.com/
> 
> /off to get some shut eyes/
> 

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-03  9:46                 ` Sapkal, Swapnil
  2025-03-03 14:37                   ` Mateusz Guzik
@ 2025-03-03 16:49                   ` Oleg Nesterov
  2025-03-04  5:06                   ` Hillf Danton
  2 siblings, 0 replies; 109+ messages in thread
From: Oleg Nesterov @ 2025-03-03 16:49 UTC (permalink / raw)
  To: Sapkal, Swapnil
  Cc: K Prateek Nayak, Mateusz Guzik, Manfred Spraul, Linus Torvalds,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, Shenoy, Gautham Ranjal, Neeraj.Upadhyay,
	Ananth.narayan, Alexey Gladkov

Hi!

On 03/03, Sapkal, Swapnil wrote:
>
> >but if you have time, could you check if this patch (with or without the
> >previous debugging patch) makes any difference? Just to be sure.
>
> Sure, I will give this a try.

Forget ;)

[...snip...]

> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -417,9 +417,19 @@ static inline int is_packetized(struct file *file)
>  /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
>  static inline bool pipe_writable(const struct pipe_inode_info *pipe)
>  {
> -	unsigned int head = READ_ONCE(pipe->head);
> -	unsigned int tail = READ_ONCE(pipe->tail);
>  	unsigned int max_usage = READ_ONCE(pipe->max_usage);
> +	unsigned int head, tail;
> +
> +	tail = READ_ONCE(pipe->tail);
> +	/*
> +	 * Since the unsigned arithmetic in this lockless preemptible context
> +	 * relies on the fact that the tail can never be ahead of head, read
> +	 * the head after the tail to ensure we've not missed any updates to
> +	 * the head. Reordering the reads can cause wraparounds and give the
> +	 * illusion that the pipe is full.
> +	 */
> +	smp_rmb();
> +	head = READ_ONCE(pipe->head);
>  	return !pipe_full(head, tail, max_usage) ||
>  		!READ_ONCE(pipe->readers);

Ooh, thanks!!!

And sorry, can't work today. To be honest, I have some concerns, but probably
I am wrong... I'll return tomorrow.

In any case, finally we have a hint. Thank you both!

(btw, please look at pipe_poll).

Oleg.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-03  9:46                 ` Sapkal, Swapnil
  2025-03-03 14:37                   ` Mateusz Guzik
  2025-03-03 16:49                   ` Oleg Nesterov
@ 2025-03-04  5:06                   ` Hillf Danton
  2025-03-04  5:35                     ` K Prateek Nayak
  2 siblings, 1 reply; 109+ messages in thread
From: Hillf Danton @ 2025-03-04  5:06 UTC (permalink / raw)
  To: Sapkal, Swapnil
  Cc: Oleg Nesterov, K Prateek Nayak, Mateusz Guzik, Linus Torvalds,
	linux-fsdevel, linux-kernel

On Mon, 3 Mar 2025 15:16:34 +0530 "Sapkal, Swapnil" <swapnil.sapkal@amd.com>
> On 2/28/2025 10:03 PM, Oleg Nesterov wrote:
> > And... I know, I know you already hate me ;)
> > 
> 
> Not at all :)
> 
> > but if you have time, could you check if this patch (with or without the
> > previous debugging patch) makes any difference? Just to be sure.
> > 
> 
> Sure, I will give this a try.
> 
> But in the meanwhile me and Prateek tried some of the experiments in the weekend.
> We were able to reproduce this issue on a third generation EPYC system as well as
> on an Intel Emerald Rapids (2 X INTEL(R) XEON(R) PLATINUM 8592+).
> 
> We tried heavy hammered tracing approach over the weekend on top of your debug patch.
> I have attached the debug patch below. With tracing we found the following case for
> pipe_writable():
> 
>    hackbench-118768  [206] .....  1029.550601: pipe_write: 000000005eea28ff: 0: 37 38 16: 1
> 
> Here,
> 
> head = 37
> tail = 38
> max_usage = 16
> pipe_full() returns 1.
> 
> Between reading of head and later the tail, the tail seems to have moved ahead of the
> head leading to wraparound. Applying the following changes I have not yet run into a
> hang on the original machine where I first saw it:
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index ce1af7592780..a1931c817822 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -417,9 +417,19 @@ static inline int is_packetized(struct file *file)
>   /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
>   static inline bool pipe_writable(const struct pipe_inode_info *pipe)
>   {
> -	unsigned int head = READ_ONCE(pipe->head);
> -	unsigned int tail = READ_ONCE(pipe->tail);
>   	unsigned int max_usage = READ_ONCE(pipe->max_usage);
> +	unsigned int head, tail;
> +
> +	tail = READ_ONCE(pipe->tail);
> +	/*
> +	 * Since the unsigned arithmetic in this lockless preemptible context
> +	 * relies on the fact that the tail can never be ahead of head, read
> +	 * the head after the tail to ensure we've not missed any updates to
> +	 * the head. Reordering the reads can cause wraparounds and give the
> +	 * illusion that the pipe is full.
> +	 */
> +	smp_rmb();
> +	head = READ_ONCE(pipe->head);
>   
>   	return !pipe_full(head, tail, max_usage) ||
>   		!READ_ONCE(pipe->readers);
> ---
> 
> smp_rmb() on x86 is a nop and even without the barrier we were not able to
> reproduce the hang even after 10000 iterations.
>
My $.02 that changes the wait condition.
Not sure it makes sense for you.

--- x/fs/pipe.c
+++ y/fs/pipe.c
@@ -430,7 +430,7 @@ pipe_write(struct kiocb *iocb, struct io
 {
 	struct file *filp = iocb->ki_filp;
 	struct pipe_inode_info *pipe = filp->private_data;
-	unsigned int head;
+	unsigned int head, tail;
 	ssize_t ret = 0;
 	size_t total_len = iov_iter_count(from);
 	ssize_t chars;
@@ -573,11 +573,13 @@ pipe_write(struct kiocb *iocb, struct io
 		 * after waiting we need to re-check whether the pipe
 		 * become empty while we dropped the lock.
 		 */
+		tail = pipe->tail;
 		mutex_unlock(&pipe->mutex);
 		if (was_empty)
 			wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
 		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
-		wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
+		wait_event_interruptible_exclusive(pipe->wr_wait,
+				!READ_ONCE(pipe->readers) || tail != READ_ONCE(pipe->tail));
 		mutex_lock(&pipe->mutex);
 		was_empty = pipe_empty(pipe->head, pipe->tail);
 		wake_next_writer = true;
--

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-04  5:06                   ` Hillf Danton
@ 2025-03-04  5:35                     ` K Prateek Nayak
  2025-03-04 10:29                       ` Hillf Danton
  0 siblings, 1 reply; 109+ messages in thread
From: K Prateek Nayak @ 2025-03-04  5:35 UTC (permalink / raw)
  To: Hillf Danton, Sapkal, Swapnil
  Cc: Oleg Nesterov, Mateusz Guzik, Linus Torvalds, linux-fsdevel,
	linux-kernel

Hello Hillf,

On 3/4/2025 10:36 AM, Hillf Danton wrote:
> On Mon, 3 Mar 2025 15:16:34 +0530 "Sapkal, Swapnil" <swapnil.sapkal@amd.com>
>> On 2/28/2025 10:03 PM, Oleg Nesterov wrote:
>>> And... I know, I know you already hate me ;)
>>>
>>
>> Not at all :)
>>
>>> but if you have time, could you check if this patch (with or without the
>>> previous debugging patch) makes any difference? Just to be sure.
>>>
>>
>> Sure, I will give this a try.
>>
>> But in the meanwhile me and Prateek tried some of the experiments in the weekend.
>> We were able to reproduce this issue on a third generation EPYC system as well as
>> on an Intel Emerald Rapids (2 X INTEL(R) XEON(R) PLATINUM 8592+).
>>
>> We tried heavy hammered tracing approach over the weekend on top of your debug patch.
>> I have attached the debug patch below. With tracing we found the following case for
>> pipe_writable():
>>
>>     hackbench-118768  [206] .....  1029.550601: pipe_write: 000000005eea28ff: 0: 37 38 16: 1
>>
>> Here,
>>
>> head = 37
>> tail = 38
>> max_usage = 16
>> pipe_full() returns 1.
>>
>> Between reading of head and later the tail, the tail seems to have moved ahead of the
>> head leading to wraparound. Applying the following changes I have not yet run into a
>> hang on the original machine where I first saw it:
>>
>> diff --git a/fs/pipe.c b/fs/pipe.c
>> index ce1af7592780..a1931c817822 100644
>> --- a/fs/pipe.c
>> +++ b/fs/pipe.c
>> @@ -417,9 +417,19 @@ static inline int is_packetized(struct file *file)
>>    /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
>>    static inline bool pipe_writable(const struct pipe_inode_info *pipe)
>>    {
>> -	unsigned int head = READ_ONCE(pipe->head);
>> -	unsigned int tail = READ_ONCE(pipe->tail);
>>    	unsigned int max_usage = READ_ONCE(pipe->max_usage);
>> +	unsigned int head, tail;
>> +
>> +	tail = READ_ONCE(pipe->tail);
>> +	/*
>> +	 * Since the unsigned arithmetic in this lockless preemptible context
>> +	 * relies on the fact that the tail can never be ahead of head, read
>> +	 * the head after the tail to ensure we've not missed any updates to
>> +	 * the head. Reordering the reads can cause wraparounds and give the
>> +	 * illusion that the pipe is full.
>> +	 */
>> +	smp_rmb();
>> +	head = READ_ONCE(pipe->head);
>>    
>>    	return !pipe_full(head, tail, max_usage) ||
>>    		!READ_ONCE(pipe->readers);
>> ---
>>
>> smp_rmb() on x86 is a nop and even without the barrier we were not able to
>> reproduce the hang even after 10000 iterations.
>>
> My $.02 that changes the wait condition.
> Not sure it makes sense for you.
> 
> --- x/fs/pipe.c
> +++ y/fs/pipe.c
> @@ -430,7 +430,7 @@ pipe_write(struct kiocb *iocb, struct io
>   {
>   	struct file *filp = iocb->ki_filp;
>   	struct pipe_inode_info *pipe = filp->private_data;
> -	unsigned int head;
> +	unsigned int head, tail;
>   	ssize_t ret = 0;
>   	size_t total_len = iov_iter_count(from);
>   	ssize_t chars;
> @@ -573,11 +573,13 @@ pipe_write(struct kiocb *iocb, struct io
>   		 * after waiting we need to re-check whether the pipe
>   		 * become empty while we dropped the lock.
>   		 */
> +		tail = pipe->tail;
>   		mutex_unlock(&pipe->mutex);
>   		if (was_empty)
>   			wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
>   		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> -		wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
> +		wait_event_interruptible_exclusive(pipe->wr_wait,
> +				!READ_ONCE(pipe->readers) || tail != READ_ONCE(pipe->tail));

That could work too for the case highlighted but in case the head too
has moved by the time the writer wakes up, it'll lead to an extra
wakeup.

Linus' diff seems cleaner and seems to cover all racy scenarios.

>   		mutex_lock(&pipe->mutex);
>   		was_empty = pipe_empty(pipe->head, pipe->tail);
>   		wake_next_writer = true;
> --

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-04  5:35                     ` K Prateek Nayak
@ 2025-03-04 10:29                       ` Hillf Danton
  2025-03-04 12:34                         ` Oleg Nesterov
  0 siblings, 1 reply; 109+ messages in thread
From: Hillf Danton @ 2025-03-04 10:29 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Oleg Nesterov, Mateusz Guzik, Sapkal, Swapnil, Linus Torvalds,
	linux-fsdevel, linux-kernel

On Tue, 4 Mar 2025 11:05:57 +0530 K Prateek Nayak <kprateek.nayak@amd.com>
>On 3/4/2025 10:36 AM, Hillf Danton wrote:
>> On Mon, 3 Mar 2025 15:16:34 +0530 "Sapkal, Swapnil" <swapnil.sapkal@amd.com>
>>> On 2/28/2025 10:03 PM, Oleg Nesterov wrote:
>>>> And... I know, I know you already hate me ;)
>>>>
>>>
>>> Not at all :)
>>>
>>>> but if you have time, could you check if this patch (with or without the
>>>> previous debugging patch) makes any difference? Just to be sure.
>>>>
>>>
>>> Sure, I will give this a try.
>>>
>>> But in the meanwhile me and Prateek tried some of the experiments in the weekend.
>>> We were able to reproduce this issue on a third generation EPYC system as well as
>>> on an Intel Emerald Rapids (2 X INTEL(R) XEON(R) PLATINUM 8592+).
>>>
>>> We tried heavy hammered tracing approach over the weekend on top of your debug patch.
>>> I have attached the debug patch below. With tracing we found the following case for
>>> pipe_writable():
>>>
>>>     hackbench-118768  [206] .....  1029.550601: pipe_write: 000000005eea28ff: 0: 37 38 16: 1
>>>
>>> Here,
>>>
>>> head = 37
>>> tail = 38
>>> max_usage = 16
>>> pipe_full() returns 1.
>>>
>>> Between reading of head and later the tail, the tail seems to have moved ahead of the
>>> head leading to wraparound. Applying the following changes I have not yet run into a
>>> hang on the original machine where I first saw it:
>>>
>>> diff --git a/fs/pipe.c b/fs/pipe.c
>>> index ce1af7592780..a1931c817822 100644
>>> --- a/fs/pipe.c
>>> +++ b/fs/pipe.c
>>> @@ -417,9 +417,19 @@ static inline int is_packetized(struct file *file)
>>>    /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
>>>    static inline bool pipe_writable(const struct pipe_inode_info *pipe)
>>>    {
>>> -	unsigned int head = READ_ONCE(pipe->head);
>>> -	unsigned int tail = READ_ONCE(pipe->tail);
>>>    	unsigned int max_usage = READ_ONCE(pipe->max_usage);
>>> +	unsigned int head, tail;
>>> +
>>> +	tail = READ_ONCE(pipe->tail);
>>> +	/*
>>> +	 * Since the unsigned arithmetic in this lockless preemptible context
>>> +	 * relies on the fact that the tail can never be ahead of head, read
>>> +	 * the head after the tail to ensure we've not missed any updates to
>>> +	 * the head. Reordering the reads can cause wraparounds and give the
>>> +	 * illusion that the pipe is full.
>>> +	 */
>>> +	smp_rmb();
>>> +	head = READ_ONCE(pipe->head);
>>>    
>>>    	return !pipe_full(head, tail, max_usage) ||
>>>    		!READ_ONCE(pipe->readers);
>>> ---
>>>
>>> smp_rmb() on x86 is a nop and even without the barrier we were not able to
>>> reproduce the hang even after 10000 iterations.
>>>
>> My $.02 that changes the wait condition.
>> Not sure it makes sense for you.
>> 
>> --- x/fs/pipe.c
>> +++ y/fs/pipe.c
>> @@ -430,7 +430,7 @@ pipe_write(struct kiocb *iocb, struct io
>>   {
>>   	struct file *filp = iocb->ki_filp;
>>   	struct pipe_inode_info *pipe = filp->private_data;
>> -	unsigned int head;
>> +	unsigned int head, tail;
>>   	ssize_t ret = 0;
>>   	size_t total_len = iov_iter_count(from);
>>   	ssize_t chars;
>> @@ -573,11 +573,13 @@ pipe_write(struct kiocb *iocb, struct io
>>   		 * after waiting we need to re-check whether the pipe
>>   		 * become empty while we dropped the lock.
>>   		 */
>> +		tail = pipe->tail;
>>   		mutex_unlock(&pipe->mutex);
>>   		if (was_empty)
>>   			wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
>>   		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
>> -		wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
>> +		wait_event_interruptible_exclusive(pipe->wr_wait,
>> +				!READ_ONCE(pipe->readers) || tail != READ_ONCE(pipe->tail));
>
>That could work too for the case highlighted but in case the head too
>has moved by the time the writer wakes up, it'll lead to an extra
>wakeup.
>
Note wakeup can occur even if pipe is full, and more important, taking
the pipe lock after wakeup is the price paid for curing the hang in
question.

		 * So we still need to wake up any pending writers in the
		 * _very_ unlikely case that the pipe was full, but we got
		 * no data.
		 */

>Linus' diff seems cleaner and seems to cover all racy scenarios.
>
>>   		mutex_lock(&pipe->mutex);
>>   		was_empty = pipe_empty(pipe->head, pipe->tail);
>>   		wake_next_writer = true;
>> --
>
>-- 
>Thanks and Regards,
>Prateek

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-04 10:29                       ` Hillf Danton
@ 2025-03-04 12:34                         ` Oleg Nesterov
  2025-03-04 23:35                           ` Hillf Danton
  0 siblings, 1 reply; 109+ messages in thread
From: Oleg Nesterov @ 2025-03-04 12:34 UTC (permalink / raw)
  To: Hillf Danton
  Cc: K Prateek Nayak, Mateusz Guzik, Sapkal, Swapnil, Linus Torvalds,
	linux-fsdevel, linux-kernel

On 03/04, Hillf Danton wrote:
>
> On Tue, 4 Mar 2025 11:05:57 +0530 K Prateek Nayak <kprateek.nayak@amd.com>
> >> @@ -573,11 +573,13 @@ pipe_write(struct kiocb *iocb, struct io
> >>   		 * after waiting we need to re-check whether the pipe
> >>   		 * become empty while we dropped the lock.
> >>   		 */
> >> +		tail = pipe->tail;
> >>   		mutex_unlock(&pipe->mutex);
> >>   		if (was_empty)
> >>   			wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
> >>   		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> >> -		wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
> >> +		wait_event_interruptible_exclusive(pipe->wr_wait,
> >> +				!READ_ONCE(pipe->readers) || tail != READ_ONCE(pipe->tail));
> >
> >That could work too for the case highlighted but in case the head too
> >has moved by the time the writer wakes up, it'll lead to an extra
> >wakeup.
> >
> Note wakeup can occur even if pipe is full,

Perhaps I misunderstood you, but I don't think pipe_read() can ever do
wake_up(pipe->wr_wait) if pipe is full...

> 		 * So we still need to wake up any pending writers in the
> 		 * _very_ unlikely case that the pipe was full, but we got
> 		 * no data.
> 		 */

Only if wake_writer is true,

		if (unlikely(wake_writer))
			wake_up_interruptible_sync_poll(...);

and in this case the pipe is no longer full. A zero-sized buffer was
removed.

Of course this pipe can be full again when the woken writer checks the
condition, but this is another story. And in this case, with your
proposed change, the woken writer will take pipe->mutex for no reason.

Note also that the comment and code above was already removed by
https://lore.kernel.org/all/20250210114039.GA3588@redhat.com/

Oleg.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-04 12:34                         ` Oleg Nesterov
@ 2025-03-04 23:35                           ` Hillf Danton
  2025-03-04 23:49                             ` Oleg Nesterov
  0 siblings, 1 reply; 109+ messages in thread
From: Hillf Danton @ 2025-03-04 23:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: K Prateek Nayak, Mateusz Guzik, Sapkal, Swapnil, Linus Torvalds,
	linux-fsdevel, linux-kernel

On Tue, 4 Mar 2025 13:34:57 +0100 Oleg Nesterov <oleg@redhat.com>
> On 03/04, Hillf Danton wrote:
> > On Tue, 4 Mar 2025 11:05:57 +0530 K Prateek Nayak <kprateek.nayak@amd.com>
> > >> @@ -573,11 +573,13 @@ pipe_write(struct kiocb *iocb, struct io
> > >>   		 * after waiting we need to re-check whether the pipe
> > >>   		 * become empty while we dropped the lock.
> > >>   		 */
> > >> +		tail = pipe->tail;
> > >>   		mutex_unlock(&pipe->mutex);
> > >>   		if (was_empty)
> > >>   			wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
> > >>   		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> > >> -		wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
> > >> +		wait_event_interruptible_exclusive(pipe->wr_wait,
> > >> +				!READ_ONCE(pipe->readers) || tail != READ_ONCE(pipe->tail));
> > >
> > >That could work too for the case highlighted but in case the head too
> > >has moved by the time the writer wakes up, it'll lead to an extra
> > >wakeup.
> > >
> > Note wakeup can occur even if pipe is full,
> 
> Perhaps I misunderstood you, but I don't think pipe_read() can ever do
> wake_up(pipe->wr_wait) if pipe is full...
> 
> > 		 * So we still need to wake up any pending writers in the
> > 		 * _very_ unlikely case that the pipe was full, but we got
> > 		 * no data.
> > 		 */
> 
> Only if wake_writer is true,
> 
> 		if (unlikely(wake_writer))
> 			wake_up_interruptible_sync_poll(...);
> 
> and in this case the pipe is no longer full. A zero-sized buffer was
> removed.
> 
> Of course this pipe can be full again when the woken writer checks the
> condition, but this is another story. And in this case, with your
> proposed change, the woken writer will take pipe->mutex for no reason.
> 
See the following sequence,

	1) waker makes full false
	2) waker makes full true
	3) waiter checks full
	4) waker makes full false

waiter has no real idea of full without lock held, perhaps regardless
the code cut below.

> Note also that the comment and code above was already removed by
> https://lore.kernel.org/all/20250210114039.GA3588@redhat.com/
> 
> Oleg.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-04 23:35                           ` Hillf Danton
@ 2025-03-04 23:49                             ` Oleg Nesterov
  2025-03-05  4:56                               ` Hillf Danton
  0 siblings, 1 reply; 109+ messages in thread
From: Oleg Nesterov @ 2025-03-04 23:49 UTC (permalink / raw)
  To: Hillf Danton
  Cc: K Prateek Nayak, Mateusz Guzik, Sapkal, Swapnil, Linus Torvalds,
	linux-fsdevel, linux-kernel

On 03/05, Hillf Danton wrote:
>
> On Tue, 4 Mar 2025 13:34:57 +0100 Oleg Nesterov <oleg@redhat.com>
> > > >
> > > Note wakeup can occur even if pipe is full,
> >
> > Perhaps I misunderstood you, but I don't think pipe_read() can ever do
> > wake_up(pipe->wr_wait) if pipe is full...
> >
> > > 		 * So we still need to wake up any pending writers in the
> > > 		 * _very_ unlikely case that the pipe was full, but we got
> > > 		 * no data.
> > > 		 */
> >
> > Only if wake_writer is true,
> >
> > 		if (unlikely(wake_writer))
> > 			wake_up_interruptible_sync_poll(...);
> >
> > and in this case the pipe is no longer full. A zero-sized buffer was
> > removed.
> >
> > Of course this pipe can be full again when the woken writer checks the
> > condition, but this is another story. And in this case, with your
> > proposed change, the woken writer will take pipe->mutex for no reason.
> >
> See the following sequence,
>
> 	1) waker makes full false
> 	2) waker makes full true
> 	3) waiter checks full
> 	4) waker makes full false

I don't really understand this sequence, but

> waiter has no real idea of full without lock held, perhaps regardless
> the code cut below.

Of course! Again, whatever the woken writer checks in pipe_writable()
lockless, another writer can make pipe_full() true again.

But why do we care? Why do you think that the change you propose makes
more sense than the fix from Prateek or the (already merged) Linus's fix?

Oleg.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-04 23:49                             ` Oleg Nesterov
@ 2025-03-05  4:56                               ` Hillf Danton
  2025-03-05 11:44                                 ` Oleg Nesterov
  0 siblings, 1 reply; 109+ messages in thread
From: Hillf Danton @ 2025-03-05  4:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: K Prateek Nayak, Mateusz Guzik, Sapkal, Swapnil, Linus Torvalds,
	linux-fsdevel, linux-kernel

On Wed, 5 Mar 2025 00:49:09 +0100 Oleg Nesterov <oleg@redhat.com>
> 
> Of course! Again, whatever the woken writer checks in pipe_writable()
> lockless, another writer can make pipe_full() true again.
> 
> But why do we care? Why do you think that the change you propose makes

Because of the hang reported.

> more sense than the fix from Prateek or the (already merged) Linus's fix?
> 
See the loop in  ___wait_event(),

	for (;;) {
		prepare_to_wait_event();

		// flip
		if (condition)
			break;

		schedule();
	}

After wakeup, waiter will sleep again if condition flips false on the waker
side before waiter checks condition, even if condition is atomic, no?


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-05  4:56                               ` Hillf Danton
@ 2025-03-05 11:44                                 ` Oleg Nesterov
  2025-03-05 22:46                                   ` Hillf Danton
  0 siblings, 1 reply; 109+ messages in thread
From: Oleg Nesterov @ 2025-03-05 11:44 UTC (permalink / raw)
  To: Hillf Danton
  Cc: K Prateek Nayak, Mateusz Guzik, Sapkal, Swapnil, Linus Torvalds,
	linux-fsdevel, linux-kernel

Hi Hillf,

again, I am not sure we understand each other, at least me...

On 03/05, Hillf Danton wrote:
>
> On Wed, 5 Mar 2025 00:49:09 +0100 Oleg Nesterov <oleg@redhat.com>
> >
> > Of course! Again, whatever the woken writer checks in pipe_writable()
> > lockless, another writer can make pipe_full() true again.
> >
> > But why do we care? Why do you think that the change you propose makes
>
> Because of the hang reported.

The hang happened because pipe_writable() could wrongly return false
when the buffer is not full.

Afaics, the Prateek's or Linus's fix solve this problem, and this is
all we need.

> > more sense than the fix from Prateek or the (already merged) Linus's fix?
> >
> See the loop in  ___wait_event(),
>
> 	for (;;) {
> 		prepare_to_wait_event();
>
> 		// flip
> 		if (condition)
> 			break;
>
> 		schedule();
> 	}

I will assume that this "// flip" means the case I described above:
before this writer checks the condition, another writer comes and
increments pipe->head.

> After wakeup, waiter will sleep again if condition flips false on the waker
> side before waiter checks condition, even if condition is atomic, no?

Yes, but in this case pipe_full() == true is correct, this writer can
safely sleep.

Even if flips again and becomes false right after the writer called
pipe_writable().

Note that it checks the condition after set_current_state(INTERRUPTIBLE)
and it is still on the pipe->wr_wait->head list.

The 2nd "flip" is only possible if some reader flushes another buffer
and updates pipe->tail, and in this case it will wake this writer again.

So I still can't understand your concerns, sorry.

Oleg.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-05 11:44                                 ` Oleg Nesterov
@ 2025-03-05 22:46                                   ` Hillf Danton
  2025-03-06  9:30                                     ` Oleg Nesterov
  0 siblings, 1 reply; 109+ messages in thread
From: Hillf Danton @ 2025-03-05 22:46 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: K Prateek Nayak, Mateusz Guzik, Sapkal, Swapnil, Linus Torvalds,
	linux-fsdevel, linux-kernel

On Wed, 5 Mar 2025 12:44:34 +0100 Oleg Nesterov <oleg@redhat.com>
> On 03/05, Hillf Danton wrote:
> > See the loop in  ___wait_event(),
> >
> > 	for (;;) {
> > 		prepare_to_wait_event();
> >
> > 		// flip
> > 		if (condition)
> > 			break;
> >
> > 		schedule();
> > 	}
> >
> > After wakeup, waiter will sleep again if condition flips false on the waker
> > side before waiter checks condition, even if condition is atomic, no?
> 
> Yes, but in this case pipe_full() == true is correct, this writer can
> safely sleep.
> 
No, because no reader is woken up before sleep to make pipe not full.

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-05 22:46                                   ` Hillf Danton
@ 2025-03-06  9:30                                     ` Oleg Nesterov
  2025-03-07  6:08                                       ` Hillf Danton
  0 siblings, 1 reply; 109+ messages in thread
From: Oleg Nesterov @ 2025-03-06  9:30 UTC (permalink / raw)
  To: Hillf Danton
  Cc: K Prateek Nayak, Mateusz Guzik, Sapkal, Swapnil, Linus Torvalds,
	linux-fsdevel, linux-kernel

On 03/06, Hillf Danton wrote:
>
> On Wed, 5 Mar 2025 12:44:34 +0100 Oleg Nesterov <oleg@redhat.com>
> > On 03/05, Hillf Danton wrote:
> > > See the loop in  ___wait_event(),
> > >
> > > 	for (;;) {
> > > 		prepare_to_wait_event();
> > >
> > > 		// flip
> > > 		if (condition)
> > > 			break;
> > >
> > > 		schedule();
> > > 	}
> > >
> > > After wakeup, waiter will sleep again if condition flips false on the waker
> > > side before waiter checks condition, even if condition is atomic, no?
> >
> > Yes, but in this case pipe_full() == true is correct, this writer can
> > safely sleep.
> >
> No, because no reader is woken up before sleep to make pipe not full.

Why the reader should be woken before this writer sleeps? Why the reader
should be woken at all in this case (when pipe is full again) ?

We certainly can't understand each other.

Could your picture the exact scenario/sequence which can hang?

Oleg.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-06  9:30                                     ` Oleg Nesterov
@ 2025-03-07  6:08                                       ` Hillf Danton
  2025-03-07  6:24                                         ` K Prateek Nayak
  2025-03-07 11:26                                         ` Oleg Nesterov
  0 siblings, 2 replies; 109+ messages in thread
From: Hillf Danton @ 2025-03-07  6:08 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: K Prateek Nayak, Mateusz Guzik, Sapkal, Swapnil, Linus Torvalds,
	linux-fsdevel, linux-kernel

On Thu, 6 Mar 2025 10:30:21 +0100 Oleg Nesterov <oleg@redhat.com>
> On 03/06, Hillf Danton wrote:
> > On Wed, 5 Mar 2025 12:44:34 +0100 Oleg Nesterov <oleg@redhat.com>
> > > On 03/05, Hillf Danton wrote:
> > > > See the loop in  ___wait_event(),
> > > >
> > > > 	for (;;) {
> > > > 		prepare_to_wait_event();
> > > >
> > > > 		// flip
> > > > 		if (condition)
> > > > 			break;
> > > >
> > > > 		schedule();
> > > > 	}
> > > >
> > > > After wakeup, waiter will sleep again if condition flips false on the waker
> > > > side before waiter checks condition, even if condition is atomic, no?
> > >
> > > Yes, but in this case pipe_full() == true is correct, this writer can
> > > safely sleep.
> > >
> > No, because no reader is woken up before sleep to make pipe not full.
> 
> Why the reader should be woken before this writer sleeps? Why the reader
> should be woken at all in this case (when pipe is full again) ?
> 
"to make pipe not full" failed to prevent you asking questions like this one.

> We certainly can't understand each other.
> 
> Could your picture the exact scenario/sequence which can hang?
> 
If you think the scenario in commit 3d252160b818 [1] is correct, check
the following one.

step-00
	pipe->head = 36
	pipe->tail = 36
	after 3d252160b818

step-01
	task-118762 writer
	pipe->head++;
	wakes up task-118740 and task-118768

step-02
	task-118768 writer
	makes pipe full;
	sleeps without waking up any reader as
	pipe was not empty after step-01

step-03
	task-118766 new reader
	makes pipe empty
	sleeps

step-04
	task-118740 reader
	sleeps as pipe is empty

[ Tasks 118740 and 118768 can then indefinitely wait on each other. ]


[1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/pipe.c?id=3d252160b818045f3a152b13756f6f37ca34639d

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-07  6:08                                       ` Hillf Danton
@ 2025-03-07  6:24                                         ` K Prateek Nayak
  2025-03-07 10:46                                           ` Hillf Danton
  2025-03-07 11:26                                         ` Oleg Nesterov
  1 sibling, 1 reply; 109+ messages in thread
From: K Prateek Nayak @ 2025-03-07  6:24 UTC (permalink / raw)
  To: Hillf Danton, Oleg Nesterov
  Cc: Mateusz Guzik, Sapkal, Swapnil, Linus Torvalds, linux-fsdevel,
	linux-kernel

Hello Hiilf,

On 3/7/2025 11:38 AM, Hillf Danton wrote:
> On Thu, 6 Mar 2025 10:30:21 +0100 Oleg Nesterov <oleg@redhat.com>
>> On 03/06, Hillf Danton wrote:
>>> On Wed, 5 Mar 2025 12:44:34 +0100 Oleg Nesterov <oleg@redhat.com>
>>>> On 03/05, Hillf Danton wrote:
>>>>> See the loop in  ___wait_event(),
>>>>>
>>>>> 	for (;;) {
>>>>> 		prepare_to_wait_event();
>>>>>
>>>>> 		// flip
>>>>> 		if (condition)
>>>>> 			break;
>>>>>
>>>>> 		schedule();
>>>>> 	}
>>>>>
>>>>> After wakeup, waiter will sleep again if condition flips false on the waker
>>>>> side before waiter checks condition, even if condition is atomic, no?
>>>>
>>>> Yes, but in this case pipe_full() == true is correct, this writer can
>>>> safely sleep.
>>>>
>>> No, because no reader is woken up before sleep to make pipe not full.
>>
>> Why the reader should be woken before this writer sleeps? Why the reader
>> should be woken at all in this case (when pipe is full again) ?
>>
> "to make pipe not full" failed to prevent you asking questions like this one.
> 
>> We certainly can't understand each other.
>>
>> Could your picture the exact scenario/sequence which can hang?
>>
> If you think the scenario in commit 3d252160b818 [1] is correct, check
> the following one.
> 
> step-00
> 	pipe->head = 36
> 	pipe->tail = 36
> 	after 3d252160b818
> 
> step-01
> 	task-118762 writer
> 	pipe->head++;
> 	wakes up task-118740 and task-118768
> 
> step-02
> 	task-118768 writer
> 	makes pipe full;
> 	sleeps without waking up any reader as
> 	pipe was not empty after step-01
> 
> step-03
> 	task-118766 new reader
> 	makes pipe empty

Reader seeing a pipe full should wake up a writer allowing 118768 to
wakeup again and fill the pipe. Am I missing something?

> 	sleeps
> 
> step-04
> 	task-118740 reader
> 	sleeps as pipe is empty
> 
> [ Tasks 118740 and 118768 can then indefinitely wait on each other. ]
> 
> 
> [1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/pipe.c?id=3d252160b818045f3a152b13756f6f37ca34639d

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-07  6:24                                         ` K Prateek Nayak
@ 2025-03-07 10:46                                           ` Hillf Danton
  2025-03-07 11:29                                             ` Oleg Nesterov
  0 siblings, 1 reply; 109+ messages in thread
From: Hillf Danton @ 2025-03-07 10:46 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Hillf Danton, Oleg Nesterov, Mateusz Guzik, Sapkal, Swapnil,
	Linus Torvalds, linux-fsdevel, linux-kernel

On Fri, 7 Mar 2025 11:54:56 +0530 K Prateek Nayak <kprateek.nayak@amd.com>
On 3/7/2025 11:38 AM, Hillf Danton wrote:
>> On Thu, 6 Mar 2025 10:30:21 +0100 Oleg Nesterov <oleg@redhat.com>
>>> On 03/06, Hillf Danton wrote:
>>>> On Wed, 5 Mar 2025 12:44:34 +0100 Oleg Nesterov <oleg@redhat.com>
>>>>> On 03/05, Hillf Danton wrote:
>>>>>> See the loop in  ___wait_event(),
>>>>>>
>>>>>> 	for (;;) {
>>>>>> 		prepare_to_wait_event();
>>>>>>
>>>>>> 		// flip
>>>>>> 		if (condition)
>>>>>> 			break;
>>>>>>
>>>>>> 		schedule();
>>>>>> 	}
>>>>>>
>>>>>> After wakeup, waiter will sleep again if condition flips false on the waker
>>>>>> side before waiter checks condition, even if condition is atomic, no?
>>>>>
>>>>> Yes, but in this case pipe_full() == true is correct, this writer can
>>>>> safely sleep.
>>>>>
>>>> No, because no reader is woken up before sleep to make pipe not full.
>>>
>>> Why the reader should be woken before this writer sleeps? Why the reader
>>> should be woken at all in this case (when pipe is full again) ?
>>>
>> "to make pipe not full" failed to prevent you asking questions like this one.
>> 
>>> We certainly can't understand each other.
>>>
>>> Could your picture the exact scenario/sequence which can hang?
>>>
>> If you think the scenario in commit 3d252160b818 [1] is correct, check
>> the following one.
>> 
>> step-00
>> 	pipe->head = 36
>> 	pipe->tail = 36
>> 	after 3d252160b818
>> 
>> step-01
>> 	task-118762 writer
>> 	pipe->head++;
>> 	wakes up task-118740 and task-118768
>> 
>> step-02
>> 	task-118768 writer
>> 	makes pipe full;
>> 	sleeps without waking up any reader as
>> 	pipe was not empty after step-01
>> 
>> step-03
>> 	task-118766 new reader
>> 	makes pipe empty
>
>Reader seeing a pipe full should wake up a writer allowing 118768 to
>wakeup again and fill the pipe. Am I missing something?
>
Good catch, but that wakeup was cut off [2,3]

[2] https://lore.kernel.org/lkml/20250304123457.GA25281@redhat.com/
[3] https://lore.kernel.org/all/20250210114039.GA3588@redhat.com/

>> 	sleeps
>> 
>> step-04
>> 	task-118740 reader
>> 	sleeps as pipe is empty
>> 
>> [ Tasks 118740 and 118768 can then indefinitely wait on each other. ]
>> 
>> 
>> [1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/pipe.c?id=3d252160b818045f3a152b13756f6f37ca34639d
>
>-- 
>Thanks and Regards,
>Prateek

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-07 10:46                                           ` Hillf Danton
@ 2025-03-07 11:29                                             ` Oleg Nesterov
  2025-03-07 12:34                                               ` Oleg Nesterov
  0 siblings, 1 reply; 109+ messages in thread
From: Oleg Nesterov @ 2025-03-07 11:29 UTC (permalink / raw)
  To: Hillf Danton
  Cc: K Prateek Nayak, Mateusz Guzik, Sapkal, Swapnil, Linus Torvalds,
	linux-fsdevel, linux-kernel

On 03/07, Hillf Danton wrote:
>
> On Fri, 7 Mar 2025 11:54:56 +0530 K Prateek Nayak <kprateek.nayak@amd.com>
> >> step-03
> >> 	task-118766 new reader
> >> 	makes pipe empty
> >
> >Reader seeing a pipe full should wake up a writer allowing 118768 to
> >wakeup again and fill the pipe. Am I missing something?
> >
> Good catch, but that wakeup was cut off [2,3]
>
> [2] https://lore.kernel.org/lkml/20250304123457.GA25281@redhat.com/
> [3] https://lore.kernel.org/all/20250210114039.GA3588@redhat.com/

Why do you think

	[PATCH v2 1/1] pipe: change pipe_write() to never add a zero-sized buffer
	https://lore.kernel.org/all/20250210114039.GA3588@redhat.com/

can make any difference ???

Where do you think a zero-sized buffer with ->len == 0 can come from?

Oleg.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-07 11:29                                             ` Oleg Nesterov
@ 2025-03-07 12:34                                               ` Oleg Nesterov
  2025-03-07 23:56                                                 ` Hillf Danton
  0 siblings, 1 reply; 109+ messages in thread
From: Oleg Nesterov @ 2025-03-07 12:34 UTC (permalink / raw)
  To: Hillf Danton
  Cc: K Prateek Nayak, Mateusz Guzik, Sapkal, Swapnil, Linus Torvalds,
	linux-fsdevel, linux-kernel

In case I wasn't clear...

On 03/07, Oleg Nesterov wrote:
>
> On 03/07, Hillf Danton wrote:
> >
> > On Fri, 7 Mar 2025 11:54:56 +0530 K Prateek Nayak <kprateek.nayak@amd.com>
> > >> step-03
> > >> 	task-118766 new reader
> > >> 	makes pipe empty
> > >
> > >Reader seeing a pipe full should wake up a writer allowing 118768 to
> > >wakeup again and fill the pipe. Am I missing something?
> > >
> > Good catch, but that wakeup was cut off [2,3]

Please note that "that wakeup" was _not_ removed by the patch below.

"That wakeup" is another wakeup pipe_read() does before return:

	if (wake_writer)
		wake_up_interruptible_sync_poll(&pipe->wr_wait, ...);

And wake_writer must be true if this reader changed the pipe_full()
condition from T to F.

Note also that pipe_read() won't sleep if it has read even one byte.

> > [2] https://lore.kernel.org/lkml/20250304123457.GA25281@redhat.com/
> > [3] https://lore.kernel.org/all/20250210114039.GA3588@redhat.com/
>
> Why do you think
>
> 	[PATCH v2 1/1] pipe: change pipe_write() to never add a zero-sized buffer
> 	https://lore.kernel.org/all/20250210114039.GA3588@redhat.com/
>
> can make any difference ???
>
> Where do you think a zero-sized buffer with ->len == 0 can come from?

Oleg.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-07 12:34                                               ` Oleg Nesterov
@ 2025-03-07 23:56                                                 ` Hillf Danton
  2025-03-09 14:01                                                   ` K Prateek Nayak
  2025-03-09 17:02                                                   ` Oleg Nesterov
  0 siblings, 2 replies; 109+ messages in thread
From: Hillf Danton @ 2025-03-07 23:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: K Prateek Nayak, Mateusz Guzik, Sapkal, Swapnil, Linus Torvalds,
	linux-fsdevel, linux-kernel

On Fri, 7 Mar 2025 13:34:43 +0100 Oleg Nesterov <oleg@redhat.com>
> On 03/07, Oleg Nesterov wrote:
> > On 03/07, Hillf Danton wrote:
> > > On Fri, 7 Mar 2025 11:54:56 +0530 K Prateek Nayak <kprateek.nayak@amd.com>
> > > >> step-03
> > > >> 	task-118766 new reader
> > > >> 	makes pipe empty
> > > >
> > > >Reader seeing a pipe full should wake up a writer allowing 118768 to
> > > >wakeup again and fill the pipe. Am I missing something?
> > > >
> > > Good catch, but that wakeup was cut off [2,3]
> 
> Please note that "that wakeup" was _not_ removed by the patch below.
> 
After another look, you did cut it.

Link: https://lore.kernel.org/all/20250209150718.GA17013@redhat.com/
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 fs/pipe.c | 45 +++++++++------------------------------------
 1 file changed, 9 insertions(+), 36 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 2ae75adfba64..b0641f75b1ba 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -360,29 +360,9 @@ anon_pipe_read(struct kiocb *iocb, struct iov_iter *to)
 			break;
 		}
 		mutex_unlock(&pipe->mutex);
-
 		/*
 		 * We only get here if we didn't actually read anything.
 		 *
-		 * However, we could have seen (and removed) a zero-sized
-		 * pipe buffer, and might have made space in the buffers
-		 * that way.
-		 *
-		 * You can't make zero-sized pipe buffers by doing an empty
-		 * write (not even in packet mode), but they can happen if
-		 * the writer gets an EFAULT when trying to fill a buffer
-		 * that already got allocated and inserted in the buffer
-		 * array.
-		 *
-		 * So we still need to wake up any pending writers in the
-		 * _very_ unlikely case that the pipe was full, but we got
-		 * no data.
-		 */
-		if (unlikely(wake_writer))
-			wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
-		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
-
-		/*
 		 * But because we didn't read anything, at this point we can
 		 * just return directly with -ERESTARTSYS if we're interrupted,
 		 * since we've done any required wakeups and there's no need
@@ -391,7 +371,6 @@ anon_pipe_read(struct kiocb *iocb, struct iov_iter *to)
 		if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0)
 			return -ERESTARTSYS;
 
-		wake_writer = false;
 		wake_next_reader = true;
 		mutex_lock(&pipe->mutex);
 	}

> "That wakeup" is another wakeup pipe_read() does before return:
> 
> 	if (wake_writer)
> 		wake_up_interruptible_sync_poll(&pipe->wr_wait, ...);
> 
> And wake_writer must be true if this reader changed the pipe_full()
> condition from T to F.
> 
Could you read Prateek's comment again, then try to work out why he
did so?

> Note also that pipe_read() won't sleep if it has read even one byte.
> 
> > > [2] https://lore.kernel.org/lkml/20250304123457.GA25281@redhat.com/
> > > [3] https://lore.kernel.org/all/20250210114039.GA3588@redhat.com/
> >
> > Why do you think
> >
> > 	[PATCH v2 1/1] pipe: change pipe_write() to never add a zero-sized buffer
> > 	https://lore.kernel.org/all/20250210114039.GA3588@redhat.com/
> >
> > can make any difference ???
> >
> > Where do you think a zero-sized buffer with ->len == 0 can come from?
> 
> Oleg.

^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-07 23:56                                                 ` Hillf Danton
@ 2025-03-09 14:01                                                   ` K Prateek Nayak
  2025-03-09 17:02                                                   ` Oleg Nesterov
  1 sibling, 0 replies; 109+ messages in thread
From: K Prateek Nayak @ 2025-03-09 14:01 UTC (permalink / raw)
  To: Hillf Danton, Oleg Nesterov
  Cc: Mateusz Guzik, Sapkal, Swapnil, Linus Torvalds, linux-fsdevel,
	linux-kernel

Hello Hillf,

On 3/8/2025 5:26 AM, Hillf Danton wrote:
> On Fri, 7 Mar 2025 13:34:43 +0100 Oleg Nesterov <oleg@redhat.com>
>> On 03/07, Oleg Nesterov wrote:
>>> On 03/07, Hillf Danton wrote:
>>>> On Fri, 7 Mar 2025 11:54:56 +0530 K Prateek Nayak <kprateek.nayak@amd.com>
>>>>>> step-03
>>>>>> 	task-118766 new reader
>>>>>> 	makes pipe empty
>>>>>
>>>>> Reader seeing a pipe full should wake up a writer allowing 118768 to
>>>>> wakeup again and fill the pipe. Am I missing something?
>>>>>
>>>> Good catch, but that wakeup was cut off [2,3]
>>
>> Please note that "that wakeup" was _not_ removed by the patch below.
>>
> After another look, you did cut it.
> 
> Link: https://lore.kernel.org/all/20250209150718.GA17013@redhat.com/
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

So that is not problematic because pipe_write() no longer increments the
head before a successful write. What also changed in that patch above is
the order in which we do copy_page_from_iter() and head increment - now,
the head is incremented only if copy_page_from_iter() actually manages to
write data into the buffer which eliminates the need to do a wakeup if
nothing was found in the buffer ...

> ---
>   fs/pipe.c | 45 +++++++++------------------------------------
>   1 file changed, 9 insertions(+), 36 deletions(-)
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 2ae75adfba64..b0641f75b1ba 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -360,29 +360,9 @@ anon_pipe_read(struct kiocb *iocb, struct iov_iter *to)

Above this, we have:

	if (!total_len)
		break;	/* common path: read succeeded */
	if (!pipe_empty(head, tail))	/* More to do? */
		continue;

And if a read is successful, one of them must be hit and with that
reordering in pipe_write(), and the readers must be able to wake a
writer if !pipe_empty() ...

>   			break;
>   		}
>   		mutex_unlock(&pipe->mutex);
> -
>   		/*
>   		 * We only get here if we didn't actually read anything.
>   		 *
> -		 * However, we could have seen (and removed) a zero-sized
> -		 * pipe buffer, and might have made space in the buffers
> -		 * that way.
> -		 *
> -		 * You can't make zero-sized pipe buffers by doing an empty
> -		 * write (not even in packet mode), but they can happen if
> -		 * the writer gets an EFAULT when trying to fill a buffer
> -		 * that already got allocated and inserted in the buffer
> -		 * array.
> -		 *
> -		 * So we still need to wake up any pending writers in the
> -		 * _very_ unlikely case that the pipe was full, but we got
> -		 * no data.
> -		 */
> -		if (unlikely(wake_writer))
> -			wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
> -		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
> -
> -		/*
>   		 * But because we didn't read anything, at this point we can
>   		 * just return directly with -ERESTARTSYS if we're interrupted,
>   		 * since we've done any required wakeups and there's no need
> @@ -391,7 +371,6 @@ anon_pipe_read(struct kiocb *iocb, struct iov_iter *to)
>   		if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0)
>   			return -ERESTARTSYS;
>   
> -		wake_writer = false;
>   		wake_next_reader = true;
>   		mutex_lock(&pipe->mutex);
>   	}
> 
>> "That wakeup" is another wakeup pipe_read() does before return:
>>
>> 	if (wake_writer)
>> 		wake_up_interruptible_sync_poll(&pipe->wr_wait, ...);
>>
>> And wake_writer must be true if this reader changed the pipe_full()
>> condition from T to F.
>>
> Could you read Prateek's comment again, then try to work out why he
> did so?
> 
>> Note also that pipe_read() won't sleep if it has read even one byte.

and that is the key why Oleg's optimization that he highlighted above
which is why it cannot cause a hang (with my imagination). Now, let
us take a closer look at the whole sleep and wakeup mechanism. I'll
expand the one for pipe_write() since that was the problematic bit we
analyzed but you can do the same for pipe_read() side too. The

     wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));

in pipe_write() will boil down to these bits (I'll only highlight the
important ones):

     ret = 0;

     might_sleep()
     if (!pipe_writable()) { /* First line of defense */
         init_wait_entry(&__wq_entry, WQ_FLAG_EXCLUSIVE);

         for (;;) {
             /* Below is expansion of prepare_to_wait_event() success case*/
             spin_lock_irqsave(&wq_head->lock, flags);
             __add_wait_queue_entry_tail(wq_head, __wq_entry);
             set_current_state(TASK_INTERRUPTIBLE);
             spin_unlock_irqrestore(&wq_head->lock, flags);

             /* Second line of defense */
	    if (pipe_writable())
                 break;

             schedule();
         }
         finish_wait(&wq_head, &__wq_entry);
     ]
     return 0;


The sequence for the writer wait is:

o Add on the wait queue.
o Set task state to TASK_INTERRUPTIBLE.
o Check if pipe_writable() one last time.
o Call schedule()

Now why can't we miss a wakeup? The reader increments the tail, and then
does a wakeup of waiting writers in all cases it finds pipe_full().
Wakeup accesses a the wait queues which does:

     spin_lock_irqsave(&wq_head->lock, flags);
     __wake_up_common();
     spin_unlock_irqrestore(&wq_head->lock, flags);

so adding to wait queue and wakeup from wait queue are always done under
the wait queue lock.

Now there are two non trivial cases:

o This is the simple case:

                  writer                                                 reader
                  ======                                                 ======

if (!pipe_writable()) /* True */ {
     for (;;) {
         ...                                            wake_writers = pipe_full(); /* True */
                                                        tail = tail + 1
                                                        wake_up_interruptible_sync_poll(&pipe->wr_wait) {
                                                            /*wr_wait is empty */
                                                        }
         ...
         /* Adds itself on the wait queue */
         writer->__state = TASK_INTERRUPTIBLE;
         
         if (pipe_writable()) /* True */ {
             break;
             /*
              * Goes and does finish_wait()
              */
         }
     }

     finish_wait() {
         p->__state = TASK_RUNNING;
     }

     /*
      * Goes and does a check under
      * pipe->mutex to be sure.
      */
}

O This is slightly complicated case:

                  writer                                                 reader
                  ======                                                 ======
if (!pipe_writable()) /* True */ {
     for (;;) {
         /* Adds itself on the wait queue */
         writer->__state = TASK_INTERRUPTIBLE;
         if (pipe_writable()) /* False */ {
             /* The break is not executed */
         }
         ...                                            wake_writers = pipe_full(); /* True */
                                                        tail = tail + 1
                                                        wake_up_interruptible_sync_poll(&pipe->wr_wait) {
                                                            default_wake_function() {
                                                                ttwu_runnable() {
                                                                    /* Calls ttwu_do_wakeup() which does */
                                                                    p->__state = TASK_RUNNING
                                                                }
							   }
                                                        }
         ...
         schedule() {
             if (!p->__state) /* False */ {
                 /*
                  * Never called since p->__state
                  * is RUNNING
                  */
                 try_to_block_task();
             }
         }

         ... repeat the loop
         if (pipe_writable()) /* True */ {
             break;
             /*
              * Goes and does finish_wait()
              */
         }
     }

     finish_wait() {
         p->__state = TASK_RUNNING;
     }

     /*
      * Goes and does a check under
      * pipe->mutex to be sure.
      */
}

All in all, the writer will wither see a pipe_writable() and never fully
block, or it will be woken up it will add itself to the wit queue and
will be woken by a reader after moving pipe->tail if pipe_full()
returned true before. If I've missed something, please let me know and
I'll try to go back and convince myself on how that situation can happen
and get back to you but so far, my head cannot think of a situation
where the pipe can hang after the recent set of fixes and optimizations.

>>
>>>> [2] https://lore.kernel.org/lkml/20250304123457.GA25281@redhat.com/
>>>> [3] https://lore.kernel.org/all/20250210114039.GA3588@redhat.com/
>>>
>>> Why do you think
>>>
>>> 	[PATCH v2 1/1] pipe: change pipe_write() to never add a zero-sized buffer
>>> 	https://lore.kernel.org/all/20250210114039.GA3588@redhat.com/
>>>
>>> can make any difference ???
>>>
>>> Where do you think a zero-sized buffer with ->len == 0 can come from?

I audited the post_one_notification() code and the paths that lead up
to it:

o remove_watch_from_object() will have the data of size watch_sizeof(n)
   so that beffer size is definitely !0 (unless (watch_sizeof(n) & 0x7f is
   zero but what are the odds of that?)

o Other is __post_watch_notification() which has a early return for

       ((n->info & WATCH_INFO_LENGTH) >> WATCH_INFO_LENGTH__SHIFT) == 0

    and also a WARN_ON() so it'll not add an empty buffer and will instead
    scream at the users if someone tries to do it.

splice() cases too cannot do this based on my understanding so I don't
think we can run into the issue but I'm limited by my own imagination :)

>>
>> Oleg.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-07 23:56                                                 ` Hillf Danton
  2025-03-09 14:01                                                   ` K Prateek Nayak
@ 2025-03-09 17:02                                                   ` Oleg Nesterov
  2025-03-10 10:49                                                     ` Hillf Danton
  1 sibling, 1 reply; 109+ messages in thread
From: Oleg Nesterov @ 2025-03-09 17:02 UTC (permalink / raw)
  To: Hillf Danton
  Cc: K Prateek Nayak, Mateusz Guzik, Sapkal, Swapnil, Linus Torvalds,
	linux-fsdevel, linux-kernel

Well. Prateek has already provide the lengthy/thorough explanation,
but let me add anyway...

On 03/08, Hillf Danton wrote:
>
> On Fri, 7 Mar 2025 13:34:43 +0100 Oleg Nesterov <oleg@redhat.com>
> > On 03/07, Oleg Nesterov wrote:
> > > On 03/07, Hillf Danton wrote:
> > > > On Fri, 7 Mar 2025 11:54:56 +0530 K Prateek Nayak <kprateek.nayak@amd.com>
> > > > >> step-03
> > > > >> 	task-118766 new reader
> > > > >> 	makes pipe empty
> > > > >
> > > > >Reader seeing a pipe full should wake up a writer allowing 118768 to
> > > > >wakeup again and fill the pipe. Am I missing something?
> > > > >
> > > > Good catch, but that wakeup was cut off [2,3]
> >
> > Please note that "that wakeup" was _not_ removed by the patch below.
> >
> After another look, you did cut it.

I still don't think so.

> Link: https://lore.kernel.org/all/20250209150718.GA17013@redhat.com/
...
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -360,29 +360,9 @@ anon_pipe_read(struct kiocb *iocb, struct iov_iter *to)
>  			break;
>  		}
>  		mutex_unlock(&pipe->mutex);
> -
>  		/*
>  		 * We only get here if we didn't actually read anything.
>  		 *
> -		 * However, we could have seen (and removed) a zero-sized
> -		 * pipe buffer, and might have made space in the buffers
> -		 * that way.
> -		 *
> -		 * You can't make zero-sized pipe buffers by doing an empty
> -		 * write (not even in packet mode), but they can happen if
> -		 * the writer gets an EFAULT when trying to fill a buffer
> -		 * that already got allocated and inserted in the buffer
> -		 * array.
> -		 *
> -		 * So we still need to wake up any pending writers in the
> -		 * _very_ unlikely case that the pipe was full, but we got
> -		 * no data.
> -		 */
> -		if (unlikely(wake_writer))
> -			wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
> -		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
> -
> -		/*
>  		 * But because we didn't read anything, at this point we can
>  		 * just return directly with -ERESTARTSYS if we're interrupted,
>  		 * since we've done any required wakeups and there's no need
> @@ -391,7 +371,6 @@ anon_pipe_read(struct kiocb *iocb, struct iov_iter *to)
>  		if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0)
>  			return -ERESTARTSYS;
>
> -		wake_writer = false;
>  		wake_next_reader = true;
>  		mutex_lock(&pipe->mutex);
>  	}

Please note that in this particular case (hackbench testing)
pipe_write() -> copy_page_from_iter() never fails. So wake_writer is
never true before pipe_reader() calls wait_event(pipe->rd_wait).

So (again, in this particular case) we could apply the patch below
on top of Linus's tree.

So, with or without these changes, the writer should be woken up at
step-03 in your scenario.

Oleg.
---

--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -360,27 +360,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 		}
 		mutex_unlock(&pipe->mutex);
 
-		/*
-		 * We only get here if we didn't actually read anything.
-		 *
-		 * However, we could have seen (and removed) a zero-sized
-		 * pipe buffer, and might have made space in the buffers
-		 * that way.
-		 *
-		 * You can't make zero-sized pipe buffers by doing an empty
-		 * write (not even in packet mode), but they can happen if
-		 * the writer gets an EFAULT when trying to fill a buffer
-		 * that already got allocated and inserted in the buffer
-		 * array.
-		 *
-		 * So we still need to wake up any pending writers in the
-		 * _very_ unlikely case that the pipe was full, but we got
-		 * no data.
-		 */
-		if (unlikely(wake_writer))
-			wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
-		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
-
+		BUG_ON(wake_writer);
 		/*
 		 * But because we didn't read anything, at this point we can
 		 * just return directly with -ERESTARTSYS if we're interrupted,


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-09 17:02                                                   ` Oleg Nesterov
@ 2025-03-10 10:49                                                     ` Hillf Danton
  2025-03-10 11:09                                                       ` Oleg Nesterov
  0 siblings, 1 reply; 109+ messages in thread
From: Hillf Danton @ 2025-03-10 10:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: K Prateek Nayak, Mateusz Guzik, Sapkal, Swapnil, Linus Torvalds,
	linux-fsdevel, linux-kernel

On Sun, 9 Mar 2025 18:02:55 +0100 Oleg Nesterov
> 
> Well. Prateek has already provide the lengthy/thorough explanation,
> but let me add anyway...
> 
lengthy != correct

> On 03/08, Hillf Danton wrote:
> > On Fri, 7 Mar 2025 13:34:43 +0100 Oleg Nesterov <oleg@redhat.com>
> > > On 03/07, Oleg Nesterov wrote:
> > > > On 03/07, Hillf Danton wrote:
> > > > > On Fri, 7 Mar 2025 11:54:56 +0530 K Prateek Nayak <kprateek.nayak@amd.com>
> > > > > >> step-03
> > > > > >> 	task-118766 new reader
> > > > > >> 	makes pipe empty
> > > > > >
> > > > > >Reader seeing a pipe full should wake up a writer allowing 118768 to
> > > > > >wakeup again and fill the pipe. Am I missing something?
> > > > > >
> > > > > Good catch, but that wakeup was cut off [2,3]
> > >
> > > Please note that "that wakeup" was _not_ removed by the patch below.
> > >
> > After another look, you did cut it.
> 
> I still don't think so.
> 
> > Link: https://lore.kernel.org/all/20250209150718.GA17013@redhat.com/
> ...
> > --- a/fs/pipe.c
> > +++ b/fs/pipe.c
> > @@ -360,29 +360,9 @@ anon_pipe_read(struct kiocb *iocb, struct iov_iter *to)
> >  			break;
> >  		}
> >  		mutex_unlock(&pipe->mutex);
> > -
> >  		/*
> >  		 * We only get here if we didn't actually read anything.
> >  		 *
> > -		 * However, we could have seen (and removed) a zero-sized
> > -		 * pipe buffer, and might have made space in the buffers
> > -		 * that way.
> > -		 *
> > -		 * You can't make zero-sized pipe buffers by doing an empty
> > -		 * write (not even in packet mode), but they can happen if
> > -		 * the writer gets an EFAULT when trying to fill a buffer
> > -		 * that already got allocated and inserted in the buffer
> > -		 * array.
> > -		 *
> > -		 * So we still need to wake up any pending writers in the
> > -		 * _very_ unlikely case that the pipe was full, but we got
> > -		 * no data.
> > -		 */
> > -		if (unlikely(wake_writer))
> > -			wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
> > -		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
> > -
> > -		/*
> >  		 * But because we didn't read anything, at this point we can
> >  		 * just return directly with -ERESTARTSYS if we're interrupted,
> >  		 * since we've done any required wakeups and there's no need
> > @@ -391,7 +371,6 @@ anon_pipe_read(struct kiocb *iocb, struct iov_iter *to)
> >  		if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0)
> >  			return -ERESTARTSYS;
> >
> > -		wake_writer = false;
> >  		wake_next_reader = true;
> >  		mutex_lock(&pipe->mutex);
> >  	}
> 
> Please note that in this particular case (hackbench testing)
> pipe_write() -> copy_page_from_iter() never fails. So wake_writer is
> never true before pipe_reader() calls wait_event(pipe->rd_wait).
> 
Given never and the BUG_ON below, you accidentally prove that Prateek's
comment is false, no?

> So (again, in this particular case) we could apply the patch below
> on top of Linus's tree.
> 
> So, with or without these changes, the writer should be woken up at
> step-03 in your scenario.
> 
Fine, before checking my scenario once more, feel free to pinpoint the
line number where writer is woken up, with the change below applied.

> Oleg.
> ---
> 
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -360,27 +360,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>  		}
>  		mutex_unlock(&pipe->mutex);
>  
> -		/*
> -		 * We only get here if we didn't actually read anything.
> -		 *
> -		 * However, we could have seen (and removed) a zero-sized
> -		 * pipe buffer, and might have made space in the buffers
> -		 * that way.
> -		 *
> -		 * You can't make zero-sized pipe buffers by doing an empty
> -		 * write (not even in packet mode), but they can happen if
> -		 * the writer gets an EFAULT when trying to fill a buffer
> -		 * that already got allocated and inserted in the buffer
> -		 * array.
> -		 *
> -		 * So we still need to wake up any pending writers in the
> -		 * _very_ unlikely case that the pipe was full, but we got
> -		 * no data.
> -		 */
> -		if (unlikely(wake_writer))
> -			wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
> -		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
> -
> +		BUG_ON(wake_writer);
>  		/*
>  		 * But because we didn't read anything, at this point we can
>  		 * just return directly with -ERESTARTSYS if we're interrupted,
> 
> 

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-10 10:49                                                     ` Hillf Danton
@ 2025-03-10 11:09                                                       ` Oleg Nesterov
  2025-03-10 11:37                                                         ` Hillf Danton
  0 siblings, 1 reply; 109+ messages in thread
From: Oleg Nesterov @ 2025-03-10 11:09 UTC (permalink / raw)
  To: Hillf Danton
  Cc: K Prateek Nayak, Mateusz Guzik, Sapkal, Swapnil, Linus Torvalds,
	linux-fsdevel, linux-kernel

On 03/10, Hillf Danton wrote:
>
> On Sun, 9 Mar 2025 18:02:55 +0100 Oleg Nesterov
> >
> > So (again, in this particular case) we could apply the patch below
> > on top of Linus's tree.
> >
> > So, with or without these changes, the writer should be woken up at
> > step-03 in your scenario.
> >
> Fine, before checking my scenario once more, feel free to pinpoint the
> line number where writer is woken up, with the change below applied.

    381          if (wake_writer)
==> 382                  wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
    383          if (wake_next_reader)
    384                  wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
    385          kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
    386          if (ret > 0)
    387                  file_accessed(filp);
    388          return ret;

line 382, no?

Oleg.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-10 11:09                                                       ` Oleg Nesterov
@ 2025-03-10 11:37                                                         ` Hillf Danton
  2025-03-10 12:43                                                           ` Oleg Nesterov
  0 siblings, 1 reply; 109+ messages in thread
From: Hillf Danton @ 2025-03-10 11:37 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: K Prateek Nayak, Mateusz Guzik, Sapkal, Swapnil, Linus Torvalds,
	linux-fsdevel, linux-kernel

On Mon, 10 Mar 2025 12:09:15 +0100 Oleg Nesterov
> On 03/10, Hillf Danton wrote:
> > On Sun, 9 Mar 2025 18:02:55 +0100 Oleg Nesterov
> > >
> > > So (again, in this particular case) we could apply the patch below
> > > on top of Linus's tree.
> > >
> > > So, with or without these changes, the writer should be woken up at
> > > step-03 in your scenario.
> > >
> > Fine, before checking my scenario once more, feel free to pinpoint the
> > line number where writer is woken up, with the change below applied.
> 
>     381          if (wake_writer)
> ==> 382                  wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
>     383          if (wake_next_reader)
>     384                  wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
>     385          kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
>     386          if (ret > 0)
>     387                  file_accessed(filp);
>     388          return ret;
> 
> line 382, no?
> 
Yes, but how is the wait loop at line-370 broken?

 360                 }
 361                 mutex_unlock(&pipe->mutex);
 362
 363                 BUG_ON(wake_writer);
 364                 /*
 365                  * But because we didn't read anything, at this point we can
 366                  * just return directly with -ERESTARTSYS if we're interrupted,
 367                  * since we've done any required wakeups and there's no need
 368                  * to mark anything accessed. And we've dropped the lock.
 369                  */
 370                 if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0)
 371                         return -ERESTARTSYS;
 372
 373                 wake_writer = false;
 374                 wake_next_reader = true;
 375                 mutex_lock(&pipe->mutex);

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-10 11:37                                                         ` Hillf Danton
@ 2025-03-10 12:43                                                           ` Oleg Nesterov
  2025-03-10 23:33                                                             ` Hillf Danton
  0 siblings, 1 reply; 109+ messages in thread
From: Oleg Nesterov @ 2025-03-10 12:43 UTC (permalink / raw)
  To: Hillf Danton
  Cc: K Prateek Nayak, Mateusz Guzik, Sapkal, Swapnil, Linus Torvalds,
	linux-fsdevel, linux-kernel

On 03/10, Hillf Danton wrote:
>
> On Mon, 10 Mar 2025 12:09:15 +0100 Oleg Nesterov
> > On 03/10, Hillf Danton wrote:
> > > On Sun, 9 Mar 2025 18:02:55 +0100 Oleg Nesterov
> > > >
> > > > So (again, in this particular case) we could apply the patch below
> > > > on top of Linus's tree.
> > > >
> > > > So, with or without these changes, the writer should be woken up at
> > > > step-03 in your scenario.
> > > >
> > > Fine, before checking my scenario once more, feel free to pinpoint the
> > > line number where writer is woken up, with the change below applied.
> >
> >     381          if (wake_writer)
> > ==> 382                  wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
> >     383          if (wake_next_reader)
> >     384                  wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
> >     385          kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
> >     386          if (ret > 0)
> >     387                  file_accessed(filp);
> >     388          return ret;
> >
> > line 382, no?
> >
> Yes, but how is the wait loop at line-370 broken?
>
>  360                 }
>  361                 mutex_unlock(&pipe->mutex);
>  362
>  363                 BUG_ON(wake_writer);
>  364                 /*
>  365                  * But because we didn't read anything, at this point we can
>  366                  * just return directly with -ERESTARTSYS if we're interrupted,
>  367                  * since we've done any required wakeups and there's no need
>  368                  * to mark anything accessed. And we've dropped the lock.
>  369                  */
>  370                 if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0)
>  371                         return -ERESTARTSYS;

Hmm. I don't understand you, again.

OK, once some writer writes at least one byte (this will make the
pipe_empty() condition false) and wakes this reader up.

If you meant something else, say, if you referred to you previous
scenario, please clarify your question.

Oleg.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-10 12:43                                                           ` Oleg Nesterov
@ 2025-03-10 23:33                                                             ` Hillf Danton
  2025-03-11  0:26                                                               ` Linus Torvalds
                                                                                 ` (2 more replies)
  0 siblings, 3 replies; 109+ messages in thread
From: Hillf Danton @ 2025-03-10 23:33 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: K Prateek Nayak, Mateusz Guzik, Sapkal, Swapnil, Linus Torvalds,
	linux-fsdevel, linux-kernel

On Mon, 10 Mar 2025 13:43:42 +0100 Oleg Nesterov
> On 03/10, Hillf Danton wrote:
> > On Mon, 10 Mar 2025 12:09:15 +0100 Oleg Nesterov
> > > On 03/10, Hillf Danton wrote:
> > > > On Sun, 9 Mar 2025 18:02:55 +0100 Oleg Nesterov
> > > > >
> > > > > So (again, in this particular case) we could apply the patch below
> > > > > on top of Linus's tree.
> > > > >
> > > > > So, with or without these changes, the writer should be woken up at
> > > > > step-03 in your scenario.
> > > > >
> > > > Fine, before checking my scenario once more, feel free to pinpoint the
> > > > line number where writer is woken up, with the change below applied.
> > >
> > >     381          if (wake_writer)
> > > ==> 382                  wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
> > >     383          if (wake_next_reader)
> > >     384                  wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
> > >     385          kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
> > >     386          if (ret > 0)
> > >     387                  file_accessed(filp);
> > >     388          return ret;
> > >
> > > line 382, no?
> > >
> > Yes, but how is the wait loop at line-370 broken?
> >
> >  360                 }
> >  361                 mutex_unlock(&pipe->mutex);
> >  362
> >  363                 BUG_ON(wake_writer);
> >  364                 /*
> >  365                  * But because we didn't read anything, at this point we can
> >  366                  * just return directly with -ERESTARTSYS if we're interrupted,
> >  367                  * since we've done any required wakeups and there's no need
> >  368                  * to mark anything accessed. And we've dropped the lock.
> >  369                  */
> >  370                 if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0)
> >  371                         return -ERESTARTSYS;
> 
> Hmm. I don't understand you, again.
> 
> OK, once some writer writes at least one byte (this will make the
> pipe_empty() condition false) and wakes this reader up.
> 
> If you meant something else, say, if you referred to you previous
> scenario, please clarify your question.
> 
The step-03 in my scenario [1] shows a reader sleeps at line-370 after
making the pipe empty, so after your change that cuts the chance for
waking up writer, who will wake up the sleeping reader? Nobody.

Feel free to check my scenario again.

step-03
	task-118766 new reader
	makes pipe empty
	sleeps

[1] https://lore.kernel.org/lkml/20250307060827.3083-1-hdanton@sina.com/

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-10 23:33                                                             ` Hillf Danton
@ 2025-03-11  0:26                                                               ` Linus Torvalds
  2025-03-11  6:54                                                               ` Oleg Nesterov
       [not found]                                                               ` <20250311112922.3342-1-hdanton@sina.com>
  2 siblings, 0 replies; 109+ messages in thread
From: Linus Torvalds @ 2025-03-11  0:26 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Oleg Nesterov, K Prateek Nayak, Mateusz Guzik, Sapkal, Swapnil,
	linux-fsdevel, linux-kernel

On Mon, 10 Mar 2025 at 13:34, Hillf Danton <hdanton@sina.com> wrote:
>
> The step-03 in my scenario [1] shows a reader sleeps at line-370 after
> making the pipe empty, so after your change that cuts the chance for
> waking up writer, who will wake up the sleeping reader? Nobody.

But step-03 will wake the writer.

And no, nobody will wake readers, because the pipe is empty. Only the
next writer that adds data to the pipe should wake any readers.

Note that the logic that sets "wake_writer" and "was_empty" is all
protected by the pipe semaphore. So there are no races wrt figuring
out "should we wake readers/writers".

So I really think you need to very explicitly point to what you think
the problem is. Not point to some other email. Write out all out in
full and explain.

               Linus

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-10 23:33                                                             ` Hillf Danton
  2025-03-11  0:26                                                               ` Linus Torvalds
@ 2025-03-11  6:54                                                               ` Oleg Nesterov
       [not found]                                                               ` <20250311112922.3342-1-hdanton@sina.com>
  2 siblings, 0 replies; 109+ messages in thread
From: Oleg Nesterov @ 2025-03-11  6:54 UTC (permalink / raw)
  To: Hillf Danton
  Cc: K Prateek Nayak, Mateusz Guzik, Sapkal, Swapnil, Linus Torvalds,
	linux-fsdevel, linux-kernel

On 03/11, Hillf Danton wrote:
>
> On Mon, 10 Mar 2025 13:43:42 +0100 Oleg Nesterov
> >
> > Hmm. I don't understand you, again.
> >
> > OK, once some writer writes at least one byte (this will make the
> > pipe_empty() condition false) and wakes this reader up.
> >
> > If you meant something else, say, if you referred to you previous
> > scenario, please clarify your question.
> >
> The step-03 in my scenario [1] shows a reader sleeps at line-370 after
> making the pipe empty, so after your change that cuts the chance for
> waking up writer,

We are circling.

Once again, in this case "wake_writer" can't be true when the reader does
wait_event(rd_wait), this code can be replaced with BUG_ON(wake_writer).
So that change cuts nothing. It simply has no effect in this case.

> who will wake up the sleeping reader? Nobody.
>
> Feel free to check my scenario again.
>
> step-03
> 	task-118766 new reader
> 	makes pipe empty
> 	sleeps
>
> [1] https://lore.kernel.org/lkml/20250307060827.3083-1-hdanton@sina.com/

First of all, task-118766 won't sleep unless it calls read() again.

From https://lore.kernel.org/all/20250307123442.GD5963@redhat.com/

	Note also that pipe_read() won't sleep if it has read even one byte.

but this doesn't really matter.

From https://lore.kernel.org/all/20250307112619.GA5963@redhat.com/

	> step-03
	> 	task-118766 new reader
	> 	makes pipe empty
	> 	sleeps

	but since the pipe was full, this reader should wake up the
	writer task-118768 once it updates the tail the 1st time during
	the read.

	> step-04
	> 	task-118740 reader
	> 	sleeps as pipe is empty

	this is fine.

	> [ Tasks 118740 and 118768 can then indefinitely wait on each other. ]

	118768 should be woken at step 3

Now, when the writer task-118768 does write() it will wake the reader,
task-118740.

Oleg.


^ permalink raw reply	[flat|nested] 109+ messages in thread

[parent not found: <20250311112922.3342-1-hdanton@sina.com>]

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
       [not found]                                                               ` <20250311112922.3342-1-hdanton@sina.com>
@ 2025-03-11 11:53                                                                 ` Oleg Nesterov
  0 siblings, 0 replies; 109+ messages in thread
From: Oleg Nesterov @ 2025-03-11 11:53 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Linus Torvalds, K Prateek Nayak, Mateusz Guzik, Sapkal, Swapnil,
	linux-fsdevel, linux-kernel

On 03/11, Hillf Danton wrote:
>
> On Mon, 10 Mar 2025 14:26:17 -1000 Linus Torvalds wrote:
> > On Mon, 10 Mar 2025 at 13:34, Hillf Danton <hdanton@sina.com> wrote:
> > >
> > > The step-03 in my scenario [1] shows a reader sleeps at line-370 after
> > > making the pipe empty, so after your change that cuts the chance for
> > > waking up writer, who will wake up the sleeping reader? Nobody.
> >
> > But step-03 will wake the writer.
> >
> > And no, nobody will wake readers, because the pipe is empty. Only the
> > next writer that adds data to the pipe should wake any readers.
> >
> > Note that the logic that sets "wake_writer" and "was_empty" is all
> > protected by the pipe semaphore. So there are no races wrt figuring
> > out "should we wake readers/writers".
> >
> > So I really think you need to very explicitly point to what you think
> > the problem is. Not point to some other email. Write out all out in
> > full and explain.
> >
> In the mainline tree, conditional wakeup [2] exists before a pipe writer
> takes a nap, so scenario can be constructed based on the one in commit
> 3d252160b818 to make pipe writer sleep with nobody woken up.
>
> step-00
> 	pipe->head = 36
> 	pipe->tail = 36
>
> step-01
> 	task-118762 is a writer
> 	pipe->head++;
> 	wakes up task-118740 and task-118768
>
> step-02
> 	task-118768 is a writer
> 	makes pipe full;
> 	sleeps without waking up any reader as
> 	pipe was not empty after step-01
>
> Conditional wakeup also exists on the reader side [3], but Oleg cut it off [4].
>
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -360,27 +360,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>  		}
>  		mutex_unlock(&pipe->mutex);
>
> -		/*
> -		 * We only get here if we didn't actually read anything.
> -		 *
> -		 * However, we could have seen (and removed) a zero-sized
> -		 * pipe buffer, and might have made space in the buffers
> -		 * that way.
> -		 *
> -		 * You can't make zero-sized pipe buffers by doing an empty
> -		 * write (not even in packet mode), but they can happen if
> -		 * the writer gets an EFAULT when trying to fill a buffer
> -		 * that already got allocated and inserted in the buffer
> -		 * array.
> -		 *
> -		 * So we still need to wake up any pending writers in the
> -		 * _very_ unlikely case that the pipe was full, but we got
> -		 * no data.
> -		 */
> -		if (unlikely(wake_writer))
> -			wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
> -		kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
> -
> +		BUG_ON(wake_writer);
>  		/*
>  		 * But because we didn't read anything, at this point we can
>  		 * just return directly with -ERESTARTSYS if we're interrupted,
>
>
> step-03
> 	task-118740 is reader
> 	makes pipe empty
> 	sleeps with no writer woken up
>
> After step-03, both reader(task-118740) and writer (task-118768) sleep
> waiting for each other, with Oleg's change.

Well. I have already tried to explain this at least twice :/ Prateek too.

After step-03 task-118740 won't sleep. pipe_read() won't sleep if it has
read even one byte. Since the pipe was full and this reader makes it empty,
"wake_writer" must be true after the main loop before return from pipe_read().
This means that the reader(task-118740) will wake the writer(task-118768)
before it returns from pipe_read().

Oleg.

> PS Oleg, given no seperate reply to you, check the above scenario instead please.
>
> [2] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/pipe.c#n576
> [3] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/pipe.c#n381
> [4] https://lore.kernel.org/lkml/20250309170254.GA15139@redhat.com/
>


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-03-07  6:08                                       ` Hillf Danton
  2025-03-07  6:24                                         ` K Prateek Nayak
@ 2025-03-07 11:26                                         ` Oleg Nesterov
  1 sibling, 0 replies; 109+ messages in thread
From: Oleg Nesterov @ 2025-03-07 11:26 UTC (permalink / raw)
  To: Hillf Danton
  Cc: K Prateek Nayak, Mateusz Guzik, Sapkal, Swapnil, Linus Torvalds,
	linux-fsdevel, linux-kernel

On 03/07, Hillf Danton wrote:
>
> On Thu, 6 Mar 2025 10:30:21 +0100 Oleg Nesterov <oleg@redhat.com>
> > > >
> > > > Yes, but in this case pipe_full() == true is correct, this writer can
> > > > safely sleep.
> > > >
> > > No, because no reader is woken up before sleep to make pipe not full.
> >
> > Why the reader should be woken before this writer sleeps? Why the reader
> > should be woken at all in this case (when pipe is full again) ?
> >
> "to make pipe not full" failed to prevent you asking questions like this one.

Hmm. I don't understand your "prevent you asking questions" reply.

If the pipe was full we do not need to wake the reader(s), the reader
can only sleep if pipe_empty() == true.

> > We certainly can't understand each other.

Yes.

> step-00
> 	pipe->head = 36
> 	pipe->tail = 36
> 	after 3d252160b818
>
> step-01
> 	task-118762 writer
> 	pipe->head++;
> 	wakes up task-118740 and task-118768
>
> step-02
> 	task-118768 writer
> 	makes pipe full;
> 	sleeps without waking up any reader as
> 	pipe was not empty after step-01
>
> step-03
> 	task-118766 new reader
> 	makes pipe empty
> 	sleeps

but since the pipe was full, this reader should wake up the
writer task-118768 once it updates the tail the 1st time during
the read.

> step-04
> 	task-118740 reader
> 	sleeps as pipe is empty

this is fine.

> [ Tasks 118740 and 118768 can then indefinitely wait on each other. ]

118768 should be woken at step 3 ?

Oleg.


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-24  9:26 ` Sapkal, Swapnil
  2025-02-24 14:24   ` Oleg Nesterov
@ 2025-02-27 12:50   ` Oleg Nesterov
  2025-02-27 13:52     ` Oleg Nesterov
  2025-02-27 15:59     ` Mateusz Guzik
  1 sibling, 2 replies; 109+ messages in thread
From: Oleg Nesterov @ 2025-02-27 12:50 UTC (permalink / raw)
  To: Sapkal, Swapnil, Mateusz Guzik, Linus Torvalds
  Cc: Manfred Spraul, Christian Brauner, David Howells, WangYuli,
	linux-fsdevel, linux-kernel, K Prateek Nayak,
	Shenoy, Gautham Ranjal, Neeraj.Upadhyay

Hmm...

Suppose that pipe is full, a writer W tries to write a single byte
and sleeps on pipe->wr_wait.

A reader reads PAGE_SIZE bytes, updates pipe->tail, and wakes W up.

But, before the woken W takes pipe->mutex, another writer comes and
writes 1 byte. This updates ->head and makes pipe_full() true again.

Now, W could happily merge its "small" write into the last buffer,
but it will sleep again, despite the fact the last buffer has room
for 4095 bytes.

Sapkal, I don't think this can explain the hang, receiver()->read()
should wake this writer later anyway. But could you please retest
with the patch below?

Thanks,

Oleg.
---

diff --git a/fs/pipe.c b/fs/pipe.c
index b0641f75b1ba..222881559c30 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -455,6 +455,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	 * page-aligns the rest of the writes for large writes
 	 * spanning multiple pages.
 	 */
+again:
 	head = pipe->head;
 	was_empty = pipe_empty(head, pipe->tail);
 	chars = total_len & (PAGE_SIZE-1);
@@ -559,8 +560,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
 		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 		wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
 		mutex_lock(&pipe->mutex);
-		was_empty = pipe_empty(pipe->head, pipe->tail);
 		wake_next_writer = true;
+		goto again;
 	}
 out:
 	if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))


^ permalink raw reply related	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-27 12:50   ` Oleg Nesterov
@ 2025-02-27 13:52     ` Oleg Nesterov
  2025-02-27 15:59     ` Mateusz Guzik
  1 sibling, 0 replies; 109+ messages in thread
From: Oleg Nesterov @ 2025-02-27 13:52 UTC (permalink / raw)
  To: Sapkal, Swapnil, Mateusz Guzik, Linus Torvalds
  Cc: Manfred Spraul, Christian Brauner, David Howells, WangYuli,
	linux-fsdevel, linux-kernel, K Prateek Nayak,
	Shenoy, Gautham Ranjal, Neeraj.Upadhyay

Forgot to mention...

Even if this patch is likely "offtopic", it probably makes sense.
However, it is "incomplete" in that there are other scenarious.

Again, the pipe is full. A writer W1 tries to write 4096 bytes and
sleeps. A writer W2 tries to write 1 byte and sleeps too.

A reader reads 4096 bytes, updates pipe->tail and wakes W1.

Another writer comes, writes 1 byte and "steals" the buffer released
by the reader.

W1 sleeps again, this is correct. But nobody will wake W2 which could
succeed.

This (and the previous more simple scenario) means that

	if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
		wake_next_writer = false;

before return in anon_pipe_write() is not really right in this sense.

Oleg.

On 02/27, Oleg Nesterov wrote:
>
> Hmm...
> 
> Suppose that pipe is full, a writer W tries to write a single byte
> and sleeps on pipe->wr_wait.
> 
> A reader reads PAGE_SIZE bytes, updates pipe->tail, and wakes W up.
> 
> But, before the woken W takes pipe->mutex, another writer comes and
> writes 1 byte. This updates ->head and makes pipe_full() true again.
> 
> Now, W could happily merge its "small" write into the last buffer,
> but it will sleep again, despite the fact the last buffer has room
> for 4095 bytes.
> 
> Sapkal, I don't think this can explain the hang, receiver()->read()
> should wake this writer later anyway. But could you please retest
> with the patch below?
> 
> Thanks,
> 
> Oleg.
> ---
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index b0641f75b1ba..222881559c30 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -455,6 +455,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  	 * page-aligns the rest of the writes for large writes
>  	 * spanning multiple pages.
>  	 */
> +again:
>  	head = pipe->head;
>  	was_empty = pipe_empty(head, pipe->tail);
>  	chars = total_len & (PAGE_SIZE-1);
> @@ -559,8 +560,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>  		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
>  		wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
>  		mutex_lock(&pipe->mutex);
> -		was_empty = pipe_empty(pipe->head, pipe->tail);
>  		wake_next_writer = true;
> +		goto again;
>  	}
>  out:
>  	if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))


^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-27 12:50   ` Oleg Nesterov
  2025-02-27 13:52     ` Oleg Nesterov
@ 2025-02-27 15:59     ` Mateusz Guzik
  2025-02-27 16:28       ` Oleg Nesterov
  1 sibling, 1 reply; 109+ messages in thread
From: Mateusz Guzik @ 2025-02-27 15:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Sapkal, Swapnil, Linus Torvalds, Manfred Spraul,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, K Prateek Nayak, Shenoy, Gautham Ranjal,
	Neeraj.Upadhyay

On Thu, Feb 27, 2025 at 1:51 PM Oleg Nesterov <oleg@redhat.com> wrote:
>
> Hmm...
>
> Suppose that pipe is full, a writer W tries to write a single byte
> and sleeps on pipe->wr_wait.
>
> A reader reads PAGE_SIZE bytes, updates pipe->tail, and wakes W up.
>
> But, before the woken W takes pipe->mutex, another writer comes and
> writes 1 byte. This updates ->head and makes pipe_full() true again.
>
> Now, W could happily merge its "small" write into the last buffer,
> but it will sleep again, despite the fact the last buffer has room
> for 4095 bytes.
>
> Sapkal, I don't think this can explain the hang, receiver()->read()
> should wake this writer later anyway. But could you please retest
> with the patch below?
>
> Thanks,
>
> Oleg.
> ---
>
> diff --git a/fs/pipe.c b/fs/pipe.c
> index b0641f75b1ba..222881559c30 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -455,6 +455,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>          * page-aligns the rest of the writes for large writes
>          * spanning multiple pages.
>          */
> +again:
>         head = pipe->head;
>         was_empty = pipe_empty(head, pipe->tail);
>         chars = total_len & (PAGE_SIZE-1);
> @@ -559,8 +560,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
>                 kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
>                 wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
>                 mutex_lock(&pipe->mutex);
> -               was_empty = pipe_empty(pipe->head, pipe->tail);
>                 wake_next_writer = true;
> +               goto again;
>         }
>  out:
>         if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
>

I think this is buggy.

You get wakeups also when the last reader goes away. The for loop you
are jumping out of makes sure to check for the condition, same for the
first mutex acquire. With this goto you can get a successful write
instead of getting SIGPIPE. iow this should goto few lines higher.

I am not sure about the return value. The for loop bumps ret with each
write, but the section you are jumping to overwrites it. So if the
thread wrote some data within the loop, went to sleep and woke up to a
state where it can do a write in the section you are jumping to, it is
going to return the wrong number of bytes.

Unless I'm misreading something.

However, I do think something may be going on with the "split" ops,
which is why I suggested going from 100 bytes where the bug was
encountered to 128 for testing purposes. If that cleared it, that
would be nice for sure. :>

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 109+ messages in thread

* Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full
  2025-02-27 15:59     ` Mateusz Guzik
@ 2025-02-27 16:28       ` Oleg Nesterov
  0 siblings, 0 replies; 109+ messages in thread
From: Oleg Nesterov @ 2025-02-27 16:28 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Sapkal, Swapnil, Linus Torvalds, Manfred Spraul,
	Christian Brauner, David Howells, WangYuli, linux-fsdevel,
	linux-kernel, K Prateek Nayak, Shenoy, Gautham Ranjal,
	Neeraj.Upadhyay

On 02/27, Mateusz Guzik wrote:
>
> On Thu, Feb 27, 2025 at 1:51 PM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Sapkal, I don't think this can explain the hang, receiver()->read()
> > should wake this writer later anyway. But could you please retest
> > with the patch below?
> >
> > Thanks,
> >
> > Oleg.
> > ---
> >
> > diff --git a/fs/pipe.c b/fs/pipe.c
> > index b0641f75b1ba..222881559c30 100644
> > --- a/fs/pipe.c
> > +++ b/fs/pipe.c
> > @@ -455,6 +455,7 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
> >          * page-aligns the rest of the writes for large writes
> >          * spanning multiple pages.
> >          */
> > +again:
> >         head = pipe->head;
> >         was_empty = pipe_empty(head, pipe->tail);
> >         chars = total_len & (PAGE_SIZE-1);
> > @@ -559,8 +560,8 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
> >                 kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> >                 wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
> >                 mutex_lock(&pipe->mutex);
> > -               was_empty = pipe_empty(pipe->head, pipe->tail);
> >                 wake_next_writer = true;
> > +               goto again;
> >         }
> >  out:
> >         if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
> >
>
> I think this is buggy.
>
> You get wakeups also when the last reader goes away. The for loop you
> are jumping out of makes sure to check for the condition, same for the
> first mutex acquire. With this goto you can get a successful write
> instead of getting SIGPIPE. iow this should goto few lines higher.

Yes, yes, and then we need to remove another pipe->readers check
in the main loop.

> I am not sure about the return value. The for loop bumps ret with each
> write, but the section you are jumping to overwrites it.

Ah, yes, thanks, I missed that.

OK, I'll make another one tomorrow, I need to run away.

Until then, it would be nice to test this patch with hackbench anyway.

> However, I do think something may be going on with the "split" ops,
> which is why I suggested going from 100 bytes where the bug was
> encountered to 128 for testing purposes. If that cleared it, that
> would be nice for sure. :>

Yes, but note that the same scenario can happen with 128 bytes as well.
It doesn't really matter how many bytes < PAGE_SIZE the sleeping writer
needs to write, another writer can steal the buffer released by the
last reader in any case.

Thanks!

Oleg.


^ permalink raw reply	[flat|nested] 109+ messages in thread

end of thread, other threads:[~2025-03-11 11:54 UTC | newest]

Thread overview: 109+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-02 14:07 [PATCH] pipe_read: don't wake up the writer if the pipe is still full Oleg Nesterov
2025-01-02 16:20 ` WangYuli
2025-01-02 16:46   ` Oleg Nesterov
2025-01-04  8:42 ` Christian Brauner
2025-01-31  9:49 ` K Prateek Nayak
2025-01-31 13:23   ` Oleg Nesterov
2025-01-31 20:06   ` Linus Torvalds
2025-02-02 17:01     ` Oleg Nesterov
2025-02-02 18:39       ` Linus Torvalds
2025-02-02 19:32         ` Oleg Nesterov
2025-02-04 11:17         ` Christian Brauner
2025-02-03  9:05       ` K Prateek Nayak
2025-02-04 13:49         ` Oleg Nesterov
2025-02-24  9:26 ` Sapkal, Swapnil
2025-02-24 14:24   ` Oleg Nesterov
2025-02-24 18:36     ` Linus Torvalds
2025-02-25 14:26       ` Oleg Nesterov
2025-02-25 11:57     ` Oleg Nesterov
2025-02-26  5:55       ` Sapkal, Swapnil
2025-02-26 11:38         ` Oleg Nesterov
2025-02-26 17:56           ` Sapkal, Swapnil
2025-02-26 18:12             ` Oleg Nesterov
2025-03-03 13:00       ` Alexey Gladkov
2025-03-03 15:46         ` K Prateek Nayak
2025-03-03 17:18           ` Alexey Gladkov
2025-02-26 13:18     ` Mateusz Guzik
2025-02-26 13:21       ` Mateusz Guzik
2025-02-26 17:16         ` Oleg Nesterov
2025-02-27 16:18       ` Sapkal, Swapnil
2025-02-27 16:34         ` Mateusz Guzik
2025-02-27 21:12         ` Oleg Nesterov
2025-02-28  5:58           ` Sapkal, Swapnil
2025-02-28 14:30             ` Oleg Nesterov
2025-02-28 16:33               ` Oleg Nesterov
2025-03-03  9:46                 ` Sapkal, Swapnil
2025-03-03 14:37                   ` Mateusz Guzik
2025-03-03 14:51                     ` Mateusz Guzik
2025-03-03 15:31                       ` K Prateek Nayak
2025-03-03 17:54                         ` Mateusz Guzik
2025-03-03 18:11                           ` Linus Torvalds
2025-03-03 18:33                             ` Mateusz Guzik
2025-03-03 18:55                               ` Linus Torvalds
2025-03-03 19:06                                 ` Mateusz Guzik
2025-03-03 20:27                                 ` Oleg Nesterov
2025-03-03 20:46                                   ` Linus Torvalds
2025-03-04  5:31                                     ` K Prateek Nayak
2025-03-04  6:32                                       ` Linus Torvalds
2025-03-04 12:54                                     ` Oleg Nesterov
2025-03-04 13:25                                       ` Oleg Nesterov
2025-03-04 18:28                                       ` Linus Torvalds
2025-03-04 22:11                                         ` Oleg Nesterov
2025-03-05  4:40                                         ` K Prateek Nayak
2025-03-05  4:52                                           ` Linus Torvalds
2025-03-04 13:51                                     ` [PATCH] fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex K Prateek Nayak
2025-03-04 18:36                                       ` Alexey Gladkov
2025-03-04 19:03                                       ` Linus Torvalds
2025-03-05 15:31                                     ` [PATCH] pipe_read: don't wake up the writer if the pipe is still full Rasmus Villemoes
2025-03-05 16:50                                       ` Linus Torvalds
2025-03-06  9:48                                         ` Rasmus Villemoes
2025-03-06 14:42                                           ` Rasmus Villemoes
2025-03-05 16:40                                     ` Linus Torvalds
2025-03-06  8:35                                       ` Rasmus Villemoes
2025-03-06 17:59                                         ` Linus Torvalds
2025-03-06  9:28                                       ` Rasmus Villemoes
2025-03-06 11:39                                       ` [RFC PATCH 0/3] pipe: Convert pipe->{head,tail} to unsigned short K Prateek Nayak
2025-03-06 11:39                                         ` [RFC PATCH 1/3] fs/pipe: Limit the slots in pipe_resize_ring() K Prateek Nayak
2025-03-06 12:28                                           ` Oleg Nesterov
2025-03-06 15:26                                             ` K Prateek Nayak
2025-03-06 11:39                                         ` [RFC PATCH 2/3] fs/splice: Atomically read pipe->{head,tail} in opipe_prep() K Prateek Nayak
2025-03-06 11:39                                         ` [RFC PATCH 3/3] treewide: pipe: Convert all references to pipe->{head,tail,max_usage,ring_size} to unsigned short K Prateek Nayak
2025-03-06 12:32                                           ` Oleg Nesterov
2025-03-06 12:41                                             ` Oleg Nesterov
2025-03-06 15:33                                               ` K Prateek Nayak
2025-03-06 18:04                                                 ` Linus Torvalds
2025-03-06 14:27                                             ` Rasmus Villemoes
2025-03-03 18:32                           ` [PATCH] pipe_read: don't wake up the writer if the pipe is still full K Prateek Nayak
2025-03-04  5:22                             ` K Prateek Nayak
2025-03-03 16:49                   ` Oleg Nesterov
2025-03-04  5:06                   ` Hillf Danton
2025-03-04  5:35                     ` K Prateek Nayak
2025-03-04 10:29                       ` Hillf Danton
2025-03-04 12:34                         ` Oleg Nesterov
2025-03-04 23:35                           ` Hillf Danton
2025-03-04 23:49                             ` Oleg Nesterov
2025-03-05  4:56                               ` Hillf Danton
2025-03-05 11:44                                 ` Oleg Nesterov
2025-03-05 22:46                                   ` Hillf Danton
2025-03-06  9:30                                     ` Oleg Nesterov
2025-03-07  6:08                                       ` Hillf Danton
2025-03-07  6:24                                         ` K Prateek Nayak
2025-03-07 10:46                                           ` Hillf Danton
2025-03-07 11:29                                             ` Oleg Nesterov
2025-03-07 12:34                                               ` Oleg Nesterov
2025-03-07 23:56                                                 ` Hillf Danton
2025-03-09 14:01                                                   ` K Prateek Nayak
2025-03-09 17:02                                                   ` Oleg Nesterov
2025-03-10 10:49                                                     ` Hillf Danton
2025-03-10 11:09                                                       ` Oleg Nesterov
2025-03-10 11:37                                                         ` Hillf Danton
2025-03-10 12:43                                                           ` Oleg Nesterov
2025-03-10 23:33                                                             ` Hillf Danton
2025-03-11  0:26                                                               ` Linus Torvalds
2025-03-11  6:54                                                               ` Oleg Nesterov
     [not found]                                                               ` <20250311112922.3342-1-hdanton@sina.com>
2025-03-11 11:53                                                                 ` Oleg Nesterov
2025-03-07 11:26                                         ` Oleg Nesterov
2025-02-27 12:50   ` Oleg Nesterov
2025-02-27 13:52     ` Oleg Nesterov
2025-02-27 15:59     ` Mateusz Guzik
2025-02-27 16:28       ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).