All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alexey Gladkov <legion@kernel.org>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Oleg Nesterov <oleg@redhat.com>,
	Swapnil Sapkal <swapnil.sapkal@amd.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	Jan Kara <jack@suse.cz>, Mateusz Guzik <mjguzik@gmail.com>,
	Manfred Spraul <manfred@colorfullife.com>,
	David Howells <dhowells@redhat.com>,
	WangYuli <wangyuli@uniontech.com>,
	Hillf Danton <hdanton@sina.com>,
	"Gautham R. Shenoy" <gautham.shenoy@amd.com>,
	Neeraj.Upadhyay@amd.com, Ananth.narayan@amd.com
Subject: Re: [PATCH] fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex
Date: Tue, 4 Mar 2025 19:36:24 +0100	[thread overview]
Message-ID: <Z8dIKGCSRWqUqAEI@example.org> (raw)
In-Reply-To: <20250304135138.1278-1-kprateek.nayak@amd.com>

On Tue, Mar 04, 2025 at 01:51:38PM +0000, K Prateek Nayak wrote:
> From: Linus Torvalds <torvalds@linux-foundation.org>
> 
> pipe_readable(), pipe_writable(), and pipe_poll() can read "pipe->head"
> and "pipe->tail" outside of "pipe->mutex" critical section. When the
> head and the tail are read individually in that order, there is a window
> for interruption between the two reads in which both the head and the
> tail can be updated by concurrent readers and writers.
> 
> One of the problematic scenarios observed with hackbench running
> multiple groups on a large server on a particular pipe inode is as
> follows:
> 
>     pipe->head = 36
>     pipe->tail = 36
> 
>     hackbench-118762  [057] .....  1029.550548: pipe_write: *wakes up: pipe not full*
>     hackbench-118762  [057] .....  1029.550548: pipe_write: head: 36 -> 37 [tail: 36]
>     hackbench-118762  [057] .....  1029.550548: pipe_write: *wake up next reader 118740*
>     hackbench-118762  [057] .....  1029.550548: pipe_write: *wake up next writer 118768*
> 
>     hackbench-118768  [206] .....  1029.55055X: pipe_write: *writer wakes up*
>     hackbench-118768  [206] .....  1029.55055X: pipe_write: head = READ_ONCE(pipe->head) [37]
>     ... CPU 206 interrupted (exact wakeup was not traced but 118768 did read head at 37 in traces)
> 
>     hackbench-118740  [057] .....  1029.550558: pipe_read:  *reader wakes up: pipe is not empty*
>     hackbench-118740  [057] .....  1029.550558: pipe_read:  tail: 36 -> 37 [head = 37]
>     hackbench-118740  [057] .....  1029.550559: pipe_read:  *pipe is empty; wakeup writer 118768*
>     hackbench-118740  [057] .....  1029.550559: pipe_read:  *sleeps*
> 
>     hackbench-118766  [185] .....  1029.550592: pipe_write: *New writer comes in*
>     hackbench-118766  [185] .....  1029.550592: pipe_write: head: 37 -> 38 [tail: 37]
>     hackbench-118766  [185] .....  1029.550592: pipe_write: *wakes up reader 118766*
> 
>     hackbench-118740  [185] .....  1029.550598: pipe_read:  *reader wakes up; pipe not empty*
>     hackbench-118740  [185] .....  1029.550599: pipe_read:  tail: 37 -> 38 [head: 38]
>     hackbench-118740  [185] .....  1029.550599: pipe_read:  *pipe is empty*
>     hackbench-118740  [185] .....  1029.550599: pipe_read:  *reader sleeps; wakeup writer 118768*
> 
>     ... CPU 206 switches back to writer
>     hackbench-118768  [206] .....  1029.550601: pipe_write: tail = READ_ONCE(pipe->tail) [38]
>     hackbench-118768  [206] .....  1029.550601: pipe_write: pipe_full()? (u32)(37 - 38) >= 16? Yes
>     hackbench-118768  [206] .....  1029.550601: pipe_write: *writer goes back to sleep*
> 
>     [ Tasks 118740 and 118768 can then indefinitely wait on each other. ]
> 
> The unsigned arithmetic in pipe_occupancy() wraps around when
> "pipe->tail > pipe->head" leading to pipe_full() returning true despite
> the pipe being empty.
> 
> The case of genuine wraparound of "pipe->head" is handled since pipe
> buffer has data allowing readers to make progress until the pipe->tail
> wraps too after which the reader will wakeup a sleeping writer, however,
> mistaking the pipe to be full when it is in fact empty can lead to
> readers and writers waiting on each other indefinitely.
> 
> This issue became more problematic and surfaced as a hang in hackbench
> after the optimization in commit aaec5a95d596 ("pipe_read: don't wake up
> the writer if the pipe is still full") significantly reduced the number
> of spurious wakeups of writers that had previously helped mask the
> issue.
> 
> To avoid missing any updates between the reads of "pipe->head" and
> "pipe->write", unionize the two with a single unsigned long
> "pipe->head_tail" member that can be loaded atomically.
> 
> Using "pipe->head_tail" to read the head and the tail ensures the
> lockless checks do not miss any updates to the head or the tail and
> since those two are only updated under "pipe->mutex", it ensures that
> the head is always ahead of, or equal to the tail resulting in correct
> calculations.
> 
>   [ prateek: commit log, testing on x86 platforms. ]
> 
> Reported-and-debugged-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
> Closes: https://lore.kernel.org/lkml/e813814e-7094-4673-bc69-731af065a0eb@amd.com/
> Reported-by: Alexey Gladkov <legion@kernel.org>
> Closes: https://lore.kernel.org/all/Z8Wn0nTvevLRG_4m@example.org/
> Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length")
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>

With this patch, I'm also not reproducing the problem. Thanks!

> ---
> Changes are based on:
> 
>   git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs-6.15.pipe
> 
> at commit commit ee5eda8ea595 ("pipe: change pipe_write() to never add a
> zero-sized buffer") but also applies cleanly on top of v6.14-rc5.
> 
> The diff from Linus is kept as is except for removing the whitespaces in
> front of the typedef that checkpatch complained about (the warning on
> usage of typedef itself has been ignored since I could not think of a
> better alternative other than #ifdef hackery in pipe_inode_info and the
> newly introduced pipe_index union.) and the suggestion from Oleg to
> explicitly initialize the "head_tail" with:
> 
>     union pipe_index idx = { .head_tail = READ_ONCE(pipe->head_tail) }
> 
> I went with commit 8cefc107ca54 ("pipe: Use head and tail pointers for
> the ring, not cursor and length") for the "Fixes:" tag since pipe_poll()
> added:
> 
>     unsigned int head = READ_ONCE(pipe->head);
>     unsigned int tail = READ_ONCE(pipe->tail);
> 
>     poll_wait(filp, &pipe->wait, wait);
> 
>     BUG_ON(pipe_occupancy(head, tail) > pipe->ring_size);
> 
> and the race described can trigger that BUG_ON() but as Linus pointed
> out in [1] the commit 85190d15f4ea ("pipe: don't use 'pipe_wait() for
> basic pipe IO") is probably the one that can cause the writers to
> sleep on empty pipe since the pipe_wait() used prior did not drop the
> pipe lock until it called schedule() and prepare_to_wait() was called
> before pipe_unlock() ensuring no races.
> 
> [1] https://lore.kernel.org/all/CAHk-=wh804HX8H86VRUSKoJGVG0eBe8sPz8=E3d8LHftOCSqwQ@mail.gmail.com/
> 
> Please let me know if the patch requires any modifications and I'll jump
> right on it. The changes have been tested on both a 5th Generation AMD
> EPYC system and on a dual socket Intel Emerald Rapids system with
> multiple thousand iterations of hackbench for small and large loop
> counts. Thanks a ton to Swapnil for all the help.
> ---
>  fs/pipe.c                 | 19 ++++++++-----------
>  include/linux/pipe_fs_i.h | 39 +++++++++++++++++++++++++++++++++++++--
>  2 files changed, 45 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index b0641f75b1ba..780990f307ab 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -210,11 +210,10 @@ static const struct pipe_buf_operations anon_pipe_buf_ops = {
>  /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
>  static inline bool pipe_readable(const struct pipe_inode_info *pipe)
>  {
> -	unsigned int head = READ_ONCE(pipe->head);
> -	unsigned int tail = READ_ONCE(pipe->tail);
> +	union pipe_index idx = { .head_tail = READ_ONCE(pipe->head_tail) };
>  	unsigned int writers = READ_ONCE(pipe->writers);
>  
> -	return !pipe_empty(head, tail) || !writers;
> +	return !pipe_empty(idx.head, idx.tail) || !writers;
>  }
>  
>  static inline unsigned int pipe_update_tail(struct pipe_inode_info *pipe,
> @@ -403,11 +402,10 @@ static inline int is_packetized(struct file *file)
>  /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
>  static inline bool pipe_writable(const struct pipe_inode_info *pipe)
>  {
> -	unsigned int head = READ_ONCE(pipe->head);
> -	unsigned int tail = READ_ONCE(pipe->tail);
> +	union pipe_index idx = { .head_tail = READ_ONCE(pipe->head_tail) };
>  	unsigned int max_usage = READ_ONCE(pipe->max_usage);
>  
> -	return !pipe_full(head, tail, max_usage) ||
> +	return !pipe_full(idx.head, idx.tail, max_usage) ||
>  		!READ_ONCE(pipe->readers);
>  }
>  
> @@ -649,7 +647,7 @@ pipe_poll(struct file *filp, poll_table *wait)
>  {
>  	__poll_t mask;
>  	struct pipe_inode_info *pipe = filp->private_data;
> -	unsigned int head, tail;
> +	union pipe_index idx;
>  
>  	/* Epoll has some historical nasty semantics, this enables them */
>  	WRITE_ONCE(pipe->poll_usage, true);
> @@ -670,19 +668,18 @@ pipe_poll(struct file *filp, poll_table *wait)
>  	 * if something changes and you got it wrong, the poll
>  	 * table entry will wake you up and fix it.
>  	 */
> -	head = READ_ONCE(pipe->head);
> -	tail = READ_ONCE(pipe->tail);
> +	idx.head_tail = READ_ONCE(pipe->head_tail);
>  
>  	mask = 0;
>  	if (filp->f_mode & FMODE_READ) {
> -		if (!pipe_empty(head, tail))
> +		if (!pipe_empty(idx.head, idx.tail))
>  			mask |= EPOLLIN | EPOLLRDNORM;
>  		if (!pipe->writers && filp->f_pipe != pipe->w_counter)
>  			mask |= EPOLLHUP;
>  	}
>  
>  	if (filp->f_mode & FMODE_WRITE) {
> -		if (!pipe_full(head, tail, pipe->max_usage))
> +		if (!pipe_full(idx.head, idx.tail, pipe->max_usage))
>  			mask |= EPOLLOUT | EPOLLWRNORM;
>  		/*
>  		 * Most Unices do not set EPOLLERR for FIFOs but on Linux they
> diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
> index 8ff23bf5a819..3cc4f8eab853 100644
> --- a/include/linux/pipe_fs_i.h
> +++ b/include/linux/pipe_fs_i.h
> @@ -31,6 +31,33 @@ struct pipe_buffer {
>  	unsigned long private;
>  };
>  
> +/*
> + * Really only alpha needs 32-bit fields, but
> + * might as well do it for 64-bit architectures
> + * since that's what we've historically done,
> + * and it makes 'head_tail' always be a simple
> + * 'unsigned long'.
> + */
> +#ifdef CONFIG_64BIT
> +typedef unsigned int pipe_index_t;
> +#else
> +typedef unsigned short pipe_index_t;
> +#endif
> +
> +/*
> + * We have to declare this outside 'struct pipe_inode_info',
> + * but then we can't use 'union pipe_index' for an anonymous
> + * union, so we end up having to duplicate this declaration
> + * below. Annoying.
> + */
> +union pipe_index {
> +	unsigned long head_tail;
> +	struct {
> +		pipe_index_t head;
> +		pipe_index_t tail;
> +	};
> +};
> +
>  /**
>   *	struct pipe_inode_info - a linux kernel pipe
>   *	@mutex: mutex protecting the whole thing
> @@ -58,8 +85,16 @@ struct pipe_buffer {
>  struct pipe_inode_info {
>  	struct mutex mutex;
>  	wait_queue_head_t rd_wait, wr_wait;
> -	unsigned int head;
> -	unsigned int tail;
> +
> +	/* This has to match the 'union pipe_index' above */
> +	union {
> +		unsigned long head_tail;
> +		struct {
> +			pipe_index_t head;
> +			pipe_index_t tail;
> +		};
> +	};
> +
>  	unsigned int max_usage;
>  	unsigned int ring_size;
>  	unsigned int nr_accounted;
> 
> base-commit: ee5eda8ea59546af2e8f192c060fbf29862d7cbd
> -- 
> 2.34.1
> 

-- 
Rgrds, legion


  reply	other threads:[~2025-03-04 18:36 UTC|newest]

Thread overview: 109+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-02 14:07 [PATCH] pipe_read: don't wake up the writer if the pipe is still full Oleg Nesterov
2025-01-02 16:20 ` WangYuli
2025-01-02 16:46   ` Oleg Nesterov
2025-01-04  8:42 ` Christian Brauner
2025-01-31  9:49 ` K Prateek Nayak
2025-01-31 13:23   ` Oleg Nesterov
2025-01-31 20:06   ` Linus Torvalds
2025-02-02 17:01     ` Oleg Nesterov
2025-02-02 18:39       ` Linus Torvalds
2025-02-02 19:32         ` Oleg Nesterov
2025-02-04 11:17         ` Christian Brauner
2025-02-03  9:05       ` K Prateek Nayak
2025-02-04 13:49         ` Oleg Nesterov
2025-02-24  9:26 ` Sapkal, Swapnil
2025-02-24 14:24   ` Oleg Nesterov
2025-02-24 18:36     ` Linus Torvalds
2025-02-25 14:26       ` Oleg Nesterov
2025-02-25 11:57     ` Oleg Nesterov
2025-02-26  5:55       ` Sapkal, Swapnil
2025-02-26 11:38         ` Oleg Nesterov
2025-02-26 17:56           ` Sapkal, Swapnil
2025-02-26 18:12             ` Oleg Nesterov
2025-03-03 13:00       ` Alexey Gladkov
2025-03-03 15:46         ` K Prateek Nayak
2025-03-03 17:18           ` Alexey Gladkov
2025-02-26 13:18     ` Mateusz Guzik
2025-02-26 13:21       ` Mateusz Guzik
2025-02-26 17:16         ` Oleg Nesterov
2025-02-27 16:18       ` Sapkal, Swapnil
2025-02-27 16:34         ` Mateusz Guzik
2025-02-27 21:12         ` Oleg Nesterov
2025-02-28  5:58           ` Sapkal, Swapnil
2025-02-28 14:30             ` Oleg Nesterov
2025-02-28 16:33               ` Oleg Nesterov
2025-03-03  9:46                 ` Sapkal, Swapnil
2025-03-03 14:37                   ` Mateusz Guzik
2025-03-03 14:51                     ` Mateusz Guzik
2025-03-03 15:31                       ` K Prateek Nayak
2025-03-03 17:54                         ` Mateusz Guzik
2025-03-03 18:11                           ` Linus Torvalds
2025-03-03 18:33                             ` Mateusz Guzik
2025-03-03 18:55                               ` Linus Torvalds
2025-03-03 19:06                                 ` Mateusz Guzik
2025-03-03 20:27                                 ` Oleg Nesterov
2025-03-03 20:46                                   ` Linus Torvalds
2025-03-04  5:31                                     ` K Prateek Nayak
2025-03-04  6:32                                       ` Linus Torvalds
2025-03-04 12:54                                     ` Oleg Nesterov
2025-03-04 13:25                                       ` Oleg Nesterov
2025-03-04 18:28                                       ` Linus Torvalds
2025-03-04 22:11                                         ` Oleg Nesterov
2025-03-05  4:40                                         ` K Prateek Nayak
2025-03-05  4:52                                           ` Linus Torvalds
2025-03-04 13:51                                     ` [PATCH] fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex K Prateek Nayak
2025-03-04 18:36                                       ` Alexey Gladkov [this message]
2025-03-04 19:03                                       ` Linus Torvalds
2025-03-05 15:31                                     ` [PATCH] pipe_read: don't wake up the writer if the pipe is still full Rasmus Villemoes
2025-03-05 16:50                                       ` Linus Torvalds
2025-03-06  9:48                                         ` Rasmus Villemoes
2025-03-06 14:42                                           ` Rasmus Villemoes
2025-03-05 16:40                                     ` Linus Torvalds
2025-03-06  8:35                                       ` Rasmus Villemoes
2025-03-06 17:59                                         ` Linus Torvalds
2025-03-06  9:28                                       ` Rasmus Villemoes
2025-03-06 11:39                                       ` [RFC PATCH 0/3] pipe: Convert pipe->{head,tail} to unsigned short K Prateek Nayak
2025-03-06 11:39                                         ` [RFC PATCH 1/3] fs/pipe: Limit the slots in pipe_resize_ring() K Prateek Nayak
2025-03-06 12:28                                           ` Oleg Nesterov
2025-03-06 15:26                                             ` K Prateek Nayak
2025-03-06 11:39                                         ` [RFC PATCH 2/3] fs/splice: Atomically read pipe->{head,tail} in opipe_prep() K Prateek Nayak
2025-03-06 11:39                                         ` [RFC PATCH 3/3] treewide: pipe: Convert all references to pipe->{head,tail,max_usage,ring_size} to unsigned short K Prateek Nayak
2025-03-06 12:32                                           ` Oleg Nesterov
2025-03-06 12:41                                             ` Oleg Nesterov
2025-03-06 15:33                                               ` K Prateek Nayak
2025-03-06 18:04                                                 ` Linus Torvalds
2025-03-06 14:27                                             ` Rasmus Villemoes
2025-03-03 18:32                           ` [PATCH] pipe_read: don't wake up the writer if the pipe is still full K Prateek Nayak
2025-03-04  5:22                             ` K Prateek Nayak
2025-03-03 16:49                   ` Oleg Nesterov
2025-03-04  5:06                   ` Hillf Danton
2025-03-04  5:35                     ` K Prateek Nayak
2025-03-04 10:29                       ` Hillf Danton
2025-03-04 12:34                         ` Oleg Nesterov
2025-03-04 23:35                           ` Hillf Danton
2025-03-04 23:49                             ` Oleg Nesterov
2025-03-05  4:56                               ` Hillf Danton
2025-03-05 11:44                                 ` Oleg Nesterov
2025-03-05 22:46                                   ` Hillf Danton
2025-03-06  9:30                                     ` Oleg Nesterov
2025-03-07  6:08                                       ` Hillf Danton
2025-03-07  6:24                                         ` K Prateek Nayak
2025-03-07 10:46                                           ` Hillf Danton
2025-03-07 11:29                                             ` Oleg Nesterov
2025-03-07 12:34                                               ` Oleg Nesterov
2025-03-07 23:56                                                 ` Hillf Danton
2025-03-09 14:01                                                   ` K Prateek Nayak
2025-03-09 17:02                                                   ` Oleg Nesterov
2025-03-10 10:49                                                     ` Hillf Danton
2025-03-10 11:09                                                       ` Oleg Nesterov
2025-03-10 11:37                                                         ` Hillf Danton
2025-03-10 12:43                                                           ` Oleg Nesterov
2025-03-10 23:33                                                             ` Hillf Danton
2025-03-11  0:26                                                               ` Linus Torvalds
2025-03-11  6:54                                                               ` Oleg Nesterov
     [not found]                                                               ` <20250311112922.3342-1-hdanton@sina.com>
2025-03-11 11:53                                                                 ` Oleg Nesterov
2025-03-07 11:26                                         ` Oleg Nesterov
2025-02-27 12:50   ` Oleg Nesterov
2025-02-27 13:52     ` Oleg Nesterov
2025-02-27 15:59     ` Mateusz Guzik
2025-02-27 16:28       ` Oleg Nesterov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z8dIKGCSRWqUqAEI@example.org \
    --to=legion@kernel.org \
    --cc=Ananth.narayan@amd.com \
    --cc=Neeraj.Upadhyay@amd.com \
    --cc=brauner@kernel.org \
    --cc=dhowells@redhat.com \
    --cc=gautham.shenoy@amd.com \
    --cc=hdanton@sina.com \
    --cc=jack@suse.cz \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=manfred@colorfullife.com \
    --cc=mjguzik@gmail.com \
    --cc=oleg@redhat.com \
    --cc=swapnil.sapkal@amd.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=wangyuli@uniontech.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.