linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alexey Gladkov <legion@kernel.org>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Oleg Nesterov <oleg@redhat.com>,
	Swapnil Sapkal <swapnil.sapkal@amd.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	Jan Kara <jack@suse.cz>, Mateusz Guzik <mjguzik@gmail.com>,
	Manfred Spraul <manfred@colorfullife.com>,
	David Howells <dhowells@redhat.com>,
	WangYuli <wangyuli@uniontech.com>,
	Hillf Danton <hdanton@sina.com>,
	"Gautham R. Shenoy" <gautham.shenoy@amd.com>,
	Neeraj.Upadhyay@amd.com, Ananth.narayan@amd.com
Subject: Re: [PATCH] fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex
Date: Tue, 4 Mar 2025 19:36:24 +0100	[thread overview]
Message-ID: <Z8dIKGCSRWqUqAEI@example.org> (raw)
In-Reply-To: <20250304135138.1278-1-kprateek.nayak@amd.com>

On Tue, Mar 04, 2025 at 01:51:38PM +0000, K Prateek Nayak wrote:
> From: Linus Torvalds <torvalds@linux-foundation.org>
> 
> pipe_readable(), pipe_writable(), and pipe_poll() can read "pipe->head"
> and "pipe->tail" outside of "pipe->mutex" critical section. When the
> head and the tail are read individually in that order, there is a window
> for interruption between the two reads in which both the head and the
> tail can be updated by concurrent readers and writers.
> 
> One of the problematic scenarios observed with hackbench running
> multiple groups on a large server on a particular pipe inode is as
> follows:
> 
>     pipe->head = 36
>     pipe->tail = 36
> 
>     hackbench-118762  [057] .....  1029.550548: pipe_write: *wakes up: pipe not full*
>     hackbench-118762  [057] .....  1029.550548: pipe_write: head: 36 -> 37 [tail: 36]
>     hackbench-118762  [057] .....  1029.550548: pipe_write: *wake up next reader 118740*
>     hackbench-118762  [057] .....  1029.550548: pipe_write: *wake up next writer 118768*
> 
>     hackbench-118768  [206] .....  1029.55055X: pipe_write: *writer wakes up*
>     hackbench-118768  [206] .....  1029.55055X: pipe_write: head = READ_ONCE(pipe->head) [37]
>     ... CPU 206 interrupted (exact wakeup was not traced but 118768 did read head at 37 in traces)
> 
>     hackbench-118740  [057] .....  1029.550558: pipe_read:  *reader wakes up: pipe is not empty*
>     hackbench-118740  [057] .....  1029.550558: pipe_read:  tail: 36 -> 37 [head = 37]
>     hackbench-118740  [057] .....  1029.550559: pipe_read:  *pipe is empty; wakeup writer 118768*
>     hackbench-118740  [057] .....  1029.550559: pipe_read:  *sleeps*
> 
>     hackbench-118766  [185] .....  1029.550592: pipe_write: *New writer comes in*
>     hackbench-118766  [185] .....  1029.550592: pipe_write: head: 37 -> 38 [tail: 37]
>     hackbench-118766  [185] .....  1029.550592: pipe_write: *wakes up reader 118766*
> 
>     hackbench-118740  [185] .....  1029.550598: pipe_read:  *reader wakes up; pipe not empty*
>     hackbench-118740  [185] .....  1029.550599: pipe_read:  tail: 37 -> 38 [head: 38]
>     hackbench-118740  [185] .....  1029.550599: pipe_read:  *pipe is empty*
>     hackbench-118740  [185] .....  1029.550599: pipe_read:  *reader sleeps; wakeup writer 118768*
> 
>     ... CPU 206 switches back to writer
>     hackbench-118768  [206] .....  1029.550601: pipe_write: tail = READ_ONCE(pipe->tail) [38]
>     hackbench-118768  [206] .....  1029.550601: pipe_write: pipe_full()? (u32)(37 - 38) >= 16? Yes
>     hackbench-118768  [206] .....  1029.550601: pipe_write: *writer goes back to sleep*
> 
>     [ Tasks 118740 and 118768 can then indefinitely wait on each other. ]
> 
> The unsigned arithmetic in pipe_occupancy() wraps around when
> "pipe->tail > pipe->head" leading to pipe_full() returning true despite
> the pipe being empty.
> 
> The case of genuine wraparound of "pipe->head" is handled since pipe
> buffer has data allowing readers to make progress until the pipe->tail
> wraps too after which the reader will wakeup a sleeping writer, however,
> mistaking the pipe to be full when it is in fact empty can lead to
> readers and writers waiting on each other indefinitely.
> 
> This issue became more problematic and surfaced as a hang in hackbench
> after the optimization in commit aaec5a95d596 ("pipe_read: don't wake up
> the writer if the pipe is still full") significantly reduced the number
> of spurious wakeups of writers that had previously helped mask the
> issue.
> 
> To avoid missing any updates between the reads of "pipe->head" and
> "pipe->write", unionize the two with a single unsigned long
> "pipe->head_tail" member that can be loaded atomically.
> 
> Using "pipe->head_tail" to read the head and the tail ensures the
> lockless checks do not miss any updates to the head or the tail and
> since those two are only updated under "pipe->mutex", it ensures that
> the head is always ahead of, or equal to the tail resulting in correct
> calculations.
> 
>   [ prateek: commit log, testing on x86 platforms. ]
> 
> Reported-and-debugged-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
> Closes: https://lore.kernel.org/lkml/e813814e-7094-4673-bc69-731af065a0eb@amd.com/
> Reported-by: Alexey Gladkov <legion@kernel.org>
> Closes: https://lore.kernel.org/all/Z8Wn0nTvevLRG_4m@example.org/
> Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length")
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>

With this patch, I'm also not reproducing the problem. Thanks!

> ---
> Changes are based on:
> 
>   git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs-6.15.pipe
> 
> at commit commit ee5eda8ea595 ("pipe: change pipe_write() to never add a
> zero-sized buffer") but also applies cleanly on top of v6.14-rc5.
> 
> The diff from Linus is kept as is except for removing the whitespaces in
> front of the typedef that checkpatch complained about (the warning on
> usage of typedef itself has been ignored since I could not think of a
> better alternative other than #ifdef hackery in pipe_inode_info and the
> newly introduced pipe_index union.) and the suggestion from Oleg to
> explicitly initialize the "head_tail" with:
> 
>     union pipe_index idx = { .head_tail = READ_ONCE(pipe->head_tail) }
> 
> I went with commit 8cefc107ca54 ("pipe: Use head and tail pointers for
> the ring, not cursor and length") for the "Fixes:" tag since pipe_poll()
> added:
> 
>     unsigned int head = READ_ONCE(pipe->head);
>     unsigned int tail = READ_ONCE(pipe->tail);
> 
>     poll_wait(filp, &pipe->wait, wait);
> 
>     BUG_ON(pipe_occupancy(head, tail) > pipe->ring_size);
> 
> and the race described can trigger that BUG_ON() but as Linus pointed
> out in [1] the commit 85190d15f4ea ("pipe: don't use 'pipe_wait() for
> basic pipe IO") is probably the one that can cause the writers to
> sleep on empty pipe since the pipe_wait() used prior did not drop the
> pipe lock until it called schedule() and prepare_to_wait() was called
> before pipe_unlock() ensuring no races.
> 
> [1] https://lore.kernel.org/all/CAHk-=wh804HX8H86VRUSKoJGVG0eBe8sPz8=E3d8LHftOCSqwQ@mail.gmail.com/
> 
> Please let me know if the patch requires any modifications and I'll jump
> right on it. The changes have been tested on both a 5th Generation AMD
> EPYC system and on a dual socket Intel Emerald Rapids system with
> multiple thousand iterations of hackbench for small and large loop
> counts. Thanks a ton to Swapnil for all the help.
> ---
>  fs/pipe.c                 | 19 ++++++++-----------
>  include/linux/pipe_fs_i.h | 39 +++++++++++++++++++++++++++++++++++++--
>  2 files changed, 45 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index b0641f75b1ba..780990f307ab 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -210,11 +210,10 @@ static const struct pipe_buf_operations anon_pipe_buf_ops = {
>  /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
>  static inline bool pipe_readable(const struct pipe_inode_info *pipe)
>  {
> -	unsigned int head = READ_ONCE(pipe->head);
> -	unsigned int tail = READ_ONCE(pipe->tail);
> +	union pipe_index idx = { .head_tail = READ_ONCE(pipe->head_tail) };
>  	unsigned int writers = READ_ONCE(pipe->writers);
>  
> -	return !pipe_empty(head, tail) || !writers;
> +	return !pipe_empty(idx.head, idx.tail) || !writers;
>  }
>  
>  static inline unsigned int pipe_update_tail(struct pipe_inode_info *pipe,
> @@ -403,11 +402,10 @@ static inline int is_packetized(struct file *file)
>  /* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
>  static inline bool pipe_writable(const struct pipe_inode_info *pipe)
>  {
> -	unsigned int head = READ_ONCE(pipe->head);
> -	unsigned int tail = READ_ONCE(pipe->tail);
> +	union pipe_index idx = { .head_tail = READ_ONCE(pipe->head_tail) };
>  	unsigned int max_usage = READ_ONCE(pipe->max_usage);
>  
> -	return !pipe_full(head, tail, max_usage) ||
> +	return !pipe_full(idx.head, idx.tail, max_usage) ||
>  		!READ_ONCE(pipe->readers);
>  }
>  
> @@ -649,7 +647,7 @@ pipe_poll(struct file *filp, poll_table *wait)
>  {
>  	__poll_t mask;
>  	struct pipe_inode_info *pipe = filp->private_data;
> -	unsigned int head, tail;
> +	union pipe_index idx;
>  
>  	/* Epoll has some historical nasty semantics, this enables them */
>  	WRITE_ONCE(pipe->poll_usage, true);
> @@ -670,19 +668,18 @@ pipe_poll(struct file *filp, poll_table *wait)
>  	 * if something changes and you got it wrong, the poll
>  	 * table entry will wake you up and fix it.
>  	 */
> -	head = READ_ONCE(pipe->head);
> -	tail = READ_ONCE(pipe->tail);
> +	idx.head_tail = READ_ONCE(pipe->head_tail);
>  
>  	mask = 0;
>  	if (filp->f_mode & FMODE_READ) {
> -		if (!pipe_empty(head, tail))
> +		if (!pipe_empty(idx.head, idx.tail))
>  			mask |= EPOLLIN | EPOLLRDNORM;
>  		if (!pipe->writers && filp->f_pipe != pipe->w_counter)
>  			mask |= EPOLLHUP;
>  	}
>  
>  	if (filp->f_mode & FMODE_WRITE) {
> -		if (!pipe_full(head, tail, pipe->max_usage))
> +		if (!pipe_full(idx.head, idx.tail, pipe->max_usage))
>  			mask |= EPOLLOUT | EPOLLWRNORM;
>  		/*
>  		 * Most Unices do not set EPOLLERR for FIFOs but on Linux they
> diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
> index 8ff23bf5a819..3cc4f8eab853 100644
> --- a/include/linux/pipe_fs_i.h
> +++ b/include/linux/pipe_fs_i.h
> @@ -31,6 +31,33 @@ struct pipe_buffer {
>  	unsigned long private;
>  };
>  
> +/*
> + * Really only alpha needs 32-bit fields, but
> + * might as well do it for 64-bit architectures
> + * since that's what we've historically done,
> + * and it makes 'head_tail' always be a simple
> + * 'unsigned long'.
> + */
> +#ifdef CONFIG_64BIT
> +typedef unsigned int pipe_index_t;
> +#else
> +typedef unsigned short pipe_index_t;
> +#endif
> +
> +/*
> + * We have to declare this outside 'struct pipe_inode_info',
> + * but then we can't use 'union pipe_index' for an anonymous
> + * union, so we end up having to duplicate this declaration
> + * below. Annoying.
> + */
> +union pipe_index {
> +	unsigned long head_tail;
> +	struct {
> +		pipe_index_t head;
> +		pipe_index_t tail;
> +	};
> +};
> +
>  /**
>   *	struct pipe_inode_info - a linux kernel pipe
>   *	@mutex: mutex protecting the whole thing
> @@ -58,8 +85,16 @@ struct pipe_buffer {
>  struct pipe_inode_info {
>  	struct mutex mutex;
>  	wait_queue_head_t rd_wait, wr_wait;
> -	unsigned int head;
> -	unsigned int tail;
> +
> +	/* This has to match the 'union pipe_index' above */
> +	union {
> +		unsigned long head_tail;
> +		struct {
> +			pipe_index_t head;
> +			pipe_index_t tail;
> +		};
> +	};
> +
>  	unsigned int max_usage;
>  	unsigned int ring_size;
>  	unsigned int nr_accounted;
> 
> base-commit: ee5eda8ea59546af2e8f192c060fbf29862d7cbd
> -- 
> 2.34.1
> 

-- 
Rgrds, legion


  reply	other threads:[~2025-03-04 18:36 UTC|newest]

Thread overview: 109+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-02 14:07 [PATCH] pipe_read: don't wake up the writer if the pipe is still full Oleg Nesterov
2025-01-02 16:20 ` WangYuli
2025-01-02 16:46   ` Oleg Nesterov
2025-01-04  8:42 ` Christian Brauner
2025-01-31  9:49 ` K Prateek Nayak
2025-01-31 13:23   ` Oleg Nesterov
2025-01-31 20:06   ` Linus Torvalds
2025-02-02 17:01     ` Oleg Nesterov
2025-02-02 18:39       ` Linus Torvalds
2025-02-02 19:32         ` Oleg Nesterov
2025-02-04 11:17         ` Christian Brauner
2025-02-03  9:05       ` K Prateek Nayak
2025-02-04 13:49         ` Oleg Nesterov
2025-02-24  9:26 ` Sapkal, Swapnil
2025-02-24 14:24   ` Oleg Nesterov
2025-02-24 18:36     ` Linus Torvalds
2025-02-25 14:26       ` Oleg Nesterov
2025-02-25 11:57     ` Oleg Nesterov
2025-02-26  5:55       ` Sapkal, Swapnil
2025-02-26 11:38         ` Oleg Nesterov
2025-02-26 17:56           ` Sapkal, Swapnil
2025-02-26 18:12             ` Oleg Nesterov
2025-03-03 13:00       ` Alexey Gladkov
2025-03-03 15:46         ` K Prateek Nayak
2025-03-03 17:18           ` Alexey Gladkov
2025-02-26 13:18     ` Mateusz Guzik
2025-02-26 13:21       ` Mateusz Guzik
2025-02-26 17:16         ` Oleg Nesterov
2025-02-27 16:18       ` Sapkal, Swapnil
2025-02-27 16:34         ` Mateusz Guzik
2025-02-27 21:12         ` Oleg Nesterov
2025-02-28  5:58           ` Sapkal, Swapnil
2025-02-28 14:30             ` Oleg Nesterov
2025-02-28 16:33               ` Oleg Nesterov
2025-03-03  9:46                 ` Sapkal, Swapnil
2025-03-03 14:37                   ` Mateusz Guzik
2025-03-03 14:51                     ` Mateusz Guzik
2025-03-03 15:31                       ` K Prateek Nayak
2025-03-03 17:54                         ` Mateusz Guzik
2025-03-03 18:11                           ` Linus Torvalds
2025-03-03 18:33                             ` Mateusz Guzik
2025-03-03 18:55                               ` Linus Torvalds
2025-03-03 19:06                                 ` Mateusz Guzik
2025-03-03 20:27                                 ` Oleg Nesterov
2025-03-03 20:46                                   ` Linus Torvalds
2025-03-04  5:31                                     ` K Prateek Nayak
2025-03-04  6:32                                       ` Linus Torvalds
2025-03-04 12:54                                     ` Oleg Nesterov
2025-03-04 13:25                                       ` Oleg Nesterov
2025-03-04 18:28                                       ` Linus Torvalds
2025-03-04 22:11                                         ` Oleg Nesterov
2025-03-05  4:40                                         ` K Prateek Nayak
2025-03-05  4:52                                           ` Linus Torvalds
2025-03-04 13:51                                     ` [PATCH] fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex K Prateek Nayak
2025-03-04 18:36                                       ` Alexey Gladkov [this message]
2025-03-04 19:03                                       ` Linus Torvalds
2025-03-05 15:31                                     ` [PATCH] pipe_read: don't wake up the writer if the pipe is still full Rasmus Villemoes
2025-03-05 16:50                                       ` Linus Torvalds
2025-03-06  9:48                                         ` Rasmus Villemoes
2025-03-06 14:42                                           ` Rasmus Villemoes
2025-03-05 16:40                                     ` Linus Torvalds
2025-03-06  8:35                                       ` Rasmus Villemoes
2025-03-06 17:59                                         ` Linus Torvalds
2025-03-06  9:28                                       ` Rasmus Villemoes
2025-03-06 11:39                                       ` [RFC PATCH 0/3] pipe: Convert pipe->{head,tail} to unsigned short K Prateek Nayak
2025-03-06 11:39                                         ` [RFC PATCH 1/3] fs/pipe: Limit the slots in pipe_resize_ring() K Prateek Nayak
2025-03-06 12:28                                           ` Oleg Nesterov
2025-03-06 15:26                                             ` K Prateek Nayak
2025-03-06 11:39                                         ` [RFC PATCH 2/3] fs/splice: Atomically read pipe->{head,tail} in opipe_prep() K Prateek Nayak
2025-03-06 11:39                                         ` [RFC PATCH 3/3] treewide: pipe: Convert all references to pipe->{head,tail,max_usage,ring_size} to unsigned short K Prateek Nayak
2025-03-06 12:32                                           ` Oleg Nesterov
2025-03-06 12:41                                             ` Oleg Nesterov
2025-03-06 15:33                                               ` K Prateek Nayak
2025-03-06 18:04                                                 ` Linus Torvalds
2025-03-06 14:27                                             ` Rasmus Villemoes
2025-03-03 18:32                           ` [PATCH] pipe_read: don't wake up the writer if the pipe is still full K Prateek Nayak
2025-03-04  5:22                             ` K Prateek Nayak
2025-03-03 16:49                   ` Oleg Nesterov
2025-03-04  5:06                   ` Hillf Danton
2025-03-04  5:35                     ` K Prateek Nayak
2025-03-04 10:29                       ` Hillf Danton
2025-03-04 12:34                         ` Oleg Nesterov
2025-03-04 23:35                           ` Hillf Danton
2025-03-04 23:49                             ` Oleg Nesterov
2025-03-05  4:56                               ` Hillf Danton
2025-03-05 11:44                                 ` Oleg Nesterov
2025-03-05 22:46                                   ` Hillf Danton
2025-03-06  9:30                                     ` Oleg Nesterov
2025-03-07  6:08                                       ` Hillf Danton
2025-03-07  6:24                                         ` K Prateek Nayak
2025-03-07 10:46                                           ` Hillf Danton
2025-03-07 11:29                                             ` Oleg Nesterov
2025-03-07 12:34                                               ` Oleg Nesterov
2025-03-07 23:56                                                 ` Hillf Danton
2025-03-09 14:01                                                   ` K Prateek Nayak
2025-03-09 17:02                                                   ` Oleg Nesterov
2025-03-10 10:49                                                     ` Hillf Danton
2025-03-10 11:09                                                       ` Oleg Nesterov
2025-03-10 11:37                                                         ` Hillf Danton
2025-03-10 12:43                                                           ` Oleg Nesterov
2025-03-10 23:33                                                             ` Hillf Danton
2025-03-11  0:26                                                               ` Linus Torvalds
2025-03-11  6:54                                                               ` Oleg Nesterov
     [not found]                                                               ` <20250311112922.3342-1-hdanton@sina.com>
2025-03-11 11:53                                                                 ` Oleg Nesterov
2025-03-07 11:26                                         ` Oleg Nesterov
2025-02-27 12:50   ` Oleg Nesterov
2025-02-27 13:52     ` Oleg Nesterov
2025-02-27 15:59     ` Mateusz Guzik
2025-02-27 16:28       ` Oleg Nesterov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z8dIKGCSRWqUqAEI@example.org \
    --to=legion@kernel.org \
    --cc=Ananth.narayan@amd.com \
    --cc=Neeraj.Upadhyay@amd.com \
    --cc=brauner@kernel.org \
    --cc=dhowells@redhat.com \
    --cc=gautham.shenoy@amd.com \
    --cc=hdanton@sina.com \
    --cc=jack@suse.cz \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=manfred@colorfullife.com \
    --cc=mjguzik@gmail.com \
    --cc=oleg@redhat.com \
    --cc=swapnil.sapkal@amd.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=wangyuli@uniontech.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).