RFC: buffered I/O completion change

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* RFC: buffered I/O completion change
@ 2017-08-11  9:59 Christoph Hellwig
  2017-08-11  9:59 ` [PATCH] xfs: open code end_buffer_async_write in xfs_finish_page_writeback Christoph Hellwig
  0 siblings, 1 reply; 4+ messages in thread
From: Christoph Hellwig @ 2017-08-11  9:59 UTC (permalink / raw)
  To: linux-xfs

I'm currently debugging a customer issue where we run into a softlockup
in xfs_finish_writeback, and consequently a lot of stuck buffered
writeback.   I'm still not sure what the root cause it as it is very
hard to reproduce, but one thing that sprang to mind again is the
hairy nature of end_buffer_async_write, especially combined with our
loop that calls it.  This patch avoids these pitfalls by opencoding
end_buffer_async_write.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH] xfs: open code end_buffer_async_write in xfs_finish_page_writeback
  2017-08-11  9:59 RFC: buffered I/O completion change Christoph Hellwig
@ 2017-08-11  9:59 ` Christoph Hellwig
  2017-08-11 16:20   ` Darrick J. Wong
  0 siblings, 1 reply; 4+ messages in thread
From: Christoph Hellwig @ 2017-08-11  9:59 UTC (permalink / raw)
  To: linux-xfs

Our loop in xfs_finish_page_writeback, which iterates over all buffer
heads in a page and then calls end_buffer_async_write, which also
iterates over all buffers in the page to check if any I/O is in flight
is not only inefficient, but also potentially dangerous as
end_buffer_async_write can cause the page and all buffers to be freed.

Replace it with a single loop that does the work of end_buffer_async_write
on a per-page basis.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_aops.c | 62 ++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 38 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 6bf120bb1a17..0ddda41ab428 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -84,12 +84,6 @@ xfs_find_bdev_for_inode(
  * We're now finished for good with this page.  Update the page state via the
  * associated buffer_heads, paying attention to the start and end offsets that
  * we need to process on the page.
- *
- * Landmine Warning: bh->b_end_io() will call end_page_writeback() on the last
- * buffer in the IO. Once it does this, it is unsafe to access the bufferhead or
- * the page at all, as we may be racing with memory reclaim and it can free both
- * the bufferhead chain and the page as it will see the page as clean and
- * unused.
  */
 static void
 xfs_finish_page_writeback(
@@ -97,29 +91,42 @@ xfs_finish_page_writeback(
 	struct bio_vec		*bvec,
 	int			error)
 {
-	unsigned int		end = bvec->bv_offset + bvec->bv_len - 1;
-	struct buffer_head	*head, *bh, *next;
+	struct buffer_head	*head = page_buffers(bvec->bv_page), *bh = head;
+	bool			busy = false;
 	unsigned int		off = 0;
-	unsigned int		bsize;
+	unsigned long		flags;
 
 	ASSERT(bvec->bv_offset < PAGE_SIZE);
 	ASSERT((bvec->bv_offset & (i_blocksize(inode) - 1)) == 0);
-	ASSERT(end < PAGE_SIZE);
+	ASSERT(bvec->bv_offset + bvec->bv_len <= PAGE_SIZE);
 	ASSERT((bvec->bv_len & (i_blocksize(inode) - 1)) == 0);
 
-	bh = head = page_buffers(bvec->bv_page);
-
-	bsize = bh->b_size;
+	local_irq_save(flags);
+	bit_spin_lock(BH_Uptodate_Lock, &head->b_state);
 	do {
-		if (off > end)
-			break;
-		next = bh->b_this_page;
-		if (off < bvec->bv_offset)
-			goto next_bh;
-		bh->b_end_io(bh, !error);
-next_bh:
-		off += bsize;
-	} while ((bh = next) != head);
+		if (off >= bvec->bv_offset &&
+		    off < bvec->bv_offset + bvec->bv_len) {
+			BUG_ON(!buffer_async_write(bh));
+			if (error) {
+				mark_buffer_write_io_error(bh);
+				clear_buffer_uptodate(bh);
+				SetPageError(bvec->bv_page);
+			} else {
+				set_buffer_uptodate(bh);
+			}
+			clear_buffer_async_write(bh);
+			unlock_buffer(bh);
+		} else if (buffer_async_write(bh)) {
+			BUG_ON(!buffer_locked(bh));
+			busy = true;
+		}
+		off += bh->b_size;
+	} while ((bh = bh->b_this_page) != head);
+	bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
+	local_irq_restore(flags);
+
+	if (!busy)
+		end_page_writeback(bvec->bv_page);
 }
 
 /*
@@ -133,8 +140,10 @@ xfs_destroy_ioend(
 	int			error)
 {
 	struct inode		*inode = ioend->io_inode;
-	struct bio		*last = ioend->io_bio;
-	struct bio		*bio, *next;
+	struct bio		*bio = &ioend->io_inline_bio;
+	struct bio		*last = ioend->io_bio, *next;
+	u64			start = bio->bi_iter.bi_sector;
+	bool			quiet = bio_flagged(bio, BIO_QUIET);
 
 	for (bio = &ioend->io_inline_bio; bio; bio = next) {
 		struct bio_vec	*bvec;
@@ -155,6 +164,11 @@ xfs_destroy_ioend(
 
 		bio_put(bio);
 	}
+
+	if (unlikely(error && !quiet)) {
+		xfs_err_ratelimited(XFS_I(inode)->i_mount,
+			"read error on sector %llu", start);
+	}
 }
 
 /*
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] xfs: open code end_buffer_async_write in xfs_finish_page_writeback
  2017-08-11  9:59 ` [PATCH] xfs: open code end_buffer_async_write in xfs_finish_page_writeback Christoph Hellwig
@ 2017-08-11 16:20   ` Darrick J. Wong
  2017-08-11 17:09     ` Christoph Hellwig
  0 siblings, 1 reply; 4+ messages in thread
From: Darrick J. Wong @ 2017-08-11 16:20 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Fri, Aug 11, 2017 at 11:59:05AM +0200, Christoph Hellwig wrote:
> Our loop in xfs_finish_page_writeback, which iterates over all buffer
> heads in a page and then calls end_buffer_async_write, which also
> iterates over all buffers in the page to check if any I/O is in flight
> is not only inefficient, but also potentially dangerous as
> end_buffer_async_write can cause the page and all buffers to be freed.
> 
> Replace it with a single loop that does the work of end_buffer_async_write
> on a per-page basis.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_aops.c | 62 ++++++++++++++++++++++++++++++++++---------------------
>  1 file changed, 38 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 6bf120bb1a17..0ddda41ab428 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -84,12 +84,6 @@ xfs_find_bdev_for_inode(
>   * We're now finished for good with this page.  Update the page state via the
>   * associated buffer_heads, paying attention to the start and end offsets that
>   * we need to process on the page.
> - *
> - * Landmine Warning: bh->b_end_io() will call end_page_writeback() on the last
> - * buffer in the IO. Once it does this, it is unsafe to access the bufferhead or
> - * the page at all, as we may be racing with memory reclaim and it can free both
> - * the bufferhead chain and the page as it will see the page as clean and
> - * unused.

If we're going to open-code end_buffer_async_write, could we have a
comment here explaining that we've done so, and why?

>   */
>  static void
>  xfs_finish_page_writeback(
> @@ -97,29 +91,42 @@ xfs_finish_page_writeback(
>  	struct bio_vec		*bvec,
>  	int			error)
>  {
> -	unsigned int		end = bvec->bv_offset + bvec->bv_len - 1;
> -	struct buffer_head	*head, *bh, *next;
> +	struct buffer_head	*head = page_buffers(bvec->bv_page), *bh = head;
> +	bool			busy = false;
>  	unsigned int		off = 0;
> -	unsigned int		bsize;
> +	unsigned long		flags;
>  
>  	ASSERT(bvec->bv_offset < PAGE_SIZE);
>  	ASSERT((bvec->bv_offset & (i_blocksize(inode) - 1)) == 0);
> -	ASSERT(end < PAGE_SIZE);
> +	ASSERT(bvec->bv_offset + bvec->bv_len <= PAGE_SIZE);
>  	ASSERT((bvec->bv_len & (i_blocksize(inode) - 1)) == 0);

ASSERT(bh->b_end_io == end_buffer_async_write), just in case we ever get
passed a buffer head with an endio function that isn't what we're
open-coding?

>  
> -	bh = head = page_buffers(bvec->bv_page);
> -
> -	bsize = bh->b_size;
> +	local_irq_save(flags);
> +	bit_spin_lock(BH_Uptodate_Lock, &head->b_state);
>  	do {
> -		if (off > end)
> -			break;
> -		next = bh->b_this_page;
> -		if (off < bvec->bv_offset)
> -			goto next_bh;
> -		bh->b_end_io(bh, !error);
> -next_bh:
> -		off += bsize;
> -	} while ((bh = next) != head);
> +		if (off >= bvec->bv_offset &&
> +		    off < bvec->bv_offset + bvec->bv_len) {
> +			BUG_ON(!buffer_async_write(bh));
> +			if (error) {
> +				mark_buffer_write_io_error(bh);
> +				clear_buffer_uptodate(bh);
> +				SetPageError(bvec->bv_page);
> +			} else {
> +				set_buffer_uptodate(bh);
> +			}
> +			clear_buffer_async_write(bh);
> +			unlock_buffer(bh);
> +		} else if (buffer_async_write(bh)) {
> +			BUG_ON(!buffer_locked(bh));
> +			busy = true;
> +		}
> +		off += bh->b_size;
> +	} while ((bh = bh->b_this_page) != head);
> +	bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
> +	local_irq_restore(flags);
> +
> +	if (!busy)
> +		end_page_writeback(bvec->bv_page);
>  }
>  
>  /*
> @@ -133,8 +140,10 @@ xfs_destroy_ioend(
>  	int			error)
>  {
>  	struct inode		*inode = ioend->io_inode;
> -	struct bio		*last = ioend->io_bio;
> -	struct bio		*bio, *next;
> +	struct bio		*bio = &ioend->io_inline_bio;
> +	struct bio		*last = ioend->io_bio, *next;
> +	u64			start = bio->bi_iter.bi_sector;
> +	bool			quiet = bio_flagged(bio, BIO_QUIET);
>  
>  	for (bio = &ioend->io_inline_bio; bio; bio = next) {
>  		struct bio_vec	*bvec;
> @@ -155,6 +164,11 @@ xfs_destroy_ioend(
>  
>  		bio_put(bio);
>  	}
> +
> +	if (unlikely(error && !quiet)) {
> +		xfs_err_ratelimited(XFS_I(inode)->i_mount,
> +			"read error on sector %llu", start);

Read error?  Isn't this writeback?

I also wonder about the wisdom of the text deviating from "lost async
page write", since scripts/admins have been trained to look for that as
evidence of writeback problems for years.

(That said, at least now it's a lot less spammy.)

--D

> +	}
>  }
>  
>  /*
> -- 
> 2.11.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] xfs: open code end_buffer_async_write in xfs_finish_page_writeback
  2017-08-11 16:20   ` Darrick J. Wong
@ 2017-08-11 17:09     ` Christoph Hellwig
  0 siblings, 0 replies; 4+ messages in thread
From: Christoph Hellwig @ 2017-08-11 17:09 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs

On Fri, Aug 11, 2017 at 09:20:38AM -0700, Darrick J. Wong wrote:
> If we're going to open-code end_buffer_async_write, could we have a
> comment here explaining that we've done so, and why?

Sure.

> ASSERT(bh->b_end_io == end_buffer_async_write), just in case we ever get
> passed a buffer head with an endio function that isn't what we're
> open-coding?

Ok.

> > +	if (unlikely(error && !quiet)) {
> > +		xfs_err_ratelimited(XFS_I(inode)->i_mount,
> > +			"read error on sector %llu", start);
> 
> Read error?  Isn't this writeback?

Ooops, it is.

> I also wonder about the wisdom of the text deviating from "lost async
> page write", since scripts/admins have been trained to look for that as
> evidence of writeback problems for years.

We could keep the text, but I think it's rather confusing and doesn't
fit the other XFS warnings.  And I don't think admins should rely on
the sorts of warnings.  E.g. for btrfs they would not get these
warnings either.

> (That said, at least now it's a lot less spammy.)

Yes - but we could still print the warning once per bio instead of
once per bh if the message remains the same.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-08-11 17:09 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-08-11  9:59 RFC: buffered I/O completion change Christoph Hellwig
2017-08-11  9:59 ` [PATCH] xfs: open code end_buffer_async_write in xfs_finish_page_writeback Christoph Hellwig
2017-08-11 16:20   ` Darrick J. Wong
2017-08-11 17:09     ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox