Re: [PATCH v2] btrfs: unify the error pathes in __extent_writepage() for subpage and regular sectorsize

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v2] btrfs: unify the error pathes in __extent_writepage() for subpage and regular sectorsize
Date: Mon, 2 Aug 2021 18:27:39 +0800	[thread overview]
Message-ID: <3c7d2da3-5038-3737-bf9f-c31106ca7767@suse.com> (raw)
In-Reply-To: <20210723060657.198883-1-wqu@suse.com>



On 2021/7/23 下午2:06, Qu Wenruo wrote:
> [BUG]
> When running btrfs/160 in a loop for subpage with experimental
> compression support, it has a high chance to crash (~20%):
> 
>   BTRFS critical (device dm-7): panic in __btrfs_add_ordered_extent:238: inconsistency in ordered tree at offset 0 (errno=-17 Object already exists)
>   ------------[ cut here ]------------
>   kernel BUG at fs/btrfs/ordered-data.c:238!
>   Internal error: Oops - BUG: 0 [#1] SMP
>   pc : __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
>   lr : __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
>   Call trace:
>    __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
>    btrfs_add_ordered_extent+0x2c/0x50 [btrfs]
>    run_delalloc_nocow+0x81c/0x8fc [btrfs]
>    btrfs_run_delalloc_range+0xa4/0x390 [btrfs]
>    writepage_delalloc+0xc0/0x1ac [btrfs]
>    __extent_writepage+0xf4/0x370 [btrfs]
>    extent_write_cache_pages+0x288/0x4f4 [btrfs]
>    extent_writepages+0x58/0xe0 [btrfs]
>    btrfs_writepages+0x1c/0x30 [btrfs]
>    do_writepages+0x60/0x110
>    __filemap_fdatawrite_range+0x108/0x170
>    filemap_fdatawrite_range+0x20/0x30
>    btrfs_fdatawrite_range+0x34/0x4dc [btrfs]
>    __btrfs_write_out_cache+0x34c/0x480 [btrfs]
>    btrfs_write_out_cache+0x144/0x220 [btrfs]
>    btrfs_start_dirty_block_groups+0x3ac/0x6b0 [btrfs]
>    btrfs_commit_transaction+0xd0/0xbb4 [btrfs]
>    btrfs_sync_fs+0x64/0x1cc [btrfs]
>    sync_fs_one_sb+0x3c/0x50
>    iterate_supers+0xcc/0x1d4
>    ksys_sync+0x6c/0xd0
>    __arm64_sys_sync+0x1c/0x30
>    invoke_syscall+0x50/0x120
>    el0_svc_common.constprop.0+0x4c/0xd4
>    do_el0_svc+0x30/0x9c
>    el0_svc+0x2c/0x54
>    el0_sync_handler+0x1a8/0x1b0
>    el0_sync+0x198/0x1c0
>   ---[ end trace 336f67369ae6e0af ]---
> 
> [CAUSE]
> For subpage case, we can have multiple sectors inside a page, this makes
> it possible for __extent_writepage() to have part of its page submitted
> before returning.
> 
> In btrfs/160, we are using dm-dust to emulate write error, this means
> for certain pages, we could have everything running fine, but at the end
> of __extent_writepage(), one of the submitted bios fails due to dm-dust.
> 
> Then the page is marked Error, and we change @ret from 0 to -EIO.
> 
> This makes the caller extent_write_cache_pages() to error out, without
> submitting the remaining pages.
> 
> Furthermore, since we're erroring out for free space cache, it doesn't
> really care about the error and will update the inode and retry the
> writeback.
> 
> Then we re-run the delalloc range, and will try to insert the same
> delalloc range while previous delalloc range is still hanging there,
> triggering the above error.
> 
> [FIX]
> The proper fix is to make btrfs handle errors from __extent_writepage()
> properly, by ending the remaining ordered extent.
> 
> But that fix needs the following refactors:
> 
> - Know at eaxctly which sector the error happened
>    Currently __extent_writepage_io() works for the full page, can't
>    return at which sector we hit the error.
> 
> - Grab the ordered extent covering the failed sector
> 
> As a hotfix for subpage case, here we unify the error pathes in
> __extent_writepage().
> 
> In fact, the "if (PageError(page))" branch never get executed if @ret is
> still 0 for non-subpage cases.
> 
> As for non-subpage case, we never submit current page in
> __extent_writepage(), but only add current page into bio.
> The bio can only get submitted in next page.
> 
> Thus we never get PageError() set due to IO failure, thus when we hit
> the branch, @ret is never 0.
> 
> By simplying removing that @ret assignment, we let subpage case to
> ignore the IO failure, thus only error out for fatal errors just like
> regular sectorsize.
> 
> So that IO error won't be treated as fatal error not trigger the hanging
> OE problem.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> For the proper fix, I pupose to change how we run delalloc range.
> 
> Currently we call writepage_delalloc() for every page we're going to
> write, but it only really get executed for the first page of a contig
> range.
> 
> I hope to change the workflow to something more like async cow:
> - Run delalloc range
>    But keeps all the page still locked
> 
> - Submit pages for the delalloc range
>    Set each sector writeback and unlock the sector after adding it to
>    bio.
> 
> So that we can know exactly the delalloc range is, and where the error
> happens when submitting bio.
> 
> The only down side is btrfs_invalidatepage().
> Previously btrfs_invalidatepage() can invalidate a page which is covered
> by ordered extent, but not yet being added to a bio.
> 
> Now we have to wait for the writeback to finish before doing
> invalidating.
> 
> Changelog:
> v2:
> - Make error handling in end_extent_writepage() to be subpage compatible
> - Update the commit message and add new comment for why it's safe
>    to remove the @ret assignment
> ---
>   fs/btrfs/extent_io.c | 51 ++++++++++++++++++++++++++++++++++++++------
>   1 file changed, 44 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 248f22267665..543f87ea372e 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2789,8 +2789,14 @@ void end_extent_writepage(struct page *page, int err, u64 start, u64 end)
>   	btrfs_writepage_endio_finish_ordered(inode, page, start, end, uptodate);
>   
>   	if (!uptodate) {
> -		ClearPageUptodate(page);
> -		SetPageError(page);
> +		const struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +		u32 len;
> +
> +		ASSERT(end + 1 - start <= U32_MAX);
> +		len = end + 1 - start;
> +
> +		btrfs_page_clear_uptodate(fs_info, page, start, len);
> +		btrfs_page_set_error(fs_info, page, start, len);
>   		ret = err < 0 ? err : -EIO;
>   		mapping_set_error(page->mapping, ret);
>   	}
> @@ -3795,7 +3801,8 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
>   		ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
>   				delalloc_end, &page_started, nr_written, wbc);
>   		if (ret) {
> -			SetPageError(page);
> +			btrfs_page_set_error(inode->root->fs_info, page,
> +					     page_offset(page), PAGE_SIZE);
>   			/*
>   			 * btrfs_run_delalloc_range should return < 0 for error
>   			 * but just in case, we use > 0 here meaning the IO is
> @@ -4071,7 +4078,8 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc,
>   
>   	WARN_ON(!PageLocked(page));
>   
> -	ClearPageError(page);
> +	btrfs_page_clear_error(btrfs_sb(inode->i_sb), page,
> +			       page_offset(page), PAGE_SIZE);
>   
>   	pg_offset = offset_in_page(i_size);
>   	if (page->index > end_index ||
> @@ -4112,10 +4120,39 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc,
>   		set_page_writeback(page);
>   		end_page_writeback(page);
>   	}
> -	if (PageError(page)) {
> -		ret = ret < 0 ? ret : -EIO;
> +	/*
> +	 * Here we used to have a check for PageError() and then set @ret and
> +	 * call end_extent_writepage().
> +	 *
> +	 * But in fact setting @ret here will cause different error paths
> +	 * between subpage and regualr sectorsize.
> +	 *
> +	 * For regular page size, we never submit current page, but only add
> +	 * current page to current bio.
> +	 * The bio submission can only happen in next page.
> +	 * Thus if we hit the PageError() branch, @ret is already set to
> +	 * non-zero value and will not get updated for regular sectorsize.
> +	 *
> +	 * But for subpage case, it's possible we submit part of current page,
> +	 * thus can get PageError() set by submitted bio of the same page,
> +	 * while our @ret is still 0.
> +	 *
> +	 * So here we unify the behavior and don't set @ret.
> +	 * Error can still be properly passed to higher layer as page will
> +	 * be set error, here we just don't handle the IO failure.
> +	 *
> +	 * NOTE: This is just a hotfix for subpage.
> +	 * The root fix will be properly ending ordered extent when we hit
> +	 * an error during writeback.
> +	 *
> +	 * But that needs a much bigger refactor, as we not only need to
> +	 * grab the submitted OE, but also needs to know exactly at which
> +	 * bytenr we hit an error.
> +	 * Currently the full page based __extent_writepage_io() is not
> +	 * capable for that.
> +	 */
> +	if (PageError(page))
>   		end_extent_writepage(page, ret, start, page_end);

Well, it turns out, this end_extent_writepage() is causing more problem 
than I thought.


If we have some page layout like this:

0		32K		64K
|/////|		|////|
\- Extent A	\- Extent B.

When Extent A get submitted and IO failed, the page will have PageError set.

Then Extent B get submitted without problem.

But since PageError is set, we will call end_extent_writepage() for the 
whole page.

Which will call btrfs_mark_ordered_io_finished() to cleanup the ordered 
extent.

If btrfs_mark_ordered_io_finished() happens before the endio of Extent 
B, it will finish the Ordered IO for extent B, now ordered extent B will 
be removed from the io extent tree.

Then the bio get submitted, we will call btrfs_csum_one_bio(), but since 
the ordered extent B has been removed, we trigger a BUG_ON() inside 
btrfs_csum_one_bio.


This means, we should not mix the IO error with submission error.

In __extent_writepage() what we're really doing is just bio assembly and 
submission, we should not bother if the IO failed.

I will craft a proper patch mostly for the commit message and error analyse.

But this really show one thing: immature error handling is more buggy 
than we thought.

Thanks,
Qu
> -	}
>   	unlock_page(page);
>   	ASSERT(ret <= 0);
>   	return ret;
>

     prev parent reply	other threads:[~2021-08-02 10:27 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-23  6:06 [PATCH v2] btrfs: unify the error pathes in __extent_writepage() for subpage and regular sectorsize Qu Wenruo
2021-08-02 10:27 ` Qu Wenruo [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3c7d2da3-5038-3737-bf9f-c31106ca7767@suse.com \
    --to=wqu@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).