From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v2] btrfs: unify the error pathes in __extent_writepage() for subpage and regular sectorsize
Date: Mon, 2 Aug 2021 18:27:39 +0800 [thread overview]
Message-ID: <3c7d2da3-5038-3737-bf9f-c31106ca7767@suse.com> (raw)
In-Reply-To: <20210723060657.198883-1-wqu@suse.com>
On 2021/7/23 下午2:06, Qu Wenruo wrote:
> [BUG]
> When running btrfs/160 in a loop for subpage with experimental
> compression support, it has a high chance to crash (~20%):
>
> BTRFS critical (device dm-7): panic in __btrfs_add_ordered_extent:238: inconsistency in ordered tree at offset 0 (errno=-17 Object already exists)
> ------------[ cut here ]------------
> kernel BUG at fs/btrfs/ordered-data.c:238!
> Internal error: Oops - BUG: 0 [#1] SMP
> pc : __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
> lr : __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
> Call trace:
> __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
> btrfs_add_ordered_extent+0x2c/0x50 [btrfs]
> run_delalloc_nocow+0x81c/0x8fc [btrfs]
> btrfs_run_delalloc_range+0xa4/0x390 [btrfs]
> writepage_delalloc+0xc0/0x1ac [btrfs]
> __extent_writepage+0xf4/0x370 [btrfs]
> extent_write_cache_pages+0x288/0x4f4 [btrfs]
> extent_writepages+0x58/0xe0 [btrfs]
> btrfs_writepages+0x1c/0x30 [btrfs]
> do_writepages+0x60/0x110
> __filemap_fdatawrite_range+0x108/0x170
> filemap_fdatawrite_range+0x20/0x30
> btrfs_fdatawrite_range+0x34/0x4dc [btrfs]
> __btrfs_write_out_cache+0x34c/0x480 [btrfs]
> btrfs_write_out_cache+0x144/0x220 [btrfs]
> btrfs_start_dirty_block_groups+0x3ac/0x6b0 [btrfs]
> btrfs_commit_transaction+0xd0/0xbb4 [btrfs]
> btrfs_sync_fs+0x64/0x1cc [btrfs]
> sync_fs_one_sb+0x3c/0x50
> iterate_supers+0xcc/0x1d4
> ksys_sync+0x6c/0xd0
> __arm64_sys_sync+0x1c/0x30
> invoke_syscall+0x50/0x120
> el0_svc_common.constprop.0+0x4c/0xd4
> do_el0_svc+0x30/0x9c
> el0_svc+0x2c/0x54
> el0_sync_handler+0x1a8/0x1b0
> el0_sync+0x198/0x1c0
> ---[ end trace 336f67369ae6e0af ]---
>
> [CAUSE]
> For subpage case, we can have multiple sectors inside a page, this makes
> it possible for __extent_writepage() to have part of its page submitted
> before returning.
>
> In btrfs/160, we are using dm-dust to emulate write error, this means
> for certain pages, we could have everything running fine, but at the end
> of __extent_writepage(), one of the submitted bios fails due to dm-dust.
>
> Then the page is marked Error, and we change @ret from 0 to -EIO.
>
> This makes the caller extent_write_cache_pages() to error out, without
> submitting the remaining pages.
>
> Furthermore, since we're erroring out for free space cache, it doesn't
> really care about the error and will update the inode and retry the
> writeback.
>
> Then we re-run the delalloc range, and will try to insert the same
> delalloc range while previous delalloc range is still hanging there,
> triggering the above error.
>
> [FIX]
> The proper fix is to make btrfs handle errors from __extent_writepage()
> properly, by ending the remaining ordered extent.
>
> But that fix needs the following refactors:
>
> - Know at eaxctly which sector the error happened
> Currently __extent_writepage_io() works for the full page, can't
> return at which sector we hit the error.
>
> - Grab the ordered extent covering the failed sector
>
> As a hotfix for subpage case, here we unify the error pathes in
> __extent_writepage().
>
> In fact, the "if (PageError(page))" branch never get executed if @ret is
> still 0 for non-subpage cases.
>
> As for non-subpage case, we never submit current page in
> __extent_writepage(), but only add current page into bio.
> The bio can only get submitted in next page.
>
> Thus we never get PageError() set due to IO failure, thus when we hit
> the branch, @ret is never 0.
>
> By simplying removing that @ret assignment, we let subpage case to
> ignore the IO failure, thus only error out for fatal errors just like
> regular sectorsize.
>
> So that IO error won't be treated as fatal error not trigger the hanging
> OE problem.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> For the proper fix, I pupose to change how we run delalloc range.
>
> Currently we call writepage_delalloc() for every page we're going to
> write, but it only really get executed for the first page of a contig
> range.
>
> I hope to change the workflow to something more like async cow:
> - Run delalloc range
> But keeps all the page still locked
>
> - Submit pages for the delalloc range
> Set each sector writeback and unlock the sector after adding it to
> bio.
>
> So that we can know exactly the delalloc range is, and where the error
> happens when submitting bio.
>
> The only down side is btrfs_invalidatepage().
> Previously btrfs_invalidatepage() can invalidate a page which is covered
> by ordered extent, but not yet being added to a bio.
>
> Now we have to wait for the writeback to finish before doing
> invalidating.
>
> Changelog:
> v2:
> - Make error handling in end_extent_writepage() to be subpage compatible
> - Update the commit message and add new comment for why it's safe
> to remove the @ret assignment
> ---
> fs/btrfs/extent_io.c | 51 ++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 44 insertions(+), 7 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 248f22267665..543f87ea372e 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2789,8 +2789,14 @@ void end_extent_writepage(struct page *page, int err, u64 start, u64 end)
> btrfs_writepage_endio_finish_ordered(inode, page, start, end, uptodate);
>
> if (!uptodate) {
> - ClearPageUptodate(page);
> - SetPageError(page);
> + const struct btrfs_fs_info *fs_info = inode->root->fs_info;
> + u32 len;
> +
> + ASSERT(end + 1 - start <= U32_MAX);
> + len = end + 1 - start;
> +
> + btrfs_page_clear_uptodate(fs_info, page, start, len);
> + btrfs_page_set_error(fs_info, page, start, len);
> ret = err < 0 ? err : -EIO;
> mapping_set_error(page->mapping, ret);
> }
> @@ -3795,7 +3801,8 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
> ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
> delalloc_end, &page_started, nr_written, wbc);
> if (ret) {
> - SetPageError(page);
> + btrfs_page_set_error(inode->root->fs_info, page,
> + page_offset(page), PAGE_SIZE);
> /*
> * btrfs_run_delalloc_range should return < 0 for error
> * but just in case, we use > 0 here meaning the IO is
> @@ -4071,7 +4078,8 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc,
>
> WARN_ON(!PageLocked(page));
>
> - ClearPageError(page);
> + btrfs_page_clear_error(btrfs_sb(inode->i_sb), page,
> + page_offset(page), PAGE_SIZE);
>
> pg_offset = offset_in_page(i_size);
> if (page->index > end_index ||
> @@ -4112,10 +4120,39 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc,
> set_page_writeback(page);
> end_page_writeback(page);
> }
> - if (PageError(page)) {
> - ret = ret < 0 ? ret : -EIO;
> + /*
> + * Here we used to have a check for PageError() and then set @ret and
> + * call end_extent_writepage().
> + *
> + * But in fact setting @ret here will cause different error paths
> + * between subpage and regualr sectorsize.
> + *
> + * For regular page size, we never submit current page, but only add
> + * current page to current bio.
> + * The bio submission can only happen in next page.
> + * Thus if we hit the PageError() branch, @ret is already set to
> + * non-zero value and will not get updated for regular sectorsize.
> + *
> + * But for subpage case, it's possible we submit part of current page,
> + * thus can get PageError() set by submitted bio of the same page,
> + * while our @ret is still 0.
> + *
> + * So here we unify the behavior and don't set @ret.
> + * Error can still be properly passed to higher layer as page will
> + * be set error, here we just don't handle the IO failure.
> + *
> + * NOTE: This is just a hotfix for subpage.
> + * The root fix will be properly ending ordered extent when we hit
> + * an error during writeback.
> + *
> + * But that needs a much bigger refactor, as we not only need to
> + * grab the submitted OE, but also needs to know exactly at which
> + * bytenr we hit an error.
> + * Currently the full page based __extent_writepage_io() is not
> + * capable for that.
> + */
> + if (PageError(page))
> end_extent_writepage(page, ret, start, page_end);
Well, it turns out, this end_extent_writepage() is causing more problem
than I thought.
If we have some page layout like this:
0 32K 64K
|/////| |////|
\- Extent A \- Extent B.
When Extent A get submitted and IO failed, the page will have PageError set.
Then Extent B get submitted without problem.
But since PageError is set, we will call end_extent_writepage() for the
whole page.
Which will call btrfs_mark_ordered_io_finished() to cleanup the ordered
extent.
If btrfs_mark_ordered_io_finished() happens before the endio of Extent
B, it will finish the Ordered IO for extent B, now ordered extent B will
be removed from the io extent tree.
Then the bio get submitted, we will call btrfs_csum_one_bio(), but since
the ordered extent B has been removed, we trigger a BUG_ON() inside
btrfs_csum_one_bio.
This means, we should not mix the IO error with submission error.
In __extent_writepage() what we're really doing is just bio assembly and
submission, we should not bother if the IO failed.
I will craft a proper patch mostly for the commit message and error analyse.
But this really show one thing: immature error handling is more buggy
than we thought.
Thanks,
Qu
> - }
> unlock_page(page);
> ASSERT(ret <= 0);
> return ret;
>
prev parent reply other threads:[~2021-08-02 10:27 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-07-23 6:06 [PATCH v2] btrfs: unify the error pathes in __extent_writepage() for subpage and regular sectorsize Qu Wenruo
2021-08-02 10:27 ` Qu Wenruo [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3c7d2da3-5038-3737-bf9f-c31106ca7767@suse.com \
--to=wqu@suse.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).