From: Liu Bo <bo.li.liu@oracle.com>
To: Qu Wenruo <quwenruo@cn.fujitsu.com>
Cc: linux-btrfs@vger.kernel.org, fdmanana@kernel.org,
Filipe Manana <fdmanana@suse.com>
Subject: Re: [PATCH v7 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang
Date: Thu, 9 Mar 2017 09:37:28 -0800 [thread overview]
Message-ID: <20170309173727.GC28780@lim.localdomain> (raw)
In-Reply-To: <20170308022552.14686-2-quwenruo@cn.fujitsu.com>
On Wed, Mar 08, 2017 at 10:25:52AM +0800, Qu Wenruo wrote:
> [BUG]
> If run_delalloc_range() returns error and there is already some ordered
> extents created, btrfs will be hanged with the following backtrace:
>
> Call Trace:
> __schedule+0x2d4/0xae0
> schedule+0x3d/0x90
> btrfs_start_ordered_extent+0x160/0x200 [btrfs]
> ? wake_atomic_t_function+0x60/0x60
> btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
> btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
> btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
> process_one_work+0x2af/0x720
> ? process_one_work+0x22b/0x720
> worker_thread+0x4b/0x4f0
> kthread+0x10f/0x150
> ? process_one_work+0x720/0x720
> ? kthread_create_on_node+0x40/0x40
> ret_from_fork+0x2e/0x40
>
> [CAUSE]
>
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
> |<>| |<---------- cleanup range --------->|
> ||
> \_=> First page handled by end_extent_writepage() in __extent_writepage()
>
> The problem is caused by error handler of run_delalloc_range(), which
> doesn't handle any created ordered extents, leaving them waiting on
> btrfs_finish_ordered_io() to finish.
>
> However after run_delalloc_range() returns error, __extent_writepage()
> won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered
> except the first page, and btrfs_finish_ordered_io() won't be triggered
> for created ordered extents either.
>
> So OE 2~n will hang forever, and if OE 1 is larger than one page, it
> will also hang.
>
> [FIX]
> Introduce btrfs_cleanup_ordered_extents() function to cleanup created
> ordered extents and finish them manually.
>
> The function is based on existing
> btrfs_endio_direct_write_update_ordered() function, and modify it to
> act just like btrfs_writepage_endio_hook() but handles specified range
> other than one page.
>
> After fix, delalloc error will be handled like:
>
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
> |<>|<-------- ----------->|<------ old error handler --------->|
> || ||
> || \_=> Cleaned up by cleanup_ordered_extents()
> \_=> First page handled by end_extent_writepage() in __extent_writepage()
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Thanks,
-liubo
>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> v2:
> Add BTRFS_ORDERED_SKIP_METADATA flag to avoid double reducing
> outstanding extents, which is already done by
> extent_clear_unlock_delalloc() with EXTENT_DO_ACCOUNT control bit
> v3:
> Skip first page to avoid underflow ordered->bytes_left.
> Fix range passed in cow_file_range() which doesn't cover the whole
> extent.
> Expend extent_clear_unlock_delalloc() range to allow them to handle
> metadata release.
> v4:
> Don't use extra bit to skip metadata freeing for ordered extent,
> but only handle btrfs_reloc_clone_csums() error just before processing
> next extent.
> This makes error handle much easier for run_delalloc_nocow().
> v5:
> Variant gramma and comment fixes suggested by Filipe Manana
> Enhanced commit message to focus on the generic error handler bug,
> pointed out by Filipe Manana
> Adding missing wq/func user in __endio_write_update_ordered(), pointed
> out by Filipe Manana.
> Enhanced commit message with ascii art to explain the bug more easily.
> Fix a bug which can leads to corrupted extent allocation, exposed by
> Filipe Manana.
> v6:
> Split the metadata underflow fix to another patch.
> Use better comment and btrfs_cleanup_ordered_extent() implementation
> from Filipe Manana.
> v7:
> Add back const prefix
> ---
> fs/btrfs/inode.c | 63 ++++++++++++++++++++++++++++++++++++++++++--------------
> 1 file changed, 48 insertions(+), 15 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index fe588bf30d02..d3bc68bbe0e7 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -115,6 +115,31 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len,
> u64 ram_bytes, int compress_type,
> int type);
>
> +static void __endio_write_update_ordered(struct inode *inode,
> + const u64 offset, const u64 bytes,
> + const bool uptodate);
> +
> +/*
> + * Cleanup all submitted ordered extents in specified range to handle errors
> + * from the fill_dellaloc() callback.
> + *
> + * NOTE: caller must ensure that when an error happens, it can not call
> + * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING
> + * and EXTENT_DELALLOC simultaneously, because that causes the reserved metadata
> + * to be released, which we want to happen only when finishing the ordered
> + * extent (btrfs_finish_ordered_io()). Also note that the caller of the
> + * fill_delalloc() callback already does proper cleanup for the first page of
> + * the range, that is, it invokes the callback writepage_end_io_hook() for the
> + * range of the first page.
> + */
> +static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
> + const u64 offset,
> + const u64 bytes)
> +{
> + return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
> + bytes - PAGE_SIZE, false);
> +}
> +
> static int btrfs_dirty_inode(struct inode *inode);
>
> #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> @@ -1536,6 +1561,8 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page,
> ret = cow_file_range_async(inode, locked_page, start, end,
> page_started, nr_written);
> }
> + if (ret)
> + btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
> return ret;
> }
>
> @@ -8142,17 +8169,26 @@ static void btrfs_endio_direct_read(struct bio *bio)
> bio_put(bio);
> }
>
> -static void btrfs_endio_direct_write_update_ordered(struct inode *inode,
> - const u64 offset,
> - const u64 bytes,
> - const int uptodate)
> +static void __endio_write_update_ordered(struct inode *inode,
> + const u64 offset, const u64 bytes,
> + const bool uptodate)
> {
> struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> struct btrfs_ordered_extent *ordered = NULL;
> + struct btrfs_workqueue *wq;
> + btrfs_work_func_t func;
> u64 ordered_offset = offset;
> u64 ordered_bytes = bytes;
> int ret;
>
> + if (btrfs_is_free_space_inode(BTRFS_I(inode))) {
> + wq = fs_info->endio_freespace_worker;
> + func = btrfs_freespace_write_helper;
> + } else {
> + wq = fs_info->endio_write_workers;
> + func = btrfs_endio_write_helper;
> + }
> +
> again:
> ret = btrfs_dec_test_first_ordered_pending(inode, &ordered,
> &ordered_offset,
> @@ -8161,9 +8197,8 @@ static void btrfs_endio_direct_write_update_ordered(struct inode *inode,
> if (!ret)
> goto out_test;
>
> - btrfs_init_work(&ordered->work, btrfs_endio_write_helper,
> - finish_ordered_fn, NULL, NULL);
> - btrfs_queue_work(fs_info->endio_write_workers, &ordered->work);
> + btrfs_init_work(&ordered->work, func, finish_ordered_fn, NULL, NULL);
> + btrfs_queue_work(wq, &ordered->work);
> out_test:
> /*
> * our bio might span multiple ordered extents. If we haven't
> @@ -8181,10 +8216,8 @@ static void btrfs_endio_direct_write(struct bio *bio)
> struct btrfs_dio_private *dip = bio->bi_private;
> struct bio *dio_bio = dip->dio_bio;
>
> - btrfs_endio_direct_write_update_ordered(dip->inode,
> - dip->logical_offset,
> - dip->bytes,
> - !bio->bi_error);
> + __endio_write_update_ordered(dip->inode, dip->logical_offset,
> + dip->bytes, !bio->bi_error);
>
> kfree(dip);
>
> @@ -8545,10 +8578,10 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
> io_bio = NULL;
> } else {
> if (write)
> - btrfs_endio_direct_write_update_ordered(inode,
> + __endio_write_update_ordered(inode,
> file_offset,
> dio_bio->bi_iter.bi_size,
> - 0);
> + false);
> else
> unlock_extent(&BTRFS_I(inode)->io_tree, file_offset,
> file_offset + dio_bio->bi_iter.bi_size - 1);
> @@ -8683,11 +8716,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> */
> if (dio_data.unsubmitted_oe_range_start <
> dio_data.unsubmitted_oe_range_end)
> - btrfs_endio_direct_write_update_ordered(inode,
> + __endio_write_update_ordered(inode,
> dio_data.unsubmitted_oe_range_start,
> dio_data.unsubmitted_oe_range_end -
> dio_data.unsubmitted_oe_range_start,
> - 0);
> + false);
> } else if (ret >= 0 && (size_t)ret < count)
> btrfs_delalloc_release_space(inode, offset,
> count - (size_t)ret);
> --
> 2.12.0
>
>
>
next prev parent reply other threads:[~2017-03-09 17:38 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-03-08 2:25 [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Qu Wenruo
2017-03-08 2:25 ` [PATCH v7 2/2] btrfs: Handle delalloc error correctly to avoid ordered extent hang Qu Wenruo
2017-03-08 10:19 ` Filipe Manana
2017-03-09 17:37 ` Liu Bo [this message]
2017-03-08 10:18 ` [PATCH v7 1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error Filipe Manana
2017-03-09 17:35 ` Liu Bo
2017-03-12 20:49 ` Stefan Priebe - Profihost AG
2017-03-13 1:16 ` Qu Wenruo
2017-03-13 7:26 ` Stefan Priebe - Profihost AG
2017-03-13 7:39 ` Qu Wenruo
2017-03-13 13:26 ` Stefan Priebe - Profihost AG
2017-03-14 0:30 ` Qu Wenruo
2017-03-14 2:50 ` Qu Wenruo
2017-03-14 9:06 ` Stefan Priebe - Profihost AG
2017-03-14 9:09 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170309173727.GC28780@lim.localdomain \
--to=bo.li.liu@oracle.com \
--cc=fdmanana@kernel.org \
--cc=fdmanana@suse.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo@cn.fujitsu.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).