From: Filipe Manana <fdmanana@kernel.org>
To: Naohiro Aota <naohiro.aota@wdc.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 1/4] btrfs: ensure pages are unlocked on cow_file_range() failure
Date: Fri, 17 Jun 2022 11:11:49 +0100 [thread overview]
Message-ID: <20220617101149.GA4041436@falcondesktop> (raw)
In-Reply-To: <318b80987f74e1cf6bf4ab09aed2399538fc4f9e.1655391633.git.naohiro.aota@wdc.com>
On Fri, Jun 17, 2022 at 12:45:39AM +0900, Naohiro Aota wrote:
> There is a hung_task report on zoned btrfs like below.
>
> https://github.com/naota/linux/issues/59
>
> [ 726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
> [ 726.329839] Not tainted 5.16.0-rc1+ #1
> [ 726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 726.331603] task:rocksdb:high0 state:D stack: 0 pid:11085 ppid: 11082 flags:0x00000000
> [ 726.331608] Call Trace:
> [ 726.331611] <TASK>
> [ 726.331614] __schedule+0x2e5/0x9d0
> [ 726.331622] schedule+0x58/0xd0
> [ 726.331626] io_schedule+0x3f/0x70
> [ 726.331629] __folio_lock+0x125/0x200
> [ 726.331634] ? find_get_entries+0x1bc/0x240
> [ 726.331638] ? filemap_invalidate_unlock_two+0x40/0x40
> [ 726.331642] truncate_inode_pages_range+0x5b2/0x770
> [ 726.331649] truncate_inode_pages_final+0x44/0x50
> [ 726.331653] btrfs_evict_inode+0x67/0x480
> [ 726.331658] evict+0xd0/0x180
> [ 726.331661] iput+0x13f/0x200
> [ 726.331664] do_unlinkat+0x1c0/0x2b0
> [ 726.331668] __x64_sys_unlink+0x23/0x30
> [ 726.331670] do_syscall_64+0x3b/0xc0
> [ 726.331674] entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 726.331677] RIP: 0033:0x7fb9490a171b
> [ 726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
> [ 726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
> [ 726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
> [ 726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
> [ 726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
> [ 726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
> [ 726.331693] </TASK>
>
> While we debug the issue, we found running fstests generic/551 on 5GB
> non-zoned null_blk device in the emulated zoned mode also had a
> similar hung issue.
>
> Also, we can reproduce the same symptom with an error injected
> cow_file_range() setup.
>
> The hang occurs when cow_file_range() fails in the middle of
> allocation. cow_file_range() called from do_allocation_zoned() can
> split the give region ([start, end]) for allocation depending on
> current block group usages. When btrfs can allocate bytes for one part
> of the split regions but fails for the other region (e.g. because of
> -ENOSPC), we return the error leaving the pages in the succeeded regions
> locked. Technically, this occurs only when @unlock == 0. Otherwise, we
> unlock the pages in an allocated region after creating an ordered
> extent.
>
> Considering the callers of cow_file_range(unlock=0) won't write out
> the pages, we can unlock the pages on error exit from
> cow_file_range(). So, we can ensure all the pages except @locked_page
> are unlocked on error case.
>
> In summary, cow_file_range now behaves like this:
>
> - page_started == 1 (return value)
> - All the pages are unlocked. IO is started.
> - unlock == 1
> - All the pages except @locked_page are unlocked in any case
> - unlock == 0
> - On success, all the pages are locked for writing out them
> - On failure, all the pages except @locked_page are unlocked
>
> Fixes: 42c011000963 ("btrfs: zoned: introduce dedicated data write path for zoned filesystems")
> CC: stable@vger.kernel.org # 5.12+
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
> fs/btrfs/inode.c | 72 ++++++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 64 insertions(+), 8 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 1247690e7021..0c3d9998470f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1134,6 +1134,28 @@ static u64 get_extent_allocation_hint(struct btrfs_inode *inode, u64 start,
> * *page_started is set to one if we unlock locked_page and do everything
> * required to start IO on it. It may be clean and already done with
> * IO when we return.
> + *
> + * When unlock == 1, we unlock the pages in successfully allocated regions.
> + * When unlock == 0, we leave them locked for writing them out.
> + *
> + * However, we unlock all the pages except @locked_page in case of failure.
> + *
> + * In summary, page locking state will be as follow:
> + *
> + * - page_started == 1 (return value)
> + * - All the pages are unlocked. IO is started.
> + * - Note that this can happen only on success
> + * - unlock == 1
> + * - All the pages except @locked_page are unlocked in any case
> + * - unlock == 0
> + * - On success, all the pages are locked for writing out them
> + * - On failure, all the pages except @locked_page are unlocked
> + *
> + * When a failure happens in the second or later iteration of the
> + * while-loop, the ordered extents created in previous iterations are kept
> + * intact. So, the caller must clean them up by calling
> + * btrfs_cleanup_ordered_extents(). See btrfs_run_delalloc_range() for
> + * example.
> */
> static noinline int cow_file_range(struct btrfs_inode *inode,
> struct page *locked_page,
> @@ -1143,6 +1165,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
> struct btrfs_root *root = inode->root;
> struct btrfs_fs_info *fs_info = root->fs_info;
> u64 alloc_hint = 0;
> + u64 orig_start = start;
> u64 num_bytes;
> unsigned long ram_size;
> u64 cur_alloc_size = 0;
> @@ -1336,18 +1359,44 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
> btrfs_dec_block_group_reservations(fs_info, ins.objectid);
> btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
> out_unlock:
> + /*
> + * Now, we have three regions to clean up, as shown below.
> + *
> + * |-------(1)----|---(2)---|-------------(3)----------|
> + * `- orig_start `- start `- start + cur_alloc_size `- end
> + *
> + * We process each region below.
> + */
> +
> clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
> EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
> page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
> +
> /*
> - * If we reserved an extent for our delalloc range (or a subrange) and
> - * failed to create the respective ordered extent, then it means that
> - * when we reserved the extent we decremented the extent's size from
> - * the data space_info's bytes_may_use counter and incremented the
> - * space_info's bytes_reserved counter by the same amount. We must make
> - * sure extent_clear_unlock_delalloc() does not try to decrement again
> - * the data space_info's bytes_may_use counter, therefore we do not pass
> - * it the flag EXTENT_CLEAR_DATA_RESV.
> + * For the range (1). We have already instantiated the ordered extents
> + * for this region. They are cleaned up by
> + * btrfs_cleanup_ordered_extents() in e.g,
> + * btrfs_run_delalloc_range(). EXTENT_LOCKED | EXTENT_DELALLOC are
> + * already cleared in the above loop. And, EXTENT_DELALLOC_NEW |
> + * EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV are handled by the cleanup
> + * function.
> + *
> + * However, in case of unlock == 0, we still need to unlock the pages
> + * (except @locked_page) to ensure all the pages are unlocked.
> + */
> + if (!unlock && orig_start < start)
> + extent_clear_unlock_delalloc(inode, orig_start, start - 1,
> + locked_page, 0, page_ops);
> +
> + /*
> + * For the range (2). If we reserved an extent for our delalloc range
> + * (or a subrange) and failed to create the respective ordered extent,
> + * then it means that when we reserved the extent we decremented the
> + * extent's size from the data space_info's bytes_may_use counter and
> + * incremented the space_info's bytes_reserved counter by the same
> + * amount. We must make sure extent_clear_unlock_delalloc() does not try
> + * to decrement again the data space_info's bytes_may_use counter,
> + * therefore we do not pass it the flag EXTENT_CLEAR_DATA_RESV.
> */
> if (extent_reserved) {
> extent_clear_unlock_delalloc(inode, start,
> @@ -1359,6 +1408,13 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
> if (start >= end)
> goto out;
> }
> +
> + /*
> + * For the range (3). We never touched the region. In addition to the
> + * clear_bits above, we add EXTENT_CLEAR_DATA_RESV to release the data
> + * space_info's bytes_may_use counter, reserved in e.g,
> + * btrfs_check_data_free_space().
This says "in .e.g.", but in fact there's only one place where we increment the
bytes_may_use counter for data, which is at btrfs_check_data_free_space(). So
just remove the "in e.g." part because it gives the idea that the increment/reservation
can happen in more than one place.
Other than that, it looks good, thanks.
> + */
> extent_clear_unlock_delalloc(inode, start, end, locked_page,
> clear_bits | EXTENT_CLEAR_DATA_RESV,
> page_ops);
> --
> 2.35.1
>
next prev parent reply other threads:[~2022-06-17 10:11 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-16 15:45 [PATCH 0/4] btrfs: fix error handling of cow_file_range(unlock = 0) Naohiro Aota
2022-06-16 15:45 ` [PATCH 1/4] btrfs: ensure pages are unlocked on cow_file_range() failure Naohiro Aota
2022-06-17 10:11 ` Filipe Manana [this message]
2022-06-20 2:22 ` Naohiro Aota
2022-06-16 15:45 ` [PATCH 2/4] btrfs: extend btrfs_cleanup_ordered_extens for NULL locked_page Naohiro Aota
2022-06-17 10:15 ` Filipe Manana
2022-06-20 2:28 ` Naohiro Aota
2022-06-16 15:45 ` [PATCH 3/4] btrfs: fix error handling of fallbacked uncompress write Naohiro Aota
2022-06-17 10:21 ` Filipe Manana
2022-06-20 2:26 ` Naohiro Aota
2022-06-16 15:45 ` [PATCH 4/4] btrfs: replace unnecessary goto with direct return Naohiro Aota
2022-06-17 10:13 ` Filipe Manana
2022-06-20 2:22 ` Naohiro Aota
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220617101149.GA4041436@falcondesktop \
--to=fdmanana@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=naohiro.aota@wdc.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.