From: Boris Burkov <boris@bur.io>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 1/3] btrfs: fix race between reflinking and ordered extent completion
Date: Mon, 6 Jun 2022 14:36:36 -0700 [thread overview]
Message-ID: <Yp5zZBrdi1iM2Hjo@zen> (raw)
In-Reply-To: <e4fedf38e6197e82ab53913a5c7e2fcc0d41e3d0.1654508104.git.fdmanana@suse.com>
On Mon, Jun 06, 2022 at 10:41:17AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> While doing a reflink operation, if an ordered extent for a file range
> that does not overlap with the source and destination ranges of the
> reflink operation happens, we can end up having a failure in the reflink
> operation and return -EINVAL to user space.
>
> The following sequence of steps explains how this can happen:
>
> 1) We have the page at file offset 315392 dirty (under delalloc);
>
> 2) A reflink operation for this file starts, using the same file as both
> source and destination, the source range is [372736, 409600[ (length of
> 36864 bytes) and the destination range is [208896, 245760[;
>
> 3) At btrfs_remap_file_range_prep(), we flush all delalloc in the source
> and destination ranges, and wait for any ordered extents in those range
> to complete;
>
> 4) Still at btrfs_remap_file_range_prep(), we then flush all delalloc in
> the inode, but we neither wait for it to complete nor any ordered
> extents to complete. This results in starting delalloc for the page at
> file offset 315392 and creating an ordered extent for that single page
> range;
>
> 5) We then move to btrfs_clone() and enter the loop to find file extent
> items to copy from the source range to destination range;
>
> 6) In the first iteration we end up at last file extent item stored in
> leaf A:
>
> (...)
> item 131 key (143616 108 315392) itemoff 5101 itemsize 53
> extent data disk bytenr 1903988736 nr 73728
> extent data offset 12288 nr 61440 ram 73728
>
> This represents the file range [315392, 376832[, which overlaps with
> the source range to clone.
>
> @datal is set to 61440, key.offset is 315392 and @next_key_min_offset
> is therefore set to 376832 (315392 + 61440).
>
> @off (372736) is > key.offset (315392), so @new_key.offset is set to
> the value of @destoff (208896).
>
> @new_key.offset == @last_dest_end (208896) so @drop_start is set to
> 208896 (@new_key.offset).
>
> @datal is adjusted to 4096, as @off is > @key.offset.
>
> So in this iteration we call btrfs_replace_file_extents() for the range
> [208896, 212991] (a single page, which is
> [@drop_start, @new_key.offset + @datal - 1]).
>
> @last_dest_end is set to 212992 (@new_key.offset + @datal =
> 208896 + 4096 = 212992).
>
> Before the next iteration of the loop, @key.offset is set to the value
> 376832, which is @next_key_min_offset;
>
> 7) On the second iteration btrfs_search_slot() leaves us again at leaf A,
> but this time pointing beyond the last slot of leaf A, as that's where
> a key with offset 376832 should be at if it existed. So end up calling
> btrfs_next_leaf();
>
> 8) btrfs_next_leaf() releases the path, but before it searches again the
> tree for the next key/leaf, the ordered extent for the single page
> range at file offset 315392 completes. That results in trimming the
> file extent item we processed before, adjusting its key offset from
> 315392 to 319488, reducing its length from 61440 to 57344 and inserting
> a new file extent item for that single page range, with a key offset of
> 315392 and a length of 4096.
>
> Leaf A now looks like:
>
> (...)
> item 132 key (143616 108 315392) itemoff 4995 itemsize 53
> extent data disk bytenr 1801666560 nr 4096
> extent data offset 0 nr 4096 ram 4096
> item 133 key (143616 108 319488) itemoff 4942 itemsize 53
> extent data disk bytenr 1903988736 nr 73728
> extent data offset 16384 nr 57344 ram 73728
>
> 9) When btrfs_next_leaf() returns, it gives us a path pointing to leaf A
> at slot 133, since it's the first key that follows what was the last
> key we saw (143616 108 315392). In fact it's the same item we processed
> before, but its key offset was changed, so it counts as a new key;
>
> 10) So now we have:
>
> @key.offset == 319488
> @datal == 57344
>
> @off (372736) is > key.offset (319488), so @new_key.offset is set to
> 208896 (@destoff value).
>
> @new_key.offset (208896) != @last_dest_end (212992), so @drop_start
> is set to 212992 (@last_dest_end value).
>
> @datal is adjusted to 4096 because @off > @key.offset.
>
> So in this iteration we call btrfs_replace_file_extents() for the
> invalid range of [212992, 212991] (which is
> [@drop_start, @new_key.offset + @datal - 1]).
>
> This range is empty, the end offset is smaller than the start offset
> so btrfs_replace_file_extents() returns -EINVAL, which we end up
> returning to user space and fail the reflink operation.
>
> This all happens because the range of this file extent item was
> already processed in the previous iteration.
>
> This scenario can be triggered very sporadically by fsx from fstests, for
> example with test case generic/522.
>
> So fix this by having btrfs_clone() skip file extent items that cover a
> file range that we have already processed.
>
> CC: stable@vger.kernel.org # 5.10+
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
Error description and fix make sense, and work on my setup.
I sort of wish it were possible to wait for any io touching extents that
overlap the range in prep, not just for the live ordered_extents that
overlap the range, but I can't think of how to do it, and skipping
redundant extents is a reasonable fix anyway.
Reviewed-by: Boris Burkov <boris@bur.io>
> ---
> fs/btrfs/reflink.c | 15 +++++++++++----
> 1 file changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
> index c0f1456be998..7e3b0aa318c1 100644
> --- a/fs/btrfs/reflink.c
> +++ b/fs/btrfs/reflink.c
> @@ -345,6 +345,7 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
> int ret;
> const u64 len = olen_aligned;
> u64 last_dest_end = destoff;
> + u64 prev_extent_end = off;
>
> ret = -ENOMEM;
> buf = kvmalloc(fs_info->nodesize, GFP_KERNEL);
> @@ -364,7 +365,6 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
> key.offset = off;
>
> while (1) {
> - u64 next_key_min_offset = key.offset + 1;
> struct btrfs_file_extent_item *extent;
> u64 extent_gen;
> int type;
> @@ -432,14 +432,21 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
> * The first search might have left us at an extent item that
> * ends before our target range's start, can happen if we have
> * holes and NO_HOLES feature enabled.
> + *
> + * Subsequent searches may leave us on a file range we have
> + * processed before - this happens due to a race with ordered
> + * extent completion for a file range that is outside our source
> + * range, but that range was part of a file extent item that
> + * also covered a leading part of our source range.
> */
> - if (key.offset + datal <= off) {
> + if (key.offset + datal <= prev_extent_end) {
> path->slots[0]++;
> goto process_slot;
> } else if (key.offset >= off + len) {
> break;
> }
> - next_key_min_offset = key.offset + datal;
> +
> + prev_extent_end = key.offset + datal;
> size = btrfs_item_size(leaf, slot);
> read_extent_buffer(leaf, buf, btrfs_item_ptr_offset(leaf, slot),
> size);
> @@ -551,7 +558,7 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
> break;
>
> btrfs_release_path(path);
> - key.offset = next_key_min_offset;
> + key.offset = prev_extent_end;
>
> if (fatal_signal_pending(current)) {
> ret = -EINTR;
> --
> 2.35.1
>
next prev parent reply other threads:[~2022-06-06 21:36 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-06 9:41 [PATCH 0/3] btrfs: a couple bug fixes around reflinks and fallocate fdmanana
2022-06-06 9:41 ` [PATCH 1/3] btrfs: fix race between reflinking and ordered extent completion fdmanana
2022-06-06 21:36 ` Boris Burkov [this message]
2022-06-06 9:41 ` [PATCH 2/3] btrfs: add missing inode updates on each iteration when replacing extents fdmanana
2022-06-06 22:11 ` Boris Burkov
2022-06-07 9:31 ` Filipe Manana
2022-06-07 16:41 ` Boris Burkov
2022-06-06 9:41 ` [PATCH 3/3] btrfs: do not BUG_ON() on failure to migrate space " fdmanana
2022-06-07 16:44 ` Boris Burkov
2022-06-06 20:45 ` [PATCH 0/3] btrfs: a couple bug fixes around reflinks and fallocate David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Yp5zZBrdi1iM2Hjo@zen \
--to=boris@bur.io \
--cc=fdmanana@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox