Re: [PATCH 1/3] btrfs: fix race between reflinking and ordered extent completion

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Boris Burkov <boris@bur.io>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 1/3] btrfs: fix race between reflinking and ordered extent completion
Date: Mon, 6 Jun 2022 14:36:36 -0700	[thread overview]
Message-ID: <Yp5zZBrdi1iM2Hjo@zen> (raw)
In-Reply-To: <e4fedf38e6197e82ab53913a5c7e2fcc0d41e3d0.1654508104.git.fdmanana@suse.com>

On Mon, Jun 06, 2022 at 10:41:17AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> While doing a reflink operation, if an ordered extent for a file range
> that does not overlap with the source and destination ranges of the
> reflink operation happens, we can end up having a failure in the reflink
> operation and return -EINVAL to user space.
> 
> The following sequence of steps explains how this can happen:
> 
> 1) We have the page at file offset 315392 dirty (under delalloc);
> 
> 2) A reflink operation for this file starts, using the same file as both
>    source and destination, the source range is [372736, 409600[ (length of
>    36864 bytes) and the destination range is [208896, 245760[;
> 
> 3) At btrfs_remap_file_range_prep(), we flush all delalloc in the source
>    and destination ranges, and wait for any ordered extents in those range
>    to complete;
> 
> 4) Still at btrfs_remap_file_range_prep(), we then flush all delalloc in
>    the inode, but we neither wait for it to complete nor any ordered
>    extents to complete. This results in starting delalloc for the page at
>    file offset 315392 and creating an ordered extent for that single page
>    range;
> 
> 5) We then move to btrfs_clone() and enter the loop to find file extent
>    items to copy from the source range to destination range;
> 
> 6) In the first iteration we end up at last file extent item stored in
>    leaf A:
> 
>    (...)
>    item 131 key (143616 108 315392) itemoff 5101 itemsize 53
>             extent data disk bytenr 1903988736 nr 73728
>             extent data offset 12288 nr 61440 ram 73728
> 
>    This represents the file range [315392, 376832[, which overlaps with
>    the source range to clone.
> 
>    @datal is set to 61440, key.offset is 315392 and @next_key_min_offset
>    is therefore set to 376832 (315392 + 61440).
> 
>    @off (372736) is > key.offset (315392), so @new_key.offset is set to
>    the value of @destoff (208896).
> 
>    @new_key.offset == @last_dest_end (208896) so @drop_start is set to
>    208896 (@new_key.offset).
> 
>    @datal is adjusted to 4096, as @off is > @key.offset.
> 
>    So in this iteration we call btrfs_replace_file_extents() for the range
>    [208896, 212991] (a single page, which is
>    [@drop_start, @new_key.offset + @datal - 1]).
> 
>    @last_dest_end is set to 212992 (@new_key.offset + @datal =
>    208896 + 4096 = 212992).
> 
>    Before the next iteration of the loop, @key.offset is set to the value
>    376832, which is @next_key_min_offset;
> 
> 7) On the second iteration btrfs_search_slot() leaves us again at leaf A,
>    but this time pointing beyond the last slot of leaf A, as that's where
>    a key with offset 376832 should be at if it existed. So end up calling
>    btrfs_next_leaf();
> 
> 8) btrfs_next_leaf() releases the path, but before it searches again the
>    tree for the next key/leaf, the ordered extent for the single page
>    range at file offset 315392 completes. That results in trimming the
>    file extent item we processed before, adjusting its key offset from
>    315392 to 319488, reducing its length from 61440 to 57344 and inserting
>    a new file extent item for that single page range, with a key offset of
>    315392 and a length of 4096.
> 
>    Leaf A now looks like:
> 
>      (...)
>      item 132 key (143616 108 315392) itemoff 4995 itemsize 53
>               extent data disk bytenr 1801666560 nr 4096
>               extent data offset 0 nr 4096 ram 4096
>      item 133 key (143616 108 319488) itemoff 4942 itemsize 53
>               extent data disk bytenr 1903988736 nr 73728
>               extent data offset 16384 nr 57344 ram 73728
> 
> 9) When btrfs_next_leaf() returns, it gives us a path pointing to leaf A
>    at slot 133, since it's the first key that follows what was the last
>    key we saw (143616 108 315392). In fact it's the same item we processed
>    before, but its key offset was changed, so it counts as a new key;
> 
> 10) So now we have:
> 
>     @key.offset == 319488
>     @datal == 57344
> 
>     @off (372736) is > key.offset (319488), so @new_key.offset is set to
>     208896 (@destoff value).
> 
>     @new_key.offset (208896) != @last_dest_end (212992), so @drop_start
>     is set to 212992 (@last_dest_end value).
> 
>     @datal is adjusted to 4096 because @off > @key.offset.
> 
>     So in this iteration we call btrfs_replace_file_extents() for the
>     invalid range of [212992, 212991] (which is
>     [@drop_start, @new_key.offset + @datal - 1]).
> 
>     This range is empty, the end offset is smaller than the start offset
>     so btrfs_replace_file_extents() returns -EINVAL, which we end up
>     returning to user space and fail the reflink operation.
> 
>     This all happens because the range of this file extent item was
>     already processed in the previous iteration.
> 
> This scenario can be triggered very sporadically by fsx from fstests, for
> example with test case generic/522.
> 
> So fix this by having btrfs_clone() skip file extent items that cover a
> file range that we have already processed.
> 
> CC: stable@vger.kernel.org # 5.10+
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Error description and fix make sense, and work on my setup.

I sort of wish it were possible to wait for any io touching extents that
overlap the range in prep, not just for the live ordered_extents that 
overlap the range, but I can't think of how to do it, and skipping
redundant extents is a reasonable fix anyway.

Reviewed-by: Boris Burkov <boris@bur.io>

> ---
>  fs/btrfs/reflink.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
> index c0f1456be998..7e3b0aa318c1 100644
> --- a/fs/btrfs/reflink.c
> +++ b/fs/btrfs/reflink.c
> @@ -345,6 +345,7 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
>  	int ret;
>  	const u64 len = olen_aligned;
>  	u64 last_dest_end = destoff;
> +	u64 prev_extent_end = off;
>  
>  	ret = -ENOMEM;
>  	buf = kvmalloc(fs_info->nodesize, GFP_KERNEL);
> @@ -364,7 +365,6 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
>  	key.offset = off;
>  
>  	while (1) {
> -		u64 next_key_min_offset = key.offset + 1;
>  		struct btrfs_file_extent_item *extent;
>  		u64 extent_gen;
>  		int type;
> @@ -432,14 +432,21 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
>  		 * The first search might have left us at an extent item that
>  		 * ends before our target range's start, can happen if we have
>  		 * holes and NO_HOLES feature enabled.
> +		 *
> +		 * Subsequent searches may leave us on a file range we have
> +		 * processed before - this happens due to a race with ordered
> +		 * extent completion for a file range that is outside our source
> +		 * range, but that range was part of a file extent item that
> +		 * also covered a leading part of our source range.
>  		 */
> -		if (key.offset + datal <= off) {
> +		if (key.offset + datal <= prev_extent_end) {
>  			path->slots[0]++;
>  			goto process_slot;
>  		} else if (key.offset >= off + len) {
>  			break;
>  		}
> -		next_key_min_offset = key.offset + datal;
> +
> +		prev_extent_end = key.offset + datal;
>  		size = btrfs_item_size(leaf, slot);
>  		read_extent_buffer(leaf, buf, btrfs_item_ptr_offset(leaf, slot),
>  				   size);
> @@ -551,7 +558,7 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
>  			break;
>  
>  		btrfs_release_path(path);
> -		key.offset = next_key_min_offset;
> +		key.offset = prev_extent_end;
>  
>  		if (fatal_signal_pending(current)) {
>  			ret = -EINTR;
> -- 
> 2.35.1
>

next prev parent reply	other threads:[~2022-06-06 21:36 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-06  9:41 [PATCH 0/3] btrfs: a couple bug fixes around reflinks and fallocate fdmanana
2022-06-06  9:41 ` [PATCH 1/3] btrfs: fix race between reflinking and ordered extent completion fdmanana
2022-06-06 21:36   ` Boris Burkov [this message]
2022-06-06  9:41 ` [PATCH 2/3] btrfs: add missing inode updates on each iteration when replacing extents fdmanana
2022-06-06 22:11   ` Boris Burkov
2022-06-07  9:31     ` Filipe Manana
2022-06-07 16:41       ` Boris Burkov
2022-06-06  9:41 ` [PATCH 3/3] btrfs: do not BUG_ON() on failure to migrate space " fdmanana
2022-06-07 16:44   ` Boris Burkov
2022-06-06 20:45 ` [PATCH 0/3] btrfs: a couple bug fixes around reflinks and fallocate David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yp5zZBrdi1iM2Hjo@zen \
    --to=boris@bur.io \
    --cc=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.