public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Boris Burkov <boris@bur.io>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 1/3] btrfs: fix race between reflinking and ordered extent completion
Date: Mon, 6 Jun 2022 14:36:36 -0700	[thread overview]
Message-ID: <Yp5zZBrdi1iM2Hjo@zen> (raw)
In-Reply-To: <e4fedf38e6197e82ab53913a5c7e2fcc0d41e3d0.1654508104.git.fdmanana@suse.com>

On Mon, Jun 06, 2022 at 10:41:17AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> While doing a reflink operation, if an ordered extent for a file range
> that does not overlap with the source and destination ranges of the
> reflink operation happens, we can end up having a failure in the reflink
> operation and return -EINVAL to user space.
> 
> The following sequence of steps explains how this can happen:
> 
> 1) We have the page at file offset 315392 dirty (under delalloc);
> 
> 2) A reflink operation for this file starts, using the same file as both
>    source and destination, the source range is [372736, 409600[ (length of
>    36864 bytes) and the destination range is [208896, 245760[;
> 
> 3) At btrfs_remap_file_range_prep(), we flush all delalloc in the source
>    and destination ranges, and wait for any ordered extents in those range
>    to complete;
> 
> 4) Still at btrfs_remap_file_range_prep(), we then flush all delalloc in
>    the inode, but we neither wait for it to complete nor any ordered
>    extents to complete. This results in starting delalloc for the page at
>    file offset 315392 and creating an ordered extent for that single page
>    range;
> 
> 5) We then move to btrfs_clone() and enter the loop to find file extent
>    items to copy from the source range to destination range;
> 
> 6) In the first iteration we end up at last file extent item stored in
>    leaf A:
> 
>    (...)
>    item 131 key (143616 108 315392) itemoff 5101 itemsize 53
>             extent data disk bytenr 1903988736 nr 73728
>             extent data offset 12288 nr 61440 ram 73728
> 
>    This represents the file range [315392, 376832[, which overlaps with
>    the source range to clone.
> 
>    @datal is set to 61440, key.offset is 315392 and @next_key_min_offset
>    is therefore set to 376832 (315392 + 61440).
> 
>    @off (372736) is > key.offset (315392), so @new_key.offset is set to
>    the value of @destoff (208896).
> 
>    @new_key.offset == @last_dest_end (208896) so @drop_start is set to
>    208896 (@new_key.offset).
> 
>    @datal is adjusted to 4096, as @off is > @key.offset.
> 
>    So in this iteration we call btrfs_replace_file_extents() for the range
>    [208896, 212991] (a single page, which is
>    [@drop_start, @new_key.offset + @datal - 1]).
> 
>    @last_dest_end is set to 212992 (@new_key.offset + @datal =
>    208896 + 4096 = 212992).
> 
>    Before the next iteration of the loop, @key.offset is set to the value
>    376832, which is @next_key_min_offset;
> 
> 7) On the second iteration btrfs_search_slot() leaves us again at leaf A,
>    but this time pointing beyond the last slot of leaf A, as that's where
>    a key with offset 376832 should be at if it existed. So end up calling
>    btrfs_next_leaf();
> 
> 8) btrfs_next_leaf() releases the path, but before it searches again the
>    tree for the next key/leaf, the ordered extent for the single page
>    range at file offset 315392 completes. That results in trimming the
>    file extent item we processed before, adjusting its key offset from
>    315392 to 319488, reducing its length from 61440 to 57344 and inserting
>    a new file extent item for that single page range, with a key offset of
>    315392 and a length of 4096.
> 
>    Leaf A now looks like:
> 
>      (...)
>      item 132 key (143616 108 315392) itemoff 4995 itemsize 53
>               extent data disk bytenr 1801666560 nr 4096
>               extent data offset 0 nr 4096 ram 4096
>      item 133 key (143616 108 319488) itemoff 4942 itemsize 53
>               extent data disk bytenr 1903988736 nr 73728
>               extent data offset 16384 nr 57344 ram 73728
> 
> 9) When btrfs_next_leaf() returns, it gives us a path pointing to leaf A
>    at slot 133, since it's the first key that follows what was the last
>    key we saw (143616 108 315392). In fact it's the same item we processed
>    before, but its key offset was changed, so it counts as a new key;
> 
> 10) So now we have:
> 
>     @key.offset == 319488
>     @datal == 57344
> 
>     @off (372736) is > key.offset (319488), so @new_key.offset is set to
>     208896 (@destoff value).
> 
>     @new_key.offset (208896) != @last_dest_end (212992), so @drop_start
>     is set to 212992 (@last_dest_end value).
> 
>     @datal is adjusted to 4096 because @off > @key.offset.
> 
>     So in this iteration we call btrfs_replace_file_extents() for the
>     invalid range of [212992, 212991] (which is
>     [@drop_start, @new_key.offset + @datal - 1]).
> 
>     This range is empty, the end offset is smaller than the start offset
>     so btrfs_replace_file_extents() returns -EINVAL, which we end up
>     returning to user space and fail the reflink operation.
> 
>     This all happens because the range of this file extent item was
>     already processed in the previous iteration.
> 
> This scenario can be triggered very sporadically by fsx from fstests, for
> example with test case generic/522.
> 
> So fix this by having btrfs_clone() skip file extent items that cover a
> file range that we have already processed.
> 
> CC: stable@vger.kernel.org # 5.10+
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Error description and fix make sense, and work on my setup.

I sort of wish it were possible to wait for any io touching extents that
overlap the range in prep, not just for the live ordered_extents that 
overlap the range, but I can't think of how to do it, and skipping
redundant extents is a reasonable fix anyway.

Reviewed-by: Boris Burkov <boris@bur.io>

> ---
>  fs/btrfs/reflink.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
> index c0f1456be998..7e3b0aa318c1 100644
> --- a/fs/btrfs/reflink.c
> +++ b/fs/btrfs/reflink.c
> @@ -345,6 +345,7 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
>  	int ret;
>  	const u64 len = olen_aligned;
>  	u64 last_dest_end = destoff;
> +	u64 prev_extent_end = off;
>  
>  	ret = -ENOMEM;
>  	buf = kvmalloc(fs_info->nodesize, GFP_KERNEL);
> @@ -364,7 +365,6 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
>  	key.offset = off;
>  
>  	while (1) {
> -		u64 next_key_min_offset = key.offset + 1;
>  		struct btrfs_file_extent_item *extent;
>  		u64 extent_gen;
>  		int type;
> @@ -432,14 +432,21 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
>  		 * The first search might have left us at an extent item that
>  		 * ends before our target range's start, can happen if we have
>  		 * holes and NO_HOLES feature enabled.
> +		 *
> +		 * Subsequent searches may leave us on a file range we have
> +		 * processed before - this happens due to a race with ordered
> +		 * extent completion for a file range that is outside our source
> +		 * range, but that range was part of a file extent item that
> +		 * also covered a leading part of our source range.
>  		 */
> -		if (key.offset + datal <= off) {
> +		if (key.offset + datal <= prev_extent_end) {
>  			path->slots[0]++;
>  			goto process_slot;
>  		} else if (key.offset >= off + len) {
>  			break;
>  		}
> -		next_key_min_offset = key.offset + datal;
> +
> +		prev_extent_end = key.offset + datal;
>  		size = btrfs_item_size(leaf, slot);
>  		read_extent_buffer(leaf, buf, btrfs_item_ptr_offset(leaf, slot),
>  				   size);
> @@ -551,7 +558,7 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
>  			break;
>  
>  		btrfs_release_path(path);
> -		key.offset = next_key_min_offset;
> +		key.offset = prev_extent_end;
>  
>  		if (fatal_signal_pending(current)) {
>  			ret = -EINTR;
> -- 
> 2.35.1
> 

  reply	other threads:[~2022-06-06 21:36 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-06  9:41 [PATCH 0/3] btrfs: a couple bug fixes around reflinks and fallocate fdmanana
2022-06-06  9:41 ` [PATCH 1/3] btrfs: fix race between reflinking and ordered extent completion fdmanana
2022-06-06 21:36   ` Boris Burkov [this message]
2022-06-06  9:41 ` [PATCH 2/3] btrfs: add missing inode updates on each iteration when replacing extents fdmanana
2022-06-06 22:11   ` Boris Burkov
2022-06-07  9:31     ` Filipe Manana
2022-06-07 16:41       ` Boris Burkov
2022-06-06  9:41 ` [PATCH 3/3] btrfs: do not BUG_ON() on failure to migrate space " fdmanana
2022-06-07 16:44   ` Boris Burkov
2022-06-06 20:45 ` [PATCH 0/3] btrfs: a couple bug fixes around reflinks and fallocate David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yp5zZBrdi1iM2Hjo@zen \
    --to=boris@bur.io \
    --cc=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox