All of lore.kernel.org
 help / color / mirror / Atom feed
From: Filipe Manana <fdmanana@kernel.org>
To: Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org, Su Yue <l@damenly.su>
Subject: Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert
Date: Thu, 17 Feb 2022 11:14:29 +0000	[thread overview]
Message-ID: <Yg4uFd40/Y5pQWt2@debian9.Home> (raw)
In-Reply-To: <f0d4a789788e8e63041dc6d91408f40234daa329.1645091180.git.wqu@suse.com>

On Thu, Feb 17, 2022 at 05:49:34PM +0800, Qu Wenruo wrote:
> [BUG]
> Su reported that with his VM running on an M1 mac, it has a high chance
> to trigger the following ASSERT() with scrub and balance test cases:
> 
> 		ASSERT(cache->start == chunk_offset);
> 		ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
> 				  dev_extent_len);
> 
> [CAUSE]
> The ASSERT() is making sure that the block group cache (@cache) and the
> chunk offset from the device extent match.
> 
> But the truth is, block group caches and dev extent items are deleted at
> different timing.
> 
> For block group caches and items, after btrfs_remove_block_group() they
> will be marked deleted, but the device extent and the chunk item is
> still in dev tree and chunk tree respectively.

This is not correct, what happens is:

1) btrfs_remove_chunk() will remove the device extent items and the chunk
   item;

2) It then calls btrfs_remove_block_group(). This will not remove the
   extent map if the block group was frozen (due to trim, and scrub
   itself). But it will remove the block group (struct btrfs_block_group)
   from the red black tree of block groups, and mark the block group
   as deleted (set struct btrfs_block_group::removed to 1).

   If the extent map was not removed, meaning the block group is frozen,
   then no one will be able to create a new block group with the same logical
   address before the block group is unfrozen (by someone who is holding
   a reference on it). So a lookup on the red black tree will return NULL
   for the logical address until the block group is unfrozen and its
   logical address is reused for a new block group.

So the ordering of what you are saying is reversed - the device extent
and chunk items are removed before marking the block group as deleted.

> 
> The device extents and chunk items are only deleted in
> btrfs_remove_chunk(), which happens either a btrfs_delete_unused_bgs()
> or btrfs_relocate_chunk().

This is also confusing because it gives a wrong idea of the ordering.
The device extents and chunk items are removed by btrfs_remove_chunk(),
which happens before marking a block group as deleted.

> 
> This timing difference leaves a window where we can have a mismatch
> between block group cache and device extent tree, and triggering the
> ASSERT().

I would like to see more details on this. The sequence of steps between
two tasks that result in this assertion being triggered.

> 
> [FIX]
> 
> - Remove the ASSERT()
> 
> - Add a quick exit if our found bg doesn't match @chunk_offset

This shouldn't happen. I would like to know the sequence of steps
between 2 (or more) tasks that leads to that.

We are getting the block group with btrfs_lookup_block_group() at
scrub_enumerate_chunks(), that calls block_group_cache_tree_search()
with the "contains" argument set to 1, meaning it can return a
block group that contains the given bytenr but does not start at
that bytenr necessarily.

This gives me the idea a small block group was deleted and then a
new one was allocated which starts at a lower logical offset and
includes "chunk_offset" (ends beyond that).

We should probably be using btrfs_lookup_first_block_group() at
scrub_enumerate_chunks(), so it looks for a block group that starts
exactly at the given logical address.

But anyway, as I read this and see the patch's diff, this feels a lot
like a "bail out if something unexpected happens but we don't know
exactly why/how that is possible".

> 
> - Add a comment on the existing checks in scrub_chunk()
>   This is the ultimate safenet, as it will iterate through all the stripes
>   of the found chunk.
>   And only scrub the stripe if it matches both device and physical
>   offset.
> 
>   So even by some how we got a dev extent which is no longer owned
>   by the target chunk, we will just skip this chunk completely, without
>   any consequence.
> 
> Reported-by: Su Yue <l@damenly.su>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Changelog:
> v2:
> - Add a new quick exit with extra comments
> 
> - Add a new comment in the existing safenet in scrub_chunk()
> ---
>  fs/btrfs/scrub.c | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 11089568b287..1c3ed4720ddd 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3578,6 +3578,14 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
>  		goto out;
>  
>  	map = em->map_lookup;
> +
> +	/*
> +	 * Due to the delayed modification of device tree, this chunk
> +	 * may not own this dev extent.

This is confusing. What delayed modification, what do you mean by it
exactly? Same below, with more details.

> +	 *
> +	 * So we need to iterate through all stripes, and only scrub
> +	 * the stripe which matches both device and physical offset.
> +	 */
>  	for (i = 0; i < map->num_stripes; ++i) {
>  		if (map->stripes[i].dev->bdev == scrub_dev->bdev &&
>  		    map->stripes[i].physical == dev_offset) {
> @@ -3699,6 +3707,18 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>  		if (!cache)
>  			goto skip;
>  
> +		/*
> +		 * Since dev extents modification is delayed, it's possible
> +		 * we got a bg which doesn't really own this dev extent.

Same as before. Too confusing, what is delayed dev extent modification?

Once added, device extents can only be deleted, they are never modified.
I think you are referring to the deletions and the fact we use the commit
roots to find extents, but that should be a more clear comment, without
such level of ambiguity.

Thanks.

> +		 *
> +		 * Here just do a quick skip, scrub_chunk() has a more
> +		 * comprehensive check in it.
> +		 */
> +		if (cache->start != chunk_offset) {
> +			btrfs_put_block_group(cache);
> +			goto skip;
> +		}
> +
>  		if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) {
>  			spin_lock(&cache->lock);
>  			if (!cache->to_copy) {
> @@ -3822,7 +3842,6 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>  		dev_replace->item_needs_writeback = 1;
>  		up_write(&dev_replace->rwsem);
>  
> -		ASSERT(cache->start == chunk_offset);
>  		ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
>  				  dev_extent_len);
>  
> -- 
> 2.35.1
> 

  reply	other threads:[~2022-02-17 11:14 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-17  9:49 [PATCH v2] btrfs: remove an ASSERT() which can cause false alert Qu Wenruo
2022-02-17 11:14 ` Filipe Manana [this message]
2022-02-17 11:24   ` Qu Wenruo
2022-02-17 12:32     ` Filipe Manana
2022-02-17 13:00       ` Qu Wenruo
2022-02-17 13:02         ` Filipe Manana
2022-02-17 13:25           ` Qu Wenruo
2022-02-18  2:29           ` Su Yue
2022-04-19 20:34 ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yg4uFd40/Y5pQWt2@debian9.Home \
    --to=fdmanana@kernel.org \
    --cc=l@damenly.su \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.