public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Filipe Manana <fdmanana@kernel.org>
To: Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org, Su Yue <l@damenly.su>
Subject: Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert
Date: Thu, 17 Feb 2022 11:14:29 +0000	[thread overview]
Message-ID: <Yg4uFd40/Y5pQWt2@debian9.Home> (raw)
In-Reply-To: <f0d4a789788e8e63041dc6d91408f40234daa329.1645091180.git.wqu@suse.com>

On Thu, Feb 17, 2022 at 05:49:34PM +0800, Qu Wenruo wrote:
> [BUG]
> Su reported that with his VM running on an M1 mac, it has a high chance
> to trigger the following ASSERT() with scrub and balance test cases:
> 
> 		ASSERT(cache->start == chunk_offset);
> 		ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
> 				  dev_extent_len);
> 
> [CAUSE]
> The ASSERT() is making sure that the block group cache (@cache) and the
> chunk offset from the device extent match.
> 
> But the truth is, block group caches and dev extent items are deleted at
> different timing.
> 
> For block group caches and items, after btrfs_remove_block_group() they
> will be marked deleted, but the device extent and the chunk item is
> still in dev tree and chunk tree respectively.

This is not correct, what happens is:

1) btrfs_remove_chunk() will remove the device extent items and the chunk
   item;

2) It then calls btrfs_remove_block_group(). This will not remove the
   extent map if the block group was frozen (due to trim, and scrub
   itself). But it will remove the block group (struct btrfs_block_group)
   from the red black tree of block groups, and mark the block group
   as deleted (set struct btrfs_block_group::removed to 1).

   If the extent map was not removed, meaning the block group is frozen,
   then no one will be able to create a new block group with the same logical
   address before the block group is unfrozen (by someone who is holding
   a reference on it). So a lookup on the red black tree will return NULL
   for the logical address until the block group is unfrozen and its
   logical address is reused for a new block group.

So the ordering of what you are saying is reversed - the device extent
and chunk items are removed before marking the block group as deleted.

> 
> The device extents and chunk items are only deleted in
> btrfs_remove_chunk(), which happens either a btrfs_delete_unused_bgs()
> or btrfs_relocate_chunk().

This is also confusing because it gives a wrong idea of the ordering.
The device extents and chunk items are removed by btrfs_remove_chunk(),
which happens before marking a block group as deleted.

> 
> This timing difference leaves a window where we can have a mismatch
> between block group cache and device extent tree, and triggering the
> ASSERT().

I would like to see more details on this. The sequence of steps between
two tasks that result in this assertion being triggered.

> 
> [FIX]
> 
> - Remove the ASSERT()
> 
> - Add a quick exit if our found bg doesn't match @chunk_offset

This shouldn't happen. I would like to know the sequence of steps
between 2 (or more) tasks that leads to that.

We are getting the block group with btrfs_lookup_block_group() at
scrub_enumerate_chunks(), that calls block_group_cache_tree_search()
with the "contains" argument set to 1, meaning it can return a
block group that contains the given bytenr but does not start at
that bytenr necessarily.

This gives me the idea a small block group was deleted and then a
new one was allocated which starts at a lower logical offset and
includes "chunk_offset" (ends beyond that).

We should probably be using btrfs_lookup_first_block_group() at
scrub_enumerate_chunks(), so it looks for a block group that starts
exactly at the given logical address.

But anyway, as I read this and see the patch's diff, this feels a lot
like a "bail out if something unexpected happens but we don't know
exactly why/how that is possible".

> 
> - Add a comment on the existing checks in scrub_chunk()
>   This is the ultimate safenet, as it will iterate through all the stripes
>   of the found chunk.
>   And only scrub the stripe if it matches both device and physical
>   offset.
> 
>   So even by some how we got a dev extent which is no longer owned
>   by the target chunk, we will just skip this chunk completely, without
>   any consequence.
> 
> Reported-by: Su Yue <l@damenly.su>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Changelog:
> v2:
> - Add a new quick exit with extra comments
> 
> - Add a new comment in the existing safenet in scrub_chunk()
> ---
>  fs/btrfs/scrub.c | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 11089568b287..1c3ed4720ddd 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3578,6 +3578,14 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
>  		goto out;
>  
>  	map = em->map_lookup;
> +
> +	/*
> +	 * Due to the delayed modification of device tree, this chunk
> +	 * may not own this dev extent.

This is confusing. What delayed modification, what do you mean by it
exactly? Same below, with more details.

> +	 *
> +	 * So we need to iterate through all stripes, and only scrub
> +	 * the stripe which matches both device and physical offset.
> +	 */
>  	for (i = 0; i < map->num_stripes; ++i) {
>  		if (map->stripes[i].dev->bdev == scrub_dev->bdev &&
>  		    map->stripes[i].physical == dev_offset) {
> @@ -3699,6 +3707,18 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>  		if (!cache)
>  			goto skip;
>  
> +		/*
> +		 * Since dev extents modification is delayed, it's possible
> +		 * we got a bg which doesn't really own this dev extent.

Same as before. Too confusing, what is delayed dev extent modification?

Once added, device extents can only be deleted, they are never modified.
I think you are referring to the deletions and the fact we use the commit
roots to find extents, but that should be a more clear comment, without
such level of ambiguity.

Thanks.

> +		 *
> +		 * Here just do a quick skip, scrub_chunk() has a more
> +		 * comprehensive check in it.
> +		 */
> +		if (cache->start != chunk_offset) {
> +			btrfs_put_block_group(cache);
> +			goto skip;
> +		}
> +
>  		if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) {
>  			spin_lock(&cache->lock);
>  			if (!cache->to_copy) {
> @@ -3822,7 +3842,6 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>  		dev_replace->item_needs_writeback = 1;
>  		up_write(&dev_replace->rwsem);
>  
> -		ASSERT(cache->start == chunk_offset);
>  		ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
>  				  dev_extent_len);
>  
> -- 
> 2.35.1
> 

  reply	other threads:[~2022-02-17 11:14 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-17  9:49 [PATCH v2] btrfs: remove an ASSERT() which can cause false alert Qu Wenruo
2022-02-17 11:14 ` Filipe Manana [this message]
2022-02-17 11:24   ` Qu Wenruo
2022-02-17 12:32     ` Filipe Manana
2022-02-17 13:00       ` Qu Wenruo
2022-02-17 13:02         ` Filipe Manana
2022-02-17 13:25           ` Qu Wenruo
2022-02-18  2:29           ` Su Yue
2022-04-19 20:34 ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yg4uFd40/Y5pQWt2@debian9.Home \
    --to=fdmanana@kernel.org \
    --cc=l@damenly.su \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox