From: Filipe Manana <fdmanana@kernel.org>
To: Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org, Su Yue <l@damenly.su>
Subject: Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert
Date: Thu, 17 Feb 2022 11:14:29 +0000 [thread overview]
Message-ID: <Yg4uFd40/Y5pQWt2@debian9.Home> (raw)
In-Reply-To: <f0d4a789788e8e63041dc6d91408f40234daa329.1645091180.git.wqu@suse.com>
On Thu, Feb 17, 2022 at 05:49:34PM +0800, Qu Wenruo wrote:
> [BUG]
> Su reported that with his VM running on an M1 mac, it has a high chance
> to trigger the following ASSERT() with scrub and balance test cases:
>
> ASSERT(cache->start == chunk_offset);
> ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
> dev_extent_len);
>
> [CAUSE]
> The ASSERT() is making sure that the block group cache (@cache) and the
> chunk offset from the device extent match.
>
> But the truth is, block group caches and dev extent items are deleted at
> different timing.
>
> For block group caches and items, after btrfs_remove_block_group() they
> will be marked deleted, but the device extent and the chunk item is
> still in dev tree and chunk tree respectively.
This is not correct, what happens is:
1) btrfs_remove_chunk() will remove the device extent items and the chunk
item;
2) It then calls btrfs_remove_block_group(). This will not remove the
extent map if the block group was frozen (due to trim, and scrub
itself). But it will remove the block group (struct btrfs_block_group)
from the red black tree of block groups, and mark the block group
as deleted (set struct btrfs_block_group::removed to 1).
If the extent map was not removed, meaning the block group is frozen,
then no one will be able to create a new block group with the same logical
address before the block group is unfrozen (by someone who is holding
a reference on it). So a lookup on the red black tree will return NULL
for the logical address until the block group is unfrozen and its
logical address is reused for a new block group.
So the ordering of what you are saying is reversed - the device extent
and chunk items are removed before marking the block group as deleted.
>
> The device extents and chunk items are only deleted in
> btrfs_remove_chunk(), which happens either a btrfs_delete_unused_bgs()
> or btrfs_relocate_chunk().
This is also confusing because it gives a wrong idea of the ordering.
The device extents and chunk items are removed by btrfs_remove_chunk(),
which happens before marking a block group as deleted.
>
> This timing difference leaves a window where we can have a mismatch
> between block group cache and device extent tree, and triggering the
> ASSERT().
I would like to see more details on this. The sequence of steps between
two tasks that result in this assertion being triggered.
>
> [FIX]
>
> - Remove the ASSERT()
>
> - Add a quick exit if our found bg doesn't match @chunk_offset
This shouldn't happen. I would like to know the sequence of steps
between 2 (or more) tasks that leads to that.
We are getting the block group with btrfs_lookup_block_group() at
scrub_enumerate_chunks(), that calls block_group_cache_tree_search()
with the "contains" argument set to 1, meaning it can return a
block group that contains the given bytenr but does not start at
that bytenr necessarily.
This gives me the idea a small block group was deleted and then a
new one was allocated which starts at a lower logical offset and
includes "chunk_offset" (ends beyond that).
We should probably be using btrfs_lookup_first_block_group() at
scrub_enumerate_chunks(), so it looks for a block group that starts
exactly at the given logical address.
But anyway, as I read this and see the patch's diff, this feels a lot
like a "bail out if something unexpected happens but we don't know
exactly why/how that is possible".
>
> - Add a comment on the existing checks in scrub_chunk()
> This is the ultimate safenet, as it will iterate through all the stripes
> of the found chunk.
> And only scrub the stripe if it matches both device and physical
> offset.
>
> So even by some how we got a dev extent which is no longer owned
> by the target chunk, we will just skip this chunk completely, without
> any consequence.
>
> Reported-by: Su Yue <l@damenly.su>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Changelog:
> v2:
> - Add a new quick exit with extra comments
>
> - Add a new comment in the existing safenet in scrub_chunk()
> ---
> fs/btrfs/scrub.c | 21 ++++++++++++++++++++-
> 1 file changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 11089568b287..1c3ed4720ddd 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3578,6 +3578,14 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
> goto out;
>
> map = em->map_lookup;
> +
> + /*
> + * Due to the delayed modification of device tree, this chunk
> + * may not own this dev extent.
This is confusing. What delayed modification, what do you mean by it
exactly? Same below, with more details.
> + *
> + * So we need to iterate through all stripes, and only scrub
> + * the stripe which matches both device and physical offset.
> + */
> for (i = 0; i < map->num_stripes; ++i) {
> if (map->stripes[i].dev->bdev == scrub_dev->bdev &&
> map->stripes[i].physical == dev_offset) {
> @@ -3699,6 +3707,18 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
> if (!cache)
> goto skip;
>
> + /*
> + * Since dev extents modification is delayed, it's possible
> + * we got a bg which doesn't really own this dev extent.
Same as before. Too confusing, what is delayed dev extent modification?
Once added, device extents can only be deleted, they are never modified.
I think you are referring to the deletions and the fact we use the commit
roots to find extents, but that should be a more clear comment, without
such level of ambiguity.
Thanks.
> + *
> + * Here just do a quick skip, scrub_chunk() has a more
> + * comprehensive check in it.
> + */
> + if (cache->start != chunk_offset) {
> + btrfs_put_block_group(cache);
> + goto skip;
> + }
> +
> if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) {
> spin_lock(&cache->lock);
> if (!cache->to_copy) {
> @@ -3822,7 +3842,6 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
> dev_replace->item_needs_writeback = 1;
> up_write(&dev_replace->rwsem);
>
> - ASSERT(cache->start == chunk_offset);
> ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
> dev_extent_len);
>
> --
> 2.35.1
>
next prev parent reply other threads:[~2022-02-17 11:14 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-02-17 9:49 [PATCH v2] btrfs: remove an ASSERT() which can cause false alert Qu Wenruo
2022-02-17 11:14 ` Filipe Manana [this message]
2022-02-17 11:24 ` Qu Wenruo
2022-02-17 12:32 ` Filipe Manana
2022-02-17 13:00 ` Qu Wenruo
2022-02-17 13:02 ` Filipe Manana
2022-02-17 13:25 ` Qu Wenruo
2022-02-18 2:29 ` Su Yue
2022-04-19 20:34 ` David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Yg4uFd40/Y5pQWt2@debian9.Home \
--to=fdmanana@kernel.org \
--cc=l@damenly.su \
--cc=linux-btrfs@vger.kernel.org \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox