From: Su Yue <l@damenly.su>
To: Filipe Manana <fdmanana@kernel.org>
Cc: Qu Wenruo <quwenruo.btrfs@gmx.com>, Qu Wenruo <wqu@suse.com>,
linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert
Date: Fri, 18 Feb 2022 10:29:48 +0800 [thread overview]
Message-ID: <5ypcrj51.fsf@damenly.su> (raw)
In-Reply-To: <CAL3q7H6bZLPqGboFN9KDs717puFBXZ9Q-UknWxJ-sD9NZ9TpGw@mail.gmail.com>
On Thu 17 Feb 2022 at 13:02, Filipe Manana <fdmanana@kernel.org>
wrote:
> On Thu, Feb 17, 2022 at 1:00 PM Qu Wenruo
> <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2022/2/17 20:32, Filipe Manana wrote:
>> > On Thu, Feb 17, 2022 at 11:24 AM Qu Wenruo
>> > <quwenruo.btrfs@gmx.com> wrote:
>> >>
>> >>
>> >>
>> >> On 2022/2/17 19:14, Filipe Manana wrote:
>> >>> On Thu, Feb 17, 2022 at 05:49:34PM +0800, Qu Wenruo wrote:
>> >>>> [BUG]
>> >>>> Su reported that with his VM running on an M1 mac, it has
>> >>>> a high chance
>> >>>> to trigger the following ASSERT() with scrub and balance
>> >>>> test cases:
>> >>>>
>> >>>> ASSERT(cache->start == chunk_offset);
>> >>>> ret = scrub_chunk(sctx, cache, scrub_dev,
>> >>>> found_key.offset,
>> >>>> dev_extent_len);
>> >>>>
>> >>>> [CAUSE]
>> >>>> The ASSERT() is making sure that the block group cache
>> >>>> (@cache) and the
>> >>>> chunk offset from the device extent match.
>> >>>>
>> >>>> But the truth is, block group caches and dev extent items
>> >>>> are deleted at
>> >>>> different timing.
>> >>>>
>> >>>> For block group caches and items, after
>> >>>> btrfs_remove_block_group() they
>> >>>> will be marked deleted, but the device extent and the
>> >>>> chunk item is
>> >>>> still in dev tree and chunk tree respectively.
>> >>>
>> >>> This is not correct, what happens is:
>> >>>
>> >>> 1) btrfs_remove_chunk() will remove the device extent items
>> >>> and the chunk
>> >>> item;
>> >>>
>> >>> 2) It then calls btrfs_remove_block_group(). This will not
>> >>> remove the
>> >>> extent map if the block group was frozen (due to trim,
>> >>> and scrub
>> >>> itself). But it will remove the block group (struct
>> >>> btrfs_block_group)
>> >>> from the red black tree of block groups, and mark the
>> >>> block group
>> >>> as deleted (set struct btrfs_block_group::removed to
>> >>> 1).
>> >>>
>> >>> If the extent map was not removed, meaning the block
>> >>> group is frozen,
>> >>> then no one will be able to create a new block group
>> >>> with the same logical
>> >>> address before the block group is unfrozen (by someone
>> >>> who is holding
>> >>> a reference on it). So a lookup on the red black tree
>> >>> will return NULL
>> >>> for the logical address until the block group is
>> >>> unfrozen and its
>> >>> logical address is reused for a new block group.
>> >>>
>> >>> So the ordering of what you are saying is reversed - the
>> >>> device extent
>> >>> and chunk items are removed before marking the block group
>> >>> as deleted.
>> >>>
>> >>>>
>> >>>> The device extents and chunk items are only deleted in
>> >>>> btrfs_remove_chunk(), which happens either a
>> >>>> btrfs_delete_unused_bgs()
>> >>>> or btrfs_relocate_chunk().
>> >>>
>> >>> This is also confusing because it gives a wrong idea of the
>> >>> ordering.
>> >>> The device extents and chunk items are removed by
>> >>> btrfs_remove_chunk(),
>> >>> which happens before marking a block group as deleted.
>> >>>
>> >>>>
>> >>>> This timing difference leaves a window where we can have a
>> >>>> mismatch
>> >>>> between block group cache and device extent tree, and
>> >>>> triggering the
>> >>>> ASSERT().
>> >>>
>> >>> I would like to see more details on this. The sequence of
>> >>> steps between
>> >>> two tasks that result in this assertion being triggered.
>> >>>
>> >>>>
>> >>>> [FIX]
>> >>>>
>> >>>> - Remove the ASSERT()
>> >>>>
>> >>>> - Add a quick exit if our found bg doesn't match
>> >>>> @chunk_offset
>> >>>
>> >>> This shouldn't happen. I would like to know the sequence of
>> >>> steps
>> >>> between 2 (or more) tasks that leads to that.
>> >>>
>> >>> We are getting the block group with
>> >>> btrfs_lookup_block_group() at
>> >>> scrub_enumerate_chunks(), that calls
>> >>> block_group_cache_tree_search()
>> >>> with the "contains" argument set to 1, meaning it can
>> >>> return a
>> >>> block group that contains the given bytenr but does not
>> >>> start at
>> >>> that bytenr necessarily.
>> >>>
>> >>> This gives me the idea a small block group was deleted and
>> >>> then a
>> >>> new one was allocated which starts at a lower logical
>> >>> offset and
>> >>> includes "chunk_offset" (ends beyond that).
>> >>>
>> >>> We should probably be using
>> >>> btrfs_lookup_first_block_group() at
>> >>> scrub_enumerate_chunks(), so it looks for a block group
>> >>> that starts
>> >>> exactly at the given logical address.
>> >>>
>> >>> But anyway, as I read this and see the patch's diff, this
>> >>> feels a lot
>> >>> like a "bail out if something unexpected happens but we
>> >>> don't know
>> >>> exactly why/how that is possible".
>> >>
>> >> You're completely correct.
>> >>
>> >> Unfortunately I have no way to reproduce the bug on x86_64
>> >> VMs, and all
>> >> my ARM VMs are too slow to reproduce.
>> >>
>> >> Furthermore, due to some unrelated bugs, v5.17-rc based
>> >> kernels all
>> >> failed to boot inside the VMs on Apple M1.
>> >>
>> >> So unfortunately I don't have extra details for now.
>> >>
>> >> Will find a way to reproduce first, then update the patch
>> >> with better
>> >> comments.
>> >
>> > Was it triggered with some specific test case from fstests?
>>
>> Yes, from a regular run on fstests.
>
> Ok, but which specific test(s)?
>
It is btrfs/074. It was found during early 16k subpage support
development
of Qu. Unfortunately dmesg is lost and it can't be reproduced from
old
branch fetched from `git reflog`.
I'd report back if the issue occurs again.
--
Su
>>
>> But that was on an older branch (some older misc-next, which
>> has the
>> patch but still based on v5.16).
>>
>> The v5.17-rc based branch will not boot inside the ARM vm at
>> all, and no
>> early boot messages after the EFI routine.
>>
>> Will attack the boot problem first.
>>
>> Thanks,
>> Qu
>> >
>> >>
>> >> Thanks,
>> >> Qu
>> >>
>> >>>
>> >>>>
>> >>>> - Add a comment on the existing checks in scrub_chunk()
>> >>>> This is the ultimate safenet, as it will iterate
>> >>>> through all the stripes
>> >>>> of the found chunk.
>> >>>> And only scrub the stripe if it matches both device
>> >>>> and physical
>> >>>> offset.
>> >>>>
>> >>>> So even by some how we got a dev extent which is no
>> >>>> longer owned
>> >>>> by the target chunk, we will just skip this chunk
>> >>>> completely, without
>> >>>> any consequence.
>> >>>>
>> >>>> Reported-by: Su Yue <l@damenly.su>
>> >>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> >>>> ---
>> >>>> Changelog:
>> >>>> v2:
>> >>>> - Add a new quick exit with extra comments
>> >>>>
>> >>>> - Add a new comment in the existing safenet in
>> >>>> scrub_chunk()
>> >>>> ---
>> >>>> fs/btrfs/scrub.c | 21 ++++++++++++++++++++-
>> >>>> 1 file changed, 20 insertions(+), 1 deletion(-)
>> >>>>
>> >>>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>> >>>> index 11089568b287..1c3ed4720ddd 100644
>> >>>> --- a/fs/btrfs/scrub.c
>> >>>> +++ b/fs/btrfs/scrub.c
>> >>>> @@ -3578,6 +3578,14 @@ static noinline_for_stack int
>> >>>> scrub_chunk(struct scrub_ctx *sctx,
>> >>>> goto out;
>> >>>>
>> >>>> map = em->map_lookup;
>> >>>> +
>> >>>> + /*
>> >>>> + * Due to the delayed modification of device tree,
>> >>>> this chunk
>> >>>> + * may not own this dev extent.
>> >>>
>> >>> This is confusing. What delayed modification, what do you
>> >>> mean by it
>> >>> exactly? Same below, with more details.
>> >>>
>> >>>> + *
>> >>>> + * So we need to iterate through all stripes, and
>> >>>> only scrub
>> >>>> + * the stripe which matches both device and physical
>> >>>> offset.
>> >>>> + */
>> >>>> for (i = 0; i < map->num_stripes; ++i) {
>> >>>> if (map->stripes[i].dev->bdev ==
>> >>>> scrub_dev->bdev &&
>> >>>> map->stripes[i].physical == dev_offset)
>> >>>> {
>> >>>> @@ -3699,6 +3707,18 @@ int scrub_enumerate_chunks(struct
>> >>>> scrub_ctx *sctx,
>> >>>> if (!cache)
>> >>>> goto skip;
>> >>>>
>> >>>> + /*
>> >>>> + * Since dev extents modification is delayed,
>> >>>> it's possible
>> >>>> + * we got a bg which doesn't really own this
>> >>>> dev extent.
>> >>>
>> >>> Same as before. Too confusing, what is delayed dev extent
>> >>> modification?
>> >>>
>> >>> Once added, device extents can only be deleted, they are
>> >>> never modified.
>> >>> I think you are referring to the deletions and the fact we
>> >>> use the commit
>> >>> roots to find extents, but that should be a more clear
>> >>> comment, without
>> >>> such level of ambiguity.
>> >>>
>> >>> Thanks.
>> >>>
>> >>>> + *
>> >>>> + * Here just do a quick skip, scrub_chunk()
>> >>>> has a more
>> >>>> + * comprehensive check in it.
>> >>>> + */
>> >>>> + if (cache->start != chunk_offset) {
>> >>>> + btrfs_put_block_group(cache);
>> >>>> + goto skip;
>> >>>> + }
>> >>>> +
>> >>>> if (sctx->is_dev_replace &&
>> >>>> btrfs_is_zoned(fs_info)) {
>> >>>> spin_lock(&cache->lock);
>> >>>> if (!cache->to_copy) {
>> >>>> @@ -3822,7 +3842,6 @@ int scrub_enumerate_chunks(struct
>> >>>> scrub_ctx *sctx,
>> >>>> dev_replace->item_needs_writeback = 1;
>> >>>> up_write(&dev_replace->rwsem);
>> >>>>
>> >>>> - ASSERT(cache->start == chunk_offset);
>> >>>> ret = scrub_chunk(sctx, cache, scrub_dev,
>> >>>> found_key.offset,
>> >>>> dev_extent_len);
>> >>>>
>> >>>> --
>> >>>> 2.35.1
>> >>>>
next prev parent reply other threads:[~2022-02-18 2:41 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-02-17 9:49 [PATCH v2] btrfs: remove an ASSERT() which can cause false alert Qu Wenruo
2022-02-17 11:14 ` Filipe Manana
2022-02-17 11:24 ` Qu Wenruo
2022-02-17 12:32 ` Filipe Manana
2022-02-17 13:00 ` Qu Wenruo
2022-02-17 13:02 ` Filipe Manana
2022-02-17 13:25 ` Qu Wenruo
2022-02-18 2:29 ` Su Yue [this message]
2022-04-19 20:34 ` David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5ypcrj51.fsf@damenly.su \
--to=l@damenly.su \
--cc=fdmanana@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo.btrfs@gmx.com \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.