From: Su Yue <l@damenly.su>
To: Filipe Manana <fdmanana@kernel.org>
Cc: Qu Wenruo <quwenruo.btrfs@gmx.com>, Qu Wenruo <wqu@suse.com>,
linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert
Date: Fri, 18 Feb 2022 10:29:48 +0800 [thread overview]
Message-ID: <5ypcrj51.fsf@damenly.su> (raw)
In-Reply-To: <CAL3q7H6bZLPqGboFN9KDs717puFBXZ9Q-UknWxJ-sD9NZ9TpGw@mail.gmail.com>
On Thu 17 Feb 2022 at 13:02, Filipe Manana <fdmanana@kernel.org>
wrote:
> On Thu, Feb 17, 2022 at 1:00 PM Qu Wenruo
> <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2022/2/17 20:32, Filipe Manana wrote:
>> > On Thu, Feb 17, 2022 at 11:24 AM Qu Wenruo
>> > <quwenruo.btrfs@gmx.com> wrote:
>> >>
>> >>
>> >>
>> >> On 2022/2/17 19:14, Filipe Manana wrote:
>> >>> On Thu, Feb 17, 2022 at 05:49:34PM +0800, Qu Wenruo wrote:
>> >>>> [BUG]
>> >>>> Su reported that with his VM running on an M1 mac, it has
>> >>>> a high chance
>> >>>> to trigger the following ASSERT() with scrub and balance
>> >>>> test cases:
>> >>>>
>> >>>> ASSERT(cache->start == chunk_offset);
>> >>>> ret = scrub_chunk(sctx, cache, scrub_dev,
>> >>>> found_key.offset,
>> >>>> dev_extent_len);
>> >>>>
>> >>>> [CAUSE]
>> >>>> The ASSERT() is making sure that the block group cache
>> >>>> (@cache) and the
>> >>>> chunk offset from the device extent match.
>> >>>>
>> >>>> But the truth is, block group caches and dev extent items
>> >>>> are deleted at
>> >>>> different timing.
>> >>>>
>> >>>> For block group caches and items, after
>> >>>> btrfs_remove_block_group() they
>> >>>> will be marked deleted, but the device extent and the
>> >>>> chunk item is
>> >>>> still in dev tree and chunk tree respectively.
>> >>>
>> >>> This is not correct, what happens is:
>> >>>
>> >>> 1) btrfs_remove_chunk() will remove the device extent items
>> >>> and the chunk
>> >>> item;
>> >>>
>> >>> 2) It then calls btrfs_remove_block_group(). This will not
>> >>> remove the
>> >>> extent map if the block group was frozen (due to trim,
>> >>> and scrub
>> >>> itself). But it will remove the block group (struct
>> >>> btrfs_block_group)
>> >>> from the red black tree of block groups, and mark the
>> >>> block group
>> >>> as deleted (set struct btrfs_block_group::removed to
>> >>> 1).
>> >>>
>> >>> If the extent map was not removed, meaning the block
>> >>> group is frozen,
>> >>> then no one will be able to create a new block group
>> >>> with the same logical
>> >>> address before the block group is unfrozen (by someone
>> >>> who is holding
>> >>> a reference on it). So a lookup on the red black tree
>> >>> will return NULL
>> >>> for the logical address until the block group is
>> >>> unfrozen and its
>> >>> logical address is reused for a new block group.
>> >>>
>> >>> So the ordering of what you are saying is reversed - the
>> >>> device extent
>> >>> and chunk items are removed before marking the block group
>> >>> as deleted.
>> >>>
>> >>>>
>> >>>> The device extents and chunk items are only deleted in
>> >>>> btrfs_remove_chunk(), which happens either a
>> >>>> btrfs_delete_unused_bgs()
>> >>>> or btrfs_relocate_chunk().
>> >>>
>> >>> This is also confusing because it gives a wrong idea of the
>> >>> ordering.
>> >>> The device extents and chunk items are removed by
>> >>> btrfs_remove_chunk(),
>> >>> which happens before marking a block group as deleted.
>> >>>
>> >>>>
>> >>>> This timing difference leaves a window where we can have a
>> >>>> mismatch
>> >>>> between block group cache and device extent tree, and
>> >>>> triggering the
>> >>>> ASSERT().
>> >>>
>> >>> I would like to see more details on this. The sequence of
>> >>> steps between
>> >>> two tasks that result in this assertion being triggered.
>> >>>
>> >>>>
>> >>>> [FIX]
>> >>>>
>> >>>> - Remove the ASSERT()
>> >>>>
>> >>>> - Add a quick exit if our found bg doesn't match
>> >>>> @chunk_offset
>> >>>
>> >>> This shouldn't happen. I would like to know the sequence of
>> >>> steps
>> >>> between 2 (or more) tasks that leads to that.
>> >>>
>> >>> We are getting the block group with
>> >>> btrfs_lookup_block_group() at
>> >>> scrub_enumerate_chunks(), that calls
>> >>> block_group_cache_tree_search()
>> >>> with the "contains" argument set to 1, meaning it can
>> >>> return a
>> >>> block group that contains the given bytenr but does not
>> >>> start at
>> >>> that bytenr necessarily.
>> >>>
>> >>> This gives me the idea a small block group was deleted and
>> >>> then a
>> >>> new one was allocated which starts at a lower logical
>> >>> offset and
>> >>> includes "chunk_offset" (ends beyond that).
>> >>>
>> >>> We should probably be using
>> >>> btrfs_lookup_first_block_group() at
>> >>> scrub_enumerate_chunks(), so it looks for a block group
>> >>> that starts
>> >>> exactly at the given logical address.
>> >>>
>> >>> But anyway, as I read this and see the patch's diff, this
>> >>> feels a lot
>> >>> like a "bail out if something unexpected happens but we
>> >>> don't know
>> >>> exactly why/how that is possible".
>> >>
>> >> You're completely correct.
>> >>
>> >> Unfortunately I have no way to reproduce the bug on x86_64
>> >> VMs, and all
>> >> my ARM VMs are too slow to reproduce.
>> >>
>> >> Furthermore, due to some unrelated bugs, v5.17-rc based
>> >> kernels all
>> >> failed to boot inside the VMs on Apple M1.
>> >>
>> >> So unfortunately I don't have extra details for now.
>> >>
>> >> Will find a way to reproduce first, then update the patch
>> >> with better
>> >> comments.
>> >
>> > Was it triggered with some specific test case from fstests?
>>
>> Yes, from a regular run on fstests.
>
> Ok, but which specific test(s)?
>
It is btrfs/074. It was found during early 16k subpage support
development
of Qu. Unfortunately dmesg is lost and it can't be reproduced from
old
branch fetched from `git reflog`.
I'd report back if the issue occurs again.
--
Su
>>
>> But that was on an older branch (some older misc-next, which
>> has the
>> patch but still based on v5.16).
>>
>> The v5.17-rc based branch will not boot inside the ARM vm at
>> all, and no
>> early boot messages after the EFI routine.
>>
>> Will attack the boot problem first.
>>
>> Thanks,
>> Qu
>> >
>> >>
>> >> Thanks,
>> >> Qu
>> >>
>> >>>
>> >>>>
>> >>>> - Add a comment on the existing checks in scrub_chunk()
>> >>>> This is the ultimate safenet, as it will iterate
>> >>>> through all the stripes
>> >>>> of the found chunk.
>> >>>> And only scrub the stripe if it matches both device
>> >>>> and physical
>> >>>> offset.
>> >>>>
>> >>>> So even by some how we got a dev extent which is no
>> >>>> longer owned
>> >>>> by the target chunk, we will just skip this chunk
>> >>>> completely, without
>> >>>> any consequence.
>> >>>>
>> >>>> Reported-by: Su Yue <l@damenly.su>
>> >>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> >>>> ---
>> >>>> Changelog:
>> >>>> v2:
>> >>>> - Add a new quick exit with extra comments
>> >>>>
>> >>>> - Add a new comment in the existing safenet in
>> >>>> scrub_chunk()
>> >>>> ---
>> >>>> fs/btrfs/scrub.c | 21 ++++++++++++++++++++-
>> >>>> 1 file changed, 20 insertions(+), 1 deletion(-)
>> >>>>
>> >>>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>> >>>> index 11089568b287..1c3ed4720ddd 100644
>> >>>> --- a/fs/btrfs/scrub.c
>> >>>> +++ b/fs/btrfs/scrub.c
>> >>>> @@ -3578,6 +3578,14 @@ static noinline_for_stack int
>> >>>> scrub_chunk(struct scrub_ctx *sctx,
>> >>>> goto out;
>> >>>>
>> >>>> map = em->map_lookup;
>> >>>> +
>> >>>> + /*
>> >>>> + * Due to the delayed modification of device tree,
>> >>>> this chunk
>> >>>> + * may not own this dev extent.
>> >>>
>> >>> This is confusing. What delayed modification, what do you
>> >>> mean by it
>> >>> exactly? Same below, with more details.
>> >>>
>> >>>> + *
>> >>>> + * So we need to iterate through all stripes, and
>> >>>> only scrub
>> >>>> + * the stripe which matches both device and physical
>> >>>> offset.
>> >>>> + */
>> >>>> for (i = 0; i < map->num_stripes; ++i) {
>> >>>> if (map->stripes[i].dev->bdev ==
>> >>>> scrub_dev->bdev &&
>> >>>> map->stripes[i].physical == dev_offset)
>> >>>> {
>> >>>> @@ -3699,6 +3707,18 @@ int scrub_enumerate_chunks(struct
>> >>>> scrub_ctx *sctx,
>> >>>> if (!cache)
>> >>>> goto skip;
>> >>>>
>> >>>> + /*
>> >>>> + * Since dev extents modification is delayed,
>> >>>> it's possible
>> >>>> + * we got a bg which doesn't really own this
>> >>>> dev extent.
>> >>>
>> >>> Same as before. Too confusing, what is delayed dev extent
>> >>> modification?
>> >>>
>> >>> Once added, device extents can only be deleted, they are
>> >>> never modified.
>> >>> I think you are referring to the deletions and the fact we
>> >>> use the commit
>> >>> roots to find extents, but that should be a more clear
>> >>> comment, without
>> >>> such level of ambiguity.
>> >>>
>> >>> Thanks.
>> >>>
>> >>>> + *
>> >>>> + * Here just do a quick skip, scrub_chunk()
>> >>>> has a more
>> >>>> + * comprehensive check in it.
>> >>>> + */
>> >>>> + if (cache->start != chunk_offset) {
>> >>>> + btrfs_put_block_group(cache);
>> >>>> + goto skip;
>> >>>> + }
>> >>>> +
>> >>>> if (sctx->is_dev_replace &&
>> >>>> btrfs_is_zoned(fs_info)) {
>> >>>> spin_lock(&cache->lock);
>> >>>> if (!cache->to_copy) {
>> >>>> @@ -3822,7 +3842,6 @@ int scrub_enumerate_chunks(struct
>> >>>> scrub_ctx *sctx,
>> >>>> dev_replace->item_needs_writeback = 1;
>> >>>> up_write(&dev_replace->rwsem);
>> >>>>
>> >>>> - ASSERT(cache->start == chunk_offset);
>> >>>> ret = scrub_chunk(sctx, cache, scrub_dev,
>> >>>> found_key.offset,
>> >>>> dev_extent_len);
>> >>>>
>> >>>> --
>> >>>> 2.35.1
>> >>>>
next prev parent reply other threads:[~2022-02-18 2:41 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-02-17 9:49 [PATCH v2] btrfs: remove an ASSERT() which can cause false alert Qu Wenruo
2022-02-17 11:14 ` Filipe Manana
2022-02-17 11:24 ` Qu Wenruo
2022-02-17 12:32 ` Filipe Manana
2022-02-17 13:00 ` Qu Wenruo
2022-02-17 13:02 ` Filipe Manana
2022-02-17 13:25 ` Qu Wenruo
2022-02-18 2:29 ` Su Yue [this message]
2022-04-19 20:34 ` David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5ypcrj51.fsf@damenly.su \
--to=l@damenly.su \
--cc=fdmanana@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo.btrfs@gmx.com \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox