From: Qu Wenruo <wqu@suse.com>
To: Filipe Manana <fdmanana@kernel.org>, Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>, Su Yue <l@damenly.su>
Subject: Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert
Date: Thu, 17 Feb 2022 21:25:19 +0800 [thread overview]
Message-ID: <cb3b778b-b619-23cf-7b07-3e7aa1c02caf@suse.com> (raw)
In-Reply-To: <CAL3q7H6bZLPqGboFN9KDs717puFBXZ9Q-UknWxJ-sD9NZ9TpGw@mail.gmail.com>
On 2022/2/17 21:02, Filipe Manana wrote:
> On Thu, Feb 17, 2022 at 1:00 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2022/2/17 20:32, Filipe Manana wrote:
>>> On Thu, Feb 17, 2022 at 11:24 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2022/2/17 19:14, Filipe Manana wrote:
>>>>> On Thu, Feb 17, 2022 at 05:49:34PM +0800, Qu Wenruo wrote:
>>>>>> [BUG]
>>>>>> Su reported that with his VM running on an M1 mac, it has a high chance
>>>>>> to trigger the following ASSERT() with scrub and balance test cases:
>>>>>>
>>>>>> ASSERT(cache->start == chunk_offset);
>>>>>> ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
>>>>>> dev_extent_len);
>>>>>>
>>>>>> [CAUSE]
>>>>>> The ASSERT() is making sure that the block group cache (@cache) and the
>>>>>> chunk offset from the device extent match.
>>>>>>
>>>>>> But the truth is, block group caches and dev extent items are deleted at
>>>>>> different timing.
>>>>>>
>>>>>> For block group caches and items, after btrfs_remove_block_group() they
>>>>>> will be marked deleted, but the device extent and the chunk item is
>>>>>> still in dev tree and chunk tree respectively.
>>>>>
>>>>> This is not correct, what happens is:
>>>>>
>>>>> 1) btrfs_remove_chunk() will remove the device extent items and the chunk
>>>>> item;
>>>>>
>>>>> 2) It then calls btrfs_remove_block_group(). This will not remove the
>>>>> extent map if the block group was frozen (due to trim, and scrub
>>>>> itself). But it will remove the block group (struct btrfs_block_group)
>>>>> from the red black tree of block groups, and mark the block group
>>>>> as deleted (set struct btrfs_block_group::removed to 1).
>>>>>
>>>>> If the extent map was not removed, meaning the block group is frozen,
>>>>> then no one will be able to create a new block group with the same logical
>>>>> address before the block group is unfrozen (by someone who is holding
>>>>> a reference on it). So a lookup on the red black tree will return NULL
>>>>> for the logical address until the block group is unfrozen and its
>>>>> logical address is reused for a new block group.
>>>>>
>>>>> So the ordering of what you are saying is reversed - the device extent
>>>>> and chunk items are removed before marking the block group as deleted.
>>>>>
>>>>>>
>>>>>> The device extents and chunk items are only deleted in
>>>>>> btrfs_remove_chunk(), which happens either a btrfs_delete_unused_bgs()
>>>>>> or btrfs_relocate_chunk().
>>>>>
>>>>> This is also confusing because it gives a wrong idea of the ordering.
>>>>> The device extents and chunk items are removed by btrfs_remove_chunk(),
>>>>> which happens before marking a block group as deleted.
>>>>>
>>>>>>
>>>>>> This timing difference leaves a window where we can have a mismatch
>>>>>> between block group cache and device extent tree, and triggering the
>>>>>> ASSERT().
>>>>>
>>>>> I would like to see more details on this. The sequence of steps between
>>>>> two tasks that result in this assertion being triggered.
>>>>>
>>>>>>
>>>>>> [FIX]
>>>>>>
>>>>>> - Remove the ASSERT()
>>>>>>
>>>>>> - Add a quick exit if our found bg doesn't match @chunk_offset
>>>>>
>>>>> This shouldn't happen. I would like to know the sequence of steps
>>>>> between 2 (or more) tasks that leads to that.
>>>>>
>>>>> We are getting the block group with btrfs_lookup_block_group() at
>>>>> scrub_enumerate_chunks(), that calls block_group_cache_tree_search()
>>>>> with the "contains" argument set to 1, meaning it can return a
>>>>> block group that contains the given bytenr but does not start at
>>>>> that bytenr necessarily.
>>>>>
>>>>> This gives me the idea a small block group was deleted and then a
>>>>> new one was allocated which starts at a lower logical offset and
>>>>> includes "chunk_offset" (ends beyond that).
>>>>>
>>>>> We should probably be using btrfs_lookup_first_block_group() at
>>>>> scrub_enumerate_chunks(), so it looks for a block group that starts
>>>>> exactly at the given logical address.
>>>>>
>>>>> But anyway, as I read this and see the patch's diff, this feels a lot
>>>>> like a "bail out if something unexpected happens but we don't know
>>>>> exactly why/how that is possible".
>>>>
>>>> You're completely correct.
>>>>
>>>> Unfortunately I have no way to reproduce the bug on x86_64 VMs, and all
>>>> my ARM VMs are too slow to reproduce.
>>>>
>>>> Furthermore, due to some unrelated bugs, v5.17-rc based kernels all
>>>> failed to boot inside the VMs on Apple M1.
>>>>
>>>> So unfortunately I don't have extra details for now.
>>>>
>>>> Will find a way to reproduce first, then update the patch with better
>>>> comments.
>>>
>>> Was it triggered with some specific test case from fstests?
>>
>> Yes, from a regular run on fstests.
>
> Ok, but which specific test(s)?
No mention, but I guess it's inside scrub + replace group.
>
>>
>> But that was on an older branch (some older misc-next, which has the
>> patch but still based on v5.16).
>>
>> The v5.17-rc based branch will not boot inside the ARM vm at all, and no
>> early boot messages after the EFI routine.
>>
>> Will attack the boot problem first.
>>
>> Thanks,
>> Qu
>>>
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>>>
>>>>>> - Add a comment on the existing checks in scrub_chunk()
>>>>>> This is the ultimate safenet, as it will iterate through all the stripes
>>>>>> of the found chunk.
>>>>>> And only scrub the stripe if it matches both device and physical
>>>>>> offset.
>>>>>>
>>>>>> So even by some how we got a dev extent which is no longer owned
>>>>>> by the target chunk, we will just skip this chunk completely, without
>>>>>> any consequence.
>>>>>>
>>>>>> Reported-by: Su Yue <l@damenly.su>
>>>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>>>> ---
>>>>>> Changelog:
>>>>>> v2:
>>>>>> - Add a new quick exit with extra comments
>>>>>>
>>>>>> - Add a new comment in the existing safenet in scrub_chunk()
>>>>>> ---
>>>>>> fs/btrfs/scrub.c | 21 ++++++++++++++++++++-
>>>>>> 1 file changed, 20 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>>>>>> index 11089568b287..1c3ed4720ddd 100644
>>>>>> --- a/fs/btrfs/scrub.c
>>>>>> +++ b/fs/btrfs/scrub.c
>>>>>> @@ -3578,6 +3578,14 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
>>>>>> goto out;
>>>>>>
>>>>>> map = em->map_lookup;
>>>>>> +
>>>>>> + /*
>>>>>> + * Due to the delayed modification of device tree, this chunk
>>>>>> + * may not own this dev extent.
>>>>>
>>>>> This is confusing. What delayed modification, what do you mean by it
>>>>> exactly? Same below, with more details.
>>>>>
>>>>>> + *
>>>>>> + * So we need to iterate through all stripes, and only scrub
>>>>>> + * the stripe which matches both device and physical offset.
>>>>>> + */
>>>>>> for (i = 0; i < map->num_stripes; ++i) {
>>>>>> if (map->stripes[i].dev->bdev == scrub_dev->bdev &&
>>>>>> map->stripes[i].physical == dev_offset) {
>>>>>> @@ -3699,6 +3707,18 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>>>>>> if (!cache)
>>>>>> goto skip;
>>>>>>
>>>>>> + /*
>>>>>> + * Since dev extents modification is delayed, it's possible
>>>>>> + * we got a bg which doesn't really own this dev extent.
>>>>>
>>>>> Same as before. Too confusing, what is delayed dev extent modification?
>>>>>
>>>>> Once added, device extents can only be deleted, they are never modified.
>>>>> I think you are referring to the deletions and the fact we use the commit
>>>>> roots to find extents, but that should be a more clear comment, without
>>>>> such level of ambiguity.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>> + *
>>>>>> + * Here just do a quick skip, scrub_chunk() has a more
>>>>>> + * comprehensive check in it.
>>>>>> + */
>>>>>> + if (cache->start != chunk_offset) {
>>>>>> + btrfs_put_block_group(cache);
>>>>>> + goto skip;
>>>>>> + }
>>>>>> +
>>>>>> if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) {
>>>>>> spin_lock(&cache->lock);
>>>>>> if (!cache->to_copy) {
>>>>>> @@ -3822,7 +3842,6 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>>>>>> dev_replace->item_needs_writeback = 1;
>>>>>> up_write(&dev_replace->rwsem);
>>>>>>
>>>>>> - ASSERT(cache->start == chunk_offset);
>>>>>> ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
>>>>>> dev_extent_len);
>>>>>>
>>>>>> --
>>>>>> 2.35.1
>>>>>>
>
next prev parent reply other threads:[~2022-02-17 13:25 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-02-17 9:49 [PATCH v2] btrfs: remove an ASSERT() which can cause false alert Qu Wenruo
2022-02-17 11:14 ` Filipe Manana
2022-02-17 11:24 ` Qu Wenruo
2022-02-17 12:32 ` Filipe Manana
2022-02-17 13:00 ` Qu Wenruo
2022-02-17 13:02 ` Filipe Manana
2022-02-17 13:25 ` Qu Wenruo [this message]
2022-02-18 2:29 ` Su Yue
2022-04-19 20:34 ` David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cb3b778b-b619-23cf-7b07-3e7aa1c02caf@suse.com \
--to=wqu@suse.com \
--cc=fdmanana@kernel.org \
--cc=l@damenly.su \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo.btrfs@gmx.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox