Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Qu Wenruo <wqu@suse.com>
To: Filipe Manana <fdmanana@kernel.org>, Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>, Su Yue <l@damenly.su>
Subject: Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert
Date: Thu, 17 Feb 2022 21:25:19 +0800	[thread overview]
Message-ID: <cb3b778b-b619-23cf-7b07-3e7aa1c02caf@suse.com> (raw)
In-Reply-To: <CAL3q7H6bZLPqGboFN9KDs717puFBXZ9Q-UknWxJ-sD9NZ9TpGw@mail.gmail.com>



On 2022/2/17 21:02, Filipe Manana wrote:
> On Thu, Feb 17, 2022 at 1:00 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2022/2/17 20:32, Filipe Manana wrote:
>>> On Thu, Feb 17, 2022 at 11:24 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2022/2/17 19:14, Filipe Manana wrote:
>>>>> On Thu, Feb 17, 2022 at 05:49:34PM +0800, Qu Wenruo wrote:
>>>>>> [BUG]
>>>>>> Su reported that with his VM running on an M1 mac, it has a high chance
>>>>>> to trigger the following ASSERT() with scrub and balance test cases:
>>>>>>
>>>>>>                ASSERT(cache->start == chunk_offset);
>>>>>>                ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
>>>>>>                                  dev_extent_len);
>>>>>>
>>>>>> [CAUSE]
>>>>>> The ASSERT() is making sure that the block group cache (@cache) and the
>>>>>> chunk offset from the device extent match.
>>>>>>
>>>>>> But the truth is, block group caches and dev extent items are deleted at
>>>>>> different timing.
>>>>>>
>>>>>> For block group caches and items, after btrfs_remove_block_group() they
>>>>>> will be marked deleted, but the device extent and the chunk item is
>>>>>> still in dev tree and chunk tree respectively.
>>>>>
>>>>> This is not correct, what happens is:
>>>>>
>>>>> 1) btrfs_remove_chunk() will remove the device extent items and the chunk
>>>>>       item;
>>>>>
>>>>> 2) It then calls btrfs_remove_block_group(). This will not remove the
>>>>>       extent map if the block group was frozen (due to trim, and scrub
>>>>>       itself). But it will remove the block group (struct btrfs_block_group)
>>>>>       from the red black tree of block groups, and mark the block group
>>>>>       as deleted (set struct btrfs_block_group::removed to 1).
>>>>>
>>>>>       If the extent map was not removed, meaning the block group is frozen,
>>>>>       then no one will be able to create a new block group with the same logical
>>>>>       address before the block group is unfrozen (by someone who is holding
>>>>>       a reference on it). So a lookup on the red black tree will return NULL
>>>>>       for the logical address until the block group is unfrozen and its
>>>>>       logical address is reused for a new block group.
>>>>>
>>>>> So the ordering of what you are saying is reversed - the device extent
>>>>> and chunk items are removed before marking the block group as deleted.
>>>>>
>>>>>>
>>>>>> The device extents and chunk items are only deleted in
>>>>>> btrfs_remove_chunk(), which happens either a btrfs_delete_unused_bgs()
>>>>>> or btrfs_relocate_chunk().
>>>>>
>>>>> This is also confusing because it gives a wrong idea of the ordering.
>>>>> The device extents and chunk items are removed by btrfs_remove_chunk(),
>>>>> which happens before marking a block group as deleted.
>>>>>
>>>>>>
>>>>>> This timing difference leaves a window where we can have a mismatch
>>>>>> between block group cache and device extent tree, and triggering the
>>>>>> ASSERT().
>>>>>
>>>>> I would like to see more details on this. The sequence of steps between
>>>>> two tasks that result in this assertion being triggered.
>>>>>
>>>>>>
>>>>>> [FIX]
>>>>>>
>>>>>> - Remove the ASSERT()
>>>>>>
>>>>>> - Add a quick exit if our found bg doesn't match @chunk_offset
>>>>>
>>>>> This shouldn't happen. I would like to know the sequence of steps
>>>>> between 2 (or more) tasks that leads to that.
>>>>>
>>>>> We are getting the block group with btrfs_lookup_block_group() at
>>>>> scrub_enumerate_chunks(), that calls block_group_cache_tree_search()
>>>>> with the "contains" argument set to 1, meaning it can return a
>>>>> block group that contains the given bytenr but does not start at
>>>>> that bytenr necessarily.
>>>>>
>>>>> This gives me the idea a small block group was deleted and then a
>>>>> new one was allocated which starts at a lower logical offset and
>>>>> includes "chunk_offset" (ends beyond that).
>>>>>
>>>>> We should probably be using btrfs_lookup_first_block_group() at
>>>>> scrub_enumerate_chunks(), so it looks for a block group that starts
>>>>> exactly at the given logical address.
>>>>>
>>>>> But anyway, as I read this and see the patch's diff, this feels a lot
>>>>> like a "bail out if something unexpected happens but we don't know
>>>>> exactly why/how that is possible".
>>>>
>>>> You're completely correct.
>>>>
>>>> Unfortunately I have no way to reproduce the bug on x86_64 VMs, and all
>>>> my ARM VMs are too slow to reproduce.
>>>>
>>>> Furthermore, due to some unrelated bugs, v5.17-rc based kernels all
>>>> failed to boot inside the VMs on Apple M1.
>>>>
>>>> So unfortunately I don't have extra details for now.
>>>>
>>>> Will find a way to reproduce first, then update the patch with better
>>>> comments.
>>>
>>> Was it triggered with some specific test case from fstests?
>>
>> Yes, from a regular run on fstests.
> 
> Ok, but which specific test(s)?

No mention, but I guess it's inside scrub + replace group.

> 
>>
>> But that was on an older branch (some older misc-next, which has the
>> patch but still based on v5.16).
>>
>> The v5.17-rc based branch will not boot inside the ARM vm at all, and no
>> early boot messages after the EFI routine.
>>
>> Will attack the boot problem first.
>>
>> Thanks,
>> Qu
>>>
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>>>
>>>>>> - Add a comment on the existing checks in scrub_chunk()
>>>>>>      This is the ultimate safenet, as it will iterate through all the stripes
>>>>>>      of the found chunk.
>>>>>>      And only scrub the stripe if it matches both device and physical
>>>>>>      offset.
>>>>>>
>>>>>>      So even by some how we got a dev extent which is no longer owned
>>>>>>      by the target chunk, we will just skip this chunk completely, without
>>>>>>      any consequence.
>>>>>>
>>>>>> Reported-by: Su Yue <l@damenly.su>
>>>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>>>> ---
>>>>>> Changelog:
>>>>>> v2:
>>>>>> - Add a new quick exit with extra comments
>>>>>>
>>>>>> - Add a new comment in the existing safenet in scrub_chunk()
>>>>>> ---
>>>>>>     fs/btrfs/scrub.c | 21 ++++++++++++++++++++-
>>>>>>     1 file changed, 20 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>>>>>> index 11089568b287..1c3ed4720ddd 100644
>>>>>> --- a/fs/btrfs/scrub.c
>>>>>> +++ b/fs/btrfs/scrub.c
>>>>>> @@ -3578,6 +3578,14 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
>>>>>>                goto out;
>>>>>>
>>>>>>        map = em->map_lookup;
>>>>>> +
>>>>>> +    /*
>>>>>> +     * Due to the delayed modification of device tree, this chunk
>>>>>> +     * may not own this dev extent.
>>>>>
>>>>> This is confusing. What delayed modification, what do you mean by it
>>>>> exactly? Same below, with more details.
>>>>>
>>>>>> +     *
>>>>>> +     * So we need to iterate through all stripes, and only scrub
>>>>>> +     * the stripe which matches both device and physical offset.
>>>>>> +     */
>>>>>>        for (i = 0; i < map->num_stripes; ++i) {
>>>>>>                if (map->stripes[i].dev->bdev == scrub_dev->bdev &&
>>>>>>                    map->stripes[i].physical == dev_offset) {
>>>>>> @@ -3699,6 +3707,18 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>>>>>>                if (!cache)
>>>>>>                        goto skip;
>>>>>>
>>>>>> +            /*
>>>>>> +             * Since dev extents modification is delayed, it's possible
>>>>>> +             * we got a bg which doesn't really own this dev extent.
>>>>>
>>>>> Same as before. Too confusing, what is delayed dev extent modification?
>>>>>
>>>>> Once added, device extents can only be deleted, they are never modified.
>>>>> I think you are referring to the deletions and the fact we use the commit
>>>>> roots to find extents, but that should be a more clear comment, without
>>>>> such level of ambiguity.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>> +             *
>>>>>> +             * Here just do a quick skip, scrub_chunk() has a more
>>>>>> +             * comprehensive check in it.
>>>>>> +             */
>>>>>> +            if (cache->start != chunk_offset) {
>>>>>> +                    btrfs_put_block_group(cache);
>>>>>> +                    goto skip;
>>>>>> +            }
>>>>>> +
>>>>>>                if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) {
>>>>>>                        spin_lock(&cache->lock);
>>>>>>                        if (!cache->to_copy) {
>>>>>> @@ -3822,7 +3842,6 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>>>>>>                dev_replace->item_needs_writeback = 1;
>>>>>>                up_write(&dev_replace->rwsem);
>>>>>>
>>>>>> -            ASSERT(cache->start == chunk_offset);
>>>>>>                ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
>>>>>>                                  dev_extent_len);
>>>>>>
>>>>>> --
>>>>>> 2.35.1
>>>>>>
>

next prev parent reply	other threads:[~2022-02-17 13:25 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-17  9:49 [PATCH v2] btrfs: remove an ASSERT() which can cause false alert Qu Wenruo
2022-02-17 11:14 ` Filipe Manana
2022-02-17 11:24   ` Qu Wenruo
2022-02-17 12:32     ` Filipe Manana
2022-02-17 13:00       ` Qu Wenruo
2022-02-17 13:02         ` Filipe Manana
2022-02-17 13:25           ` Qu Wenruo [this message]
2022-02-18  2:29           ` Su Yue
2022-04-19 20:34 ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cb3b778b-b619-23cf-7b07-3e7aa1c02caf@suse.com \
    --to=wqu@suse.com \
    --cc=fdmanana@kernel.org \
    --cc=l@damenly.su \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox