public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Su Yue <l@damenly.su>
To: Filipe Manana <fdmanana@kernel.org>
Cc: Qu Wenruo <quwenruo.btrfs@gmx.com>, Qu Wenruo <wqu@suse.com>,
	linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH v2] btrfs: remove an ASSERT() which can cause false alert
Date: Fri, 18 Feb 2022 10:29:48 +0800	[thread overview]
Message-ID: <5ypcrj51.fsf@damenly.su> (raw)
In-Reply-To: <CAL3q7H6bZLPqGboFN9KDs717puFBXZ9Q-UknWxJ-sD9NZ9TpGw@mail.gmail.com>


On Thu 17 Feb 2022 at 13:02, Filipe Manana <fdmanana@kernel.org> 
wrote:

> On Thu, Feb 17, 2022 at 1:00 PM Qu Wenruo 
> <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2022/2/17 20:32, Filipe Manana wrote:
>> > On Thu, Feb 17, 2022 at 11:24 AM Qu Wenruo 
>> > <quwenruo.btrfs@gmx.com> wrote:
>> >>
>> >>
>> >>
>> >> On 2022/2/17 19:14, Filipe Manana wrote:
>> >>> On Thu, Feb 17, 2022 at 05:49:34PM +0800, Qu Wenruo wrote:
>> >>>> [BUG]
>> >>>> Su reported that with his VM running on an M1 mac, it has 
>> >>>> a high chance
>> >>>> to trigger the following ASSERT() with scrub and balance 
>> >>>> test cases:
>> >>>>
>> >>>>               ASSERT(cache->start == chunk_offset);
>> >>>>               ret = scrub_chunk(sctx, cache, scrub_dev, 
>> >>>>               found_key.offset,
>> >>>>                                 dev_extent_len);
>> >>>>
>> >>>> [CAUSE]
>> >>>> The ASSERT() is making sure that the block group cache 
>> >>>> (@cache) and the
>> >>>> chunk offset from the device extent match.
>> >>>>
>> >>>> But the truth is, block group caches and dev extent items 
>> >>>> are deleted at
>> >>>> different timing.
>> >>>>
>> >>>> For block group caches and items, after 
>> >>>> btrfs_remove_block_group() they
>> >>>> will be marked deleted, but the device extent and the 
>> >>>> chunk item is
>> >>>> still in dev tree and chunk tree respectively.
>> >>>
>> >>> This is not correct, what happens is:
>> >>>
>> >>> 1) btrfs_remove_chunk() will remove the device extent items 
>> >>> and the chunk
>> >>>      item;
>> >>>
>> >>> 2) It then calls btrfs_remove_block_group(). This will not 
>> >>> remove the
>> >>>      extent map if the block group was frozen (due to trim, 
>> >>>      and scrub
>> >>>      itself). But it will remove the block group (struct 
>> >>>      btrfs_block_group)
>> >>>      from the red black tree of block groups, and mark the 
>> >>>      block group
>> >>>      as deleted (set struct btrfs_block_group::removed to 
>> >>>      1).
>> >>>
>> >>>      If the extent map was not removed, meaning the block 
>> >>>      group is frozen,
>> >>>      then no one will be able to create a new block group 
>> >>>      with the same logical
>> >>>      address before the block group is unfrozen (by someone 
>> >>>      who is holding
>> >>>      a reference on it). So a lookup on the red black tree 
>> >>>      will return NULL
>> >>>      for the logical address until the block group is 
>> >>>      unfrozen and its
>> >>>      logical address is reused for a new block group.
>> >>>
>> >>> So the ordering of what you are saying is reversed - the 
>> >>> device extent
>> >>> and chunk items are removed before marking the block group 
>> >>> as deleted.
>> >>>
>> >>>>
>> >>>> The device extents and chunk items are only deleted in
>> >>>> btrfs_remove_chunk(), which happens either a 
>> >>>> btrfs_delete_unused_bgs()
>> >>>> or btrfs_relocate_chunk().
>> >>>
>> >>> This is also confusing because it gives a wrong idea of the 
>> >>> ordering.
>> >>> The device extents and chunk items are removed by 
>> >>> btrfs_remove_chunk(),
>> >>> which happens before marking a block group as deleted.
>> >>>
>> >>>>
>> >>>> This timing difference leaves a window where we can have a 
>> >>>> mismatch
>> >>>> between block group cache and device extent tree, and 
>> >>>> triggering the
>> >>>> ASSERT().
>> >>>
>> >>> I would like to see more details on this. The sequence of 
>> >>> steps between
>> >>> two tasks that result in this assertion being triggered.
>> >>>
>> >>>>
>> >>>> [FIX]
>> >>>>
>> >>>> - Remove the ASSERT()
>> >>>>
>> >>>> - Add a quick exit if our found bg doesn't match 
>> >>>> @chunk_offset
>> >>>
>> >>> This shouldn't happen. I would like to know the sequence of 
>> >>> steps
>> >>> between 2 (or more) tasks that leads to that.
>> >>>
>> >>> We are getting the block group with 
>> >>> btrfs_lookup_block_group() at
>> >>> scrub_enumerate_chunks(), that calls 
>> >>> block_group_cache_tree_search()
>> >>> with the "contains" argument set to 1, meaning it can 
>> >>> return a
>> >>> block group that contains the given bytenr but does not 
>> >>> start at
>> >>> that bytenr necessarily.
>> >>>
>> >>> This gives me the idea a small block group was deleted and 
>> >>> then a
>> >>> new one was allocated which starts at a lower logical 
>> >>> offset and
>> >>> includes "chunk_offset" (ends beyond that).
>> >>>
>> >>> We should probably be using 
>> >>> btrfs_lookup_first_block_group() at
>> >>> scrub_enumerate_chunks(), so it looks for a block group 
>> >>> that starts
>> >>> exactly at the given logical address.
>> >>>
>> >>> But anyway, as I read this and see the patch's diff, this 
>> >>> feels a lot
>> >>> like a "bail out if something unexpected happens but we 
>> >>> don't know
>> >>> exactly why/how that is possible".
>> >>
>> >> You're completely correct.
>> >>
>> >> Unfortunately I have no way to reproduce the bug on x86_64 
>> >> VMs, and all
>> >> my ARM VMs are too slow to reproduce.
>> >>
>> >> Furthermore, due to some unrelated bugs, v5.17-rc based 
>> >> kernels all
>> >> failed to boot inside the VMs on Apple M1.
>> >>
>> >> So unfortunately I don't have extra details for now.
>> >>
>> >> Will find a way to reproduce first, then update the patch 
>> >> with better
>> >> comments.
>> >
>> > Was it triggered with some specific test case from fstests?
>>
>> Yes, from a regular run on fstests.
>
> Ok, but which specific test(s)?
>

It is btrfs/074. It was found during early 16k subpage support 
development
of Qu. Unfortunately dmesg is lost and it can't be reproduced from 
old
branch fetched from `git reflog`.

I'd report back if the issue occurs again.

--
Su
>>
>> But that was on an older branch (some older misc-next, which 
>> has the
>> patch but still based on v5.16).
>>
>> The v5.17-rc based branch will not boot inside the ARM vm at 
>> all, and no
>> early boot messages after the EFI routine.
>>
>> Will attack the boot problem first.
>>
>> Thanks,
>> Qu
>> >
>> >>
>> >> Thanks,
>> >> Qu
>> >>
>> >>>
>> >>>>
>> >>>> - Add a comment on the existing checks in scrub_chunk()
>> >>>>     This is the ultimate safenet, as it will iterate 
>> >>>>     through all the stripes
>> >>>>     of the found chunk.
>> >>>>     And only scrub the stripe if it matches both device 
>> >>>>     and physical
>> >>>>     offset.
>> >>>>
>> >>>>     So even by some how we got a dev extent which is no 
>> >>>>     longer owned
>> >>>>     by the target chunk, we will just skip this chunk 
>> >>>>     completely, without
>> >>>>     any consequence.
>> >>>>
>> >>>> Reported-by: Su Yue <l@damenly.su>
>> >>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> >>>> ---
>> >>>> Changelog:
>> >>>> v2:
>> >>>> - Add a new quick exit with extra comments
>> >>>>
>> >>>> - Add a new comment in the existing safenet in 
>> >>>> scrub_chunk()
>> >>>> ---
>> >>>>    fs/btrfs/scrub.c | 21 ++++++++++++++++++++-
>> >>>>    1 file changed, 20 insertions(+), 1 deletion(-)
>> >>>>
>> >>>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>> >>>> index 11089568b287..1c3ed4720ddd 100644
>> >>>> --- a/fs/btrfs/scrub.c
>> >>>> +++ b/fs/btrfs/scrub.c
>> >>>> @@ -3578,6 +3578,14 @@ static noinline_for_stack int 
>> >>>> scrub_chunk(struct scrub_ctx *sctx,
>> >>>>               goto out;
>> >>>>
>> >>>>       map = em->map_lookup;
>> >>>> +
>> >>>> +    /*
>> >>>> +     * Due to the delayed modification of device tree, 
>> >>>> this chunk
>> >>>> +     * may not own this dev extent.
>> >>>
>> >>> This is confusing. What delayed modification, what do you 
>> >>> mean by it
>> >>> exactly? Same below, with more details.
>> >>>
>> >>>> +     *
>> >>>> +     * So we need to iterate through all stripes, and 
>> >>>> only scrub
>> >>>> +     * the stripe which matches both device and physical 
>> >>>> offset.
>> >>>> +     */
>> >>>>       for (i = 0; i < map->num_stripes; ++i) {
>> >>>>               if (map->stripes[i].dev->bdev == 
>> >>>>               scrub_dev->bdev &&
>> >>>>                   map->stripes[i].physical == dev_offset) 
>> >>>>                   {
>> >>>> @@ -3699,6 +3707,18 @@ int scrub_enumerate_chunks(struct 
>> >>>> scrub_ctx *sctx,
>> >>>>               if (!cache)
>> >>>>                       goto skip;
>> >>>>
>> >>>> +            /*
>> >>>> +             * Since dev extents modification is delayed, 
>> >>>> it's possible
>> >>>> +             * we got a bg which doesn't really own this 
>> >>>> dev extent.
>> >>>
>> >>> Same as before. Too confusing, what is delayed dev extent 
>> >>> modification?
>> >>>
>> >>> Once added, device extents can only be deleted, they are 
>> >>> never modified.
>> >>> I think you are referring to the deletions and the fact we 
>> >>> use the commit
>> >>> roots to find extents, but that should be a more clear 
>> >>> comment, without
>> >>> such level of ambiguity.
>> >>>
>> >>> Thanks.
>> >>>
>> >>>> +             *
>> >>>> +             * Here just do a quick skip, scrub_chunk() 
>> >>>> has a more
>> >>>> +             * comprehensive check in it.
>> >>>> +             */
>> >>>> +            if (cache->start != chunk_offset) {
>> >>>> +                    btrfs_put_block_group(cache);
>> >>>> +                    goto skip;
>> >>>> +            }
>> >>>> +
>> >>>>               if (sctx->is_dev_replace && 
>> >>>>               btrfs_is_zoned(fs_info)) {
>> >>>>                       spin_lock(&cache->lock);
>> >>>>                       if (!cache->to_copy) {
>> >>>> @@ -3822,7 +3842,6 @@ int scrub_enumerate_chunks(struct 
>> >>>> scrub_ctx *sctx,
>> >>>>               dev_replace->item_needs_writeback = 1;
>> >>>>               up_write(&dev_replace->rwsem);
>> >>>>
>> >>>> -            ASSERT(cache->start == chunk_offset);
>> >>>>               ret = scrub_chunk(sctx, cache, scrub_dev, 
>> >>>>               found_key.offset,
>> >>>>                                 dev_extent_len);
>> >>>>
>> >>>> --
>> >>>> 2.35.1
>> >>>>

  parent reply	other threads:[~2022-02-18  2:41 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-17  9:49 [PATCH v2] btrfs: remove an ASSERT() which can cause false alert Qu Wenruo
2022-02-17 11:14 ` Filipe Manana
2022-02-17 11:24   ` Qu Wenruo
2022-02-17 12:32     ` Filipe Manana
2022-02-17 13:00       ` Qu Wenruo
2022-02-17 13:02         ` Filipe Manana
2022-02-17 13:25           ` Qu Wenruo
2022-02-18  2:29           ` Su Yue [this message]
2022-04-19 20:34 ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5ypcrj51.fsf@damenly.su \
    --to=l@damenly.su \
    --cc=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox