From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Filipe Manana <fdmanana@kernel.org>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH] btrfs: fix assertion failure during scrub due to block group reallocation
Date: Wed, 20 Apr 2022 17:52:41 +0800 [thread overview]
Message-ID: <3782b02b-0121-86d2-abc0-28199183ec40@gmx.com> (raw)
In-Reply-To: <CAL3q7H5kXdiuBpDtish83+-XABHxc7DsuY2ceZQd3g59uH_aBA@mail.gmail.com>
On 2022/4/20 17:50, Filipe Manana wrote:
> On Wed, Apr 20, 2022 at 3:24 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2022/4/19 21:23, fdmanana@kernel.org wrote:
>>> From: Filipe Manana <fdmanana@suse.com>
>>>
>>> During a scrub, or device replace, we can race with block group removal
>>> and allocation and trigger the following assertion failure:
>>>
>>> [ 7526.385524] assertion failed: cache->start == chunk_offset, in fs/btrfs/scrub.c:3817
>>> [ 7526.387351] ------------[ cut here ]------------
>>> [ 7526.387373] kernel BUG at fs/btrfs/ctree.h:3599!
>>> [ 7526.388001] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
>>> [ 7526.388970] CPU: 2 PID: 1158150 Comm: btrfs Not tainted 5.17.0-rc8-btrfs-next-114 #4
>>> [ 7526.390279] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>>> [ 7526.392430] RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
>>> [ 7526.393520] Code: f3 48 c7 c7 20 (...)
>>> [ 7526.396926] RSP: 0018:ffffb9154176bc40 EFLAGS: 00010246
>>> [ 7526.397690] RAX: 0000000000000048 RBX: ffffa0db8a910000 RCX: 0000000000000000
>>> [ 7526.398732] RDX: 0000000000000000 RSI: ffffffff9d7239a2 RDI: 00000000ffffffff
>>> [ 7526.399766] RBP: ffffa0db8a911e10 R08: ffffffffa71a3ca0 R09: 0000000000000001
>>> [ 7526.400793] R10: 0000000000000001 R11: 0000000000000000 R12: ffffa0db4b170800
>>> [ 7526.401839] R13: 00000003494b0000 R14: ffffa0db7c55b488 R15: ffffa0db8b19a000
>>> [ 7526.402874] FS: 00007f6c99c40640(0000) GS:ffffa0de6d200000(0000) knlGS:0000000000000000
>>> [ 7526.404038] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 7526.405040] CR2: 00007f31b0882160 CR3: 000000014b38c004 CR4: 0000000000370ee0
>>> [ 7526.406112] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [ 7526.407148] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [ 7526.408169] Call Trace:
>>> [ 7526.408529] <TASK>
>>> [ 7526.408839] scrub_enumerate_chunks.cold+0x11/0x79 [btrfs]
>>> [ 7526.409690] ? do_wait_intr_irq+0xb0/0xb0
>>> [ 7526.410276] btrfs_scrub_dev+0x226/0x620 [btrfs]
>>> [ 7526.410995] ? preempt_count_add+0x49/0xa0
>>> [ 7526.411592] btrfs_ioctl+0x1ab5/0x36d0 [btrfs]
>>> [ 7526.412278] ? __fget_files+0xc9/0x1b0
>>> [ 7526.412825] ? kvm_sched_clock_read+0x14/0x40
>>> [ 7526.413459] ? lock_release+0x155/0x4a0
>>> [ 7526.414022] ? __x64_sys_ioctl+0x83/0xb0
>>> [ 7526.414601] __x64_sys_ioctl+0x83/0xb0
>>> [ 7526.415150] do_syscall_64+0x3b/0xc0
>>> [ 7526.415675] entry_SYSCALL_64_after_hwframe+0x44/0xae
>>> [ 7526.416408] RIP: 0033:0x7f6c99d34397
>>> [ 7526.416931] Code: 3c 1c e8 1c ff (...)
>>> [ 7526.419641] RSP: 002b:00007f6c99c3fca8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>>> [ 7526.420735] RAX: ffffffffffffffda RBX: 00005624e1e007b0 RCX: 00007f6c99d34397
>>> [ 7526.421779] RDX: 00005624e1e007b0 RSI: 00000000c400941b RDI: 0000000000000003
>>> [ 7526.422820] RBP: 0000000000000000 R08: 00007f6c99c40640 R09: 0000000000000000
>>> [ 7526.423906] R10: 00007f6c99c40640 R11: 0000000000000246 R12: 00007fff746755de
>>> [ 7526.424924] R13: 00007fff746755df R14: 0000000000000000 R15: 00007f6c99c40640
>>> [ 7526.425950] </TASK>
>>>
>>> That assertion is relatively new, introduced with commit d04fbe19aefd2
>>> ("btrfs: scrub: cleanup the argument list of scrub_chunk()").
>>>
>>> The block group we get at scrub_enumerate_chunks() can actually have a
>>> start address that is smaller then the chunk offset we extracted from a
>>> device extent item we got from the commit root of the device tree.
>>> This is very rare, but it can happen due to a race with block group
>>> removal and allocation. For example, the following steps show how this
>>> can happen:
>>>
>>> 1) We are at transaction T, and we have the following blocks groups,
>>> sorted by their logical start address:
>>>
>>> [ bg A, start address A, length 1G (data) ]
>>> [ bg B, start address B, length 1G (data) ]
>>> (...)
>>> [ bg W, start address W, length 1G (data) ]
>>>
>>> --> logical address space hole of 256M,
>>> there used to be a 256M metadata block group here
>>>
>>> [ bg Y, start address Y, length 256M (metadata) ]
>>>
>>> --> Y matches W's end offset + 256M
>>>
>>> Block group Y is the block group with the highest logical address in
>>> the whole filesystem;
>>>
>>> 2) Block group Y is deleted and its extent mapping is removed by the call
>>> to remove_extent_mapping() made from btrfs_remove_block_group().
>>>
>>> So after this point, the last element of the mapping red black tree,
>>> its rightmost node, is the mapping for block group W;
>>>
>>> 3) While still at transaction T, a new data block group is allocated,
>>> with a length of 1G. When creating the block group we do a call to
>>> find_next_chunk(), which returns the logical start address for the
>>> new block group. This calls returns X, which corresponds to the
>>> end offset of the last block group, the rightmost node in the mapping
>>> red black tree (fs_info->mapping_tree), plus one.
>>>
>>> So we get a new block group that starts at logical address X and with
>>> a length of 1G. It spans over the whole logical range of the old block
>>> group Y, that was previously removed in the same transaction.
>>>
>>> However the device extent allocated to block group X is not the same
>>> device extent that was used by block group Y, and it also does not
>>> overlap that extent, which must be always the case because we allocate
>>> extents by searching through the commit root of the device tree
>>> (otherwise it could corrupt a filesystem after a power failure or
>>> an unclean shutdown in general), so the extent allocator is behaving
>>> as expected;
>>>
>>> 4) We have a task running scrub, currently at scrub_enumerate_chunks().
>>> There it searches for device extent items in the device tree, using
>>> its commit root. It finds a device extent item that was used by
>>> block group Y, and it extracts the value Y from that item into the
>>> local variable 'chunk_offset', using btrfs_dev_extent_chunk_offset();
>>
>> That perfectly explains the problem I hit.
>
> Wasn't it Su who hit it?
Right, it's Su...
As I can't even reproduce it...
Thanks,
Qu
>
>>
>> Thank you very much for further pin down the cause.
>>
>>>
>>> It then calls btrfs_lookup_block_group() to find block group for
>>> the logical address Y - since there's currently no block group that
>>> starts at that logical address, it returns block group X, because
>>> its range contains Y.
>>>
>>> This results in triggering the assertion:
>>>
>>> ASSERT(cache->start == chunk_offset);
>>>
>>> right before calling scrub_chunk(), as cache->start is X and
>>> chunk_offset is Y.
>>>
>>> This is more likely to happen of filesystems not larger than 50G, because
>>> for these filesystems we use a 256M size for metadata block groups and
>>> a 1G size for data block groups, while for filesystems larger than 50G,
>>> we use a 1G size for both data and metadata block groups (except for
>>> zoned filesystems). It could also happen on any filesystem size due to
>>> the fact that system block groups are always smaller (32M) than both
>>> data and metadata block groups, but these are not frequently deleted, so
>>> much less likely to trigger the race.
>>>
>>> So make scrub skip any block group with a start offset that is less than
>>> the value we expect, as that means it's a new block group that was created
>>> in the current transaction. It's pointless to continue and try to scrub
>>> its extents, because scrub searches for extents using the commit root, so
>>> it won't find any. For a device replace, skip it as well for the same
>>> reasons, and we don't need to worry about the possibility of extents of
>>> the new block group not being to the new device, because we have the write
>>> duplication setup done through btrfs_map_block().
>>>
>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>
>> The offending commit is relatively new, do we need to Cc to stable just
>> for v5.17 kernel?
>
> David added that to the changelog in misc-next.
>
>>
>> Reviewed-by: Qu Wenruo <wqu@suse.com>
>>
>> Thanks,
>> Qu
>>
>>> ---
>>> fs/btrfs/dev-replace.c | 7 ++++++-
>>> fs/btrfs/scrub.c | 26 +++++++++++++++++++++++++-
>>> 2 files changed, 31 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>>> index 677b99e66c21..a7dd6ba25e99 100644
>>> --- a/fs/btrfs/dev-replace.c
>>> +++ b/fs/btrfs/dev-replace.c
>>> @@ -707,7 +707,12 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
>>>
>>> btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
>>>
>>> - /* Commit dev_replace state and reserve 1 item for it. */
>>> + /*
>>> + * Commit dev_replace state and reserve 1 item for it.
>>> + * This is crucial to ensure we won't miss copying extents for new block
>>> + * groups that are allocated after we started the device replace, and
>>> + * must be done after setting up the device replace state.
>>> + */
>>> trans = btrfs_start_transaction(root, 1);
>>> if (IS_ERR(trans)) {
>>> ret = PTR_ERR(trans);
>>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>>> index 13ba458c080c..b79a3221d7af 100644
>>> --- a/fs/btrfs/scrub.c
>>> +++ b/fs/btrfs/scrub.c
>>> @@ -3674,6 +3674,31 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>>> if (!cache)
>>> goto skip;
>>>
>>> + ASSERT(cache->start <= chunk_offset);
>>> + /*
>>> + * We are using the commit root to search for device extents, so
>>> + * that means we could have found a device extent item from a
>>> + * block group that was deleted in the current transaction. The
>>> + * logical start offset of the deleted block group, stored at
>>> + * @chunk_offset, might be part of the logical address range of
>>> + * a new block group (which uses different physical extents).
>>> + * In this case btrfs_lookup_block_group() has returned the new
>>> + * block group, and its start address is less than @chunk_offset.
>>> + *
>>> + * We skip such new block groups, because it's pointless to
>>> + * process them, as we won't find their extents because we search
>>> + * for them using the commit root of the extent tree. For a device
>>> + * replace it's also fine to skip it, we won't miss copying them
>>> + * to the target device because we have the write duplication
>>> + * setup through the regular write path (by btrfs_map_block()),
>>> + * and we have committed a transaction when we started the device
>>> + * replace, right after setting up the device replace state.
>>> + */
>>> + if (cache->start < chunk_offset) {
>>> + btrfs_put_block_group(cache);
>>> + goto skip;
>>> + }
>>> +
>>> if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) {
>>> spin_lock(&cache->lock);
>>> if (!cache->to_copy) {
>>> @@ -3797,7 +3822,6 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
>>> dev_replace->item_needs_writeback = 1;
>>> up_write(&dev_replace->rwsem);
>>>
>>> - ASSERT(cache->start == chunk_offset);
>>> ret = scrub_chunk(sctx, cache, scrub_dev, found_key.offset,
>>> dev_extent_len);
>>>
next prev parent reply other threads:[~2022-04-20 9:52 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-04-19 13:23 [PATCH] btrfs: fix assertion failure during scrub due to block group reallocation fdmanana
2022-04-19 18:02 ` David Sterba
2022-04-20 2:24 ` Qu Wenruo
2022-04-20 9:50 ` Filipe Manana
2022-04-20 9:52 ` Qu Wenruo [this message]
2022-04-20 14:36 ` David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3782b02b-0121-86d2-abc0-28199183ec40@gmx.com \
--to=quwenruo.btrfs@gmx.com \
--cc=fdmanana@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox