From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Daniel Vacek <neelx@suse.com>, Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com,
stable@vger.kernel.org, Boris Burkov <boris@bur.io>
Subject: Re: [PATCH] btrfs: adjust subpage bit start based on sectorsize
Date: Fri, 25 Apr 2025 06:27:33 +0930 [thread overview]
Message-ID: <d874c64c-34fc-4d3a-abf9-19625bddd213@gmx.com> (raw)
In-Reply-To: <CAPjX3FeSOJVo=4hyPaHp3svLorWXp2SGhMEB17+Qm3OLmireSw@mail.gmail.com>
在 2025/4/24 20:10, Daniel Vacek 写道:
> On Mon, 21 Apr 2025 at 15:38, Josef Bacik <josef@toxicpanda.com> wrote:
>>
>> When running machines with 64k page size and a 16k nodesize we started
>> seeing tree log corruption in production. This turned out to be because
>> we were not writing out dirty blocks sometimes, so this in fact affects
>> all metadata writes.
>>
>> When writing out a subpage EB we scan the subpage bitmap for a dirty
>> range. If the range isn't dirty we do
>>
>> bit_start++;
>>
>> to move onto the next bit. The problem is the bitmap is based on the
>> number of sectors that an EB has. So in this case, we have a 64k
>> pagesize, 16k nodesize, but a 4k sectorsize. This means our bitmap is 4
>> bits for every node. With a 64k page size we end up with 4 nodes per
>> page.
>>
>> To make this easier this is how everything looks
>>
>> [0 16k 32k 48k ] logical address
>> [0 4 8 12 ] radix tree offset
>> [ 64k page ] folio
>> [ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers
>> [ | | | | | | | | | | | | | | | | ] bitmap
>>
>> Now we use all of our addressing based on fs_info->sectorsize_bits, so
>> as you can see the above our 16k eb->start turns into radix entry 4.
>
> Btw, unrelated to this patch - but this way we're using at best only
> 25% of the tree slots. Or in other words we're wasting 75% of the
> memory here. We should rather use eb->start / fs_info->nodesize for
> the tree index.
That requires all tree blocks to be nodesize aligned.
We're already working towards that direction, but we need to be cautious
to reject those non-nodesize aligned tree blocks, as end users won't be
happy that their fsese no longer mount after a kernel update.
So as usual, kernel warning message first, btrfs check reports as error,
then experimental rejection, and finally push to end users.
>
> And by the other way - why do we need a copy of nodesize in eb->len?
> We can always eb->fs_info->nodesize if needed.
Because there are pseudo "extent buffers" in the past, like accessing
super blocks using extent buffer helpers.
In that case we need a length other than node size but super block size.
But that is no longer the case, feel free to clean it up.
Thanks,
Qu
>
>> When we find a dirty range for our eb, we correctly do bit_start +=
>> sectors_per_node, because if we start at bit 0, the next bit for the
>> next eb is 4, to correspond to eb->start 16k.
>>
>> However if our range is clean, we will do bit_start++, which will now
>> put us offset from our radix tree entries.
>>
>> In our case, assume that the first time we check the bitmap the block is
>> not dirty, we increment bit_start so now it == 1, and then we loop
>> around and check again. This time it is dirty, and we go to find that
>> start using the following equation
>>
>> start = folio_start + bit_start * fs_info->sectorsize;
>>
>> so in the case above, eb->start 0 is now dirty, and we calculate start
>> as
>>
>> 0 + 1 * fs_info->sectorsize = 4096
>> 4096 >> 12 = 1
>>
>> Now we're looking up the radix tree for 1, and we won't find an eb.
>> What's worse is now we're using bit_start == 1, so we do bit_start +=
>> sectors_per_node, which is now 5. If that eb is dirty we will run into
>> the same thing, we will look at an offset that is not populated in the
>> radix tree, and now we're skipping the writeout of dirty extent buffers.
>>
>> The best fix for this is to not use sectorsize_bits to address nodes,
>> but that's a larger change. Since this is a fs corruption problem fix
>> it simply by always using sectors_per_node to increment the start bit.
>>
>> cc: stable@vger.kernel.org
>> Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page")
>> Reviewed-by: Boris Burkov <boris@bur.io>
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>> ---
>> - Further testing indicated that the page tagging theoretical race isn't getting
>> hit in practice, so we're going to limit the "hotfix" to this specific patch,
>> and then send subsequent patches to address the other issues we're hitting. My
>> simplify metadata writebback patches are the more wholistic fix.
>>
>> fs/btrfs/extent_io.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 5f08615b334f..6cfd286b8bbc 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -2034,7 +2034,7 @@ static int submit_eb_subpage(struct folio *folio, struct writeback_control *wbc)
>> subpage->bitmaps)) {
>> spin_unlock_irqrestore(&subpage->lock, flags);
>> spin_unlock(&folio->mapping->i_private_lock);
>> - bit_start++;
>> + bit_start += sectors_per_node;
>> continue;
>> }
>>
>> --
>> 2.48.1
>>
>>
>
next prev parent reply other threads:[~2025-04-24 20:57 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-21 13:37 [PATCH] btrfs: adjust subpage bit start based on sectorsize Josef Bacik
2025-04-21 20:17 ` Qu Wenruo
2025-04-24 10:40 ` Daniel Vacek
2025-04-24 20:57 ` Qu Wenruo [this message]
-- strict thread matches above, loose matches on Subject: below --
2025-04-14 19:08 Josef Bacik
2025-04-14 19:54 ` Boris Burkov
2025-04-14 22:08 ` Qu Wenruo
2025-04-14 22:37 ` Qu Wenruo
2025-04-15 16:16 ` Boris Burkov
2025-04-15 23:49 ` Qu Wenruo
2025-04-16 14:16 ` Josef Bacik
2025-04-16 22:07 ` Qu Wenruo
2025-04-15 17:25 ` Josef Bacik
2025-04-16 0:08 ` Qu Wenruo
2025-04-15 17:23 ` Josef Bacik
2025-04-15 23:53 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d874c64c-34fc-4d3a-abf9-19625bddd213@gmx.com \
--to=quwenruo.btrfs@gmx.com \
--cc=boris@bur.io \
--cc=josef@toxicpanda.com \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=neelx@suse.com \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox