Re: [PATCH] btrfs: adjust subpage bit start based on sectorsize

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Daniel Vacek <neelx@suse.com>, Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com,
	stable@vger.kernel.org, Boris Burkov <boris@bur.io>
Subject: Re: [PATCH] btrfs: adjust subpage bit start based on sectorsize
Date: Fri, 25 Apr 2025 06:27:33 +0930	[thread overview]
Message-ID: <d874c64c-34fc-4d3a-abf9-19625bddd213@gmx.com> (raw)
In-Reply-To: <CAPjX3FeSOJVo=4hyPaHp3svLorWXp2SGhMEB17+Qm3OLmireSw@mail.gmail.com>



在 2025/4/24 20:10, Daniel Vacek 写道:
> On Mon, 21 Apr 2025 at 15:38, Josef Bacik <josef@toxicpanda.com> wrote:
>>
>> When running machines with 64k page size and a 16k nodesize we started
>> seeing tree log corruption in production.  This turned out to be because
>> we were not writing out dirty blocks sometimes, so this in fact affects
>> all metadata writes.
>>
>> When writing out a subpage EB we scan the subpage bitmap for a dirty
>> range.  If the range isn't dirty we do
>>
>> bit_start++;
>>
>> to move onto the next bit.  The problem is the bitmap is based on the
>> number of sectors that an EB has.  So in this case, we have a 64k
>> pagesize, 16k nodesize, but a 4k sectorsize.  This means our bitmap is 4
>> bits for every node.  With a 64k page size we end up with 4 nodes per
>> page.
>>
>> To make this easier this is how everything looks
>>
>> [0         16k       32k       48k     ] logical address
>> [0         4         8         12      ] radix tree offset
>> [               64k page               ] folio
>> [ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers
>> [ | | | |  | | | |   | | | |   | | | | ] bitmap
>>
>> Now we use all of our addressing based on fs_info->sectorsize_bits, so
>> as you can see the above our 16k eb->start turns into radix entry 4.
> 
> Btw, unrelated to this patch - but this way we're using at best only
> 25% of the tree slots. Or in other words we're wasting 75% of the
> memory here. We should rather use eb->start / fs_info->nodesize for
> the tree index.

That requires all tree blocks to be nodesize aligned.

We're already working towards that direction, but we need to be cautious 
to reject those non-nodesize aligned tree blocks, as end users won't be 
happy that their fsese no longer mount after a kernel update.

So as usual, kernel warning message first, btrfs check reports as error, 
then experimental rejection, and finally push to end users.

> 
> And by the other way - why do we need a copy of nodesize in eb->len?
> We can always eb->fs_info->nodesize if needed.

Because there are pseudo "extent buffers" in the past, like accessing 
super blocks using extent buffer helpers.

In that case we need a length other than node size but super block size.

But that is no longer the case, feel free to clean it up.

Thanks,
Qu

> 
>> When we find a dirty range for our eb, we correctly do bit_start +=
>> sectors_per_node, because if we start at bit 0, the next bit for the
>> next eb is 4, to correspond to eb->start 16k.
>>
>> However if our range is clean, we will do bit_start++, which will now
>> put us offset from our radix tree entries.
>>
>> In our case, assume that the first time we check the bitmap the block is
>> not dirty, we increment bit_start so now it == 1, and then we loop
>> around and check again.  This time it is dirty, and we go to find that
>> start using the following equation
>>
>> start = folio_start + bit_start * fs_info->sectorsize;
>>
>> so in the case above, eb->start 0 is now dirty, and we calculate start
>> as
>>
>> 0 + 1 * fs_info->sectorsize = 4096
>> 4096 >> 12 = 1
>>
>> Now we're looking up the radix tree for 1, and we won't find an eb.
>> What's worse is now we're using bit_start == 1, so we do bit_start +=
>> sectors_per_node, which is now 5.  If that eb is dirty we will run into
>> the same thing, we will look at an offset that is not populated in the
>> radix tree, and now we're skipping the writeout of dirty extent buffers.
>>
>> The best fix for this is to not use sectorsize_bits to address nodes,
>> but that's a larger change.  Since this is a fs corruption problem fix
>> it simply by always using sectors_per_node to increment the start bit.
>>
>> cc: stable@vger.kernel.org
>> Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page")
>> Reviewed-by: Boris Burkov <boris@bur.io>
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>> ---
>> - Further testing indicated that the page tagging theoretical race isn't getting
>>    hit in practice, so we're going to limit the "hotfix" to this specific patch,
>>    and then send subsequent patches to address the other issues we're hitting. My
>>    simplify metadata writebback patches are the more wholistic fix.
>>
>>   fs/btrfs/extent_io.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 5f08615b334f..6cfd286b8bbc 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -2034,7 +2034,7 @@ static int submit_eb_subpage(struct folio *folio, struct writeback_control *wbc)
>>                                subpage->bitmaps)) {
>>                          spin_unlock_irqrestore(&subpage->lock, flags);
>>                          spin_unlock(&folio->mapping->i_private_lock);
>> -                       bit_start++;
>> +                       bit_start += sectors_per_node;
>>                          continue;
>>                  }
>>
>> --
>> 2.48.1
>>
>>
>

next prev parent reply	other threads:[~2025-04-24 20:57 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-21 13:37 [PATCH] btrfs: adjust subpage bit start based on sectorsize Josef Bacik
2025-04-21 20:17 ` Qu Wenruo
2025-04-24 10:40 ` Daniel Vacek
2025-04-24 20:57   ` Qu Wenruo [this message]
  -- strict thread matches above, loose matches on Subject: below --
2025-04-14 19:08 Josef Bacik
2025-04-14 19:54 ` Boris Burkov
2025-04-14 22:08 ` Qu Wenruo
2025-04-14 22:37   ` Qu Wenruo
2025-04-15 16:16     ` Boris Burkov
2025-04-15 23:49       ` Qu Wenruo
2025-04-16 14:16         ` Josef Bacik
2025-04-16 22:07           ` Qu Wenruo
2025-04-15 17:25     ` Josef Bacik
2025-04-16  0:08       ` Qu Wenruo
2025-04-15 17:23   ` Josef Bacik
2025-04-15 23:53     ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d874c64c-34fc-4d3a-abf9-19625bddd213@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=boris@bur.io \
    --cc=josef@toxicpanda.com \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=neelx@suse.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox