From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Josef Bacik <josef@toxicpanda.com>,
linux-btrfs@vger.kernel.org, kernel-team@fb.com
Cc: stable@vger.kernel.org, Boris Burkov <boris@bur.io>
Subject: Re: [PATCH] btrfs: adjust subpage bit start based on sectorsize
Date: Tue, 22 Apr 2025 05:47:04 +0930 [thread overview]
Message-ID: <1ffd7ba1-8e81-4dbc-8ff0-a2c376b7ffed@gmx.com> (raw)
In-Reply-To: <0914bad6138d2cfafc9cfe762bd06c1883ceb9d2.1745242535.git.josef@toxicpanda.com>
在 2025/4/21 23:07, Josef Bacik 写道:
> When running machines with 64k page size and a 16k nodesize we started
> seeing tree log corruption in production. This turned out to be because
> we were not writing out dirty blocks sometimes, so this in fact affects
> all metadata writes.
>
> When writing out a subpage EB we scan the subpage bitmap for a dirty
> range. If the range isn't dirty we do
>
> bit_start++;
>
> to move onto the next bit. The problem is the bitmap is based on the
> number of sectors that an EB has. So in this case, we have a 64k
> pagesize, 16k nodesize, but a 4k sectorsize. This means our bitmap is 4
> bits for every node. With a 64k page size we end up with 4 nodes per
> page.
>
> To make this easier this is how everything looks
>
> [0 16k 32k 48k ] logical address
> [0 4 8 12 ] radix tree offset
> [ 64k page ] folio
> [ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers
> [ | | | | | | | | | | | | | | | | ] bitmap
>
> Now we use all of our addressing based on fs_info->sectorsize_bits, so
> as you can see the above our 16k eb->start turns into radix entry 4.
>
> When we find a dirty range for our eb, we correctly do bit_start +=
> sectors_per_node, because if we start at bit 0, the next bit for the
> next eb is 4, to correspond to eb->start 16k.
>
> However if our range is clean, we will do bit_start++, which will now
> put us offset from our radix tree entries.
>
> In our case, assume that the first time we check the bitmap the block is
> not dirty, we increment bit_start so now it == 1, and then we loop
> around and check again. This time it is dirty, and we go to find that
> start using the following equation
>
> start = folio_start + bit_start * fs_info->sectorsize;
>
> so in the case above, eb->start 0 is now dirty, and we calculate start
> as
>
> 0 + 1 * fs_info->sectorsize = 4096
> 4096 >> 12 = 1
>
> Now we're looking up the radix tree for 1, and we won't find an eb.
> What's worse is now we're using bit_start == 1, so we do bit_start +=
> sectors_per_node, which is now 5. If that eb is dirty we will run into
> the same thing, we will look at an offset that is not populated in the
> radix tree, and now we're skipping the writeout of dirty extent buffers.
>
> The best fix for this is to not use sectorsize_bits to address nodes,
> but that's a larger change. Since this is a fs corruption problem fix
> it simply by always using sectors_per_node to increment the start bit.
>
> cc: stable@vger.kernel.org
> Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page")
> Reviewed-by: Boris Burkov <boris@bur.io>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Although I'm still a little concerned about unaligned tree blocks, but
in practice there should be rarely any converted btrfs from v4.6 era
that doesn't get a full balance and still be utilized.
So let's get the fix done and backported first, then reject unaligned
tree blocks completely.
Thanks,
Qu
> ---
> - Further testing indicated that the page tagging theoretical race isn't getting
> hit in practice, so we're going to limit the "hotfix" to this specific patch,
> and then send subsequent patches to address the other issues we're hitting. My
> simplify metadata writebback patches are the more wholistic fix.
>
> fs/btrfs/extent_io.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 5f08615b334f..6cfd286b8bbc 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2034,7 +2034,7 @@ static int submit_eb_subpage(struct folio *folio, struct writeback_control *wbc)
> subpage->bitmaps)) {
> spin_unlock_irqrestore(&subpage->lock, flags);
> spin_unlock(&folio->mapping->i_private_lock);
> - bit_start++;
> + bit_start += sectors_per_node;
> continue;
> }
>
next prev parent reply other threads:[~2025-04-21 20:17 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-21 13:37 [PATCH] btrfs: adjust subpage bit start based on sectorsize Josef Bacik
2025-04-21 20:17 ` Qu Wenruo [this message]
2025-04-24 10:40 ` Daniel Vacek
2025-04-24 20:57 ` Qu Wenruo
-- strict thread matches above, loose matches on Subject: below --
2025-04-14 19:08 Josef Bacik
2025-04-14 19:54 ` Boris Burkov
2025-04-14 22:08 ` Qu Wenruo
2025-04-14 22:37 ` Qu Wenruo
2025-04-15 16:16 ` Boris Burkov
2025-04-15 23:49 ` Qu Wenruo
2025-04-16 14:16 ` Josef Bacik
2025-04-16 22:07 ` Qu Wenruo
2025-04-15 17:25 ` Josef Bacik
2025-04-16 0:08 ` Qu Wenruo
2025-04-15 17:23 ` Josef Bacik
2025-04-15 23:53 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1ffd7ba1-8e81-4dbc-8ff0-a2c376b7ffed@gmx.com \
--to=quwenruo.btrfs@gmx.com \
--cc=boris@bur.io \
--cc=josef@toxicpanda.com \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox