[PATCH] btrfs: adjust subpage bit start based on sectorsize

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] btrfs: adjust subpage bit start based on sectorsize
@ 2025-04-21 13:37 Josef Bacik
  2025-04-21 20:17 ` Qu Wenruo
  2025-04-24 10:40 ` Daniel Vacek
  0 siblings, 2 replies; 16+ messages in thread
From: Josef Bacik @ 2025-04-21 13:37 UTC (permalink / raw)
  To: linux-btrfs, kernel-team; +Cc: stable, Boris Burkov

When running machines with 64k page size and a 16k nodesize we started
seeing tree log corruption in production.  This turned out to be because
we were not writing out dirty blocks sometimes, so this in fact affects
all metadata writes.

When writing out a subpage EB we scan the subpage bitmap for a dirty
range.  If the range isn't dirty we do

bit_start++;

to move onto the next bit.  The problem is the bitmap is based on the
number of sectors that an EB has.  So in this case, we have a 64k
pagesize, 16k nodesize, but a 4k sectorsize.  This means our bitmap is 4
bits for every node.  With a 64k page size we end up with 4 nodes per
page.

To make this easier this is how everything looks

[0         16k       32k       48k     ] logical address
[0         4         8         12      ] radix tree offset
[               64k page               ] folio
[ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers
[ | | | |  | | | |   | | | |   | | | | ] bitmap

Now we use all of our addressing based on fs_info->sectorsize_bits, so
as you can see the above our 16k eb->start turns into radix entry 4.

When we find a dirty range for our eb, we correctly do bit_start +=
sectors_per_node, because if we start at bit 0, the next bit for the
next eb is 4, to correspond to eb->start 16k.

However if our range is clean, we will do bit_start++, which will now
put us offset from our radix tree entries.

In our case, assume that the first time we check the bitmap the block is
not dirty, we increment bit_start so now it == 1, and then we loop
around and check again.  This time it is dirty, and we go to find that
start using the following equation

start = folio_start + bit_start * fs_info->sectorsize;

so in the case above, eb->start 0 is now dirty, and we calculate start
as

0 + 1 * fs_info->sectorsize = 4096
4096 >> 12 = 1

Now we're looking up the radix tree for 1, and we won't find an eb.
What's worse is now we're using bit_start == 1, so we do bit_start +=
sectors_per_node, which is now 5.  If that eb is dirty we will run into
the same thing, we will look at an offset that is not populated in the
radix tree, and now we're skipping the writeout of dirty extent buffers.

The best fix for this is to not use sectorsize_bits to address nodes,
but that's a larger change.  Since this is a fs corruption problem fix
it simply by always using sectors_per_node to increment the start bit.

cc: stable@vger.kernel.org
Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
- Further testing indicated that the page tagging theoretical race isn't getting
  hit in practice, so we're going to limit the "hotfix" to this specific patch,
  and then send subsequent patches to address the other issues we're hitting. My
  simplify metadata writebback patches are the more wholistic fix.

 fs/btrfs/extent_io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 5f08615b334f..6cfd286b8bbc 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2034,7 +2034,7 @@ static int submit_eb_subpage(struct folio *folio, struct writeback_control *wbc)
 			      subpage->bitmaps)) {
 			spin_unlock_irqrestore(&subpage->lock, flags);
 			spin_unlock(&folio->mapping->i_private_lock);
-			bit_start++;
+			bit_start += sectors_per_node;
 			continue;
 		}

-- 
2.48.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] btrfs: adjust subpage bit start based on sectorsize
  2025-04-21 13:37 [PATCH] btrfs: adjust subpage bit start based on sectorsize Josef Bacik
@ 2025-04-21 20:17 ` Qu Wenruo
  2025-04-24 10:40 ` Daniel Vacek
  1 sibling, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-04-21 20:17 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team; +Cc: stable, Boris Burkov



在 2025/4/21 23:07, Josef Bacik 写道:
> When running machines with 64k page size and a 16k nodesize we started
> seeing tree log corruption in production.  This turned out to be because
> we were not writing out dirty blocks sometimes, so this in fact affects
> all metadata writes.
> 
> When writing out a subpage EB we scan the subpage bitmap for a dirty
> range.  If the range isn't dirty we do
> 
> bit_start++;
> 
> to move onto the next bit.  The problem is the bitmap is based on the
> number of sectors that an EB has.  So in this case, we have a 64k
> pagesize, 16k nodesize, but a 4k sectorsize.  This means our bitmap is 4
> bits for every node.  With a 64k page size we end up with 4 nodes per
> page.
> 
> To make this easier this is how everything looks
> 
> [0         16k       32k       48k     ] logical address
> [0         4         8         12      ] radix tree offset
> [               64k page               ] folio
> [ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers
> [ | | | |  | | | |   | | | |   | | | | ] bitmap
> 
> Now we use all of our addressing based on fs_info->sectorsize_bits, so
> as you can see the above our 16k eb->start turns into radix entry 4.
> 
> When we find a dirty range for our eb, we correctly do bit_start +=
> sectors_per_node, because if we start at bit 0, the next bit for the
> next eb is 4, to correspond to eb->start 16k.
> 
> However if our range is clean, we will do bit_start++, which will now
> put us offset from our radix tree entries.
> 
> In our case, assume that the first time we check the bitmap the block is
> not dirty, we increment bit_start so now it == 1, and then we loop
> around and check again.  This time it is dirty, and we go to find that
> start using the following equation
> 
> start = folio_start + bit_start * fs_info->sectorsize;
> 
> so in the case above, eb->start 0 is now dirty, and we calculate start
> as
> 
> 0 + 1 * fs_info->sectorsize = 4096
> 4096 >> 12 = 1
> 
> Now we're looking up the radix tree for 1, and we won't find an eb.
> What's worse is now we're using bit_start == 1, so we do bit_start +=
> sectors_per_node, which is now 5.  If that eb is dirty we will run into
> the same thing, we will look at an offset that is not populated in the
> radix tree, and now we're skipping the writeout of dirty extent buffers.
> 
> The best fix for this is to not use sectorsize_bits to address nodes,
> but that's a larger change.  Since this is a fs corruption problem fix
> it simply by always using sectors_per_node to increment the start bit.
> 
> cc: stable@vger.kernel.org
> Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page")
> Reviewed-by: Boris Burkov <boris@bur.io>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Although I'm still a little concerned about unaligned tree blocks, but 
in practice there should be rarely any converted btrfs from v4.6 era 
that doesn't get a full balance and still be utilized.

So let's get the fix done and backported first, then reject unaligned 
tree blocks completely.

Thanks,
Qu

> ---
> - Further testing indicated that the page tagging theoretical race isn't getting
>    hit in practice, so we're going to limit the "hotfix" to this specific patch,
>    and then send subsequent patches to address the other issues we're hitting. My
>    simplify metadata writebback patches are the more wholistic fix.
> 
>   fs/btrfs/extent_io.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 5f08615b334f..6cfd286b8bbc 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2034,7 +2034,7 @@ static int submit_eb_subpage(struct folio *folio, struct writeback_control *wbc)
>   			      subpage->bitmaps)) {
>   			spin_unlock_irqrestore(&subpage->lock, flags);
>   			spin_unlock(&folio->mapping->i_private_lock);
> -			bit_start++;
> +			bit_start += sectors_per_node;
>   			continue;
>   		}
>   


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] btrfs: adjust subpage bit start based on sectorsize
  2025-04-21 13:37 [PATCH] btrfs: adjust subpage bit start based on sectorsize Josef Bacik
  2025-04-21 20:17 ` Qu Wenruo
@ 2025-04-24 10:40 ` Daniel Vacek
  2025-04-24 20:57   ` Qu Wenruo
  1 sibling, 1 reply; 16+ messages in thread
From: Daniel Vacek @ 2025-04-24 10:40 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team, stable, Boris Burkov

On Mon, 21 Apr 2025 at 15:38, Josef Bacik <josef@toxicpanda.com> wrote:
>
> When running machines with 64k page size and a 16k nodesize we started
> seeing tree log corruption in production.  This turned out to be because
> we were not writing out dirty blocks sometimes, so this in fact affects
> all metadata writes.
>
> When writing out a subpage EB we scan the subpage bitmap for a dirty
> range.  If the range isn't dirty we do
>
> bit_start++;
>
> to move onto the next bit.  The problem is the bitmap is based on the
> number of sectors that an EB has.  So in this case, we have a 64k
> pagesize, 16k nodesize, but a 4k sectorsize.  This means our bitmap is 4
> bits for every node.  With a 64k page size we end up with 4 nodes per
> page.
>
> To make this easier this is how everything looks
>
> [0         16k       32k       48k     ] logical address
> [0         4         8         12      ] radix tree offset
> [               64k page               ] folio
> [ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers
> [ | | | |  | | | |   | | | |   | | | | ] bitmap
>
> Now we use all of our addressing based on fs_info->sectorsize_bits, so
> as you can see the above our 16k eb->start turns into radix entry 4.

Btw, unrelated to this patch - but this way we're using at best only
25% of the tree slots. Or in other words we're wasting 75% of the
memory here. We should rather use eb->start / fs_info->nodesize for
the tree index.

And by the other way - why do we need a copy of nodesize in eb->len?
We can always eb->fs_info->nodesize if needed.

> When we find a dirty range for our eb, we correctly do bit_start +=
> sectors_per_node, because if we start at bit 0, the next bit for the
> next eb is 4, to correspond to eb->start 16k.
>
> However if our range is clean, we will do bit_start++, which will now
> put us offset from our radix tree entries.
>
> In our case, assume that the first time we check the bitmap the block is
> not dirty, we increment bit_start so now it == 1, and then we loop
> around and check again.  This time it is dirty, and we go to find that
> start using the following equation
>
> start = folio_start + bit_start * fs_info->sectorsize;
>
> so in the case above, eb->start 0 is now dirty, and we calculate start
> as
>
> 0 + 1 * fs_info->sectorsize = 4096
> 4096 >> 12 = 1
>
> Now we're looking up the radix tree for 1, and we won't find an eb.
> What's worse is now we're using bit_start == 1, so we do bit_start +=
> sectors_per_node, which is now 5.  If that eb is dirty we will run into
> the same thing, we will look at an offset that is not populated in the
> radix tree, and now we're skipping the writeout of dirty extent buffers.
>
> The best fix for this is to not use sectorsize_bits to address nodes,
> but that's a larger change.  Since this is a fs corruption problem fix
> it simply by always using sectors_per_node to increment the start bit.
>
> cc: stable@vger.kernel.org
> Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page")
> Reviewed-by: Boris Burkov <boris@bur.io>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> - Further testing indicated that the page tagging theoretical race isn't getting
>   hit in practice, so we're going to limit the "hotfix" to this specific patch,
>   and then send subsequent patches to address the other issues we're hitting. My
>   simplify metadata writebback patches are the more wholistic fix.
>
>  fs/btrfs/extent_io.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 5f08615b334f..6cfd286b8bbc 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2034,7 +2034,7 @@ static int submit_eb_subpage(struct folio *folio, struct writeback_control *wbc)
>                               subpage->bitmaps)) {
>                         spin_unlock_irqrestore(&subpage->lock, flags);
>                         spin_unlock(&folio->mapping->i_private_lock);
> -                       bit_start++;
> +                       bit_start += sectors_per_node;
>                         continue;
>                 }
>
> --
> 2.48.1
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] btrfs: adjust subpage bit start based on sectorsize
  2025-04-24 10:40 ` Daniel Vacek
@ 2025-04-24 20:57   ` Qu Wenruo
  0 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-04-24 20:57 UTC (permalink / raw)
  To: Daniel Vacek, Josef Bacik; +Cc: linux-btrfs, kernel-team, stable, Boris Burkov



在 2025/4/24 20:10, Daniel Vacek 写道:
> On Mon, 21 Apr 2025 at 15:38, Josef Bacik <josef@toxicpanda.com> wrote:
>>
>> When running machines with 64k page size and a 16k nodesize we started
>> seeing tree log corruption in production.  This turned out to be because
>> we were not writing out dirty blocks sometimes, so this in fact affects
>> all metadata writes.
>>
>> When writing out a subpage EB we scan the subpage bitmap for a dirty
>> range.  If the range isn't dirty we do
>>
>> bit_start++;
>>
>> to move onto the next bit.  The problem is the bitmap is based on the
>> number of sectors that an EB has.  So in this case, we have a 64k
>> pagesize, 16k nodesize, but a 4k sectorsize.  This means our bitmap is 4
>> bits for every node.  With a 64k page size we end up with 4 nodes per
>> page.
>>
>> To make this easier this is how everything looks
>>
>> [0         16k       32k       48k     ] logical address
>> [0         4         8         12      ] radix tree offset
>> [               64k page               ] folio
>> [ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers
>> [ | | | |  | | | |   | | | |   | | | | ] bitmap
>>
>> Now we use all of our addressing based on fs_info->sectorsize_bits, so
>> as you can see the above our 16k eb->start turns into radix entry 4.
> 
> Btw, unrelated to this patch - but this way we're using at best only
> 25% of the tree slots. Or in other words we're wasting 75% of the
> memory here. We should rather use eb->start / fs_info->nodesize for
> the tree index.

That requires all tree blocks to be nodesize aligned.

We're already working towards that direction, but we need to be cautious 
to reject those non-nodesize aligned tree blocks, as end users won't be 
happy that their fsese no longer mount after a kernel update.

So as usual, kernel warning message first, btrfs check reports as error, 
then experimental rejection, and finally push to end users.

> 
> And by the other way - why do we need a copy of nodesize in eb->len?
> We can always eb->fs_info->nodesize if needed.

Because there are pseudo "extent buffers" in the past, like accessing 
super blocks using extent buffer helpers.

In that case we need a length other than node size but super block size.

But that is no longer the case, feel free to clean it up.

Thanks,
Qu

> 
>> When we find a dirty range for our eb, we correctly do bit_start +=
>> sectors_per_node, because if we start at bit 0, the next bit for the
>> next eb is 4, to correspond to eb->start 16k.
>>
>> However if our range is clean, we will do bit_start++, which will now
>> put us offset from our radix tree entries.
>>
>> In our case, assume that the first time we check the bitmap the block is
>> not dirty, we increment bit_start so now it == 1, and then we loop
>> around and check again.  This time it is dirty, and we go to find that
>> start using the following equation
>>
>> start = folio_start + bit_start * fs_info->sectorsize;
>>
>> so in the case above, eb->start 0 is now dirty, and we calculate start
>> as
>>
>> 0 + 1 * fs_info->sectorsize = 4096
>> 4096 >> 12 = 1
>>
>> Now we're looking up the radix tree for 1, and we won't find an eb.
>> What's worse is now we're using bit_start == 1, so we do bit_start +=
>> sectors_per_node, which is now 5.  If that eb is dirty we will run into
>> the same thing, we will look at an offset that is not populated in the
>> radix tree, and now we're skipping the writeout of dirty extent buffers.
>>
>> The best fix for this is to not use sectorsize_bits to address nodes,
>> but that's a larger change.  Since this is a fs corruption problem fix
>> it simply by always using sectors_per_node to increment the start bit.
>>
>> cc: stable@vger.kernel.org
>> Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page")
>> Reviewed-by: Boris Burkov <boris@bur.io>
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>> ---
>> - Further testing indicated that the page tagging theoretical race isn't getting
>>    hit in practice, so we're going to limit the "hotfix" to this specific patch,
>>    and then send subsequent patches to address the other issues we're hitting. My
>>    simplify metadata writebback patches are the more wholistic fix.
>>
>>   fs/btrfs/extent_io.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 5f08615b334f..6cfd286b8bbc 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -2034,7 +2034,7 @@ static int submit_eb_subpage(struct folio *folio, struct writeback_control *wbc)
>>                                subpage->bitmaps)) {
>>                          spin_unlock_irqrestore(&subpage->lock, flags);
>>                          spin_unlock(&folio->mapping->i_private_lock);
>> -                       bit_start++;
>> +                       bit_start += sectors_per_node;
>>                          continue;
>>                  }
>>
>> --
>> 2.48.1
>>
>>
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] btrfs: adjust subpage bit start based on sectorsize
@ 2025-04-14 19:08 Josef Bacik
  2025-04-14 19:54 ` Boris Burkov
  2025-04-14 22:08 ` Qu Wenruo
  0 siblings, 2 replies; 16+ messages in thread
From: Josef Bacik @ 2025-04-14 19:08 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

When running machines with 64k page size and a 16k nodesize we started
seeing tree log corruption in production.  This turned out to be because
we were not writing out dirty blocks sometimes, so this in fact affects
all metadata writes.

When writing out a subpage EB we scan the subpage bitmap for a dirty
range.  If the range isn't dirty we do

bit_start++;

to move onto the next bit.  The problem is the bitmap is based on the
number of sectors that an EB has.  So in this case, we have a 64k
pagesize, 16k nodesize, but a 4k sectorsize.  This means our bitmap is 4
bits for every node.  With a 64k page size we end up with 4 nodes per
page.

To make this easier this is how everything looks

[0         16k       32k       48k     ] logical address
[0         4         8         12      ] radix tree offset
[               64k page               ] folio
[ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers
[ | | | |  | | | |   | | | |   | | | | ] bitmap

Now we use all of our addressing based on fs_info->sectorsize_bits, so
as you can see the above our 16k eb->start turns into radix entry 4.

When we find a dirty range for our eb, we correctly do bit_start +=
sectors_per_node, because if we start at bit 0, the next bit for the
next eb is 4, to correspond to eb->start 16k.

However if our range is clean, we will do bit_start++, which will now
put us offset from our radix tree entries.

In our case, assume that the first time we check the bitmap the block is
not dirty, we increment bit_start so now it == 1, and then we loop
around and check again.  This time it is dirty, and we go to find that
start using the following equation

start = folio_start + bit_start * fs_info->sectorsize;

so in the case above, eb->start 0 is now dirty, and we calculate start
as

0 + 1 * fs_info->sectorsize = 4096
4096 >> 12 = 1

Now we're looking up the radix tree for 1, and we won't find an eb.
What's worse is now we're using bit_start == 1, so we do bit_start +=
sectors_per_node, which is now 5.  If that eb is dirty we will run into
the same thing, we will look at an offset that is not populated in the
radix tree, and now we're skipping the writeout of dirty extent buffers.

The best fix for this is to not use sectorsize_bits to address nodes,
but that's a larger change.  Since this is a fs corruption problem fix
it simply by always using sectors_per_node to increment the start bit.

Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 5f08615b334f..6cfd286b8bbc 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2034,7 +2034,7 @@ static int submit_eb_subpage(struct folio *folio, struct writeback_control *wbc)
 			      subpage->bitmaps)) {
 			spin_unlock_irqrestore(&subpage->lock, flags);
 			spin_unlock(&folio->mapping->i_private_lock);
-			bit_start++;
+			bit_start += sectors_per_node;
 			continue;
 		}

-- 
2.48.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] btrfs: adjust subpage bit start based on sectorsize
  2025-04-14 19:08 Josef Bacik
@ 2025-04-14 19:54 ` Boris Burkov
  2025-04-14 22:08 ` Qu Wenruo
  1 sibling, 0 replies; 16+ messages in thread
From: Boris Burkov @ 2025-04-14 19:54 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Mon, Apr 14, 2025 at 03:08:16PM -0400, Josef Bacik wrote:
> When running machines with 64k page size and a 16k nodesize we started
> seeing tree log corruption in production.  This turned out to be because
> we were not writing out dirty blocks sometimes, so this in fact affects
> all metadata writes.
> 
> When writing out a subpage EB we scan the subpage bitmap for a dirty
> range.  If the range isn't dirty we do
> 
> bit_start++;
> 
> to move onto the next bit.  The problem is the bitmap is based on the
> number of sectors that an EB has.  So in this case, we have a 64k
> pagesize, 16k nodesize, but a 4k sectorsize.  This means our bitmap is 4
> bits for every node.  With a 64k page size we end up with 4 nodes per
> page.
> 
> To make this easier this is how everything looks
> 
> [0         16k       32k       48k     ] logical address
> [0         4         8         12      ] radix tree offset
> [               64k page               ] folio
> [ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers
> [ | | | |  | | | |   | | | |   | | | | ] bitmap
> 
> Now we use all of our addressing based on fs_info->sectorsize_bits, so
> as you can see the above our 16k eb->start turns into radix entry 4.
> 
> When we find a dirty range for our eb, we correctly do bit_start +=
> sectors_per_node, because if we start at bit 0, the next bit for the
> next eb is 4, to correspond to eb->start 16k.
> 
> However if our range is clean, we will do bit_start++, which will now
> put us offset from our radix tree entries.
> 
> In our case, assume that the first time we check the bitmap the block is
> not dirty, we increment bit_start so now it == 1, and then we loop
> around and check again.  This time it is dirty, and we go to find that
> start using the following equation
> 
> start = folio_start + bit_start * fs_info->sectorsize;
> 
> so in the case above, eb->start 0 is now dirty, and we calculate start
> as
> 
> 0 + 1 * fs_info->sectorsize = 4096
> 4096 >> 12 = 1
> 
> Now we're looking up the radix tree for 1, and we won't find an eb.
> What's worse is now we're using bit_start == 1, so we do bit_start +=
> sectors_per_node, which is now 5.  If that eb is dirty we will run into
> the same thing, we will look at an offset that is not populated in the
> radix tree, and now we're skipping the writeout of dirty extent buffers.
> 
> The best fix for this is to not use sectorsize_bits to address nodes,
> but that's a larger change.  Since this is a fs corruption problem fix
> it simply by always using sectors_per_node to increment the start bit.
> 
> Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page")
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/extent_io.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 5f08615b334f..6cfd286b8bbc 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2034,7 +2034,7 @@ static int submit_eb_subpage(struct folio *folio, struct writeback_control *wbc)
>  			      subpage->bitmaps)) {
>  			spin_unlock_irqrestore(&subpage->lock, flags);
>  			spin_unlock(&folio->mapping->i_private_lock);
> -			bit_start++;
> +			bit_start += sectors_per_node;
>  			continue;
>  		}
>  
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] btrfs: adjust subpage bit start based on sectorsize
  2025-04-14 19:08 Josef Bacik
  2025-04-14 19:54 ` Boris Burkov
@ 2025-04-14 22:08 ` Qu Wenruo
  2025-04-14 22:37   ` Qu Wenruo
  2025-04-15 17:23   ` Josef Bacik
  1 sibling, 2 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-04-14 22:08 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



在 2025/4/15 04:38, Josef Bacik 写道:
> When running machines with 64k page size and a 16k nodesize we started
> seeing tree log corruption in production.  This turned out to be because
> we were not writing out dirty blocks sometimes, so this in fact affects
> all metadata writes.
> 
> When writing out a subpage EB we scan the subpage bitmap for a dirty
> range.  If the range isn't dirty we do
> 
> bit_start++;
> 
> to move onto the next bit.  The problem is the bitmap is based on the
> number of sectors that an EB has.  So in this case, we have a 64k
> pagesize, 16k nodesize, but a 4k sectorsize.  This means our bitmap is 4
> bits for every node.  With a 64k page size we end up with 4 nodes per
> page.
> 
> To make this easier this is how everything looks
> 
> [0         16k       32k       48k     ] logical address
> [0         4         8         12      ] radix tree offset
> [               64k page               ] folio
> [ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers
> [ | | | |  | | | |   | | | |   | | | | ] bitmap
> 
> Now we use all of our addressing based on fs_info->sectorsize_bits, so
> as you can see the above our 16k eb->start turns into radix entry 4.
> 
> When we find a dirty range for our eb, we correctly do bit_start +=
> sectors_per_node, because if we start at bit 0, the next bit for the
> next eb is 4, to correspond to eb->start 16k.
> 
> However if our range is clean, we will do bit_start++, which will now
> put us offset from our radix tree entries.
> 
> In our case, assume that the first time we check the bitmap the block is
> not dirty, we increment bit_start so now it == 1, and then we loop
> around and check again.  This time it is dirty, and we go to find that
> start using the following equation
> 
> start = folio_start + bit_start * fs_info->sectorsize;
> 
> so in the case above, eb->start 0 is now dirty, and we calculate start
> as
> 
> 0 + 1 * fs_info->sectorsize = 4096
> 4096 >> 12 = 1
> 
> Now we're looking up the radix tree for 1, and we won't find an eb.
> What's worse is now we're using bit_start == 1, so we do bit_start +=
> sectors_per_node, which is now 5.  If that eb is dirty we will run into
> the same thing, we will look at an offset that is not populated in the
> radix tree, and now we're skipping the writeout of dirty extent buffers.
> 
> The best fix for this is to not use sectorsize_bits to address nodes,
> but that's a larger change.  Since this is a fs corruption problem fix
> it simply by always using sectors_per_node to increment the start bit.
> 
> Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page")
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>   fs/btrfs/extent_io.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 5f08615b334f..6cfd286b8bbc 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2034,7 +2034,7 @@ static int submit_eb_subpage(struct folio *folio, struct writeback_control *wbc)
>   			      subpage->bitmaps)) {
>   			spin_unlock_irqrestore(&subpage->lock, flags);
>   			spin_unlock(&folio->mapping->i_private_lock);
> -			bit_start++;
> +			bit_start += sectors_per_node;

The problem is, we can not ensure all extent buffers are nodesize aligned.

If we have an eb whose bytenr is only block aligned but not node size 
aligned, this will lead to the same problem.

We need an extra check to reject tree blocks which are not node size 
aligned, which is another big change and not suitable for a quick fix.


Can we do a gang radix tree lookup for the involved ranges that can 
cover the block, then increase bit_start to the end of the found eb instead?

Thanks,
Qu

>   			continue;
>   		}
>   

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] btrfs: adjust subpage bit start based on sectorsize
  2025-04-14 22:08 ` Qu Wenruo
@ 2025-04-14 22:37   ` Qu Wenruo
  2025-04-15 16:16     ` Boris Burkov
  2025-04-15 17:25     ` Josef Bacik
  2025-04-15 17:23   ` Josef Bacik
  1 sibling, 2 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-04-14 22:37 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs, kernel-team



在 2025/4/15 07:38, Qu Wenruo 写道:
>
>
> 在 2025/4/15 04:38, Josef Bacik 写道:
>> When running machines with 64k page size and a 16k nodesize we started
>> seeing tree log corruption in production.  This turned out to be because
>> we were not writing out dirty blocks sometimes, so this in fact affects
>> all metadata writes.
>>
>> When writing out a subpage EB we scan the subpage bitmap for a dirty
>> range.  If the range isn't dirty we do
>>
>> bit_start++;
>>
>> to move onto the next bit.  The problem is the bitmap is based on the
>> number of sectors that an EB has.  So in this case, we have a 64k
>> pagesize, 16k nodesize, but a 4k sectorsize.  This means our bitmap is 4
>> bits for every node.  With a 64k page size we end up with 4 nodes per
>> page.
>>
>> To make this easier this is how everything looks
>>
>> [0         16k       32k       48k     ] logical address
>> [0         4         8         12      ] radix tree offset
>> [               64k page               ] folio
>> [ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers
>> [ | | | |  | | | |   | | | |   | | | | ] bitmap
>>
>> Now we use all of our addressing based on fs_info->sectorsize_bits, so
>> as you can see the above our 16k eb->start turns into radix entry 4.
>>
>> When we find a dirty range for our eb, we correctly do bit_start +=
>> sectors_per_node, because if we start at bit 0, the next bit for the
>> next eb is 4, to correspond to eb->start 16k.
>>
>> However if our range is clean, we will do bit_start++, which will now
>> put us offset from our radix tree entries.
>>
>> In our case, assume that the first time we check the bitmap the block is
>> not dirty, we increment bit_start so now it == 1, and then we loop
>> around and check again.  This time it is dirty, and we go to find that
>> start using the following equation
>>
>> start = folio_start + bit_start * fs_info->sectorsize;
>>
>> so in the case above, eb->start 0 is now dirty, and we calculate start
>> as
>>
>> 0 + 1 * fs_info->sectorsize = 4096
>> 4096 >> 12 = 1
>>
>> Now we're looking up the radix tree for 1, and we won't find an eb.
>> What's worse is now we're using bit_start == 1, so we do bit_start +=
>> sectors_per_node, which is now 5.  If that eb is dirty we will run into
>> the same thing, we will look at an offset that is not populated in the
>> radix tree, and now we're skipping the writeout of dirty extent buffers.
>>
>> The best fix for this is to not use sectorsize_bits to address nodes,
>> but that's a larger change.  Since this is a fs corruption problem fix
>> it simply by always using sectors_per_node to increment the start bit.
>>
>> Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a
>> subpage metadata page")
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>> ---
>>   fs/btrfs/extent_io.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 5f08615b334f..6cfd286b8bbc 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -2034,7 +2034,7 @@ static int submit_eb_subpage(struct folio
>> *folio, struct writeback_control *wbc)
>>                     subpage->bitmaps)) {
>>               spin_unlock_irqrestore(&subpage->lock, flags);
>>               spin_unlock(&folio->mapping->i_private_lock);
>> -            bit_start++;
>> +            bit_start += sectors_per_node;
>
> The problem is, we can not ensure all extent buffers are nodesize aligned.
>
> If we have an eb whose bytenr is only block aligned but not node size
> aligned, this will lead to the same problem.
>
> We need an extra check to reject tree blocks which are not node size
> aligned, which is another big change and not suitable for a quick fix.
>
>
> Can we do a gang radix tree lookup for the involved ranges that can
> cover the block, then increase bit_start to the end of the found eb
> instead?

In fact, I think it's better to fix both this and the missing eb write
bugs together in one go.

With something like this:

static int find_subpage_dirty_subpage(struct folio *folio)
{
	struct extent_buffer *gang[BTRFS_MAX_EB_SIZE/MIN_BLOCKSIZE];
	struct extent_buffer *ret = NULL;

	rcu_read_lock()
	ret = radix_tree_gang_lookup();
	for (int i = 0; i < ret; i++) {
		if (eb && atomic_inc_not_zero(&eb->refs)) {
			if (!test_bit(EXTENT_BUFFER_DIRTY) {
				atomic_dec(&eb->refs);
				continue;
			}
			ret = eb;
			break;
		}
	}
	rcu_read_unlock()
	return ret;
}

And make submit_eb_subpage() no longer relies on subpage dirty bitmap,
but above helper to grab any dirty ebs.

By this, we fix both bugs by:

- No more bitmap search
   So no increment mismatch, and can still handle unaligned one (as long
   as they don't cross page boundary).

- No more missing writeback
   As the gang lookup is always for the whole folio, and we always test
   eb dirty flags, we should always catch dirty ebs in the folio.

Thanks,
Qu

>
> Thanks,
> Qu
>
>>               continue;
>>           }
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] btrfs: adjust subpage bit start based on sectorsize
  2025-04-14 22:37   ` Qu Wenruo
@ 2025-04-15 16:16     ` Boris Burkov
  2025-04-15 23:49       ` Qu Wenruo
  2025-04-15 17:25     ` Josef Bacik
  1 sibling, 1 reply; 16+ messages in thread
From: Boris Burkov @ 2025-04-15 16:16 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Josef Bacik, linux-btrfs, kernel-team

On Tue, Apr 15, 2025 at 08:07:08AM +0930, Qu Wenruo wrote:
> 
> 
> 在 2025/4/15 07:38, Qu Wenruo 写道:
> > 
> > 
> > 在 2025/4/15 04:38, Josef Bacik 写道:
> > > When running machines with 64k page size and a 16k nodesize we started
> > > seeing tree log corruption in production.  This turned out to be because
> > > we were not writing out dirty blocks sometimes, so this in fact affects
> > > all metadata writes.
> > > 
> > > When writing out a subpage EB we scan the subpage bitmap for a dirty
> > > range.  If the range isn't dirty we do
> > > 
> > > bit_start++;
> > > 
> > > to move onto the next bit.  The problem is the bitmap is based on the
> > > number of sectors that an EB has.  So in this case, we have a 64k
> > > pagesize, 16k nodesize, but a 4k sectorsize.  This means our bitmap is 4
> > > bits for every node.  With a 64k page size we end up with 4 nodes per
> > > page.
> > > 
> > > To make this easier this is how everything looks
> > > 
> > > [0         16k       32k       48k     ] logical address
> > > [0         4         8         12      ] radix tree offset
> > > [               64k page               ] folio
> > > [ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers
> > > [ | | | |  | | | |   | | | |   | | | | ] bitmap
> > > 
> > > Now we use all of our addressing based on fs_info->sectorsize_bits, so
> > > as you can see the above our 16k eb->start turns into radix entry 4.
> > > 
> > > When we find a dirty range for our eb, we correctly do bit_start +=
> > > sectors_per_node, because if we start at bit 0, the next bit for the
> > > next eb is 4, to correspond to eb->start 16k.
> > > 
> > > However if our range is clean, we will do bit_start++, which will now
> > > put us offset from our radix tree entries.
> > > 
> > > In our case, assume that the first time we check the bitmap the block is
> > > not dirty, we increment bit_start so now it == 1, and then we loop
> > > around and check again.  This time it is dirty, and we go to find that
> > > start using the following equation
> > > 
> > > start = folio_start + bit_start * fs_info->sectorsize;
> > > 
> > > so in the case above, eb->start 0 is now dirty, and we calculate start
> > > as
> > > 
> > > 0 + 1 * fs_info->sectorsize = 4096
> > > 4096 >> 12 = 1
> > > 
> > > Now we're looking up the radix tree for 1, and we won't find an eb.
> > > What's worse is now we're using bit_start == 1, so we do bit_start +=
> > > sectors_per_node, which is now 5.  If that eb is dirty we will run into
> > > the same thing, we will look at an offset that is not populated in the
> > > radix tree, and now we're skipping the writeout of dirty extent buffers.
> > > 
> > > The best fix for this is to not use sectorsize_bits to address nodes,
> > > but that's a larger change.  Since this is a fs corruption problem fix
> > > it simply by always using sectors_per_node to increment the start bit.
> > > 
> > > Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a
> > > subpage metadata page")
> > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > > ---
> > >   fs/btrfs/extent_io.c | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > > index 5f08615b334f..6cfd286b8bbc 100644
> > > --- a/fs/btrfs/extent_io.c
> > > +++ b/fs/btrfs/extent_io.c
> > > @@ -2034,7 +2034,7 @@ static int submit_eb_subpage(struct folio
> > > *folio, struct writeback_control *wbc)
> > >                     subpage->bitmaps)) {
> > >               spin_unlock_irqrestore(&subpage->lock, flags);
> > >               spin_unlock(&folio->mapping->i_private_lock);
> > > -            bit_start++;
> > > +            bit_start += sectors_per_node;
> > 
> > The problem is, we can not ensure all extent buffers are nodesize aligned.
> > 

In theory because the allocator can do whatever it wants, or in practice
because of mixed block groups? It seems to me that in practice without
mixed block groups they ought to always be aligned.

> > If we have an eb whose bytenr is only block aligned but not node size
> > aligned, this will lead to the same problem.
> > 

But then the existing code for the non error path is broken, right?
How was this intended to work? Is there any correct way to loop over the
ebs in a folio when nodesize < pagesize? Your proposed gang lookup?

I guess to put my question a different way, was it intentional to mix
the increment size in the two codepaths in this function?

> > We need an extra check to reject tree blocks which are not node size
> > aligned, which is another big change and not suitable for a quick fix.
> > 
> > 
> > Can we do a gang radix tree lookup for the involved ranges that can
> > cover the block, then increase bit_start to the end of the found eb
> > instead?
> 
> In fact, I think it's better to fix both this and the missing eb write
> bugs together in one go.
> 
> With something like this:
> 
> static int find_subpage_dirty_subpage(struct folio *folio)
> {
> 	struct extent_buffer *gang[BTRFS_MAX_EB_SIZE/MIN_BLOCKSIZE];
> 	struct extent_buffer *ret = NULL;
> 
> 	rcu_read_lock()
> 	ret = radix_tree_gang_lookup();
> 	for (int i = 0; i < ret; i++) {
> 		if (eb && atomic_inc_not_zero(&eb->refs)) {
> 			if (!test_bit(EXTENT_BUFFER_DIRTY) {
> 				atomic_dec(&eb->refs);
> 				continue;
> 			}
> 			ret = eb;
> 			break;
> 		}
> 	}
> 	rcu_read_unlock()
> 	return ret;
> }
> 
> And make submit_eb_subpage() no longer relies on subpage dirty bitmap,
> but above helper to grab any dirty ebs.
> 
> By this, we fix both bugs by:
> 
> - No more bitmap search
>   So no increment mismatch, and can still handle unaligned one (as long
>   as they don't cross page boundary).
> 
> - No more missing writeback
>   As the gang lookup is always for the whole folio, and we always test
>   eb dirty flags, we should always catch dirty ebs in the folio.

I don't see why this is the case. The race Josef fixed is quite narrow
but is fundamentally based on the TOWRITE mark getting cleared mid
subpage iteration.

If all we do is change subpage bitmap to this gang lookup, but still
clear the TOWRITE tag whenever the folio has the first eb call
meta_folio_set_writeback(), then it is possible for other threads to
come in and dirty a different eb, write it back, tag it TOWRITE, then
lose the tag before doing the tagged lookup for TOWRITE folios.

Thanks,
Boris

> 
> Thanks,
> Qu
> 
> > 
> > Thanks,
> > Qu
> > 
> > >               continue;
> > >           }
> > 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] btrfs: adjust subpage bit start based on sectorsize
  2025-04-15 16:16     ` Boris Burkov
@ 2025-04-15 23:49       ` Qu Wenruo
  2025-04-16 14:16         ` Josef Bacik
  0 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2025-04-15 23:49 UTC (permalink / raw)
  To: Boris Burkov; +Cc: Josef Bacik, linux-btrfs, kernel-team



在 2025/4/16 01:46, Boris Burkov 写道:
> On Tue, Apr 15, 2025 at 08:07:08AM +0930, Qu Wenruo wrote:
[...]
>>>
>>> The problem is, we can not ensure all extent buffers are nodesize aligned.
>>>
> 
> In theory because the allocator can do whatever it wants, or in practice
> because of mixed block groups? It seems to me that in practice without
> mixed block groups they ought to always be aligned.

Because we may have some old converted btrfs, whose allocater may not 
ensure nodesize aligned bytenr.

For subpage we can still support non-aligned tree blocks as long as they 
do not cross boundary.

I know this is over-complicated and prefer to reject them all, but such 
a change will need quite some time to reach end users.

> 
>>> If we have an eb whose bytenr is only block aligned but not node size
>>> aligned, this will lead to the same problem.
>>>
> 
> But then the existing code for the non error path is broken, right?
> How was this intended to work? Is there any correct way to loop over the
> ebs in a folio when nodesize < pagesize? Your proposed gang lookup?
> 
> I guess to put my question a different way, was it intentional to mix
> the increment size in the two codepaths in this function?

Yes, that's intended, consider the following case:

32K page size, 16K node size.

0    4K     8K    16K    20K    24K     28K      32K
|           |///////////////////|                |

In above case, for offset 0 and 4K, we didn't find any dirty block, and 
skip to next block.

For 8K, we found an eb, submit it, and jump to 24K, and that's the 
expected behavior.

But on the other hand, if at offset 0 we increase the offset by 16K, we 
will never be able to grab the eb at 8K.

I know this is creepy, but I really do not have any better solution than 
two different increment sizes at that time.

But for now, I believe the gang lookup should be way more accurate and 
safer.

> 
>>> We need an extra check to reject tree blocks which are not node size
>>> aligned, which is another big change and not suitable for a quick fix.
>>>
>>>
>>> Can we do a gang radix tree lookup for the involved ranges that can
>>> cover the block, then increase bit_start to the end of the found eb
>>> instead?
>>
>> In fact, I think it's better to fix both this and the missing eb write
>> bugs together in one go.
>>
>> With something like this:
>>
>> static int find_subpage_dirty_subpage(struct folio *folio)
>> {
>> 	struct extent_buffer *gang[BTRFS_MAX_EB_SIZE/MIN_BLOCKSIZE];
>> 	struct extent_buffer *ret = NULL;
>>
>> 	rcu_read_lock()
>> 	ret = radix_tree_gang_lookup();
>> 	for (int i = 0; i < ret; i++) {
>> 		if (eb && atomic_inc_not_zero(&eb->refs)) {
>> 			if (!test_bit(EXTENT_BUFFER_DIRTY) {
>> 				atomic_dec(&eb->refs);
>> 				continue;
>> 			}
>> 			ret = eb;
>> 			break;
>> 		}
>> 	}
>> 	rcu_read_unlock()
>> 	return ret;
>> }
>>
>> And make submit_eb_subpage() no longer relies on subpage dirty bitmap,
>> but above helper to grab any dirty ebs.
>>
>> By this, we fix both bugs by:
>>
>> - No more bitmap search
>>    So no increment mismatch, and can still handle unaligned one (as long
>>    as they don't cross page boundary).
>>
>> - No more missing writeback
>>    As the gang lookup is always for the whole folio, and we always test
>>    eb dirty flags, we should always catch dirty ebs in the folio.
> 
> I don't see why this is the case. The race Josef fixed is quite narrow
> but is fundamentally based on the TOWRITE mark getting cleared mid
> subpage iteration.
> 
> If all we do is change subpage bitmap to this gang lookup, but still
> clear the TOWRITE tag whenever the folio has the first eb call
> meta_folio_set_writeback(), then it is possible for other threads to
> come in and dirty a different eb, write it back, tag it TOWRITE, then
> lose the tag before doing the tagged lookup for TOWRITE folios.

The point here is, we ensure all dirty ebs inside the folio will be 
submitted (maybe except for error paths).

E.g. if there is initially one dirty eb, we do gang lookup, submit that 
one (which clears the TOWRITE tag of the folio).
Then we will still do another gang lookup.

If a new eb in the folio is dirtied before that, we will found it and 
submit it.

The gang lookup solution is to ensure, we only exit submit_eb_subpage() 
with no dirty ebs in the folio.

Thanks,
Qu

> 
> Thanks,
> Boris
> 
>>
>> Thanks,
>> Qu
>>
>>>
>>> Thanks,
>>> Qu
>>>
>>>>                continue;
>>>>            }
>>>
>>
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] btrfs: adjust subpage bit start based on sectorsize
  2025-04-15 23:49       ` Qu Wenruo
@ 2025-04-16 14:16         ` Josef Bacik
  2025-04-16 22:07           ` Qu Wenruo
  0 siblings, 1 reply; 16+ messages in thread
From: Josef Bacik @ 2025-04-16 14:16 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Boris Burkov, linux-btrfs, kernel-team

On Wed, Apr 16, 2025 at 09:19:37AM +0930, Qu Wenruo wrote:
> 
> 
> 在 2025/4/16 01:46, Boris Burkov 写道:
> > On Tue, Apr 15, 2025 at 08:07:08AM +0930, Qu Wenruo wrote:
> [...]
> > > > 
> > > > The problem is, we can not ensure all extent buffers are nodesize aligned.
> > > > 
> > 
> > In theory because the allocator can do whatever it wants, or in practice
> > because of mixed block groups? It seems to me that in practice without
> > mixed block groups they ought to always be aligned.
> 
> Because we may have some old converted btrfs, whose allocater may not ensure
> nodesize aligned bytenr.
> 
> For subpage we can still support non-aligned tree blocks as long as they do
> not cross boundary.
> 
> I know this is over-complicated and prefer to reject them all, but such a
> change will need quite some time to reach end users.
> 
> > 
> > > > If we have an eb whose bytenr is only block aligned but not node size
> > > > aligned, this will lead to the same problem.
> > > > 
> > 
> > But then the existing code for the non error path is broken, right?
> > How was this intended to work? Is there any correct way to loop over the
> > ebs in a folio when nodesize < pagesize? Your proposed gang lookup?
> > 
> > I guess to put my question a different way, was it intentional to mix
> > the increment size in the two codepaths in this function?
> 
> Yes, that's intended, consider the following case:
> 
> 32K page size, 16K node size.
> 
> 0    4K     8K    16K    20K    24K     28K      32K
> |           |///////////////////|                |
> 
> In above case, for offset 0 and 4K, we didn't find any dirty block, and skip
> to next block.
> 
> For 8K, we found an eb, submit it, and jump to 24K, and that's the expected
> behavior.
> 
> But on the other hand, if at offset 0 we increase the offset by 16K, we will
> never be able to grab the eb at 8K.

Right, but this can't actually happen can it?  I mean it could happen the first
eb, but as soon as we allocate eb->start 24k we would fail check_eb_alignment()
because it straddles a folio and then the fs would flipe RO.  So I don't think
this is a problem in general.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] btrfs: adjust subpage bit start based on sectorsize
  2025-04-16 14:16         ` Josef Bacik
@ 2025-04-16 22:07           ` Qu Wenruo
  0 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-04-16 22:07 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Boris Burkov, linux-btrfs, kernel-team



在 2025/4/16 23:46, Josef Bacik 写道:
> On Wed, Apr 16, 2025 at 09:19:37AM +0930, Qu Wenruo wrote:
>>
>>
>> 在 2025/4/16 01:46, Boris Burkov 写道:
>>> On Tue, Apr 15, 2025 at 08:07:08AM +0930, Qu Wenruo wrote:
>> [...]
>>>>>
>>>>> The problem is, we can not ensure all extent buffers are nodesize aligned.
>>>>>
>>>
>>> In theory because the allocator can do whatever it wants, or in practice
>>> because of mixed block groups? It seems to me that in practice without
>>> mixed block groups they ought to always be aligned.
>>
>> Because we may have some old converted btrfs, whose allocater may not ensure
>> nodesize aligned bytenr.
>>
>> For subpage we can still support non-aligned tree blocks as long as they do
>> not cross boundary.
>>
>> I know this is over-complicated and prefer to reject them all, but such a
>> change will need quite some time to reach end users.
>>
>>>
>>>>> If we have an eb whose bytenr is only block aligned but not node size
>>>>> aligned, this will lead to the same problem.
>>>>>
>>>
>>> But then the existing code for the non error path is broken, right?
>>> How was this intended to work? Is there any correct way to loop over the
>>> ebs in a folio when nodesize < pagesize? Your proposed gang lookup?
>>>
>>> I guess to put my question a different way, was it intentional to mix
>>> the increment size in the two codepaths in this function?
>>
>> Yes, that's intended, consider the following case:
>>
>> 32K page size, 16K node size.
>>
>> 0    4K     8K    16K    20K    24K     28K      32K
>> |           |///////////////////|                |
>>
>> In above case, for offset 0 and 4K, we didn't find any dirty block, and skip
>> to next block.
>>
>> For 8K, we found an eb, submit it, and jump to 24K, and that's the expected
>> behavior.
>>
>> But on the other hand, if at offset 0 we increase the offset by 16K, we will
>> never be able to grab the eb at 8K.
> 
> Right, but this can't actually happen can it?  I mean it could happen the first
> eb, but as soon as we allocate eb->start 24k we would fail check_eb_alignment()
> because it straddles a folio and then the fs would flipe RO.  So I don't think
> this is a problem in general.  Thanks,

Yep, that's totally true.

Thus I'm fine to reject such ebs now.

Thanks,
Qu

> 
> Josef


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] btrfs: adjust subpage bit start based on sectorsize
  2025-04-14 22:37   ` Qu Wenruo
  2025-04-15 16:16     ` Boris Burkov
@ 2025-04-15 17:25     ` Josef Bacik
  2025-04-16  0:08       ` Qu Wenruo
  1 sibling, 1 reply; 16+ messages in thread
From: Josef Bacik @ 2025-04-15 17:25 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, kernel-team

On Tue, Apr 15, 2025 at 08:07:08AM +0930, Qu Wenruo wrote:
> 
> 
> 在 2025/4/15 07:38, Qu Wenruo 写道:
> > 
> > 
> > 在 2025/4/15 04:38, Josef Bacik 写道:
> > > When running machines with 64k page size and a 16k nodesize we started
> > > seeing tree log corruption in production.  This turned out to be because
> > > we were not writing out dirty blocks sometimes, so this in fact affects
> > > all metadata writes.
> > > 
> > > When writing out a subpage EB we scan the subpage bitmap for a dirty
> > > range.  If the range isn't dirty we do
> > > 
> > > bit_start++;
> > > 
> > > to move onto the next bit.  The problem is the bitmap is based on the
> > > number of sectors that an EB has.  So in this case, we have a 64k
> > > pagesize, 16k nodesize, but a 4k sectorsize.  This means our bitmap is 4
> > > bits for every node.  With a 64k page size we end up with 4 nodes per
> > > page.
> > > 
> > > To make this easier this is how everything looks
> > > 
> > > [0         16k       32k       48k     ] logical address
> > > [0         4         8         12      ] radix tree offset
> > > [               64k page               ] folio
> > > [ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers
> > > [ | | | |  | | | |   | | | |   | | | | ] bitmap
> > > 
> > > Now we use all of our addressing based on fs_info->sectorsize_bits, so
> > > as you can see the above our 16k eb->start turns into radix entry 4.
> > > 
> > > When we find a dirty range for our eb, we correctly do bit_start +=
> > > sectors_per_node, because if we start at bit 0, the next bit for the
> > > next eb is 4, to correspond to eb->start 16k.
> > > 
> > > However if our range is clean, we will do bit_start++, which will now
> > > put us offset from our radix tree entries.
> > > 
> > > In our case, assume that the first time we check the bitmap the block is
> > > not dirty, we increment bit_start so now it == 1, and then we loop
> > > around and check again.  This time it is dirty, and we go to find that
> > > start using the following equation
> > > 
> > > start = folio_start + bit_start * fs_info->sectorsize;
> > > 
> > > so in the case above, eb->start 0 is now dirty, and we calculate start
> > > as
> > > 
> > > 0 + 1 * fs_info->sectorsize = 4096
> > > 4096 >> 12 = 1
> > > 
> > > Now we're looking up the radix tree for 1, and we won't find an eb.
> > > What's worse is now we're using bit_start == 1, so we do bit_start +=
> > > sectors_per_node, which is now 5.  If that eb is dirty we will run into
> > > the same thing, we will look at an offset that is not populated in the
> > > radix tree, and now we're skipping the writeout of dirty extent buffers.
> > > 
> > > The best fix for this is to not use sectorsize_bits to address nodes,
> > > but that's a larger change.  Since this is a fs corruption problem fix
> > > it simply by always using sectors_per_node to increment the start bit.
> > > 
> > > Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a
> > > subpage metadata page")
> > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > > ---
> > >   fs/btrfs/extent_io.c | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > > index 5f08615b334f..6cfd286b8bbc 100644
> > > --- a/fs/btrfs/extent_io.c
> > > +++ b/fs/btrfs/extent_io.c
> > > @@ -2034,7 +2034,7 @@ static int submit_eb_subpage(struct folio
> > > *folio, struct writeback_control *wbc)
> > >                     subpage->bitmaps)) {
> > >               spin_unlock_irqrestore(&subpage->lock, flags);
> > >               spin_unlock(&folio->mapping->i_private_lock);
> > > -            bit_start++;
> > > +            bit_start += sectors_per_node;
> > 
> > The problem is, we can not ensure all extent buffers are nodesize aligned.
> > 
> > If we have an eb whose bytenr is only block aligned but not node size
> > aligned, this will lead to the same problem.
> > 
> > We need an extra check to reject tree blocks which are not node size
> > aligned, which is another big change and not suitable for a quick fix.
> > 
> > 
> > Can we do a gang radix tree lookup for the involved ranges that can
> > cover the block, then increase bit_start to the end of the found eb
> > instead?
> 
> In fact, I think it's better to fix both this and the missing eb write
> bugs together in one go.
> 
> With something like this:
> 
> static int find_subpage_dirty_subpage(struct folio *folio)
> {
> 	struct extent_buffer *gang[BTRFS_MAX_EB_SIZE/MIN_BLOCKSIZE];
> 	struct extent_buffer *ret = NULL;
> 
> 	rcu_read_lock()
> 	ret = radix_tree_gang_lookup();
> 	for (int i = 0; i < ret; i++) {
> 		if (eb && atomic_inc_not_zero(&eb->refs)) {
> 			if (!test_bit(EXTENT_BUFFER_DIRTY) {
> 				atomic_dec(&eb->refs);
> 				continue;
> 			}
> 			ret = eb;
> 			break;
> 		}
> 	}
> 	rcu_read_unlock()
> 	return ret;
> }

Again, I'm following up with a better solution for all of this.  The current
patch needs to be pulled back to a ton of kernels, so this is targeted at fixing
the problem, and then we can make it look better with a series that has a longer
bake time.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] btrfs: adjust subpage bit start based on sectorsize
  2025-04-15 17:25     ` Josef Bacik
@ 2025-04-16  0:08       ` Qu Wenruo
  0 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-04-16  0:08 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

在 2025/4/16 02:55, Josef Bacik 写道:
> On Tue, Apr 15, 2025 at 08:07:08AM +0930, Qu Wenruo wrote:
[...]
>> 	return ret;
>> }
> 
> Again, I'm following up with a better solution for all of this.  The current
> patch needs to be pulled back to a ton of kernels, so this is targeted at fixing
> the problem, and then we can make it look better with a series that has a longer
> bake time.  Thanks,

I'm not against the fix, but just want to point out that, the fix can 
hit problems if not all ebs are node size aligned.

But after more consideration, I believe we can just push for a backport 
that rejects unaligned tree blocks completely.

IIRC it's in the v4.4~v4.6 era, so it should be completely fine even for 
the oldest 5.4 kernel to rejects those ebs.

So I'm fine if you just add a small new patch to reject unaligned tree 
blocks as a dependency, and push all those for v5.15+ kernels.

Thanks,
Qu

> 
> Josef
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] btrfs: adjust subpage bit start based on sectorsize
  2025-04-14 22:08 ` Qu Wenruo
  2025-04-14 22:37   ` Qu Wenruo
@ 2025-04-15 17:23   ` Josef Bacik
  2025-04-15 23:53     ` Qu Wenruo
  1 sibling, 1 reply; 16+ messages in thread
From: Josef Bacik @ 2025-04-15 17:23 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, kernel-team

On Tue, Apr 15, 2025 at 07:38:29AM +0930, Qu Wenruo wrote:
> 
> 
> 在 2025/4/15 04:38, Josef Bacik 写道:
> > When running machines with 64k page size and a 16k nodesize we started
> > seeing tree log corruption in production.  This turned out to be because
> > we were not writing out dirty blocks sometimes, so this in fact affects
> > all metadata writes.
> > 
> > When writing out a subpage EB we scan the subpage bitmap for a dirty
> > range.  If the range isn't dirty we do
> > 
> > bit_start++;
> > 
> > to move onto the next bit.  The problem is the bitmap is based on the
> > number of sectors that an EB has.  So in this case, we have a 64k
> > pagesize, 16k nodesize, but a 4k sectorsize.  This means our bitmap is 4
> > bits for every node.  With a 64k page size we end up with 4 nodes per
> > page.
> > 
> > To make this easier this is how everything looks
> > 
> > [0         16k       32k       48k     ] logical address
> > [0         4         8         12      ] radix tree offset
> > [               64k page               ] folio
> > [ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers
> > [ | | | |  | | | |   | | | |   | | | | ] bitmap
> > 
> > Now we use all of our addressing based on fs_info->sectorsize_bits, so
> > as you can see the above our 16k eb->start turns into radix entry 4.
> > 
> > When we find a dirty range for our eb, we correctly do bit_start +=
> > sectors_per_node, because if we start at bit 0, the next bit for the
> > next eb is 4, to correspond to eb->start 16k.
> > 
> > However if our range is clean, we will do bit_start++, which will now
> > put us offset from our radix tree entries.
> > 
> > In our case, assume that the first time we check the bitmap the block is
> > not dirty, we increment bit_start so now it == 1, and then we loop
> > around and check again.  This time it is dirty, and we go to find that
> > start using the following equation
> > 
> > start = folio_start + bit_start * fs_info->sectorsize;
> > 
> > so in the case above, eb->start 0 is now dirty, and we calculate start
> > as
> > 
> > 0 + 1 * fs_info->sectorsize = 4096
> > 4096 >> 12 = 1
> > 
> > Now we're looking up the radix tree for 1, and we won't find an eb.
> > What's worse is now we're using bit_start == 1, so we do bit_start +=
> > sectors_per_node, which is now 5.  If that eb is dirty we will run into
> > the same thing, we will look at an offset that is not populated in the
> > radix tree, and now we're skipping the writeout of dirty extent buffers.
> > 
> > The best fix for this is to not use sectorsize_bits to address nodes,
> > but that's a larger change.  Since this is a fs corruption problem fix
> > it simply by always using sectors_per_node to increment the start bit.
> > 
> > Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page")
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> >   fs/btrfs/extent_io.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 5f08615b334f..6cfd286b8bbc 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -2034,7 +2034,7 @@ static int submit_eb_subpage(struct folio *folio, struct writeback_control *wbc)
> >   			      subpage->bitmaps)) {
> >   			spin_unlock_irqrestore(&subpage->lock, flags);
> >   			spin_unlock(&folio->mapping->i_private_lock);
> > -			bit_start++;
> > +			bit_start += sectors_per_node;
> 
> The problem is, we can not ensure all extent buffers are nodesize aligned.
> 
> If we have an eb whose bytenr is only block aligned but not node size
> aligned, this will lead to the same problem.
> 
> We need an extra check to reject tree blocks which are not node size
> aligned, which is another big change and not suitable for a quick fix.

We already have this.

> 
> 
> Can we do a gang radix tree lookup for the involved ranges that can cover
> the block, then increase bit_start to the end of the found eb instead?

That's a followup patch that I'm testing now.  I've started with the simplest
fix so that they can be pulled back to all the affected kernels, and then I'm
following up with much more invasive changes to make these problems go away in
general.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] btrfs: adjust subpage bit start based on sectorsize
  2025-04-15 17:23   ` Josef Bacik
@ 2025-04-15 23:53     ` Qu Wenruo
  0 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-04-15 23:53 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team



在 2025/4/16 02:53, Josef Bacik 写道:
> On Tue, Apr 15, 2025 at 07:38:29AM +0930, Qu Wenruo wrote:
>>
>>
[...]
>> The problem is, we can not ensure all extent buffers are nodesize aligned.
>>
>> If we have an eb whose bytenr is only block aligned but not node size
>> aligned, this will lead to the same problem.
>>
>> We need an extra check to reject tree blocks which are not node size
>> aligned, which is another big change and not suitable for a quick fix.
> 
> We already have this.

The checks inside check_eb_alignment() only ensure the subpage eb will 
not cross page boundary, not strong nodesize alignment:

         if (fs_info->nodesize < PAGE_SIZE &&
             offset_in_page(start) + fs_info->nodesize > PAGE_SIZE) {
                 btrfs_err(fs_info,
                 "tree block crosses page boundary, start %llu nodesize %u",
                           start, fs_info->nodesize);
                 return true;
         }

In fact, it even allows such unaligned tree blocks, but with some warnings:

         if (!IS_ALIGNED(start, fs_info->nodesize) &&
             !test_and_set_bit(BTRFS_FS_UNALIGNED_TREE_BLOCK, 
&fs_info->flags)) {
                 btrfs_warn(fs_info,
"tree block not nodesize aligned, start %llu nodesize %u, can be 
resolved by a full metadata balance",
                               start, fs_info->nodesize);
         }

Thanks,
Qu

> 
>>
>>
>> Can we do a gang radix tree lookup for the involved ranges that can cover
>> the block, then increase bit_start to the end of the found eb instead?
> 
> That's a followup patch that I'm testing now.  I've started with the simplest
> fix so that they can be pulled back to all the affected kernels, and then I'm
> following up with much more invasive changes to make these problems go away in
> general.  Thanks,
> 
> Josef
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-04-24 20:57 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-21 13:37 [PATCH] btrfs: adjust subpage bit start based on sectorsize Josef Bacik
2025-04-21 20:17 ` Qu Wenruo
2025-04-24 10:40 ` Daniel Vacek
2025-04-24 20:57   ` Qu Wenruo
  -- strict thread matches above, loose matches on Subject: below --
2025-04-14 19:08 Josef Bacik
2025-04-14 19:54 ` Boris Burkov
2025-04-14 22:08 ` Qu Wenruo
2025-04-14 22:37   ` Qu Wenruo
2025-04-15 16:16     ` Boris Burkov
2025-04-15 23:49       ` Qu Wenruo
2025-04-16 14:16         ` Josef Bacik
2025-04-16 22:07           ` Qu Wenruo
2025-04-15 17:25     ` Josef Bacik
2025-04-16  0:08       ` Qu Wenruo
2025-04-15 17:23   ` Josef Bacik
2025-04-15 23:53     ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox