Re: [PATCH] btrfs: adjust subpage bit start based on sectorsize

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Boris Burkov <boris@bur.io>
Cc: Josef Bacik <josef@toxicpanda.com>,
	linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] btrfs: adjust subpage bit start based on sectorsize
Date: Wed, 16 Apr 2025 09:19:37 +0930	[thread overview]
Message-ID: <7ad4df86-866e-40ce-89a1-48f3c49aeeea@gmx.com> (raw)
In-Reply-To: <20250415161647.GA2164022@zen.localdomain>



在 2025/4/16 01:46, Boris Burkov 写道:
> On Tue, Apr 15, 2025 at 08:07:08AM +0930, Qu Wenruo wrote:
[...]
>>>
>>> The problem is, we can not ensure all extent buffers are nodesize aligned.
>>>
> 
> In theory because the allocator can do whatever it wants, or in practice
> because of mixed block groups? It seems to me that in practice without
> mixed block groups they ought to always be aligned.

Because we may have some old converted btrfs, whose allocater may not 
ensure nodesize aligned bytenr.

For subpage we can still support non-aligned tree blocks as long as they 
do not cross boundary.

I know this is over-complicated and prefer to reject them all, but such 
a change will need quite some time to reach end users.

> 
>>> If we have an eb whose bytenr is only block aligned but not node size
>>> aligned, this will lead to the same problem.
>>>
> 
> But then the existing code for the non error path is broken, right?
> How was this intended to work? Is there any correct way to loop over the
> ebs in a folio when nodesize < pagesize? Your proposed gang lookup?
> 
> I guess to put my question a different way, was it intentional to mix
> the increment size in the two codepaths in this function?

Yes, that's intended, consider the following case:

32K page size, 16K node size.

0    4K     8K    16K    20K    24K     28K      32K
|           |///////////////////|                |

In above case, for offset 0 and 4K, we didn't find any dirty block, and 
skip to next block.

For 8K, we found an eb, submit it, and jump to 24K, and that's the 
expected behavior.

But on the other hand, if at offset 0 we increase the offset by 16K, we 
will never be able to grab the eb at 8K.

I know this is creepy, but I really do not have any better solution than 
two different increment sizes at that time.

But for now, I believe the gang lookup should be way more accurate and 
safer.

> 
>>> We need an extra check to reject tree blocks which are not node size
>>> aligned, which is another big change and not suitable for a quick fix.
>>>
>>>
>>> Can we do a gang radix tree lookup for the involved ranges that can
>>> cover the block, then increase bit_start to the end of the found eb
>>> instead?
>>
>> In fact, I think it's better to fix both this and the missing eb write
>> bugs together in one go.
>>
>> With something like this:
>>
>> static int find_subpage_dirty_subpage(struct folio *folio)
>> {
>> 	struct extent_buffer *gang[BTRFS_MAX_EB_SIZE/MIN_BLOCKSIZE];
>> 	struct extent_buffer *ret = NULL;
>>
>> 	rcu_read_lock()
>> 	ret = radix_tree_gang_lookup();
>> 	for (int i = 0; i < ret; i++) {
>> 		if (eb && atomic_inc_not_zero(&eb->refs)) {
>> 			if (!test_bit(EXTENT_BUFFER_DIRTY) {
>> 				atomic_dec(&eb->refs);
>> 				continue;
>> 			}
>> 			ret = eb;
>> 			break;
>> 		}
>> 	}
>> 	rcu_read_unlock()
>> 	return ret;
>> }
>>
>> And make submit_eb_subpage() no longer relies on subpage dirty bitmap,
>> but above helper to grab any dirty ebs.
>>
>> By this, we fix both bugs by:
>>
>> - No more bitmap search
>>    So no increment mismatch, and can still handle unaligned one (as long
>>    as they don't cross page boundary).
>>
>> - No more missing writeback
>>    As the gang lookup is always for the whole folio, and we always test
>>    eb dirty flags, we should always catch dirty ebs in the folio.
> 
> I don't see why this is the case. The race Josef fixed is quite narrow
> but is fundamentally based on the TOWRITE mark getting cleared mid
> subpage iteration.
> 
> If all we do is change subpage bitmap to this gang lookup, but still
> clear the TOWRITE tag whenever the folio has the first eb call
> meta_folio_set_writeback(), then it is possible for other threads to
> come in and dirty a different eb, write it back, tag it TOWRITE, then
> lose the tag before doing the tagged lookup for TOWRITE folios.

The point here is, we ensure all dirty ebs inside the folio will be 
submitted (maybe except for error paths).

E.g. if there is initially one dirty eb, we do gang lookup, submit that 
one (which clears the TOWRITE tag of the folio).
Then we will still do another gang lookup.

If a new eb in the folio is dirtied before that, we will found it and 
submit it.

The gang lookup solution is to ensure, we only exit submit_eb_subpage() 
with no dirty ebs in the folio.

Thanks,
Qu

> 
> Thanks,
> Boris
> 
>>
>> Thanks,
>> Qu
>>
>>>
>>> Thanks,
>>> Qu
>>>
>>>>                continue;
>>>>            }
>>>
>>
>

next prev parent reply	other threads:[~2025-04-15 23:49 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-14 19:08 [PATCH] btrfs: adjust subpage bit start based on sectorsize Josef Bacik
2025-04-14 19:54 ` Boris Burkov
2025-04-14 22:08 ` Qu Wenruo
2025-04-14 22:37   ` Qu Wenruo
2025-04-15 16:16     ` Boris Burkov
2025-04-15 23:49       ` Qu Wenruo [this message]
2025-04-16 14:16         ` Josef Bacik
2025-04-16 22:07           ` Qu Wenruo
2025-04-15 17:25     ` Josef Bacik
2025-04-16  0:08       ` Qu Wenruo
2025-04-15 17:23   ` Josef Bacik
2025-04-15 23:53     ` Qu Wenruo
  -- strict thread matches above, loose matches on Subject: below --
2025-04-21 13:37 Josef Bacik
2025-04-21 20:17 ` Qu Wenruo
2025-04-24 10:40 ` Daniel Vacek
2025-04-24 20:57   ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7ad4df86-866e-40ce-89a1-48f3c49aeeea@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=boris@bur.io \
    --cc=josef@toxicpanda.com \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox