From: Qu Wenruo <wqu@suse.com>
To: Filipe Manana <fdmanana@kernel.org>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 5/7] btrfs: allow buffered write to avoid full page read if it's block aligned
Date: Wed, 26 Feb 2025 07:46:07 +1030 [thread overview]
Message-ID: <cf967519-c1ee-439e-90fb-6a945296dbba@suse.com> (raw)
In-Reply-To: <CAL3q7H7dOsKqYAKvZdzOoe4DZNJ28P2neBDopme=dFYT-PNn6g@mail.gmail.com>
在 2025/2/26 04:25, Filipe Manana 写道:
> On Sun, Feb 23, 2025 at 11:47 PM Qu Wenruo <wqu@suse.com> wrote:
>>
>> [BUG]
>> Since the support of block size (sector size) < page size for btrfs,
>> test case generic/563 fails with 4K block size and 64K page size:
>>
>> --- tests/generic/563.out 2024-04-25 18:13:45.178550333 +0930
>> +++ /home/adam/xfstests-dev/results//generic/563.out.bad 2024-09-30 09:09:16.155312379 +0930
>> @@ -3,7 +3,8 @@
>> read is in range
>> write is in range
>> write -> read/write
>> -read is in range
>> +read has value of 8388608
>> +read is NOT in range -33792 .. 33792
>> write is in range
>> ...
>>
>> [CAUSE]
>> The test case creates a 8MiB file, then buffered write into the 8MiB
>> using 4K block size, to overwrite the whole file.
>>
>> On 4K page sized systems, since the write range covers the full block and
>> page, btrfs will no bother reading the page, just like what XFS and EXT4
>> do.
>>
>> But 64K page sized systems, although the 4K sized write is still block
>> aligned, it's not page aligned any more, thus btrfs will read the full
>> page, causing more read than expected and fail the test case.
>
> This part "causing more read than expected" got me puzzled, but it's explained
> in the "Fix" section below. It's more like "causing previously dirty
> blocks within the page to be zeroed out".
It's not exactly the same.
Before this patch, we will never allow a folio to be dirtied if it's not
uptodate (unless we're dirtying the full folio).
So there is no "previously dirty blocks to be zeroed out".
I have no better idea to explain the situation here, it's really about
if we need to read the folio before buffered write.
I'm super happy if you have a better expression here so that no one
would be confused.
Thanks,
Qu
>
>>
>> [FIX]
>> To skip the full page read, we need to do the following modification:
>>
>> - Do not trigger full page read as long as the buffered write is block
>> aligned
>> This is pretty simple by modifying the check inside
>> prepare_uptodate_page().
>>
>> - Skip already uptodate blocks during full page read
>> Or we can lead to the following data corruption:
>>
>> 0 32K 64K
>> |///////| |
>>
>> Where the file range [0, 32K) is dirtied by buffered write, the
>> remaining range [32K, 64K) is not.
>>
>> When reading the full page, since [0,32K) is only dirtied but not
>> written back, there is no data extent map for it, but a hole covering
>> [0, 64k).
>>
>> If we continue reading the full page range [0, 64K), the dirtied range
>> will be filled with 0 (since there is only a hole covering the whole
>> range).
>> This causes the dirtied range to get lost.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
>
> Looks good, thanks.
>
>> ---
>> fs/btrfs/extent_io.c | 4 ++++
>> fs/btrfs/file.c | 5 +++--
>> 2 files changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index b3a4a94212c9..d7240e295bfc 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -977,6 +977,10 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>> end_folio_read(folio, true, cur, end - cur + 1);
>> break;
>> }
>> + if (btrfs_folio_test_uptodate(fs_info, folio, cur, blocksize)) {
>> + end_folio_read(folio, true, cur, blocksize);
>> + continue;
>> + }
>> em = get_extent_map(BTRFS_I(inode), folio, cur, end - cur + 1, em_cached);
>> if (IS_ERR(em)) {
>> end_folio_read(folio, false, cur, end + 1 - cur);
>> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
>> index 00c68b7b2206..e3d63192281d 100644
>> --- a/fs/btrfs/file.c
>> +++ b/fs/btrfs/file.c
>> @@ -804,14 +804,15 @@ static int prepare_uptodate_folio(struct inode *inode, struct folio *folio, u64
>> {
>> u64 clamp_start = max_t(u64, pos, folio_pos(folio));
>> u64 clamp_end = min_t(u64, pos + len, folio_pos(folio) + folio_size(folio));
>> + const u32 sectorsize = inode_to_fs_info(inode)->sectorsize;
>> int ret = 0;
>>
>> if (folio_test_uptodate(folio))
>> return 0;
>>
>> if (!force_uptodate &&
>> - IS_ALIGNED(clamp_start, PAGE_SIZE) &&
>> - IS_ALIGNED(clamp_end, PAGE_SIZE))
>> + IS_ALIGNED(clamp_start, sectorsize) &&
>> + IS_ALIGNED(clamp_end, sectorsize))
>> return 0;
>>
>> ret = btrfs_read_folio(NULL, folio);
>> --
>> 2.48.1
>>
>>
next prev parent reply other threads:[~2025-02-25 21:16 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-23 23:46 [PATCH 0/7] btrfs: make subpage handling be feature full Qu Wenruo
2025-02-23 23:46 ` [PATCH 1/7] btrfs: prevent inline data extents read from touching blocks beyond its range Qu Wenruo
2025-02-25 15:07 ` Filipe Manana
2025-02-25 16:10 ` Filipe Manana
2025-02-23 23:46 ` [PATCH 2/7] btrfs: fix the qgroup data free range for inline data extents Qu Wenruo
2025-02-25 15:11 ` Filipe Manana
2025-02-23 23:46 ` [PATCH 3/7] btrfs: introduce a read path dedicated extent lock helper Qu Wenruo
2025-02-25 17:05 ` Filipe Manana
2025-02-25 21:12 ` Qu Wenruo
2025-02-25 23:44 ` Qu Wenruo
2025-02-26 11:07 ` Filipe Manana
2025-02-26 11:07 ` Filipe Manana
2025-02-23 23:46 ` [PATCH 4/7] btrfs: make btrfs_do_readpage() to do block-by-block read Qu Wenruo
2025-02-25 17:09 ` Filipe Manana
2025-02-23 23:46 ` [PATCH 5/7] btrfs: allow buffered write to avoid full page read if it's block aligned Qu Wenruo
2025-02-25 17:55 ` Filipe Manana
2025-02-25 21:16 ` Qu Wenruo [this message]
2025-02-23 23:46 ` [PATCH 6/7] btrfs: allow inline data extents creation if block size < page size Qu Wenruo
2025-02-25 18:03 ` Filipe Manana
2025-02-23 23:46 ` [PATCH 7/7] btrfs: remove the subpage related warning message Qu Wenruo
2025-02-25 18:10 ` Filipe Manana
2025-02-25 21:19 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cf967519-c1ee-439e-90fb-6a945296dbba@suse.com \
--to=wqu@suse.com \
--cc=fdmanana@kernel.org \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox