public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Qu Wenruo <wqu@suse.com>, Filipe Manana <fdmanana@suse.com>,
	David Sterba <dsterba@suse.com>, Sasha Levin <sashal@kernel.org>,
	clm@fb.com, josef@toxicpanda.com, linux-btrfs@vger.kernel.org
Subject: [PATCH AUTOSEL 6.14 098/642] btrfs: allow buffered write to avoid full page read if it's block aligned
Date: Mon,  5 May 2025 18:05:14 -0400	[thread overview]
Message-ID: <20250505221419.2672473-98-sashal@kernel.org> (raw)
In-Reply-To: <20250505221419.2672473-1-sashal@kernel.org>

From: Qu Wenruo <wqu@suse.com>

[ Upstream commit 0d31ca6584f21821c708752d379871b9fce2dc48 ]

[BUG]
Since the support of block size (sector size) < page size for btrfs,
test case generic/563 fails with 4K block size and 64K page size:

  --- tests/generic/563.out	2024-04-25 18:13:45.178550333 +0930
  +++ /home/adam/xfstests-dev/results//generic/563.out.bad	2024-09-30 09:09:16.155312379 +0930
  @@ -3,7 +3,8 @@
   read is in range
   write is in range
   write -> read/write
  -read is in range
  +read has value of 8388608
  +read is NOT in range -33792 .. 33792
   write is in range
  ...

[CAUSE]
The test case creates a 8MiB file, then does buffered write into the 8MiB
using 4K block size, to overwrite the whole file.

On 4K page sized systems, since the write range covers the full block and
page, btrfs will not bother reading the page, just like what XFS and EXT4
do.

But on 64K page sized systems, although the 4K sized write is still block
aligned, it's not page aligned anymore, thus btrfs will read the full
page, which will be accounted by cgroup and fail the test.

As the test case itself expects such 4K block aligned write should not
trigger any read.

Such expected behavior is an optimization to reduce folio reads when
possible, and unfortunately btrfs does not implement such optimization.

[FIX]
To skip the full page read, we need to do the following modification:

- Do not trigger full page read as long as the buffered write is block
  aligned
  This is pretty simple by modifying the check inside
  prepare_uptodate_page().

- Skip already uptodate blocks during full page read
  Or we can lead to the following data corruption:

  0       32K        64K
  |///////|          |

  Where the file range [0, 32K) is dirtied by buffered write, the
  remaining range [32K, 64K) is not.

  When reading the full page, since [0,32K) is only dirtied but not
  written back, there is no data extent map for it, but a hole covering
  [0, 64k).

  If we continue reading the full page range [0, 64K), the dirtied range
  will be filled with 0 (since there is only a hole covering the whole
  range).
  This causes the dirtied range to get lost.

With this optimization, btrfs can pass generic/563 even if the page size
is larger than fs block size.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/btrfs/extent_io.c | 4 ++++
 fs/btrfs/file.c      | 5 +++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 039a0c36164e9..c01926f954fe1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -974,6 +974,10 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 			end_folio_read(folio, true, cur, iosize);
 			break;
 		}
+		if (btrfs_folio_test_uptodate(fs_info, folio, cur, blocksize)) {
+			end_folio_read(folio, true, cur, blocksize);
+			continue;
+		}
 		em = get_extent_map(BTRFS_I(inode), folio, cur, end - cur + 1, em_cached);
 		if (IS_ERR(em)) {
 			end_folio_read(folio, false, cur, end + 1 - cur);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index a92997a583bd2..9facda5583ec3 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -804,14 +804,15 @@ static int prepare_uptodate_folio(struct inode *inode, struct folio *folio, u64
 {
 	u64 clamp_start = max_t(u64, pos, folio_pos(folio));
 	u64 clamp_end = min_t(u64, pos + len, folio_pos(folio) + folio_size(folio));
+	const u32 blocksize = inode_to_fs_info(inode)->sectorsize;
 	int ret = 0;
 
 	if (folio_test_uptodate(folio))
 		return 0;
 
 	if (!force_uptodate &&
-	    IS_ALIGNED(clamp_start, PAGE_SIZE) &&
-	    IS_ALIGNED(clamp_end, PAGE_SIZE))
+	    IS_ALIGNED(clamp_start, blocksize) &&
+	    IS_ALIGNED(clamp_end, blocksize))
 		return 0;
 
 	ret = btrfs_read_folio(NULL, folio);
-- 
2.39.5


  parent reply	other threads:[~2025-05-05 22:18 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20250505221419.2672473-1-sashal@kernel.org>
2025-05-05 22:05 ` [PATCH AUTOSEL 6.14 093/642] btrfs: make btrfs_discard_workfn() block_group ref explicit Sasha Levin
2025-05-05 22:05 ` [PATCH AUTOSEL 6.14 094/642] btrfs: avoid linker error in btrfs_find_create_tree_block() Sasha Levin
2025-05-05 22:05 ` [PATCH AUTOSEL 6.14 095/642] btrfs: run btrfs_error_commit_super() early Sasha Levin
2025-05-05 22:05 ` [PATCH AUTOSEL 6.14 096/642] btrfs: fix non-empty delayed iputs list on unmount due to async workers Sasha Levin
2025-05-05 22:05 ` [PATCH AUTOSEL 6.14 097/642] btrfs: properly limit inline data extent according to block size Sasha Levin
2025-05-05 22:05 ` Sasha Levin [this message]
2025-05-05 22:05 ` [PATCH AUTOSEL 6.14 099/642] btrfs: prevent inline data extents read from touching blocks beyond its range Sasha Levin
2025-05-06 13:19   ` David Sterba
2025-05-20 14:15     ` Sasha Levin
2025-05-05 22:05 ` [PATCH AUTOSEL 6.14 100/642] btrfs: get zone unusable bytes while holding lock at btrfs_reclaim_bgs_work() Sasha Levin
2025-05-05 22:05 ` [PATCH AUTOSEL 6.14 101/642] btrfs: send: return -ENAMETOOLONG when attempting a path that is too long Sasha Levin
2025-05-05 22:05 ` [PATCH AUTOSEL 6.14 102/642] btrfs: zoned: exit btrfs_can_activate_zone if BTRFS_FS_NEED_ZONE_FINISH is set Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250505221419.2672473-98-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=clm@fb.com \
    --cc=dsterba@suse.com \
    --cc=fdmanana@suse.com \
    --cc=josef@toxicpanda.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox