[PATCH 0/2] btrfs: fix generic/563 failures with sectorsize < PAGE

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] btrfs: fix generic/563 failures with sectorsize < PAGE_SIZE
@ 2024-10-02  1:52 Qu Wenruo
  2024-10-02  1:52 ` [PATCH 1/2] btrfs: make btrfs_do_readpage() to do block-by-block read Qu Wenruo
  2024-10-02  1:52 ` [PATCH 2/2] btrfs: allow buffered write to skip full page if it's sector aligned Qu Wenruo
  0 siblings, 2 replies; 4+ messages in thread
From: Qu Wenruo @ 2024-10-02  1:52 UTC (permalink / raw)
  To: linux-btrfs

The test case generic/563 always fail on 64K page sized systems with
btrfs, however there are two different causes:

- sector size == page size
  In that case, it's the test case not using aligned block size, causing
  the btrfs (and ext4) to read out the full page before buffered write.

  This is fixed by the following patch:
  https://lore.kernel.org/linux-btrfs/20240929235038.24497-1-wqu@suse.com/

- sector size < page size
  This is a problem inside btrfs, which can not handle partial uptodate
  pages, thus even if the buffered write covers a full sector, btrfs
  will still read the full page, causing unnecessary IO.
  This is btrfs specific, and ext4/xfs doesn't not have this problem.

This patchset fix the sector size < page size case for btrfs, by:

- Changing btrfs_do_readpage() to do block-by-block read
  This allows btrfs to do per-sector uptodate check, not just relying on
  the extent map checks.

- Changing both buffered read and write path to handle partial uptodate
  pages
  For the buffered write path, it's pretty straight forward, just skip
  the full page read if the range covers full sectors.

  It's the buffered read path needs the extra work, by skipping any read
  into the already uptodate sectors.

  Or we can have a dirtied range, and read (from holes) re-fill the
  dirtied range with 0, causing data corruption.

With these two patches, 4K sector sized btrfs on 64K page sized systems
can pass generic/563 now.

Although considering I have hit data corruption due to the above read
path not skipping uptodate sectors, I'm wondering should I put the full
page read skip code under EXPERIMENTAL flags.

At least this is only affecting subpage cases, the more common x86_64
cases will not be affected.

Qu Wenruo (2):
  btrfs: make btrfs_do_readpage() to do block-by-block read
  btrfs: allow buffered write to skip full page if it's sector aligned

 fs/btrfs/extent_io.c | 43 ++++++++++++++++++-------------------------
 fs/btrfs/file.c      |  5 +++--
 2 files changed, 21 insertions(+), 27 deletions(-)

-- 
2.46.2

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/2] btrfs: make btrfs_do_readpage() to do block-by-block read
  2024-10-02  1:52 [PATCH 0/2] btrfs: fix generic/563 failures with sectorsize < PAGE_SIZE Qu Wenruo
@ 2024-10-02  1:52 ` Qu Wenruo
  2024-10-02  1:52 ` [PATCH 2/2] btrfs: allow buffered write to skip full page if it's sector aligned Qu Wenruo
  1 sibling, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2024-10-02  1:52 UTC (permalink / raw)
  To: linux-btrfs

Currently if a btrfs has sector size < page size, btrfs_do_readpage()
will handle the range extent by extent, this is good for performance as
it doesn't need to re-lookup the same extent map again and again
(although __get_extent_map() already does extra cached em check).

This is totally fine and is a valid optimization, but it has an
assumption that, there is no partial uptodate range in the page.

Meanwhile there is an incoming feature, requiring btrfs to skip the full
page read if a buffered write range covers a full sector but not a full
page.

In that case, we can have a page that is partially uptodate, and the
current per-extent lookup can not handle such case.

So here we change btrfs_do_readpage() to do block-by-block read, this
simplifies the following things:

- Remove the need for @iosize variable
  Because we just use sectorsize as our increment.

- Remove @pg_offset, and calculate it inside the loop when needed
  It's just offset_in_folio().

- Use a for() loop instead of a while() loop

This will slightly reduce the read performance for
sector size < page size cases, but for the future where we can skip a
full page read for a lot of cases, it should still be worthy.

For sector size == page size, this brings no performance change.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 37 ++++++++++++-------------------------
 1 file changed, 12 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9302fde9c464..09eb8a204375 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -952,9 +952,7 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 	u64 block_start;
 	struct extent_map *em;
 	int ret = 0;
-	size_t pg_offset = 0;
-	size_t iosize;
-	size_t blocksize = fs_info->sectorsize;
+	const size_t blocksize = fs_info->sectorsize;
 
 	ret = set_folio_extent_mapped(folio);
 	if (ret < 0) {
@@ -965,23 +963,22 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 	if (folio->index == last_byte >> folio_shift(folio)) {
 		size_t zero_offset = offset_in_folio(folio, last_byte);
 
-		if (zero_offset) {
-			iosize = folio_size(folio) - zero_offset;
-			folio_zero_range(folio, zero_offset, iosize);
-		}
+		if (zero_offset)
+			folio_zero_range(folio, zero_offset,
+					 folio_size(folio) - zero_offset);
 	}
 	bio_ctrl->end_io_func = end_bbio_data_read;
 	begin_folio_read(fs_info, folio);
-	while (cur <= end) {
+	for (cur = start; cur <= end; cur += blocksize) {
 		enum btrfs_compression_type compress_type = BTRFS_COMPRESS_NONE;
+		unsigned long pg_offset = offset_in_folio(folio, cur);
 		bool force_bio_submit = false;
 		u64 disk_bytenr;
 
 		ASSERT(IS_ALIGNED(cur, fs_info->sectorsize));
 		if (cur >= last_byte) {
-			iosize = folio_size(folio) - pg_offset;
-			folio_zero_range(folio, pg_offset, iosize);
-			end_folio_read(folio, true, cur, iosize);
+			folio_zero_range(folio, pg_offset, end - cur + 1);
+			end_folio_read(folio, true, cur, end - cur + 1);
 			break;
 		}
 		em = __get_extent_map(inode, folio, cur, end - cur + 1,
@@ -996,8 +993,6 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 
 		compress_type = extent_map_compression(em);
 
-		iosize = min(extent_map_end(em) - cur, end - cur + 1);
-		iosize = ALIGN(iosize, blocksize);
 		if (compress_type != BTRFS_COMPRESS_NONE)
 			disk_bytenr = em->disk_bytenr;
 		else
@@ -1053,18 +1048,13 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 
 		/* we've found a hole, just zero and go on */
 		if (block_start == EXTENT_MAP_HOLE) {
-			folio_zero_range(folio, pg_offset, iosize);
-
-			end_folio_read(folio, true, cur, iosize);
-			cur = cur + iosize;
-			pg_offset += iosize;
+			folio_zero_range(folio, pg_offset, blocksize);
+			end_folio_read(folio, true, cur, blocksize);
 			continue;
 		}
 		/* the get_extent function already copied into the folio */
 		if (block_start == EXTENT_MAP_INLINE) {
-			end_folio_read(folio, true, cur, iosize);
-			cur = cur + iosize;
-			pg_offset += iosize;
+			end_folio_read(folio, true, cur, blocksize);
 			continue;
 		}
 
@@ -1075,12 +1065,9 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 
 		if (force_bio_submit)
 			submit_one_bio(bio_ctrl);
-		submit_extent_folio(bio_ctrl, disk_bytenr, folio, iosize,
+		submit_extent_folio(bio_ctrl, disk_bytenr, folio, blocksize,
 				    pg_offset);
-		cur = cur + iosize;
-		pg_offset += iosize;
 	}
-
 	return 0;
 }
 
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 2/2] btrfs: allow buffered write to skip full page if it's sector aligned
  2024-10-02  1:52 [PATCH 0/2] btrfs: fix generic/563 failures with sectorsize < PAGE_SIZE Qu Wenruo
  2024-10-02  1:52 ` [PATCH 1/2] btrfs: make btrfs_do_readpage() to do block-by-block read Qu Wenruo
@ 2024-10-02  1:52 ` Qu Wenruo
  2024-10-09  2:57   ` Qu Wenruo
  1 sibling, 1 reply; 4+ messages in thread
From: Qu Wenruo @ 2024-10-02  1:52 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
Since the support of sector size < page size for btrfs, test case
generic/563 fails with 4K sector size and 64K page size:

    --- tests/generic/563.out	2024-04-25 18:13:45.178550333 +0930
    +++ /home/adam/xfstests-dev/results//generic/563.out.bad	2024-09-30 09:09:16.155312379 +0930
    @@ -3,7 +3,8 @@
     read is in range
     write is in range
     write -> read/write
    -read is in range
    +read has value of 8388608
    +read is NOT in range -33792 .. 33792
     write is in range
    ...

[CAUSE]
The test case creates a 8MiB file, then buffered write into the 8MiB
using 4K block size, to overwrite the whole file.

On 4K page sized systems, since the write range covers the full sector and
page, btrfs will no bother reading the page, just like what XFS and EXT4
do.

But 64K page sized systems, although the write is sector aligned, it's
not page aligned, thus btrfs still goes the full page alignment check,
and read the full page out.

This causes extra data read, and fail the test case.

[FIX]
To skip the full page read, we need to do the following modification:

- Do not trigger full page read as long as the buffered write is sector
  aligned
  This is pretty simple by modifying the check inside
  prepare_uptodate_page().

- Skip already uptodate sectors during full page read
  Or we can lead to the following data corruption:

  0       32K        64K
  |///////|          |

  Where the file range [0, 32K) is dirtied by buffered write, the
  remaining range [32K, 64K) is not.

  When reading the full page, since [0,32K) is only dirtied but not
  written back, there is no data extent map for it, but a hole covering
  [0, 64k).

  If we continue reading the full page range [0, 64K), the dirtied range
  will be filled with 0 (since there is only a hole covering the whole
  range).
  This causes the dirtied range to get lost.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 6 ++++++
 fs/btrfs/file.c      | 5 +++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 09eb8a204375..ea118c89e365 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -981,6 +981,12 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
 			end_folio_read(folio, true, cur, end - cur + 1);
 			break;
 		}
+
+		if (btrfs_folio_test_uptodate(fs_info, folio, cur, blocksize)) {
+			end_folio_read(folio, true, cur, blocksize);
+			continue;
+		}
+
 		em = __get_extent_map(inode, folio, cur, end - cur + 1,
 				      em_cached);
 		if (IS_ERR(em)) {
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index fe4c3b31447a..64e28ebd2d0b 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -860,6 +860,7 @@ static int prepare_uptodate_page(struct inode *inode,
 				 struct page *page, u64 pos,
 				 u64 len, bool force_uptodate)
 {
+	const u32 sectorsize = inode_to_fs_info(inode)->sectorsize;
 	struct folio *folio = page_folio(page);
 	u64 clamp_start = max_t(u64, pos, folio_pos(folio));
 	u64 clamp_end = min_t(u64, pos + len, folio_pos(folio) + folio_size(folio));
@@ -869,8 +870,8 @@ static int prepare_uptodate_page(struct inode *inode,
 		return 0;
 
 	if (!force_uptodate &&
-	    IS_ALIGNED(clamp_start, PAGE_SIZE) &&
-	    IS_ALIGNED(clamp_end, PAGE_SIZE))
+	    IS_ALIGNED(clamp_start, sectorsize) &&
+	    IS_ALIGNED(clamp_end, sectorsize))
 		return 0;
 
 	ret = btrfs_read_folio(NULL, folio);
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH 2/2] btrfs: allow buffered write to skip full page if it's sector aligned
  2024-10-02  1:52 ` [PATCH 2/2] btrfs: allow buffered write to skip full page if it's sector aligned Qu Wenruo
@ 2024-10-09  2:57   ` Qu Wenruo
  0 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2024-10-09  2:57 UTC (permalink / raw)
  To: linux-btrfs



在 2024/10/2 11:22, Qu Wenruo 写道:
> [BUG]
> Since the support of sector size < page size for btrfs, test case
> generic/563 fails with 4K sector size and 64K page size:
> 
>      --- tests/generic/563.out	2024-04-25 18:13:45.178550333 +0930
>      +++ /home/adam/xfstests-dev/results//generic/563.out.bad	2024-09-30 09:09:16.155312379 +0930
>      @@ -3,7 +3,8 @@
>       read is in range
>       write is in range
>       write -> read/write
>      -read is in range
>      +read has value of 8388608
>      +read is NOT in range -33792 .. 33792
>       write is in range
>      ...
> 
> [CAUSE]
> The test case creates a 8MiB file, then buffered write into the 8MiB
> using 4K block size, to overwrite the whole file.
> 
> On 4K page sized systems, since the write range covers the full sector and
> page, btrfs will no bother reading the page, just like what XFS and EXT4
> do.
> 
> But 64K page sized systems, although the write is sector aligned, it's
> not page aligned, thus btrfs still goes the full page alignment check,
> and read the full page out.
> 
> This causes extra data read, and fail the test case.
> 
> [FIX]
> To skip the full page read, we need to do the following modification:
> 
> - Do not trigger full page read as long as the buffered write is sector
>    aligned
>    This is pretty simple by modifying the check inside
>    prepare_uptodate_page().
> 
> - Skip already uptodate sectors during full page read
>    Or we can lead to the following data corruption:
> 
>    0       32K        64K
>    |///////|          |
> 
>    Where the file range [0, 32K) is dirtied by buffered write, the
>    remaining range [32K, 64K) is not.
> 
>    When reading the full page, since [0,32K) is only dirtied but not
>    written back, there is no data extent map for it, but a hole covering
>    [0, 64k).
> 
>    If we continue reading the full page range [0, 64K), the dirtied range
>    will be filled with 0 (since there is only a hole covering the whole
>    range).
>    This causes the dirtied range to get lost.

After more testing, this patch seems to cause generic/095 to hang, 
caused by some folio not being proper unlocked.

I'll dig deeper to find out why this is happening.

Thanks,
Qu
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/extent_io.c | 6 ++++++
>   fs/btrfs/file.c      | 5 +++--
>   2 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 09eb8a204375..ea118c89e365 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -981,6 +981,12 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
>   			end_folio_read(folio, true, cur, end - cur + 1);
>   			break;
>   		}
> +
> +		if (btrfs_folio_test_uptodate(fs_info, folio, cur, blocksize)) {
> +			end_folio_read(folio, true, cur, blocksize);
> +			continue;
> +		}
> +
>   		em = __get_extent_map(inode, folio, cur, end - cur + 1,
>   				      em_cached);
>   		if (IS_ERR(em)) {
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index fe4c3b31447a..64e28ebd2d0b 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -860,6 +860,7 @@ static int prepare_uptodate_page(struct inode *inode,
>   				 struct page *page, u64 pos,
>   				 u64 len, bool force_uptodate)
>   {
> +	const u32 sectorsize = inode_to_fs_info(inode)->sectorsize;
>   	struct folio *folio = page_folio(page);
>   	u64 clamp_start = max_t(u64, pos, folio_pos(folio));
>   	u64 clamp_end = min_t(u64, pos + len, folio_pos(folio) + folio_size(folio));
> @@ -869,8 +870,8 @@ static int prepare_uptodate_page(struct inode *inode,
>   		return 0;
>   
>   	if (!force_uptodate &&
> -	    IS_ALIGNED(clamp_start, PAGE_SIZE) &&
> -	    IS_ALIGNED(clamp_end, PAGE_SIZE))
> +	    IS_ALIGNED(clamp_start, sectorsize) &&
> +	    IS_ALIGNED(clamp_end, sectorsize))
>   		return 0;
>   
>   	ret = btrfs_read_folio(NULL, folio);


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-10-09  2:57 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-02  1:52 [PATCH 0/2] btrfs: fix generic/563 failures with sectorsize < PAGE_SIZE Qu Wenruo
2024-10-02  1:52 ` [PATCH 1/2] btrfs: make btrfs_do_readpage() to do block-by-block read Qu Wenruo
2024-10-02  1:52 ` [PATCH 2/2] btrfs: allow buffered write to skip full page if it's sector aligned Qu Wenruo
2024-10-09  2:57   ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).