[PATCH v5 0/2] btrfs: fix beyond EOF truncation for subpage generic/363 failures

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 0/2] btrfs: fix beyond EOF truncation for subpage generic/363 failures
@ 2025-04-25 22:36 Qu Wenruo
  2025-04-25 22:36 ` [PATCH v5 1/2] btrfs: handle unaligned EOF truncation correctly for subpage cases Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Qu Wenruo @ 2025-04-25 22:36 UTC (permalink / raw)
  To: linux-btrfs

[CHANGELOG]
v5:
- Shrink the parameter list for btrfs_truncate_block()
  Remove the @front and @len, instead passing a new pair of @start/@end,
  so that we can determine if @from is in the head or tail block,
  thus no need for @front.

  This will give callers more freedom (a little too much),
  e.g. for the following zero range/hole punch case:

    Page size is 64K, fs block size is 4K.
    Truncation range is [6K, 58K).

    0        8K                32K                  56K      64K
    |      |/|//////////////////////////////////////|/|      |
           6K                                         58K

    To truncate the first block to zero out range [6K, 8K),
    caller can pass @from = 6K, @start = 6K, @end = 58K - 1.
    In fact, any @from inside range [6K, 8K) will work.

    To truncate the last block to zero out range [56K, 58K),
    caller can pass @from=58K - 1, @start = 6K, @end = 58K -1.
    Any @from inside range [56K, 58K) will also work.

    Furthermore, if aligned @from is passed in, e.g. 8K,
    btrfs_truncate_block() will detect that there is nothing to do,
    and exit properly.

- Only do the extra zeroing if we're truncating beyond EOF
  Especially for the recent large folios support, we can do a lot of
  unnecessary zeroing for a very large folio.

- Remove the lock-wait-retry loop if we're doing aligned truncation
  beyond EOF
  Since it's already EOF, there is no need to wait for the OE anyway.

v4:
- Rebased to the latest for-next branch
  btrfs_free_extent_map() renames cause a minor conflict in the first
  patch.

v3:
- Fix a typo where @block_size should @blocksize.
  There is a global function, block_size(), thus this typo will cause
  type conflicts inside round_down()/round_up().

v2:
- Fix a conversion bug in the first patch that leads to generic/008
  failure on x86_64
  The range is passed incorrectly and caused btrfs_truncate_block() to
  incorrectly skip an unaligned range.

Test case generic/363 always fail on subpage (fs block fs < page size)
btrfses, there are mostly two kinds of problems here:

All examples are based on 64K page size and 4K fs block size.

1) EOF is polluted and btrfs_truncate_block() only zeros the block that
   needs to be written back

   
   0                           32K                           64K
   |                           |              |GGGGGGGGGGGGGG|
                                              50K EOF
   The original file is 50K sized (not 4K aligned), and fsx polluted the
   range beyond EOF through memory mapped write.
   And since memory mapped write is page based, and our page size is
   larger than block size, the page range [0, 64K) covere blocks beyond
   EOF.

   Those polluted range will not be written back, but will still affect
   our page cache.

   Then some operation happens to expand the inode to size 64K.

   In that case btrfs_truncate_block() is called to trim the block
   [48K, 52K), and that block will be marked dirty for written back.

   But the range [52K, 64K) is untouched at all, left the garbage
   hanging there, triggering `fsx -e 1` failure.

   Fix this case by force btrfs_truncate_block() to zeroing any involved
   blocks. (Meanwhile still only one block [48K, 52K) will be written
   back)

2) EOF is polluted and the original size is block aligned so
   btrfs_truncate_block() does nothing

   0                           32K                           64K
   |                           |                |GGGGGGGGGGGG|
                                                52K EOF

   Mostly the same as case 1, but this time since the inode size is
   block aligned, btrfs_truncate_block() will do nothing.

   Leaving the garbage range [52K, 64K) untouched and fail `fsx -e 1`
   runs.

   Fix this case by force btrfs_truncate_block() to zeroing any involved
   blocks when the btrfs is subpage and the range is aligned.
   This will not cause any new dirty blocks, but purely zeroing out EOF
   to pass `fsx -e 1` runs.

Qu Wenruo (2):
  btrfs: handle unaligned EOF truncation correctly for subpage cases
  btrfs: handle aligned EOF truncation correctly for subpage cases

 fs/btrfs/btrfs_inode.h |   3 +-
 fs/btrfs/file.c        |  34 ++++-----
 fs/btrfs/inode.c       | 152 ++++++++++++++++++++++++++++++++++-------
 3 files changed, 147 insertions(+), 42 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v5 1/2] btrfs: handle unaligned EOF truncation correctly for subpage cases
  2025-04-25 22:36 [PATCH v5 0/2] btrfs: fix beyond EOF truncation for subpage generic/363 failures Qu Wenruo
@ 2025-04-25 22:36 ` Qu Wenruo
  2025-05-06 17:29   ` Boris Burkov
  2025-04-25 22:36 ` [PATCH v5 2/2] btrfs: handle aligned " Qu Wenruo
  2025-05-05 15:33 ` [PATCH v5 0/2] btrfs: fix beyond EOF truncation for subpage generic/363 failures David Sterba
  2 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2025-04-25 22:36 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
The following fsx sequence will fail on btrfs with 64K page size and 4K
fs block size:

 #fsx -d -e 1 -N 4 $mnt/junk -S 36386
 READ BAD DATA: offset = 0xe9ba, size = 0x6dd5, fname = /mnt/btrfs/junk
 OFFSET      GOOD    BAD     RANGE
 0xe9ba      0x0000  0x03ac  0x0
 operation# (mod 256) for the bad data may be 3
 ...
 LOG DUMP (4 total operations):
 1(  1 mod 256): WRITE    0x6c62 thru 0x1147d	(0xa81c bytes) HOLE	***WWWW
 2(  2 mod 256): TRUNCATE DOWN	from 0x1147e to 0x5448	******WWWW
 3(  3 mod 256): ZERO     0x1c7aa thru 0x28fe2	(0xc839 bytes)
 4(  4 mod 256): MAPREAD  0xe9ba thru 0x1578e	(0x6dd5 bytes)	***RRRR***

[CAUSE]
Only 2 operations are really involved in this case:

 3 pollute_eof	0x5448 thru	0xffff	(0xabb8 bytes)
 3 zero	from 0x1c7aa to 0x28fe3, (0xc839 bytes)
 4 mapread	0xe9ba thru	0x1578e	(0x6dd5 bytes)

At operation 3, fsx pollutes beyond EOF, that is done by mmap()
and write into that mmap() range beyondd EOF.

Such write will fill the range beyond EOF, but it will never reach disk
as ranges beyond EOF will not be marked dirty nor uptodate.

Then we zero_range for [0x1c7aa, 0x28fe3], and since the range is beyond
our isize (which was 0x5448), we should zero out any range beyond
EOF (0x5448).

During btrfs_zero_range(), we call btrfs_truncate_block() to dirty the
unaligned head block.
But that function only really zero out the block at [0x5000, 0x5fff], it
doesn't bother any range other that that block, since those range will
not be marked dirty nor written back.

So the range [0x6000, 0xffff] is still polluted, and later mapread()
will return the poisoned value.

[FIX]
Enhance btrfs_truncate_block() by:

- Pass a @start/@end pair to indicate the full truncation range
  This is to handle the following truncation case:

    Page size is 64K, fs block size is 4K, truncate range is
    [6K, 60K]

    0                      32K                    64K
    |   |///////////////////////////////////|     |
        6K                                  60K

    The range is not aligned for its head block, so we need to call
    btrfs_truncate_block() with @from = 6K, @front = 0, @len = 0.

    But with that info we only know to zero the range [6K, 8K),
    if we zero out the range [6K, 64K), the last block will also be
    zeroed, causing data loss.

  So here we need the full range we're truncating, so that we can avoid
  over-truncation.

- Remove @front parameter
  With the full truncate range passed in, we can determine if the @from
  is at the head or tail block.

- Skip truncation if @from is not in the head nor tail blocks
  The call site in hole punch unconditionally call
  btrfs_truncate_block() without even checking the range is aligned or
  not.
  If the @from is not in the head nor tail block, it means we can safely
  ignore it.

- Skip truncate if the range inside the target block is already aligned

- Make btrfs_truncate_block() to zero all blocks beyond EOF
  Since we have the original range, we know exactly if we're doing
  truncation beyond EOF (the @end will be (u64)-1).

  If we're doing truncationg beyond EOF, then enlarge the truncation
  range to the folio end, to address the possibly polluted ranges.

  Otherwise still keep the zero range inside the block, as we can have
  large data folios soon, always truncating every blocks inside the same
  folio can be costly for large folios.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/btrfs_inode.h |  3 +-
 fs/btrfs/file.c        | 34 ++++++++-------
 fs/btrfs/inode.c       | 99 ++++++++++++++++++++++++++++++++----------
 3 files changed, 94 insertions(+), 42 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 61fad5423b6a..8dc583e14bed 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -565,8 +565,7 @@ int btrfs_add_link(struct btrfs_trans_handle *trans,
 		   struct btrfs_inode *parent_inode, struct btrfs_inode *inode,
 		   const struct fscrypt_str *name, int add_backref, u64 index);
 int btrfs_delete_subvolume(struct btrfs_inode *dir, struct dentry *dentry);
-int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
-			 int front);
+int btrfs_truncate_block(struct btrfs_inode *inode, u64 from, u64 start, u64 end);
 
 int btrfs_start_delalloc_snapshot(struct btrfs_root *root, bool in_reclaim_context);
 int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e688587329de..5eaa389bfde5 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2614,7 +2614,8 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
 	u64 lockend;
 	u64 tail_start;
 	u64 tail_len;
-	u64 orig_start = offset;
+	const u64 orig_start = offset;
+	const u64 orig_end = offset + len - 1;
 	int ret = 0;
 	bool same_block;
 	u64 ino_size;
@@ -2645,10 +2646,6 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
 	lockend = round_down(offset + len, fs_info->sectorsize) - 1;
 	same_block = (BTRFS_BYTES_TO_BLKS(fs_info, offset))
 		== (BTRFS_BYTES_TO_BLKS(fs_info, offset + len - 1));
-	/*
-	 * We needn't truncate any block which is beyond the end of the file
-	 * because we are sure there is no data there.
-	 */
 	/*
 	 * Only do this if we are in the same block and we aren't doing the
 	 * entire block.
@@ -2656,8 +2653,8 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
 	if (same_block && len < fs_info->sectorsize) {
 		if (offset < ino_size) {
 			truncated_block = true;
-			ret = btrfs_truncate_block(BTRFS_I(inode), offset, len,
-						   0);
+			ret = btrfs_truncate_block(BTRFS_I(inode), offset + len - 1,
+						   orig_start, orig_end);
 		} else {
 			ret = 0;
 		}
@@ -2667,7 +2664,7 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
 	/* zero back part of the first block */
 	if (offset < ino_size) {
 		truncated_block = true;
-		ret = btrfs_truncate_block(BTRFS_I(inode), offset, 0, 0);
+		ret = btrfs_truncate_block(BTRFS_I(inode), offset, orig_start, orig_end);
 		if (ret) {
 			btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
 			return ret;
@@ -2704,8 +2701,8 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
 			if (tail_start + tail_len < ino_size) {
 				truncated_block = true;
 				ret = btrfs_truncate_block(BTRFS_I(inode),
-							tail_start + tail_len,
-							0, 1);
+							tail_start + tail_len - 1,
+							orig_start, orig_end);
 				if (ret)
 					goto out_only_mutex;
 			}
@@ -2873,6 +2870,8 @@ static int btrfs_zero_range(struct inode *inode,
 	int ret;
 	u64 alloc_hint = 0;
 	const u64 sectorsize = fs_info->sectorsize;
+	const u64 orig_start = offset;
+	const u64 orig_end = offset + len - 1;
 	u64 alloc_start = round_down(offset, sectorsize);
 	u64 alloc_end = round_up(offset + len, sectorsize);
 	u64 bytes_to_reserve = 0;
@@ -2935,8 +2934,9 @@ static int btrfs_zero_range(struct inode *inode,
 		}
 		if (len < sectorsize && em->disk_bytenr != EXTENT_MAP_HOLE) {
 			btrfs_free_extent_map(em);
-			ret = btrfs_truncate_block(BTRFS_I(inode), offset, len,
-						   0);
+			ret = btrfs_truncate_block(BTRFS_I(inode),
+						   offset + len - 1,
+						   orig_start, orig_end);
 			if (!ret)
 				ret = btrfs_fallocate_update_isize(inode,
 								   offset + len,
@@ -2967,7 +2967,8 @@ static int btrfs_zero_range(struct inode *inode,
 			alloc_start = round_down(offset, sectorsize);
 			ret = 0;
 		} else if (ret == RANGE_BOUNDARY_WRITTEN_EXTENT) {
-			ret = btrfs_truncate_block(BTRFS_I(inode), offset, 0, 0);
+			ret = btrfs_truncate_block(BTRFS_I(inode), offset,
+						   orig_start, orig_end);
 			if (ret)
 				goto out;
 		} else {
@@ -2984,8 +2985,8 @@ static int btrfs_zero_range(struct inode *inode,
 			alloc_end = round_up(offset + len, sectorsize);
 			ret = 0;
 		} else if (ret == RANGE_BOUNDARY_WRITTEN_EXTENT) {
-			ret = btrfs_truncate_block(BTRFS_I(inode), offset + len,
-						   0, 1);
+			ret = btrfs_truncate_block(BTRFS_I(inode), offset + len - 1,
+						   orig_start, orig_end);
 			if (ret)
 				goto out;
 		} else {
@@ -3105,7 +3106,8 @@ static long btrfs_fallocate(struct file *file, int mode,
 		 * need to zero out the end of the block if i_size lands in the
 		 * middle of a block.
 		 */
-		ret = btrfs_truncate_block(BTRFS_I(inode), inode->i_size, 0, 0);
+		ret = btrfs_truncate_block(BTRFS_I(inode), inode->i_size,
+					   inode->i_size, (u64)-1);
 		if (ret)
 			goto out;
 	}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 538a9ec86abc..08dda7b0883f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4760,20 +4760,32 @@ static int btrfs_rmdir(struct inode *dir, struct dentry *dentry)
 	return ret;
 }
 
+static bool is_inside_block(u64 bytenr, u64 blockstart, u32 blocksize)
+{
+	ASSERT(IS_ALIGNED(blockstart, blocksize), "blockstart=%llu blocksize=%u",
+		blockstart, blocksize);
+
+	if (blockstart <= bytenr && bytenr <= blockstart + blocksize - 1)
+		return true;
+	return false;
+}
+
 /*
- * Read, zero a chunk and write a block.
+ * Handle the truncation of a fs block.
+ *
+ * If the range is not block aligned, read out the folio covers @from, and
+ * zero any blocks that are inside the folio and covered by [@start, @end).
+ * If @start or @end + 1 lands inside a block, that block will be marked dirty
+ * for writeback.
+ *
+ * This is utilized by hole punch, zero range, file expansion.
  *
  * @inode - inode that we're zeroing
  * @from - the offset to start zeroing
- * @len - the length to zero, 0 to zero the entire range respective to the
- *	offset
- * @front - zero up to the offset instead of from the offset on
- *
- * This will find the block for the "from" offset and cow the block and zero the
- * part we want to zero.  This is used with truncate and hole punching.
+ * @start - the start file offset of the range we want to zero
+ * @end - the end (inclusive) file offset of the range we want to zero.
  */
-int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
-			 int front)
+int btrfs_truncate_block(struct btrfs_inode *inode, u64 from, u64 start, u64 end)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct address_space *mapping = inode->vfs_inode.i_mapping;
@@ -4784,16 +4796,45 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
 	bool only_release_metadata = false;
 	u32 blocksize = fs_info->sectorsize;
 	pgoff_t index = from >> PAGE_SHIFT;
-	unsigned offset = from & (blocksize - 1);
 	struct folio *folio;
 	gfp_t mask = btrfs_alloc_write_mask(mapping);
 	size_t write_bytes = blocksize;
 	int ret = 0;
+	const bool in_head_block = is_inside_block(from, round_down(start, blocksize),
+						   blocksize);
+	const bool in_tail_block = is_inside_block(from, round_down(end, blocksize),
+						   blocksize);
+	bool need_truncate_head = false;
+	bool need_truncate_tail = false;
+	u64 zero_start;
+	u64 zero_end;
 	u64 block_start;
 	u64 block_end;
 
-	if (IS_ALIGNED(offset, blocksize) &&
-	    (!len || IS_ALIGNED(len, blocksize)))
+	/* @from should be inside the range. */
+	ASSERT(start <= from && from <= end, "from=%llu start=%llu end=%llu",
+	       from, start, end);
+
+	/* The range is aligned at both ends. */
+	if (IS_ALIGNED(start, blocksize) && IS_ALIGNED(end + 1, blocksize))
+		goto out;
+
+	/*
+	 * @from may not be inside the head nor tail block. In that case
+	 * we need to do nothing.
+	 */
+	if (!in_head_block && !in_tail_block)
+		goto out;
+
+	/*
+	 * Skip the truncatioin if the range in the target block is already aligned.
+	 * The seemingly complex check will also handle the same block case.
+	 */
+	if (in_head_block && !IS_ALIGNED(start, blocksize))
+		need_truncate_head = true;
+	if (in_tail_block && !IS_ALIGNED(end + 1, blocksize))
+		need_truncate_tail = true;
+	if (!need_truncate_head && !need_truncate_tail)
 		goto out;
 
 	block_start = round_down(from, blocksize);
@@ -4876,17 +4917,26 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
 		goto out_unlock;
 	}
 
-	if (offset != blocksize) {
-		if (!len)
-			len = blocksize - offset;
-		if (front)
-			folio_zero_range(folio, block_start - folio_pos(folio),
-					 offset);
-		else
-			folio_zero_range(folio,
-					 (block_start - folio_pos(folio)) + offset,
-					 len);
+	if (end == (u64)-1) {
+		/*
+		 * We're truncating beyond EOF, the remaining blocks normally
+		 * are already holes thus no need to zero again, but it's
+		 * possible for fs block size < page size cases to have memory
+		 * mapped writes to pollute ranges beyond EOF.
+		 *
+		 * In that case although such polluted blocks beyond EOF will
+		 * not reach disk, it still affects our page caches.
+		 */
+		zero_start = max_t(u64, folio_pos(folio), start);
+		zero_end = min_t(u64, folio_pos(folio) + folio_size(folio) - 1,
+				 end);
+	} else {
+		zero_start = max_t(u64, block_start, start);
+		zero_end = min_t(u64, block_end, end);
 	}
+	folio_zero_range(folio, zero_start - folio_pos(folio),
+			 zero_end - zero_start + 1);
+
 	btrfs_folio_clear_checked(fs_info, folio, block_start,
 				  block_end + 1 - block_start);
 	btrfs_folio_set_dirty(fs_info, folio, block_start,
@@ -4988,7 +5038,7 @@ int btrfs_cont_expand(struct btrfs_inode *inode, loff_t oldsize, loff_t size)
 	 * rest of the block before we expand the i_size, otherwise we could
 	 * expose stale data.
 	 */
-	ret = btrfs_truncate_block(inode, oldsize, 0, 0);
+	ret = btrfs_truncate_block(inode, oldsize, oldsize, -1);
 	if (ret)
 		return ret;
 
@@ -7623,7 +7673,8 @@ static int btrfs_truncate(struct btrfs_inode *inode, bool skip_writeback)
 		btrfs_end_transaction(trans);
 		btrfs_btree_balance_dirty(fs_info);
 
-		ret = btrfs_truncate_block(inode, inode->vfs_inode.i_size, 0, 0);
+		ret = btrfs_truncate_block(inode, inode->vfs_inode.i_size,
+					   inode->vfs_inode.i_size, (u64)-1);
 		if (ret)
 			goto out;
 		trans = btrfs_start_transaction(root, 1);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 1/2] btrfs: handle unaligned EOF truncation correctly for subpage cases
  2025-04-25 22:36 ` [PATCH v5 1/2] btrfs: handle unaligned EOF truncation correctly for subpage cases Qu Wenruo
@ 2025-05-06 17:29   ` Boris Burkov
  0 siblings, 0 replies; 7+ messages in thread
From: Boris Burkov @ 2025-05-06 17:29 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sat, Apr 26, 2025 at 08:06:49AM +0930, Qu Wenruo wrote:
> [BUG]
> The following fsx sequence will fail on btrfs with 64K page size and 4K
> fs block size:
> 
>  #fsx -d -e 1 -N 4 $mnt/junk -S 36386
>  READ BAD DATA: offset = 0xe9ba, size = 0x6dd5, fname = /mnt/btrfs/junk
>  OFFSET      GOOD    BAD     RANGE
>  0xe9ba      0x0000  0x03ac  0x0
>  operation# (mod 256) for the bad data may be 3
>  ...
>  LOG DUMP (4 total operations):
>  1(  1 mod 256): WRITE    0x6c62 thru 0x1147d	(0xa81c bytes) HOLE	***WWWW
>  2(  2 mod 256): TRUNCATE DOWN	from 0x1147e to 0x5448	******WWWW
>  3(  3 mod 256): ZERO     0x1c7aa thru 0x28fe2	(0xc839 bytes)
>  4(  4 mod 256): MAPREAD  0xe9ba thru 0x1578e	(0x6dd5 bytes)	***RRRR***
> 
> [CAUSE]
> Only 2 operations are really involved in this case:
> 
>  3 pollute_eof	0x5448 thru	0xffff	(0xabb8 bytes)
>  3 zero	from 0x1c7aa to 0x28fe3, (0xc839 bytes)
>  4 mapread	0xe9ba thru	0x1578e	(0x6dd5 bytes)
> 
> At operation 3, fsx pollutes beyond EOF, that is done by mmap()
> and write into that mmap() range beyondd EOF.
> 
> Such write will fill the range beyond EOF, but it will never reach disk
> as ranges beyond EOF will not be marked dirty nor uptodate.
> 
> Then we zero_range for [0x1c7aa, 0x28fe3], and since the range is beyond
> our isize (which was 0x5448), we should zero out any range beyond
> EOF (0x5448).
> 
> During btrfs_zero_range(), we call btrfs_truncate_block() to dirty the
> unaligned head block.
> But that function only really zero out the block at [0x5000, 0x5fff], it
> doesn't bother any range other that that block, since those range will
> not be marked dirty nor written back.
> 
> So the range [0x6000, 0xffff] is still polluted, and later mapread()
> will return the poisoned value.
> 
> [FIX]
> Enhance btrfs_truncate_block() by:
> 
> - Pass a @start/@end pair to indicate the full truncation range
>   This is to handle the following truncation case:
> 
>     Page size is 64K, fs block size is 4K, truncate range is
>     [6K, 60K]
> 
>     0                      32K                    64K
>     |   |///////////////////////////////////|     |
>         6K                                  60K
> 
>     The range is not aligned for its head block, so we need to call
>     btrfs_truncate_block() with @from = 6K, @front = 0, @len = 0.
> 
>     But with that info we only know to zero the range [6K, 8K),
>     if we zero out the range [6K, 64K), the last block will also be
>     zeroed, causing data loss.
> 
>   So here we need the full range we're truncating, so that we can avoid
>   over-truncation.
> 
> - Remove @front parameter
>   With the full truncate range passed in, we can determine if the @from
>   is at the head or tail block.
> 
> - Skip truncation if @from is not in the head nor tail blocks
>   The call site in hole punch unconditionally call
>   btrfs_truncate_block() without even checking the range is aligned or
>   not.
>   If the @from is not in the head nor tail block, it means we can safely
>   ignore it.
> 
> - Skip truncate if the range inside the target block is already aligned
> 
> - Make btrfs_truncate_block() to zero all blocks beyond EOF
>   Since we have the original range, we know exactly if we're doing
>   truncation beyond EOF (the @end will be (u64)-1).
> 
>   If we're doing truncationg beyond EOF, then enlarge the truncation
>   range to the folio end, to address the possibly polluted ranges.
> 
>   Otherwise still keep the zero range inside the block, as we can have
>   large data folios soon, always truncating every blocks inside the same
>   folio can be costly for large folios.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

I like this version a lot more. I think it quite a bit clearer on the
dual purpose of the block read/zero/dirty and the range zeroing.

A few more minor nits, but you can add
Reviewed-by: Boris Burkov <boris@bur.io>

> ---
>  fs/btrfs/btrfs_inode.h |  3 +-
>  fs/btrfs/file.c        | 34 ++++++++-------
>  fs/btrfs/inode.c       | 99 ++++++++++++++++++++++++++++++++----------
>  3 files changed, 94 insertions(+), 42 deletions(-)
> 
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 61fad5423b6a..8dc583e14bed 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -565,8 +565,7 @@ int btrfs_add_link(struct btrfs_trans_handle *trans,
>  		   struct btrfs_inode *parent_inode, struct btrfs_inode *inode,
>  		   const struct fscrypt_str *name, int add_backref, u64 index);
>  int btrfs_delete_subvolume(struct btrfs_inode *dir, struct dentry *dentry);
> -int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
> -			 int front);
> +int btrfs_truncate_block(struct btrfs_inode *inode, u64 from, u64 start, u64 end);
>  
>  int btrfs_start_delalloc_snapshot(struct btrfs_root *root, bool in_reclaim_context);
>  int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index e688587329de..5eaa389bfde5 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2614,7 +2614,8 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
>  	u64 lockend;
>  	u64 tail_start;
>  	u64 tail_len;
> -	u64 orig_start = offset;
> +	const u64 orig_start = offset;
> +	const u64 orig_end = offset + len - 1;
>  	int ret = 0;
>  	bool same_block;
>  	u64 ino_size;
> @@ -2645,10 +2646,6 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
>  	lockend = round_down(offset + len, fs_info->sectorsize) - 1;
>  	same_block = (BTRFS_BYTES_TO_BLKS(fs_info, offset))
>  		== (BTRFS_BYTES_TO_BLKS(fs_info, offset + len - 1));
> -	/*
> -	 * We needn't truncate any block which is beyond the end of the file
> -	 * because we are sure there is no data there.
> -	 */
>  	/*
>  	 * Only do this if we are in the same block and we aren't doing the
>  	 * entire block.
> @@ -2656,8 +2653,8 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
>  	if (same_block && len < fs_info->sectorsize) {
>  		if (offset < ino_size) {
>  			truncated_block = true;
> -			ret = btrfs_truncate_block(BTRFS_I(inode), offset, len,
> -						   0);
> +			ret = btrfs_truncate_block(BTRFS_I(inode), offset + len - 1,
> +						   orig_start, orig_end);
>  		} else {
>  			ret = 0;
>  		}
> @@ -2667,7 +2664,7 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
>  	/* zero back part of the first block */
>  	if (offset < ino_size) {
>  		truncated_block = true;
> -		ret = btrfs_truncate_block(BTRFS_I(inode), offset, 0, 0);
> +		ret = btrfs_truncate_block(BTRFS_I(inode), offset, orig_start, orig_end);
>  		if (ret) {
>  			btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
>  			return ret;
> @@ -2704,8 +2701,8 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
>  			if (tail_start + tail_len < ino_size) {
>  				truncated_block = true;
>  				ret = btrfs_truncate_block(BTRFS_I(inode),
> -							tail_start + tail_len,
> -							0, 1);
> +							tail_start + tail_len - 1,
> +							orig_start, orig_end);
>  				if (ret)
>  					goto out_only_mutex;
>  			}
> @@ -2873,6 +2870,8 @@ static int btrfs_zero_range(struct inode *inode,
>  	int ret;
>  	u64 alloc_hint = 0;
>  	const u64 sectorsize = fs_info->sectorsize;
> +	const u64 orig_start = offset;
> +	const u64 orig_end = offset + len - 1;
>  	u64 alloc_start = round_down(offset, sectorsize);
>  	u64 alloc_end = round_up(offset + len, sectorsize);
>  	u64 bytes_to_reserve = 0;
> @@ -2935,8 +2934,9 @@ static int btrfs_zero_range(struct inode *inode,
>  		}
>  		if (len < sectorsize && em->disk_bytenr != EXTENT_MAP_HOLE) {
>  			btrfs_free_extent_map(em);
> -			ret = btrfs_truncate_block(BTRFS_I(inode), offset, len,
> -						   0);
> +			ret = btrfs_truncate_block(BTRFS_I(inode),
> +						   offset + len - 1,
> +						   orig_start, orig_end);
>  			if (!ret)
>  				ret = btrfs_fallocate_update_isize(inode,
>  								   offset + len,
> @@ -2967,7 +2967,8 @@ static int btrfs_zero_range(struct inode *inode,
>  			alloc_start = round_down(offset, sectorsize);
>  			ret = 0;
>  		} else if (ret == RANGE_BOUNDARY_WRITTEN_EXTENT) {
> -			ret = btrfs_truncate_block(BTRFS_I(inode), offset, 0, 0);
> +			ret = btrfs_truncate_block(BTRFS_I(inode), offset,
> +						   orig_start, orig_end);
>  			if (ret)
>  				goto out;
>  		} else {
> @@ -2984,8 +2985,8 @@ static int btrfs_zero_range(struct inode *inode,
>  			alloc_end = round_up(offset + len, sectorsize);
>  			ret = 0;
>  		} else if (ret == RANGE_BOUNDARY_WRITTEN_EXTENT) {
> -			ret = btrfs_truncate_block(BTRFS_I(inode), offset + len,
> -						   0, 1);
> +			ret = btrfs_truncate_block(BTRFS_I(inode), offset + len - 1,
> +						   orig_start, orig_end);
>  			if (ret)
>  				goto out;
>  		} else {
> @@ -3105,7 +3106,8 @@ static long btrfs_fallocate(struct file *file, int mode,
>  		 * need to zero out the end of the block if i_size lands in the
>  		 * middle of a block.
>  		 */
> -		ret = btrfs_truncate_block(BTRFS_I(inode), inode->i_size, 0, 0);
> +		ret = btrfs_truncate_block(BTRFS_I(inode), inode->i_size,
> +					   inode->i_size, (u64)-1);
>  		if (ret)
>  			goto out;
>  	}
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 538a9ec86abc..08dda7b0883f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -4760,20 +4760,32 @@ static int btrfs_rmdir(struct inode *dir, struct dentry *dentry)
>  	return ret;
>  }
>  
> +static bool is_inside_block(u64 bytenr, u64 blockstart, u32 blocksize)
> +{
> +	ASSERT(IS_ALIGNED(blockstart, blocksize), "blockstart=%llu blocksize=%u",
> +		blockstart, blocksize);
> +
> +	if (blockstart <= bytenr && bytenr <= blockstart + blocksize - 1)
> +		return true;
> +	return false;
> +}
> +
>  /*
> - * Read, zero a chunk and write a block.
> + * Handle the truncation of a fs block.
> + *
> + * If the range is not block aligned, read out the folio covers @from, and

"folio covers @from" doesn't parse that well for me. Did you mean "the
folio that covers @from"?

> + * zero any blocks that are inside the folio and covered by [@start, @end).
> + * If @start or @end + 1 lands inside a block, that block will be marked dirty
> + * for writeback.
> + *
> + * This is utilized by hole punch, zero range, file expansion.
>   *
>   * @inode - inode that we're zeroing
>   * @from - the offset to start zeroing

I think "the offset to start zeroing" is misleading, since we "zero" all
of [start, end]. "Offset of the block to truncate" or something?
Anything that indicates that it is for locating the block of interest I
think is good enough. In fact, "from" is not really the best name
anymore, so maybe block_offset or something. If that's too much of a
hassle, no problem.

> - * @len - the length to zero, 0 to zero the entire range respective to the
> - *	offset
> - * @front - zero up to the offset instead of from the offset on
> - *
> - * This will find the block for the "from" offset and cow the block and zero the
> - * part we want to zero.  This is used with truncate and hole punching.
> + * @start - the start file offset of the range we want to zero
> + * @end - the end (inclusive) file offset of the range we want to zero.
>   */
> -int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
> -			 int front)
> +int btrfs_truncate_block(struct btrfs_inode *inode, u64 from, u64 start, u64 end)
>  {
>  	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>  	struct address_space *mapping = inode->vfs_inode.i_mapping;
> @@ -4784,16 +4796,45 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
>  	bool only_release_metadata = false;
>  	u32 blocksize = fs_info->sectorsize;
>  	pgoff_t index = from >> PAGE_SHIFT;
> -	unsigned offset = from & (blocksize - 1);
>  	struct folio *folio;
>  	gfp_t mask = btrfs_alloc_write_mask(mapping);
>  	size_t write_bytes = blocksize;
>  	int ret = 0;
> +	const bool in_head_block = is_inside_block(from, round_down(start, blocksize),
> +						   blocksize);
> +	const bool in_tail_block = is_inside_block(from, round_down(end, blocksize),
> +						   blocksize);
> +	bool need_truncate_head = false;
> +	bool need_truncate_tail = false;
> +	u64 zero_start;
> +	u64 zero_end;
>  	u64 block_start;
>  	u64 block_end;
>  
> -	if (IS_ALIGNED(offset, blocksize) &&
> -	    (!len || IS_ALIGNED(len, blocksize)))
> +	/* @from should be inside the range. */
> +	ASSERT(start <= from && from <= end, "from=%llu start=%llu end=%llu",
> +	       from, start, end);
> +
> +	/* The range is aligned at both ends. */
> +	if (IS_ALIGNED(start, blocksize) && IS_ALIGNED(end + 1, blocksize))
> +		goto out;
> +
> +	/*
> +	 * @from may not be inside the head nor tail block. In that case
> +	 * we need to do nothing.
> +	 */
> +	if (!in_head_block && !in_tail_block)
> +		goto out;
> +
> +	/*
> +	 * Skip the truncatioin if the range in the target block is already aligned.

typo: truncation

> +	 * The seemingly complex check will also handle the same block case.
> +	 */
> +	if (in_head_block && !IS_ALIGNED(start, blocksize))
> +		need_truncate_head = true;
> +	if (in_tail_block && !IS_ALIGNED(end + 1, blocksize))
> +		need_truncate_tail = true;
> +	if (!need_truncate_head && !need_truncate_tail)
>  		goto out;
>  
>  	block_start = round_down(from, blocksize);
> @@ -4876,17 +4917,26 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
>  		goto out_unlock;
>  	}
>  
> -	if (offset != blocksize) {
> -		if (!len)
> -			len = blocksize - offset;
> -		if (front)
> -			folio_zero_range(folio, block_start - folio_pos(folio),
> -					 offset);
> -		else
> -			folio_zero_range(folio,
> -					 (block_start - folio_pos(folio)) + offset,
> -					 len);
> +	if (end == (u64)-1) {
> +		/*
> +		 * We're truncating beyond EOF, the remaining blocks normally
> +		 * are already holes thus no need to zero again, but it's
> +		 * possible for fs block size < page size cases to have memory
> +		 * mapped writes to pollute ranges beyond EOF.
> +		 *
> +		 * In that case although such polluted blocks beyond EOF will
> +		 * not reach disk, it still affects our page caches.
> +		 */
> +		zero_start = max_t(u64, folio_pos(folio), start);
> +		zero_end = min_t(u64, folio_pos(folio) + folio_size(folio) - 1,
> +				 end);
> +	} else {
> +		zero_start = max_t(u64, block_start, start);
> +		zero_end = min_t(u64, block_end, end);
>  	}
> +	folio_zero_range(folio, zero_start - folio_pos(folio),
> +			 zero_end - zero_start + 1);
> +
>  	btrfs_folio_clear_checked(fs_info, folio, block_start,
>  				  block_end + 1 - block_start);
>  	btrfs_folio_set_dirty(fs_info, folio, block_start,
> @@ -4988,7 +5038,7 @@ int btrfs_cont_expand(struct btrfs_inode *inode, loff_t oldsize, loff_t size)
>  	 * rest of the block before we expand the i_size, otherwise we could
>  	 * expose stale data.
>  	 */
> -	ret = btrfs_truncate_block(inode, oldsize, 0, 0);
> +	ret = btrfs_truncate_block(inode, oldsize, oldsize, -1);
>  	if (ret)
>  		return ret;
>  
> @@ -7623,7 +7673,8 @@ static int btrfs_truncate(struct btrfs_inode *inode, bool skip_writeback)
>  		btrfs_end_transaction(trans);
>  		btrfs_btree_balance_dirty(fs_info);
>  
> -		ret = btrfs_truncate_block(inode, inode->vfs_inode.i_size, 0, 0);
> +		ret = btrfs_truncate_block(inode, inode->vfs_inode.i_size,
> +					   inode->vfs_inode.i_size, (u64)-1);
>  		if (ret)
>  			goto out;
>  		trans = btrfs_start_transaction(root, 1);
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v5 2/2] btrfs: handle aligned EOF truncation correctly for subpage cases
  2025-04-25 22:36 [PATCH v5 0/2] btrfs: fix beyond EOF truncation for subpage generic/363 failures Qu Wenruo
  2025-04-25 22:36 ` [PATCH v5 1/2] btrfs: handle unaligned EOF truncation correctly for subpage cases Qu Wenruo
@ 2025-04-25 22:36 ` Qu Wenruo
  2025-05-06 17:25   ` Boris Burkov
  2025-05-05 15:33 ` [PATCH v5 0/2] btrfs: fix beyond EOF truncation for subpage generic/363 failures David Sterba
  2 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2025-04-25 22:36 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
For the following fsx -e 1 run, the btrfs still fails the run on 64K
page size with 4K fs block size:

READ BAD DATA: offset = 0x26b3a, size = 0xfafa, fname = /mnt/btrfs/junk
OFFSET      GOOD    BAD     RANGE
0x26b3a     0x0000  0x15b4  0x0
operation# (mod 256) for the bad data may be 21
[...]
LOG DUMP (28 total operations):
1(  1 mod 256): SKIPPED (no operation)
2(  2 mod 256): SKIPPED (no operation)
3(  3 mod 256): SKIPPED (no operation)
4(  4 mod 256): SKIPPED (no operation)
5(  5 mod 256): WRITE    0x1ea90 thru 0x285e0	(0x9b51 bytes) HOLE
6(  6 mod 256): ZERO     0x1b1a8 thru 0x20bd4	(0x5a2d bytes)
7(  7 mod 256): FALLOC   0x22b1a thru 0x272fa	(0x47e0 bytes) INTERIOR
8(  8 mod 256): WRITE    0x741d thru 0x13522	(0xc106 bytes)
9(  9 mod 256): MAPWRITE 0x73ee thru 0xdeeb	(0x6afe bytes)
10( 10 mod 256): FALLOC   0xb719 thru 0xb994	(0x27b bytes) INTERIOR
11( 11 mod 256): COPY 0x15ed8 thru 0x18be1	(0x2d0a bytes) to 0x25f6e thru 0x28c77
12( 12 mod 256): ZERO     0x1615e thru 0x1770e	(0x15b1 bytes)
13( 13 mod 256): SKIPPED (no operation)
14( 14 mod 256): DEDUPE 0x20000 thru 0x27fff	(0x8000 bytes) to 0x1000 thru 0x8fff
15( 15 mod 256): SKIPPED (no operation)
16( 16 mod 256): CLONE 0xa000 thru 0xffff	(0x6000 bytes) to 0x36000 thru 0x3bfff
17( 17 mod 256): ZERO     0x14adc thru 0x1b78a	(0x6caf bytes)
18( 18 mod 256): TRUNCATE DOWN	from 0x3c000 to 0x1e2e3	******WWWW
19( 19 mod 256): CLONE 0x4000 thru 0x11fff	(0xe000 bytes) to 0x16000 thru 0x23fff
20( 20 mod 256): FALLOC   0x311e1 thru 0x3681b	(0x563a bytes) PAST_EOF
21( 21 mod 256): FALLOC   0x351c5 thru 0x40000	(0xae3b bytes) EXTENDING
22( 22 mod 256): WRITE    0x920 thru 0x7e51	(0x7532 bytes)
23( 23 mod 256): COPY 0x2b58 thru 0xc508	(0x99b1 bytes) to 0x117b1 thru 0x1b161
24( 24 mod 256): TRUNCATE DOWN	from 0x40000 to 0x3c9a5
25( 25 mod 256): SKIPPED (no operation)
26( 26 mod 256): MAPWRITE 0x25020 thru 0x26b06	(0x1ae7 bytes)
27( 27 mod 256): SKIPPED (no operation)
28( 28 mod 256): READ     0x26b3a thru 0x36633	(0xfafa bytes)	***RRRR***

[CAUSE]
The involved operations are:

 fallocating to largest ever: 0x40000
 21 pollute_eof	0x24000 thru	0x2ffff	(0xc000 bytes)
 21 falloc	from 0x351c5 to 0x40000 (0xae3b bytes)
 28 read	0x26b3a thru	0x36633	(0xfafa bytes)

At operation #21 a pollute_eof is done, by memory mappaed write into
range [0x24000, 0x2ffff).
At this stage, the inode size is 0x24000, which is block aligned.

Then fallocate happens, and since it's expanding the inode, it will call
btrfs_truncate_block() to truncate any unaligned range.

But since the inode size is already block aligned,
btrfs_truncate_block() does nothing and exit.

However remember the folio at 0x20000 has some range polluted already,
although they will not be written back to disk, it still affects the
page cache, resulting the later operation #28 to read out the polluted
value.

[FIX]
Instead of early exit from btrfs_truncate_block() if the range is
already block aligned, do extra filio zeroing if the fs block size is
smaller than the page size and we're truncating beyond EOF.

This is to address exactly the above case where memory mapped write can
still leave some garbage beyond EOF.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 54 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 08dda7b0883f..e6bb604917a6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4770,6 +4770,52 @@ static bool is_inside_block(u64 bytenr, u64 blockstart, u32 blocksize)
 	return false;
 }
 
+static int truncate_block_zero_beyond_eof(struct btrfs_inode *inode, u64 start)
+{
+	const pgoff_t index = start >> PAGE_SHIFT;
+	struct address_space *mapping = inode->vfs_inode.i_mapping;
+	struct folio *folio;
+	u64 zero_start;
+	u64 zero_end;
+	int ret = 0;
+
+again:
+	folio = filemap_lock_folio(mapping, index);
+	/* No folio present. */
+	if (IS_ERR(folio))
+		return 0;
+
+	if (!folio_test_uptodate(folio)) {
+		ret = btrfs_read_folio(NULL, folio);
+		folio_lock(folio);
+		if (folio->mapping != mapping) {
+			folio_unlock(folio);
+			folio_put(folio);
+			goto again;
+		}
+		if (!folio_test_uptodate(folio)) {
+			ret = -EIO;
+			goto out_unlock;
+		}
+	}
+	folio_wait_writeback(folio);
+
+	/*
+	 * We do not need to lock extents nor wait for OE, as it's already
+	 * beyond EOF.
+	 */
+
+	zero_start = max_t(u64, folio_pos(folio), start);
+	zero_end = folio_pos(folio) + folio_size(folio) - 1;
+	folio_zero_range(folio, zero_start - folio_pos(folio),
+			 zero_end - zero_start + 1);
+
+out_unlock:
+	folio_unlock(folio);
+	folio_put(folio);
+	return ret;
+}
+
 /*
  * Handle the truncation of a fs block.
  *
@@ -4816,8 +4862,15 @@ int btrfs_truncate_block(struct btrfs_inode *inode, u64 from, u64 start, u64 end
 	       from, start, end);
 
 	/* The range is aligned at both ends. */
-	if (IS_ALIGNED(start, blocksize) && IS_ALIGNED(end + 1, blocksize))
+	if (IS_ALIGNED(start, blocksize) && IS_ALIGNED(end + 1, blocksize)) {
+		/*
+		 * For block size < page size case, we may have polluted blocks
+		 * beyond EOF. So we also need to zero them out.
+		 */
+		if (end == (u64)-1 && blocksize < PAGE_SIZE)
+			ret = truncate_block_zero_beyond_eof(inode, start);
 		goto out;
+	}
 
 	/*
 	 * @from may not be inside the head nor tail block. In that case
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 2/2] btrfs: handle aligned EOF truncation correctly for subpage cases
  2025-04-25 22:36 ` [PATCH v5 2/2] btrfs: handle aligned " Qu Wenruo
@ 2025-05-06 17:25   ` Boris Burkov
  0 siblings, 0 replies; 7+ messages in thread
From: Boris Burkov @ 2025-05-06 17:25 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sat, Apr 26, 2025 at 08:06:50AM +0930, Qu Wenruo wrote:
> [BUG]
> For the following fsx -e 1 run, the btrfs still fails the run on 64K
> page size with 4K fs block size:
> 
> READ BAD DATA: offset = 0x26b3a, size = 0xfafa, fname = /mnt/btrfs/junk
> OFFSET      GOOD    BAD     RANGE
> 0x26b3a     0x0000  0x15b4  0x0
> operation# (mod 256) for the bad data may be 21
> [...]
> LOG DUMP (28 total operations):
> 1(  1 mod 256): SKIPPED (no operation)
> 2(  2 mod 256): SKIPPED (no operation)
> 3(  3 mod 256): SKIPPED (no operation)
> 4(  4 mod 256): SKIPPED (no operation)
> 5(  5 mod 256): WRITE    0x1ea90 thru 0x285e0	(0x9b51 bytes) HOLE
> 6(  6 mod 256): ZERO     0x1b1a8 thru 0x20bd4	(0x5a2d bytes)
> 7(  7 mod 256): FALLOC   0x22b1a thru 0x272fa	(0x47e0 bytes) INTERIOR
> 8(  8 mod 256): WRITE    0x741d thru 0x13522	(0xc106 bytes)
> 9(  9 mod 256): MAPWRITE 0x73ee thru 0xdeeb	(0x6afe bytes)
> 10( 10 mod 256): FALLOC   0xb719 thru 0xb994	(0x27b bytes) INTERIOR
> 11( 11 mod 256): COPY 0x15ed8 thru 0x18be1	(0x2d0a bytes) to 0x25f6e thru 0x28c77
> 12( 12 mod 256): ZERO     0x1615e thru 0x1770e	(0x15b1 bytes)
> 13( 13 mod 256): SKIPPED (no operation)
> 14( 14 mod 256): DEDUPE 0x20000 thru 0x27fff	(0x8000 bytes) to 0x1000 thru 0x8fff
> 15( 15 mod 256): SKIPPED (no operation)
> 16( 16 mod 256): CLONE 0xa000 thru 0xffff	(0x6000 bytes) to 0x36000 thru 0x3bfff
> 17( 17 mod 256): ZERO     0x14adc thru 0x1b78a	(0x6caf bytes)
> 18( 18 mod 256): TRUNCATE DOWN	from 0x3c000 to 0x1e2e3	******WWWW
> 19( 19 mod 256): CLONE 0x4000 thru 0x11fff	(0xe000 bytes) to 0x16000 thru 0x23fff
> 20( 20 mod 256): FALLOC   0x311e1 thru 0x3681b	(0x563a bytes) PAST_EOF
> 21( 21 mod 256): FALLOC   0x351c5 thru 0x40000	(0xae3b bytes) EXTENDING
> 22( 22 mod 256): WRITE    0x920 thru 0x7e51	(0x7532 bytes)
> 23( 23 mod 256): COPY 0x2b58 thru 0xc508	(0x99b1 bytes) to 0x117b1 thru 0x1b161
> 24( 24 mod 256): TRUNCATE DOWN	from 0x40000 to 0x3c9a5
> 25( 25 mod 256): SKIPPED (no operation)
> 26( 26 mod 256): MAPWRITE 0x25020 thru 0x26b06	(0x1ae7 bytes)
> 27( 27 mod 256): SKIPPED (no operation)
> 28( 28 mod 256): READ     0x26b3a thru 0x36633	(0xfafa bytes)	***RRRR***
> 
> [CAUSE]
> The involved operations are:
> 
>  fallocating to largest ever: 0x40000
>  21 pollute_eof	0x24000 thru	0x2ffff	(0xc000 bytes)
>  21 falloc	from 0x351c5 to 0x40000 (0xae3b bytes)
>  28 read	0x26b3a thru	0x36633	(0xfafa bytes)
> 
> At operation #21 a pollute_eof is done, by memory mappaed write into
> range [0x24000, 0x2ffff).
> At this stage, the inode size is 0x24000, which is block aligned.
> 
> Then fallocate happens, and since it's expanding the inode, it will call
> btrfs_truncate_block() to truncate any unaligned range.
> 
> But since the inode size is already block aligned,
> btrfs_truncate_block() does nothing and exit.
> 
> However remember the folio at 0x20000 has some range polluted already,
> although they will not be written back to disk, it still affects the
> page cache, resulting the later operation #28 to read out the polluted
> value.
> 
> [FIX]
> Instead of early exit from btrfs_truncate_block() if the range is
> already block aligned, do extra filio zeroing if the fs block size is
> smaller than the page size and we're truncating beyond EOF.
> 
> This is to address exactly the above case where memory mapped write can
> still leave some garbage beyond EOF.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Boris Burkov <boris@bur.io>

> ---
>  fs/btrfs/inode.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 54 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 08dda7b0883f..e6bb604917a6 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -4770,6 +4770,52 @@ static bool is_inside_block(u64 bytenr, u64 blockstart, u32 blocksize)
>  	return false;
>  }
>  
> +static int truncate_block_zero_beyond_eof(struct btrfs_inode *inode, u64 start)
> +{
> +	const pgoff_t index = start >> PAGE_SHIFT;
> +	struct address_space *mapping = inode->vfs_inode.i_mapping;
> +	struct folio *folio;
> +	u64 zero_start;
> +	u64 zero_end;
> +	int ret = 0;
> +
> +again:
> +	folio = filemap_lock_folio(mapping, index);
> +	/* No folio present. */
> +	if (IS_ERR(folio))
> +		return 0;
> +
> +	if (!folio_test_uptodate(folio)) {
> +		ret = btrfs_read_folio(NULL, folio);
> +		folio_lock(folio);
> +		if (folio->mapping != mapping) {
> +			folio_unlock(folio);
> +			folio_put(folio);
> +			goto again;
> +		}
> +		if (!folio_test_uptodate(folio)) {
> +			ret = -EIO;
> +			goto out_unlock;
> +		}
> +	}
> +	folio_wait_writeback(folio);
> +
> +	/*
> +	 * We do not need to lock extents nor wait for OE, as it's already
> +	 * beyond EOF.
> +	 */
> +
> +	zero_start = max_t(u64, folio_pos(folio), start);
> +	zero_end = folio_pos(folio) + folio_size(folio) - 1;
> +	folio_zero_range(folio, zero_start - folio_pos(folio),
> +			 zero_end - zero_start + 1);
> +
> +out_unlock:
> +	folio_unlock(folio);
> +	folio_put(folio);
> +	return ret;
> +}
> +
>  /*
>   * Handle the truncation of a fs block.
>   *
> @@ -4816,8 +4862,15 @@ int btrfs_truncate_block(struct btrfs_inode *inode, u64 from, u64 start, u64 end
>  	       from, start, end);
>  
>  	/* The range is aligned at both ends. */
> -	if (IS_ALIGNED(start, blocksize) && IS_ALIGNED(end + 1, blocksize))
> +	if (IS_ALIGNED(start, blocksize) && IS_ALIGNED(end + 1, blocksize)) {
> +		/*
> +		 * For block size < page size case, we may have polluted blocks
> +		 * beyond EOF. So we also need to zero them out.
> +		 */
> +		if (end == (u64)-1 && blocksize < PAGE_SIZE)
> +			ret = truncate_block_zero_beyond_eof(inode, start);
>  		goto out;
> +	}
>  
>  	/*
>  	 * @from may not be inside the head nor tail block. In that case
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 0/2] btrfs: fix beyond EOF truncation for subpage generic/363 failures
  2025-04-25 22:36 [PATCH v5 0/2] btrfs: fix beyond EOF truncation for subpage generic/363 failures Qu Wenruo
  2025-04-25 22:36 ` [PATCH v5 1/2] btrfs: handle unaligned EOF truncation correctly for subpage cases Qu Wenruo
  2025-04-25 22:36 ` [PATCH v5 2/2] btrfs: handle aligned " Qu Wenruo
@ 2025-05-05 15:33 ` David Sterba
  2025-05-06  0:05   ` Qu Wenruo
  2 siblings, 1 reply; 7+ messages in thread
From: David Sterba @ 2025-05-05 15:33 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sat, Apr 26, 2025 at 08:06:48AM +0930, Qu Wenruo wrote:
> [CHANGELOG]
> v5:
> - Shrink the parameter list for btrfs_truncate_block()
>   Remove the @front and @len, instead passing a new pair of @start/@end,
>   so that we can determine if @from is in the head or tail block,
>   thus no need for @front.
> 
>   This will give callers more freedom (a little too much),
>   e.g. for the following zero range/hole punch case:
> 
>     Page size is 64K, fs block size is 4K.
>     Truncation range is [6K, 58K).
> 
>     0        8K                32K                  56K      64K
>     |      |/|//////////////////////////////////////|/|      |
>            6K                                         58K
> 
>     To truncate the first block to zero out range [6K, 8K),
>     caller can pass @from = 6K, @start = 6K, @end = 58K - 1.
>     In fact, any @from inside range [6K, 8K) will work.
> 
>     To truncate the last block to zero out range [56K, 58K),
>     caller can pass @from=58K - 1, @start = 6K, @end = 58K -1.
>     Any @from inside range [56K, 58K) will also work.
> 
>     Furthermore, if aligned @from is passed in, e.g. 8K,
>     btrfs_truncate_block() will detect that there is nothing to do,
>     and exit properly.
> 
> - Only do the extra zeroing if we're truncating beyond EOF
>   Especially for the recent large folios support, we can do a lot of
>   unnecessary zeroing for a very large folio.
> 
> - Remove the lock-wait-retry loop if we're doing aligned truncation
>   beyond EOF
>   Since it's already EOF, there is no need to wait for the OE anyway.

The patches have been in linux-next but I don't think they got coverage
on the 64k/4k setups. If you don't have further updates please add the
series to for-next. Thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v5 0/2] btrfs: fix beyond EOF truncation for subpage generic/363 failures
  2025-05-05 15:33 ` [PATCH v5 0/2] btrfs: fix beyond EOF truncation for subpage generic/363 failures David Sterba
@ 2025-05-06  0:05   ` Qu Wenruo
  0 siblings, 0 replies; 7+ messages in thread
From: Qu Wenruo @ 2025-05-06  0:05 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs



在 2025/5/6 01:03, David Sterba 写道:
> On Sat, Apr 26, 2025 at 08:06:48AM +0930, Qu Wenruo wrote:
>> [CHANGELOG]
>> v5:
>> - Shrink the parameter list for btrfs_truncate_block()
>>    Remove the @front and @len, instead passing a new pair of @start/@end,
>>    so that we can determine if @from is in the head or tail block,
>>    thus no need for @front.
>>
>>    This will give callers more freedom (a little too much),
>>    e.g. for the following zero range/hole punch case:
>>
>>      Page size is 64K, fs block size is 4K.
>>      Truncation range is [6K, 58K).
>>
>>      0        8K                32K                  56K      64K
>>      |      |/|//////////////////////////////////////|/|      |
>>             6K                                         58K
>>
>>      To truncate the first block to zero out range [6K, 8K),
>>      caller can pass @from = 6K, @start = 6K, @end = 58K - 1.
>>      In fact, any @from inside range [6K, 8K) will work.
>>
>>      To truncate the last block to zero out range [56K, 58K),
>>      caller can pass @from=58K - 1, @start = 6K, @end = 58K -1.
>>      Any @from inside range [56K, 58K) will also work.
>>
>>      Furthermore, if aligned @from is passed in, e.g. 8K,
>>      btrfs_truncate_block() will detect that there is nothing to do,
>>      and exit properly.
>>
>> - Only do the extra zeroing if we're truncating beyond EOF
>>    Especially for the recent large folios support, we can do a lot of
>>    unnecessary zeroing for a very large folio.
>>
>> - Remove the lock-wait-retry loop if we're doing aligned truncation
>>    beyond EOF
>>    Since it's already EOF, there is no need to wait for the OE anyway.
> 
> The patches have been in linux-next but I don't think they got coverage
> on the 64k/4k setups. If you don't have further updates please add the
> series to for-next. Thanks.

I'd prefer Boris to give it a final glance.

This v5 changes the parameter list, thus it is different from previous 
versions.

Although tests wise it's pretty boring on x86_64 and aarch64.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-05-06 17:29 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-25 22:36 [PATCH v5 0/2] btrfs: fix beyond EOF truncation for subpage generic/363 failures Qu Wenruo
2025-04-25 22:36 ` [PATCH v5 1/2] btrfs: handle unaligned EOF truncation correctly for subpage cases Qu Wenruo
2025-05-06 17:29   ` Boris Burkov
2025-04-25 22:36 ` [PATCH v5 2/2] btrfs: handle aligned " Qu Wenruo
2025-05-06 17:25   ` Boris Burkov
2025-05-05 15:33 ` [PATCH v5 0/2] btrfs: fix beyond EOF truncation for subpage generic/363 failures David Sterba
2025-05-06  0:05   ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox