public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Boris Burkov <boris@bur.io>
To: Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v4 2/2] btrfs: make btrfs_truncate_block() zero folio range for certain subpage corner cases
Date: Wed, 23 Apr 2025 17:27:14 -0700	[thread overview]
Message-ID: <aAmFYrptBXjufdg5@devvm12410.ftw0.facebook.com> (raw)
In-Reply-To: <50e41f048f26b5b58f55b25c045e5dfe94a8dcfd.1745443508.git.wqu@suse.com>

On Thu, Apr 24, 2025 at 06:56:32AM +0930, Qu Wenruo wrote:
> [BUG]
> For the following fsx -e 1 run, the btrfs still fails the run on 64K
> page size with 4K fs block size:
> 
> READ BAD DATA: offset = 0x26b3a, size = 0xfafa, fname = /mnt/btrfs/junk
> OFFSET      GOOD    BAD     RANGE
> 0x26b3a     0x0000  0x15b4  0x0
> operation# (mod 256) for the bad data may be 21
> [...]
> LOG DUMP (28 total operations):
> 1(  1 mod 256): SKIPPED (no operation)
> 2(  2 mod 256): SKIPPED (no operation)
> 3(  3 mod 256): SKIPPED (no operation)
> 4(  4 mod 256): SKIPPED (no operation)
> 5(  5 mod 256): WRITE    0x1ea90 thru 0x285e0	(0x9b51 bytes) HOLE
> 6(  6 mod 256): ZERO     0x1b1a8 thru 0x20bd4	(0x5a2d bytes)
> 7(  7 mod 256): FALLOC   0x22b1a thru 0x272fa	(0x47e0 bytes) INTERIOR
> 8(  8 mod 256): WRITE    0x741d thru 0x13522	(0xc106 bytes)
> 9(  9 mod 256): MAPWRITE 0x73ee thru 0xdeeb	(0x6afe bytes)
> 10( 10 mod 256): FALLOC   0xb719 thru 0xb994	(0x27b bytes) INTERIOR
> 11( 11 mod 256): COPY 0x15ed8 thru 0x18be1	(0x2d0a bytes) to 0x25f6e thru 0x28c77
> 12( 12 mod 256): ZERO     0x1615e thru 0x1770e	(0x15b1 bytes)
> 13( 13 mod 256): SKIPPED (no operation)
> 14( 14 mod 256): DEDUPE 0x20000 thru 0x27fff	(0x8000 bytes) to 0x1000 thru 0x8fff
> 15( 15 mod 256): SKIPPED (no operation)
> 16( 16 mod 256): CLONE 0xa000 thru 0xffff	(0x6000 bytes) to 0x36000 thru 0x3bfff
> 17( 17 mod 256): ZERO     0x14adc thru 0x1b78a	(0x6caf bytes)
> 18( 18 mod 256): TRUNCATE DOWN	from 0x3c000 to 0x1e2e3	******WWWW
> 19( 19 mod 256): CLONE 0x4000 thru 0x11fff	(0xe000 bytes) to 0x16000 thru 0x23fff
> 20( 20 mod 256): FALLOC   0x311e1 thru 0x3681b	(0x563a bytes) PAST_EOF
> 21( 21 mod 256): FALLOC   0x351c5 thru 0x40000	(0xae3b bytes) EXTENDING
> 22( 22 mod 256): WRITE    0x920 thru 0x7e51	(0x7532 bytes)
> 23( 23 mod 256): COPY 0x2b58 thru 0xc508	(0x99b1 bytes) to 0x117b1 thru 0x1b161
> 24( 24 mod 256): TRUNCATE DOWN	from 0x40000 to 0x3c9a5
> 25( 25 mod 256): SKIPPED (no operation)
> 26( 26 mod 256): MAPWRITE 0x25020 thru 0x26b06	(0x1ae7 bytes)
> 27( 27 mod 256): SKIPPED (no operation)
> 28( 28 mod 256): READ     0x26b3a thru 0x36633	(0xfafa bytes)	***RRRR***
> 
> [CAUSE]
> The involved operations are:
> 
>  fallocating to largest ever: 0x40000
>  21 pollute_eof	0x24000 thru	0x2ffff	(0xc000 bytes)
>  21 falloc	from 0x351c5 to 0x40000 (0xae3b bytes)
>  28 read	0x26b3a thru	0x36633	(0xfafa bytes)
> 
> At operation #21 a pollute_eof is done, by memory mappaed write into
> range [0x24000, 0x2ffff).
> At this stage, the inode size is 0x24000, which is block aligned.
> 
> Then fallocate happens, and since it's expanding the inode, it will call
> btrfs_truncate_block() to truncate any unaligned range.
> 
> But since the inode size is already block aligned,
> btrfs_truncate_block() does nothing and exit.
> 
> However remember the folio at 0x20000 has some range polluted already,
> although they will not be written back to disk, it still affects the
> page cache, resulting the later operation #28 to read out the polluted
> value.
> 
> [FIX]
> Instead of early exit from btrfs_truncate_block() if the range is
> already block aligned, do extra filio zeroing if the fs block size is
> smaller than the page size.
> 
> This is to address exactly the above case where memory mapped write can
> still leave some garbage beyond EOF.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/inode.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 88 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 6f910c056819..23cd03716e57 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -4760,6 +4760,80 @@ static int btrfs_rmdir(struct inode *dir, struct dentry *dentry)
>  	return ret;
>  }
>  
> +/*
> + * A helper to zero out all blocks inside range [@orig_start, @orig_end) of
> + * the target folio.
> + * The target folio is the one containing the head or tail block of the range
> + * [@from, @end].
> + *
> + * This is a special case for fs block size < page size, where even if the range
> + * [from, end] is already block aligned, we can still have blocks beyond EOF being
> + * polluted by memory mapped write.
> + */

This function name is basically the same as folio_zero_range which
behaves very differently. I think it needs a name that captures the fact
that it is a special case for btrfs_truncate_block

> +static int zero_range_folio(struct btrfs_inode *inode, u64 from, u64 end,
> +			    u64 orig_start, u64 orig_end,
> +			    enum btrfs_truncate_where where)
> +{
> +	const u32 blocksize = inode->root->fs_info->sectorsize;
> +	struct address_space *mapping = inode->vfs_inode.i_mapping;
> +	struct extent_io_tree *io_tree = &inode->io_tree;
> +	struct extent_state *cached_state = NULL;
> +	struct btrfs_ordered_extent *ordered;
> +	pgoff_t index = (where == BTRFS_TRUNCATE_HEAD_BLOCK) ?
> +			(from >> PAGE_SHIFT) : (end >> PAGE_SHIFT);
> +	struct folio *folio;
> +	u64 block_start;
> +	u64 block_end;
> +	u64 clamp_start;
> +	u64 clamp_end;
> +	int ret = 0;
> +
> +again:
> +	folio = filemap_lock_folio(mapping, index);
> +	/* No folio present. */
> +	if (IS_ERR(folio))
> +		return 0;
> +
> +	if (!folio_test_uptodate(folio)) {
> +		ret = btrfs_read_folio(NULL, folio);
> +		folio_lock(folio);
> +		if (folio->mapping != mapping) {
> +			folio_unlock(folio);
> +			folio_put(folio);
> +			goto again;
> +		}
> +		if (!folio_test_uptodate(folio)) {
> +			ret = -EIO;
> +			goto out_unlock;
> +		}
> +	}
> +	folio_wait_writeback(folio);

This lock/read/wait pattern is the same as in btrfs_truncate_block and I
think could benefit from being lifted into a function. 

> +
> +	clamp_start = max_t(u64, folio_pos(folio), orig_start);
> +	clamp_end = min_t(u64, folio_pos(folio) + folio_size(folio) - 1,
> +			  orig_end);
> +	block_start = round_down(clamp_start, blocksize);
> +	block_end = round_up(clamp_end + 1, blocksize) - 1;
> +	lock_extent(io_tree, block_start, block_end, &cached_state);
> +	ordered = btrfs_lookup_ordered_range(inode, block_start, block_end + 1 - block_end);
> +	if (ordered) {
> +		unlock_extent(io_tree, block_start, block_end, &cached_state);
> +		folio_unlock(folio);
> +		folio_put(folio);
> +		btrfs_start_ordered_extent(ordered);
> +		btrfs_put_ordered_extent(ordered);
> +		goto again;
> +	}
> +	folio_zero_range(folio, clamp_start - folio_pos(folio),
> +			 clamp_end - clamp_start + 1);
> +	unlock_extent(io_tree, block_start, block_end, &cached_state);
> +
> +out_unlock:
> +	folio_unlock(folio);
> +	folio_put(folio);
> +	return ret;
> +}
> +
>  /*
>   * Read, zero a chunk and write a block.
>   *
> @@ -4801,8 +4875,20 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t end,
>  	if (end == (loff_t)-1)
>  		ASSERT(where == BTRFS_TRUNCATE_HEAD_BLOCK);
>  
> -	if (IS_ALIGNED(from, blocksize) && IS_ALIGNED(end + 1, blocksize))
> -		goto out;
> +	if (IS_ALIGNED(from, blocksize) && IS_ALIGNED(end + 1, blocksize)) {

Does this need to depend on `where` now or is it still necessary to
check both?

> +		/*
> +		 * The target head/tail range is already block aligned.
> +		 * If block size >= PAGE_SIZE, meaning it's impossible to mmap a
> +		 * page containing anything other than the target block.
> +		 * So we can safely exit.
> +		 *
> +		 * Otherwise we still need to zero out the range inside the folio
> +		 * to avoid memory mapped write to pollute beyond EOF.
> +		 */
> +		if (blocksize >= PAGE_SIZE)
> +			return 0;
> +		return zero_range_folio(inode, from, end, orig_start, orig_end, where);
> +	}
>  
>  	if (where == BTRFS_TRUNCATE_HEAD_BLOCK)
>  		block_start = round_down(from, blocksize);
> -- 
> 2.49.0

  reply	other threads:[~2025-04-24  0:27 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-23 21:26 [PATCH v4 0/2] btrfs: fix corner cases for subpage generic/363 failures Qu Wenruo
2025-04-23 21:26 ` [PATCH v4 1/2] btrfs: make btrfs_truncate_block() to zero involved blocks in a folio Qu Wenruo
2025-04-24  0:14   ` Boris Burkov
2025-04-24  0:54     ` Qu Wenruo
2025-04-23 21:26 ` [PATCH v4 2/2] btrfs: make btrfs_truncate_block() zero folio range for certain subpage corner cases Qu Wenruo
2025-04-24  0:27   ` Boris Burkov [this message]
2025-04-24  0:56     ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aAmFYrptBXjufdg5@devvm12410.ftw0.facebook.com \
    --to=boris@bur.io \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox