public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 1/2] btrfs: make btrfs_truncate_block() to zero involved blocks in a folio
Date: Fri, 11 Apr 2025 16:32:56 +0930	[thread overview]
Message-ID: <c118af0d-aec0-4bac-85b1-b280cfc971b7@suse.com> (raw)
In-Reply-To: <158513ddd60a56c01e9404919a40b8739fc80d83.1744344865.git.wqu@suse.com>



在 2025/4/11 14:44, Qu Wenruo 写道:
> [BUG]
> The following fsx sequence will fail on btrfs with 64K page size and 4K
> fs block size:
> 
>   #fsx -d -e 1 -N 4 $mnt/junk -S 36386
>   READ BAD DATA: offset = 0xe9ba, size = 0x6dd5, fname = /mnt/btrfs/junk
>   OFFSET      GOOD    BAD     RANGE
>   0xe9ba      0x0000  0x03ac  0x0
>   operation# (mod 256) for the bad data may be 3
>   ...
>   LOG DUMP (4 total operations):
>   1(  1 mod 256): WRITE    0x6c62 thru 0x1147d	(0xa81c bytes) HOLE	***WWWW
>   2(  2 mod 256): TRUNCATE DOWN	from 0x1147e to 0x5448	******WWWW
>   3(  3 mod 256): ZERO     0x1c7aa thru 0x28fe2	(0xc839 bytes)
>   4(  4 mod 256): MAPREAD  0xe9ba thru 0x1578e	(0x6dd5 bytes)	***RRRR***
> 
> [CAUSE]
> Only 2 operations are really involved in this case:
> 
>   3 pollute_eof	0x5448 thru	0xffff	(0xabb8 bytes)
>   3 zero	from 0x1c7aa to 0x28fe3, (0xc839 bytes)
>   4 mapread	0xe9ba thru	0x1578e	(0x6dd5 bytes)
> 
> At operation 3, fsx pollutes beyond EOF, that is done by mmap()
> and write into that mmap() range beyondd EOF.
> 
> Such write will fill the range beyond EOF, but it will never reach disk
> as ranges beyond EOF will not be marked dirty nor uptodate.
> 
> Then we zero_range for [0x1c7aa, 0x28fe3], and since the range is beyond
> our isize (which was 0x5448), we should zero out any range beyond
> EOF (0x5448).
> 
> During btrfs_zero_range(), we call btrfs_truncate_block() to dirty the
> unaligned head block.
> But that function only really zero out the block at [0x5000, 0x5fff], it
> doesn't bother any range other that that block, since those range will
> not be marked dirty nor written back.
> 
> So the range [0x6000, 0xffff] is still polluted, and later mapread()
> will return the poisoned value.
> 
> Such behavior is only exposed when page size is larger than fs block
> btrfs, as for block size == page size case the block is exactly one
> page, and fsx only checks exactly one page at EOF.
> 
> [FIX]
> Enhance btrfs_truncate_block() by:
> 
> - Force callers to pass a @start/@end combination
>    So that there will be no 0 length passed in.
> 
> - Rename the @front parameter to an enum
>    And make it matches the @start/@end parameter better by using
>    TRUNCATE_HEAD_BLOCK and TRUNCATE_TAIL_BLOCK instead.
> 
> - Pass the original unmodified range into btrfs_truncate_block()
>    There are several call sites inside btrfs_zero_range() and
>    btrfs_punch_hole() where we pass part of the original range for
>    truncating.
> 
>    This hides the original range which can lead to under or over
>    truncating.
>    Thus we have to pass the original zero/punch range.
> 
> - Make btrfs_truncate_block() to zero any involved blocks inside the folio
>    Since we have the original range, we know exactly which range inside
>    the folio that should be zeroed.
> 
>    It may cover other blocks other than the one with data space reserved,
>    but that's fine, the zeroed range will not be written back anyway.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/btrfs_inode.h | 10 +++++--
>   fs/btrfs/file.c        | 33 ++++++++++++++-------
>   fs/btrfs/inode.c       | 65 +++++++++++++++++++++++++++---------------
>   3 files changed, 73 insertions(+), 35 deletions(-)
> 
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 4e2952cf5766..21b005ddf42c 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -547,8 +547,14 @@ int btrfs_add_link(struct btrfs_trans_handle *trans,
>   		   struct btrfs_inode *parent_inode, struct btrfs_inode *inode,
>   		   const struct fscrypt_str *name, int add_backref, u64 index);
>   int btrfs_delete_subvolume(struct btrfs_inode *dir, struct dentry *dentry);
> -int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
> -			 int front);
> +
> +enum btrfs_truncate_where {
> +	BTRFS_TRUNCATE_HEAD_BLOCK,
> +	BTRFS_TRUNCATE_TAIL_BLOCK,
> +};
> +int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t end,
> +			 u64 orig_start, u64 orig_end,
> +			 enum btrfs_truncate_where where);
>   
>   int btrfs_start_delalloc_snapshot(struct btrfs_root *root, bool in_reclaim_context);
>   int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index e3fea1db4304..55fa91799fb6 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2616,7 +2616,8 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
>   	u64 lockend;
>   	u64 tail_start;
>   	u64 tail_len;
> -	u64 orig_start = offset;
> +	const u64 orig_start = offset;
> +	const u64 orig_end = offset + len - 1;
>   	int ret = 0;
>   	bool same_block;
>   	u64 ino_size;
> @@ -2659,7 +2660,8 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
>   		if (offset < ino_size) {
>   			truncated_block = true;
>   			ret = btrfs_truncate_block(BTRFS_I(inode), offset, len,

This is a bug, since the 3rd parameter is the inclusive end now, it 
should be "offset + len - 1".

This leads to generic/008 failure on x86_64.> -						   0);
> +						   orig_start, orig_end,
> +						   BTRFS_TRUNCATE_HEAD_BLOCK);
>   		} else {
>   			ret = 0;
>   		}
> @@ -2669,7 +2671,9 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
>   	/* zero back part of the first block */
>   	if (offset < ino_size) {
>   		truncated_block = true;
> -		ret = btrfs_truncate_block(BTRFS_I(inode), offset, 0, 0);
> +		ret = btrfs_truncate_block(BTRFS_I(inode), offset, -1,
> +					   orig_start, orig_end,
> +					   BTRFS_TRUNCATE_HEAD_BLOCK);
>   		if (ret) {
>   			btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
>   			return ret;
> @@ -2706,8 +2710,9 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
>   			if (tail_start + tail_len < ino_size) {
>   				truncated_block = true;
>   				ret = btrfs_truncate_block(BTRFS_I(inode),
> -							tail_start + tail_len,
> -							0, 1);
> +						tail_start, tail_start + tail_len - 1,
> +						orig_start, orig_end,
> +						BTRFS_TRUNCATE_TAIL_BLOCK);
>   				if (ret)
>   					goto out_only_mutex;
>   			}
> @@ -2875,6 +2880,8 @@ static int btrfs_zero_range(struct inode *inode,
>   	int ret;
>   	u64 alloc_hint = 0;
>   	const u64 sectorsize = fs_info->sectorsize;
> +	const u64 orig_start = offset;
> +	const u64 orig_end = offset + len - 1;
>   	u64 alloc_start = round_down(offset, sectorsize);
>   	u64 alloc_end = round_up(offset + len, sectorsize);
>   	u64 bytes_to_reserve = 0;
> @@ -2938,7 +2945,8 @@ static int btrfs_zero_range(struct inode *inode,
>   		if (len < sectorsize && em->disk_bytenr != EXTENT_MAP_HOLE) {
>   			free_extent_map(em);
>   			ret = btrfs_truncate_block(BTRFS_I(inode), offset, len,

The same here.

The idea of changing @len to @end seems cursed, but there seems to be no 
better solution than this.

Will get them fixed in the next update.

Thanks,
Qu

> -						   0);
> +						   orig_start, orig_end,
> +						   BTRFS_TRUNCATE_HEAD_BLOCK);
>   			if (!ret)
>   				ret = btrfs_fallocate_update_isize(inode,
>   								   offset + len,
> @@ -2969,7 +2977,9 @@ static int btrfs_zero_range(struct inode *inode,
>   			alloc_start = round_down(offset, sectorsize);
>   			ret = 0;
>   		} else if (ret == RANGE_BOUNDARY_WRITTEN_EXTENT) {
> -			ret = btrfs_truncate_block(BTRFS_I(inode), offset, 0, 0);
> +			ret = btrfs_truncate_block(BTRFS_I(inode), offset, -1,
> +						   orig_start, orig_end,
> +						   BTRFS_TRUNCATE_HEAD_BLOCK);
>   			if (ret)
>   				goto out;
>   		} else {
> @@ -2986,8 +2996,9 @@ static int btrfs_zero_range(struct inode *inode,
>   			alloc_end = round_up(offset + len, sectorsize);
>   			ret = 0;
>   		} else if (ret == RANGE_BOUNDARY_WRITTEN_EXTENT) {
> -			ret = btrfs_truncate_block(BTRFS_I(inode), offset + len,
> -						   0, 1);
> +			ret = btrfs_truncate_block(BTRFS_I(inode), offset, offset + len - 1,
> +						   orig_start, orig_end,
> +						   BTRFS_TRUNCATE_TAIL_BLOCK);
>   			if (ret)
>   				goto out;
>   		} else {
> @@ -3107,7 +3118,9 @@ static long btrfs_fallocate(struct file *file, int mode,
>   		 * need to zero out the end of the block if i_size lands in the
>   		 * middle of a block.
>   		 */
> -		ret = btrfs_truncate_block(BTRFS_I(inode), inode->i_size, 0, 0);
> +		ret = btrfs_truncate_block(BTRFS_I(inode), inode->i_size, -1,
> +					   inode->i_size, -1,
> +					   BTRFS_TRUNCATE_HEAD_BLOCK);
>   		if (ret)
>   			goto out;
>   	}
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index e283627c087d..0700a161b80e 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -4782,15 +4782,16 @@ static int btrfs_rmdir(struct inode *dir, struct dentry *dentry)
>    *
>    * @inode - inode that we're zeroing
>    * @from - the offset to start zeroing
> - * @len - the length to zero, 0 to zero the entire range respective to the
> - *	offset
> - * @front - zero up to the offset instead of from the offset on
> + * @end - the inclusive end to finish zeroing, can be -1 meaning truncating
> + *	  everything beyond @from.
> + * @where - Head or tail block to truncate.
>    *
>    * This will find the block for the "from" offset and cow the block and zero the
>    * part we want to zero.  This is used with truncate and hole punching.
>    */
> -int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
> -			 int front)
> +int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t end,
> +			 u64 orig_start, u64 orig_end,
> +			 enum btrfs_truncate_where where)
>   {
>   	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>   	struct address_space *mapping = inode->vfs_inode.i_mapping;
> @@ -4800,20 +4801,30 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
>   	struct extent_changeset *data_reserved = NULL;
>   	bool only_release_metadata = false;
>   	u32 blocksize = fs_info->sectorsize;
> -	pgoff_t index = from >> PAGE_SHIFT;
> -	unsigned offset = from & (blocksize - 1);
> +	pgoff_t index = (where == BTRFS_TRUNCATE_HEAD_BLOCK) ?
> +			(from >> PAGE_SHIFT) : (end >> PAGE_SHIFT);
>   	struct folio *folio;
>   	gfp_t mask = btrfs_alloc_write_mask(mapping);
>   	size_t write_bytes = blocksize;
>   	int ret = 0;
>   	u64 block_start;
>   	u64 block_end;
> +	u64 clamp_start;
> +	u64 clamp_end;
>   
> -	if (IS_ALIGNED(offset, blocksize) &&
> -	    (!len || IS_ALIGNED(len, blocksize)))
> +	ASSERT(where == BTRFS_TRUNCATE_HEAD_BLOCK ||
> +	       where == BTRFS_TRUNCATE_TAIL_BLOCK);
> +
> +	if (end == (loff_t)-1)
> +		ASSERT(where == BTRFS_TRUNCATE_HEAD_BLOCK);
> +
> +	if (IS_ALIGNED(from, blocksize) && IS_ALIGNED(end + 1, blocksize))
>   		goto out;
>   
> -	block_start = round_down(from, blocksize);
> +	if (where == BTRFS_TRUNCATE_HEAD_BLOCK)
> +		block_start = round_down(from, blocksize);
> +	else
> +		block_start = round_down(end, blocksize);
>   	block_end = block_start + blocksize - 1;
>   
>   	ret = btrfs_check_data_free_space(inode, &data_reserved, block_start,
> @@ -4893,17 +4904,22 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
>   		goto out_unlock;
>   	}
>   
> -	if (offset != blocksize) {
> -		if (!len)
> -			len = blocksize - offset;
> -		if (front)
> -			folio_zero_range(folio, block_start - folio_pos(folio),
> -					 offset);
> -		else
> -			folio_zero_range(folio,
> -					 (block_start - folio_pos(folio)) + offset,
> -					 len);
> -	}
> +	/*
> +	 * Although we have only reserved space for the one block, we still should
> +	 * zero out all blocks in the original range.
> +	 * The remaining blocks normally are already holes thus no need to zero again,
> +	 * but it's possible for fs block size < page size cases to have memory mapped
> +	 * writes to pollute ranges beyond EOF.
> +	 *
> +	 * In that case although the polluted blocks beyond EOF will not reach disk,
> +	 * it still affects our page cache.
> +	 */
> +	clamp_start = max_t(u64, folio_pos(folio), orig_start);
> +	clamp_end = min_t(u64, folio_pos(folio) + folio_size(folio) - 1,
> +			  orig_end);
> +	folio_zero_range(folio, clamp_start - folio_pos(folio),
> +			 clamp_end - clamp_start + 1);
> +
>   	btrfs_folio_clear_checked(fs_info, folio, block_start,
>   				  block_end + 1 - block_start);
>   	btrfs_folio_set_dirty(fs_info, folio, block_start,
> @@ -5005,7 +5021,8 @@ int btrfs_cont_expand(struct btrfs_inode *inode, loff_t oldsize, loff_t size)
>   	 * rest of the block before we expand the i_size, otherwise we could
>   	 * expose stale data.
>   	 */
> -	ret = btrfs_truncate_block(inode, oldsize, 0, 0);
> +	ret = btrfs_truncate_block(inode, oldsize, -1, oldsize, -1,
> +				   BTRFS_TRUNCATE_HEAD_BLOCK);
>   	if (ret)
>   		return ret;
>   
> @@ -7649,7 +7666,9 @@ static int btrfs_truncate(struct btrfs_inode *inode, bool skip_writeback)
>   		btrfs_end_transaction(trans);
>   		btrfs_btree_balance_dirty(fs_info);
>   
> -		ret = btrfs_truncate_block(inode, inode->vfs_inode.i_size, 0, 0);
> +		ret = btrfs_truncate_block(inode, inode->vfs_inode.i_size, -1,
> +					   inode->vfs_inode.i_size, -1,
> +					   BTRFS_TRUNCATE_HEAD_BLOCK);
>   		if (ret)
>   			goto out;
>   		trans = btrfs_start_transaction(root, 1);


  reply	other threads:[~2025-04-11  7:03 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-11  5:13 [PATCH 0/2] btrfs: fix corner cases for subpage generic/363 failures Qu Wenruo
2025-04-11  5:14 ` [PATCH 1/2] btrfs: make btrfs_truncate_block() to zero involved blocks in a folio Qu Wenruo
2025-04-11  7:02   ` Qu Wenruo [this message]
2025-04-11  5:14 ` [PATCH 2/2] btrfs: make btrfs_truncate_block() zero folio range for certain subpage corner cases Qu Wenruo
2025-04-12  5:12   ` kernel test robot
2025-04-12  5:54   ` kernel test robot
2025-04-12 18:35   ` Andy Shevchenko
2025-04-14  1:20     ` Qu Wenruo
2025-04-14  3:37       ` Qu Wenruo
2025-04-14 10:40       ` Andy Shevchenko
2025-04-15 18:18         ` David Sterba
2025-04-15 18:21           ` Andy Shevchenko
2025-04-15 23:57             ` Qu Wenruo
2025-04-16  5:57               ` Andy Shevchenko
2025-04-16  8:28                 ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c118af0d-aec0-4bac-85b1-b280cfc971b7@suse.com \
    --to=wqu@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox