From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 1/2] btrfs: make btrfs_truncate_block() to zero involved blocks in a folio
Date: Fri, 11 Apr 2025 16:32:56 +0930 [thread overview]
Message-ID: <c118af0d-aec0-4bac-85b1-b280cfc971b7@suse.com> (raw)
In-Reply-To: <158513ddd60a56c01e9404919a40b8739fc80d83.1744344865.git.wqu@suse.com>
在 2025/4/11 14:44, Qu Wenruo 写道:
> [BUG]
> The following fsx sequence will fail on btrfs with 64K page size and 4K
> fs block size:
>
> #fsx -d -e 1 -N 4 $mnt/junk -S 36386
> READ BAD DATA: offset = 0xe9ba, size = 0x6dd5, fname = /mnt/btrfs/junk
> OFFSET GOOD BAD RANGE
> 0xe9ba 0x0000 0x03ac 0x0
> operation# (mod 256) for the bad data may be 3
> ...
> LOG DUMP (4 total operations):
> 1( 1 mod 256): WRITE 0x6c62 thru 0x1147d (0xa81c bytes) HOLE ***WWWW
> 2( 2 mod 256): TRUNCATE DOWN from 0x1147e to 0x5448 ******WWWW
> 3( 3 mod 256): ZERO 0x1c7aa thru 0x28fe2 (0xc839 bytes)
> 4( 4 mod 256): MAPREAD 0xe9ba thru 0x1578e (0x6dd5 bytes) ***RRRR***
>
> [CAUSE]
> Only 2 operations are really involved in this case:
>
> 3 pollute_eof 0x5448 thru 0xffff (0xabb8 bytes)
> 3 zero from 0x1c7aa to 0x28fe3, (0xc839 bytes)
> 4 mapread 0xe9ba thru 0x1578e (0x6dd5 bytes)
>
> At operation 3, fsx pollutes beyond EOF, that is done by mmap()
> and write into that mmap() range beyondd EOF.
>
> Such write will fill the range beyond EOF, but it will never reach disk
> as ranges beyond EOF will not be marked dirty nor uptodate.
>
> Then we zero_range for [0x1c7aa, 0x28fe3], and since the range is beyond
> our isize (which was 0x5448), we should zero out any range beyond
> EOF (0x5448).
>
> During btrfs_zero_range(), we call btrfs_truncate_block() to dirty the
> unaligned head block.
> But that function only really zero out the block at [0x5000, 0x5fff], it
> doesn't bother any range other that that block, since those range will
> not be marked dirty nor written back.
>
> So the range [0x6000, 0xffff] is still polluted, and later mapread()
> will return the poisoned value.
>
> Such behavior is only exposed when page size is larger than fs block
> btrfs, as for block size == page size case the block is exactly one
> page, and fsx only checks exactly one page at EOF.
>
> [FIX]
> Enhance btrfs_truncate_block() by:
>
> - Force callers to pass a @start/@end combination
> So that there will be no 0 length passed in.
>
> - Rename the @front parameter to an enum
> And make it matches the @start/@end parameter better by using
> TRUNCATE_HEAD_BLOCK and TRUNCATE_TAIL_BLOCK instead.
>
> - Pass the original unmodified range into btrfs_truncate_block()
> There are several call sites inside btrfs_zero_range() and
> btrfs_punch_hole() where we pass part of the original range for
> truncating.
>
> This hides the original range which can lead to under or over
> truncating.
> Thus we have to pass the original zero/punch range.
>
> - Make btrfs_truncate_block() to zero any involved blocks inside the folio
> Since we have the original range, we know exactly which range inside
> the folio that should be zeroed.
>
> It may cover other blocks other than the one with data space reserved,
> but that's fine, the zeroed range will not be written back anyway.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/btrfs_inode.h | 10 +++++--
> fs/btrfs/file.c | 33 ++++++++++++++-------
> fs/btrfs/inode.c | 65 +++++++++++++++++++++++++++---------------
> 3 files changed, 73 insertions(+), 35 deletions(-)
>
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 4e2952cf5766..21b005ddf42c 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -547,8 +547,14 @@ int btrfs_add_link(struct btrfs_trans_handle *trans,
> struct btrfs_inode *parent_inode, struct btrfs_inode *inode,
> const struct fscrypt_str *name, int add_backref, u64 index);
> int btrfs_delete_subvolume(struct btrfs_inode *dir, struct dentry *dentry);
> -int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
> - int front);
> +
> +enum btrfs_truncate_where {
> + BTRFS_TRUNCATE_HEAD_BLOCK,
> + BTRFS_TRUNCATE_TAIL_BLOCK,
> +};
> +int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t end,
> + u64 orig_start, u64 orig_end,
> + enum btrfs_truncate_where where);
>
> int btrfs_start_delalloc_snapshot(struct btrfs_root *root, bool in_reclaim_context);
> int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index e3fea1db4304..55fa91799fb6 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2616,7 +2616,8 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
> u64 lockend;
> u64 tail_start;
> u64 tail_len;
> - u64 orig_start = offset;
> + const u64 orig_start = offset;
> + const u64 orig_end = offset + len - 1;
> int ret = 0;
> bool same_block;
> u64 ino_size;
> @@ -2659,7 +2660,8 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
> if (offset < ino_size) {
> truncated_block = true;
> ret = btrfs_truncate_block(BTRFS_I(inode), offset, len,
This is a bug, since the 3rd parameter is the inclusive end now, it
should be "offset + len - 1".
This leads to generic/008 failure on x86_64.> - 0);
> + orig_start, orig_end,
> + BTRFS_TRUNCATE_HEAD_BLOCK);
> } else {
> ret = 0;
> }
> @@ -2669,7 +2671,9 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
> /* zero back part of the first block */
> if (offset < ino_size) {
> truncated_block = true;
> - ret = btrfs_truncate_block(BTRFS_I(inode), offset, 0, 0);
> + ret = btrfs_truncate_block(BTRFS_I(inode), offset, -1,
> + orig_start, orig_end,
> + BTRFS_TRUNCATE_HEAD_BLOCK);
> if (ret) {
> btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
> return ret;
> @@ -2706,8 +2710,9 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
> if (tail_start + tail_len < ino_size) {
> truncated_block = true;
> ret = btrfs_truncate_block(BTRFS_I(inode),
> - tail_start + tail_len,
> - 0, 1);
> + tail_start, tail_start + tail_len - 1,
> + orig_start, orig_end,
> + BTRFS_TRUNCATE_TAIL_BLOCK);
> if (ret)
> goto out_only_mutex;
> }
> @@ -2875,6 +2880,8 @@ static int btrfs_zero_range(struct inode *inode,
> int ret;
> u64 alloc_hint = 0;
> const u64 sectorsize = fs_info->sectorsize;
> + const u64 orig_start = offset;
> + const u64 orig_end = offset + len - 1;
> u64 alloc_start = round_down(offset, sectorsize);
> u64 alloc_end = round_up(offset + len, sectorsize);
> u64 bytes_to_reserve = 0;
> @@ -2938,7 +2945,8 @@ static int btrfs_zero_range(struct inode *inode,
> if (len < sectorsize && em->disk_bytenr != EXTENT_MAP_HOLE) {
> free_extent_map(em);
> ret = btrfs_truncate_block(BTRFS_I(inode), offset, len,
The same here.
The idea of changing @len to @end seems cursed, but there seems to be no
better solution than this.
Will get them fixed in the next update.
Thanks,
Qu
> - 0);
> + orig_start, orig_end,
> + BTRFS_TRUNCATE_HEAD_BLOCK);
> if (!ret)
> ret = btrfs_fallocate_update_isize(inode,
> offset + len,
> @@ -2969,7 +2977,9 @@ static int btrfs_zero_range(struct inode *inode,
> alloc_start = round_down(offset, sectorsize);
> ret = 0;
> } else if (ret == RANGE_BOUNDARY_WRITTEN_EXTENT) {
> - ret = btrfs_truncate_block(BTRFS_I(inode), offset, 0, 0);
> + ret = btrfs_truncate_block(BTRFS_I(inode), offset, -1,
> + orig_start, orig_end,
> + BTRFS_TRUNCATE_HEAD_BLOCK);
> if (ret)
> goto out;
> } else {
> @@ -2986,8 +2996,9 @@ static int btrfs_zero_range(struct inode *inode,
> alloc_end = round_up(offset + len, sectorsize);
> ret = 0;
> } else if (ret == RANGE_BOUNDARY_WRITTEN_EXTENT) {
> - ret = btrfs_truncate_block(BTRFS_I(inode), offset + len,
> - 0, 1);
> + ret = btrfs_truncate_block(BTRFS_I(inode), offset, offset + len - 1,
> + orig_start, orig_end,
> + BTRFS_TRUNCATE_TAIL_BLOCK);
> if (ret)
> goto out;
> } else {
> @@ -3107,7 +3118,9 @@ static long btrfs_fallocate(struct file *file, int mode,
> * need to zero out the end of the block if i_size lands in the
> * middle of a block.
> */
> - ret = btrfs_truncate_block(BTRFS_I(inode), inode->i_size, 0, 0);
> + ret = btrfs_truncate_block(BTRFS_I(inode), inode->i_size, -1,
> + inode->i_size, -1,
> + BTRFS_TRUNCATE_HEAD_BLOCK);
> if (ret)
> goto out;
> }
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index e283627c087d..0700a161b80e 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -4782,15 +4782,16 @@ static int btrfs_rmdir(struct inode *dir, struct dentry *dentry)
> *
> * @inode - inode that we're zeroing
> * @from - the offset to start zeroing
> - * @len - the length to zero, 0 to zero the entire range respective to the
> - * offset
> - * @front - zero up to the offset instead of from the offset on
> + * @end - the inclusive end to finish zeroing, can be -1 meaning truncating
> + * everything beyond @from.
> + * @where - Head or tail block to truncate.
> *
> * This will find the block for the "from" offset and cow the block and zero the
> * part we want to zero. This is used with truncate and hole punching.
> */
> -int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
> - int front)
> +int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t end,
> + u64 orig_start, u64 orig_end,
> + enum btrfs_truncate_where where)
> {
> struct btrfs_fs_info *fs_info = inode->root->fs_info;
> struct address_space *mapping = inode->vfs_inode.i_mapping;
> @@ -4800,20 +4801,30 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
> struct extent_changeset *data_reserved = NULL;
> bool only_release_metadata = false;
> u32 blocksize = fs_info->sectorsize;
> - pgoff_t index = from >> PAGE_SHIFT;
> - unsigned offset = from & (blocksize - 1);
> + pgoff_t index = (where == BTRFS_TRUNCATE_HEAD_BLOCK) ?
> + (from >> PAGE_SHIFT) : (end >> PAGE_SHIFT);
> struct folio *folio;
> gfp_t mask = btrfs_alloc_write_mask(mapping);
> size_t write_bytes = blocksize;
> int ret = 0;
> u64 block_start;
> u64 block_end;
> + u64 clamp_start;
> + u64 clamp_end;
>
> - if (IS_ALIGNED(offset, blocksize) &&
> - (!len || IS_ALIGNED(len, blocksize)))
> + ASSERT(where == BTRFS_TRUNCATE_HEAD_BLOCK ||
> + where == BTRFS_TRUNCATE_TAIL_BLOCK);
> +
> + if (end == (loff_t)-1)
> + ASSERT(where == BTRFS_TRUNCATE_HEAD_BLOCK);
> +
> + if (IS_ALIGNED(from, blocksize) && IS_ALIGNED(end + 1, blocksize))
> goto out;
>
> - block_start = round_down(from, blocksize);
> + if (where == BTRFS_TRUNCATE_HEAD_BLOCK)
> + block_start = round_down(from, blocksize);
> + else
> + block_start = round_down(end, blocksize);
> block_end = block_start + blocksize - 1;
>
> ret = btrfs_check_data_free_space(inode, &data_reserved, block_start,
> @@ -4893,17 +4904,22 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len,
> goto out_unlock;
> }
>
> - if (offset != blocksize) {
> - if (!len)
> - len = blocksize - offset;
> - if (front)
> - folio_zero_range(folio, block_start - folio_pos(folio),
> - offset);
> - else
> - folio_zero_range(folio,
> - (block_start - folio_pos(folio)) + offset,
> - len);
> - }
> + /*
> + * Although we have only reserved space for the one block, we still should
> + * zero out all blocks in the original range.
> + * The remaining blocks normally are already holes thus no need to zero again,
> + * but it's possible for fs block size < page size cases to have memory mapped
> + * writes to pollute ranges beyond EOF.
> + *
> + * In that case although the polluted blocks beyond EOF will not reach disk,
> + * it still affects our page cache.
> + */
> + clamp_start = max_t(u64, folio_pos(folio), orig_start);
> + clamp_end = min_t(u64, folio_pos(folio) + folio_size(folio) - 1,
> + orig_end);
> + folio_zero_range(folio, clamp_start - folio_pos(folio),
> + clamp_end - clamp_start + 1);
> +
> btrfs_folio_clear_checked(fs_info, folio, block_start,
> block_end + 1 - block_start);
> btrfs_folio_set_dirty(fs_info, folio, block_start,
> @@ -5005,7 +5021,8 @@ int btrfs_cont_expand(struct btrfs_inode *inode, loff_t oldsize, loff_t size)
> * rest of the block before we expand the i_size, otherwise we could
> * expose stale data.
> */
> - ret = btrfs_truncate_block(inode, oldsize, 0, 0);
> + ret = btrfs_truncate_block(inode, oldsize, -1, oldsize, -1,
> + BTRFS_TRUNCATE_HEAD_BLOCK);
> if (ret)
> return ret;
>
> @@ -7649,7 +7666,9 @@ static int btrfs_truncate(struct btrfs_inode *inode, bool skip_writeback)
> btrfs_end_transaction(trans);
> btrfs_btree_balance_dirty(fs_info);
>
> - ret = btrfs_truncate_block(inode, inode->vfs_inode.i_size, 0, 0);
> + ret = btrfs_truncate_block(inode, inode->vfs_inode.i_size, -1,
> + inode->vfs_inode.i_size, -1,
> + BTRFS_TRUNCATE_HEAD_BLOCK);
> if (ret)
> goto out;
> trans = btrfs_start_transaction(root, 1);
next prev parent reply other threads:[~2025-04-11 7:03 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-11 5:13 [PATCH 0/2] btrfs: fix corner cases for subpage generic/363 failures Qu Wenruo
2025-04-11 5:14 ` [PATCH 1/2] btrfs: make btrfs_truncate_block() to zero involved blocks in a folio Qu Wenruo
2025-04-11 7:02 ` Qu Wenruo [this message]
2025-04-11 5:14 ` [PATCH 2/2] btrfs: make btrfs_truncate_block() zero folio range for certain subpage corner cases Qu Wenruo
2025-04-12 5:12 ` kernel test robot
2025-04-12 5:54 ` kernel test robot
2025-04-12 18:35 ` Andy Shevchenko
2025-04-14 1:20 ` Qu Wenruo
2025-04-14 3:37 ` Qu Wenruo
2025-04-14 10:40 ` Andy Shevchenko
2025-04-15 18:18 ` David Sterba
2025-04-15 18:21 ` Andy Shevchenko
2025-04-15 23:57 ` Qu Wenruo
2025-04-16 5:57 ` Andy Shevchenko
2025-04-16 8:28 ` David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c118af0d-aec0-4bac-85b1-b280cfc971b7@suse.com \
--to=wqu@suse.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox