From: Boris Burkov <boris@bur.io>
To: Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v4 2/2] btrfs: make btrfs_truncate_block() zero folio range for certain subpage corner cases
Date: Wed, 23 Apr 2025 17:27:14 -0700 [thread overview]
Message-ID: <aAmFYrptBXjufdg5@devvm12410.ftw0.facebook.com> (raw)
In-Reply-To: <50e41f048f26b5b58f55b25c045e5dfe94a8dcfd.1745443508.git.wqu@suse.com>
On Thu, Apr 24, 2025 at 06:56:32AM +0930, Qu Wenruo wrote:
> [BUG]
> For the following fsx -e 1 run, the btrfs still fails the run on 64K
> page size with 4K fs block size:
>
> READ BAD DATA: offset = 0x26b3a, size = 0xfafa, fname = /mnt/btrfs/junk
> OFFSET GOOD BAD RANGE
> 0x26b3a 0x0000 0x15b4 0x0
> operation# (mod 256) for the bad data may be 21
> [...]
> LOG DUMP (28 total operations):
> 1( 1 mod 256): SKIPPED (no operation)
> 2( 2 mod 256): SKIPPED (no operation)
> 3( 3 mod 256): SKIPPED (no operation)
> 4( 4 mod 256): SKIPPED (no operation)
> 5( 5 mod 256): WRITE 0x1ea90 thru 0x285e0 (0x9b51 bytes) HOLE
> 6( 6 mod 256): ZERO 0x1b1a8 thru 0x20bd4 (0x5a2d bytes)
> 7( 7 mod 256): FALLOC 0x22b1a thru 0x272fa (0x47e0 bytes) INTERIOR
> 8( 8 mod 256): WRITE 0x741d thru 0x13522 (0xc106 bytes)
> 9( 9 mod 256): MAPWRITE 0x73ee thru 0xdeeb (0x6afe bytes)
> 10( 10 mod 256): FALLOC 0xb719 thru 0xb994 (0x27b bytes) INTERIOR
> 11( 11 mod 256): COPY 0x15ed8 thru 0x18be1 (0x2d0a bytes) to 0x25f6e thru 0x28c77
> 12( 12 mod 256): ZERO 0x1615e thru 0x1770e (0x15b1 bytes)
> 13( 13 mod 256): SKIPPED (no operation)
> 14( 14 mod 256): DEDUPE 0x20000 thru 0x27fff (0x8000 bytes) to 0x1000 thru 0x8fff
> 15( 15 mod 256): SKIPPED (no operation)
> 16( 16 mod 256): CLONE 0xa000 thru 0xffff (0x6000 bytes) to 0x36000 thru 0x3bfff
> 17( 17 mod 256): ZERO 0x14adc thru 0x1b78a (0x6caf bytes)
> 18( 18 mod 256): TRUNCATE DOWN from 0x3c000 to 0x1e2e3 ******WWWW
> 19( 19 mod 256): CLONE 0x4000 thru 0x11fff (0xe000 bytes) to 0x16000 thru 0x23fff
> 20( 20 mod 256): FALLOC 0x311e1 thru 0x3681b (0x563a bytes) PAST_EOF
> 21( 21 mod 256): FALLOC 0x351c5 thru 0x40000 (0xae3b bytes) EXTENDING
> 22( 22 mod 256): WRITE 0x920 thru 0x7e51 (0x7532 bytes)
> 23( 23 mod 256): COPY 0x2b58 thru 0xc508 (0x99b1 bytes) to 0x117b1 thru 0x1b161
> 24( 24 mod 256): TRUNCATE DOWN from 0x40000 to 0x3c9a5
> 25( 25 mod 256): SKIPPED (no operation)
> 26( 26 mod 256): MAPWRITE 0x25020 thru 0x26b06 (0x1ae7 bytes)
> 27( 27 mod 256): SKIPPED (no operation)
> 28( 28 mod 256): READ 0x26b3a thru 0x36633 (0xfafa bytes) ***RRRR***
>
> [CAUSE]
> The involved operations are:
>
> fallocating to largest ever: 0x40000
> 21 pollute_eof 0x24000 thru 0x2ffff (0xc000 bytes)
> 21 falloc from 0x351c5 to 0x40000 (0xae3b bytes)
> 28 read 0x26b3a thru 0x36633 (0xfafa bytes)
>
> At operation #21 a pollute_eof is done, by memory mappaed write into
> range [0x24000, 0x2ffff).
> At this stage, the inode size is 0x24000, which is block aligned.
>
> Then fallocate happens, and since it's expanding the inode, it will call
> btrfs_truncate_block() to truncate any unaligned range.
>
> But since the inode size is already block aligned,
> btrfs_truncate_block() does nothing and exit.
>
> However remember the folio at 0x20000 has some range polluted already,
> although they will not be written back to disk, it still affects the
> page cache, resulting the later operation #28 to read out the polluted
> value.
>
> [FIX]
> Instead of early exit from btrfs_truncate_block() if the range is
> already block aligned, do extra filio zeroing if the fs block size is
> smaller than the page size.
>
> This is to address exactly the above case where memory mapped write can
> still leave some garbage beyond EOF.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> fs/btrfs/inode.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 88 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 6f910c056819..23cd03716e57 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -4760,6 +4760,80 @@ static int btrfs_rmdir(struct inode *dir, struct dentry *dentry)
> return ret;
> }
>
> +/*
> + * A helper to zero out all blocks inside range [@orig_start, @orig_end) of
> + * the target folio.
> + * The target folio is the one containing the head or tail block of the range
> + * [@from, @end].
> + *
> + * This is a special case for fs block size < page size, where even if the range
> + * [from, end] is already block aligned, we can still have blocks beyond EOF being
> + * polluted by memory mapped write.
> + */
This function name is basically the same as folio_zero_range which
behaves very differently. I think it needs a name that captures the fact
that it is a special case for btrfs_truncate_block
> +static int zero_range_folio(struct btrfs_inode *inode, u64 from, u64 end,
> + u64 orig_start, u64 orig_end,
> + enum btrfs_truncate_where where)
> +{
> + const u32 blocksize = inode->root->fs_info->sectorsize;
> + struct address_space *mapping = inode->vfs_inode.i_mapping;
> + struct extent_io_tree *io_tree = &inode->io_tree;
> + struct extent_state *cached_state = NULL;
> + struct btrfs_ordered_extent *ordered;
> + pgoff_t index = (where == BTRFS_TRUNCATE_HEAD_BLOCK) ?
> + (from >> PAGE_SHIFT) : (end >> PAGE_SHIFT);
> + struct folio *folio;
> + u64 block_start;
> + u64 block_end;
> + u64 clamp_start;
> + u64 clamp_end;
> + int ret = 0;
> +
> +again:
> + folio = filemap_lock_folio(mapping, index);
> + /* No folio present. */
> + if (IS_ERR(folio))
> + return 0;
> +
> + if (!folio_test_uptodate(folio)) {
> + ret = btrfs_read_folio(NULL, folio);
> + folio_lock(folio);
> + if (folio->mapping != mapping) {
> + folio_unlock(folio);
> + folio_put(folio);
> + goto again;
> + }
> + if (!folio_test_uptodate(folio)) {
> + ret = -EIO;
> + goto out_unlock;
> + }
> + }
> + folio_wait_writeback(folio);
This lock/read/wait pattern is the same as in btrfs_truncate_block and I
think could benefit from being lifted into a function.
> +
> + clamp_start = max_t(u64, folio_pos(folio), orig_start);
> + clamp_end = min_t(u64, folio_pos(folio) + folio_size(folio) - 1,
> + orig_end);
> + block_start = round_down(clamp_start, blocksize);
> + block_end = round_up(clamp_end + 1, blocksize) - 1;
> + lock_extent(io_tree, block_start, block_end, &cached_state);
> + ordered = btrfs_lookup_ordered_range(inode, block_start, block_end + 1 - block_end);
> + if (ordered) {
> + unlock_extent(io_tree, block_start, block_end, &cached_state);
> + folio_unlock(folio);
> + folio_put(folio);
> + btrfs_start_ordered_extent(ordered);
> + btrfs_put_ordered_extent(ordered);
> + goto again;
> + }
> + folio_zero_range(folio, clamp_start - folio_pos(folio),
> + clamp_end - clamp_start + 1);
> + unlock_extent(io_tree, block_start, block_end, &cached_state);
> +
> +out_unlock:
> + folio_unlock(folio);
> + folio_put(folio);
> + return ret;
> +}
> +
> /*
> * Read, zero a chunk and write a block.
> *
> @@ -4801,8 +4875,20 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t end,
> if (end == (loff_t)-1)
> ASSERT(where == BTRFS_TRUNCATE_HEAD_BLOCK);
>
> - if (IS_ALIGNED(from, blocksize) && IS_ALIGNED(end + 1, blocksize))
> - goto out;
> + if (IS_ALIGNED(from, blocksize) && IS_ALIGNED(end + 1, blocksize)) {
Does this need to depend on `where` now or is it still necessary to
check both?
> + /*
> + * The target head/tail range is already block aligned.
> + * If block size >= PAGE_SIZE, meaning it's impossible to mmap a
> + * page containing anything other than the target block.
> + * So we can safely exit.
> + *
> + * Otherwise we still need to zero out the range inside the folio
> + * to avoid memory mapped write to pollute beyond EOF.
> + */
> + if (blocksize >= PAGE_SIZE)
> + return 0;
> + return zero_range_folio(inode, from, end, orig_start, orig_end, where);
> + }
>
> if (where == BTRFS_TRUNCATE_HEAD_BLOCK)
> block_start = round_down(from, blocksize);
> --
> 2.49.0
next prev parent reply other threads:[~2025-04-24 0:27 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-23 21:26 [PATCH v4 0/2] btrfs: fix corner cases for subpage generic/363 failures Qu Wenruo
2025-04-23 21:26 ` [PATCH v4 1/2] btrfs: make btrfs_truncate_block() to zero involved blocks in a folio Qu Wenruo
2025-04-24 0:14 ` Boris Burkov
2025-04-24 0:54 ` Qu Wenruo
2025-04-23 21:26 ` [PATCH v4 2/2] btrfs: make btrfs_truncate_block() zero folio range for certain subpage corner cases Qu Wenruo
2025-04-24 0:27 ` Boris Burkov [this message]
2025-04-24 0:56 ` Qu Wenruo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aAmFYrptBXjufdg5@devvm12410.ftw0.facebook.com \
--to=boris@bur.io \
--cc=linux-btrfs@vger.kernel.org \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox