From: "Darrick J. Wong" <djwong@kernel.org>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 2/1] xfs: don't free cowblocks from under dirty pagecache on unshare
Date: Tue, 17 Sep 2024 11:31:42 -0700 [thread overview]
Message-ID: <20240917183142.GI182194@frogsfrogsfrogs> (raw)
In-Reply-To: <20240906114051.120743-1-bfoster@redhat.com>
On Fri, Sep 06, 2024 at 07:40:51AM -0400, Brian Foster wrote:
> fallocate unshare mode explicitly breaks extent sharing. When a
> command completes, it checks the data fork for any remaining shared
> extents to determine whether the reflink inode flag and COW fork
> preallocation can be removed. This logic doesn't consider in-core
> pagecache and I/O state, however, which means we can unsafely remove
> COW fork blocks that are still needed under certain conditions.
>
> For example, consider the following command sequence:
>
> xfs_io -fc "pwrite 0 1k" -c "reflink <file> 0 256k 1k" \
> -c "pwrite 0 32k" -c "funshare 0 1k" <file>
>
> This allocates a data block at offset 0, shares it, and then
> overwrites it with a larger buffered write. The overwrite triggers
> COW fork preallocation, 32 blocks by default, which maps the entire
> 32k write to delalloc in the COW fork. All but the shared block at
> offset 0 remains hole mapped in the data fork. The unshare command
> redirties and flushes the folio at offset 0, removing the only
> shared extent from the inode. Since the inode no longer maps shared
> extents, unshare purges the COW fork before the remaining 28k may
> have written back.
>
> This leaves dirty pagecache backed by holes, which writeback quietly
> skips, thus leaving clean, non-zeroed pagecache over holes in the
> file. To verify, fiemap shows holes in the first 32k of the file and
> reads return different data across a remount:
>
> $ xfs_io -c "fiemap -v" <file>
> <file>:
> EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
> ...
> 1: [8..511]: hole 504
> ...
> $ xfs_io -c "pread -v 4k 8" <file>
> 00001000: cd cd cd cd cd cd cd cd ........
> $ umount <mnt>; mount <dev> <mnt>
> $ xfs_io -c "pread -v 4k 8" <file>
> 00001000: 00 00 00 00 00 00 00 00 ........
>
> To avoid this problem, make unshare follow the same rules used for
> background cowblock scanning and never purge the COW fork for inodes
> with dirty pagecache or in-flight I/O.
>
> Fixes: 46afb0628b ("xfs: only flush the unshared range in xfs_reflink_unshare")
> Signed-off-by: Brian Foster <bfoster@redhat.com>
Question: Does xfs_repair report orphaned cow staging blocks after this?
There's a longstanding bug that I've seen in the long soak xfs/286 VM
where we slowly leak cow fork blocks (~80 per ~1 billion fsxops over 7
days).
Anyhow this looks correct on its own so
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> ---
>
> Here's another COW issue I came across via some unshare testing. A quick
> hack to enable unshare in fsx uncovered it. I'll follow up with a proper
> patch for that.
>
> I'm sending this as a 2/1 here just to reflect patch order in my local
> tree. Also note that I haven't explicitly tested the fixes commit, but a
> quick test to switch back to the old full flush behavior on latest
> master also makes the problem go away, so I suspect that's where the
> regression was introduced.
>
> Brian
>
> fs/xfs/xfs_icache.c | 8 +-------
> fs/xfs/xfs_reflink.c | 3 +++
> fs/xfs/xfs_reflink.h | 19 +++++++++++++++++++
> 3 files changed, 23 insertions(+), 7 deletions(-)
>
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 900a6277d931..a1b34e6ccfe2 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1278,13 +1278,7 @@ xfs_prep_free_cowblocks(
> */
> if (!sync && inode_is_open_for_write(VFS_I(ip)))
> return false;
> - if ((VFS_I(ip)->i_state & I_DIRTY_PAGES) ||
> - mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY) ||
> - mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
> - atomic_read(&VFS_I(ip)->i_dio_count))
> - return false;
> -
> - return true;
> + return xfs_can_free_cowblocks(ip);
> }
>
> /*
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 6fde6ec8092f..5bf6682e701b 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1595,6 +1595,9 @@ xfs_reflink_clear_inode_flag(
>
> ASSERT(xfs_is_reflink_inode(ip));
>
> + if (!xfs_can_free_cowblocks(ip))
> + return 0;
> +
> error = xfs_reflink_inode_has_shared_extents(*tpp, ip, &needs_flag);
> if (error || needs_flag)
> return error;
> diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> index fb55e4ce49fa..4a58e4533671 100644
> --- a/fs/xfs/xfs_reflink.h
> +++ b/fs/xfs/xfs_reflink.h
> @@ -6,6 +6,25 @@
> #ifndef __XFS_REFLINK_H
> #define __XFS_REFLINK_H 1
>
> +/*
> + * Check whether it is safe to free COW fork blocks from an inode. It is unsafe
> + * to do so when an inode has dirty cache or I/O in-flight, even if no shared
> + * extents exist in the data fork, because outstanding I/O may target blocks
> + * that were speculatively allocated to the COW fork.
> + */
> +static inline bool
> +xfs_can_free_cowblocks(struct xfs_inode *ip)
> +{
> + struct inode *inode = VFS_I(ip);
> +
> + if ((inode->i_state & I_DIRTY_PAGES) ||
> + mapping_tagged(inode->i_mapping, PAGECACHE_TAG_DIRTY) ||
> + mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
> + atomic_read(&inode->i_dio_count))
> + return false;
> + return true;
> +}
> +
> extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
> struct xfs_bmbt_irec *irec, bool *shared);
> int xfs_bmap_trim_cow(struct xfs_inode *ip, struct xfs_bmbt_irec *imap,
> --
> 2.45.0
>
>
next prev parent reply other threads:[~2024-09-17 18:31 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-03 12:47 [PATCH v2] xfs: skip background cowblock trims on inodes open for write Brian Foster
2024-09-06 11:40 ` [PATCH 2/1] xfs: don't free cowblocks from under dirty pagecache on unshare Brian Foster
2024-09-17 18:31 ` Darrick J. Wong [this message]
2024-09-18 12:22 ` Brian Foster
2024-09-17 18:24 ` [PATCH v2] xfs: skip background cowblock trims on inodes open for write Darrick J. Wong
2024-09-30 17:09 ` Darrick J. Wong
2024-10-11 7:42 ` Carlos Maiolino
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240917183142.GI182194@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=bfoster@redhat.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox