public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] xfs: skip background cowblock trims on inodes open for write
@ 2024-09-03 12:47 Brian Foster
  2024-09-06 11:40 ` [PATCH 2/1] xfs: don't free cowblocks from under dirty pagecache on unshare Brian Foster
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Brian Foster @ 2024-09-03 12:47 UTC (permalink / raw)
  To: linux-xfs; +Cc: Darrick Wong

The background blockgc scanner runs on a 5m interval by default and
trims preallocation (post-eof and cow fork) from inodes that are
otherwise idle. Idle effectively means that iolock can be acquired
without blocking and that the inode has no dirty pagecache or I/O in
flight.

This simple mechanism and heuristic has worked fairly well for
post-eof speculative preallocations. Support for reflink and COW
fork preallocations came sometime later and plugged into the same
mechanism, with similar heuristics. Some recent testing has shown
that COW fork preallocation may be notably more sensitive to blockgc
processing than post-eof preallocation, however.

For example, consider an 8GB reflinked file with a COW extent size
hint of 1MB. A worst case fully randomized overwrite of this file
results in ~8k extents of an average size of ~1MB. If the same
workload is interrupted a couple times for blockgc processing
(assuming the file goes idle), the resulting extent count explodes
to over 100k extents with an average size <100kB. This is
significantly worse than ideal and essentially defeats the COW
extent size hint mechanism.

While this particular test is instrumented, it reflects a fairly
reasonable pattern in practice where random I/Os might spread out
over a large period of time with varying periods of (in)activity.
For example, consider a cloned disk image file for a VM or container
with long uptime and variable and bursty usage. A background blockgc
scan that races and processes the image file when it happens to be
clean and idle can have a significant effect on the future
fragmentation level of the file, even when still in use.

To help combat this, update the heuristic to skip cowblocks inodes
that are currently opened for write access during non-sync blockgc
scans. This allows COW fork preallocations to persist for as long as
possible unless otherwise needed for functional purposes (i.e. a
sync scan), the file is idle and closed, or the inode is being
evicted from cache. While here, update the comments to help
distinguish performance oriented heuristics from the logic that
exists to maintain functional correctness.

Suggested-by: Darrick Wong <djwong@kernel.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
---

v2:
- Reorder logic and update comments in xfs_prep_free_cowblocks().
v1: https://lore.kernel.org/linux-xfs/20240214165231.84925-1-bfoster@redhat.com/

 fs/xfs/xfs_icache.c | 31 +++++++++++++++++++++++--------
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index cf629302d48e..900a6277d931 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1241,14 +1241,17 @@ xfs_inode_clear_eofblocks_tag(
 }
 
 /*
- * Set ourselves up to free CoW blocks from this file.  If it's already clean
- * then we can bail out quickly, but otherwise we must back off if the file
- * is undergoing some kind of write.
+ * Prepare to free COW fork blocks from an inode.
  */
 static bool
 xfs_prep_free_cowblocks(
-	struct xfs_inode	*ip)
+	struct xfs_inode	*ip,
+	struct xfs_icwalk	*icw)
 {
+	bool			sync;
+
+	sync = icw && (icw->icw_flags & XFS_ICWALK_FLAG_SYNC);
+
 	/*
 	 * Just clear the tag if we have an empty cow fork or none at all. It's
 	 * possible the inode was fully unshared since it was originally tagged.
@@ -1260,9 +1263,21 @@ xfs_prep_free_cowblocks(
 	}
 
 	/*
-	 * If the mapping is dirty or under writeback we cannot touch the
-	 * CoW fork.  Leave it alone if we're in the midst of a directio.
+	 * A cowblocks trim of an inode can have a significant effect on
+	 * fragmentation even when a reasonable COW extent size hint is set.
+	 * Therefore, we prefer to not process cowblocks unless they are clean
+	 * and idle. We can never process a cowblocks inode that is dirty or has
+	 * in-flight I/O under any circumstances, because outstanding writeback
+	 * or dio expects targeted COW fork blocks exist through write
+	 * completion where they can be remapped into the data fork.
+	 *
+	 * Therefore, the heuristic used here is to never process inodes
+	 * currently opened for write from background (i.e. non-sync) scans. For
+	 * sync scans, use the pagecache/dio state of the inode to ensure we
+	 * never free COW fork blocks out from under pending I/O.
 	 */
+	if (!sync && inode_is_open_for_write(VFS_I(ip)))
+		return false;
 	if ((VFS_I(ip)->i_state & I_DIRTY_PAGES) ||
 	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY) ||
 	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
@@ -1298,7 +1313,7 @@ xfs_inode_free_cowblocks(
 	if (!xfs_iflags_test(ip, XFS_ICOWBLOCKS))
 		return 0;
 
-	if (!xfs_prep_free_cowblocks(ip))
+	if (!xfs_prep_free_cowblocks(ip, icw))
 		return 0;
 
 	if (!xfs_icwalk_match(ip, icw))
@@ -1327,7 +1342,7 @@ xfs_inode_free_cowblocks(
 	 * Check again, nobody else should be able to dirty blocks or change
 	 * the reflink iflag now that we have the first two locks held.
 	 */
-	if (xfs_prep_free_cowblocks(ip))
+	if (xfs_prep_free_cowblocks(ip, icw))
 		ret = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, false);
 	return ret;
 }
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/1] xfs: don't free cowblocks from under dirty pagecache on unshare
  2024-09-03 12:47 [PATCH v2] xfs: skip background cowblock trims on inodes open for write Brian Foster
@ 2024-09-06 11:40 ` Brian Foster
  2024-09-17 18:31   ` Darrick J. Wong
  2024-09-17 18:24 ` [PATCH v2] xfs: skip background cowblock trims on inodes open for write Darrick J. Wong
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 7+ messages in thread
From: Brian Foster @ 2024-09-06 11:40 UTC (permalink / raw)
  To: linux-xfs; +Cc: Darrick Wong

fallocate unshare mode explicitly breaks extent sharing. When a
command completes, it checks the data fork for any remaining shared
extents to determine whether the reflink inode flag and COW fork
preallocation can be removed. This logic doesn't consider in-core
pagecache and I/O state, however, which means we can unsafely remove
COW fork blocks that are still needed under certain conditions.

For example, consider the following command sequence:

xfs_io -fc "pwrite 0 1k" -c "reflink <file> 0 256k 1k" \
	-c "pwrite 0 32k" -c "funshare 0 1k" <file>

This allocates a data block at offset 0, shares it, and then
overwrites it with a larger buffered write. The overwrite triggers
COW fork preallocation, 32 blocks by default, which maps the entire
32k write to delalloc in the COW fork. All but the shared block at
offset 0 remains hole mapped in the data fork. The unshare command
redirties and flushes the folio at offset 0, removing the only
shared extent from the inode. Since the inode no longer maps shared
extents, unshare purges the COW fork before the remaining 28k may
have written back.

This leaves dirty pagecache backed by holes, which writeback quietly
skips, thus leaving clean, non-zeroed pagecache over holes in the
file. To verify, fiemap shows holes in the first 32k of the file and
reads return different data across a remount:

$ xfs_io -c "fiemap -v" <file>
<file>:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   ...
   1: [8..511]:        hole               504
   ...
$ xfs_io -c "pread -v 4k 8" <file>
00001000:  cd cd cd cd cd cd cd cd  ........
$ umount <mnt>; mount <dev> <mnt>
$ xfs_io -c "pread -v 4k 8" <file>
00001000:  00 00 00 00 00 00 00 00  ........

To avoid this problem, make unshare follow the same rules used for
background cowblock scanning and never purge the COW fork for inodes
with dirty pagecache or in-flight I/O.

Fixes: 46afb0628b ("xfs: only flush the unshared range in xfs_reflink_unshare")
Signed-off-by: Brian Foster <bfoster@redhat.com>
---

Here's another COW issue I came across via some unshare testing. A quick
hack to enable unshare in fsx uncovered it. I'll follow up with a proper
patch for that.

I'm sending this as a 2/1 here just to reflect patch order in my local
tree. Also note that I haven't explicitly tested the fixes commit, but a
quick test to switch back to the old full flush behavior on latest
master also makes the problem go away, so I suspect that's where the
regression was introduced.

Brian

 fs/xfs/xfs_icache.c  |  8 +-------
 fs/xfs/xfs_reflink.c |  3 +++
 fs/xfs/xfs_reflink.h | 19 +++++++++++++++++++
 3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 900a6277d931..a1b34e6ccfe2 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1278,13 +1278,7 @@ xfs_prep_free_cowblocks(
 	 */
 	if (!sync && inode_is_open_for_write(VFS_I(ip)))
 		return false;
-	if ((VFS_I(ip)->i_state & I_DIRTY_PAGES) ||
-	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY) ||
-	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
-	    atomic_read(&VFS_I(ip)->i_dio_count))
-		return false;
-
-	return true;
+	return xfs_can_free_cowblocks(ip);
 }
 
 /*
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 6fde6ec8092f..5bf6682e701b 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1595,6 +1595,9 @@ xfs_reflink_clear_inode_flag(
 
 	ASSERT(xfs_is_reflink_inode(ip));
 
+	if (!xfs_can_free_cowblocks(ip))
+		return 0;
+
 	error = xfs_reflink_inode_has_shared_extents(*tpp, ip, &needs_flag);
 	if (error || needs_flag)
 		return error;
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index fb55e4ce49fa..4a58e4533671 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -6,6 +6,25 @@
 #ifndef __XFS_REFLINK_H
 #define __XFS_REFLINK_H 1
 
+/*
+ * Check whether it is safe to free COW fork blocks from an inode. It is unsafe
+ * to do so when an inode has dirty cache or I/O in-flight, even if no shared
+ * extents exist in the data fork, because outstanding I/O may target blocks
+ * that were speculatively allocated to the COW fork.
+ */
+static inline bool
+xfs_can_free_cowblocks(struct xfs_inode *ip)
+{
+	struct inode *inode = VFS_I(ip);
+
+	if ((inode->i_state & I_DIRTY_PAGES) ||
+	    mapping_tagged(inode->i_mapping, PAGECACHE_TAG_DIRTY) ||
+	    mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
+	    atomic_read(&inode->i_dio_count))
+		return false;
+	return true;
+}
+
 extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
 		struct xfs_bmbt_irec *irec, bool *shared);
 int xfs_bmap_trim_cow(struct xfs_inode *ip, struct xfs_bmbt_irec *imap,
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] xfs: skip background cowblock trims on inodes open for write
  2024-09-03 12:47 [PATCH v2] xfs: skip background cowblock trims on inodes open for write Brian Foster
  2024-09-06 11:40 ` [PATCH 2/1] xfs: don't free cowblocks from under dirty pagecache on unshare Brian Foster
@ 2024-09-17 18:24 ` Darrick J. Wong
  2024-09-30 17:09 ` Darrick J. Wong
  2024-10-11  7:42 ` Carlos Maiolino
  3 siblings, 0 replies; 7+ messages in thread
From: Darrick J. Wong @ 2024-09-17 18:24 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Sep 03, 2024 at 08:47:13AM -0400, Brian Foster wrote:
> The background blockgc scanner runs on a 5m interval by default and
> trims preallocation (post-eof and cow fork) from inodes that are
> otherwise idle. Idle effectively means that iolock can be acquired
> without blocking and that the inode has no dirty pagecache or I/O in
> flight.
> 
> This simple mechanism and heuristic has worked fairly well for
> post-eof speculative preallocations. Support for reflink and COW
> fork preallocations came sometime later and plugged into the same
> mechanism, with similar heuristics. Some recent testing has shown
> that COW fork preallocation may be notably more sensitive to blockgc
> processing than post-eof preallocation, however.
> 
> For example, consider an 8GB reflinked file with a COW extent size
> hint of 1MB. A worst case fully randomized overwrite of this file
> results in ~8k extents of an average size of ~1MB. If the same
> workload is interrupted a couple times for blockgc processing
> (assuming the file goes idle), the resulting extent count explodes
> to over 100k extents with an average size <100kB. This is
> significantly worse than ideal and essentially defeats the COW
> extent size hint mechanism.
> 
> While this particular test is instrumented, it reflects a fairly
> reasonable pattern in practice where random I/Os might spread out
> over a large period of time with varying periods of (in)activity.
> For example, consider a cloned disk image file for a VM or container
> with long uptime and variable and bursty usage. A background blockgc
> scan that races and processes the image file when it happens to be
> clean and idle can have a significant effect on the future
> fragmentation level of the file, even when still in use.
> 
> To help combat this, update the heuristic to skip cowblocks inodes
> that are currently opened for write access during non-sync blockgc
> scans. This allows COW fork preallocations to persist for as long as
> possible unless otherwise needed for functional purposes (i.e. a
> sync scan), the file is idle and closed, or the inode is being
> evicted from cache. While here, update the comments to help
> distinguish performance oriented heuristics from the logic that
> exists to maintain functional correctness.
> 
> Suggested-by: Darrick Wong <djwong@kernel.org>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> v2:
> - Reorder logic and update comments in xfs_prep_free_cowblocks().
> v1: https://lore.kernel.org/linux-xfs/20240214165231.84925-1-bfoster@redhat.com/
> 
>  fs/xfs/xfs_icache.c | 31 +++++++++++++++++++++++--------
>  1 file changed, 23 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index cf629302d48e..900a6277d931 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1241,14 +1241,17 @@ xfs_inode_clear_eofblocks_tag(
>  }
>  
>  /*
> - * Set ourselves up to free CoW blocks from this file.  If it's already clean
> - * then we can bail out quickly, but otherwise we must back off if the file
> - * is undergoing some kind of write.
> + * Prepare to free COW fork blocks from an inode.
>   */
>  static bool
>  xfs_prep_free_cowblocks(
> -	struct xfs_inode	*ip)
> +	struct xfs_inode	*ip,
> +	struct xfs_icwalk	*icw)
>  {
> +	bool			sync;
> +
> +	sync = icw && (icw->icw_flags & XFS_ICWALK_FLAG_SYNC);
> +
>  	/*
>  	 * Just clear the tag if we have an empty cow fork or none at all. It's
>  	 * possible the inode was fully unshared since it was originally tagged.
> @@ -1260,9 +1263,21 @@ xfs_prep_free_cowblocks(
>  	}
>  
>  	/*
> -	 * If the mapping is dirty or under writeback we cannot touch the
> -	 * CoW fork.  Leave it alone if we're in the midst of a directio.
> +	 * A cowblocks trim of an inode can have a significant effect on
> +	 * fragmentation even when a reasonable COW extent size hint is set.
> +	 * Therefore, we prefer to not process cowblocks unless they are clean
> +	 * and idle. We can never process a cowblocks inode that is dirty or has
> +	 * in-flight I/O under any circumstances, because outstanding writeback
> +	 * or dio expects targeted COW fork blocks exist through write
> +	 * completion where they can be remapped into the data fork.
> +	 *
> +	 * Therefore, the heuristic used here is to never process inodes
> +	 * currently opened for write from background (i.e. non-sync) scans. For
> +	 * sync scans, use the pagecache/dio state of the inode to ensure we
> +	 * never free COW fork blocks out from under pending I/O.

Sounds good to me!
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

>  	 */
> +	if (!sync && inode_is_open_for_write(VFS_I(ip)))
> +		return false;
>  	if ((VFS_I(ip)->i_state & I_DIRTY_PAGES) ||
>  	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY) ||
>  	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
> @@ -1298,7 +1313,7 @@ xfs_inode_free_cowblocks(
>  	if (!xfs_iflags_test(ip, XFS_ICOWBLOCKS))
>  		return 0;
>  
> -	if (!xfs_prep_free_cowblocks(ip))
> +	if (!xfs_prep_free_cowblocks(ip, icw))
>  		return 0;
>  
>  	if (!xfs_icwalk_match(ip, icw))
> @@ -1327,7 +1342,7 @@ xfs_inode_free_cowblocks(
>  	 * Check again, nobody else should be able to dirty blocks or change
>  	 * the reflink iflag now that we have the first two locks held.
>  	 */
> -	if (xfs_prep_free_cowblocks(ip))
> +	if (xfs_prep_free_cowblocks(ip, icw))
>  		ret = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, false);
>  	return ret;
>  }
> -- 
> 2.45.0
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/1] xfs: don't free cowblocks from under dirty pagecache on unshare
  2024-09-06 11:40 ` [PATCH 2/1] xfs: don't free cowblocks from under dirty pagecache on unshare Brian Foster
@ 2024-09-17 18:31   ` Darrick J. Wong
  2024-09-18 12:22     ` Brian Foster
  0 siblings, 1 reply; 7+ messages in thread
From: Darrick J. Wong @ 2024-09-17 18:31 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Sep 06, 2024 at 07:40:51AM -0400, Brian Foster wrote:
> fallocate unshare mode explicitly breaks extent sharing. When a
> command completes, it checks the data fork for any remaining shared
> extents to determine whether the reflink inode flag and COW fork
> preallocation can be removed. This logic doesn't consider in-core
> pagecache and I/O state, however, which means we can unsafely remove
> COW fork blocks that are still needed under certain conditions.
> 
> For example, consider the following command sequence:
> 
> xfs_io -fc "pwrite 0 1k" -c "reflink <file> 0 256k 1k" \
> 	-c "pwrite 0 32k" -c "funshare 0 1k" <file>
> 
> This allocates a data block at offset 0, shares it, and then
> overwrites it with a larger buffered write. The overwrite triggers
> COW fork preallocation, 32 blocks by default, which maps the entire
> 32k write to delalloc in the COW fork. All but the shared block at
> offset 0 remains hole mapped in the data fork. The unshare command
> redirties and flushes the folio at offset 0, removing the only
> shared extent from the inode. Since the inode no longer maps shared
> extents, unshare purges the COW fork before the remaining 28k may
> have written back.
> 
> This leaves dirty pagecache backed by holes, which writeback quietly
> skips, thus leaving clean, non-zeroed pagecache over holes in the
> file. To verify, fiemap shows holes in the first 32k of the file and
> reads return different data across a remount:
> 
> $ xfs_io -c "fiemap -v" <file>
> <file>:
>  EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
>    ...
>    1: [8..511]:        hole               504
>    ...
> $ xfs_io -c "pread -v 4k 8" <file>
> 00001000:  cd cd cd cd cd cd cd cd  ........
> $ umount <mnt>; mount <dev> <mnt>
> $ xfs_io -c "pread -v 4k 8" <file>
> 00001000:  00 00 00 00 00 00 00 00  ........
> 
> To avoid this problem, make unshare follow the same rules used for
> background cowblock scanning and never purge the COW fork for inodes
> with dirty pagecache or in-flight I/O.
> 
> Fixes: 46afb0628b ("xfs: only flush the unshared range in xfs_reflink_unshare")
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Question: Does xfs_repair report orphaned cow staging blocks after this?
There's a longstanding bug that I've seen in the long soak xfs/286 VM
where we slowly leak cow fork blocks (~80 per ~1 billion fsxops over 7
days).

Anyhow this looks correct on its own so
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
> 
> Here's another COW issue I came across via some unshare testing. A quick
> hack to enable unshare in fsx uncovered it. I'll follow up with a proper
> patch for that.
> 
> I'm sending this as a 2/1 here just to reflect patch order in my local
> tree. Also note that I haven't explicitly tested the fixes commit, but a
> quick test to switch back to the old full flush behavior on latest
> master also makes the problem go away, so I suspect that's where the
> regression was introduced.
> 
> Brian
> 
>  fs/xfs/xfs_icache.c  |  8 +-------
>  fs/xfs/xfs_reflink.c |  3 +++
>  fs/xfs/xfs_reflink.h | 19 +++++++++++++++++++
>  3 files changed, 23 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 900a6277d931..a1b34e6ccfe2 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1278,13 +1278,7 @@ xfs_prep_free_cowblocks(
>  	 */
>  	if (!sync && inode_is_open_for_write(VFS_I(ip)))
>  		return false;
> -	if ((VFS_I(ip)->i_state & I_DIRTY_PAGES) ||
> -	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY) ||
> -	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
> -	    atomic_read(&VFS_I(ip)->i_dio_count))
> -		return false;
> -
> -	return true;
> +	return xfs_can_free_cowblocks(ip);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 6fde6ec8092f..5bf6682e701b 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1595,6 +1595,9 @@ xfs_reflink_clear_inode_flag(
>  
>  	ASSERT(xfs_is_reflink_inode(ip));
>  
> +	if (!xfs_can_free_cowblocks(ip))
> +		return 0;
> +
>  	error = xfs_reflink_inode_has_shared_extents(*tpp, ip, &needs_flag);
>  	if (error || needs_flag)
>  		return error;
> diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> index fb55e4ce49fa..4a58e4533671 100644
> --- a/fs/xfs/xfs_reflink.h
> +++ b/fs/xfs/xfs_reflink.h
> @@ -6,6 +6,25 @@
>  #ifndef __XFS_REFLINK_H
>  #define __XFS_REFLINK_H 1
>  
> +/*
> + * Check whether it is safe to free COW fork blocks from an inode. It is unsafe
> + * to do so when an inode has dirty cache or I/O in-flight, even if no shared
> + * extents exist in the data fork, because outstanding I/O may target blocks
> + * that were speculatively allocated to the COW fork.
> + */
> +static inline bool
> +xfs_can_free_cowblocks(struct xfs_inode *ip)
> +{
> +	struct inode *inode = VFS_I(ip);
> +
> +	if ((inode->i_state & I_DIRTY_PAGES) ||
> +	    mapping_tagged(inode->i_mapping, PAGECACHE_TAG_DIRTY) ||
> +	    mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
> +	    atomic_read(&inode->i_dio_count))
> +		return false;
> +	return true;
> +}
> +
>  extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
>  		struct xfs_bmbt_irec *irec, bool *shared);
>  int xfs_bmap_trim_cow(struct xfs_inode *ip, struct xfs_bmbt_irec *imap,
> -- 
> 2.45.0
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/1] xfs: don't free cowblocks from under dirty pagecache on unshare
  2024-09-17 18:31   ` Darrick J. Wong
@ 2024-09-18 12:22     ` Brian Foster
  0 siblings, 0 replies; 7+ messages in thread
From: Brian Foster @ 2024-09-18 12:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Sep 17, 2024 at 11:31:42AM -0700, Darrick J. Wong wrote:
> On Fri, Sep 06, 2024 at 07:40:51AM -0400, Brian Foster wrote:
> > fallocate unshare mode explicitly breaks extent sharing. When a
> > command completes, it checks the data fork for any remaining shared
> > extents to determine whether the reflink inode flag and COW fork
> > preallocation can be removed. This logic doesn't consider in-core
> > pagecache and I/O state, however, which means we can unsafely remove
> > COW fork blocks that are still needed under certain conditions.
> > 
> > For example, consider the following command sequence:
> > 
> > xfs_io -fc "pwrite 0 1k" -c "reflink <file> 0 256k 1k" \
> > 	-c "pwrite 0 32k" -c "funshare 0 1k" <file>
> > 
> > This allocates a data block at offset 0, shares it, and then
> > overwrites it with a larger buffered write. The overwrite triggers
> > COW fork preallocation, 32 blocks by default, which maps the entire
> > 32k write to delalloc in the COW fork. All but the shared block at
> > offset 0 remains hole mapped in the data fork. The unshare command
> > redirties and flushes the folio at offset 0, removing the only
> > shared extent from the inode. Since the inode no longer maps shared
> > extents, unshare purges the COW fork before the remaining 28k may
> > have written back.
> > 
> > This leaves dirty pagecache backed by holes, which writeback quietly
> > skips, thus leaving clean, non-zeroed pagecache over holes in the
> > file. To verify, fiemap shows holes in the first 32k of the file and
> > reads return different data across a remount:
> > 
> > $ xfs_io -c "fiemap -v" <file>
> > <file>:
> >  EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
> >    ...
> >    1: [8..511]:        hole               504
> >    ...
> > $ xfs_io -c "pread -v 4k 8" <file>
> > 00001000:  cd cd cd cd cd cd cd cd  ........
> > $ umount <mnt>; mount <dev> <mnt>
> > $ xfs_io -c "pread -v 4k 8" <file>
> > 00001000:  00 00 00 00 00 00 00 00  ........
> > 
> > To avoid this problem, make unshare follow the same rules used for
> > background cowblock scanning and never purge the COW fork for inodes
> > with dirty pagecache or in-flight I/O.
> > 
> > Fixes: 46afb0628b ("xfs: only flush the unshared range in xfs_reflink_unshare")
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> 
> Question: Does xfs_repair report orphaned cow staging blocks after this?
> There's a longstanding bug that I've seen in the long soak xfs/286 VM
> where we slowly leak cow fork blocks (~80 per ~1 billion fsxops over 7
> days).
> 

I've not seen that, at least in the test case I have. I think what's
happening here is more that we clean up the COW fork correctly from an
accounting standpoint, but we do so prematurely because the pagecache is
dirty in ranges that are still only backed by COW fork blocks.

> Anyhow this looks correct on its own so
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 

Thanks!

Brian

> --D
> 
> > ---
> > 
> > Here's another COW issue I came across via some unshare testing. A quick
> > hack to enable unshare in fsx uncovered it. I'll follow up with a proper
> > patch for that.
> > 
> > I'm sending this as a 2/1 here just to reflect patch order in my local
> > tree. Also note that I haven't explicitly tested the fixes commit, but a
> > quick test to switch back to the old full flush behavior on latest
> > master also makes the problem go away, so I suspect that's where the
> > regression was introduced.
> > 
> > Brian
> > 
> >  fs/xfs/xfs_icache.c  |  8 +-------
> >  fs/xfs/xfs_reflink.c |  3 +++
> >  fs/xfs/xfs_reflink.h | 19 +++++++++++++++++++
> >  3 files changed, 23 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 900a6277d931..a1b34e6ccfe2 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -1278,13 +1278,7 @@ xfs_prep_free_cowblocks(
> >  	 */
> >  	if (!sync && inode_is_open_for_write(VFS_I(ip)))
> >  		return false;
> > -	if ((VFS_I(ip)->i_state & I_DIRTY_PAGES) ||
> > -	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY) ||
> > -	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
> > -	    atomic_read(&VFS_I(ip)->i_dio_count))
> > -		return false;
> > -
> > -	return true;
> > +	return xfs_can_free_cowblocks(ip);
> >  }
> >  
> >  /*
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 6fde6ec8092f..5bf6682e701b 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -1595,6 +1595,9 @@ xfs_reflink_clear_inode_flag(
> >  
> >  	ASSERT(xfs_is_reflink_inode(ip));
> >  
> > +	if (!xfs_can_free_cowblocks(ip))
> > +		return 0;
> > +
> >  	error = xfs_reflink_inode_has_shared_extents(*tpp, ip, &needs_flag);
> >  	if (error || needs_flag)
> >  		return error;
> > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> > index fb55e4ce49fa..4a58e4533671 100644
> > --- a/fs/xfs/xfs_reflink.h
> > +++ b/fs/xfs/xfs_reflink.h
> > @@ -6,6 +6,25 @@
> >  #ifndef __XFS_REFLINK_H
> >  #define __XFS_REFLINK_H 1
> >  
> > +/*
> > + * Check whether it is safe to free COW fork blocks from an inode. It is unsafe
> > + * to do so when an inode has dirty cache or I/O in-flight, even if no shared
> > + * extents exist in the data fork, because outstanding I/O may target blocks
> > + * that were speculatively allocated to the COW fork.
> > + */
> > +static inline bool
> > +xfs_can_free_cowblocks(struct xfs_inode *ip)
> > +{
> > +	struct inode *inode = VFS_I(ip);
> > +
> > +	if ((inode->i_state & I_DIRTY_PAGES) ||
> > +	    mapping_tagged(inode->i_mapping, PAGECACHE_TAG_DIRTY) ||
> > +	    mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
> > +	    atomic_read(&inode->i_dio_count))
> > +		return false;
> > +	return true;
> > +}
> > +
> >  extern int xfs_reflink_trim_around_shared(struct xfs_inode *ip,
> >  		struct xfs_bmbt_irec *irec, bool *shared);
> >  int xfs_bmap_trim_cow(struct xfs_inode *ip, struct xfs_bmbt_irec *imap,
> > -- 
> > 2.45.0
> > 
> > 
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] xfs: skip background cowblock trims on inodes open for write
  2024-09-03 12:47 [PATCH v2] xfs: skip background cowblock trims on inodes open for write Brian Foster
  2024-09-06 11:40 ` [PATCH 2/1] xfs: don't free cowblocks from under dirty pagecache on unshare Brian Foster
  2024-09-17 18:24 ` [PATCH v2] xfs: skip background cowblock trims on inodes open for write Darrick J. Wong
@ 2024-09-30 17:09 ` Darrick J. Wong
  2024-10-11  7:42 ` Carlos Maiolino
  3 siblings, 0 replies; 7+ messages in thread
From: Darrick J. Wong @ 2024-09-30 17:09 UTC (permalink / raw)
  To: Brian Foster, Carlos Maiolino; +Cc: linux-xfs

On Tue, Sep 03, 2024 at 08:47:13AM -0400, Brian Foster wrote:
> The background blockgc scanner runs on a 5m interval by default and
> trims preallocation (post-eof and cow fork) from inodes that are
> otherwise idle. Idle effectively means that iolock can be acquired
> without blocking and that the inode has no dirty pagecache or I/O in
> flight.

Hey, can we get this data corruption fix staged for 6.12-rc2, please?
Yesterday's fstests push contained the exerciser for this bug.

--D

> This simple mechanism and heuristic has worked fairly well for
> post-eof speculative preallocations. Support for reflink and COW
> fork preallocations came sometime later and plugged into the same
> mechanism, with similar heuristics. Some recent testing has shown
> that COW fork preallocation may be notably more sensitive to blockgc
> processing than post-eof preallocation, however.
> 
> For example, consider an 8GB reflinked file with a COW extent size
> hint of 1MB. A worst case fully randomized overwrite of this file
> results in ~8k extents of an average size of ~1MB. If the same
> workload is interrupted a couple times for blockgc processing
> (assuming the file goes idle), the resulting extent count explodes
> to over 100k extents with an average size <100kB. This is
> significantly worse than ideal and essentially defeats the COW
> extent size hint mechanism.
> 
> While this particular test is instrumented, it reflects a fairly
> reasonable pattern in practice where random I/Os might spread out
> over a large period of time with varying periods of (in)activity.
> For example, consider a cloned disk image file for a VM or container
> with long uptime and variable and bursty usage. A background blockgc
> scan that races and processes the image file when it happens to be
> clean and idle can have a significant effect on the future
> fragmentation level of the file, even when still in use.
> 
> To help combat this, update the heuristic to skip cowblocks inodes
> that are currently opened for write access during non-sync blockgc
> scans. This allows COW fork preallocations to persist for as long as
> possible unless otherwise needed for functional purposes (i.e. a
> sync scan), the file is idle and closed, or the inode is being
> evicted from cache. While here, update the comments to help
> distinguish performance oriented heuristics from the logic that
> exists to maintain functional correctness.
> 
> Suggested-by: Darrick Wong <djwong@kernel.org>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> v2:
> - Reorder logic and update comments in xfs_prep_free_cowblocks().
> v1: https://lore.kernel.org/linux-xfs/20240214165231.84925-1-bfoster@redhat.com/
> 
>  fs/xfs/xfs_icache.c | 31 +++++++++++++++++++++++--------
>  1 file changed, 23 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index cf629302d48e..900a6277d931 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1241,14 +1241,17 @@ xfs_inode_clear_eofblocks_tag(
>  }
>  
>  /*
> - * Set ourselves up to free CoW blocks from this file.  If it's already clean
> - * then we can bail out quickly, but otherwise we must back off if the file
> - * is undergoing some kind of write.
> + * Prepare to free COW fork blocks from an inode.
>   */
>  static bool
>  xfs_prep_free_cowblocks(
> -	struct xfs_inode	*ip)
> +	struct xfs_inode	*ip,
> +	struct xfs_icwalk	*icw)
>  {
> +	bool			sync;
> +
> +	sync = icw && (icw->icw_flags & XFS_ICWALK_FLAG_SYNC);
> +
>  	/*
>  	 * Just clear the tag if we have an empty cow fork or none at all. It's
>  	 * possible the inode was fully unshared since it was originally tagged.
> @@ -1260,9 +1263,21 @@ xfs_prep_free_cowblocks(
>  	}
>  
>  	/*
> -	 * If the mapping is dirty or under writeback we cannot touch the
> -	 * CoW fork.  Leave it alone if we're in the midst of a directio.
> +	 * A cowblocks trim of an inode can have a significant effect on
> +	 * fragmentation even when a reasonable COW extent size hint is set.
> +	 * Therefore, we prefer to not process cowblocks unless they are clean
> +	 * and idle. We can never process a cowblocks inode that is dirty or has
> +	 * in-flight I/O under any circumstances, because outstanding writeback
> +	 * or dio expects targeted COW fork blocks exist through write
> +	 * completion where they can be remapped into the data fork.
> +	 *
> +	 * Therefore, the heuristic used here is to never process inodes
> +	 * currently opened for write from background (i.e. non-sync) scans. For
> +	 * sync scans, use the pagecache/dio state of the inode to ensure we
> +	 * never free COW fork blocks out from under pending I/O.
>  	 */
> +	if (!sync && inode_is_open_for_write(VFS_I(ip)))
> +		return false;
>  	if ((VFS_I(ip)->i_state & I_DIRTY_PAGES) ||
>  	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_DIRTY) ||
>  	    mapping_tagged(VFS_I(ip)->i_mapping, PAGECACHE_TAG_WRITEBACK) ||
> @@ -1298,7 +1313,7 @@ xfs_inode_free_cowblocks(
>  	if (!xfs_iflags_test(ip, XFS_ICOWBLOCKS))
>  		return 0;
>  
> -	if (!xfs_prep_free_cowblocks(ip))
> +	if (!xfs_prep_free_cowblocks(ip, icw))
>  		return 0;
>  
>  	if (!xfs_icwalk_match(ip, icw))
> @@ -1327,7 +1342,7 @@ xfs_inode_free_cowblocks(
>  	 * Check again, nobody else should be able to dirty blocks or change
>  	 * the reflink iflag now that we have the first two locks held.
>  	 */
> -	if (xfs_prep_free_cowblocks(ip))
> +	if (xfs_prep_free_cowblocks(ip, icw))
>  		ret = xfs_reflink_cancel_cow_range(ip, 0, NULLFILEOFF, false);
>  	return ret;
>  }
> -- 
> 2.45.0
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] xfs: skip background cowblock trims on inodes open for write
  2024-09-03 12:47 [PATCH v2] xfs: skip background cowblock trims on inodes open for write Brian Foster
                   ` (2 preceding siblings ...)
  2024-09-30 17:09 ` Darrick J. Wong
@ 2024-10-11  7:42 ` Carlos Maiolino
  3 siblings, 0 replies; 7+ messages in thread
From: Carlos Maiolino @ 2024-10-11  7:42 UTC (permalink / raw)
  To: linux-xfs, Brian Foster; +Cc: Darrick Wong

On Tue, 03 Sep 2024 08:47:13 -0400, Brian Foster wrote:
> The background blockgc scanner runs on a 5m interval by default and
> trims preallocation (post-eof and cow fork) from inodes that are
> otherwise idle. Idle effectively means that iolock can be acquired
> without blocking and that the inode has no dirty pagecache or I/O in
> flight.
> 
> This simple mechanism and heuristic has worked fairly well for
> post-eof speculative preallocations. Support for reflink and COW
> fork preallocations came sometime later and plugged into the same
> mechanism, with similar heuristics. Some recent testing has shown
> that COW fork preallocation may be notably more sensitive to blockgc
> processing than post-eof preallocation, however.
> 
> [...]

Applied to for-next, thanks!

[1/1] xfs: skip background cowblock trims on inodes open for write
      commit: 90a71daaf73f5d39bb0cbb3c7ab6af942fe6233e
[2/2] xfs: don't free cowblocks from under dirty pagecache on unshare
      commit: 4390f019ad7866c3791c3d768d2ff185d89e8ebe

Best regards,
-- 
Carlos Maiolino <cem@kernel.org>


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-10-11  7:42 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-03 12:47 [PATCH v2] xfs: skip background cowblock trims on inodes open for write Brian Foster
2024-09-06 11:40 ` [PATCH 2/1] xfs: don't free cowblocks from under dirty pagecache on unshare Brian Foster
2024-09-17 18:31   ` Darrick J. Wong
2024-09-18 12:22     ` Brian Foster
2024-09-17 18:24 ` [PATCH v2] xfs: skip background cowblock trims on inodes open for write Darrick J. Wong
2024-09-30 17:09 ` Darrick J. Wong
2024-10-11  7:42 ` Carlos Maiolino

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox