[PATCHSET 5.16-rcX 0/2] xfs: fix data corruption when cycling ro/rw mounts

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET 5.16-rcX 0/2] xfs: fix data corruption when cycling ro/rw mounts
@ 2021-12-07 18:35 Darrick J. Wong
  2021-12-07 18:35 ` [PATCH 1/2] xfs: remove all COW fork extents when remounting readonly Darrick J. Wong
  2021-12-07 18:35 ` [PATCH 2/2] xfs: only run COW extent recovery when there are no live extents Darrick J. Wong
  0 siblings, 2 replies; 11+ messages in thread
From: Darrick J. Wong @ 2021-12-07 18:35 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, wen.gang.wang

Hi all,

As part of a large customer escalation, I've been combing through the
XFS copy on write code to try to find sources of (mostly) silent data
corruption.  I found a nasty problem in the remount code, wherein a ro
remount can race with file reader threads and fail to clean out cached
inode COW forks.  A subsequent rw remount calls the COW staging extent
recovery code, which frees the space but does not update the records in
the cached inode COW forks.  This leads to massive fs corruption.

The first patch in this series is the critical fix for the race
condition.  The second patch is defensive in that it moves the COW
staging extent recovery so that it always happens at mount time to
protect us against future screwups.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=remount-fixes-5.16
---
 fs/xfs/xfs_mount.c   |   37 ++++++++++++++++++++++++++++---------
 fs/xfs/xfs_reflink.c |    4 +++-
 fs/xfs/xfs_super.c   |   23 +++++++++++------------
 3 files changed, 42 insertions(+), 22 deletions(-)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] xfs: remove all COW fork extents when remounting readonly
  2021-12-07 18:35 [PATCHSET 5.16-rcX 0/2] xfs: fix data corruption when cycling ro/rw mounts Darrick J. Wong
@ 2021-12-07 18:35 ` Darrick J. Wong
  2021-12-07 21:33   ` Dave Chinner
  2021-12-07 18:35 ` [PATCH 2/2] xfs: only run COW extent recovery when there are no live extents Darrick J. Wong
  1 sibling, 1 reply; 11+ messages in thread
From: Darrick J. Wong @ 2021-12-07 18:35 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, wen.gang.wang

From: Darrick J. Wong <djwong@kernel.org>

As part of multiple customer escalations due to file data corruption
after copy on write operations, I wrote some fstests that use fsstress
to hammer on COW to shake things loose.  Regrettably, I caught some
filesystem shutdowns due to incorrect rmap operations with the following
loop:

mount <filesystem>				# (0)
fsstress <run only readonly ops> &		# (1)
while true; do
	fsstress <run all ops>
	mount -o remount,ro			# (2)
	fsstress <run only readonly ops>
	mount -o remount,rw			# (3)
done

When (2) happens, notice that (1) is still running.  xfs_remount_ro will
call xfs_blockgc_stop to walk the inode cache to free all the COW
extents, but the blockgc mechanism races with (1)'s reader threads to
take IOLOCKs and loses, which means that it doesn't clean them all out.
Call such a file (A).

When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
walks the ondisk refcount btree and frees any COW extent that it finds.
This function does not check the inode cache, which means that incore
COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
one of those former COW extents are allocated and mapped into another
file (B) and someone triggers a COW to the stale reservation in (A), A's
dirty data will be written into (B) and once that's done, those blocks
will be transferred to (A)'s data fork without bumping the refcount.

The results are catastrophic -- file (B) and the refcount btree are now
corrupt.  Solve this race by forcing the xfs_blockgc_free_space to run
synchronously, which causes xfs_icwalk to return to inodes that were
skipped because the blockgc code couldn't take the IOLOCK.  This is safe
to do here because the VFS has already prohibited new writer threads.

Fixes: 10ddf64e420f ("xfs: remove leftover CoW reservations when remounting ro")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_super.c |   14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index e21459f9923a..0c07a4aef3b9 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1765,7 +1765,10 @@ static int
 xfs_remount_ro(
 	struct xfs_mount	*mp)
 {
-	int error;
+	struct xfs_icwalk	icw = {
+		.icw_flags	= XFS_ICWALK_FLAG_SYNC,
+	};
+	int			error;

 	/*
 	 * Cancel background eofb scanning so it cannot race with the final
@@ -1773,8 +1776,13 @@ xfs_remount_ro(
 	 */
 	xfs_blockgc_stop(mp);

-	/* Get rid of any leftover CoW reservations... */
-	error = xfs_blockgc_free_space(mp, NULL);
+	/*
+	 * Clean out all remaining COW staging extents.  This extra step is
+	 * done synchronously because the background blockgc worker could have
+	 * raced with a reader thread and failed to grab an IOLOCK.  In that
+	 * case, the inode could still have post-eof and COW blocks.
+	 */
+	error = xfs_blockgc_free_space(mp, &icw);
 	if (error) {
 		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
 		return error;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] xfs: only run COW extent recovery when there are no live extents
  2021-12-07 18:35 [PATCHSET 5.16-rcX 0/2] xfs: fix data corruption when cycling ro/rw mounts Darrick J. Wong
  2021-12-07 18:35 ` [PATCH 1/2] xfs: remove all COW fork extents when remounting readonly Darrick J. Wong
@ 2021-12-07 18:35 ` Darrick J. Wong
  2021-12-07 22:21   ` Dave Chinner
  1 sibling, 1 reply; 11+ messages in thread
From: Darrick J. Wong @ 2021-12-07 18:35 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, wen.gang.wang

From: Darrick J. Wong <djwong@kernel.org>

As part of multiple customer escalations due to file data corruption
after copy on write operations, I wrote some fstests that use fsstress
to hammer on COW to shake things loose.  Regrettably, I caught some
filesystem shutdowns due to incorrect rmap operations with the following
loop:

mount <filesystem>				# (0)
fsstress <run only readonly ops> &		# (1)
while true; do
	fsstress <run all ops>
	mount -o remount,ro			# (2)
	fsstress <run only readonly ops>
	mount -o remount,rw			# (3)
done

When (2) happens, notice that (1) is still running.  xfs_remount_ro will
call xfs_blockgc_stop to walk the inode cache to free all the COW
extents, but the blockgc mechanism races with (1)'s reader threads to
take IOLOCKs and loses, which means that it doesn't clean them all out.
Call such a file (A).

When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
walks the ondisk refcount btree and frees any COW extent that it finds.
This function does not check the inode cache, which means that incore
COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
one of those former COW extents are allocated and mapped into another
file (B) and someone triggers a COW to the stale reservation in (A), A's
dirty data will be written into (B) and once that's done, those blocks
will be transferred to (A)'s data fork without bumping the refcount.

The results are catastrophic -- file (B) and the refcount btree are now
corrupt.  In the first patch, we fixed the race condition in (2) so that
(A) will always flush the COW fork.  In this second patch, we move the
_recover_cow call to the initial mount call in (0) for safety.

As mentioned previously, xfs_reflink_recover_cow walks the refcount
btree looking for COW staging extents, and frees them.  This was
intended to be run at mount time (when we know there are no live inodes)
to clean up any leftover staging events that may have been left behind
during an unclean shutdown.  As a time "optimization" for readonly
mounts, we deferred this to the ro->rw transition, not realizing that
any failure to clean all COW forks during a rw->ro transition would
result in catastrophic corruption.

Therefore, remove this optimization and only run the recovery routine
when we're guaranteed not to have any COW staging extents anywhere,
which means we always run this at mount time.

Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_mount.c   |   37 ++++++++++++++++++++++++++++---------
 fs/xfs/xfs_reflink.c |    4 +++-
 fs/xfs/xfs_super.c   |    9 ---------
 3 files changed, 31 insertions(+), 19 deletions(-)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 359109b6f0d3..064ff89a4557 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -918,6 +918,34 @@ xfs_mountfs(
 		xfs_qm_mount_quotas(mp);
 	}

+	/*
+	 * Recover any CoW staging blocks that are still referenced by the
+	 * ondisk refcount metadata.  The ondisk metadata does not track which
+	 * inode created the staging extent, which means that we don't have an
+	 * easy means to figure out if a given staging extent is referenced by
+	 * a cached inode or is a leftover from a previous unclean shutdown,
+	 * short of scanning the entire inode cache to construct a bitmap of
+	 * actually stale extents.
+	 *
+	 * During mount, we know that zero files have been exposed to user
+	 * modifications, which means that there cannot possibly be any live
+	 * staging extents.  Therefore, it is safe to free them all right now,
+	 * even if we're performing a readonly mount.
+	 *
+	 * This cannot be deferred this to rw remount time if we're performing
+	 * a readonly mount (as XFS once did) until there's an interlock with
+	 * cached inodes.
+	 */
+	if (!xfs_has_norecovery(mp)) {
+		error = xfs_reflink_recover_cow(mp);
+		if (error) {
+			xfs_err(mp,
+	"Error %d recovering leftover CoW allocations.", error);
+			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+			goto out_quota;
+		}
+	}
+
 	/*
 	 * Now we are mounted, reserve a small amount of unused space for
 	 * privileged transactions. This is needed so that transaction
@@ -936,15 +964,6 @@ xfs_mountfs(
 			xfs_warn(mp,
 	"Unable to allocate reserve blocks. Continuing without reserve pool.");

-		/* Recover any CoW blocks that never got remapped. */
-		error = xfs_reflink_recover_cow(mp);
-		if (error) {
-			xfs_err(mp,
-	"Error %d recovering leftover CoW allocations.", error);
-			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
-			goto out_quota;
-		}
-
 		/* Reserve AG blocks for future btree expansion. */
 		error = xfs_fs_reserve_ag_blocks(mp);
 		if (error && error != -ENOSPC)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index cb0edb1d68ef..a571489ef0bd 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -749,7 +749,9 @@ xfs_reflink_end_cow(
 }

 /*
- * Free leftover CoW reservations that didn't get cleaned out.
+ * Free leftover CoW reservations that didn't get cleaned out.  This function
+ * does not coordinate with cached inode COW forks, which means that callers
+ * must ensure there are no COW staging extents attached to any cached inodes.
  */
 int
 xfs_reflink_recover_cow(
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 0c07a4aef3b9..4649e7429264 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1739,15 +1739,6 @@ xfs_remount_rw(
 	 */
 	xfs_restore_resvblks(mp);
 	xfs_log_work_queue(mp);
-
-	/* Recover any CoW blocks that never got remapped. */
-	error = xfs_reflink_recover_cow(mp);
-	if (error) {
-		xfs_err(mp,
-			"Error %d recovering leftover CoW allocations.", error);
-		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
-		return error;
-	}
 	xfs_blockgc_start(mp);

 	/* Create the per-AG metadata reservation pool .*/

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] xfs: remove all COW fork extents when remounting readonly
  2021-12-07 18:35 ` [PATCH 1/2] xfs: remove all COW fork extents when remounting readonly Darrick J. Wong
@ 2021-12-07 21:33   ` Dave Chinner
  2021-12-08  1:38     ` Darrick J. Wong
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2021-12-07 21:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, wen.gang.wang

On Tue, Dec 07, 2021 at 10:35:45AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> As part of multiple customer escalations due to file data corruption
> after copy on write operations, I wrote some fstests that use fsstress
> to hammer on COW to shake things loose.  Regrettably, I caught some
> filesystem shutdowns due to incorrect rmap operations with the following
> loop:
> 
> mount <filesystem>				# (0)
> fsstress <run only readonly ops> &		# (1)
> while true; do
> 	fsstress <run all ops>
> 	mount -o remount,ro			# (2)
> 	fsstress <run only readonly ops>
> 	mount -o remount,rw			# (3)
> done
> 
> When (2) happens, notice that (1) is still running.  xfs_remount_ro will
> call xfs_blockgc_stop to walk the inode cache to free all the COW
> extents, but the blockgc mechanism races with (1)'s reader threads to
> take IOLOCKs and loses, which means that it doesn't clean them all out.
> Call such a file (A).
> 
> When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
> walks the ondisk refcount btree and frees any COW extent that it finds.
> This function does not check the inode cache, which means that incore
> COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
> one of those former COW extents are allocated and mapped into another
> file (B) and someone triggers a COW to the stale reservation in (A), A's
> dirty data will be written into (B) and once that's done, those blocks
> will be transferred to (A)'s data fork without bumping the refcount.
> 
> The results are catastrophic -- file (B) and the refcount btree are now
> corrupt.  Solve this race by forcing the xfs_blockgc_free_space to run
> synchronously, which causes xfs_icwalk to return to inodes that were
> skipped because the blockgc code couldn't take the IOLOCK.  This is safe
> to do here because the VFS has already prohibited new writer threads.
> 
> Fixes: 10ddf64e420f ("xfs: remove leftover CoW reservations when remounting ro")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_super.c |   14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)

Looks good, I went through the analysis yesterday when you mentioned
it on #xfs. Minor nit below, otherwise:

Reviewed-by: Dave Chinner <dchinner@redhat.com>

> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index e21459f9923a..0c07a4aef3b9 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1765,7 +1765,10 @@ static int
>  xfs_remount_ro(
>  	struct xfs_mount	*mp)
>  {
> -	int error;
> +	struct xfs_icwalk	icw = {
> +		.icw_flags	= XFS_ICWALK_FLAG_SYNC,
> +	};
> +	int			error;
>  
>  	/*
>  	 * Cancel background eofb scanning so it cannot race with the final
> @@ -1773,8 +1776,13 @@ xfs_remount_ro(
>  	 */
>  	xfs_blockgc_stop(mp);
>  
> -	/* Get rid of any leftover CoW reservations... */
> -	error = xfs_blockgc_free_space(mp, NULL);
> +	/*
> +	 * Clean out all remaining COW staging extents.  This extra step is
> +	 * done synchronously because the background blockgc worker could have
> +	 * raced with a reader thread and failed to grab an IOLOCK.  In that
> +	 * case, the inode could still have post-eof and COW blocks.
> +	 */

Rather than describe how inodes might be skipped here, the
constraint we are operating under should be described. That is:

	/*
	 * We need to clear out all remaining COW staging extents so
	 * that we don't leave inodes requiring modifications during
	 * inactivation and reclaim on a read-only mount. We must
	 * check and process every inode currently in memory, hence
	 * this requires a synchronous inode cache scan to be
	 * executed.
	 */

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: only run COW extent recovery when there are no live extents
  2021-12-07 18:35 ` [PATCH 2/2] xfs: only run COW extent recovery when there are no live extents Darrick J. Wong
@ 2021-12-07 22:21   ` Dave Chinner
  2021-12-08  1:50     ` Darrick J. Wong
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2021-12-07 22:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, wen.gang.wang

On Tue, Dec 07, 2021 at 10:35:51AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> As part of multiple customer escalations due to file data corruption
> after copy on write operations, I wrote some fstests that use fsstress
> to hammer on COW to shake things loose.  Regrettably, I caught some
> filesystem shutdowns due to incorrect rmap operations with the following
> loop:
> 
> mount <filesystem>				# (0)
> fsstress <run only readonly ops> &		# (1)
> while true; do
> 	fsstress <run all ops>
> 	mount -o remount,ro			# (2)
> 	fsstress <run only readonly ops>
> 	mount -o remount,rw			# (3)
> done
> 
> When (2) happens, notice that (1) is still running.  xfs_remount_ro will
> call xfs_blockgc_stop to walk the inode cache to free all the COW
> extents, but the blockgc mechanism races with (1)'s reader threads to
> take IOLOCKs and loses, which means that it doesn't clean them all out.
> Call such a file (A).
> 
> When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
> walks the ondisk refcount btree and frees any COW extent that it finds.
> This function does not check the inode cache, which means that incore
> COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
> one of those former COW extents are allocated and mapped into another
> file (B) and someone triggers a COW to the stale reservation in (A), A's
> dirty data will be written into (B) and once that's done, those blocks
> will be transferred to (A)'s data fork without bumping the refcount.
> 
> The results are catastrophic -- file (B) and the refcount btree are now
> corrupt.  In the first patch, we fixed the race condition in (2) so that
> (A) will always flush the COW fork.  In this second patch, we move the
> _recover_cow call to the initial mount call in (0) for safety.
> 
> As mentioned previously, xfs_reflink_recover_cow walks the refcount
> btree looking for COW staging extents, and frees them.  This was
> intended to be run at mount time (when we know there are no live inodes)
> to clean up any leftover staging events that may have been left behind
> during an unclean shutdown.  As a time "optimization" for readonly
> mounts, we deferred this to the ro->rw transition, not realizing that
> any failure to clean all COW forks during a rw->ro transition would
> result in catastrophic corruption.
> 
> Therefore, remove this optimization and only run the recovery routine
> when we're guaranteed not to have any COW staging extents anywhere,
> which means we always run this at mount time.
> 
> Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

The mechanism looks fine, comments below.

> ---
>  fs/xfs/xfs_mount.c   |   37 ++++++++++++++++++++++++++++---------
>  fs/xfs/xfs_reflink.c |    4 +++-
>  fs/xfs/xfs_super.c   |    9 ---------
>  3 files changed, 31 insertions(+), 19 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 359109b6f0d3..064ff89a4557 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -918,6 +918,34 @@ xfs_mountfs(
>  		xfs_qm_mount_quotas(mp);
>  	}
>  
> +	/*
> +	 * Recover any CoW staging blocks that are still referenced by the
> +	 * ondisk refcount metadata.  The ondisk metadata does not track which
> +	 * inode created the staging extent, which means that we don't have an
> +	 * easy means to figure out if a given staging extent is referenced by
> +	 * a cached inode or is a leftover from a previous unclean shutdown,
> +	 * short of scanning the entire inode cache to construct a bitmap of
> +	 * actually stale extents.

This really isn't commentary that is need for recovery - it should
be in the comment above xfs_reflink_recover_cow().

> +	 *
> +	 * During mount, we know that zero files have been exposed to user
> +	 * modifications, which means that there cannot possibly be any live
> +	 * staging extents.  Therefore, it is safe to free them all right now,
> +	 * even if we're performing a readonly mount.
> +	 *
> +	 * This cannot be deferred this to rw remount time if we're performing
> +	 * a readonly mount (as XFS once did) until there's an interlock with
> +	 * cached inodes.
> +	 */

I'm not sure this last bit is necessary here - it's largely covered by
the commit message and the comment added to xfs_reflink_end_cow().

It seems to me that the comment here can be reduced to:

	/*
	 * Recover any CoW staging blocks that are still referenced by the
	 * ondisk refcount metadata.  During mount there cannot be any live
	 * staging extents as we have not run any user modifications.
	 * Therefore, it is safe to free them all right now, even on a
	 * read-only mount.
	 */

And the rest of the stuff about live staging extents vs on disk metadata state,
ro->rw remounts, etc all goes into an more complete explanation of the
limitations of xfs_reflink_recover_cow() in the comment above
xfs_reflink_recover_cow()....

> +	if (!xfs_has_norecovery(mp)) {
> +		error = xfs_reflink_recover_cow(mp);
> +		if (error) {
> +			xfs_err(mp,
> +	"Error %d recovering leftover CoW allocations.", error);
> +			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> +			goto out_quota;
> +		}
> +	}

This strikes me as something xfs_log_mount_finish() should do, as
it's a post-replay recovery operation and should be done before
we push the AIL and force the log at the end of log recovery....

> +
>  	/*
>  	 * Now we are mounted, reserve a small amount of unused space for
>  	 * privileged transactions. This is needed so that transaction
> @@ -936,15 +964,6 @@ xfs_mountfs(
>  			xfs_warn(mp,
>  	"Unable to allocate reserve blocks. Continuing without reserve pool.");
>  
> -		/* Recover any CoW blocks that never got remapped. */
> -		error = xfs_reflink_recover_cow(mp);
> -		if (error) {
> -			xfs_err(mp,
> -	"Error %d recovering leftover CoW allocations.", error);
> -			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> -			goto out_quota;
> -		}
> -
>  		/* Reserve AG blocks for future btree expansion. */
>  		error = xfs_fs_reserve_ag_blocks(mp);
>  		if (error && error != -ENOSPC)
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index cb0edb1d68ef..a571489ef0bd 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -749,7 +749,9 @@ xfs_reflink_end_cow(
>  }
>  
>  /*
> - * Free leftover CoW reservations that didn't get cleaned out.
> + * Free leftover CoW reservations that didn't get cleaned out.  This function
> + * does not coordinate with cached inode COW forks, which means that callers
> + * must ensure there are no COW staging extents attached to any cached inodes.
>   */

Side note - I've noticed a lot of new comments are in the form of "X does Y,
which means A needs to do B". Speaking for myself, I find it much easier to
understand comments in the "rule before reason" form. i.e. "A needs to do B
because X does Y".

The rule that we need to follow (A needs to do B) is the important thing
readers need to understand and so it should be the primary object in the
sentence/paragraph. They don't necessarily need to understand exactly
why that rule needs to be followed, but they need to know about the rule.
Putting the rule after a chunk of complex reasoning means the importance/clarity
of the rule is lost/less obvious.

Rewritten in "rule before reason" form:

"Callers must ensure there are no COW staging extents attached to any cached
inodes as this function does not co-ordinate with cached inode COW forks."

Now it's obvious that until there is an interlock between
xfs_reflink_recover_cow() and cached inodes, this rule needs to be followed.
There's also no need to add commentary to say "this rule needs to be followed
until there's an interlock with cached inodes" because that's obvious from the
rule and the reason for the rule...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] xfs: remove all COW fork extents when remounting readonly
  2021-12-07 21:33   ` Dave Chinner
@ 2021-12-08  1:38     ` Darrick J. Wong
  0 siblings, 0 replies; 11+ messages in thread
From: Darrick J. Wong @ 2021-12-08  1:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, wen.gang.wang

On Wed, Dec 08, 2021 at 08:33:16AM +1100, Dave Chinner wrote:
> On Tue, Dec 07, 2021 at 10:35:45AM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > As part of multiple customer escalations due to file data corruption
> > after copy on write operations, I wrote some fstests that use fsstress
> > to hammer on COW to shake things loose.  Regrettably, I caught some
> > filesystem shutdowns due to incorrect rmap operations with the following
> > loop:
> > 
> > mount <filesystem>				# (0)
> > fsstress <run only readonly ops> &		# (1)
> > while true; do
> > 	fsstress <run all ops>
> > 	mount -o remount,ro			# (2)
> > 	fsstress <run only readonly ops>
> > 	mount -o remount,rw			# (3)
> > done
> > 
> > When (2) happens, notice that (1) is still running.  xfs_remount_ro will
> > call xfs_blockgc_stop to walk the inode cache to free all the COW
> > extents, but the blockgc mechanism races with (1)'s reader threads to
> > take IOLOCKs and loses, which means that it doesn't clean them all out.
> > Call such a file (A).
> > 
> > When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
> > walks the ondisk refcount btree and frees any COW extent that it finds.
> > This function does not check the inode cache, which means that incore
> > COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
> > one of those former COW extents are allocated and mapped into another
> > file (B) and someone triggers a COW to the stale reservation in (A), A's
> > dirty data will be written into (B) and once that's done, those blocks
> > will be transferred to (A)'s data fork without bumping the refcount.
> > 
> > The results are catastrophic -- file (B) and the refcount btree are now
> > corrupt.  Solve this race by forcing the xfs_blockgc_free_space to run
> > synchronously, which causes xfs_icwalk to return to inodes that were
> > skipped because the blockgc code couldn't take the IOLOCK.  This is safe
> > to do here because the VFS has already prohibited new writer threads.
> > 
> > Fixes: 10ddf64e420f ("xfs: remove leftover CoW reservations when remounting ro")
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_super.c |   14 +++++++++++---
> >  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> Looks good, I went through the analysis yesterday when you mentioned
> it on #xfs. Minor nit below, otherwise:
> 
> Reviewed-by: Dave Chinner <dchinner@redhat.com>

Thanks for the review!

> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index e21459f9923a..0c07a4aef3b9 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -1765,7 +1765,10 @@ static int
> >  xfs_remount_ro(
> >  	struct xfs_mount	*mp)
> >  {
> > -	int error;
> > +	struct xfs_icwalk	icw = {
> > +		.icw_flags	= XFS_ICWALK_FLAG_SYNC,
> > +	};
> > +	int			error;
> >  
> >  	/*
> >  	 * Cancel background eofb scanning so it cannot race with the final
> > @@ -1773,8 +1776,13 @@ xfs_remount_ro(
> >  	 */
> >  	xfs_blockgc_stop(mp);
> >  
> > -	/* Get rid of any leftover CoW reservations... */
> > -	error = xfs_blockgc_free_space(mp, NULL);
> > +	/*
> > +	 * Clean out all remaining COW staging extents.  This extra step is
> > +	 * done synchronously because the background blockgc worker could have
> > +	 * raced with a reader thread and failed to grab an IOLOCK.  In that
> > +	 * case, the inode could still have post-eof and COW blocks.
> > +	 */
> 
> Rather than describe how inodes might be skipped here, the
> constraint we are operating under should be described. That is:
> 
> 	/*
> 	 * We need to clear out all remaining COW staging extents so
> 	 * that we don't leave inodes requiring modifications during
> 	 * inactivation and reclaim on a read-only mount. We must
> 	 * check and process every inode currently in memory, hence
> 	 * this requires a synchronous inode cache scan to be
> 	 * executed.
> 	 */

I will shorten this to:

	/*
	 * Clear out all remaining COW staging extents and speculative
	 * post-EOF preallocations so that we don't leave inodes
	 * requiring inactivation cleanups during reclaim on a read-only
	 * mount.  We must process every cached inode, so this requires
	 * a synchronous cache scan.
	 */

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: only run COW extent recovery when there are no live extents
  2021-12-07 22:21   ` Dave Chinner
@ 2021-12-08  1:50     ` Darrick J. Wong
  0 siblings, 0 replies; 11+ messages in thread
From: Darrick J. Wong @ 2021-12-08  1:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, wen.gang.wang

On Wed, Dec 08, 2021 at 09:21:09AM +1100, Dave Chinner wrote:
> On Tue, Dec 07, 2021 at 10:35:51AM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > As part of multiple customer escalations due to file data corruption
> > after copy on write operations, I wrote some fstests that use fsstress
> > to hammer on COW to shake things loose.  Regrettably, I caught some
> > filesystem shutdowns due to incorrect rmap operations with the following
> > loop:
> > 
> > mount <filesystem>				# (0)
> > fsstress <run only readonly ops> &		# (1)
> > while true; do
> > 	fsstress <run all ops>
> > 	mount -o remount,ro			# (2)
> > 	fsstress <run only readonly ops>
> > 	mount -o remount,rw			# (3)
> > done
> > 
> > When (2) happens, notice that (1) is still running.  xfs_remount_ro will
> > call xfs_blockgc_stop to walk the inode cache to free all the COW
> > extents, but the blockgc mechanism races with (1)'s reader threads to
> > take IOLOCKs and loses, which means that it doesn't clean them all out.
> > Call such a file (A).
> > 
> > When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
> > walks the ondisk refcount btree and frees any COW extent that it finds.
> > This function does not check the inode cache, which means that incore
> > COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
> > one of those former COW extents are allocated and mapped into another
> > file (B) and someone triggers a COW to the stale reservation in (A), A's
> > dirty data will be written into (B) and once that's done, those blocks
> > will be transferred to (A)'s data fork without bumping the refcount.
> > 
> > The results are catastrophic -- file (B) and the refcount btree are now
> > corrupt.  In the first patch, we fixed the race condition in (2) so that
> > (A) will always flush the COW fork.  In this second patch, we move the
> > _recover_cow call to the initial mount call in (0) for safety.
> > 
> > As mentioned previously, xfs_reflink_recover_cow walks the refcount
> > btree looking for COW staging extents, and frees them.  This was
> > intended to be run at mount time (when we know there are no live inodes)
> > to clean up any leftover staging events that may have been left behind
> > during an unclean shutdown.  As a time "optimization" for readonly
> > mounts, we deferred this to the ro->rw transition, not realizing that
> > any failure to clean all COW forks during a rw->ro transition would
> > result in catastrophic corruption.
> > 
> > Therefore, remove this optimization and only run the recovery routine
> > when we're guaranteed not to have any COW staging extents anywhere,
> > which means we always run this at mount time.
> > 
> > Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree")
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> 
> The mechanism looks fine, comments below.
> 
> > ---
> >  fs/xfs/xfs_mount.c   |   37 ++++++++++++++++++++++++++++---------
> >  fs/xfs/xfs_reflink.c |    4 +++-
> >  fs/xfs/xfs_super.c   |    9 ---------
> >  3 files changed, 31 insertions(+), 19 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > index 359109b6f0d3..064ff89a4557 100644
> > --- a/fs/xfs/xfs_mount.c
> > +++ b/fs/xfs/xfs_mount.c
> > @@ -918,6 +918,34 @@ xfs_mountfs(
> >  		xfs_qm_mount_quotas(mp);
> >  	}
> >  
> > +	/*
> > +	 * Recover any CoW staging blocks that are still referenced by the
> > +	 * ondisk refcount metadata.  The ondisk metadata does not track which
> > +	 * inode created the staging extent, which means that we don't have an
> > +	 * easy means to figure out if a given staging extent is referenced by
> > +	 * a cached inode or is a leftover from a previous unclean shutdown,
> > +	 * short of scanning the entire inode cache to construct a bitmap of
> > +	 * actually stale extents.
> 
> This really isn't commentary that is need for recovery - it should
> be in the comment above xfs_reflink_recover_cow().

<nod>

> > +	 *
> > +	 * During mount, we know that zero files have been exposed to user
> > +	 * modifications, which means that there cannot possibly be any live
> > +	 * staging extents.  Therefore, it is safe to free them all right now,
> > +	 * even if we're performing a readonly mount.
> > +	 *
> > +	 * This cannot be deferred this to rw remount time if we're performing
> > +	 * a readonly mount (as XFS once did) until there's an interlock with
> > +	 * cached inodes.
> > +	 */
> 
> I'm not sure this last bit is necessary here - it's largely covered by
> the commit message and the comment added to xfs_reflink_end_cow().
> 
> It seems to me that the comment here can be reduced to:
> 
> 	/*
> 	 * Recover any CoW staging blocks that are still referenced by the
> 	 * ondisk refcount metadata.  During mount there cannot be any live
> 	 * staging extents as we have not run any user modifications.
> 	 * Therefore, it is safe to free them all right now, even on a
> 	 * read-only mount.
> 	 */
> 
> And the rest of the stuff about live staging extents vs on disk metadata state,
> ro->rw remounts, etc all goes into an more complete explanation of the
> limitations of xfs_reflink_recover_cow() in the comment above
> xfs_reflink_recover_cow()....

Changed, though I'll tweak the second sentence to end with "..as we have
not permitted any user modifications."

> > +	if (!xfs_has_norecovery(mp)) {
> > +		error = xfs_reflink_recover_cow(mp);
> > +		if (error) {
> > +			xfs_err(mp,
> > +	"Error %d recovering leftover CoW allocations.", error);
> > +			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > +			goto out_quota;
> > +		}
> > +	}
> 
> This strikes me as something xfs_log_mount_finish() should do, as
> it's a post-replay recovery operation and should be done before
> we push the AIL and force the log at the end of log recovery....

Oof, yes.  The refcount btree /requires/ that there be a per-AG
reservation waiting for it, which means that it's current location
after xfs_reserve_blocks and before xfs_fs_reserve_ag_blocks is wrong.

> > +
> >  	/*
> >  	 * Now we are mounted, reserve a small amount of unused space for
> >  	 * privileged transactions. This is needed so that transaction
> > @@ -936,15 +964,6 @@ xfs_mountfs(
> >  			xfs_warn(mp,
> >  	"Unable to allocate reserve blocks. Continuing without reserve pool.");
> >  
> > -		/* Recover any CoW blocks that never got remapped. */
> > -		error = xfs_reflink_recover_cow(mp);
> > -		if (error) {
> > -			xfs_err(mp,
> > -	"Error %d recovering leftover CoW allocations.", error);
> > -			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > -			goto out_quota;
> > -		}
> > -
> >  		/* Reserve AG blocks for future btree expansion. */
> >  		error = xfs_fs_reserve_ag_blocks(mp);
> >  		if (error && error != -ENOSPC)
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index cb0edb1d68ef..a571489ef0bd 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -749,7 +749,9 @@ xfs_reflink_end_cow(
> >  }
> >  
> >  /*
> > - * Free leftover CoW reservations that didn't get cleaned out.
> > + * Free leftover CoW reservations that didn't get cleaned out.  This function
> > + * does not coordinate with cached inode COW forks, which means that callers
> > + * must ensure there are no COW staging extents attached to any cached inodes.
> >   */
> 
> Side note - I've noticed a lot of new comments are in the form of "X does Y,
> which means A needs to do B". Speaking for myself, I find it much easier to
> understand comments in the "rule before reason" form. i.e. "A needs to do B
> because X does Y".
> 
> The rule that we need to follow (A needs to do B) is the important thing
> readers need to understand and so it should be the primary object in the
> sentence/paragraph. They don't necessarily need to understand exactly
> why that rule needs to be followed, but they need to know about the rule.
> Putting the rule after a chunk of complex reasoning means the importance/clarity
> of the rule is lost/less obvious.
> 
> Rewritten in "rule before reason" form:
> 
> "Callers must ensure there are no COW staging extents attached to any cached
> inodes as this function does not co-ordinate with cached inode COW forks."
> 
> Now it's obvious that until there is an interlock between
> xfs_reflink_recover_cow() and cached inodes, this rule needs to be followed.
> There's also no need to add commentary to say "this rule needs to be followed
> until there's an interlock with cached inodes" because that's obvious from the
> rule and the reason for the rule...

<nod> I'll try to keep that in mind for subsequent patches, since these
suggestions significantly shortened the comments.  The comment now
reads:

/*
 * Free all CoW staging blocks that are still referenced by the ondisk
 * refcount metadata.  The ondisk metadata does not track which inode
 * created the staging extent, so callers must ensure that there are no
 * cached inodes with live CoW staging extents.
 */


--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/2] xfs: only run COW extent recovery when there are no live extents
  2021-12-08 23:15 [PATCHSET V2 for-5.16 0/2] xfs: fix data corruption when cycling ro/rw mounts Darrick J. Wong
@ 2021-12-08 23:15 ` Darrick J. Wong
  2021-12-09  5:41   ` Dave Chinner
  2021-12-09 14:44   ` Chandan Babu R
  0 siblings, 2 replies; 11+ messages in thread
From: Darrick J. Wong @ 2021-12-08 23:15 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, david, wen.gang.wang

From: Darrick J. Wong <djwong@kernel.org>

As part of multiple customer escalations due to file data corruption
after copy on write operations, I wrote some fstests that use fsstress
to hammer on COW to shake things loose.  Regrettably, I caught some
filesystem shutdowns due to incorrect rmap operations with the following
loop:

mount <filesystem>				# (0)
fsstress <run only readonly ops> &		# (1)
while true; do
	fsstress <run all ops>
	mount -o remount,ro			# (2)
	fsstress <run only readonly ops>
	mount -o remount,rw			# (3)
done

When (2) happens, notice that (1) is still running.  xfs_remount_ro will
call xfs_blockgc_stop to walk the inode cache to free all the COW
extents, but the blockgc mechanism races with (1)'s reader threads to
take IOLOCKs and loses, which means that it doesn't clean them all out.
Call such a file (A).

When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
walks the ondisk refcount btree and frees any COW extent that it finds.
This function does not check the inode cache, which means that incore
COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
one of those former COW extents are allocated and mapped into another
file (B) and someone triggers a COW to the stale reservation in (A), A's
dirty data will be written into (B) and once that's done, those blocks
will be transferred to (A)'s data fork without bumping the refcount.

The results are catastrophic -- file (B) and the refcount btree are now
corrupt.  In the first patch, we fixed the race condition in (2) so that
(A) will always flush the COW fork.  In this second patch, we move the
_recover_cow call to the initial mount call in (0) for safety.

As mentioned previously, xfs_reflink_recover_cow walks the refcount
btree looking for COW staging extents, and frees them.  This was
intended to be run at mount time (when we know there are no live inodes)
to clean up any leftover staging events that may have been left behind
during an unclean shutdown.  As a time "optimization" for readonly
mounts, we deferred this to the ro->rw transition, not realizing that
any failure to clean all COW forks during a rw->ro transition would
result in catastrophic corruption.

Therefore, remove this optimization and only run the recovery routine
when we're guaranteed not to have any COW staging extents anywhere,
which means we always run this at mount time.  While we're at it, move
the callsite to xfs_log_mount_finish because any refcount btree
expansion (however unlikely given that we're removing records from the
right side of the index) must be fed by a per-AG reservation, which
doesn't exist in its current location.

Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_log.c     |   23 ++++++++++++++++++++++-
 fs/xfs/xfs_mount.c   |   10 ----------
 fs/xfs/xfs_reflink.c |    5 ++++-
 fs/xfs/xfs_super.c   |    9 ---------
 4 files changed, 26 insertions(+), 21 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 89fec9a18c34..c17344fc1275 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -10,6 +10,7 @@
 #include "xfs_log_format.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
+#include "xfs_inode.h"
 #include "xfs_errortag.h"
 #include "xfs_error.h"
 #include "xfs_trans.h"
@@ -20,6 +21,7 @@
 #include "xfs_sysfs.h"
 #include "xfs_sb.h"
 #include "xfs_health.h"
+#include "xfs_reflink.h"

 struct kmem_cache	*xfs_log_ticket_cache;

@@ -847,9 +849,28 @@ xfs_log_mount_finish(
 	/* Make sure the log is dead if we're returning failure. */
 	ASSERT(!error || xlog_is_shutdown(log));

-	return error;
+	if (error)
+		return error;
+
+	/*
+	 * Recover any CoW staging blocks that are still referenced by the
+	 * ondisk refcount metadata.  During mount there cannot be any live
+	 * staging extents as we have not permitted any user modifications.
+	 * Therefore, it is safe to free them all right now, even on a
+	 * read-only mount.
+	 */
+	error = xfs_reflink_recover_cow(mp);
+	if (error) {
+		xfs_err(mp, "Error %d recovering leftover CoW allocations.",
+				error);
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+		return error;
+	}
+
+	return 0;
 }

+
 /*
  * The mount has failed. Cancel the recovery if it hasn't completed and destroy
  * the log.
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 359109b6f0d3..bed73e8002a5 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -936,15 +936,6 @@ xfs_mountfs(
 			xfs_warn(mp,
 	"Unable to allocate reserve blocks. Continuing without reserve pool.");

-		/* Recover any CoW blocks that never got remapped. */
-		error = xfs_reflink_recover_cow(mp);
-		if (error) {
-			xfs_err(mp,
-	"Error %d recovering leftover CoW allocations.", error);
-			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
-			goto out_quota;
-		}
-
 		/* Reserve AG blocks for future btree expansion. */
 		error = xfs_fs_reserve_ag_blocks(mp);
 		if (error && error != -ENOSPC)
@@ -955,7 +946,6 @@ xfs_mountfs(

  out_agresv:
 	xfs_fs_unreserve_ag_blocks(mp);
- out_quota:
 	xfs_qm_unmount_quotas(mp);
  out_rtunmount:
 	xfs_rtunmount_inodes(mp);
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index cb0edb1d68ef..8b6c7163f684 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -749,7 +749,10 @@ xfs_reflink_end_cow(
 }

 /*
- * Free leftover CoW reservations that didn't get cleaned out.
+ * Free all CoW staging blocks that are still referenced by the ondisk refcount
+ * metadata.  The ondisk metadata does not track which inode created the
+ * staging extent, so callers must ensure that there are no cached inodes with
+ * live CoW staging extents.
  */
 int
 xfs_reflink_recover_cow(
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 778b57b1f020..c7ac486ca5d3 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1739,15 +1739,6 @@ xfs_remount_rw(
 	 */
 	xfs_restore_resvblks(mp);
 	xfs_log_work_queue(mp);
-
-	/* Recover any CoW blocks that never got remapped. */
-	error = xfs_reflink_recover_cow(mp);
-	if (error) {
-		xfs_err(mp,
-			"Error %d recovering leftover CoW allocations.", error);
-		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
-		return error;
-	}
 	xfs_blockgc_start(mp);

 	/* Create the per-AG metadata reservation pool .*/

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: only run COW extent recovery when there are no live extents
  2021-12-08 23:15 ` [PATCH 2/2] xfs: only run COW extent recovery when there are no live extents Darrick J. Wong
@ 2021-12-09  5:41   ` Dave Chinner
  2021-12-11  1:24     ` Darrick J. Wong
  2021-12-09 14:44   ` Chandan Babu R
  1 sibling, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2021-12-09  5:41 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, wen.gang.wang

On Wed, Dec 08, 2021 at 03:15:16PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> As part of multiple customer escalations due to file data corruption
> after copy on write operations, I wrote some fstests that use fsstress
> to hammer on COW to shake things loose.  Regrettably, I caught some
> filesystem shutdowns due to incorrect rmap operations with the following
> loop:
> 
> mount <filesystem>				# (0)
> fsstress <run only readonly ops> &		# (1)
> while true; do
> 	fsstress <run all ops>
> 	mount -o remount,ro			# (2)
> 	fsstress <run only readonly ops>
> 	mount -o remount,rw			# (3)
> done
> 
> When (2) happens, notice that (1) is still running.  xfs_remount_ro will
> call xfs_blockgc_stop to walk the inode cache to free all the COW
> extents, but the blockgc mechanism races with (1)'s reader threads to
> take IOLOCKs and loses, which means that it doesn't clean them all out.
> Call such a file (A).
> 
> When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
> walks the ondisk refcount btree and frees any COW extent that it finds.
> This function does not check the inode cache, which means that incore
> COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
> one of those former COW extents are allocated and mapped into another
> file (B) and someone triggers a COW to the stale reservation in (A), A's
> dirty data will be written into (B) and once that's done, those blocks
> will be transferred to (A)'s data fork without bumping the refcount.
> 
> The results are catastrophic -- file (B) and the refcount btree are now
> corrupt.  In the first patch, we fixed the race condition in (2) so that
> (A) will always flush the COW fork.  In this second patch, we move the
> _recover_cow call to the initial mount call in (0) for safety.
> 
> As mentioned previously, xfs_reflink_recover_cow walks the refcount
> btree looking for COW staging extents, and frees them.  This was
> intended to be run at mount time (when we know there are no live inodes)
> to clean up any leftover staging events that may have been left behind
> during an unclean shutdown.  As a time "optimization" for readonly
> mounts, we deferred this to the ro->rw transition, not realizing that
> any failure to clean all COW forks during a rw->ro transition would
> result in catastrophic corruption.
> 
> Therefore, remove this optimization and only run the recovery routine
> when we're guaranteed not to have any COW staging extents anywhere,
> which means we always run this at mount time.  While we're at it, move
> the callsite to xfs_log_mount_finish because any refcount btree
> expansion (however unlikely given that we're removing records from the
> right side of the index) must be fed by a per-AG reservation, which
> doesn't exist in its current location.
> 
> Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_log.c     |   23 ++++++++++++++++++++++-
>  fs/xfs/xfs_mount.c   |   10 ----------
>  fs/xfs/xfs_reflink.c |    5 ++++-
>  fs/xfs/xfs_super.c   |    9 ---------
>  4 files changed, 26 insertions(+), 21 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 89fec9a18c34..c17344fc1275 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -10,6 +10,7 @@
>  #include "xfs_log_format.h"
>  #include "xfs_trans_resv.h"
>  #include "xfs_mount.h"
> +#include "xfs_inode.h"
>  #include "xfs_errortag.h"
>  #include "xfs_error.h"
>  #include "xfs_trans.h"
> @@ -20,6 +21,7 @@
>  #include "xfs_sysfs.h"
>  #include "xfs_sb.h"
>  #include "xfs_health.h"
> +#include "xfs_reflink.h"
>  
>  struct kmem_cache	*xfs_log_ticket_cache;
>  
> @@ -847,9 +849,28 @@ xfs_log_mount_finish(
>  	/* Make sure the log is dead if we're returning failure. */
>  	ASSERT(!error || xlog_is_shutdown(log));
>  
> -	return error;
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Recover any CoW staging blocks that are still referenced by the
> +	 * ondisk refcount metadata.  During mount there cannot be any live
> +	 * staging extents as we have not permitted any user modifications.
> +	 * Therefore, it is safe to free them all right now, even on a
> +	 * read-only mount.
> +	 */
> +	error = xfs_reflink_recover_cow(mp);
> +	if (error) {
> +		xfs_err(mp, "Error %d recovering leftover CoW allocations.",
> +				error);
> +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> +		return error;
> +	}
> +
> +	return 0;
>  }

THis is after we've emitted an "Ending recovery ...." log message.
I kinda expected you to move this up to just after the
evict_inodes() call before we force the log, push the AIL, drain
the buffers used during recovery and potentially turn the
"filesystem is read-only" flag back on.

i.e. if this is a recovery operation, it should be done before we
finish recovery....

Other than that, the change is fine.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: only run COW extent recovery when there are no live extents
  2021-12-08 23:15 ` [PATCH 2/2] xfs: only run COW extent recovery when there are no live extents Darrick J. Wong
  2021-12-09  5:41   ` Dave Chinner
@ 2021-12-09 14:44   ` Chandan Babu R
  1 sibling, 0 replies; 11+ messages in thread
From: Chandan Babu R @ 2021-12-09 14:44 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, wen.gang.wang

On 09 Dec 2021 at 04:45, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> As part of multiple customer escalations due to file data corruption
> after copy on write operations, I wrote some fstests that use fsstress
> to hammer on COW to shake things loose.  Regrettably, I caught some
> filesystem shutdowns due to incorrect rmap operations with the following
> loop:
>
> mount <filesystem>				# (0)
> fsstress <run only readonly ops> &		# (1)
> while true; do
> 	fsstress <run all ops>
> 	mount -o remount,ro			# (2)
> 	fsstress <run only readonly ops>
> 	mount -o remount,rw			# (3)
> done
>
> When (2) happens, notice that (1) is still running.  xfs_remount_ro will
> call xfs_blockgc_stop to walk the inode cache to free all the COW
> extents, but the blockgc mechanism races with (1)'s reader threads to
> take IOLOCKs and loses, which means that it doesn't clean them all out.
> Call such a file (A).
>
> When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
> walks the ondisk refcount btree and frees any COW extent that it finds.
> This function does not check the inode cache, which means that incore
> COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
> one of those former COW extents are allocated and mapped into another
> file (B) and someone triggers a COW to the stale reservation in (A), A's
> dirty data will be written into (B) and once that's done, those blocks
> will be transferred to (A)'s data fork without bumping the refcount.
>
> The results are catastrophic -- file (B) and the refcount btree are now
> corrupt.  In the first patch, we fixed the race condition in (2) so that
> (A) will always flush the COW fork.  In this second patch, we move the
> _recover_cow call to the initial mount call in (0) for safety.
>
> As mentioned previously, xfs_reflink_recover_cow walks the refcount
> btree looking for COW staging extents, and frees them.  This was
> intended to be run at mount time (when we know there are no live inodes)
> to clean up any leftover staging events that may have been left behind
> during an unclean shutdown.  As a time "optimization" for readonly
> mounts, we deferred this to the ro->rw transition, not realizing that
> any failure to clean all COW forks during a rw->ro transition would
> result in catastrophic corruption.
>
> Therefore, remove this optimization and only run the recovery routine
> when we're guaranteed not to have any COW staging extents anywhere,
> which means we always run this at mount time.  While we're at it, move
> the callsite to xfs_log_mount_finish because any refcount btree
> expansion (however unlikely given that we're removing records from the
> right side of the index) must be fed by a per-AG reservation, which
> doesn't exist in its current location.
>

With Dave's review comments addressed,

Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

> Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_log.c     |   23 ++++++++++++++++++++++-
>  fs/xfs/xfs_mount.c   |   10 ----------
>  fs/xfs/xfs_reflink.c |    5 ++++-
>  fs/xfs/xfs_super.c   |    9 ---------
>  4 files changed, 26 insertions(+), 21 deletions(-)
>
>
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 89fec9a18c34..c17344fc1275 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -10,6 +10,7 @@
>  #include "xfs_log_format.h"
>  #include "xfs_trans_resv.h"
>  #include "xfs_mount.h"
> +#include "xfs_inode.h"
>  #include "xfs_errortag.h"
>  #include "xfs_error.h"
>  #include "xfs_trans.h"
> @@ -20,6 +21,7 @@
>  #include "xfs_sysfs.h"
>  #include "xfs_sb.h"
>  #include "xfs_health.h"
> +#include "xfs_reflink.h"
>  
>  struct kmem_cache	*xfs_log_ticket_cache;
>  
> @@ -847,9 +849,28 @@ xfs_log_mount_finish(
>  	/* Make sure the log is dead if we're returning failure. */
>  	ASSERT(!error || xlog_is_shutdown(log));
>  
> -	return error;
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Recover any CoW staging blocks that are still referenced by the
> +	 * ondisk refcount metadata.  During mount there cannot be any live
> +	 * staging extents as we have not permitted any user modifications.
> +	 * Therefore, it is safe to free them all right now, even on a
> +	 * read-only mount.
> +	 */
> +	error = xfs_reflink_recover_cow(mp);
> +	if (error) {
> +		xfs_err(mp, "Error %d recovering leftover CoW allocations.",
> +				error);
> +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> +		return error;
> +	}
> +
> +	return 0;
>  }
>  
> +
>  /*
>   * The mount has failed. Cancel the recovery if it hasn't completed and destroy
>   * the log.
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 359109b6f0d3..bed73e8002a5 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -936,15 +936,6 @@ xfs_mountfs(
>  			xfs_warn(mp,
>  	"Unable to allocate reserve blocks. Continuing without reserve pool.");
>  
> -		/* Recover any CoW blocks that never got remapped. */
> -		error = xfs_reflink_recover_cow(mp);
> -		if (error) {
> -			xfs_err(mp,
> -	"Error %d recovering leftover CoW allocations.", error);
> -			xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> -			goto out_quota;
> -		}
> -
>  		/* Reserve AG blocks for future btree expansion. */
>  		error = xfs_fs_reserve_ag_blocks(mp);
>  		if (error && error != -ENOSPC)
> @@ -955,7 +946,6 @@ xfs_mountfs(
>  
>   out_agresv:
>  	xfs_fs_unreserve_ag_blocks(mp);
> - out_quota:
>  	xfs_qm_unmount_quotas(mp);
>   out_rtunmount:
>  	xfs_rtunmount_inodes(mp);
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index cb0edb1d68ef..8b6c7163f684 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -749,7 +749,10 @@ xfs_reflink_end_cow(
>  }
>  
>  /*
> - * Free leftover CoW reservations that didn't get cleaned out.
> + * Free all CoW staging blocks that are still referenced by the ondisk refcount
> + * metadata.  The ondisk metadata does not track which inode created the
> + * staging extent, so callers must ensure that there are no cached inodes with
> + * live CoW staging extents.
>   */
>  int
>  xfs_reflink_recover_cow(
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 778b57b1f020..c7ac486ca5d3 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1739,15 +1739,6 @@ xfs_remount_rw(
>  	 */
>  	xfs_restore_resvblks(mp);
>  	xfs_log_work_queue(mp);
> -
> -	/* Recover any CoW blocks that never got remapped. */
> -	error = xfs_reflink_recover_cow(mp);
> -	if (error) {
> -		xfs_err(mp,
> -			"Error %d recovering leftover CoW allocations.", error);
> -		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> -		return error;
> -	}
>  	xfs_blockgc_start(mp);
>  
>  	/* Create the per-AG metadata reservation pool .*/


-- 
chandan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: only run COW extent recovery when there are no live extents
  2021-12-09  5:41   ` Dave Chinner
@ 2021-12-11  1:24     ` Darrick J. Wong
  0 siblings, 0 replies; 11+ messages in thread
From: Darrick J. Wong @ 2021-12-11  1:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, wen.gang.wang

On Thu, Dec 09, 2021 at 04:41:49PM +1100, Dave Chinner wrote:
> On Wed, Dec 08, 2021 at 03:15:16PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > As part of multiple customer escalations due to file data corruption
> > after copy on write operations, I wrote some fstests that use fsstress
> > to hammer on COW to shake things loose.  Regrettably, I caught some
> > filesystem shutdowns due to incorrect rmap operations with the following
> > loop:
> > 
> > mount <filesystem>				# (0)
> > fsstress <run only readonly ops> &		# (1)
> > while true; do
> > 	fsstress <run all ops>
> > 	mount -o remount,ro			# (2)
> > 	fsstress <run only readonly ops>
> > 	mount -o remount,rw			# (3)
> > done
> > 
> > When (2) happens, notice that (1) is still running.  xfs_remount_ro will
> > call xfs_blockgc_stop to walk the inode cache to free all the COW
> > extents, but the blockgc mechanism races with (1)'s reader threads to
> > take IOLOCKs and loses, which means that it doesn't clean them all out.
> > Call such a file (A).
> > 
> > When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
> > walks the ondisk refcount btree and frees any COW extent that it finds.
> > This function does not check the inode cache, which means that incore
> > COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
> > one of those former COW extents are allocated and mapped into another
> > file (B) and someone triggers a COW to the stale reservation in (A), A's
> > dirty data will be written into (B) and once that's done, those blocks
> > will be transferred to (A)'s data fork without bumping the refcount.
> > 
> > The results are catastrophic -- file (B) and the refcount btree are now
> > corrupt.  In the first patch, we fixed the race condition in (2) so that
> > (A) will always flush the COW fork.  In this second patch, we move the
> > _recover_cow call to the initial mount call in (0) for safety.
> > 
> > As mentioned previously, xfs_reflink_recover_cow walks the refcount
> > btree looking for COW staging extents, and frees them.  This was
> > intended to be run at mount time (when we know there are no live inodes)
> > to clean up any leftover staging events that may have been left behind
> > during an unclean shutdown.  As a time "optimization" for readonly
> > mounts, we deferred this to the ro->rw transition, not realizing that
> > any failure to clean all COW forks during a rw->ro transition would
> > result in catastrophic corruption.
> > 
> > Therefore, remove this optimization and only run the recovery routine
> > when we're guaranteed not to have any COW staging extents anywhere,
> > which means we always run this at mount time.  While we're at it, move
> > the callsite to xfs_log_mount_finish because any refcount btree
> > expansion (however unlikely given that we're removing records from the
> > right side of the index) must be fed by a per-AG reservation, which
> > doesn't exist in its current location.
> > 
> > Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree")
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_log.c     |   23 ++++++++++++++++++++++-
> >  fs/xfs/xfs_mount.c   |   10 ----------
> >  fs/xfs/xfs_reflink.c |    5 ++++-
> >  fs/xfs/xfs_super.c   |    9 ---------
> >  4 files changed, 26 insertions(+), 21 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index 89fec9a18c34..c17344fc1275 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -10,6 +10,7 @@
> >  #include "xfs_log_format.h"
> >  #include "xfs_trans_resv.h"
> >  #include "xfs_mount.h"
> > +#include "xfs_inode.h"
> >  #include "xfs_errortag.h"
> >  #include "xfs_error.h"
> >  #include "xfs_trans.h"
> > @@ -20,6 +21,7 @@
> >  #include "xfs_sysfs.h"
> >  #include "xfs_sb.h"
> >  #include "xfs_health.h"
> > +#include "xfs_reflink.h"
> >  
> >  struct kmem_cache	*xfs_log_ticket_cache;
> >  
> > @@ -847,9 +849,28 @@ xfs_log_mount_finish(
> >  	/* Make sure the log is dead if we're returning failure. */
> >  	ASSERT(!error || xlog_is_shutdown(log));
> >  
> > -	return error;
> > +	if (error)
> > +		return error;
> > +
> > +	/*
> > +	 * Recover any CoW staging blocks that are still referenced by the
> > +	 * ondisk refcount metadata.  During mount there cannot be any live
> > +	 * staging extents as we have not permitted any user modifications.
> > +	 * Therefore, it is safe to free them all right now, even on a
> > +	 * read-only mount.
> > +	 */
> > +	error = xfs_reflink_recover_cow(mp);
> > +	if (error) {
> > +		xfs_err(mp, "Error %d recovering leftover CoW allocations.",
> > +				error);
> > +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > +		return error;
> > +	}
> > +
> > +	return 0;
> >  }
> 
> THis is after we've emitted an "Ending recovery ...." log message.
> I kinda expected you to move this up to just after the
> evict_inodes() call before we force the log, push the AIL, drain
> the buffers used during recovery and potentially turn the
> "filesystem is read-only" flag back on.
> 
> i.e. if this is a recovery operation, it should be done before we
> finish recovery....
> 
> Other than that, the change is fine.

It's a recovery function, albeit one that we always run during mount,
even if the log didn't require recovery.  That's a behavior change,
which is beyond the scope of a fix patch.  Not to mention it causes a
hang in xfs/434, which means now I have to withdraw this until I sort
out why we stall forever in xfs_buftarg_drain, and ftrace isn't helpful
in this regard.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-12-11  1:24 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-12-07 18:35 [PATCHSET 5.16-rcX 0/2] xfs: fix data corruption when cycling ro/rw mounts Darrick J. Wong
2021-12-07 18:35 ` [PATCH 1/2] xfs: remove all COW fork extents when remounting readonly Darrick J. Wong
2021-12-07 21:33   ` Dave Chinner
2021-12-08  1:38     ` Darrick J. Wong
2021-12-07 18:35 ` [PATCH 2/2] xfs: only run COW extent recovery when there are no live extents Darrick J. Wong
2021-12-07 22:21   ` Dave Chinner
2021-12-08  1:50     ` Darrick J. Wong
  -- strict thread matches above, loose matches on Subject: below --
2021-12-08 23:15 [PATCHSET V2 for-5.16 0/2] xfs: fix data corruption when cycling ro/rw mounts Darrick J. Wong
2021-12-08 23:15 ` [PATCH 2/2] xfs: only run COW extent recovery when there are no live extents Darrick J. Wong
2021-12-09  5:41   ` Dave Chinner
2021-12-11  1:24     ` Darrick J. Wong
2021-12-09 14:44   ` Chandan Babu R

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox