[PATCH 0/3] xfs, iomap: fix writeback failure handling

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] xfs, iomap: fix writeback failure handling
@ 2023-02-14  5:51 Dave Chinner
  2023-02-14  5:51 ` [PATCH 1/3] xfs: report block map corruption errors to the health tracking system Dave Chinner
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Dave Chinner @ 2023-02-14  5:51 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

Hi folks,

We just had a report of a WARN in the XFS writeback code where
delayed allocation was not finding a delayed allocation extent in
the extent tree here:

https://bugzilla.kernel.org/show_bug.cgi?id=217030

Turns out that this is a regression that resulted from removing the
dirty page invalidation on writeback error behaviour that XFS had
for many, many years. Essentially, if we are not invalidating the
dirty cached data on error, we should not be invalidating the
delalloc extent that backs the dirty data. Bad things happen when we
do that.....

This series of patches first adds Darrick's code to mark inodes as
unhealthy when bad extent maps or corruption during allocation is
detected.

The second patch expands on this sickness detection to
cover delalloc conversion failures due to corruption detected during
allocation. It then uses this sickness to trigger removal of the
unconvertable delalloc extents after the VFS has discarded the
cached data during inode reclaim, rather than throwing warnings and
assert failures due to stray unconverted delalloc extents. Those
will still happen if the inode is healthy, hence the need for
ensuring we mark inodes sick correctly.

The last patch then removes xfs_discard_folio() as all it does is
punch the delalloc extent incorrectly. Given that there are now no
other users of ->discard_folio(), that gets removed too.

This has run for a couple of hours with the original reproducer
code, whereas without these patches a current 6.2-rc7 kernel fails
in seconds. No fstests regressions have been seen either, with both
1kB and 4kB block size auto group tests runs now completed.

-Dave.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] xfs: report block map corruption errors to the health tracking system
  2023-02-14  5:51 [PATCH 0/3] xfs, iomap: fix writeback failure handling Dave Chinner
@ 2023-02-14  5:51 ` Dave Chinner
  2023-02-14  8:03   ` Christoph Hellwig
  2023-02-14  5:51 ` [PATCH 2/3] xfs: failed delalloc conversion results in bad extent lists Dave Chinner
  2023-02-14  5:51 ` [PATCH 3/3] xfs, iomap: ->discard_folio() is broken so remove it Dave Chinner
  2 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2023-02-14  5:51 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: "Darrick J. Wong" <djwong@kernel.org>

Whenever we encounter a corrupt block mapping, we should report that to
the health monitoring system for later reporting.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[dgc: open coded xfs_metadata_is_sick() macro]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c   | 35 +++++++++++++++++++++++++++++------
 fs/xfs/libxfs/xfs_health.h |  1 +
 fs/xfs/xfs_health.c        | 26 ++++++++++++++++++++++++++
 fs/xfs/xfs_iomap.c         | 15 ++++++++++++---
 fs/xfs/xfs_reflink.c       |  6 +++++-
 5 files changed, 73 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index c8c65387136c..958e4bb2e51e 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -36,6 +36,7 @@
 #include "xfs_refcount.h"
 #include "xfs_icache.h"
 #include "xfs_iomap.h"
+#include "xfs_health.h"
 
 struct kmem_cache		*xfs_bmap_intent_cache;
 
@@ -971,6 +972,7 @@ xfs_bmap_add_attrfork_local(
 
 	/* should only be called for types that support local format data */
 	ASSERT(0);
+	xfs_bmap_mark_sick(ip, XFS_ATTR_FORK);
 	return -EFSCORRUPTED;
 }
 
@@ -1126,6 +1128,7 @@ xfs_iread_bmbt_block(
 				(unsigned long long)ip->i_ino);
 		xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__, block,
 				sizeof(*block), __this_address);
+		xfs_bmap_mark_sick(ip, whichfork);
 		return -EFSCORRUPTED;
 	}
 
@@ -1141,6 +1144,7 @@ xfs_iread_bmbt_block(
 			xfs_inode_verifier_error(ip, -EFSCORRUPTED,
 					"xfs_iread_extents(2)", frp,
 					sizeof(*frp), fa);
+			xfs_bmap_mark_sick(ip, whichfork);
 			return -EFSCORRUPTED;
 		}
 		xfs_iext_insert(ip, &ir->icur, &new,
@@ -1189,6 +1193,8 @@ xfs_iread_extents(
 	ASSERT(ir.loaded == xfs_iext_count(ifp));
 	return 0;
 out:
+	if ((error == -EFSCORRUPTED) || (error == -EFSBADCRC))
+		xfs_bmap_mark_sick(ip, whichfork);
 	xfs_iext_destroy(ifp);
 	return error;
 }
@@ -1268,6 +1274,7 @@ xfs_bmap_last_before(
 		break;
 	default:
 		ASSERT(0);
+		xfs_bmap_mark_sick(ip, whichfork);
 		return -EFSCORRUPTED;
 	}
 
@@ -3879,12 +3886,16 @@ xfs_bmapi_read(
 	ASSERT(!(flags & ~(XFS_BMAPI_ATTRFORK | XFS_BMAPI_ENTIRE)));
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_SHARED|XFS_ILOCK_EXCL));
 
-	if (WARN_ON_ONCE(!ifp))
+	if (WARN_ON_ONCE(!ifp)) {
+		xfs_bmap_mark_sick(ip, whichfork);
 		return -EFSCORRUPTED;
+	}
 
 	if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(ifp)) ||
-	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT))
+	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
+		xfs_bmap_mark_sick(ip, whichfork);
 		return -EFSCORRUPTED;
+	}
 
 	if (xfs_is_shutdown(mp))
 		return -EIO;
@@ -4365,6 +4376,7 @@ xfs_bmapi_write(
 
 	if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(ifp)) ||
 	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
+		xfs_bmap_mark_sick(ip, whichfork);
 		return -EFSCORRUPTED;
 	}
 
@@ -4592,9 +4604,11 @@ xfs_bmapi_convert_delalloc(
 	error = -ENOSPC;
 	if (WARN_ON_ONCE(bma.blkno == NULLFSBLOCK))
 		goto out_finish;
-	error = -EFSCORRUPTED;
-	if (WARN_ON_ONCE(!xfs_valid_startblock(ip, bma.got.br_startblock)))
+	if (WARN_ON_ONCE(!xfs_valid_startblock(ip, bma.got.br_startblock))) {
+		xfs_bmap_mark_sick(ip, whichfork);
+		error = -EFSCORRUPTED;
 		goto out_finish;
+	}
 
 	XFS_STATS_ADD(mp, xs_xstrat_bytes, XFS_FSB_TO_B(mp, bma.length));
 	XFS_STATS_INC(mp, xs_xstrat_quick);
@@ -4653,6 +4667,7 @@ xfs_bmapi_remap(
 
 	if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(ifp)) ||
 	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
+		xfs_bmap_mark_sick(ip, whichfork);
 		return -EFSCORRUPTED;
 	}
 
@@ -5291,8 +5306,10 @@ __xfs_bunmapi(
 	whichfork = xfs_bmapi_whichfork(flags);
 	ASSERT(whichfork != XFS_COW_FORK);
 	ifp = xfs_ifork_ptr(ip, whichfork);
-	if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(ifp)))
+	if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(ifp))) {
+		xfs_bmap_mark_sick(ip, whichfork);
 		return -EFSCORRUPTED;
+	}
 	if (xfs_is_shutdown(mp))
 		return -EIO;
 
@@ -5762,6 +5779,7 @@ xfs_bmap_collapse_extents(
 
 	if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(ifp)) ||
 	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
+		xfs_bmap_mark_sick(ip, whichfork);
 		return -EFSCORRUPTED;
 	}
 
@@ -5877,6 +5895,7 @@ xfs_bmap_insert_extents(
 
 	if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(ifp)) ||
 	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
+		xfs_bmap_mark_sick(ip, whichfork);
 		return -EFSCORRUPTED;
 	}
 
@@ -5980,6 +5999,7 @@ xfs_bmap_split_extent(
 
 	if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(ifp)) ||
 	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
+		xfs_bmap_mark_sick(ip, whichfork);
 		return -EFSCORRUPTED;
 	}
 
@@ -6161,8 +6181,10 @@ xfs_bmap_finish_one(
 			bmap->br_startoff, bmap->br_blockcount,
 			bmap->br_state);
 
-	if (WARN_ON_ONCE(bi->bi_whichfork != XFS_DATA_FORK))
+	if (WARN_ON_ONCE(bi->bi_whichfork != XFS_DATA_FORK)) {
+		xfs_bmap_mark_sick(bi->bi_owner, bi->bi_whichfork);
 		return -EFSCORRUPTED;
+	}
 
 	if (XFS_TEST_ERROR(false, tp->t_mountp,
 			XFS_ERRTAG_BMAP_FINISH_ONE))
@@ -6180,6 +6202,7 @@ xfs_bmap_finish_one(
 		break;
 	default:
 		ASSERT(0);
+		xfs_bmap_mark_sick(bi->bi_owner, bi->bi_whichfork);
 		error = -EFSCORRUPTED;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index 99e796256c5d..b6bfa3b17b1e 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -120,6 +120,7 @@ void xfs_inode_measure_sickness(struct xfs_inode *ip, unsigned int *sick,
 		unsigned int *checked);
 
 void xfs_health_unmount(struct xfs_mount *mp);
+void xfs_bmap_mark_sick(struct xfs_inode *ip, int whichfork);
 
 /* Now some helpers. */
 
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index 72a075bb2c10..9887fb3b9b0f 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -393,3 +393,29 @@ xfs_bulkstat_health(
 			bs->bs_sick |= m->ioctl_mask;
 	}
 }
+
+/* Mark a block mapping sick. */
+void
+xfs_bmap_mark_sick(
+	struct xfs_inode	*ip,
+	int			whichfork)
+{
+	unsigned int		mask;
+
+	switch (whichfork) {
+	case XFS_DATA_FORK:
+		mask = XFS_SICK_INO_BMBTD;
+		break;
+	case XFS_ATTR_FORK:
+		mask = XFS_SICK_INO_BMBTA;
+		break;
+	case XFS_COW_FORK:
+		mask = XFS_SICK_INO_BMBTC;
+		break;
+	default:
+		ASSERT(0);
+		return;
+	}
+
+	xfs_inode_mark_sick(ip, mask);
+}
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index fc1946f80a4a..c2ba03281daf 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -27,6 +27,7 @@
 #include "xfs_dquot_item.h"
 #include "xfs_dquot.h"
 #include "xfs_reflink.h"
+#include "xfs_health.h"
 
 #define XFS_ALLOC_ALIGN(mp, off) \
 	(((off) >> mp->m_allocsize_log) << mp->m_allocsize_log)
@@ -45,6 +46,7 @@ xfs_alert_fsblock_zero(
 		(unsigned long long)imap->br_startoff,
 		(unsigned long long)imap->br_blockcount,
 		imap->br_state);
+	xfs_bmap_mark_sick(ip, XFS_DATA_FORK);
 	return -EFSCORRUPTED;
 }
 
@@ -99,8 +101,10 @@ xfs_bmbt_to_iomap(
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
 
-	if (unlikely(!xfs_valid_startblock(ip, imap->br_startblock)))
+	if (unlikely(!xfs_valid_startblock(ip, imap->br_startblock))) {
+		xfs_bmap_mark_sick(ip, XFS_DATA_FORK);
 		return xfs_alert_fsblock_zero(ip, imap);
+	}
 
 	if (imap->br_startblock == HOLESTARTBLOCK) {
 		iomap->addr = IOMAP_NULL_ADDR;
@@ -325,8 +329,10 @@ xfs_iomap_write_direct(
 		goto out_unlock;
 	}
 
-	if (unlikely(!xfs_valid_startblock(ip, imap->br_startblock)))
+	if (unlikely(!xfs_valid_startblock(ip, imap->br_startblock))) {
+		xfs_bmap_mark_sick(ip, XFS_DATA_FORK);
 		error = xfs_alert_fsblock_zero(ip, imap);
+	}
 
 out_unlock:
 	*seq = xfs_iomap_inode_sequence(ip, 0);
@@ -639,8 +645,10 @@ xfs_iomap_write_unwritten(
 		if (error)
 			return error;
 
-		if (unlikely(!xfs_valid_startblock(ip, imap.br_startblock)))
+		if (unlikely(!xfs_valid_startblock(ip, imap.br_startblock))) {
+			xfs_bmap_mark_sick(ip, XFS_DATA_FORK);
 			return xfs_alert_fsblock_zero(ip, &imap);
+		}
 
 		if ((numblks_fsb = imap.br_blockcount) == 0) {
 			/*
@@ -986,6 +994,7 @@ xfs_buffered_write_iomap_begin(
 
 	if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(&ip->i_df)) ||
 	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
+		xfs_bmap_mark_sick(ip, XFS_DATA_FORK);
 		error = -EFSCORRUPTED;
 		goto out_unlock;
 	}
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 5535778a98f9..55604bbd25a4 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -29,6 +29,7 @@
 #include "xfs_iomap.h"
 #include "xfs_ag.h"
 #include "xfs_ag_resv.h"
+#include "xfs_health.h"
 
 /*
  * Copy on Write of Shared Blocks
@@ -1223,8 +1224,10 @@ xfs_reflink_remap_extent(
 	 * extent if they're both holes or both the same physical extent.
 	 */
 	if (dmap->br_startblock == smap.br_startblock) {
-		if (dmap->br_state != smap.br_state)
+		if (dmap->br_state != smap.br_state) {
+			xfs_bmap_mark_sick(ip, XFS_DATA_FORK);
 			error = -EFSCORRUPTED;
+		}
 		goto out_cancel;
 	}
 
@@ -1387,6 +1390,7 @@ xfs_reflink_remap_blocks(
 		ASSERT(nimaps == 1 && imap.br_startoff == srcoff);
 		if (imap.br_startblock == DELAYSTARTBLOCK) {
 			ASSERT(imap.br_startblock != DELAYSTARTBLOCK);
+			xfs_bmap_mark_sick(src, XFS_DATA_FORK);
 			error = -EFSCORRUPTED;
 			break;
 		}
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] xfs: report block map corruption errors to the health tracking system
  2023-02-14  5:51 ` [PATCH 1/3] xfs: report block map corruption errors to the health tracking system Dave Chinner
@ 2023-02-14  8:03   ` Christoph Hellwig
  2023-02-14 22:21     ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2023-02-14  8:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Tue, Feb 14, 2023 at 04:51:12PM +1100, Dave Chinner wrote:
> From: "Darrick J. Wong" <djwong@kernel.org>
> 
> Whenever we encounter a corrupt block mapping, we should report that to
> the health monitoring system for later reporting.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> [dgc: open coded xfs_metadata_is_sick() macro]
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Just curious:  this is probably from a bigger series, which one is
that?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] xfs: report block map corruption errors to the health tracking system
  2023-02-14  8:03   ` Christoph Hellwig
@ 2023-02-14 22:21     ` Dave Chinner
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2023-02-14 22:21 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel

On Tue, Feb 14, 2023 at 12:03:18AM -0800, Christoph Hellwig wrote:
> On Tue, Feb 14, 2023 at 04:51:12PM +1100, Dave Chinner wrote:
> > From: "Darrick J. Wong" <djwong@kernel.org>
> > 
> > Whenever we encounter a corrupt block mapping, we should report that to
> > the health monitoring system for later reporting.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > [dgc: open coded xfs_metadata_is_sick() macro]
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> Just curious:  this is probably from a bigger series, which one is
> that?

[14/2/23 10:36] <djwong> branch @ https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/3] xfs: failed delalloc conversion results in bad extent lists
  2023-02-14  5:51 [PATCH 0/3] xfs, iomap: fix writeback failure handling Dave Chinner
  2023-02-14  5:51 ` [PATCH 1/3] xfs: report block map corruption errors to the health tracking system Dave Chinner
@ 2023-02-14  5:51 ` Dave Chinner
  2023-02-14  8:13   ` Christoph Hellwig
  2023-02-14  5:51 ` [PATCH 3/3] xfs, iomap: ->discard_folio() is broken so remove it Dave Chinner
  2 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2023-02-14  5:51 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

If we fail delayed allocation conversion because we cannot allocate
blocks, we end up in the situation where the inode extent list is
effectively corrupt and unresolvable. Whilst we have dirty data in
memory that we cannot allocate space for, we cannot write that data
back to disk. Unmounting a filesystem in this state results in data
loss.

In situations where we end up with a corrupt extent list in memory,
we can also be asked to convert a delayed region that does not have
a delalloc extent backing it. This should be considered a
corruption, too, not a "try again later" error.

These conversion failures result in the inode being sick and needing
repair, but we don't mark all the conditions that can lead to bmap
sickness being marked. Make sure that the error conditions that
indicate corruption are properly marked.

Further, if we trip over these corruptions conditions, we then have
to reclaim an inode that has unresolvable delayed allocation extents
attached to the inode. This generally happens at unmount and inode
inactivation will fire assert failures because we've left stray
delayed allocation extents behind on the inode. Hence we need to
ensure that we only trigger the stale delalloc extent checks if the
inode is fully healthy.

Even then, this may not be enough, because the inactivation code
assumes that there will be no stray delayed extents unless the
filesystem is shut down. If the inode is unhealthy, we need to
ensure we clean up delayed allocation extents within EOF because
the VFS has already tossed the data away. Hence there's no longer
any data over the delalloc extents to write back, so we need to also
toss the delayed allocation extents to ensure we release the space
reservation the delalloc extent holds. Failure to punch delalloc
extents in this case results in assert failures during unmount when
the delalloc block counter is torn down.

This all needs to be in place before the next patch which resolves a
bug in the iomap code that discards delalloc extents backing dirty
pages on writeback error without discarding the dirty data. Hence we
need to be able to handle delalloc extents in inode cleanup sanely,
rather than rely on incorrectly punching the delalloc extents on the
first writeback error that occurs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 13 ++++++++++---
 fs/xfs/xfs_icache.c      |  4 +++-
 fs/xfs/xfs_inode.c       | 10 ++++++++++
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 958e4bb2e51e..fb718a5825d5 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4553,8 +4553,12 @@ xfs_bmapi_convert_delalloc(
 		 * should only happen for the COW fork, where another thread
 		 * might have moved the extent to the data fork in the meantime.
 		 */
-		WARN_ON_ONCE(whichfork != XFS_COW_FORK);
-		error = -EAGAIN;
+		if (whichfork != XFS_COW_FORK) {
+			xfs_bmap_mark_sick(ip, whichfork);
+			error = -EFSCORRUPTED;
+		} else {
+			error = -EAGAIN;
+		}
 		goto out_trans_cancel;
 	}

@@ -4598,8 +4602,11 @@ xfs_bmapi_convert_delalloc(
 		bma.prev.br_startoff = NULLFILEOFF;

 	error = xfs_bmapi_allocate(&bma);
-	if (error)
+	if (error) {
+		if ((error == -EFSCORRUPTED) || (error == -EFSBADCRC))
+			xfs_bmap_mark_sick(ip, whichfork);
 		goto out_finish;
+	}

 	error = -ENOSPC;
 	if (WARN_ON_ONCE(bma.blkno == NULLFSBLOCK))
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index ddeaccc04aec..4354b6639dec 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -24,6 +24,7 @@
 #include "xfs_ialloc.h"
 #include "xfs_ag.h"
 #include "xfs_log_priv.h"
+#include "xfs_health.h"

 #include <linux/iversion.h>

@@ -1810,7 +1811,8 @@ xfs_inodegc_set_reclaimable(
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_perag	*pag;

-	if (!xfs_is_shutdown(mp) && ip->i_delayed_blks) {
+	if (ip->i_delayed_blks && xfs_inode_is_healthy(ip) &&
+	    !xfs_is_shutdown(mp)) {
 		xfs_check_delalloc(ip, XFS_DATA_FORK);
 		xfs_check_delalloc(ip, XFS_COW_FORK);
 		ASSERT(0);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index d354ea2b74f9..06f1229ef628 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -37,6 +37,7 @@
 #include "xfs_reflink.h"
 #include "xfs_ag.h"
 #include "xfs_log_priv.h"
+#include "xfs_health.h"

 struct kmem_cache *xfs_inode_cache;

@@ -1738,6 +1739,15 @@ xfs_inactive(
 		if (xfs_can_free_eofblocks(ip, true))
 			xfs_free_eofblocks(ip);

+		/*
+		 * If the inode is sick, then it might have delalloc extents
+		 * within EOF that we were unable to convert. We have to punch
+		 * them out here to release the reservation as there is no
+		 * longer any data to write back into the delalloc range now.
+		 */
+		if (!xfs_inode_is_healthy(ip))
+			xfs_bmap_punch_delalloc_range(ip, 0,
+						i_size_read(VFS_I(ip)));
 		goto out;
 	}

-- 
2.39.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] xfs: failed delalloc conversion results in bad extent lists
  2023-02-14  5:51 ` [PATCH 2/3] xfs: failed delalloc conversion results in bad extent lists Dave Chinner
@ 2023-02-14  8:13   ` Christoph Hellwig
  2023-02-14 22:26     ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2023-02-14  8:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Tue, Feb 14, 2023 at 04:51:13PM +1100, Dave Chinner wrote:
> index 958e4bb2e51e..fb718a5825d5 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -4553,8 +4553,12 @@ xfs_bmapi_convert_delalloc(
>  		 * should only happen for the COW fork, where another thread
>  		 * might have moved the extent to the data fork in the meantime.
>  		 */
> -		WARN_ON_ONCE(whichfork != XFS_COW_FORK);
> -		error = -EAGAIN;
> +		if (whichfork != XFS_COW_FORK) {
> +			xfs_bmap_mark_sick(ip, whichfork);
> +			error = -EFSCORRUPTED;
> +		} else {
> +			error = -EAGAIN;
> +		}

The comment above should probably be expanded a bit on what this means
for a non-cow fork extent and how we'll handle it later.

> +	if (error) {
> +		if ((error == -EFSCORRUPTED) || (error == -EFSBADCRC))

Nit: no need for the inner braces.

>  
> +		/*
> +		 * If the inode is sick, then it might have delalloc extents
> +		 * within EOF that we were unable to convert. We have to punch
> +		 * them out here to release the reservation as there is no
> +		 * longer any data to write back into the delalloc range now.
> +		 */
> +		if (!xfs_inode_is_healthy(ip))
> +			xfs_bmap_punch_delalloc_range(ip, 0,
> +						i_size_read(VFS_I(ip)));

Is i_size_read the right check here?  The delalloc extent could extend
past i_size if i_size is not block aligned.  Can't we just simply pass
(xfs_off_t)-1 here?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] xfs: failed delalloc conversion results in bad extent lists
  2023-02-14  8:13   ` Christoph Hellwig
@ 2023-02-14 22:26     ` Dave Chinner
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2023-02-14 22:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, linux-fsdevel

On Tue, Feb 14, 2023 at 12:13:19AM -0800, Christoph Hellwig wrote:
> On Tue, Feb 14, 2023 at 04:51:13PM +1100, Dave Chinner wrote:
> > index 958e4bb2e51e..fb718a5825d5 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -4553,8 +4553,12 @@ xfs_bmapi_convert_delalloc(
> >  		 * should only happen for the COW fork, where another thread
> >  		 * might have moved the extent to the data fork in the meantime.
> >  		 */
> > -		WARN_ON_ONCE(whichfork != XFS_COW_FORK);
> > -		error = -EAGAIN;
> > +		if (whichfork != XFS_COW_FORK) {
> > +			xfs_bmap_mark_sick(ip, whichfork);
> > +			error = -EFSCORRUPTED;
> > +		} else {
> > +			error = -EAGAIN;
> > +		}
> 
> The comment above should probably be expanded a bit on what this means
> for a non-cow fork extent and how we'll handle it later.
> 
> > +	if (error) {
> > +		if ((error == -EFSCORRUPTED) || (error == -EFSBADCRC))
> 
> Nit: no need for the inner braces.
> 
> >  
> > +		/*
> > +		 * If the inode is sick, then it might have delalloc extents
> > +		 * within EOF that we were unable to convert. We have to punch
> > +		 * them out here to release the reservation as there is no
> > +		 * longer any data to write back into the delalloc range now.
> > +		 */
> > +		if (!xfs_inode_is_healthy(ip))
> > +			xfs_bmap_punch_delalloc_range(ip, 0,
> > +						i_size_read(VFS_I(ip)));
> 
> Is i_size_read the right check here?  The delalloc extent could extend
> past i_size if i_size is not block aligned.  Can't we just simply pass
> (xfs_off_t)-1 here?

Probably, we just killed all the delalloc blocks beyond eof via
xfs_free_eofblocks() in the line above this, so I didn't seem
necessary to try to punch blocks beyond EOF for this case. Easy
enough to do to be safe, just need a comment update to go with
it....

Cheers,

Dave.
> 
> 

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 3/3] xfs, iomap: ->discard_folio() is broken so remove it
  2023-02-14  5:51 [PATCH 0/3] xfs, iomap: fix writeback failure handling Dave Chinner
  2023-02-14  5:51 ` [PATCH 1/3] xfs: report block map corruption errors to the health tracking system Dave Chinner
  2023-02-14  5:51 ` [PATCH 2/3] xfs: failed delalloc conversion results in bad extent lists Dave Chinner
@ 2023-02-14  5:51 ` Dave Chinner
  2023-02-14  8:14   ` Christoph Hellwig
  2023-02-14 18:10   ` Brian Foster
  2 siblings, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2023-02-14  5:51 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

From: Dave Chinner <dchinner@redhat.com>

Ever since commit e9c3a8e820ed ("iomap: don't invalidate folios
after writeback errors") XFS and iomap have been retaining dirty
folios in memory after a writeback error. XFS no longer invalidates
the folio, and iomap no longer clears the folio uptodate state.

However, iomap is still been calling ->discard_folio on error, and
XFS is still punching the delayed allocation range backing the dirty
folio.

This is incorrect behaviour. The folio remains dirty and up to date,
meaning that another writeback will be attempted in the near future.
THis means that XFS is still going to have to allocate space for it
during writeback, and that means it still needs to have a delayed
allocation reservation and extent backing the dirty folio.

Failure to retain the delalloc extent (because xfs_discard_folio()
punched it out) means that the next writeback attempt does not find
an extent over the range of the write in ->map_blocks(), and
xfs_map_blocks() triggers a WARN_ON() because it should never land
in a hole for a data fork writeback request. This looks like:

[  647.356969] ------------[ cut here ]------------
[  647.359277] WARNING: CPU: 14 PID: 21913 at fs/xfs/libxfs/xfs_bmap.c:4510 xfs_bmapi_convert_delalloc+0x221/0x4e0
[  647.364551] Modules linked in:
[  647.366294] CPU: 14 PID: 21913 Comm: test_delalloc_c Not tainted 6.2.0-rc7-dgc+ #1754
[  647.370356] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[  647.374781] RIP: 0010:xfs_bmapi_convert_delalloc+0x221/0x4e0
[  647.377807] Code: e9 7d fe ff ff 80 bf 54 01 00 00 00 0f 84 68 fe ff ff 48 8d 47 70 48 89 04 24 e9 63 fe ff ff 83 fd 02 41 be f5 ff ff ff 74 a5 <0f> 0b eb a0
[  647.387242] RSP: 0018:ffffc9000aa677a8 EFLAGS: 00010293
[  647.389837] RAX: 0000000000000000 RBX: ffff88825bc4da00 RCX: 0000000000000000
[  647.393371] RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff88825bc4da40
[  647.396546] RBP: 0000000000000000 R08: ffffc9000aa67810 R09: ffffc9000aa67850
[  647.400186] R10: ffff88825bc4da00 R11: ffff888800a9aaac R12: ffff888101707000
[  647.403484] R13: ffffc9000aa677e0 R14: 00000000fffffff5 R15: 0000000000000004
[  647.406251] FS:  00007ff35ec24640(0000) GS:ffff88883ed00000(0000) knlGS:0000000000000000
[  647.410089] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  647.413225] CR2: 00007f7292cbc5d0 CR3: 0000000807d0e004 CR4: 0000000000060ee0
[  647.416917] Call Trace:
[  647.418080]  <TASK>
[  647.419291]  ? _raw_spin_unlock_irqrestore+0xe/0x30
[  647.421400]  xfs_map_blocks+0x1b7/0x590
[  647.422951]  iomap_do_writepage+0x1f1/0x7d0
[  647.424607]  ? __mod_lruvec_page_state+0x93/0x140
[  647.426419]  write_cache_pages+0x17b/0x4f0
[  647.428079]  ? iomap_read_end_io+0x2c0/0x2c0
[  647.429839]  iomap_writepages+0x1c/0x40
[  647.431377]  xfs_vm_writepages+0x79/0xb0
[  647.432826]  do_writepages+0xbd/0x1a0
[  647.434207]  ? obj_cgroup_release+0x73/0xb0
[  647.435769]  ? drain_obj_stock+0x130/0x290
[  647.437273]  ? avc_has_perm+0x8a/0x1a0
[  647.438746]  ? avc_has_perm_noaudit+0x8c/0x100
[  647.440223]  __filemap_fdatawrite_range+0x8e/0xa0
[  647.441960]  filemap_write_and_wait_range+0x3d/0xa0
[  647.444258]  __iomap_dio_rw+0x181/0x790
[  647.445960]  ? __schedule+0x385/0xa20
[  647.447829]  iomap_dio_rw+0xe/0x30
[  647.449284]  xfs_file_dio_write_aligned+0x97/0x150
[  647.451332]  ? selinux_file_permission+0x107/0x150
[  647.453299]  xfs_file_write_iter+0xd2/0x120
[  647.455238]  vfs_write+0x20d/0x3d0
[  647.456768]  ksys_write+0x69/0xf0
[  647.458067]  do_syscall_64+0x34/0x80
[  647.459488]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  647.461529] RIP: 0033:0x7ff3651406e9
[  647.463119] Code: 48 8d 3d 2a a1 0c 00 0f 05 eb a5 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f8
[  647.470563] RSP: 002b:00007ff35ec23df8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  647.473465] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff3651406e9
[  647.476278] RDX: 0000000000001400 RSI: 0000000020000000 RDI: 0000000000000005
[  647.478895] RBP: 00007ff35ec23e20 R08: 0000000000000005 R09: 0000000000000000
[  647.481568] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe533d8d4e
[  647.483751] R13: 00007ffe533d8d4f R14: 0000000000000000 R15: 00007ff35ec24640
[  647.486168]  </TASK>
[  647.487142] ---[ end trace 0000000000000000 ]---

Punching delalloc extents out from under dirty cached pages is wrong
and broken. We can't remove the delalloc extent until the page is
either removed from memory (i.e. invaliated) or writeback succeeds
in converting the delalloc extent to a real extent and writeback can
clean the page.

Hence we remove xfs_discard_folio() because it is only punching
delalloc blocks from under dirty pages now. With that removal,
nothing else uses ->discard_folio(), so we remove that from the
iomap infrastructure as well.

Reported-by: pengfei.xu@intel.com
Fixes: e9c3a8e820ed ("iomap: don't invalidate folios after writeback errors")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap/buffered-io.c | 16 +++-------------
 fs/xfs/xfs_aops.c      | 35 -----------------------------------
 include/linux/iomap.h  |  6 ------
 3 files changed, 3 insertions(+), 54 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 356193e44cf0..502fa2d41097 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1635,19 +1635,9 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	 * completion to mark the error state of the pages under writeback
 	 * appropriately.
 	 */
-	if (unlikely(error)) {
-		/*
-		 * Let the filesystem know what portion of the current page
-		 * failed to map. If the page hasn't been added to ioend, it
-		 * won't be affected by I/O completion and we must unlock it
-		 * now.
-		 */
-		if (wpc->ops->discard_folio)
-			wpc->ops->discard_folio(folio, pos);
-		if (!count) {
-			folio_unlock(folio);
-			goto done;
-		}
+	if (unlikely(error && !count)) {
+		folio_unlock(folio);
+		goto done;
 	}
 
 	folio_start_writeback(folio);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 41734202796f..3f0dae5ca9c2 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -448,44 +448,9 @@ xfs_prepare_ioend(
 	return status;
 }
 
-/*
- * If the page has delalloc blocks on it, we need to punch them out before we
- * invalidate the page.  If we don't, we leave a stale delalloc mapping on the
- * inode that can trip up a later direct I/O read operation on the same region.
- *
- * We prevent this by truncating away the delalloc regions on the page.  Because
- * they are delalloc, we can do this without needing a transaction. Indeed - if
- * we get ENOSPC errors, we have to be able to do this truncation without a
- * transaction as there is no space left for block reservation (typically why we
- * see a ENOSPC in writeback).
- */
-static void
-xfs_discard_folio(
-	struct folio		*folio,
-	loff_t			pos)
-{
-	struct xfs_inode	*ip = XFS_I(folio->mapping->host);
-	struct xfs_mount	*mp = ip->i_mount;
-	int			error;
-
-	if (xfs_is_shutdown(mp))
-		return;
-
-	xfs_alert_ratelimited(mp,
-		"page discard on page "PTR_FMT", inode 0x%llx, pos %llu.",
-			folio, ip->i_ino, pos);
-
-	error = xfs_bmap_punch_delalloc_range(ip, pos,
-			round_up(pos, folio_size(folio)));
-
-	if (error && !xfs_is_shutdown(mp))
-		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
-}
-
 static const struct iomap_writeback_ops xfs_writeback_ops = {
 	.map_blocks		= xfs_map_blocks,
 	.prepare_ioend		= xfs_prepare_ioend,
-	.discard_folio		= xfs_discard_folio,
 };
 
 STATIC int
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 0983dfc9a203..681e26a86791 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -310,12 +310,6 @@ struct iomap_writeback_ops {
 	 * conversions.
 	 */
 	int (*prepare_ioend)(struct iomap_ioend *ioend, int status);
-
-	/*
-	 * Optional, allows the file system to discard state on a page where
-	 * we failed to submit any I/O.
-	 */
-	void (*discard_folio)(struct folio *folio, loff_t pos);
 };
 
 struct iomap_writepage_ctx {
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] xfs, iomap: ->discard_folio() is broken so remove it
  2023-02-14  5:51 ` [PATCH 3/3] xfs, iomap: ->discard_folio() is broken so remove it Dave Chinner
@ 2023-02-14  8:14   ` Christoph Hellwig
  2023-02-14 18:10   ` Brian Foster
  1 sibling, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2023-02-14  8:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] xfs, iomap: ->discard_folio() is broken so remove it
  2023-02-14  5:51 ` [PATCH 3/3] xfs, iomap: ->discard_folio() is broken so remove it Dave Chinner
  2023-02-14  8:14   ` Christoph Hellwig
@ 2023-02-14 18:10   ` Brian Foster
  2023-02-14 22:20     ` Dave Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Brian Foster @ 2023-02-14 18:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Tue, Feb 14, 2023 at 04:51:14PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Ever since commit e9c3a8e820ed ("iomap: don't invalidate folios
> after writeback errors") XFS and iomap have been retaining dirty
> folios in memory after a writeback error. XFS no longer invalidates
> the folio, and iomap no longer clears the folio uptodate state.
> 
> However, iomap is still been calling ->discard_folio on error, and
> XFS is still punching the delayed allocation range backing the dirty
> folio.
> 
> This is incorrect behaviour. The folio remains dirty and up to date,
> meaning that another writeback will be attempted in the near future.
> THis means that XFS is still going to have to allocate space for it
> during writeback, and that means it still needs to have a delayed
> allocation reservation and extent backing the dirty folio.
> 

Hmm.. I don't think that is correct. It looks like the previous patch
removes the invalidation, but writeback clears the dirty bit before
calling into the fs and we're not doing anything to redirty the folio,
so there's no guarantee of subsequent writeback. As of that patch we
presumably leave around a !dirty,uptodate folio without backing storage
(due to the discard call as you've pointed out). I would hope/think the
!dirty state would mean a redirty reallocates delalloc for the folio,
but that's not immediately clear to me.

Regardless, I can see how this prevents this sort of error in the
scenario where writeback fails due to corruption, but I don't see how it
doesn't just break error handling of writeback failures not associated
with corruption. I.e., a delalloc folio is allocated/dirtied, writeback
fails due to some random/transient error, delalloc is left around on a
!dirty page (i.e. stale), and reclaim eventually comes around and
results in the usual block accounting corruption associated with stale
delalloc blocks. This is easy enough to test/reproduce (just tried it
via error injection to delalloc conversion) that I'm kind of surprised
fstests doesn't uncover it. :/

> Failure to retain the delalloc extent (because xfs_discard_folio()
> punched it out) means that the next writeback attempt does not find
> an extent over the range of the write in ->map_blocks(), and
> xfs_map_blocks() triggers a WARN_ON() because it should never land
> in a hole for a data fork writeback request. This looks like:
> 

I'm not sure this warning makes a lot of sense either given most of this
should occur around the folio lock. Looking back at the code and the
error report for this, the same error injection used above on a 5k write
to a bsize=1k fs actually shows the punch remove fsb offsets 0-5 on a
writeback failure, so it does appear to be punching too much out. The
cause appears to be that the end offset is calculated in
xfs_discard_folio() by rounding up the start offset to 4k (folio size).
If pos == 0, this results in passing end_fsb == 0 to the punch code,
which xfs_iext_lookup_extent_before() then changes to fsb == 5 because
that's the last block of the delalloc extent that covers fsb 0.

I've not reproduced the warning shown below, but I do see the side
effect of losing data at fsb 5 if the first page conversion fails. This
is silent because iomap now sees a hole and just skips the page. I
suspect the warning results from a combination of this problem and
racing writeback contexts as you've described in the commit log.

Brian

> [  647.356969] ------------[ cut here ]------------
> [  647.359277] WARNING: CPU: 14 PID: 21913 at fs/xfs/libxfs/xfs_bmap.c:4510 xfs_bmapi_convert_delalloc+0x221/0x4e0
> [  647.364551] Modules linked in:
> [  647.366294] CPU: 14 PID: 21913 Comm: test_delalloc_c Not tainted 6.2.0-rc7-dgc+ #1754
> [  647.370356] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> [  647.374781] RIP: 0010:xfs_bmapi_convert_delalloc+0x221/0x4e0
> [  647.377807] Code: e9 7d fe ff ff 80 bf 54 01 00 00 00 0f 84 68 fe ff ff 48 8d 47 70 48 89 04 24 e9 63 fe ff ff 83 fd 02 41 be f5 ff ff ff 74 a5 <0f> 0b eb a0
> [  647.387242] RSP: 0018:ffffc9000aa677a8 EFLAGS: 00010293
> [  647.389837] RAX: 0000000000000000 RBX: ffff88825bc4da00 RCX: 0000000000000000
> [  647.393371] RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff88825bc4da40
> [  647.396546] RBP: 0000000000000000 R08: ffffc9000aa67810 R09: ffffc9000aa67850
> [  647.400186] R10: ffff88825bc4da00 R11: ffff888800a9aaac R12: ffff888101707000
> [  647.403484] R13: ffffc9000aa677e0 R14: 00000000fffffff5 R15: 0000000000000004
> [  647.406251] FS:  00007ff35ec24640(0000) GS:ffff88883ed00000(0000) knlGS:0000000000000000
> [  647.410089] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  647.413225] CR2: 00007f7292cbc5d0 CR3: 0000000807d0e004 CR4: 0000000000060ee0
> [  647.416917] Call Trace:
> [  647.418080]  <TASK>
> [  647.419291]  ? _raw_spin_unlock_irqrestore+0xe/0x30
> [  647.421400]  xfs_map_blocks+0x1b7/0x590
> [  647.422951]  iomap_do_writepage+0x1f1/0x7d0
> [  647.424607]  ? __mod_lruvec_page_state+0x93/0x140
> [  647.426419]  write_cache_pages+0x17b/0x4f0
> [  647.428079]  ? iomap_read_end_io+0x2c0/0x2c0
> [  647.429839]  iomap_writepages+0x1c/0x40
> [  647.431377]  xfs_vm_writepages+0x79/0xb0
> [  647.432826]  do_writepages+0xbd/0x1a0
> [  647.434207]  ? obj_cgroup_release+0x73/0xb0
> [  647.435769]  ? drain_obj_stock+0x130/0x290
> [  647.437273]  ? avc_has_perm+0x8a/0x1a0
> [  647.438746]  ? avc_has_perm_noaudit+0x8c/0x100
> [  647.440223]  __filemap_fdatawrite_range+0x8e/0xa0
> [  647.441960]  filemap_write_and_wait_range+0x3d/0xa0
> [  647.444258]  __iomap_dio_rw+0x181/0x790
> [  647.445960]  ? __schedule+0x385/0xa20
> [  647.447829]  iomap_dio_rw+0xe/0x30
> [  647.449284]  xfs_file_dio_write_aligned+0x97/0x150
> [  647.451332]  ? selinux_file_permission+0x107/0x150
> [  647.453299]  xfs_file_write_iter+0xd2/0x120
> [  647.455238]  vfs_write+0x20d/0x3d0
> [  647.456768]  ksys_write+0x69/0xf0
> [  647.458067]  do_syscall_64+0x34/0x80
> [  647.459488]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [  647.461529] RIP: 0033:0x7ff3651406e9
> [  647.463119] Code: 48 8d 3d 2a a1 0c 00 0f 05 eb a5 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f8
> [  647.470563] RSP: 002b:00007ff35ec23df8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [  647.473465] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff3651406e9
> [  647.476278] RDX: 0000000000001400 RSI: 0000000020000000 RDI: 0000000000000005
> [  647.478895] RBP: 00007ff35ec23e20 R08: 0000000000000005 R09: 0000000000000000
> [  647.481568] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe533d8d4e
> [  647.483751] R13: 00007ffe533d8d4f R14: 0000000000000000 R15: 00007ff35ec24640
> [  647.486168]  </TASK>
> [  647.487142] ---[ end trace 0000000000000000 ]---
> 
> Punching delalloc extents out from under dirty cached pages is wrong
> and broken. We can't remove the delalloc extent until the page is
> either removed from memory (i.e. invaliated) or writeback succeeds
> in converting the delalloc extent to a real extent and writeback can
> clean the page.
> 
> Hence we remove xfs_discard_folio() because it is only punching
> delalloc blocks from under dirty pages now. With that removal,
> nothing else uses ->discard_folio(), so we remove that from the
> iomap infrastructure as well.
> 
> Reported-by: pengfei.xu@intel.com
> Fixes: e9c3a8e820ed ("iomap: don't invalidate folios after writeback errors")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/iomap/buffered-io.c | 16 +++-------------
>  fs/xfs/xfs_aops.c      | 35 -----------------------------------
>  include/linux/iomap.h  |  6 ------
>  3 files changed, 3 insertions(+), 54 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 356193e44cf0..502fa2d41097 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1635,19 +1635,9 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  	 * completion to mark the error state of the pages under writeback
>  	 * appropriately.
>  	 */
> -	if (unlikely(error)) {
> -		/*
> -		 * Let the filesystem know what portion of the current page
> -		 * failed to map. If the page hasn't been added to ioend, it
> -		 * won't be affected by I/O completion and we must unlock it
> -		 * now.
> -		 */
> -		if (wpc->ops->discard_folio)
> -			wpc->ops->discard_folio(folio, pos);
> -		if (!count) {
> -			folio_unlock(folio);
> -			goto done;
> -		}
> +	if (unlikely(error && !count)) {
> +		folio_unlock(folio);
> +		goto done;
>  	}
>  
>  	folio_start_writeback(folio);
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 41734202796f..3f0dae5ca9c2 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -448,44 +448,9 @@ xfs_prepare_ioend(
>  	return status;
>  }
>  
> -/*
> - * If the page has delalloc blocks on it, we need to punch them out before we
> - * invalidate the page.  If we don't, we leave a stale delalloc mapping on the
> - * inode that can trip up a later direct I/O read operation on the same region.
> - *
> - * We prevent this by truncating away the delalloc regions on the page.  Because
> - * they are delalloc, we can do this without needing a transaction. Indeed - if
> - * we get ENOSPC errors, we have to be able to do this truncation without a
> - * transaction as there is no space left for block reservation (typically why we
> - * see a ENOSPC in writeback).
> - */
> -static void
> -xfs_discard_folio(
> -	struct folio		*folio,
> -	loff_t			pos)
> -{
> -	struct xfs_inode	*ip = XFS_I(folio->mapping->host);
> -	struct xfs_mount	*mp = ip->i_mount;
> -	int			error;
> -
> -	if (xfs_is_shutdown(mp))
> -		return;
> -
> -	xfs_alert_ratelimited(mp,
> -		"page discard on page "PTR_FMT", inode 0x%llx, pos %llu.",
> -			folio, ip->i_ino, pos);
> -
> -	error = xfs_bmap_punch_delalloc_range(ip, pos,
> -			round_up(pos, folio_size(folio)));
> -
> -	if (error && !xfs_is_shutdown(mp))
> -		xfs_alert(mp, "page discard unable to remove delalloc mapping.");
> -}
> -
>  static const struct iomap_writeback_ops xfs_writeback_ops = {
>  	.map_blocks		= xfs_map_blocks,
>  	.prepare_ioend		= xfs_prepare_ioend,
> -	.discard_folio		= xfs_discard_folio,
>  };
>  
>  STATIC int
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 0983dfc9a203..681e26a86791 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -310,12 +310,6 @@ struct iomap_writeback_ops {
>  	 * conversions.
>  	 */
>  	int (*prepare_ioend)(struct iomap_ioend *ioend, int status);
> -
> -	/*
> -	 * Optional, allows the file system to discard state on a page where
> -	 * we failed to submit any I/O.
> -	 */
> -	void (*discard_folio)(struct folio *folio, loff_t pos);
>  };
>  
>  struct iomap_writepage_ctx {
> -- 
> 2.39.0
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] xfs, iomap: ->discard_folio() is broken so remove it
  2023-02-14 18:10   ` Brian Foster
@ 2023-02-14 22:20     ` Dave Chinner
  2023-02-15  1:26       ` Dave Chinner
  2023-02-15 15:25       ` Brian Foster
  0 siblings, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2023-02-14 22:20 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel

On Tue, Feb 14, 2023 at 01:10:05PM -0500, Brian Foster wrote:
> On Tue, Feb 14, 2023 at 04:51:14PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Ever since commit e9c3a8e820ed ("iomap: don't invalidate folios
> > after writeback errors") XFS and iomap have been retaining dirty
> > folios in memory after a writeback error. XFS no longer invalidates
> > the folio, and iomap no longer clears the folio uptodate state.
> > 
> > However, iomap is still been calling ->discard_folio on error, and
> > XFS is still punching the delayed allocation range backing the dirty
> > folio.
> > 
> > This is incorrect behaviour. The folio remains dirty and up to date,
> > meaning that another writeback will be attempted in the near future.
> > THis means that XFS is still going to have to allocate space for it
> > during writeback, and that means it still needs to have a delayed
> > allocation reservation and extent backing the dirty folio.
> > 
> 
> Hmm.. I don't think that is correct. It looks like the previous patch
> removes the invalidation, but writeback clears the dirty bit before
> calling into the fs and we're not doing anything to redirty the folio,
> so there's no guarantee of subsequent writeback.

Ah, right, I got confused with iomap_do_writepage() which redirties
folios it performs no action on. The case that is being tripped here
is "count == 0" which means no action has actually been taken on the
folio and it is not submitted for writeback. We don't mark the folio
with an error on submission failure like we do for errors reported
to IO completion, so the folio is just left in it's current state
in the cache.

> Regardless, I can see how this prevents this sort of error in the
> scenario where writeback fails due to corruption, but I don't see how it
> doesn't just break error handling of writeback failures not associated
> with corruption.

What other cases in XFS do we have that cause mapping failure? We
can't get ENOSPC here because of delalloc reservations. We can't get
ENOMEM because all the memory allocations are blocking. That just
leaves IO errors reading metadata, or structure corruption when
parsing and modifying on-disk metadata.  I can't think (off the top
of my head) of any other type of error we can get returned from
allocation - what sort of non-corruption errors were you thinking
of here?

> fails due to some random/transient error, delalloc is left around on a
> !dirty page (i.e. stale), and reclaim eventually comes around and
> results in the usual block accounting corruption associated with stale
> delalloc blocks.

The first patches in the series fix those issues. If we get stray
delalloc extents on a healthy inode, then it will still trigger all
the warnings/asserts that we have now. But if the inode has been
marked sick by a corruption based allocation failure, it will clean
up in reclaim without leaking anything or throwing any new warnings.

> This is easy enough to test/reproduce (just tried it
> via error injection to delalloc conversion) that I'm kind of surprised
> fstests doesn't uncover it. :/

> > Failure to retain the delalloc extent (because xfs_discard_folio()
> > punched it out) means that the next writeback attempt does not find
> > an extent over the range of the write in ->map_blocks(), and
> > xfs_map_blocks() triggers a WARN_ON() because it should never land
> > in a hole for a data fork writeback request. This looks like:
> > 
> 
> I'm not sure this warning makes a lot of sense either given most of this
> should occur around the folio lock. Looking back at the code and the
> error report for this, the same error injection used above on a 5k write
> to a bsize=1k fs actually shows the punch remove fsb offsets 0-5 on a
> writeback failure, so it does appear to be punching too much out.  The
> cause appears to be that the end offset is calculated in
> xfs_discard_folio() by rounding up the start offset to 4k (folio size).
> If pos == 0, this results in passing end_fsb == 0 to the punch code,
> which xfs_iext_lookup_extent_before() then changes to fsb == 5 because
> that's the last block of the delalloc extent that covers fsb 0.

And that is the bug I could not see in commit 7348b322332d ("xfs:
xfs_bmap_punch_delalloc_range() should take a byte range") which is
what this warning was bisected down to. Thank you for identifying
the reason the bisect landed on that commit. Have you written a
fix to test out you reasoning that you can post?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] xfs, iomap: ->discard_folio() is broken so remove it
  2023-02-14 22:20     ` Dave Chinner
@ 2023-02-15  1:26       ` Dave Chinner
  2023-02-15 15:25       ` Brian Foster
  1 sibling, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2023-02-15  1:26 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel

On Wed, Feb 15, 2023 at 09:20:00AM +1100, Dave Chinner wrote:
> On Tue, Feb 14, 2023 at 01:10:05PM -0500, Brian Foster wrote:
> > On Tue, Feb 14, 2023 at 04:51:14PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Ever since commit e9c3a8e820ed ("iomap: don't invalidate folios
> > > after writeback errors") XFS and iomap have been retaining dirty
> > > folios in memory after a writeback error. XFS no longer invalidates
> > > the folio, and iomap no longer clears the folio uptodate state.
> > > 
> > > However, iomap is still been calling ->discard_folio on error, and
> > > XFS is still punching the delayed allocation range backing the dirty
> > > folio.
> > > 
> > > This is incorrect behaviour. The folio remains dirty and up to date,
> > > meaning that another writeback will be attempted in the near future.
> > > THis means that XFS is still going to have to allocate space for it
> > > during writeback, and that means it still needs to have a delayed
> > > allocation reservation and extent backing the dirty folio.
> > > 
> > 
> > Hmm.. I don't think that is correct. It looks like the previous patch
> > removes the invalidation, but writeback clears the dirty bit before
> > calling into the fs and we're not doing anything to redirty the folio,
> > so there's no guarantee of subsequent writeback.
> 
> Ah, right, I got confused with iomap_do_writepage() which redirties
> folios it performs no action on. The case that is being tripped here
> is "count == 0" which means no action has actually been taken on the
> folio and it is not submitted for writeback. We don't mark the folio
> with an error on submission failure like we do for errors reported
> to IO completion, so the folio is just left in it's current state
> in the cache.

OK, so after thinking on this for a little while, and then asking
the question on #xfs:

[15/2/23 09:39] <dchinner> so, if we don't start writeback on a page
on mapping failure, should we be redirtying it?

I think the direction this patchset is heading towards is the
correct direction. The discussion that followed pretty much leads to
needing to redirty the folio on any submission failure so that the
VFS infrastructure will try to write the data again in future. I've
included the full log of the discussion below so there is a record
of in the lore archives.

I also think that redirtying the page is the right thing to do when
we consider that we are going to be trying to fix corruptions
online, without users even needing to know a corruption was
encountered. In this case, we need to keep the folio dirty so that
once we've repaired the metadata corruption the user data will be
written back.

This also points out another aspect where health status should be
taken into account. When we select an AG for allocation, we should
check first that it is healthy before trying to allocate from it.
This would allow writeback to fail the first time because the AG
selected was corrupt, but on the second VFS attempt to write it back
it won't select the AG we already know is corrupt and hence may well
succeed in allocating the space needed to perform writeback.

It's these sorts of conditions that lead me to think that this
patchset is going in the right direction for XFS - we just need to
ensure that the folio we failed to submit bios for (even on mixed
folio writeback submission success/failure) is redirtied so that
future writeback attempts will be made.

Hence I think all this patchset needs is an additional patch that
adds a call to folio_redirty_for_writeback() when mapping failures
occur. We may need some additional fixes to ensure these dirty pages
are discarded at unmount if they are persistent/unrecoverable
failures, but this seems to be the right approach for the failure
handling behaviour we are trying to acheive now and into the
future...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

[15/2/23 09:39] <dchinner> so, if we don't start writeback on a page on mapping failure, should we be redirtying it?
[15/2/23 09:43] <willy> i think so.  otherwise we're pretending to the pagecache that we wrote it
[15/2/23 09:54] <djwong> (this was the subject a UEK5 bug 3 months ago)
[15/2/23 09:54] <djwong> (albeit with buffer heads mixed in for insanity maximization)
[15/2/23 10:20] <dchinner> willy: ok, so what happens if we have multiple blocks per page, and we map some blocks to a bio bio before we get a mapping failure?
[15/2/23 10:20] <dchinner> we currently mark the folio and under writeback and submit the folio
[15/2/23 10:20] <dchinner> *submit the bio
[15/2/23 10:21] <dchinner> so after the IO the folio ends up clean even though there is some data on it that was not written back
[15/2/23 10:21] <willy> i think you still need to redirty it because some of it hasn't been written back
[15/2/23 10:23] <dchinner> ok, so we'd need to do teh redirtying before we set the page for writeback?
[15/2/23 10:23] <dchinner> *folio
[15/2/23 10:24] <dchinner> because folio_start_writeback() will clear the PAGECACHE_TAG_DIRTY if the folio is clean when it is moved to writeback state?
[15/2/23 10:24] <willy> i don't think so.  the folio can be both dirty and writeback at the same time, and i think you want that, because you don't want to restart the writeback until the bio you submitted has finished
[15/2/23 10:25] <dchinner> write_cache_pages() handles trying to write pages currently under writeback
[15/2/23 10:26] <dchinner> (it either waits on it or skips it depending on wbc->sync_mode)
[15/2/23 10:26] <willy> makes sense
[15/2/23 10:27] <willy> yes, you should call folio_redirty_for_writepage, no matter whether you've called folio_start_writeback() or not
[15/2/23 10:29] <dchinner> ok
[15/2/23 10:30] <dchinner> that then means we really do need to get rid of ->discard_folio, because we need to keep the delalloc mappings behind the folio so that the next attempt to write the page will still have space reserved for it
[15/2/23 10:30] <willy> I'm pretty sure I would agree with you if I understood XFS well enough to have an opinion
[15/2/23 10:31] <dchinner> heh
[15/2/23 10:38] <djwong> uhhh :)
[15/2/23 10:38] <djwong> if we're going to redirty the folios, then yes, i generally think we should leave the delalloc extents
[15/2/23 10:39] <djwong> this redirtying -- this is only for the case that getting writeback mappings to construct bios fails, right?
[15/2/23 10:39] <willy> if we _don't_ redirty the folios, then the VM thinks they're clean and will drop them under memory pressure instead of trying to write them out again
[15/2/23 10:39] <djwong> or is it for handling the bios coming back with errors set?
[15/2/23 10:39] <willy> this is submission path errors
[15/2/23 10:54] <dchinner> submission path (iomap_writepage_map())

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] xfs, iomap: ->discard_folio() is broken so remove it
  2023-02-14 22:20     ` Dave Chinner
  2023-02-15  1:26       ` Dave Chinner
@ 2023-02-15 15:25       ` Brian Foster
  2023-02-15 23:03         ` Dave Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Brian Foster @ 2023-02-15 15:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Wed, Feb 15, 2023 at 09:20:00AM +1100, Dave Chinner wrote:
> On Tue, Feb 14, 2023 at 01:10:05PM -0500, Brian Foster wrote:
> > On Tue, Feb 14, 2023 at 04:51:14PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Ever since commit e9c3a8e820ed ("iomap: don't invalidate folios
> > > after writeback errors") XFS and iomap have been retaining dirty
> > > folios in memory after a writeback error. XFS no longer invalidates
> > > the folio, and iomap no longer clears the folio uptodate state.
> > > 
> > > However, iomap is still been calling ->discard_folio on error, and
> > > XFS is still punching the delayed allocation range backing the dirty
> > > folio.
> > > 
> > > This is incorrect behaviour. The folio remains dirty and up to date,
> > > meaning that another writeback will be attempted in the near future.
> > > THis means that XFS is still going to have to allocate space for it
> > > during writeback, and that means it still needs to have a delayed
> > > allocation reservation and extent backing the dirty folio.
> > > 
> > 
> > Hmm.. I don't think that is correct. It looks like the previous patch
> > removes the invalidation, but writeback clears the dirty bit before
> > calling into the fs and we're not doing anything to redirty the folio,
> > so there's no guarantee of subsequent writeback.
> 
> Ah, right, I got confused with iomap_do_writepage() which redirties
> folios it performs no action on. The case that is being tripped here
> is "count == 0" which means no action has actually been taken on the
> folio and it is not submitted for writeback. We don't mark the folio
> with an error on submission failure like we do for errors reported
> to IO completion, so the folio is just left in it's current state
> in the cache.
> 
> > Regardless, I can see how this prevents this sort of error in the
> > scenario where writeback fails due to corruption, but I don't see how it
> > doesn't just break error handling of writeback failures not associated
> > with corruption.
> 
> What other cases in XFS do we have that cause mapping failure? We
> can't get ENOSPC here because of delalloc reservations. We can't get
> ENOMEM because all the memory allocations are blocking. That just
> leaves IO errors reading metadata, or structure corruption when
> parsing and modifying on-disk metadata.  I can't think (off the top
> of my head) of any other type of error we can get returned from
> allocation - what sort of non-corruption errors were you thinking
> of here?
> 
> > fails due to some random/transient error, delalloc is left around on a
> > !dirty page (i.e. stale), and reclaim eventually comes around and
> > results in the usual block accounting corruption associated with stale
> > delalloc blocks.
> 
> The first patches in the series fix those issues. If we get stray
> delalloc extents on a healthy inode, then it will still trigger all
> the warnings/asserts that we have now. But if the inode has been
> marked sick by a corruption based allocation failure, it will clean
> up in reclaim without leaking anything or throwing any new warnings.
> 

Those warnings/asserts that exist now indicate something is wrong and
that free space accounting is likely about to become corrupted, because
an otherwise clean inode is being reclaimed with stale delalloc blocks.

I see there's an error injection knob (XFS_ERRTAG_REDUCE_MAX_IEXTENTS)
tied to the max extent count checking stuff in the delalloc conversion
path. You should be able to add some (10+) extents to a file and then
turn that thing all the way up to induce a (delalloc conversion)
writeback failure and see exactly what I'm talking about [1].

Brian

[1] The following occurs with this patch, but not on mainline because the
purpose of ->discard_folio() is to prevent it.

(/mnt/file has 10+ preexisting extents beyond the 0-5k range)

# echo 1 > /sys/fs/xfs/vdb1/errortag/reduce_max_iextents
# xfs_io -fc "pwrite 0 5k" -c fsync /mnt/file
wrote 5120/5120 bytes at offset 0
5 KiB, 5 ops; 0.0000 sec (52.503 MiB/sec and 53763.4409 ops/sec)
fsync: File too large
# umount /mnt/
#
Message from syslogd@localhost at Feb 15 09:47:41 ...                                                                                                           kernel:XFS: Assertion failed: 0, file: fs/xfs/xfs_icache.c, line: 1818

Message from syslogd@localhost at Feb 15 09:47:41 ...
 kernel:XFS: Assertion failed: xfs_is_shutdown(mp) || percpu_counter_sum(&mp->m_delalloc_blks) == 0, file: fs/xfs/xfs_super.c, line: 1068
#
# xfs_repair -n /dev/vdb1 
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
sb_fdblocks 20960174, counted 20960186
...

> > This is easy enough to test/reproduce (just tried it
> > via error injection to delalloc conversion) that I'm kind of surprised
> > fstests doesn't uncover it. :/
> 
> > > Failure to retain the delalloc extent (because xfs_discard_folio()
> > > punched it out) means that the next writeback attempt does not find
> > > an extent over the range of the write in ->map_blocks(), and
> > > xfs_map_blocks() triggers a WARN_ON() because it should never land
> > > in a hole for a data fork writeback request. This looks like:
> > > 
> > 
> > I'm not sure this warning makes a lot of sense either given most of this
> > should occur around the folio lock. Looking back at the code and the
> > error report for this, the same error injection used above on a 5k write
> > to a bsize=1k fs actually shows the punch remove fsb offsets 0-5 on a
> > writeback failure, so it does appear to be punching too much out.  The
> > cause appears to be that the end offset is calculated in
> > xfs_discard_folio() by rounding up the start offset to 4k (folio size).
> > If pos == 0, this results in passing end_fsb == 0 to the punch code,
> > which xfs_iext_lookup_extent_before() then changes to fsb == 5 because
> > that's the last block of the delalloc extent that covers fsb 0.
> 
> And that is the bug I could not see in commit 7348b322332d ("xfs:
> xfs_bmap_punch_delalloc_range() should take a byte range") which is
> what this warning was bisected down to. Thank you for identifying
> the reason the bisect landed on that commit. Have you written a
> fix to test out you reasoning that you can post?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] xfs, iomap: ->discard_folio() is broken so remove it
  2023-02-15 15:25       ` Brian Foster
@ 2023-02-15 23:03         ` Dave Chinner
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2023-02-15 23:03 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, linux-fsdevel

On Wed, Feb 15, 2023 at 10:25:43AM -0500, Brian Foster wrote:
> On Wed, Feb 15, 2023 at 09:20:00AM +1100, Dave Chinner wrote:
> > On Tue, Feb 14, 2023 at 01:10:05PM -0500, Brian Foster wrote:
> > > On Tue, Feb 14, 2023 at 04:51:14PM +1100, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > Ever since commit e9c3a8e820ed ("iomap: don't invalidate folios
> > > > after writeback errors") XFS and iomap have been retaining dirty
> > > > folios in memory after a writeback error. XFS no longer invalidates
> > > > the folio, and iomap no longer clears the folio uptodate state.
> > > > 
> > > > However, iomap is still been calling ->discard_folio on error, and
> > > > XFS is still punching the delayed allocation range backing the dirty
> > > > folio.
> > > > 
> > > > This is incorrect behaviour. The folio remains dirty and up to date,
> > > > meaning that another writeback will be attempted in the near future.
> > > > THis means that XFS is still going to have to allocate space for it
> > > > during writeback, and that means it still needs to have a delayed
> > > > allocation reservation and extent backing the dirty folio.
> > > > 
> > > 
> > > Hmm.. I don't think that is correct. It looks like the previous patch
> > > removes the invalidation, but writeback clears the dirty bit before
> > > calling into the fs and we're not doing anything to redirty the folio,
> > > so there's no guarantee of subsequent writeback.
> > 
> > Ah, right, I got confused with iomap_do_writepage() which redirties
> > folios it performs no action on. The case that is being tripped here
> > is "count == 0" which means no action has actually been taken on the
> > folio and it is not submitted for writeback. We don't mark the folio
> > with an error on submission failure like we do for errors reported
> > to IO completion, so the folio is just left in it's current state
> > in the cache.
> > 
> > > Regardless, I can see how this prevents this sort of error in the
> > > scenario where writeback fails due to corruption, but I don't see how it
> > > doesn't just break error handling of writeback failures not associated
> > > with corruption.
> > 
> > What other cases in XFS do we have that cause mapping failure? We
> > can't get ENOSPC here because of delalloc reservations. We can't get
> > ENOMEM because all the memory allocations are blocking. That just
> > leaves IO errors reading metadata, or structure corruption when
> > parsing and modifying on-disk metadata.  I can't think (off the top
> > of my head) of any other type of error we can get returned from
> > allocation - what sort of non-corruption errors were you thinking
> > of here?
> > 
> > > fails due to some random/transient error, delalloc is left around on a
> > > !dirty page (i.e. stale), and reclaim eventually comes around and
> > > results in the usual block accounting corruption associated with stale
> > > delalloc blocks.
> > 
> > The first patches in the series fix those issues. If we get stray
> > delalloc extents on a healthy inode, then it will still trigger all
> > the warnings/asserts that we have now. But if the inode has been
> > marked sick by a corruption based allocation failure, it will clean
> > up in reclaim without leaking anything or throwing any new warnings.
> > 
> 
> Those warnings/asserts that exist now indicate something is wrong and
> that free space accounting is likely about to become corrupted, because
> an otherwise clean inode is being reclaimed with stale delalloc blocks.

Well, yes.

> I see there's an error injection knob (XFS_ERRTAG_REDUCE_MAX_IEXTENTS)
> tied to the max extent count checking stuff in the delalloc conversion
> path. You should be able to add some (10+) extents to a file and then
> turn that thing all the way up to induce a (delalloc conversion)
> writeback failure and see exactly what I'm talking about [1].
> 
> Brian
> 
> [1] The following occurs with this patch, but not on mainline because the
> purpose of ->discard_folio() is to prevent it.

A non-corruption related writeback error has resulted in those debug
checks triggering correctly. This demonstrates the debug checks are
still working as intended. :)

Hence this isn't an argument against removing ->discard_folio(), this is
merely a demonstration that the current patch series needs more work.

Indeed, if the folio gets redirtied here instead of left clean as
we've already talked about, a future writeback may, in fact, succeed
and this specific problem goes away. We know how this retry
mechanism works - it's exactly what we do with metadata write
failures. Further, changing the behaviour of failure handling here
is exactly what we have the configurable error handling
infrastructure for. It's also why the "fail on unmount"
functionality exists, too.

That is, if we get to the point that "fail on unmount" triggers for
metadata we cannot write back due to persistent errors, we should
also perform the same trigger for data we cannot write back due to
persistent writeback allocation failures. In which case, any
allocation error should mark the inode sick and the unconverted
delalloc extents get cleaned up correctly by the final inode reclaim
pass.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-02-15 23:04 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-14  5:51 [PATCH 0/3] xfs, iomap: fix writeback failure handling Dave Chinner
2023-02-14  5:51 ` [PATCH 1/3] xfs: report block map corruption errors to the health tracking system Dave Chinner
2023-02-14  8:03   ` Christoph Hellwig
2023-02-14 22:21     ` Dave Chinner
2023-02-14  5:51 ` [PATCH 2/3] xfs: failed delalloc conversion results in bad extent lists Dave Chinner
2023-02-14  8:13   ` Christoph Hellwig
2023-02-14 22:26     ` Dave Chinner
2023-02-14  5:51 ` [PATCH 3/3] xfs, iomap: ->discard_folio() is broken so remove it Dave Chinner
2023-02-14  8:14   ` Christoph Hellwig
2023-02-14 18:10   ` Brian Foster
2023-02-14 22:20     ` Dave Chinner
2023-02-15  1:26       ` Dave Chinner
2023-02-15 15:25       ` Brian Foster
2023-02-15 23:03         ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).