[PATCH 0/8 v3] xfs: various fixes for 6.5

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/8 v3] xfs: various fixes for 6.5
@ 2023-06-27 22:44 Dave Chinner
  2023-06-27 22:44 ` [PATCH 1/8] xfs: don't reverse order of items in bulk AIL insertion Dave Chinner
                   ` (8 more replies)
  0 siblings, 9 replies; 30+ messages in thread
From: Dave Chinner @ 2023-06-27 22:44 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

This is an update of the fixes patchset I sent here:

https://lore.kernel.org/linux-xfs/20230620002021.1038067-1-david@fromorbit.com/

There are new patches in the series - patch two is a new patch to
fix a potential issue in the non-blocking busy extent flush code
that Chandan noticed where btree block freeing could potentially
trip over busy extents and return -EAGAIN that isn't handled.
Patches 7 and 8 are new patches to this series; they are outstanding
standalone bug fixes that need review, so I've included them here,
too.

Original cover letter (with patch numbers updated) follows.

-Dave.

--

This is a wrap up of version 3 of all the fixes I have recently
pushed out for review.

The first patch fixes a AIL ordering problem identified when testing
patches 3-5 in this series. This patch only addresses the AIL ordering
problem that was found, it does not address any other corner cases
in intent recovery that may be exposed by patches 3-5.

Patches 3-5 allow partial handling of multi-extent EFIs during
runtime operation and during journal recovery. This is needed as
we attempt to process all the extents in the EFI in a single
transaction and we can deadlock during AGFL fixup if the transaction
context holds the only free space in the AG in a busy extent.

This patchset uncovered a problem where log recovery would run off
the end of the list of intents to recover and into intents that
require deferred processing. This was caused by the ordering issue
fixed in patch 1.

This patchset does not attempt to address the end of intent recovery
detection issues - this raises other issues with the intent recovery
beyond loop termination. Solving those issues requires more thought,
and the problem can largely be avoided by the first patch in the
series. As it is, CUI recovery has been vulnerable to these intent
recovery loop termination issues for several years but we don't have
any evidence that systems have tripped over this so the urgency to
fix the loop termination fully isn't as high as fixing the AIL bulk
insertion ordering problem that exposed it.

Finally, patch 6 addresses journal geometry validation issues. It
makes all geometry mismatches hard failures for V4 and V5
filesystems, except for the log size being too small on V4
filesystems. This addresses the problem on V4 filesystems where we'd
fail to perform ithe remaining validation on the geometry once we'd
detected that the log was too small or too large.

This all passed fstests on v4 and v5 filesystems without
regressions.

---
Version 3:
- patch 2
  - new patch to defer block freeing for inobt and refcountbt
    blocks. This is to close a problem Chandan found during review
    of "xfs: don't block in busy flushing when freeing extents" in
    the V2 series.
- patch 7
  - pulled in AGF length bounds chekcing fixes patch.
  - rearranged slightly for better error discrimination
  - comments added
  - minor syntax and comment fixes
- patch 8
  - new bug fix for a memory leak regression discovered by Coverity
    during xfsprogs scan.

Version 2:
- patch 1
  - rewritten commit message
- patch 2
  - uint32_t alloc_flag conversion pushed all the way down into
    xfs_extent_busy_flush
- patch 3
  - Updated comment for xfs_efd_from_efi()
  - fixed whitespace damage
  - check error return from xfs_free_extent_later()
- patch 5
  - update error message for max LSU size check
  - fix whitespace damage
  - clean up error handling in xfs_log_mount() changes

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/8] xfs: don't reverse order of items in bulk AIL insertion
  2023-06-27 22:44 [PATCH 0/8 v3] xfs: various fixes for 6.5 Dave Chinner
@ 2023-06-27 22:44 ` Dave Chinner
  2023-06-28  6:03   ` Christoph Hellwig
                     ` (2 more replies)
  2023-06-27 22:44 ` [PATCH 2/8] xfs: use deferred frees for btree block freeing Dave Chinner
                   ` (7 subsequent siblings)
  8 siblings, 3 replies; 30+ messages in thread
From: Dave Chinner @ 2023-06-27 22:44 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

XFS has strict metadata ordering requirements. One of the things it
does is maintain the commit order of items from transaction commit
through the CIL and into the AIL. That is, if a transaction logs
item A before item B in a modification, then they will be inserted
into the CIL in the order {A, B}. These items are then written into
the iclog during checkpointing in the order {A, B}. When the
checkpoint commits, they are supposed to be inserted into the AIL in
the order {A, B}, and when they are pushed from the AIL, they are
pushed in the order {A, B}.

If we crash, log recovery then replays the two items from the
checkpoint in the order {A, B}, resulting in the objects the items
apply to being queued for writeback at the end of the checkpoint
in the order {A, B}. This means recovery behaves the same way as the
runtime code.

In places, we have subtle dependencies on this ordering being
maintained. One of this place is performing intent recovery from the
log. It assumes that recovering an intent will result in a
non-intent object being the first thing that is modified in the
recovery transaction, and so when the transaction commits and the
journal flushes, the first object inserted into the AIL beyond the
intent recovery range will be a non-intent item.  It uses the
transistion from intent items to non-intent items to stop the
recovery pass.

A recent log recovery issue indicated that an intent was appearing
as the first item in the AIL beyond the recovery range, hence
breaking the end of recovery detection that exists.

Tracing indicated insertion of the items into the AIL was apparently
occurring in the right order (the intent was last in the commit item
list), but the intent was appearing first in the AIL. IOWs, the
order of items in the AIL was {D,C,B,A}, not {A,B,C,D}, and bulk
insertion was reversing the order of the items in the batch of items
being inserted.

Lucky for us, all the items fed to bulk insertion have the same LSN,
so the reversal of order does not affect the log head/tail tracking
that is based on the contents of the AIL. It only impacts on code
that has implicit, subtle dependencies on object order, and AFAICT
only the intent recovery loop is impacted by it.

Make sure bulk AIL insertion does not reorder items incorrectly.

Fixes: 0e57f6a36f9b ("xfs: bulk AIL insertion during transaction commit")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_trans_ail.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 7d4109af193e..1098452e7f95 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -823,7 +823,7 @@ xfs_trans_ail_update_bulk(
 			trace_xfs_ail_insert(lip, 0, lsn);
 		}
 		lip->li_lsn = lsn;
-		list_add(&lip->li_ail, &tmp);
+		list_add_tail(&lip->li_ail, &tmp);
 	}

 	if (!list_empty(&tmp))
-- 
2.40.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/8] xfs: use deferred frees for btree block freeing
  2023-06-27 22:44 [PATCH 0/8 v3] xfs: various fixes for 6.5 Dave Chinner
  2023-06-27 22:44 ` [PATCH 1/8] xfs: don't reverse order of items in bulk AIL insertion Dave Chinner
@ 2023-06-27 22:44 ` Dave Chinner
  2023-06-28 17:46   ` Darrick J. Wong
  2023-06-29  7:52   ` Chandan Babu R
  2023-06-27 22:44 ` [PATCH 3/8] xfs: pass alloc flags through to xfs_extent_busy_flush() Dave Chinner
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 30+ messages in thread
From: Dave Chinner @ 2023-06-27 22:44 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Btrees that aren't freespace management trees use the normal extent
allocation and freeing routines for their blocks. Hence when a btree
block is freed, a direct call to xfs_free_extent() is made and the
extent is immediately freed. This puts the entire free space
management btrees under this path, so we are stacking btrees on
btrees in the call stack. The inobt, finobt and refcount btrees
all do this.

However, the bmap btree does not do this - it calls
xfs_free_extent_later() to defer the extent free operation via an
XEFI and hence it gets processed in deferred operation processing
during the commit of the primary transaction (i.e. via intent
chaining).

We need to change xfs_free_extent() to behave in a non-blocking
manner so that we can avoid deadlocks with busy extents near ENOSPC
in transactions that free multiple extents. Inserting or removing a
record from a btree can cause a multi-level tree merge operation and
that will free multiple blocks from the btree in a single
transaction. i.e. we can call xfs_free_extent() multiple times, and
hence the btree manipulation transaction is vulnerable to this busy
extent deadlock vector.

To fix this, convert all the remaining callers of xfs_free_extent()
to use xfs_free_extent_later() to queue XEFIs and hence defer
processing of the extent frees to a context that can be safely
restarted if a deadlock condition is detected.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_ag.c             | 2 +-
 fs/xfs/libxfs/xfs_alloc.c          | 4 ++++
 fs/xfs/libxfs/xfs_alloc.h          | 8 +++++---
 fs/xfs/libxfs/xfs_bmap.c           | 8 +++++---
 fs/xfs/libxfs/xfs_bmap_btree.c     | 3 ++-
 fs/xfs/libxfs/xfs_ialloc.c         | 8 ++++----
 fs/xfs/libxfs/xfs_ialloc_btree.c   | 3 +--
 fs/xfs/libxfs/xfs_refcount.c       | 9 ++++++---
 fs/xfs/libxfs/xfs_refcount_btree.c | 8 +-------
 fs/xfs/xfs_extfree_item.c          | 3 ++-
 fs/xfs/xfs_reflink.c               | 3 ++-
 11 files changed, 33 insertions(+), 26 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index ee84835ebc66..e9cc481b4ddf 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -985,7 +985,7 @@ xfs_ag_shrink_space(
 			goto resv_err;
 
 		err2 = __xfs_free_extent_later(*tpp, args.fsbno, delta, NULL,
-				true);
+				XFS_AG_RESV_NONE, true);
 		if (err2)
 			goto resv_err;
 
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index c20fe99405d8..cc3f7b905ea1 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2449,6 +2449,7 @@ xfs_defer_agfl_block(
 	xefi->xefi_startblock = XFS_AGB_TO_FSB(mp, agno, agbno);
 	xefi->xefi_blockcount = 1;
 	xefi->xefi_owner = oinfo->oi_owner;
+	xefi->xefi_type = XFS_AG_RESV_AGFL;
 
 	if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbno(mp, xefi->xefi_startblock)))
 		return -EFSCORRUPTED;
@@ -2470,6 +2471,7 @@ __xfs_free_extent_later(
 	xfs_fsblock_t			bno,
 	xfs_filblks_t			len,
 	const struct xfs_owner_info	*oinfo,
+	enum xfs_ag_resv_type		type,
 	bool				skip_discard)
 {
 	struct xfs_extent_free_item	*xefi;
@@ -2490,6 +2492,7 @@ __xfs_free_extent_later(
 	ASSERT(agbno + len <= mp->m_sb.sb_agblocks);
 #endif
 	ASSERT(xfs_extfree_item_cache != NULL);
+	ASSERT(type != XFS_AG_RESV_AGFL);
 
 	if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbext(mp, bno, len)))
 		return -EFSCORRUPTED;
@@ -2498,6 +2501,7 @@ __xfs_free_extent_later(
 			       GFP_KERNEL | __GFP_NOFAIL);
 	xefi->xefi_startblock = bno;
 	xefi->xefi_blockcount = (xfs_extlen_t)len;
+	xefi->xefi_type = type;
 	if (skip_discard)
 		xefi->xefi_flags |= XFS_EFI_SKIP_DISCARD;
 	if (oinfo) {
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 85ac470be0da..121faf1e11ad 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -232,7 +232,7 @@ xfs_buf_to_agfl_bno(
 
 int __xfs_free_extent_later(struct xfs_trans *tp, xfs_fsblock_t bno,
 		xfs_filblks_t len, const struct xfs_owner_info *oinfo,
-		bool skip_discard);
+		enum xfs_ag_resv_type type, bool skip_discard);
 
 /*
  * List of extents to be free "later".
@@ -245,6 +245,7 @@ struct xfs_extent_free_item {
 	xfs_extlen_t		xefi_blockcount;/* number of blocks in extent */
 	struct xfs_perag	*xefi_pag;
 	unsigned int		xefi_flags;
+	enum xfs_ag_resv_type	xefi_type;
 };
 
 void xfs_extent_free_get_group(struct xfs_mount *mp,
@@ -259,9 +260,10 @@ xfs_free_extent_later(
 	struct xfs_trans		*tp,
 	xfs_fsblock_t			bno,
 	xfs_filblks_t			len,
-	const struct xfs_owner_info	*oinfo)
+	const struct xfs_owner_info	*oinfo,
+	enum xfs_ag_resv_type		type)
 {
-	return __xfs_free_extent_later(tp, bno, len, oinfo, false);
+	return __xfs_free_extent_later(tp, bno, len, oinfo, type, false);
 }
 
 
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index fef35696adb7..30c931b38853 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -574,7 +574,8 @@ xfs_bmap_btree_to_extents(
 		return error;
 
 	xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, whichfork);
-	error = xfs_free_extent_later(cur->bc_tp, cbno, 1, &oinfo);
+	error = xfs_free_extent_later(cur->bc_tp, cbno, 1, &oinfo,
+			XFS_AG_RESV_NONE);
 	if (error)
 		return error;
 
@@ -5236,8 +5237,9 @@ xfs_bmap_del_extent_real(
 		} else {
 			error = __xfs_free_extent_later(tp, del->br_startblock,
 					del->br_blockcount, NULL,
-					(bflags & XFS_BMAPI_NODISCARD) ||
-					del->br_state == XFS_EXT_UNWRITTEN);
+					XFS_AG_RESV_NONE,
+					((bflags & XFS_BMAPI_NODISCARD) ||
+					del->br_state == XFS_EXT_UNWRITTEN));
 			if (error)
 				goto done;
 		}
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index 36564ae3084f..bf3f1b36fdd2 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -271,7 +271,8 @@ xfs_bmbt_free_block(
 	int			error;
 
 	xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, cur->bc_ino.whichfork);
-	error = xfs_free_extent_later(cur->bc_tp, fsbno, 1, &oinfo);
+	error = xfs_free_extent_later(cur->bc_tp, fsbno, 1, &oinfo,
+			XFS_AG_RESV_NONE);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 34600f94c2f4..1e5fafbc0cdb 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -1853,8 +1853,8 @@ xfs_difree_inode_chunk(
 		/* not sparse, calculate extent info directly */
 		return xfs_free_extent_later(tp,
 				XFS_AGB_TO_FSB(mp, agno, sagbno),
-				M_IGEO(mp)->ialloc_blks,
-				&XFS_RMAP_OINFO_INODES);
+				M_IGEO(mp)->ialloc_blks, &XFS_RMAP_OINFO_INODES,
+				XFS_AG_RESV_NONE);
 	}
 
 	/* holemask is only 16-bits (fits in an unsigned long) */
@@ -1899,8 +1899,8 @@ xfs_difree_inode_chunk(
 		ASSERT(agbno % mp->m_sb.sb_spino_align == 0);
 		ASSERT(contigblk % mp->m_sb.sb_spino_align == 0);
 		error = xfs_free_extent_later(tp,
-				XFS_AGB_TO_FSB(mp, agno, agbno),
-				contigblk, &XFS_RMAP_OINFO_INODES);
+				XFS_AGB_TO_FSB(mp, agno, agbno), contigblk,
+				&XFS_RMAP_OINFO_INODES, XFS_AG_RESV_NONE);
 		if (error)
 			return error;
 
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 5a945ae21b5d..9258f01c0015 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -160,8 +160,7 @@ __xfs_inobt_free_block(
 
 	xfs_inobt_mod_blockcount(cur, -1);
 	fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, xfs_buf_daddr(bp));
-	return xfs_free_extent(cur->bc_tp, cur->bc_ag.pag,
-			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1,
+	return xfs_free_extent_later(cur->bc_tp, fsbno, 1,
 			&XFS_RMAP_OINFO_INOBT, resv);
 }
 
diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index b6e21433925c..70ab113c9cea 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -1152,7 +1152,8 @@ xfs_refcount_adjust_extents(
 						cur->bc_ag.pag->pag_agno,
 						tmp.rc_startblock);
 				error = xfs_free_extent_later(cur->bc_tp, fsbno,
-						  tmp.rc_blockcount, NULL);
+						  tmp.rc_blockcount, NULL,
+						  XFS_AG_RESV_NONE);
 				if (error)
 					goto out_error;
 			}
@@ -1213,7 +1214,8 @@ xfs_refcount_adjust_extents(
 					cur->bc_ag.pag->pag_agno,
 					ext.rc_startblock);
 			error = xfs_free_extent_later(cur->bc_tp, fsbno,
-					ext.rc_blockcount, NULL);
+					ext.rc_blockcount, NULL,
+					XFS_AG_RESV_NONE);
 			if (error)
 				goto out_error;
 		}
@@ -1981,7 +1983,8 @@ xfs_refcount_recover_cow_leftovers(
 
 		/* Free the block. */
 		error = xfs_free_extent_later(tp, fsb,
-				rr->rr_rrec.rc_blockcount, NULL);
+				rr->rr_rrec.rc_blockcount, NULL,
+				XFS_AG_RESV_NONE);
 		if (error)
 			goto out_trans;
 
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index d4afc5f4e6a5..5c3987d8dc24 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -106,19 +106,13 @@ xfs_refcountbt_free_block(
 	struct xfs_buf		*agbp = cur->bc_ag.agbp;
 	struct xfs_agf		*agf = agbp->b_addr;
 	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, xfs_buf_daddr(bp));
-	int			error;
 
 	trace_xfs_refcountbt_free_block(cur->bc_mp, cur->bc_ag.pag->pag_agno,
 			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1);
 	be32_add_cpu(&agf->agf_refcount_blocks, -1);
 	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
-	error = xfs_free_extent(cur->bc_tp, cur->bc_ag.pag,
-			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1,
+	return xfs_free_extent_later(cur->bc_tp, fsbno, 1,
 			&XFS_RMAP_OINFO_REFC, XFS_AG_RESV_METADATA);
-	if (error)
-		return error;
-
-	return error;
 }
 
 STATIC int
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index f9e36b810663..79e65bb6b0a2 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -365,7 +365,7 @@ xfs_trans_free_extent(
 			agbno, xefi->xefi_blockcount);
 
 	error = __xfs_free_extent(tp, xefi->xefi_pag, agbno,
-			xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE,
+			xefi->xefi_blockcount, &oinfo, xefi->xefi_type,
 			xefi->xefi_flags & XFS_EFI_SKIP_DISCARD);
 
 	/*
@@ -644,6 +644,7 @@ xfs_efi_item_recover(
 	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
 		struct xfs_extent_free_item	fake = {
 			.xefi_owner		= XFS_RMAP_OWN_UNKNOWN,
+			.xefi_type		= XFS_AG_RESV_NONE,
 		};
 		struct xfs_extent		*extp;
 
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index abcc559f3c64..eb9102453aff 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -617,7 +617,8 @@ xfs_reflink_cancel_cow_blocks(
 					del.br_blockcount);
 
 			error = xfs_free_extent_later(*tpp, del.br_startblock,
-					  del.br_blockcount, NULL);
+					del.br_blockcount, NULL,
+					XFS_AG_RESV_NONE);
 			if (error)
 				break;
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/8] xfs: pass alloc flags through to xfs_extent_busy_flush()
  2023-06-27 22:44 [PATCH 0/8 v3] xfs: various fixes for 6.5 Dave Chinner
  2023-06-27 22:44 ` [PATCH 1/8] xfs: don't reverse order of items in bulk AIL insertion Dave Chinner
  2023-06-27 22:44 ` [PATCH 2/8] xfs: use deferred frees for btree block freeing Dave Chinner
@ 2023-06-27 22:44 ` Dave Chinner
  2023-06-29  9:44   ` Chandan Babu R
  2023-06-27 22:44 ` [PATCH 4/8] xfs: allow extent free intents to be retried Dave Chinner
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2023-06-27 22:44 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

To avoid blocking in xfs_extent_busy_flush() when freeing extents
and the only busy extents are held by the current transaction, we
need to pass the XFS_ALLOC_FLAG_FREEING flag context all the way
into xfs_extent_busy_flush().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c | 96 +++++++++++++++++++++------------------
 fs/xfs/libxfs/xfs_alloc.h |  2 +-
 fs/xfs/xfs_extent_busy.c  |  3 +-
 fs/xfs/xfs_extent_busy.h  |  2 +-
 4 files changed, 56 insertions(+), 47 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index cc3f7b905ea1..ba929f6b5c9b 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -1536,7 +1536,8 @@ xfs_alloc_ag_vextent_lastblock(
  */
 STATIC int
 xfs_alloc_ag_vextent_near(
-	struct xfs_alloc_arg	*args)
+	struct xfs_alloc_arg	*args,
+	uint32_t		alloc_flags)
 {
 	struct xfs_alloc_cur	acur = {};
 	int			error;		/* error code */
@@ -1612,7 +1613,7 @@ xfs_alloc_ag_vextent_near(
 		if (acur.busy) {
 			trace_xfs_alloc_near_busy(args);
 			xfs_extent_busy_flush(args->mp, args->pag,
-					      acur.busy_gen);
+					      acur.busy_gen, alloc_flags);
 			goto restart;
 		}
 		trace_xfs_alloc_size_neither(args);
@@ -1635,21 +1636,22 @@ xfs_alloc_ag_vextent_near(
  * and of the form k * prod + mod unless there's nothing that large.
  * Return the starting a.g. block, or NULLAGBLOCK if we can't do it.
  */
-STATIC int				/* error */
+static int
 xfs_alloc_ag_vextent_size(
-	xfs_alloc_arg_t	*args)		/* allocation argument structure */
+	struct xfs_alloc_arg	*args,
+	uint32_t		alloc_flags)
 {
-	struct xfs_agf	*agf = args->agbp->b_addr;
-	struct xfs_btree_cur *bno_cur;	/* cursor for bno btree */
-	struct xfs_btree_cur *cnt_cur;	/* cursor for cnt btree */
-	int		error;		/* error result */
-	xfs_agblock_t	fbno;		/* start of found freespace */
-	xfs_extlen_t	flen;		/* length of found freespace */
-	int		i;		/* temp status variable */
-	xfs_agblock_t	rbno;		/* returned block number */
-	xfs_extlen_t	rlen;		/* length of returned extent */
-	bool		busy;
-	unsigned	busy_gen;
+	struct xfs_agf		*agf = args->agbp->b_addr;
+	struct xfs_btree_cur	*bno_cur;
+	struct xfs_btree_cur	*cnt_cur;
+	xfs_agblock_t		fbno;		/* start of found freespace */
+	xfs_extlen_t		flen;		/* length of found freespace */
+	xfs_agblock_t		rbno;		/* returned block number */
+	xfs_extlen_t		rlen;		/* length of returned extent */
+	bool			busy;
+	unsigned		busy_gen;
+	int			error;
+	int			i;
 
 restart:
 	/*
@@ -1717,8 +1719,8 @@ xfs_alloc_ag_vextent_size(
 				xfs_btree_del_cursor(cnt_cur,
 						     XFS_BTREE_NOERROR);
 				trace_xfs_alloc_size_busy(args);
-				xfs_extent_busy_flush(args->mp,
-							args->pag, busy_gen);
+				xfs_extent_busy_flush(args->mp, args->pag,
+						busy_gen, alloc_flags);
 				goto restart;
 			}
 		}
@@ -1802,7 +1804,8 @@ xfs_alloc_ag_vextent_size(
 		if (busy) {
 			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
 			trace_xfs_alloc_size_busy(args);
-			xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
+			xfs_extent_busy_flush(args->mp, args->pag, busy_gen,
+					alloc_flags);
 			goto restart;
 		}
 		goto out_nominleft;
@@ -2572,7 +2575,7 @@ xfs_exact_minlen_extent_available(
 int			/* error */
 xfs_alloc_fix_freelist(
 	struct xfs_alloc_arg	*args,	/* allocation argument structure */
-	int			flags)	/* XFS_ALLOC_FLAG_... */
+	uint32_t		alloc_flags)
 {
 	struct xfs_mount	*mp = args->mp;
 	struct xfs_perag	*pag = args->pag;
@@ -2588,7 +2591,7 @@ xfs_alloc_fix_freelist(
 	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
 
 	if (!xfs_perag_initialised_agf(pag)) {
-		error = xfs_alloc_read_agf(pag, tp, flags, &agbp);
+		error = xfs_alloc_read_agf(pag, tp, alloc_flags, &agbp);
 		if (error) {
 			/* Couldn't lock the AGF so skip this AG. */
 			if (error == -EAGAIN)
@@ -2604,13 +2607,13 @@ xfs_alloc_fix_freelist(
 	 */
 	if (xfs_perag_prefers_metadata(pag) &&
 	    (args->datatype & XFS_ALLOC_USERDATA) &&
-	    (flags & XFS_ALLOC_FLAG_TRYLOCK)) {
-		ASSERT(!(flags & XFS_ALLOC_FLAG_FREEING));
+	    (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK)) {
+		ASSERT(!(alloc_flags & XFS_ALLOC_FLAG_FREEING));
 		goto out_agbp_relse;
 	}
 
 	need = xfs_alloc_min_freelist(mp, pag);
-	if (!xfs_alloc_space_available(args, need, flags |
+	if (!xfs_alloc_space_available(args, need, alloc_flags |
 			XFS_ALLOC_FLAG_CHECK))
 		goto out_agbp_relse;
 
@@ -2619,7 +2622,7 @@ xfs_alloc_fix_freelist(
 	 * Can fail if we're not blocking on locks, and it's held.
 	 */
 	if (!agbp) {
-		error = xfs_alloc_read_agf(pag, tp, flags, &agbp);
+		error = xfs_alloc_read_agf(pag, tp, alloc_flags, &agbp);
 		if (error) {
 			/* Couldn't lock the AGF so skip this AG. */
 			if (error == -EAGAIN)
@@ -2634,7 +2637,7 @@ xfs_alloc_fix_freelist(
 
 	/* If there isn't enough total space or single-extent, reject it. */
 	need = xfs_alloc_min_freelist(mp, pag);
-	if (!xfs_alloc_space_available(args, need, flags))
+	if (!xfs_alloc_space_available(args, need, alloc_flags))
 		goto out_agbp_relse;
 
 #ifdef DEBUG
@@ -2672,11 +2675,12 @@ xfs_alloc_fix_freelist(
 	 */
 	memset(&targs, 0, sizeof(targs));
 	/* struct copy below */
-	if (flags & XFS_ALLOC_FLAG_NORMAP)
+	if (alloc_flags & XFS_ALLOC_FLAG_NORMAP)
 		targs.oinfo = XFS_RMAP_OINFO_SKIP_UPDATE;
 	else
 		targs.oinfo = XFS_RMAP_OINFO_AG;
-	while (!(flags & XFS_ALLOC_FLAG_NOSHRINK) && pag->pagf_flcount > need) {
+	while (!(alloc_flags & XFS_ALLOC_FLAG_NOSHRINK) &&
+			pag->pagf_flcount > need) {
 		error = xfs_alloc_get_freelist(pag, tp, agbp, &bno, 0);
 		if (error)
 			goto out_agbp_relse;
@@ -2704,7 +2708,7 @@ xfs_alloc_fix_freelist(
 		targs.resv = XFS_AG_RESV_AGFL;
 
 		/* Allocate as many blocks as possible at once. */
-		error = xfs_alloc_ag_vextent_size(&targs);
+		error = xfs_alloc_ag_vextent_size(&targs, alloc_flags);
 		if (error)
 			goto out_agflbp_relse;
 
@@ -2714,7 +2718,7 @@ xfs_alloc_fix_freelist(
 		 * on a completely full ag.
 		 */
 		if (targs.agbno == NULLAGBLOCK) {
-			if (flags & XFS_ALLOC_FLAG_FREEING)
+			if (alloc_flags & XFS_ALLOC_FLAG_FREEING)
 				break;
 			goto out_agflbp_relse;
 		}
@@ -3230,7 +3234,7 @@ xfs_alloc_vextent_check_args(
 static int
 xfs_alloc_vextent_prepare_ag(
 	struct xfs_alloc_arg	*args,
-	uint32_t		flags)
+	uint32_t		alloc_flags)
 {
 	bool			need_pag = !args->pag;
 	int			error;
@@ -3239,7 +3243,7 @@ xfs_alloc_vextent_prepare_ag(
 		args->pag = xfs_perag_get(args->mp, args->agno);
 
 	args->agbp = NULL;
-	error = xfs_alloc_fix_freelist(args, flags);
+	error = xfs_alloc_fix_freelist(args, alloc_flags);
 	if (error) {
 		trace_xfs_alloc_vextent_nofix(args);
 		if (need_pag)
@@ -3361,6 +3365,7 @@ xfs_alloc_vextent_this_ag(
 {
 	struct xfs_mount	*mp = args->mp;
 	xfs_agnumber_t		minimum_agno;
+	uint32_t		alloc_flags = 0;
 	int			error;
 
 	ASSERT(args->pag != NULL);
@@ -3379,9 +3384,9 @@ xfs_alloc_vextent_this_ag(
 		return error;
 	}
 
-	error = xfs_alloc_vextent_prepare_ag(args, 0);
+	error = xfs_alloc_vextent_prepare_ag(args, alloc_flags);
 	if (!error && args->agbp)
-		error = xfs_alloc_ag_vextent_size(args);
+		error = xfs_alloc_ag_vextent_size(args, alloc_flags);
 
 	return xfs_alloc_vextent_finish(args, minimum_agno, error, false);
 }
@@ -3410,20 +3415,20 @@ xfs_alloc_vextent_iterate_ags(
 	xfs_agnumber_t		minimum_agno,
 	xfs_agnumber_t		start_agno,
 	xfs_agblock_t		target_agbno,
-	uint32_t		flags)
+	uint32_t		alloc_flags)
 {
 	struct xfs_mount	*mp = args->mp;
 	xfs_agnumber_t		restart_agno = minimum_agno;
 	xfs_agnumber_t		agno;
 	int			error = 0;
 
-	if (flags & XFS_ALLOC_FLAG_TRYLOCK)
+	if (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK)
 		restart_agno = 0;
 restart:
 	for_each_perag_wrap_range(mp, start_agno, restart_agno,
 			mp->m_sb.sb_agcount, agno, args->pag) {
 		args->agno = agno;
-		error = xfs_alloc_vextent_prepare_ag(args, flags);
+		error = xfs_alloc_vextent_prepare_ag(args, alloc_flags);
 		if (error)
 			break;
 		if (!args->agbp) {
@@ -3437,10 +3442,10 @@ xfs_alloc_vextent_iterate_ags(
 		 */
 		if (args->agno == start_agno && target_agbno) {
 			args->agbno = target_agbno;
-			error = xfs_alloc_ag_vextent_near(args);
+			error = xfs_alloc_ag_vextent_near(args, alloc_flags);
 		} else {
 			args->agbno = 0;
-			error = xfs_alloc_ag_vextent_size(args);
+			error = xfs_alloc_ag_vextent_size(args, alloc_flags);
 		}
 		break;
 	}
@@ -3457,8 +3462,8 @@ xfs_alloc_vextent_iterate_ags(
 	 * constraining flags by the caller, drop them and retry the allocation
 	 * without any constraints being set.
 	 */
-	if (flags) {
-		flags = 0;
+	if (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK) {
+		alloc_flags &= ~XFS_ALLOC_FLAG_TRYLOCK;
 		restart_agno = minimum_agno;
 		goto restart;
 	}
@@ -3486,6 +3491,7 @@ xfs_alloc_vextent_start_ag(
 	xfs_agnumber_t		start_agno;
 	xfs_agnumber_t		rotorstep = xfs_rotorstep;
 	bool			bump_rotor = false;
+	uint32_t		alloc_flags = XFS_ALLOC_FLAG_TRYLOCK;
 	int			error;
 
 	ASSERT(args->pag == NULL);
@@ -3512,7 +3518,7 @@ xfs_alloc_vextent_start_ag(
 
 	start_agno = max(minimum_agno, XFS_FSB_TO_AGNO(mp, target));
 	error = xfs_alloc_vextent_iterate_ags(args, minimum_agno, start_agno,
-			XFS_FSB_TO_AGBNO(mp, target), XFS_ALLOC_FLAG_TRYLOCK);
+			XFS_FSB_TO_AGBNO(mp, target), alloc_flags);
 
 	if (bump_rotor) {
 		if (args->agno == start_agno)
@@ -3539,6 +3545,7 @@ xfs_alloc_vextent_first_ag(
 	struct xfs_mount	*mp = args->mp;
 	xfs_agnumber_t		minimum_agno;
 	xfs_agnumber_t		start_agno;
+	uint32_t		alloc_flags = XFS_ALLOC_FLAG_TRYLOCK;
 	int			error;
 
 	ASSERT(args->pag == NULL);
@@ -3557,7 +3564,7 @@ xfs_alloc_vextent_first_ag(
 
 	start_agno = max(minimum_agno, XFS_FSB_TO_AGNO(mp, target));
 	error = xfs_alloc_vextent_iterate_ags(args, minimum_agno, start_agno,
-			XFS_FSB_TO_AGBNO(mp, target), 0);
+			XFS_FSB_TO_AGBNO(mp, target), alloc_flags);
 	return xfs_alloc_vextent_finish(args, minimum_agno, error, true);
 }
 
@@ -3610,6 +3617,7 @@ xfs_alloc_vextent_near_bno(
 	struct xfs_mount	*mp = args->mp;
 	xfs_agnumber_t		minimum_agno;
 	bool			needs_perag = args->pag == NULL;
+	uint32_t		alloc_flags = 0;
 	int			error;
 
 	if (!needs_perag)
@@ -3630,9 +3638,9 @@ xfs_alloc_vextent_near_bno(
 	if (needs_perag)
 		args->pag = xfs_perag_grab(mp, args->agno);
 
-	error = xfs_alloc_vextent_prepare_ag(args, 0);
+	error = xfs_alloc_vextent_prepare_ag(args, alloc_flags);
 	if (!error && args->agbp)
-		error = xfs_alloc_ag_vextent_near(args);
+		error = xfs_alloc_ag_vextent_near(args, alloc_flags);
 
 	return xfs_alloc_vextent_finish(args, minimum_agno, error, needs_perag);
 }
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 121faf1e11ad..50119ebaede9 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -195,7 +195,7 @@ int xfs_alloc_read_agfl(struct xfs_perag *pag, struct xfs_trans *tp,
 		struct xfs_buf **bpp);
 int xfs_free_agfl_block(struct xfs_trans *, xfs_agnumber_t, xfs_agblock_t,
 			struct xfs_buf *, struct xfs_owner_info *);
-int xfs_alloc_fix_freelist(struct xfs_alloc_arg *args, int flags);
+int xfs_alloc_fix_freelist(struct xfs_alloc_arg *args, uint32_t alloc_flags);
 int xfs_free_extent_fix_freelist(struct xfs_trans *tp, struct xfs_perag *pag,
 		struct xfs_buf **agbp);
 
diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
index f3d328e4a440..5f44936eae1c 100644
--- a/fs/xfs/xfs_extent_busy.c
+++ b/fs/xfs/xfs_extent_busy.c
@@ -571,7 +571,8 @@ void
 xfs_extent_busy_flush(
 	struct xfs_mount	*mp,
 	struct xfs_perag	*pag,
-	unsigned		busy_gen)
+	unsigned		busy_gen,
+	uint32_t		alloc_flags)
 {
 	DEFINE_WAIT		(wait);
 	int			error;
diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
index 4a118131059f..7a82c689bfa4 100644
--- a/fs/xfs/xfs_extent_busy.h
+++ b/fs/xfs/xfs_extent_busy.h
@@ -53,7 +53,7 @@ xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno,
 
 void
 xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag,
-	unsigned busy_gen);
+	unsigned busy_gen, uint32_t alloc_flags);
 
 void
 xfs_extent_busy_wait_all(struct xfs_mount *mp);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/8] xfs: allow extent free intents to be retried
  2023-06-27 22:44 [PATCH 0/8 v3] xfs: various fixes for 6.5 Dave Chinner
                   ` (2 preceding siblings ...)
  2023-06-27 22:44 ` [PATCH 3/8] xfs: pass alloc flags through to xfs_extent_busy_flush() Dave Chinner
@ 2023-06-27 22:44 ` Dave Chinner
  2023-06-28 17:48   ` Darrick J. Wong
  2023-06-29  9:50   ` Chandan Babu R
  2023-06-27 22:44 ` [PATCH 5/8] xfs: don't block in busy flushing when freeing extents Dave Chinner
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 30+ messages in thread
From: Dave Chinner @ 2023-06-27 22:44 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Extent freeing neeeds to be able to avoid a busy extent deadlock
when the transaction itself holds the only busy extents in the
allocation group. This may occur if we have an EFI that contains
multiple extents to be freed, and the freeing the second intent
requires the space the first extent free released to expand the
AGFL. If we block on the busy extent at this point, we deadlock.

We hold a dirty transaction that contains a entire atomic extent
free operations within it, so if we can abort the extent free
operation and commit the progress that we've made, the busy extent
can be resolved by a log force. Hence we can restart the aborted
extent free with a new transaction and continue to make
progress without risking deadlocks.

To enable this, we need the EFI processing code to be able to handle
an -EAGAIN error to tell it to commit the current transaction and
retry again. This mechanism is already built into the defer ops
processing (used bythe refcount btree modification intents), so
there's relatively little handling we need to add to the EFI code to
enable this.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_extfree_item.c | 73 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 70 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 79e65bb6b0a2..098420cbd4a0 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -336,6 +336,34 @@ xfs_trans_get_efd(
 	return efdp;
 }
 
+/*
+ * Fill the EFD with all extents from the EFI when we need to roll the
+ * transaction and continue with a new EFI.
+ *
+ * This simply copies all the extents in the EFI to the EFD rather than make
+ * assumptions about which extents in the EFI have already been processed. We
+ * currently keep the xefi list in the same order as the EFI extent list, but
+ * that may not always be the case. Copying everything avoids leaving a landmine
+ * were we fail to cancel all the extents in an EFI if the xefi list is
+ * processed in a different order to the extents in the EFI.
+ */
+static void
+xfs_efd_from_efi(
+	struct xfs_efd_log_item	*efdp)
+{
+	struct xfs_efi_log_item *efip = efdp->efd_efip;
+	uint                    i;
+
+	ASSERT(efip->efi_format.efi_nextents > 0);
+	ASSERT(efdp->efd_next_extent < efip->efi_format.efi_nextents);
+
+	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
+	       efdp->efd_format.efd_extents[i] =
+		       efip->efi_format.efi_extents[i];
+	}
+	efdp->efd_next_extent = efip->efi_format.efi_nextents;
+}
+
 /*
  * Free an extent and log it to the EFD. Note that the transaction is marked
  * dirty regardless of whether the extent free succeeds or fails to support the
@@ -378,6 +406,17 @@ xfs_trans_free_extent(
 	tp->t_flags |= XFS_TRANS_DIRTY | XFS_TRANS_HAS_INTENT_DONE;
 	set_bit(XFS_LI_DIRTY, &efdp->efd_item.li_flags);
 
+	/*
+	 * If we need a new transaction to make progress, the caller will log a
+	 * new EFI with the current contents. It will also log an EFD to cancel
+	 * the existing EFI, and so we need to copy all the unprocessed extents
+	 * in this EFI to the EFD so this works correctly.
+	 */
+	if (error == -EAGAIN) {
+		xfs_efd_from_efi(efdp);
+		return error;
+	}
+
 	next_extent = efdp->efd_next_extent;
 	ASSERT(next_extent < efdp->efd_format.efd_nextents);
 	extp = &(efdp->efd_format.efd_extents[next_extent]);
@@ -495,6 +534,13 @@ xfs_extent_free_finish_item(
 
 	error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);
 
+	/*
+	 * Don't free the XEFI if we need a new transaction to complete
+	 * processing of it.
+	 */
+	if (error == -EAGAIN)
+		return error;
+
 	xfs_extent_free_put_group(xefi);
 	kmem_cache_free(xfs_extfree_item_cache, xefi);
 	return error;
@@ -620,6 +666,7 @@ xfs_efi_item_recover(
 	struct xfs_trans		*tp;
 	int				i;
 	int				error = 0;
+	bool				requeue_only = false;
 
 	/*
 	 * First check the validity of the extents described by the
@@ -653,9 +700,29 @@ xfs_efi_item_recover(
 		fake.xefi_startblock = extp->ext_start;
 		fake.xefi_blockcount = extp->ext_len;
 
-		xfs_extent_free_get_group(mp, &fake);
-		error = xfs_trans_free_extent(tp, efdp, &fake);
-		xfs_extent_free_put_group(&fake);
+		if (!requeue_only) {
+			xfs_extent_free_get_group(mp, &fake);
+			error = xfs_trans_free_extent(tp, efdp, &fake);
+			xfs_extent_free_put_group(&fake);
+		}
+
+		/*
+		 * If we can't free the extent without potentially deadlocking,
+		 * requeue the rest of the extents to a new so that they get
+		 * run again later with a new transaction context.
+		 */
+		if (error == -EAGAIN || requeue_only) {
+			error = xfs_free_extent_later(tp, fake.xefi_startblock,
+					fake.xefi_blockcount,
+					&XFS_RMAP_OINFO_ANY_OWNER,
+					fake.xefi_type);
+			if (!error) {
+				requeue_only = true;
+				error = 0;
+				continue;
+			}
+		};
+
 		if (error == -EFSCORRUPTED)
 			XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
 					extp, sizeof(*extp));
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 5/8] xfs: don't block in busy flushing when freeing extents
  2023-06-27 22:44 [PATCH 0/8 v3] xfs: various fixes for 6.5 Dave Chinner
                   ` (3 preceding siblings ...)
  2023-06-27 22:44 ` [PATCH 4/8] xfs: allow extent free intents to be retried Dave Chinner
@ 2023-06-27 22:44 ` Dave Chinner
  2023-06-27 22:44 ` [PATCH 6/8] xfs: journal geometry is not properly bounds checked Dave Chinner
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2023-06-27 22:44 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

If the current transaction holds a busy extent and we are trying to
allocate a new extent to fix up the free list, we can deadlock if
the AG is entirely empty except for the busy extent held by the
transaction.

This can occur at runtime processing an XEFI with multiple extents
in this path:

__schedule+0x22f at ffffffff81f75e8f
schedule+0x46 at ffffffff81f76366
xfs_extent_busy_flush+0x69 at ffffffff81477d99
xfs_alloc_ag_vextent_size+0x16a at ffffffff8141711a
xfs_alloc_ag_vextent+0x19b at ffffffff81417edb
xfs_alloc_fix_freelist+0x22f at ffffffff8141896f
xfs_free_extent_fix_freelist+0x6a at ffffffff8141939a
__xfs_free_extent+0x99 at ffffffff81419499
xfs_trans_free_extent+0x3e at ffffffff814a6fee
xfs_extent_free_finish_item+0x24 at ffffffff814a70d4
xfs_defer_finish_noroll+0x1f7 at ffffffff81441407
xfs_defer_finish+0x11 at ffffffff814417e1
xfs_itruncate_extents_flags+0x13d at ffffffff8148b7dd
xfs_inactive_truncate+0xb9 at ffffffff8148bb89
xfs_inactive+0x227 at ffffffff8148c4f7
xfs_fs_destroy_inode+0xb8 at ffffffff81496898
destroy_inode+0x3b at ffffffff8127d2ab
do_unlinkat+0x1d1 at ffffffff81270df1
do_syscall_64+0x40 at ffffffff81f6b5f0
entry_SYSCALL_64_after_hwframe+0x44 at ffffffff8200007c

This can also happen in log recovery when processing an EFI
with multiple extents through this path:

context_switch() kernel/sched/core.c:3881
__schedule() kernel/sched/core.c:5111
schedule() kernel/sched/core.c:5186
xfs_extent_busy_flush() fs/xfs/xfs_extent_busy.c:598
xfs_alloc_ag_vextent_size() fs/xfs/libxfs/xfs_alloc.c:1641
xfs_alloc_ag_vextent() fs/xfs/libxfs/xfs_alloc.c:828
xfs_alloc_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:2362
xfs_free_extent_fix_freelist() fs/xfs/libxfs/xfs_alloc.c:3029
__xfs_free_extent() fs/xfs/libxfs/xfs_alloc.c:3067
xfs_trans_free_extent() fs/xfs/xfs_extfree_item.c:370
xfs_efi_recover() fs/xfs/xfs_extfree_item.c:626
xlog_recover_process_efi() fs/xfs/xfs_log_recover.c:4605
xlog_recover_process_intents() fs/xfs/xfs_log_recover.c:4893
xlog_recover_finish() fs/xfs/xfs_log_recover.c:5824
xfs_log_mount_finish() fs/xfs/xfs_log.c:764
xfs_mountfs() fs/xfs/xfs_mount.c:978
xfs_fs_fill_super() fs/xfs/xfs_super.c:1908
mount_bdev() fs/super.c:1417
xfs_fs_mount() fs/xfs/xfs_super.c:1985
legacy_get_tree() fs/fs_context.c:647
vfs_get_tree() fs/super.c:1547
do_new_mount() fs/namespace.c:2843
do_mount() fs/namespace.c:3163
ksys_mount() fs/namespace.c:3372
__do_sys_mount() fs/namespace.c:3386
__se_sys_mount() fs/namespace.c:3383
__x64_sys_mount() fs/namespace.c:3383
do_syscall_64() arch/x86/entry/common.c:296
entry_SYSCALL_64() arch/x86/entry/entry_64.S:180

To avoid this deadlock, we should not block in
xfs_extent_busy_flush() if we hold a busy extent in the current
transaction.

Now that the EFI processing code can handle requeuing a partially
completed EFI, we can detect this situation in
xfs_extent_busy_flush() and return -EAGAIN rather than going to
sleep forever. The -EAGAIN get propagated back out to the
xfs_trans_free_extent() context, where the EFD is populated and the
transaction is rolled, thereby moving the busy extents into the CIL.

At this point, we can retry the extent free operation again with a
clean transaction. If we hit the same "all free extents are busy"
situation when trying to fix up the free list, we can safely call
xfs_extent_busy_flush() and wait for the busy extents to resolve
and wake us. At this point, the allocation search can make progress
again and we can fix up the free list.

This deadlock was first reported by Chandan in mid-2021, but I
couldn't make myself understood during review, and didn't have time
to fix it myself.

It was reported again in March 2023, and again I have found myself
unable to explain the complexities of the solution needed during
review.

As such, I don't have hours more time to waste trying to get the
fix written the way it needs to be written, so I'm just doing it
myself. This patchset is largely based on Wengang Wang's last patch,
but with all the unnecessary stuff removed, split up into multiple
patches and cleaned up somewhat.

Reported-by: Chandan Babu R <chandanrlinux@gmail.com>
Reported-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c | 68 ++++++++++++++++++++++++++++-----------
 fs/xfs/libxfs/xfs_alloc.h | 11 ++++---
 fs/xfs/xfs_extent_busy.c  | 33 ++++++++++++++++---
 fs/xfs/xfs_extent_busy.h  |  6 ++--
 4 files changed, 88 insertions(+), 30 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index ba929f6b5c9b..1e72b91daff6 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -1556,6 +1556,8 @@ xfs_alloc_ag_vextent_near(
 	if (args->agbno > args->max_agbno)
 		args->agbno = args->max_agbno;
 
+	/* Retry once quickly if we find busy extents before blocking. */
+	alloc_flags |= XFS_ALLOC_FLAG_TRYFLUSH;
 restart:
 	len = 0;
 
@@ -1611,9 +1613,20 @@ xfs_alloc_ag_vextent_near(
 	 */
 	if (!acur.len) {
 		if (acur.busy) {
+			/*
+			 * Our only valid extents must have been busy. Flush and
+			 * retry the allocation again. If we get an -EAGAIN
+			 * error, we're being told that a deadlock was avoided
+			 * and the current transaction needs committing before
+			 * the allocation can be retried.
+			 */
 			trace_xfs_alloc_near_busy(args);
-			xfs_extent_busy_flush(args->mp, args->pag,
-					      acur.busy_gen, alloc_flags);
+			error = xfs_extent_busy_flush(args->tp, args->pag,
+					acur.busy_gen, alloc_flags);
+			if (error)
+				goto out;
+
+			alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
 			goto restart;
 		}
 		trace_xfs_alloc_size_neither(args);
@@ -1653,6 +1666,8 @@ xfs_alloc_ag_vextent_size(
 	int			error;
 	int			i;
 
+	/* Retry once quickly if we find busy extents before blocking. */
+	alloc_flags |= XFS_ALLOC_FLAG_TRYFLUSH;
 restart:
 	/*
 	 * Allocate and initialize a cursor for the by-size btree.
@@ -1710,19 +1725,25 @@ xfs_alloc_ag_vextent_size(
 			error = xfs_btree_increment(cnt_cur, 0, &i);
 			if (error)
 				goto error0;
-			if (i == 0) {
-				/*
-				 * Our only valid extents must have been busy.
-				 * Make it unbusy by forcing the log out and
-				 * retrying.
-				 */
-				xfs_btree_del_cursor(cnt_cur,
-						     XFS_BTREE_NOERROR);
-				trace_xfs_alloc_size_busy(args);
-				xfs_extent_busy_flush(args->mp, args->pag,
-						busy_gen, alloc_flags);
-				goto restart;
-			}
+			if (i)
+				continue;
+
+			/*
+			 * Our only valid extents must have been busy. Flush and
+			 * retry the allocation again. If we get an -EAGAIN
+			 * error, we're being told that a deadlock was avoided
+			 * and the current transaction needs committing before
+			 * the allocation can be retried.
+			 */
+			trace_xfs_alloc_size_busy(args);
+			error = xfs_extent_busy_flush(args->tp, args->pag,
+					busy_gen, alloc_flags);
+			if (error)
+				goto error0;
+
+			alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
+			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
+			goto restart;
 		}
 	}
 
@@ -1802,10 +1823,21 @@ xfs_alloc_ag_vextent_size(
 	args->len = rlen;
 	if (rlen < args->minlen) {
 		if (busy) {
-			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
+			/*
+			 * Our only valid extents must have been busy. Flush and
+			 * retry the allocation again. If we get an -EAGAIN
+			 * error, we're being told that a deadlock was avoided
+			 * and the current transaction needs committing before
+			 * the allocation can be retried.
+			 */
 			trace_xfs_alloc_size_busy(args);
-			xfs_extent_busy_flush(args->mp, args->pag, busy_gen,
-					alloc_flags);
+			error = xfs_extent_busy_flush(args->tp, args->pag,
+					busy_gen, alloc_flags);
+			if (error)
+				goto error0;
+
+			alloc_flags &= ~XFS_ALLOC_FLAG_TRYFLUSH;
+			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
 			goto restart;
 		}
 		goto out_nominleft;
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 50119ebaede9..29e56fdf25e4 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -19,11 +19,12 @@ unsigned int xfs_agfl_size(struct xfs_mount *mp);
 /*
  * Flags for xfs_alloc_fix_freelist.
  */
-#define	XFS_ALLOC_FLAG_TRYLOCK	0x00000001  /* use trylock for buffer locking */
-#define	XFS_ALLOC_FLAG_FREEING	0x00000002  /* indicate caller is freeing extents*/
-#define	XFS_ALLOC_FLAG_NORMAP	0x00000004  /* don't modify the rmapbt */
-#define	XFS_ALLOC_FLAG_NOSHRINK	0x00000008  /* don't shrink the freelist */
-#define	XFS_ALLOC_FLAG_CHECK	0x00000010  /* test only, don't modify args */
+#define	XFS_ALLOC_FLAG_TRYLOCK	(1U << 0)  /* use trylock for buffer locking */
+#define	XFS_ALLOC_FLAG_FREEING	(1U << 1)  /* indicate caller is freeing extents*/
+#define	XFS_ALLOC_FLAG_NORMAP	(1U << 2)  /* don't modify the rmapbt */
+#define	XFS_ALLOC_FLAG_NOSHRINK	(1U << 3)  /* don't shrink the freelist */
+#define	XFS_ALLOC_FLAG_CHECK	(1U << 4)  /* test only, don't modify args */
+#define	XFS_ALLOC_FLAG_TRYFLUSH	(1U << 5)  /* don't wait in busy extent flush */
 
 /*
  * Argument structure for xfs_alloc routines.
diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
index 5f44936eae1c..7c2fdc71e42d 100644
--- a/fs/xfs/xfs_extent_busy.c
+++ b/fs/xfs/xfs_extent_busy.c
@@ -566,10 +566,21 @@ xfs_extent_busy_clear(
 
 /*
  * Flush out all busy extents for this AG.
+ *
+ * If the current transaction is holding busy extents, the caller may not want
+ * to wait for committed busy extents to resolve. If we are being told just to
+ * try a flush or progress has been made since we last skipped a busy extent,
+ * return immediately to allow the caller to try again.
+ *
+ * If we are freeing extents, we might actually be holding the only free extents
+ * in the transaction busy list and the log force won't resolve that situation.
+ * In this case, we must return -EAGAIN to avoid a deadlock by informing the
+ * caller it needs to commit the busy extents it holds before retrying the
+ * extent free operation.
  */
-void
+int
 xfs_extent_busy_flush(
-	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
 	struct xfs_perag	*pag,
 	unsigned		busy_gen,
 	uint32_t		alloc_flags)
@@ -577,10 +588,23 @@ xfs_extent_busy_flush(
 	DEFINE_WAIT		(wait);
 	int			error;
 
-	error = xfs_log_force(mp, XFS_LOG_SYNC);
+	error = xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
 	if (error)
-		return;
+		return error;
 
+	/* Avoid deadlocks on uncommitted busy extents. */
+	if (!list_empty(&tp->t_busy)) {
+		if (alloc_flags & XFS_ALLOC_FLAG_TRYFLUSH)
+			return 0;
+
+		if (busy_gen != READ_ONCE(pag->pagb_gen))
+			return 0;
+
+		if (alloc_flags & XFS_ALLOC_FLAG_FREEING)
+			return -EAGAIN;
+	}
+
+	/* Wait for committed busy extents to resolve. */
 	do {
 		prepare_to_wait(&pag->pagb_wait, &wait, TASK_KILLABLE);
 		if  (busy_gen != READ_ONCE(pag->pagb_gen))
@@ -589,6 +613,7 @@ xfs_extent_busy_flush(
 	} while (1);
 
 	finish_wait(&pag->pagb_wait, &wait);
+	return 0;
 }
 
 void
diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
index 7a82c689bfa4..c37bf87e6781 100644
--- a/fs/xfs/xfs_extent_busy.h
+++ b/fs/xfs/xfs_extent_busy.h
@@ -51,9 +51,9 @@ bool
 xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno,
 		xfs_extlen_t *len, unsigned *busy_gen);
 
-void
-xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag,
-	unsigned busy_gen, uint32_t alloc_flags);
+int
+xfs_extent_busy_flush(struct xfs_trans *tp, struct xfs_perag *pag,
+		unsigned busy_gen, uint32_t alloc_flags);
 
 void
 xfs_extent_busy_wait_all(struct xfs_mount *mp);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 6/8] xfs: journal geometry is not properly bounds checked
  2023-06-27 22:44 [PATCH 0/8 v3] xfs: various fixes for 6.5 Dave Chinner
                   ` (4 preceding siblings ...)
  2023-06-27 22:44 ` [PATCH 5/8] xfs: don't block in busy flushing when freeing extents Dave Chinner
@ 2023-06-27 22:44 ` Dave Chinner
  2023-06-28  6:08   ` Christoph Hellwig
  2023-06-28 17:50   ` Darrick J. Wong
  2023-06-27 22:44 ` [PATCH 7/8] xfs: AGF length has never been " Dave Chinner
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 30+ messages in thread
From: Dave Chinner @ 2023-06-27 22:44 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

If the journal geometry results in a sector or log stripe unit
validation problem, it indicates that we cannot set the log up to
safely write to the the journal. In these cases, we must abort the
mount because the corruption needs external intervention to resolve.
Similarly, a journal that is too large cannot be written to safely,
either, so we shouldn't allow those geometries to mount, either.

If the log is too small, we risk having transaction reservations
overruning the available log space and the system hanging waiting
for space it can never provide. This is purely a runtime hang issue,
not a corruption issue as per the first cases listed above. We abort
mounts of the log is too small for V5 filesystems, but we must allow
v4 filesystems to mount because, historically, there was no log size
validity checking and so some systems may still be out there with
undersized logs.

The problem is that on V4 filesystems, when we discover a log
geometry problem, we skip all the remaining checks and then allow
the log to continue mounting. This mean that if one of the log size
checks fails, we skip the log stripe unit check. i.e. we allow the
mount because a "non-fatal" geometry is violated, and then fail to
check the hard fail geometries that should fail the mount.

Move all these fatal checks to the superblock verifier, and add a
new check for the two log sector size geometry variables having the
same values. This will prevent any attempt to mount a log that has
invalid or inconsistent geometries long before we attempt to mount
the log.

However, for the minimum log size checks, we can only do that once
we've setup up the log and calculated all the iclog sizes and
roundoffs. Hence this needs to remain in the log mount code after
the log has been initialised. It is also the only case where we
should allow a v4 filesystem to continue running, so leave that
handling in place, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_sb.c | 56 +++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_log.c       | 47 +++++++++++------------------------
 2 files changed, 70 insertions(+), 33 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index ba0f17bc1dc0..5e174685a77c 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -412,7 +412,6 @@ xfs_validate_sb_common(
 	    sbp->sb_inodelog < XFS_DINODE_MIN_LOG			||
 	    sbp->sb_inodelog > XFS_DINODE_MAX_LOG			||
 	    sbp->sb_inodesize != (1 << sbp->sb_inodelog)		||
-	    sbp->sb_logsunit > XLOG_MAX_RECORD_BSIZE			||
 	    sbp->sb_inopblock != howmany(sbp->sb_blocksize,sbp->sb_inodesize) ||
 	    XFS_FSB_TO_B(mp, sbp->sb_agblocks) < XFS_MIN_AG_BYTES	||
 	    XFS_FSB_TO_B(mp, sbp->sb_agblocks) > XFS_MAX_AG_BYTES	||
@@ -430,6 +429,61 @@ xfs_validate_sb_common(
 		return -EFSCORRUPTED;
 	}
 
+	/*
+	 * Logs that are too large are not supported at all. Reject them
+	 * outright. Logs that are too small are tolerated on v4 filesystems,
+	 * but we can only check that when mounting the log. Hence we skip
+	 * those checks here.
+	 */
+	if (sbp->sb_logblocks > XFS_MAX_LOG_BLOCKS) {
+		xfs_notice(mp,
+		"Log size 0x%x blocks too large, maximum size is 0x%llx blocks",
+			 sbp->sb_logblocks, XFS_MAX_LOG_BLOCKS);
+		return -EFSCORRUPTED;
+	}
+
+	if (XFS_FSB_TO_B(mp, sbp->sb_logblocks) > XFS_MAX_LOG_BYTES) {
+		xfs_warn(mp,
+		"log size 0x%llx bytes too large, maximum size is 0x%llx bytes",
+			 XFS_FSB_TO_B(mp, sbp->sb_logblocks),
+			 XFS_MAX_LOG_BYTES);
+		return -EFSCORRUPTED;
+	}
+
+	/*
+	 * Do not allow filesystems with corrupted log sector or stripe units to
+	 * be mounted. We cannot safely size the iclogs or write to the log if
+	 * the log stripe unit is not valid.
+	 */
+	if (sbp->sb_versionnum & XFS_SB_VERSION_SECTORBIT) {
+		if (sbp->sb_logsectsize != (1U << sbp->sb_logsectlog)) {
+			xfs_notice(mp,
+			"log sector size in bytes/log2 (0x%x/0x%x) must match",
+				sbp->sb_logsectsize, 1U << sbp->sb_logsectlog);
+			return -EFSCORRUPTED;
+		}
+	} else if (sbp->sb_logsectsize || sbp->sb_logsectlog) {
+		xfs_notice(mp,
+		"log sector size in bytes/log2 (0x%x/0x%x) are not zero",
+			sbp->sb_logsectsize, sbp->sb_logsectlog);
+		return -EFSCORRUPTED;
+	}
+
+	if (sbp->sb_logsunit > 1) {
+		if (sbp->sb_logsunit % sbp->sb_blocksize) {
+			xfs_notice(mp,
+		"log stripe unit 0x%x bytes must be a multiple of block size",
+				sbp->sb_logsunit);
+			return -EFSCORRUPTED;
+		}
+		if (sbp->sb_logsunit > XLOG_MAX_RECORD_BSIZE) {
+			xfs_notice(mp,
+		"log stripe unit 0x%x bytes over maximum size (0x%x bytes)",
+				sbp->sb_logsunit, XLOG_MAX_RECORD_BSIZE);
+			return -EFSCORRUPTED;
+		}
+	}
+
 	/* Validate the realtime geometry; stolen from xfs_repair */
 	if (sbp->sb_rextsize * sbp->sb_blocksize > XFS_MAX_RTEXTSIZE ||
 	    sbp->sb_rextsize * sbp->sb_blocksize < XFS_MIN_RTEXTSIZE) {
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index fc61cc024023..79004d193e54 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -639,7 +639,6 @@ xfs_log_mount(
 	int		num_bblks)
 {
 	struct xlog	*log;
-	bool		fatal = xfs_has_crc(mp);
 	int		error = 0;
 	int		min_logfsbs;
 
@@ -663,53 +662,37 @@ xfs_log_mount(
 	mp->m_log = log;
 
 	/*
-	 * Validate the given log space and drop a critical message via syslog
-	 * if the log size is too small that would lead to some unexpected
-	 * situations in transaction log space reservation stage.
+	 * Now that we have set up the log and it's internal geometry
+	 * parameters, we can validate the given log space and drop a critical
+	 * message via syslog if the log size is too small. A log that is too
+	 * small can lead to unexpected situations in transaction log space
+	 * reservation stage. The superblock verifier has already validated all
+	 * the other log geometry constraints, so we don't have to check those
+	 * here.
 	 *
-	 * Note: we can't just reject the mount if the validation fails.  This
-	 * would mean that people would have to downgrade their kernel just to
-	 * remedy the situation as there is no way to grow the log (short of
-	 * black magic surgery with xfs_db).
+	 * Note: For v4 filesystems, we can't just reject the mount if the
+	 * validation fails.  This would mean that people would have to
+	 * downgrade their kernel just to remedy the situation as there is no
+	 * way to grow the log (short of black magic surgery with xfs_db).
 	 *
-	 * We can, however, reject mounts for CRC format filesystems, as the
+	 * We can, however, reject mounts for V5 format filesystems, as the
 	 * mkfs binary being used to make the filesystem should never create a
 	 * filesystem with a log that is too small.
 	 */
 	min_logfsbs = xfs_log_calc_minimum_size(mp);
-
 	if (mp->m_sb.sb_logblocks < min_logfsbs) {
 		xfs_warn(mp,
 		"Log size %d blocks too small, minimum size is %d blocks",
 			 mp->m_sb.sb_logblocks, min_logfsbs);
-		error = -EINVAL;
-	} else if (mp->m_sb.sb_logblocks > XFS_MAX_LOG_BLOCKS) {
-		xfs_warn(mp,
-		"Log size %d blocks too large, maximum size is %lld blocks",
-			 mp->m_sb.sb_logblocks, XFS_MAX_LOG_BLOCKS);
-		error = -EINVAL;
-	} else if (XFS_FSB_TO_B(mp, mp->m_sb.sb_logblocks) > XFS_MAX_LOG_BYTES) {
-		xfs_warn(mp,
-		"log size %lld bytes too large, maximum size is %lld bytes",
-			 XFS_FSB_TO_B(mp, mp->m_sb.sb_logblocks),
-			 XFS_MAX_LOG_BYTES);
-		error = -EINVAL;
-	} else if (mp->m_sb.sb_logsunit > 1 &&
-		   mp->m_sb.sb_logsunit % mp->m_sb.sb_blocksize) {
-		xfs_warn(mp,
-		"log stripe unit %u bytes must be a multiple of block size",
-			 mp->m_sb.sb_logsunit);
-		error = -EINVAL;
-		fatal = true;
-	}
-	if (error) {
+
 		/*
 		 * Log check errors are always fatal on v5; or whenever bad
 		 * metadata leads to a crash.
 		 */
-		if (fatal) {
+		if (xfs_has_crc(mp)) {
 			xfs_crit(mp, "AAIEEE! Log failed size checks. Abort!");
 			ASSERT(0);
+			error = -EINVAL;
 			goto out_free_log;
 		}
 		xfs_crit(mp, "Log size out of supported range.");
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 7/8] xfs: AGF length has never been bounds checked
  2023-06-27 22:44 [PATCH 0/8 v3] xfs: various fixes for 6.5 Dave Chinner
                   ` (5 preceding siblings ...)
  2023-06-27 22:44 ` [PATCH 6/8] xfs: journal geometry is not properly bounds checked Dave Chinner
@ 2023-06-27 22:44 ` Dave Chinner
  2023-06-28 17:52   ` Darrick J. Wong
  2023-06-27 22:44 ` [PATCH 8/8] xfs: fix bounds check in xfs_defer_agfl_block() Dave Chinner
  2023-06-29 19:42 ` [RFC PATCH 9/8] xfs: AGI length should be bounds checked Darrick J. Wong
  8 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2023-06-27 22:44 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The AGF verifier does not check that the AGF length field is within
known good bounds. This has never been checked by runtime kernel
code (i.e. the lack of verification goes back to 1993) yet we assume
in many places that it is correct and verify other metdata against
it.

Add length verification to the AGF verifier. The length of the AGF
must be equal to the size of the AG specified in the superblock,
unless it is the last AG in the filesystem. In that case, it must be
less than or equal to sb->sb_agblocks and greater than
XFS_MIN_AG_BLOCKS, which is the smallest AG a growfs operation will
allow to exist.

This requires a bit of rework of the verifier function. We want to
verify metadata before we use it to verify other metadata. Hence
we need to verify the AGF sequence numbers before using them to
verify the length of the AGF. Then we can verify the AGF length
before we verify AGFL fields. Then we can verifier other fields that
are bounds limited by the AGF length.

And, finally, by calculating agf_length only once into a local
variable, we can collapse repeated "if (xfs_has_foo() &&"
conditionaly checks into single checks. This makes the code much
easier to follow as all the checks for a given feature are obviously
in the same place.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_alloc.c | 86 +++++++++++++++++++++++----------------
 1 file changed, 51 insertions(+), 35 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 1e72b91daff6..7c86a69354fb 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2974,6 +2974,7 @@ xfs_agf_verify(
 {
 	struct xfs_mount	*mp = bp->b_mount;
 	struct xfs_agf		*agf = bp->b_addr;
+	uint32_t		agf_length = be32_to_cpu(agf->agf_length);
 
 	if (xfs_has_crc(mp)) {
 		if (!uuid_equal(&agf->agf_uuid, &mp->m_sb.sb_meta_uuid))
@@ -2985,18 +2986,43 @@ xfs_agf_verify(
 	if (!xfs_verify_magic(bp, agf->agf_magicnum))
 		return __this_address;
 
-	if (!(XFS_AGF_GOOD_VERSION(be32_to_cpu(agf->agf_versionnum)) &&
-	      be32_to_cpu(agf->agf_freeblks) <= be32_to_cpu(agf->agf_length) &&
-	      be32_to_cpu(agf->agf_flfirst) < xfs_agfl_size(mp) &&
-	      be32_to_cpu(agf->agf_fllast) < xfs_agfl_size(mp) &&
-	      be32_to_cpu(agf->agf_flcount) <= xfs_agfl_size(mp)))
+	if (!XFS_AGF_GOOD_VERSION(be32_to_cpu(agf->agf_versionnum)))
 		return __this_address;
 
-	if (be32_to_cpu(agf->agf_length) > mp->m_sb.sb_dblocks)
+	/*
+	 * Both agf_seqno and agf_length need to validated before anything else
+	 * block number related in the AGF or AGFL can be checked.
+	 *
+	 * During growfs operations, the perag is not fully initialised,
+	 * so we can't use it for any useful checking. growfs ensures we can't
+	 * use it by using uncached buffers that don't have the perag attached
+	 * so we can detect and avoid this problem.
+	 */
+	if (bp->b_pag && be32_to_cpu(agf->agf_seqno) != bp->b_pag->pag_agno)
+		return __this_address;
+
+	/*
+	 * Only the last AGF in the filesytsem is allowed to be shorter
+	 * than the AG size recorded in the superblock.
+	 */
+	if (agf_length != mp->m_sb.sb_agblocks) {
+		if (be32_to_cpu(agf->agf_seqno) != mp->m_sb.sb_agcount - 1)
+			return __this_address;
+		if (agf_length < XFS_MIN_AG_BLOCKS)
+			return __this_address;
+		if (agf_length > mp->m_sb.sb_agblocks)
+			return __this_address;
+	}
+
+	if (be32_to_cpu(agf->agf_flfirst) >= xfs_agfl_size(mp))
+		return __this_address;
+	if (be32_to_cpu(agf->agf_fllast) >= xfs_agfl_size(mp))
+		return __this_address;
+	if (be32_to_cpu(agf->agf_flcount) > xfs_agfl_size(mp))
 		return __this_address;
 
 	if (be32_to_cpu(agf->agf_freeblks) < be32_to_cpu(agf->agf_longest) ||
-	    be32_to_cpu(agf->agf_freeblks) > be32_to_cpu(agf->agf_length))
+	    be32_to_cpu(agf->agf_freeblks) > agf_length)
 		return __this_address;
 
 	if (be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNO]) < 1 ||
@@ -3007,38 +3033,28 @@ xfs_agf_verify(
 						mp->m_alloc_maxlevels)
 		return __this_address;
 
-	if (xfs_has_rmapbt(mp) &&
-	    (be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) < 1 ||
-	     be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) >
-						mp->m_rmap_maxlevels))
-		return __this_address;
-
-	if (xfs_has_rmapbt(mp) &&
-	    be32_to_cpu(agf->agf_rmap_blocks) > be32_to_cpu(agf->agf_length))
-		return __this_address;
-
-	/*
-	 * during growfs operations, the perag is not fully initialised,
-	 * so we can't use it for any useful checking. growfs ensures we can't
-	 * use it by using uncached buffers that don't have the perag attached
-	 * so we can detect and avoid this problem.
-	 */
-	if (bp->b_pag && be32_to_cpu(agf->agf_seqno) != bp->b_pag->pag_agno)
-		return __this_address;
-
 	if (xfs_has_lazysbcount(mp) &&
-	    be32_to_cpu(agf->agf_btreeblks) > be32_to_cpu(agf->agf_length))
+	    be32_to_cpu(agf->agf_btreeblks) > agf_length)
 		return __this_address;
 
-	if (xfs_has_reflink(mp) &&
-	    be32_to_cpu(agf->agf_refcount_blocks) >
-	    be32_to_cpu(agf->agf_length))
-		return __this_address;
+	if (xfs_has_rmapbt(mp)) {
+		if (be32_to_cpu(agf->agf_rmap_blocks) > agf_length)
+			return __this_address;
 
-	if (xfs_has_reflink(mp) &&
-	    (be32_to_cpu(agf->agf_refcount_level) < 1 ||
-	     be32_to_cpu(agf->agf_refcount_level) > mp->m_refc_maxlevels))
-		return __this_address;
+		if (be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) < 1 ||
+		    be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) >
+							mp->m_rmap_maxlevels)
+			return __this_address;
+	}
+
+	if (xfs_has_reflink(mp)) {
+		if (be32_to_cpu(agf->agf_refcount_blocks) > agf_length)
+			return __this_address;
+
+		if (be32_to_cpu(agf->agf_refcount_level) < 1 ||
+		    be32_to_cpu(agf->agf_refcount_level) > mp->m_refc_maxlevels)
+			return __this_address;
+	}
 
 	return NULL;
 }
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 8/8] xfs: fix bounds check in xfs_defer_agfl_block()
  2023-06-27 22:44 [PATCH 0/8 v3] xfs: various fixes for 6.5 Dave Chinner
                   ` (6 preceding siblings ...)
  2023-06-27 22:44 ` [PATCH 7/8] xfs: AGF length has never been " Dave Chinner
@ 2023-06-27 22:44 ` Dave Chinner
  2023-06-28  6:09   ` Christoph Hellwig
  2023-06-28 17:52   ` Darrick J. Wong
  2023-06-29 19:42 ` [RFC PATCH 9/8] xfs: AGI length should be bounds checked Darrick J. Wong
  8 siblings, 2 replies; 30+ messages in thread
From: Dave Chinner @ 2023-06-27 22:44 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Need to happen before we allocate and then leak the xefi. Found by
coverity via an xfsprogs libxfs scan.

Fixes: 7dfee17b13e5 ("xfs: validate block number being freed before adding to xefi")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 7c86a69354fb..9919fdfe1d7e 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2470,25 +2470,26 @@ static int
 xfs_defer_agfl_block(
 	struct xfs_trans		*tp,
 	xfs_agnumber_t			agno,
-	xfs_fsblock_t			agbno,
+	xfs_agblock_t			agbno,
 	struct xfs_owner_info		*oinfo)
 {
 	struct xfs_mount		*mp = tp->t_mountp;
 	struct xfs_extent_free_item	*xefi;
+	xfs_fsblock_t			fsbno = XFS_AGB_TO_FSB(mp, agno, agbno);
 
 	ASSERT(xfs_extfree_item_cache != NULL);
 	ASSERT(oinfo != NULL);
 
+	if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbno(mp, fsbno)))
+		return -EFSCORRUPTED;
+
 	xefi = kmem_cache_zalloc(xfs_extfree_item_cache,
 			       GFP_KERNEL | __GFP_NOFAIL);
-	xefi->xefi_startblock = XFS_AGB_TO_FSB(mp, agno, agbno);
+	xefi->xefi_startblock = fsbno;
 	xefi->xefi_blockcount = 1;
 	xefi->xefi_owner = oinfo->oi_owner;
 	xefi->xefi_type = XFS_AG_RESV_AGFL;
 
-	if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbno(mp, xefi->xefi_startblock)))
-		return -EFSCORRUPTED;
-
 	trace_xfs_agfl_free_defer(mp, agno, 0, agbno, 1);
 
 	xfs_extent_free_get_group(mp, xefi);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/8] xfs: don't reverse order of items in bulk AIL insertion
  2023-06-27 22:44 ` [PATCH 1/8] xfs: don't reverse order of items in bulk AIL insertion Dave Chinner
@ 2023-06-28  6:03   ` Christoph Hellwig
  2023-06-28  9:55   ` Chandan Babu R
  2023-06-28 17:46   ` Darrick J. Wong
  2 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2023-06-28  6:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/8] xfs: journal geometry is not properly bounds checked
  2023-06-27 22:44 ` [PATCH 6/8] xfs: journal geometry is not properly bounds checked Dave Chinner
@ 2023-06-28  6:08   ` Christoph Hellwig
  2023-06-28  6:38     ` Dave Chinner
  2023-06-28 17:50   ` Darrick J. Wong
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2023-06-28  6:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

> +	if (sbp->sb_logblocks > XFS_MAX_LOG_BLOCKS) {
> +		xfs_notice(mp,
> +		"Log size 0x%x blocks too large, maximum size is 0x%llx blocks",
> +			 sbp->sb_logblocks, XFS_MAX_LOG_BLOCKS);

Just a little nitpick, didn't we traditionally align the overly
long format strings with a single tab and not to the same line as
the xfs_notice/warn/etc?

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 8/8] xfs: fix bounds check in xfs_defer_agfl_block()
  2023-06-27 22:44 ` [PATCH 8/8] xfs: fix bounds check in xfs_defer_agfl_block() Dave Chinner
@ 2023-06-28  6:09   ` Christoph Hellwig
  2023-06-28 17:52   ` Darrick J. Wong
  1 sibling, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2023-06-28  6:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jun 28, 2023 at 08:44:12AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Need to happen before we allocate and then leak the xefi. Found by
> coverity via an xfsprogs libxfs scan.

... and also fixes the type of the agbno argument, which probably should
be here in the commit log.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/8] xfs: journal geometry is not properly bounds checked
  2023-06-28  6:08   ` Christoph Hellwig
@ 2023-06-28  6:38     ` Dave Chinner
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2023-06-28  6:38 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Tue, Jun 27, 2023 at 11:08:46PM -0700, Christoph Hellwig wrote:
> > +	if (sbp->sb_logblocks > XFS_MAX_LOG_BLOCKS) {
> > +		xfs_notice(mp,
> > +		"Log size 0x%x blocks too large, maximum size is 0x%llx blocks",
> > +			 sbp->sb_logblocks, XFS_MAX_LOG_BLOCKS);
> 
> Just a little nitpick, didn't we traditionally align the overly
> long format strings with a single tab and not to the same line as
> the xfs_notice/warn/etc?

I don't think we have a particluarly formal definition of that.
 
I mean, just look at all the different variants in
xfs_sb_validate_common:

	if (xfs_sb_is_v5(sbp)) {
                if (sbp->sb_blocksize < XFS_MIN_CRC_BLOCKSIZE) {
                        xfs_notice(mp,
"Block size (%u bytes) too small for Version 5 superblock (minimum %d bytes)",
                                sbp->sb_blocksize, XFS_MIN_CRC_BLOCKSIZE);
                        return -EFSCORRUPTED;
                }

                /* V5 has a separate project quota inode */
                if (sbp->sb_qflags & (XFS_OQUOTA_ENFD | XFS_OQUOTA_CHKD)) {
                        xfs_notice(mp,
                           "Version 5 of Super block has XFS_OQUOTA bits.");
                        return -EFSCORRUPTED;
                }

                /*
                 * Full inode chunks must be aligned to inode chunk size when
                 * sparse inodes are enabled to support the sparse chunk
                 * allocation algorithm and prevent overlapping inode records.
                 */
                if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES) {
                        uint32_t        align;

                        align = XFS_INODES_PER_CHUNK * sbp->sb_inodesize
                                        >> sbp->sb_blocklog;
                        if (sbp->sb_inoalignmt != align) {
                                xfs_warn(mp,
"Inode block alignment (%u) must match chunk size (%u) for sparse inodes.",
                                         sbp->sb_inoalignmt, align);
                                return -EINVAL;
                        }
                }
        } else if (sbp->sb_qflags & (XFS_PQUOTA_ENFD | XFS_GQUOTA_ENFD |
                                XFS_PQUOTA_CHKD | XFS_GQUOTA_CHKD)) {
                        xfs_notice(mp,
"Superblock earlier than Version 5 has XFS_{P|G}QUOTA_{ENFD|CHKD} bits.");
                        return -EFSCORRUPTED;
        }

        if (unlikely(
            sbp->sb_logstart == 0 && mp->m_logdev_targp == mp->m_ddev_targp)) {
                xfs_warn(mp,
                "filesystem is marked as having an external log; "
                "specify logdev on the mount command line.");
                return -EINVAL;
        }

        if (unlikely(
            sbp->sb_logstart != 0 && mp->m_logdev_targp != mp->m_ddev_targp)) {
                xfs_warn(mp,
                "filesystem is marked as having an internal log; "
                "do not specify logdev on the mount command line.");
                return -EINVAL;
        }

There's 4-5 different variations in indenting in 6-7 warning
messages. And there are more variations in that function, too.

There's no consistent rule to follow, so I just set the indenting
such that the format string didn't run over 80 columns and didn't
think any further..

I think that if we really care, a separate cleanup patch to make all
the long format strings use consistent zero-indenting and the
variables one tab like this one:

                        xfs_notice(mp,
"Block size (%u bytes) too small for Version 5 superblock (minimum %d bytes)",
                                sbp->sb_blocksize, XFS_MIN_CRC_BLOCKSIZE);

Could be done. If I'm ever lacking for something to do...

> Otherwise looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks!

_dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/8] xfs: don't reverse order of items in bulk AIL insertion
  2023-06-27 22:44 ` [PATCH 1/8] xfs: don't reverse order of items in bulk AIL insertion Dave Chinner
  2023-06-28  6:03   ` Christoph Hellwig
@ 2023-06-28  9:55   ` Chandan Babu R
  2023-06-28 17:46   ` Darrick J. Wong
  2 siblings, 0 replies; 30+ messages in thread
From: Chandan Babu R @ 2023-06-28  9:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jun 28, 2023 at 08:44:05 AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> XFS has strict metadata ordering requirements. One of the things it
> does is maintain the commit order of items from transaction commit
> through the CIL and into the AIL. That is, if a transaction logs
> item A before item B in a modification, then they will be inserted
> into the CIL in the order {A, B}. These items are then written into
> the iclog during checkpointing in the order {A, B}. When the
> checkpoint commits, they are supposed to be inserted into the AIL in
> the order {A, B}, and when they are pushed from the AIL, they are
> pushed in the order {A, B}.
>
> If we crash, log recovery then replays the two items from the
> checkpoint in the order {A, B}, resulting in the objects the items
> apply to being queued for writeback at the end of the checkpoint
> in the order {A, B}. This means recovery behaves the same way as the
> runtime code.
>
> In places, we have subtle dependencies on this ordering being
> maintained. One of this place is performing intent recovery from the
> log. It assumes that recovering an intent will result in a
> non-intent object being the first thing that is modified in the
> recovery transaction, and so when the transaction commits and the
> journal flushes, the first object inserted into the AIL beyond the
> intent recovery range will be a non-intent item.  It uses the
> transistion from intent items to non-intent items to stop the
> recovery pass.
>
> A recent log recovery issue indicated that an intent was appearing
> as the first item in the AIL beyond the recovery range, hence
> breaking the end of recovery detection that exists.
>
> Tracing indicated insertion of the items into the AIL was apparently
> occurring in the right order (the intent was last in the commit item
> list), but the intent was appearing first in the AIL. IOWs, the
> order of items in the AIL was {D,C,B,A}, not {A,B,C,D}, and bulk
> insertion was reversing the order of the items in the batch of items
> being inserted.
>
> Lucky for us, all the items fed to bulk insertion have the same LSN,
> so the reversal of order does not affect the log head/tail tracking
> that is based on the contents of the AIL. It only impacts on code
> that has implicit, subtle dependencies on object order, and AFAICT
> only the intent recovery loop is impacted by it.
>
> Make sure bulk AIL insertion does not reorder items incorrectly.

Looks good to me.

Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

>
> Fixes: 0e57f6a36f9b ("xfs: bulk AIL insertion during transaction commit")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_trans_ail.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 7d4109af193e..1098452e7f95 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -823,7 +823,7 @@ xfs_trans_ail_update_bulk(
>  			trace_xfs_ail_insert(lip, 0, lsn);
>  		}
>  		lip->li_lsn = lsn;
> -		list_add(&lip->li_ail, &tmp);
> +		list_add_tail(&lip->li_ail, &tmp);
>  	}
>  
>  	if (!list_empty(&tmp))


-- 
chandan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/8] xfs: use deferred frees for btree block freeing
  2023-06-27 22:44 ` [PATCH 2/8] xfs: use deferred frees for btree block freeing Dave Chinner
@ 2023-06-28 17:46   ` Darrick J. Wong
  2023-06-28 22:55     ` Dave Chinner
  2023-06-29  7:52   ` Chandan Babu R
  1 sibling, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2023-06-28 17:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jun 28, 2023 at 08:44:06AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Btrees that aren't freespace management trees use the normal extent
> allocation and freeing routines for their blocks. Hence when a btree
> block is freed, a direct call to xfs_free_extent() is made and the
> extent is immediately freed. This puts the entire free space
> management btrees under this path, so we are stacking btrees on
> btrees in the call stack. The inobt, finobt and refcount btrees
> all do this.
> 
> However, the bmap btree does not do this - it calls
> xfs_free_extent_later() to defer the extent free operation via an
> XEFI and hence it gets processed in deferred operation processing
> during the commit of the primary transaction (i.e. via intent
> chaining).
> 
> We need to change xfs_free_extent() to behave in a non-blocking
> manner so that we can avoid deadlocks with busy extents near ENOSPC
> in transactions that free multiple extents. Inserting or removing a
> record from a btree can cause a multi-level tree merge operation and
> that will free multiple blocks from the btree in a single
> transaction. i.e. we can call xfs_free_extent() multiple times, and
> hence the btree manipulation transaction is vulnerable to this busy
> extent deadlock vector.
> 
> To fix this, convert all the remaining callers of xfs_free_extent()
> to use xfs_free_extent_later() to queue XEFIs and hence defer
> processing of the extent frees to a context that can be safely
> restarted if a deadlock condition is detected.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_ag.c             | 2 +-
>  fs/xfs/libxfs/xfs_alloc.c          | 4 ++++
>  fs/xfs/libxfs/xfs_alloc.h          | 8 +++++---
>  fs/xfs/libxfs/xfs_bmap.c           | 8 +++++---
>  fs/xfs/libxfs/xfs_bmap_btree.c     | 3 ++-
>  fs/xfs/libxfs/xfs_ialloc.c         | 8 ++++----
>  fs/xfs/libxfs/xfs_ialloc_btree.c   | 3 +--
>  fs/xfs/libxfs/xfs_refcount.c       | 9 ++++++---
>  fs/xfs/libxfs/xfs_refcount_btree.c | 8 +-------
>  fs/xfs/xfs_extfree_item.c          | 3 ++-
>  fs/xfs/xfs_reflink.c               | 3 ++-
>  11 files changed, 33 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
> index ee84835ebc66..e9cc481b4ddf 100644
> --- a/fs/xfs/libxfs/xfs_ag.c
> +++ b/fs/xfs/libxfs/xfs_ag.c
> @@ -985,7 +985,7 @@ xfs_ag_shrink_space(
>  			goto resv_err;
>  
>  		err2 = __xfs_free_extent_later(*tpp, args.fsbno, delta, NULL,
> -				true);
> +				XFS_AG_RESV_NONE, true);
>  		if (err2)
>  			goto resv_err;
>  
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index c20fe99405d8..cc3f7b905ea1 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2449,6 +2449,7 @@ xfs_defer_agfl_block(
>  	xefi->xefi_startblock = XFS_AGB_TO_FSB(mp, agno, agbno);
>  	xefi->xefi_blockcount = 1;
>  	xefi->xefi_owner = oinfo->oi_owner;
> +	xefi->xefi_type = XFS_AG_RESV_AGFL;
>  	if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbno(mp, xefi->xefi_startblock)))
>  		return -EFSCORRUPTED;
> @@ -2470,6 +2471,7 @@ __xfs_free_extent_later(
>  	xfs_fsblock_t			bno,
>  	xfs_filblks_t			len,
>  	const struct xfs_owner_info	*oinfo,
> +	enum xfs_ag_resv_type		type,
>  	bool				skip_discard)
>  {
>  	struct xfs_extent_free_item	*xefi;
> @@ -2490,6 +2492,7 @@ __xfs_free_extent_later(
>  	ASSERT(agbno + len <= mp->m_sb.sb_agblocks);
>  #endif
>  	ASSERT(xfs_extfree_item_cache != NULL);
> +	ASSERT(type != XFS_AG_RESV_AGFL);
>  
>  	if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbext(mp, bno, len)))
>  		return -EFSCORRUPTED;
> @@ -2498,6 +2501,7 @@ __xfs_free_extent_later(
>  			       GFP_KERNEL | __GFP_NOFAIL);
>  	xefi->xefi_startblock = bno;
>  	xefi->xefi_blockcount = (xfs_extlen_t)len;
> +	xefi->xefi_type = type;
>  	if (skip_discard)
>  		xefi->xefi_flags |= XFS_EFI_SKIP_DISCARD;
>  	if (oinfo) {
> diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> index 85ac470be0da..121faf1e11ad 100644
> --- a/fs/xfs/libxfs/xfs_alloc.h
> +++ b/fs/xfs/libxfs/xfs_alloc.h
> @@ -232,7 +232,7 @@ xfs_buf_to_agfl_bno(
>  
>  int __xfs_free_extent_later(struct xfs_trans *tp, xfs_fsblock_t bno,
>  		xfs_filblks_t len, const struct xfs_owner_info *oinfo,
> -		bool skip_discard);
> +		enum xfs_ag_resv_type type, bool skip_discard);
>  
>  /*
>   * List of extents to be free "later".
> @@ -245,6 +245,7 @@ struct xfs_extent_free_item {
>  	xfs_extlen_t		xefi_blockcount;/* number of blocks in extent */
>  	struct xfs_perag	*xefi_pag;
>  	unsigned int		xefi_flags;

/me is barely back from vacation, starting to process the ~1100 emails
by taking care of the obvious bugfixes first...

> +	enum xfs_ag_resv_type	xefi_type;

I got confused by 'xefi_type' until I remembered that
XFS_DEFER_OPS_TYPE_AGFL_FREE / XFS_DEFER_OPS_TYPE_FREE are stuffed in
the xfs_defer_pending structure, not the xefi itself.

Could this field be named xefi_agresv instead?

The rest of the logic in this patch looks correct and makes things
easier for the rt modernization patches, so I'll say

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

and change the name on commit, if that's ok?

--D

>  };
>  
>  void xfs_extent_free_get_group(struct xfs_mount *mp,
> @@ -259,9 +260,10 @@ xfs_free_extent_later(
>  	struct xfs_trans		*tp,
>  	xfs_fsblock_t			bno,
>  	xfs_filblks_t			len,
> -	const struct xfs_owner_info	*oinfo)
> +	const struct xfs_owner_info	*oinfo,
> +	enum xfs_ag_resv_type		type)
>  {
> -	return __xfs_free_extent_later(tp, bno, len, oinfo, false);
> +	return __xfs_free_extent_later(tp, bno, len, oinfo, type, false);
>  }
>  
>  
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index fef35696adb7..30c931b38853 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -574,7 +574,8 @@ xfs_bmap_btree_to_extents(
>  		return error;
>  
>  	xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, whichfork);
> -	error = xfs_free_extent_later(cur->bc_tp, cbno, 1, &oinfo);
> +	error = xfs_free_extent_later(cur->bc_tp, cbno, 1, &oinfo,
> +			XFS_AG_RESV_NONE);
>  	if (error)
>  		return error;
>  
> @@ -5236,8 +5237,9 @@ xfs_bmap_del_extent_real(
>  		} else {
>  			error = __xfs_free_extent_later(tp, del->br_startblock,
>  					del->br_blockcount, NULL,
> -					(bflags & XFS_BMAPI_NODISCARD) ||
> -					del->br_state == XFS_EXT_UNWRITTEN);
> +					XFS_AG_RESV_NONE,
> +					((bflags & XFS_BMAPI_NODISCARD) ||
> +					del->br_state == XFS_EXT_UNWRITTEN));
>  			if (error)
>  				goto done;
>  		}
> diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
> index 36564ae3084f..bf3f1b36fdd2 100644
> --- a/fs/xfs/libxfs/xfs_bmap_btree.c
> +++ b/fs/xfs/libxfs/xfs_bmap_btree.c
> @@ -271,7 +271,8 @@ xfs_bmbt_free_block(
>  	int			error;
>  
>  	xfs_rmap_ino_bmbt_owner(&oinfo, ip->i_ino, cur->bc_ino.whichfork);
> -	error = xfs_free_extent_later(cur->bc_tp, fsbno, 1, &oinfo);
> +	error = xfs_free_extent_later(cur->bc_tp, fsbno, 1, &oinfo,
> +			XFS_AG_RESV_NONE);
>  	if (error)
>  		return error;
>  
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index 34600f94c2f4..1e5fafbc0cdb 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -1853,8 +1853,8 @@ xfs_difree_inode_chunk(
>  		/* not sparse, calculate extent info directly */
>  		return xfs_free_extent_later(tp,
>  				XFS_AGB_TO_FSB(mp, agno, sagbno),
> -				M_IGEO(mp)->ialloc_blks,
> -				&XFS_RMAP_OINFO_INODES);
> +				M_IGEO(mp)->ialloc_blks, &XFS_RMAP_OINFO_INODES,
> +				XFS_AG_RESV_NONE);
>  	}
>  
>  	/* holemask is only 16-bits (fits in an unsigned long) */
> @@ -1899,8 +1899,8 @@ xfs_difree_inode_chunk(
>  		ASSERT(agbno % mp->m_sb.sb_spino_align == 0);
>  		ASSERT(contigblk % mp->m_sb.sb_spino_align == 0);
>  		error = xfs_free_extent_later(tp,
> -				XFS_AGB_TO_FSB(mp, agno, agbno),
> -				contigblk, &XFS_RMAP_OINFO_INODES);
> +				XFS_AGB_TO_FSB(mp, agno, agbno), contigblk,
> +				&XFS_RMAP_OINFO_INODES, XFS_AG_RESV_NONE);
>  		if (error)
>  			return error;
>  
> diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> index 5a945ae21b5d..9258f01c0015 100644
> --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> @@ -160,8 +160,7 @@ __xfs_inobt_free_block(
>  
>  	xfs_inobt_mod_blockcount(cur, -1);
>  	fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, xfs_buf_daddr(bp));
> -	return xfs_free_extent(cur->bc_tp, cur->bc_ag.pag,
> -			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1,
> +	return xfs_free_extent_later(cur->bc_tp, fsbno, 1,
>  			&XFS_RMAP_OINFO_INOBT, resv);
>  }
>  
> diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> index b6e21433925c..70ab113c9cea 100644
> --- a/fs/xfs/libxfs/xfs_refcount.c
> +++ b/fs/xfs/libxfs/xfs_refcount.c
> @@ -1152,7 +1152,8 @@ xfs_refcount_adjust_extents(
>  						cur->bc_ag.pag->pag_agno,
>  						tmp.rc_startblock);
>  				error = xfs_free_extent_later(cur->bc_tp, fsbno,
> -						  tmp.rc_blockcount, NULL);
> +						  tmp.rc_blockcount, NULL,
> +						  XFS_AG_RESV_NONE);
>  				if (error)
>  					goto out_error;
>  			}
> @@ -1213,7 +1214,8 @@ xfs_refcount_adjust_extents(
>  					cur->bc_ag.pag->pag_agno,
>  					ext.rc_startblock);
>  			error = xfs_free_extent_later(cur->bc_tp, fsbno,
> -					ext.rc_blockcount, NULL);
> +					ext.rc_blockcount, NULL,
> +					XFS_AG_RESV_NONE);
>  			if (error)
>  				goto out_error;
>  		}
> @@ -1981,7 +1983,8 @@ xfs_refcount_recover_cow_leftovers(
>  
>  		/* Free the block. */
>  		error = xfs_free_extent_later(tp, fsb,
> -				rr->rr_rrec.rc_blockcount, NULL);
> +				rr->rr_rrec.rc_blockcount, NULL,
> +				XFS_AG_RESV_NONE);
>  		if (error)
>  			goto out_trans;
>  
> diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
> index d4afc5f4e6a5..5c3987d8dc24 100644
> --- a/fs/xfs/libxfs/xfs_refcount_btree.c
> +++ b/fs/xfs/libxfs/xfs_refcount_btree.c
> @@ -106,19 +106,13 @@ xfs_refcountbt_free_block(
>  	struct xfs_buf		*agbp = cur->bc_ag.agbp;
>  	struct xfs_agf		*agf = agbp->b_addr;
>  	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, xfs_buf_daddr(bp));
> -	int			error;
>  
>  	trace_xfs_refcountbt_free_block(cur->bc_mp, cur->bc_ag.pag->pag_agno,
>  			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1);
>  	be32_add_cpu(&agf->agf_refcount_blocks, -1);
>  	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_REFCOUNT_BLOCKS);
> -	error = xfs_free_extent(cur->bc_tp, cur->bc_ag.pag,
> -			XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno), 1,
> +	return xfs_free_extent_later(cur->bc_tp, fsbno, 1,
>  			&XFS_RMAP_OINFO_REFC, XFS_AG_RESV_METADATA);
> -	if (error)
> -		return error;
> -
> -	return error;
>  }
>  
>  STATIC int
> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> index f9e36b810663..79e65bb6b0a2 100644
> --- a/fs/xfs/xfs_extfree_item.c
> +++ b/fs/xfs/xfs_extfree_item.c
> @@ -365,7 +365,7 @@ xfs_trans_free_extent(
>  			agbno, xefi->xefi_blockcount);
>  
>  	error = __xfs_free_extent(tp, xefi->xefi_pag, agbno,
> -			xefi->xefi_blockcount, &oinfo, XFS_AG_RESV_NONE,
> +			xefi->xefi_blockcount, &oinfo, xefi->xefi_type,
>  			xefi->xefi_flags & XFS_EFI_SKIP_DISCARD);
>  
>  	/*
> @@ -644,6 +644,7 @@ xfs_efi_item_recover(
>  	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
>  		struct xfs_extent_free_item	fake = {
>  			.xefi_owner		= XFS_RMAP_OWN_UNKNOWN,
> +			.xefi_type		= XFS_AG_RESV_NONE,
>  		};
>  		struct xfs_extent		*extp;
>  
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index abcc559f3c64..eb9102453aff 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -617,7 +617,8 @@ xfs_reflink_cancel_cow_blocks(
>  					del.br_blockcount);
>  
>  			error = xfs_free_extent_later(*tpp, del.br_startblock,
> -					  del.br_blockcount, NULL);
> +					del.br_blockcount, NULL,
> +					XFS_AG_RESV_NONE);
>  			if (error)
>  				break;
>  
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/8] xfs: don't reverse order of items in bulk AIL insertion
  2023-06-27 22:44 ` [PATCH 1/8] xfs: don't reverse order of items in bulk AIL insertion Dave Chinner
  2023-06-28  6:03   ` Christoph Hellwig
  2023-06-28  9:55   ` Chandan Babu R
@ 2023-06-28 17:46   ` Darrick J. Wong
  2 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2023-06-28 17:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jun 28, 2023 at 08:44:05AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> XFS has strict metadata ordering requirements. One of the things it
> does is maintain the commit order of items from transaction commit
> through the CIL and into the AIL. That is, if a transaction logs
> item A before item B in a modification, then they will be inserted
> into the CIL in the order {A, B}. These items are then written into
> the iclog during checkpointing in the order {A, B}. When the
> checkpoint commits, they are supposed to be inserted into the AIL in
> the order {A, B}, and when they are pushed from the AIL, they are
> pushed in the order {A, B}.
> 
> If we crash, log recovery then replays the two items from the
> checkpoint in the order {A, B}, resulting in the objects the items
> apply to being queued for writeback at the end of the checkpoint
> in the order {A, B}. This means recovery behaves the same way as the
> runtime code.
> 
> In places, we have subtle dependencies on this ordering being
> maintained. One of this place is performing intent recovery from the
> log. It assumes that recovering an intent will result in a
> non-intent object being the first thing that is modified in the
> recovery transaction, and so when the transaction commits and the
> journal flushes, the first object inserted into the AIL beyond the
> intent recovery range will be a non-intent item.  It uses the
> transistion from intent items to non-intent items to stop the
> recovery pass.
> 
> A recent log recovery issue indicated that an intent was appearing
> as the first item in the AIL beyond the recovery range, hence
> breaking the end of recovery detection that exists.
> 
> Tracing indicated insertion of the items into the AIL was apparently
> occurring in the right order (the intent was last in the commit item
> list), but the intent was appearing first in the AIL. IOWs, the
> order of items in the AIL was {D,C,B,A}, not {A,B,C,D}, and bulk
> insertion was reversing the order of the items in the batch of items
> being inserted.
> 
> Lucky for us, all the items fed to bulk insertion have the same LSN,
> so the reversal of order does not affect the log head/tail tracking
> that is based on the contents of the AIL. It only impacts on code
> that has implicit, subtle dependencies on object order, and AFAICT
> only the intent recovery loop is impacted by it.
> 
> Make sure bulk AIL insertion does not reorder items incorrectly.
> 
> Fixes: 0e57f6a36f9b ("xfs: bulk AIL insertion during transaction commit")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Coulda sworn I already RVBd this but cannot find any evidence I actually
did.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_trans_ail.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 7d4109af193e..1098452e7f95 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -823,7 +823,7 @@ xfs_trans_ail_update_bulk(
>  			trace_xfs_ail_insert(lip, 0, lsn);
>  		}
>  		lip->li_lsn = lsn;
> -		list_add(&lip->li_ail, &tmp);
> +		list_add_tail(&lip->li_ail, &tmp);
>  	}
>  
>  	if (!list_empty(&tmp))
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/8] xfs: allow extent free intents to be retried
  2023-06-27 22:44 ` [PATCH 4/8] xfs: allow extent free intents to be retried Dave Chinner
@ 2023-06-28 17:48   ` Darrick J. Wong
  2023-06-28 22:57     ` Dave Chinner
  2023-06-29  9:50   ` Chandan Babu R
  1 sibling, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2023-06-28 17:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jun 28, 2023 at 08:44:08AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Extent freeing neeeds to be able to avoid a busy extent deadlock
> when the transaction itself holds the only busy extents in the
> allocation group. This may occur if we have an EFI that contains
> multiple extents to be freed, and the freeing the second intent
> requires the space the first extent free released to expand the
> AGFL. If we block on the busy extent at this point, we deadlock.
> 
> We hold a dirty transaction that contains a entire atomic extent
> free operations within it, so if we can abort the extent free
> operation and commit the progress that we've made, the busy extent
> can be resolved by a log force. Hence we can restart the aborted
> extent free with a new transaction and continue to make
> progress without risking deadlocks.
> 
> To enable this, we need the EFI processing code to be able to handle
> an -EAGAIN error to tell it to commit the current transaction and
> retry again. This mechanism is already built into the defer ops
> processing (used bythe refcount btree modification intents), so
> there's relatively little handling we need to add to the EFI code to
> enable this.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_extfree_item.c | 73 +++++++++++++++++++++++++++++++++++++--
>  1 file changed, 70 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> index 79e65bb6b0a2..098420cbd4a0 100644
> --- a/fs/xfs/xfs_extfree_item.c
> +++ b/fs/xfs/xfs_extfree_item.c
> @@ -336,6 +336,34 @@ xfs_trans_get_efd(
>  	return efdp;
>  }
>  
> +/*
> + * Fill the EFD with all extents from the EFI when we need to roll the
> + * transaction and continue with a new EFI.
> + *
> + * This simply copies all the extents in the EFI to the EFD rather than make
> + * assumptions about which extents in the EFI have already been processed. We
> + * currently keep the xefi list in the same order as the EFI extent list, but
> + * that may not always be the case. Copying everything avoids leaving a landmine
> + * were we fail to cancel all the extents in an EFI if the xefi list is
> + * processed in a different order to the extents in the EFI.
> + */
> +static void
> +xfs_efd_from_efi(
> +	struct xfs_efd_log_item	*efdp)
> +{
> +	struct xfs_efi_log_item *efip = efdp->efd_efip;
> +	uint                    i;
> +
> +	ASSERT(efip->efi_format.efi_nextents > 0);
> +	ASSERT(efdp->efd_next_extent < efip->efi_format.efi_nextents);
> +
> +	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
> +	       efdp->efd_format.efd_extents[i] =
> +		       efip->efi_format.efi_extents[i];
> +	}
> +	efdp->efd_next_extent = efip->efi_format.efi_nextents;
> +}
> +
>  /*
>   * Free an extent and log it to the EFD. Note that the transaction is marked
>   * dirty regardless of whether the extent free succeeds or fails to support the
> @@ -378,6 +406,17 @@ xfs_trans_free_extent(
>  	tp->t_flags |= XFS_TRANS_DIRTY | XFS_TRANS_HAS_INTENT_DONE;
>  	set_bit(XFS_LI_DIRTY, &efdp->efd_item.li_flags);
>  
> +	/*
> +	 * If we need a new transaction to make progress, the caller will log a
> +	 * new EFI with the current contents. It will also log an EFD to cancel
> +	 * the existing EFI, and so we need to copy all the unprocessed extents
> +	 * in this EFI to the EFD so this works correctly.
> +	 */
> +	if (error == -EAGAIN) {
> +		xfs_efd_from_efi(efdp);
> +		return error;
> +	}
> +
>  	next_extent = efdp->efd_next_extent;
>  	ASSERT(next_extent < efdp->efd_format.efd_nextents);
>  	extp = &(efdp->efd_format.efd_extents[next_extent]);
> @@ -495,6 +534,13 @@ xfs_extent_free_finish_item(
>  
>  	error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);
>  
> +	/*
> +	 * Don't free the XEFI if we need a new transaction to complete
> +	 * processing of it.
> +	 */
> +	if (error == -EAGAIN)
> +		return error;
> +
>  	xfs_extent_free_put_group(xefi);
>  	kmem_cache_free(xfs_extfree_item_cache, xefi);
>  	return error;
> @@ -620,6 +666,7 @@ xfs_efi_item_recover(
>  	struct xfs_trans		*tp;
>  	int				i;
>  	int				error = 0;
> +	bool				requeue_only = false;
>  
>  	/*
>  	 * First check the validity of the extents described by the
> @@ -653,9 +700,29 @@ xfs_efi_item_recover(
>  		fake.xefi_startblock = extp->ext_start;
>  		fake.xefi_blockcount = extp->ext_len;
>  
> -		xfs_extent_free_get_group(mp, &fake);
> -		error = xfs_trans_free_extent(tp, efdp, &fake);
> -		xfs_extent_free_put_group(&fake);
> +		if (!requeue_only) {
> +			xfs_extent_free_get_group(mp, &fake);
> +			error = xfs_trans_free_extent(tp, efdp, &fake);
> +			xfs_extent_free_put_group(&fake);
> +		}
> +
> +		/*
> +		 * If we can't free the extent without potentially deadlocking,
> +		 * requeue the rest of the extents to a new so that they get
> +		 * run again later with a new transaction context.
> +		 */
> +		if (error == -EAGAIN || requeue_only) {
> +			error = xfs_free_extent_later(tp, fake.xefi_startblock,
> +					fake.xefi_blockcount,
> +					&XFS_RMAP_OINFO_ANY_OWNER,
> +					fake.xefi_type);
> +			if (!error) {
> +				requeue_only = true;
> +				error = 0;

@error is already zero so this is redundant, right?

If yes then I'll remove the line on commit,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> +				continue;
> +			}
> +		};
> +
>  		if (error == -EFSCORRUPTED)
>  			XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
>  					extp, sizeof(*extp));
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/8] xfs: journal geometry is not properly bounds checked
  2023-06-27 22:44 ` [PATCH 6/8] xfs: journal geometry is not properly bounds checked Dave Chinner
  2023-06-28  6:08   ` Christoph Hellwig
@ 2023-06-28 17:50   ` Darrick J. Wong
  1 sibling, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2023-06-28 17:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jun 28, 2023 at 08:44:10AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> If the journal geometry results in a sector or log stripe unit
> validation problem, it indicates that we cannot set the log up to
> safely write to the the journal. In these cases, we must abort the
> mount because the corruption needs external intervention to resolve.
> Similarly, a journal that is too large cannot be written to safely,
> either, so we shouldn't allow those geometries to mount, either.
> 
> If the log is too small, we risk having transaction reservations
> overruning the available log space and the system hanging waiting
> for space it can never provide. This is purely a runtime hang issue,
> not a corruption issue as per the first cases listed above. We abort
> mounts of the log is too small for V5 filesystems, but we must allow
> v4 filesystems to mount because, historically, there was no log size
> validity checking and so some systems may still be out there with
> undersized logs.
> 
> The problem is that on V4 filesystems, when we discover a log
> geometry problem, we skip all the remaining checks and then allow
> the log to continue mounting. This mean that if one of the log size
> checks fails, we skip the log stripe unit check. i.e. we allow the
> mount because a "non-fatal" geometry is violated, and then fail to
> check the hard fail geometries that should fail the mount.
> 
> Move all these fatal checks to the superblock verifier, and add a
> new check for the two log sector size geometry variables having the
> same values. This will prevent any attempt to mount a log that has
> invalid or inconsistent geometries long before we attempt to mount
> the log.
> 
> However, for the minimum log size checks, we can only do that once
> we've setup up the log and calculated all the iclog sizes and
> roundoffs. Hence this needs to remain in the log mount code after
> the log has been initialised. It is also the only case where we
> should allow a v4 filesystem to continue running, so leave that
> handling in place, too.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/libxfs/xfs_sb.c | 56 +++++++++++++++++++++++++++++++++++++++++-
>  fs/xfs/xfs_log.c       | 47 +++++++++++------------------------
>  2 files changed, 70 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> index ba0f17bc1dc0..5e174685a77c 100644
> --- a/fs/xfs/libxfs/xfs_sb.c
> +++ b/fs/xfs/libxfs/xfs_sb.c
> @@ -412,7 +412,6 @@ xfs_validate_sb_common(
>  	    sbp->sb_inodelog < XFS_DINODE_MIN_LOG			||
>  	    sbp->sb_inodelog > XFS_DINODE_MAX_LOG			||
>  	    sbp->sb_inodesize != (1 << sbp->sb_inodelog)		||
> -	    sbp->sb_logsunit > XLOG_MAX_RECORD_BSIZE			||
>  	    sbp->sb_inopblock != howmany(sbp->sb_blocksize,sbp->sb_inodesize) ||
>  	    XFS_FSB_TO_B(mp, sbp->sb_agblocks) < XFS_MIN_AG_BYTES	||
>  	    XFS_FSB_TO_B(mp, sbp->sb_agblocks) > XFS_MAX_AG_BYTES	||
> @@ -430,6 +429,61 @@ xfs_validate_sb_common(
>  		return -EFSCORRUPTED;
>  	}
>  
> +	/*
> +	 * Logs that are too large are not supported at all. Reject them
> +	 * outright. Logs that are too small are tolerated on v4 filesystems,
> +	 * but we can only check that when mounting the log. Hence we skip
> +	 * those checks here.
> +	 */
> +	if (sbp->sb_logblocks > XFS_MAX_LOG_BLOCKS) {
> +		xfs_notice(mp,
> +		"Log size 0x%x blocks too large, maximum size is 0x%llx blocks",
> +			 sbp->sb_logblocks, XFS_MAX_LOG_BLOCKS);
> +		return -EFSCORRUPTED;
> +	}
> +
> +	if (XFS_FSB_TO_B(mp, sbp->sb_logblocks) > XFS_MAX_LOG_BYTES) {
> +		xfs_warn(mp,
> +		"log size 0x%llx bytes too large, maximum size is 0x%llx bytes",
> +			 XFS_FSB_TO_B(mp, sbp->sb_logblocks),
> +			 XFS_MAX_LOG_BYTES);
> +		return -EFSCORRUPTED;
> +	}
> +
> +	/*
> +	 * Do not allow filesystems with corrupted log sector or stripe units to
> +	 * be mounted. We cannot safely size the iclogs or write to the log if
> +	 * the log stripe unit is not valid.
> +	 */
> +	if (sbp->sb_versionnum & XFS_SB_VERSION_SECTORBIT) {
> +		if (sbp->sb_logsectsize != (1U << sbp->sb_logsectlog)) {
> +			xfs_notice(mp,
> +			"log sector size in bytes/log2 (0x%x/0x%x) must match",
> +				sbp->sb_logsectsize, 1U << sbp->sb_logsectlog);
> +			return -EFSCORRUPTED;
> +		}
> +	} else if (sbp->sb_logsectsize || sbp->sb_logsectlog) {
> +		xfs_notice(mp,
> +		"log sector size in bytes/log2 (0x%x/0x%x) are not zero",
> +			sbp->sb_logsectsize, sbp->sb_logsectlog);
> +		return -EFSCORRUPTED;
> +	}
> +
> +	if (sbp->sb_logsunit > 1) {
> +		if (sbp->sb_logsunit % sbp->sb_blocksize) {
> +			xfs_notice(mp,
> +		"log stripe unit 0x%x bytes must be a multiple of block size",
> +				sbp->sb_logsunit);
> +			return -EFSCORRUPTED;
> +		}
> +		if (sbp->sb_logsunit > XLOG_MAX_RECORD_BSIZE) {
> +			xfs_notice(mp,
> +		"log stripe unit 0x%x bytes over maximum size (0x%x bytes)",
> +				sbp->sb_logsunit, XLOG_MAX_RECORD_BSIZE);
> +			return -EFSCORRUPTED;
> +		}
> +	}
> +
>  	/* Validate the realtime geometry; stolen from xfs_repair */
>  	if (sbp->sb_rextsize * sbp->sb_blocksize > XFS_MAX_RTEXTSIZE ||
>  	    sbp->sb_rextsize * sbp->sb_blocksize < XFS_MIN_RTEXTSIZE) {
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index fc61cc024023..79004d193e54 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -639,7 +639,6 @@ xfs_log_mount(
>  	int		num_bblks)
>  {
>  	struct xlog	*log;
> -	bool		fatal = xfs_has_crc(mp);
>  	int		error = 0;
>  	int		min_logfsbs;
>  
> @@ -663,53 +662,37 @@ xfs_log_mount(
>  	mp->m_log = log;
>  
>  	/*
> -	 * Validate the given log space and drop a critical message via syslog
> -	 * if the log size is too small that would lead to some unexpected
> -	 * situations in transaction log space reservation stage.
> +	 * Now that we have set up the log and it's internal geometry
> +	 * parameters, we can validate the given log space and drop a critical
> +	 * message via syslog if the log size is too small. A log that is too
> +	 * small can lead to unexpected situations in transaction log space
> +	 * reservation stage. The superblock verifier has already validated all
> +	 * the other log geometry constraints, so we don't have to check those
> +	 * here.
>  	 *
> -	 * Note: we can't just reject the mount if the validation fails.  This
> -	 * would mean that people would have to downgrade their kernel just to
> -	 * remedy the situation as there is no way to grow the log (short of
> -	 * black magic surgery with xfs_db).
> +	 * Note: For v4 filesystems, we can't just reject the mount if the
> +	 * validation fails.  This would mean that people would have to
> +	 * downgrade their kernel just to remedy the situation as there is no
> +	 * way to grow the log (short of black magic surgery with xfs_db).
>  	 *
> -	 * We can, however, reject mounts for CRC format filesystems, as the
> +	 * We can, however, reject mounts for V5 format filesystems, as the
>  	 * mkfs binary being used to make the filesystem should never create a
>  	 * filesystem with a log that is too small.
>  	 */
>  	min_logfsbs = xfs_log_calc_minimum_size(mp);
> -
>  	if (mp->m_sb.sb_logblocks < min_logfsbs) {
>  		xfs_warn(mp,
>  		"Log size %d blocks too small, minimum size is %d blocks",
>  			 mp->m_sb.sb_logblocks, min_logfsbs);
> -		error = -EINVAL;
> -	} else if (mp->m_sb.sb_logblocks > XFS_MAX_LOG_BLOCKS) {
> -		xfs_warn(mp,
> -		"Log size %d blocks too large, maximum size is %lld blocks",
> -			 mp->m_sb.sb_logblocks, XFS_MAX_LOG_BLOCKS);
> -		error = -EINVAL;
> -	} else if (XFS_FSB_TO_B(mp, mp->m_sb.sb_logblocks) > XFS_MAX_LOG_BYTES) {
> -		xfs_warn(mp,
> -		"log size %lld bytes too large, maximum size is %lld bytes",
> -			 XFS_FSB_TO_B(mp, mp->m_sb.sb_logblocks),
> -			 XFS_MAX_LOG_BYTES);
> -		error = -EINVAL;
> -	} else if (mp->m_sb.sb_logsunit > 1 &&
> -		   mp->m_sb.sb_logsunit % mp->m_sb.sb_blocksize) {
> -		xfs_warn(mp,
> -		"log stripe unit %u bytes must be a multiple of block size",
> -			 mp->m_sb.sb_logsunit);
> -		error = -EINVAL;
> -		fatal = true;
> -	}
> -	if (error) {
> +
>  		/*
>  		 * Log check errors are always fatal on v5; or whenever bad
>  		 * metadata leads to a crash.
>  		 */
> -		if (fatal) {
> +		if (xfs_has_crc(mp)) {
>  			xfs_crit(mp, "AAIEEE! Log failed size checks. Abort!");
>  			ASSERT(0);
> +			error = -EINVAL;
>  			goto out_free_log;
>  		}
>  		xfs_crit(mp, "Log size out of supported range.");
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/8] xfs: AGF length has never been bounds checked
  2023-06-27 22:44 ` [PATCH 7/8] xfs: AGF length has never been " Dave Chinner
@ 2023-06-28 17:52   ` Darrick J. Wong
  2023-06-29  2:09     ` [PATCH 7/8 V2] " Dave Chinner
  0 siblings, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2023-06-28 17:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jun 28, 2023 at 08:44:11AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The AGF verifier does not check that the AGF length field is within
> known good bounds. This has never been checked by runtime kernel
> code (i.e. the lack of verification goes back to 1993) yet we assume
> in many places that it is correct and verify other metdata against
> it.
> 
> Add length verification to the AGF verifier. The length of the AGF
> must be equal to the size of the AG specified in the superblock,
> unless it is the last AG in the filesystem. In that case, it must be
> less than or equal to sb->sb_agblocks and greater than
> XFS_MIN_AG_BLOCKS, which is the smallest AG a growfs operation will
> allow to exist.
> 
> This requires a bit of rework of the verifier function. We want to
> verify metadata before we use it to verify other metadata. Hence
> we need to verify the AGF sequence numbers before using them to
> verify the length of the AGF. Then we can verify the AGF length
> before we verify AGFL fields. Then we can verifier other fields that
> are bounds limited by the AGF length.
> 
> And, finally, by calculating agf_length only once into a local
> variable, we can collapse repeated "if (xfs_has_foo() &&"
> conditionaly checks into single checks. This makes the code much
> easier to follow as all the checks for a given feature are obviously
> in the same place.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

LGTM
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/libxfs/xfs_alloc.c | 86 +++++++++++++++++++++++----------------
>  1 file changed, 51 insertions(+), 35 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 1e72b91daff6..7c86a69354fb 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2974,6 +2974,7 @@ xfs_agf_verify(
>  {
>  	struct xfs_mount	*mp = bp->b_mount;
>  	struct xfs_agf		*agf = bp->b_addr;
> +	uint32_t		agf_length = be32_to_cpu(agf->agf_length);
>  
>  	if (xfs_has_crc(mp)) {
>  		if (!uuid_equal(&agf->agf_uuid, &mp->m_sb.sb_meta_uuid))
> @@ -2985,18 +2986,43 @@ xfs_agf_verify(
>  	if (!xfs_verify_magic(bp, agf->agf_magicnum))
>  		return __this_address;
>  
> -	if (!(XFS_AGF_GOOD_VERSION(be32_to_cpu(agf->agf_versionnum)) &&
> -	      be32_to_cpu(agf->agf_freeblks) <= be32_to_cpu(agf->agf_length) &&
> -	      be32_to_cpu(agf->agf_flfirst) < xfs_agfl_size(mp) &&
> -	      be32_to_cpu(agf->agf_fllast) < xfs_agfl_size(mp) &&
> -	      be32_to_cpu(agf->agf_flcount) <= xfs_agfl_size(mp)))
> +	if (!XFS_AGF_GOOD_VERSION(be32_to_cpu(agf->agf_versionnum)))
>  		return __this_address;
>  
> -	if (be32_to_cpu(agf->agf_length) > mp->m_sb.sb_dblocks)
> +	/*
> +	 * Both agf_seqno and agf_length need to validated before anything else
> +	 * block number related in the AGF or AGFL can be checked.
> +	 *
> +	 * During growfs operations, the perag is not fully initialised,
> +	 * so we can't use it for any useful checking. growfs ensures we can't
> +	 * use it by using uncached buffers that don't have the perag attached
> +	 * so we can detect and avoid this problem.
> +	 */
> +	if (bp->b_pag && be32_to_cpu(agf->agf_seqno) != bp->b_pag->pag_agno)
> +		return __this_address;
> +
> +	/*
> +	 * Only the last AGF in the filesytsem is allowed to be shorter
> +	 * than the AG size recorded in the superblock.
> +	 */
> +	if (agf_length != mp->m_sb.sb_agblocks) {
> +		if (be32_to_cpu(agf->agf_seqno) != mp->m_sb.sb_agcount - 1)
> +			return __this_address;
> +		if (agf_length < XFS_MIN_AG_BLOCKS)
> +			return __this_address;
> +		if (agf_length > mp->m_sb.sb_agblocks)
> +			return __this_address;
> +	}
> +
> +	if (be32_to_cpu(agf->agf_flfirst) >= xfs_agfl_size(mp))
> +		return __this_address;
> +	if (be32_to_cpu(agf->agf_fllast) >= xfs_agfl_size(mp))
> +		return __this_address;
> +	if (be32_to_cpu(agf->agf_flcount) > xfs_agfl_size(mp))
>  		return __this_address;
>  
>  	if (be32_to_cpu(agf->agf_freeblks) < be32_to_cpu(agf->agf_longest) ||
> -	    be32_to_cpu(agf->agf_freeblks) > be32_to_cpu(agf->agf_length))
> +	    be32_to_cpu(agf->agf_freeblks) > agf_length)
>  		return __this_address;
>  
>  	if (be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNO]) < 1 ||
> @@ -3007,38 +3033,28 @@ xfs_agf_verify(
>  						mp->m_alloc_maxlevels)
>  		return __this_address;
>  
> -	if (xfs_has_rmapbt(mp) &&
> -	    (be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) < 1 ||
> -	     be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) >
> -						mp->m_rmap_maxlevels))
> -		return __this_address;
> -
> -	if (xfs_has_rmapbt(mp) &&
> -	    be32_to_cpu(agf->agf_rmap_blocks) > be32_to_cpu(agf->agf_length))
> -		return __this_address;
> -
> -	/*
> -	 * during growfs operations, the perag is not fully initialised,
> -	 * so we can't use it for any useful checking. growfs ensures we can't
> -	 * use it by using uncached buffers that don't have the perag attached
> -	 * so we can detect and avoid this problem.
> -	 */
> -	if (bp->b_pag && be32_to_cpu(agf->agf_seqno) != bp->b_pag->pag_agno)
> -		return __this_address;
> -
>  	if (xfs_has_lazysbcount(mp) &&
> -	    be32_to_cpu(agf->agf_btreeblks) > be32_to_cpu(agf->agf_length))
> +	    be32_to_cpu(agf->agf_btreeblks) > agf_length)
>  		return __this_address;
>  
> -	if (xfs_has_reflink(mp) &&
> -	    be32_to_cpu(agf->agf_refcount_blocks) >
> -	    be32_to_cpu(agf->agf_length))
> -		return __this_address;
> +	if (xfs_has_rmapbt(mp)) {
> +		if (be32_to_cpu(agf->agf_rmap_blocks) > agf_length)
> +			return __this_address;
>  
> -	if (xfs_has_reflink(mp) &&
> -	    (be32_to_cpu(agf->agf_refcount_level) < 1 ||
> -	     be32_to_cpu(agf->agf_refcount_level) > mp->m_refc_maxlevels))
> -		return __this_address;
> +		if (be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) < 1 ||
> +		    be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) >
> +							mp->m_rmap_maxlevels)
> +			return __this_address;
> +	}
> +
> +	if (xfs_has_reflink(mp)) {
> +		if (be32_to_cpu(agf->agf_refcount_blocks) > agf_length)
> +			return __this_address;
> +
> +		if (be32_to_cpu(agf->agf_refcount_level) < 1 ||
> +		    be32_to_cpu(agf->agf_refcount_level) > mp->m_refc_maxlevels)
> +			return __this_address;
> +	}
>  
>  	return NULL;
>  }
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 8/8] xfs: fix bounds check in xfs_defer_agfl_block()
  2023-06-27 22:44 ` [PATCH 8/8] xfs: fix bounds check in xfs_defer_agfl_block() Dave Chinner
  2023-06-28  6:09   ` Christoph Hellwig
@ 2023-06-28 17:52   ` Darrick J. Wong
  1 sibling, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2023-06-28 17:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jun 28, 2023 at 08:44:12AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Need to happen before we allocate and then leak the xefi. Found by
> coverity via an xfsprogs libxfs scan.
> 
> Fixes: 7dfee17b13e5 ("xfs: validate block number being freed before adding to xefi")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

LGTM,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/libxfs/xfs_alloc.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 7c86a69354fb..9919fdfe1d7e 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2470,25 +2470,26 @@ static int
>  xfs_defer_agfl_block(
>  	struct xfs_trans		*tp,
>  	xfs_agnumber_t			agno,
> -	xfs_fsblock_t			agbno,
> +	xfs_agblock_t			agbno,
>  	struct xfs_owner_info		*oinfo)
>  {
>  	struct xfs_mount		*mp = tp->t_mountp;
>  	struct xfs_extent_free_item	*xefi;
> +	xfs_fsblock_t			fsbno = XFS_AGB_TO_FSB(mp, agno, agbno);
>  
>  	ASSERT(xfs_extfree_item_cache != NULL);
>  	ASSERT(oinfo != NULL);
>  
> +	if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbno(mp, fsbno)))
> +		return -EFSCORRUPTED;
> +
>  	xefi = kmem_cache_zalloc(xfs_extfree_item_cache,
>  			       GFP_KERNEL | __GFP_NOFAIL);
> -	xefi->xefi_startblock = XFS_AGB_TO_FSB(mp, agno, agbno);
> +	xefi->xefi_startblock = fsbno;
>  	xefi->xefi_blockcount = 1;
>  	xefi->xefi_owner = oinfo->oi_owner;
>  	xefi->xefi_type = XFS_AG_RESV_AGFL;
>  
> -	if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbno(mp, xefi->xefi_startblock)))
> -		return -EFSCORRUPTED;
> -
>  	trace_xfs_agfl_free_defer(mp, agno, 0, agbno, 1);
>  
>  	xfs_extent_free_get_group(mp, xefi);
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/8] xfs: use deferred frees for btree block freeing
  2023-06-28 17:46   ` Darrick J. Wong
@ 2023-06-28 22:55     ` Dave Chinner
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2023-06-28 22:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Jun 28, 2023 at 10:46:25AM -0700, Darrick J. Wong wrote:
> On Wed, Jun 28, 2023 at 08:44:06AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Btrees that aren't freespace management trees use the normal extent
> > allocation and freeing routines for their blocks. Hence when a btree
> > block is freed, a direct call to xfs_free_extent() is made and the
> > extent is immediately freed. This puts the entire free space
> > management btrees under this path, so we are stacking btrees on
> > btrees in the call stack. The inobt, finobt and refcount btrees
> > all do this.
> > 
> > However, the bmap btree does not do this - it calls
> > xfs_free_extent_later() to defer the extent free operation via an
> > XEFI and hence it gets processed in deferred operation processing
> > during the commit of the primary transaction (i.e. via intent
> > chaining).
> > 
> > We need to change xfs_free_extent() to behave in a non-blocking
> > manner so that we can avoid deadlocks with busy extents near ENOSPC
> > in transactions that free multiple extents. Inserting or removing a
> > record from a btree can cause a multi-level tree merge operation and
> > that will free multiple blocks from the btree in a single
> > transaction. i.e. we can call xfs_free_extent() multiple times, and
> > hence the btree manipulation transaction is vulnerable to this busy
> > extent deadlock vector.
> > 
> > To fix this, convert all the remaining callers of xfs_free_extent()
> > to use xfs_free_extent_later() to queue XEFIs and hence defer
> > processing of the extent frees to a context that can be safely
> > restarted if a deadlock condition is detected.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_ag.c             | 2 +-
> >  fs/xfs/libxfs/xfs_alloc.c          | 4 ++++
> >  fs/xfs/libxfs/xfs_alloc.h          | 8 +++++---
> >  fs/xfs/libxfs/xfs_bmap.c           | 8 +++++---
> >  fs/xfs/libxfs/xfs_bmap_btree.c     | 3 ++-
> >  fs/xfs/libxfs/xfs_ialloc.c         | 8 ++++----
> >  fs/xfs/libxfs/xfs_ialloc_btree.c   | 3 +--
> >  fs/xfs/libxfs/xfs_refcount.c       | 9 ++++++---
> >  fs/xfs/libxfs/xfs_refcount_btree.c | 8 +-------
> >  fs/xfs/xfs_extfree_item.c          | 3 ++-
> >  fs/xfs/xfs_reflink.c               | 3 ++-
> >  11 files changed, 33 insertions(+), 26 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
> > index ee84835ebc66..e9cc481b4ddf 100644
> > --- a/fs/xfs/libxfs/xfs_ag.c
> > +++ b/fs/xfs/libxfs/xfs_ag.c
> > @@ -985,7 +985,7 @@ xfs_ag_shrink_space(
> >  			goto resv_err;
> >  
> >  		err2 = __xfs_free_extent_later(*tpp, args.fsbno, delta, NULL,
> > -				true);
> > +				XFS_AG_RESV_NONE, true);
> >  		if (err2)
> >  			goto resv_err;
> >  
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index c20fe99405d8..cc3f7b905ea1 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -2449,6 +2449,7 @@ xfs_defer_agfl_block(
> >  	xefi->xefi_startblock = XFS_AGB_TO_FSB(mp, agno, agbno);
> >  	xefi->xefi_blockcount = 1;
> >  	xefi->xefi_owner = oinfo->oi_owner;
> > +	xefi->xefi_type = XFS_AG_RESV_AGFL;
> >  	if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbno(mp, xefi->xefi_startblock)))
> >  		return -EFSCORRUPTED;
> > @@ -2470,6 +2471,7 @@ __xfs_free_extent_later(
> >  	xfs_fsblock_t			bno,
> >  	xfs_filblks_t			len,
> >  	const struct xfs_owner_info	*oinfo,
> > +	enum xfs_ag_resv_type		type,
> >  	bool				skip_discard)
> >  {
> >  	struct xfs_extent_free_item	*xefi;
> > @@ -2490,6 +2492,7 @@ __xfs_free_extent_later(
> >  	ASSERT(agbno + len <= mp->m_sb.sb_agblocks);
> >  #endif
> >  	ASSERT(xfs_extfree_item_cache != NULL);
> > +	ASSERT(type != XFS_AG_RESV_AGFL);
> >  
> >  	if (XFS_IS_CORRUPT(mp, !xfs_verify_fsbext(mp, bno, len)))
> >  		return -EFSCORRUPTED;
> > @@ -2498,6 +2501,7 @@ __xfs_free_extent_later(
> >  			       GFP_KERNEL | __GFP_NOFAIL);
> >  	xefi->xefi_startblock = bno;
> >  	xefi->xefi_blockcount = (xfs_extlen_t)len;
> > +	xefi->xefi_type = type;
> >  	if (skip_discard)
> >  		xefi->xefi_flags |= XFS_EFI_SKIP_DISCARD;
> >  	if (oinfo) {
> > diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> > index 85ac470be0da..121faf1e11ad 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.h
> > +++ b/fs/xfs/libxfs/xfs_alloc.h
> > @@ -232,7 +232,7 @@ xfs_buf_to_agfl_bno(
> >  
> >  int __xfs_free_extent_later(struct xfs_trans *tp, xfs_fsblock_t bno,
> >  		xfs_filblks_t len, const struct xfs_owner_info *oinfo,
> > -		bool skip_discard);
> > +		enum xfs_ag_resv_type type, bool skip_discard);
> >  
> >  /*
> >   * List of extents to be free "later".
> > @@ -245,6 +245,7 @@ struct xfs_extent_free_item {
> >  	xfs_extlen_t		xefi_blockcount;/* number of blocks in extent */
> >  	struct xfs_perag	*xefi_pag;
> >  	unsigned int		xefi_flags;
> 
> /me is barely back from vacation, starting to process the ~1100 emails
> by taking care of the obvious bugfixes first...
> 
> > +	enum xfs_ag_resv_type	xefi_type;
> 
> I got confused by 'xefi_type' until I remembered that
> XFS_DEFER_OPS_TYPE_AGFL_FREE / XFS_DEFER_OPS_TYPE_FREE are stuffed in
> the xfs_defer_pending structure, not the xefi itself.
> 
> Could this field be named xefi_agresv instead?

Sure.

> The rest of the logic in this patch looks correct and makes things
> easier for the rt modernization patches, so I'll say
> 
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> and change the name on commit, if that's ok?

That's fine.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/8] xfs: allow extent free intents to be retried
  2023-06-28 17:48   ` Darrick J. Wong
@ 2023-06-28 22:57     ` Dave Chinner
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2023-06-28 22:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Jun 28, 2023 at 10:48:36AM -0700, Darrick J. Wong wrote:
> On Wed, Jun 28, 2023 at 08:44:08AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > @@ -653,9 +700,29 @@ xfs_efi_item_recover(
> >  		fake.xefi_startblock = extp->ext_start;
> >  		fake.xefi_blockcount = extp->ext_len;
> >  
> > -		xfs_extent_free_get_group(mp, &fake);
> > -		error = xfs_trans_free_extent(tp, efdp, &fake);
> > -		xfs_extent_free_put_group(&fake);
> > +		if (!requeue_only) {
> > +			xfs_extent_free_get_group(mp, &fake);
> > +			error = xfs_trans_free_extent(tp, efdp, &fake);
> > +			xfs_extent_free_put_group(&fake);
> > +		}
> > +
> > +		/*
> > +		 * If we can't free the extent without potentially deadlocking,
> > +		 * requeue the rest of the extents to a new so that they get
> > +		 * run again later with a new transaction context.
> > +		 */
> > +		if (error == -EAGAIN || requeue_only) {
> > +			error = xfs_free_extent_later(tp, fake.xefi_startblock,
> > +					fake.xefi_blockcount,
> > +					&XFS_RMAP_OINFO_ANY_OWNER,
> > +					fake.xefi_type);
> > +			if (!error) {
> > +				requeue_only = true;
> > +				error = 0;
> 
> @error is already zero so this is redundant, right?

Yes. Forgot to remove it when I changed that code to continue and
have errors fall through to the common handling...

> If yes then I'll remove the line on commit,
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Thanks.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 7/8 V2] xfs: AGF length has never been bounds checked
  2023-06-28 17:52   ` Darrick J. Wong
@ 2023-06-29  2:09     ` Dave Chinner
  2023-06-29 16:35       ` Darrick J. Wong
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2023-06-29  2:09 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The AGF verifier does not check that the AGF length field is within
known good bounds. This has never been checked by runtime kernel
code (i.e. the lack of verification goes back to 1993) yet we assume
in many places that it is correct and verify other metdata against
it.

Add length verification to the AGF verifier. The length of the AGF
must be equal to the size of the AG specified in the superblock,
unless it is the last AG in the filesystem. In that case, it must be
less than or equal to sb->sb_agblocks and greater than
XFS_MIN_AG_BLOCKS, which is the smallest AG a growfs operation will
allow to exist.

This requires a bit of rework of the verifier function. We want to
verify metadata before we use it to verify other metadata. Hence
we need to verify the AGF sequence numbers before using them to
verify the length of the AGF. Then we can verify the AGF length
before we verify AGFL fields. Then we can verifier other fields that
are bounds limited by the AGF length.

And, finally, by calculating agf_length only once into a local
variable, we can collapse repeated "if (xfs_has_foo() &&"
conditionaly checks into single checks. This makes the code much
easier to follow as all the checks for a given feature are obviously
in the same place.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---

Version 2:
- growfs will write the new AGFs before the superblock has been
  updated, so we have to skip the new runt AGF seqno check otherwise
  it will fail.

 fs/xfs/libxfs/xfs_alloc.c | 92 +++++++++++++++++++++++++++++------------------
 1 file changed, 57 insertions(+), 35 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 1e72b91daff6..fe7d5ea47b90 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2974,6 +2974,7 @@ xfs_agf_verify(
 {
 	struct xfs_mount	*mp = bp->b_mount;
 	struct xfs_agf		*agf = bp->b_addr;
+	uint32_t		agf_length = be32_to_cpu(agf->agf_length);
 
 	if (xfs_has_crc(mp)) {
 		if (!uuid_equal(&agf->agf_uuid, &mp->m_sb.sb_meta_uuid))
@@ -2985,18 +2986,49 @@ xfs_agf_verify(
 	if (!xfs_verify_magic(bp, agf->agf_magicnum))
 		return __this_address;
 
-	if (!(XFS_AGF_GOOD_VERSION(be32_to_cpu(agf->agf_versionnum)) &&
-	      be32_to_cpu(agf->agf_freeblks) <= be32_to_cpu(agf->agf_length) &&
-	      be32_to_cpu(agf->agf_flfirst) < xfs_agfl_size(mp) &&
-	      be32_to_cpu(agf->agf_fllast) < xfs_agfl_size(mp) &&
-	      be32_to_cpu(agf->agf_flcount) <= xfs_agfl_size(mp)))
+	if (!XFS_AGF_GOOD_VERSION(be32_to_cpu(agf->agf_versionnum)))
 		return __this_address;
 
-	if (be32_to_cpu(agf->agf_length) > mp->m_sb.sb_dblocks)
+	/*
+	 * Both agf_seqno and agf_length need to validated before anything else
+	 * block number related in the AGF or AGFL can be checked.
+	 *
+	 * During growfs operations, the perag is not fully initialised,
+	 * so we can't use it for any useful checking. growfs ensures we can't
+	 * use it by using uncached buffers that don't have the perag attached
+	 * so we can detect and avoid this problem.
+	 */
+	if (bp->b_pag && be32_to_cpu(agf->agf_seqno) != bp->b_pag->pag_agno)
+		return __this_address;
+
+	/*
+	 * Only the last AGF in the filesytsem is allowed to be shorter
+	 * than the AG size recorded in the superblock.
+	 */
+	if (agf_length != mp->m_sb.sb_agblocks) {
+		/*
+		 * During growfs, the new last AGF can get here before we
+		 * have updated the superblock. Give it a pass on the seqno
+		 * check.
+		 */
+		if (bp->b_pag &&
+		    be32_to_cpu(agf->agf_seqno) != mp->m_sb.sb_agcount - 1)
+			return __this_address;
+		if (agf_length < XFS_MIN_AG_BLOCKS)
+			return __this_address;
+		if (agf_length > mp->m_sb.sb_agblocks)
+			return __this_address;
+	}
+
+	if (be32_to_cpu(agf->agf_flfirst) >= xfs_agfl_size(mp))
+		return __this_address;
+	if (be32_to_cpu(agf->agf_fllast) >= xfs_agfl_size(mp))
+		return __this_address;
+	if (be32_to_cpu(agf->agf_flcount) > xfs_agfl_size(mp))
 		return __this_address;
 
 	if (be32_to_cpu(agf->agf_freeblks) < be32_to_cpu(agf->agf_longest) ||
-	    be32_to_cpu(agf->agf_freeblks) > be32_to_cpu(agf->agf_length))
+	    be32_to_cpu(agf->agf_freeblks) > agf_length)
 		return __this_address;
 
 	if (be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNO]) < 1 ||
@@ -3007,38 +3039,28 @@ xfs_agf_verify(
 						mp->m_alloc_maxlevels)
 		return __this_address;
 
-	if (xfs_has_rmapbt(mp) &&
-	    (be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) < 1 ||
-	     be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) >
-						mp->m_rmap_maxlevels))
-		return __this_address;
-
-	if (xfs_has_rmapbt(mp) &&
-	    be32_to_cpu(agf->agf_rmap_blocks) > be32_to_cpu(agf->agf_length))
-		return __this_address;
-
-	/*
-	 * during growfs operations, the perag is not fully initialised,
-	 * so we can't use it for any useful checking. growfs ensures we can't
-	 * use it by using uncached buffers that don't have the perag attached
-	 * so we can detect and avoid this problem.
-	 */
-	if (bp->b_pag && be32_to_cpu(agf->agf_seqno) != bp->b_pag->pag_agno)
-		return __this_address;
-
 	if (xfs_has_lazysbcount(mp) &&
-	    be32_to_cpu(agf->agf_btreeblks) > be32_to_cpu(agf->agf_length))
+	    be32_to_cpu(agf->agf_btreeblks) > agf_length)
 		return __this_address;
 
-	if (xfs_has_reflink(mp) &&
-	    be32_to_cpu(agf->agf_refcount_blocks) >
-	    be32_to_cpu(agf->agf_length))
-		return __this_address;
+	if (xfs_has_rmapbt(mp)) {
+		if (be32_to_cpu(agf->agf_rmap_blocks) > agf_length)
+			return __this_address;
 
-	if (xfs_has_reflink(mp) &&
-	    (be32_to_cpu(agf->agf_refcount_level) < 1 ||
-	     be32_to_cpu(agf->agf_refcount_level) > mp->m_refc_maxlevels))
-		return __this_address;
+		if (be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) < 1 ||
+		    be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) >
+							mp->m_rmap_maxlevels)
+			return __this_address;
+	}
+
+	if (xfs_has_reflink(mp)) {
+		if (be32_to_cpu(agf->agf_refcount_blocks) > agf_length)
+			return __this_address;
+
+		if (be32_to_cpu(agf->agf_refcount_level) < 1 ||
+		    be32_to_cpu(agf->agf_refcount_level) > mp->m_refc_maxlevels)
+			return __this_address;
+	}
 
 	return NULL;
 }

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/8] xfs: use deferred frees for btree block freeing
  2023-06-27 22:44 ` [PATCH 2/8] xfs: use deferred frees for btree block freeing Dave Chinner
  2023-06-28 17:46   ` Darrick J. Wong
@ 2023-06-29  7:52   ` Chandan Babu R
  1 sibling, 0 replies; 30+ messages in thread
From: Chandan Babu R @ 2023-06-29  7:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jun 28, 2023 at 08:44:06 AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Btrees that aren't freespace management trees use the normal extent
> allocation and freeing routines for their blocks. Hence when a btree
> block is freed, a direct call to xfs_free_extent() is made and the
> extent is immediately freed. This puts the entire free space
> management btrees under this path, so we are stacking btrees on
> btrees in the call stack. The inobt, finobt and refcount btrees
> all do this.
>
> However, the bmap btree does not do this - it calls
> xfs_free_extent_later() to defer the extent free operation via an
> XEFI and hence it gets processed in deferred operation processing
> during the commit of the primary transaction (i.e. via intent
> chaining).
>
> We need to change xfs_free_extent() to behave in a non-blocking
> manner so that we can avoid deadlocks with busy extents near ENOSPC
> in transactions that free multiple extents. Inserting or removing a
> record from a btree can cause a multi-level tree merge operation and
> that will free multiple blocks from the btree in a single
> transaction. i.e. we can call xfs_free_extent() multiple times, and
> hence the btree manipulation transaction is vulnerable to this busy
> extent deadlock vector.
>
> To fix this, convert all the remaining callers of xfs_free_extent()
> to use xfs_free_extent_later() to queue XEFIs and hence defer
> processing of the extent frees to a context that can be safely
> restarted if a deadlock condition is detected.
>

The changes look good to me.

Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

-- 
chandan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/8] xfs: pass alloc flags through to xfs_extent_busy_flush()
  2023-06-27 22:44 ` [PATCH 3/8] xfs: pass alloc flags through to xfs_extent_busy_flush() Dave Chinner
@ 2023-06-29  9:44   ` Chandan Babu R
  0 siblings, 0 replies; 30+ messages in thread
From: Chandan Babu R @ 2023-06-29  9:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jun 28, 2023 at 08:44:07 AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> To avoid blocking in xfs_extent_busy_flush() when freeing extents
> and the only busy extents are held by the current transaction, we
> need to pass the XFS_ALLOC_FLAG_FREEING flag context all the way
> into xfs_extent_busy_flush().
>

Looks good to me.

Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_alloc.c | 96 +++++++++++++++++++++------------------
>  fs/xfs/libxfs/xfs_alloc.h |  2 +-
>  fs/xfs/xfs_extent_busy.c  |  3 +-
>  fs/xfs/xfs_extent_busy.h  |  2 +-
>  4 files changed, 56 insertions(+), 47 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index cc3f7b905ea1..ba929f6b5c9b 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -1536,7 +1536,8 @@ xfs_alloc_ag_vextent_lastblock(
>   */
>  STATIC int
>  xfs_alloc_ag_vextent_near(
> -	struct xfs_alloc_arg	*args)
> +	struct xfs_alloc_arg	*args,
> +	uint32_t		alloc_flags)
>  {
>  	struct xfs_alloc_cur	acur = {};
>  	int			error;		/* error code */
> @@ -1612,7 +1613,7 @@ xfs_alloc_ag_vextent_near(
>  		if (acur.busy) {
>  			trace_xfs_alloc_near_busy(args);
>  			xfs_extent_busy_flush(args->mp, args->pag,
> -					      acur.busy_gen);
> +					      acur.busy_gen, alloc_flags);
>  			goto restart;
>  		}
>  		trace_xfs_alloc_size_neither(args);
> @@ -1635,21 +1636,22 @@ xfs_alloc_ag_vextent_near(
>   * and of the form k * prod + mod unless there's nothing that large.
>   * Return the starting a.g. block, or NULLAGBLOCK if we can't do it.
>   */
> -STATIC int				/* error */
> +static int
>  xfs_alloc_ag_vextent_size(
> -	xfs_alloc_arg_t	*args)		/* allocation argument structure */
> +	struct xfs_alloc_arg	*args,
> +	uint32_t		alloc_flags)
>  {
> -	struct xfs_agf	*agf = args->agbp->b_addr;
> -	struct xfs_btree_cur *bno_cur;	/* cursor for bno btree */
> -	struct xfs_btree_cur *cnt_cur;	/* cursor for cnt btree */
> -	int		error;		/* error result */
> -	xfs_agblock_t	fbno;		/* start of found freespace */
> -	xfs_extlen_t	flen;		/* length of found freespace */
> -	int		i;		/* temp status variable */
> -	xfs_agblock_t	rbno;		/* returned block number */
> -	xfs_extlen_t	rlen;		/* length of returned extent */
> -	bool		busy;
> -	unsigned	busy_gen;
> +	struct xfs_agf		*agf = args->agbp->b_addr;
> +	struct xfs_btree_cur	*bno_cur;
> +	struct xfs_btree_cur	*cnt_cur;
> +	xfs_agblock_t		fbno;		/* start of found freespace */
> +	xfs_extlen_t		flen;		/* length of found freespace */
> +	xfs_agblock_t		rbno;		/* returned block number */
> +	xfs_extlen_t		rlen;		/* length of returned extent */
> +	bool			busy;
> +	unsigned		busy_gen;
> +	int			error;
> +	int			i;
>  
>  restart:
>  	/*
> @@ -1717,8 +1719,8 @@ xfs_alloc_ag_vextent_size(
>  				xfs_btree_del_cursor(cnt_cur,
>  						     XFS_BTREE_NOERROR);
>  				trace_xfs_alloc_size_busy(args);
> -				xfs_extent_busy_flush(args->mp,
> -							args->pag, busy_gen);
> +				xfs_extent_busy_flush(args->mp, args->pag,
> +						busy_gen, alloc_flags);
>  				goto restart;
>  			}
>  		}
> @@ -1802,7 +1804,8 @@ xfs_alloc_ag_vextent_size(
>  		if (busy) {
>  			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
>  			trace_xfs_alloc_size_busy(args);
> -			xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
> +			xfs_extent_busy_flush(args->mp, args->pag, busy_gen,
> +					alloc_flags);
>  			goto restart;
>  		}
>  		goto out_nominleft;
> @@ -2572,7 +2575,7 @@ xfs_exact_minlen_extent_available(
>  int			/* error */
>  xfs_alloc_fix_freelist(
>  	struct xfs_alloc_arg	*args,	/* allocation argument structure */
> -	int			flags)	/* XFS_ALLOC_FLAG_... */
> +	uint32_t		alloc_flags)
>  {
>  	struct xfs_mount	*mp = args->mp;
>  	struct xfs_perag	*pag = args->pag;
> @@ -2588,7 +2591,7 @@ xfs_alloc_fix_freelist(
>  	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
>  
>  	if (!xfs_perag_initialised_agf(pag)) {
> -		error = xfs_alloc_read_agf(pag, tp, flags, &agbp);
> +		error = xfs_alloc_read_agf(pag, tp, alloc_flags, &agbp);
>  		if (error) {
>  			/* Couldn't lock the AGF so skip this AG. */
>  			if (error == -EAGAIN)
> @@ -2604,13 +2607,13 @@ xfs_alloc_fix_freelist(
>  	 */
>  	if (xfs_perag_prefers_metadata(pag) &&
>  	    (args->datatype & XFS_ALLOC_USERDATA) &&
> -	    (flags & XFS_ALLOC_FLAG_TRYLOCK)) {
> -		ASSERT(!(flags & XFS_ALLOC_FLAG_FREEING));
> +	    (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK)) {
> +		ASSERT(!(alloc_flags & XFS_ALLOC_FLAG_FREEING));
>  		goto out_agbp_relse;
>  	}
>  
>  	need = xfs_alloc_min_freelist(mp, pag);
> -	if (!xfs_alloc_space_available(args, need, flags |
> +	if (!xfs_alloc_space_available(args, need, alloc_flags |
>  			XFS_ALLOC_FLAG_CHECK))
>  		goto out_agbp_relse;
>  
> @@ -2619,7 +2622,7 @@ xfs_alloc_fix_freelist(
>  	 * Can fail if we're not blocking on locks, and it's held.
>  	 */
>  	if (!agbp) {
> -		error = xfs_alloc_read_agf(pag, tp, flags, &agbp);
> +		error = xfs_alloc_read_agf(pag, tp, alloc_flags, &agbp);
>  		if (error) {
>  			/* Couldn't lock the AGF so skip this AG. */
>  			if (error == -EAGAIN)
> @@ -2634,7 +2637,7 @@ xfs_alloc_fix_freelist(
>  
>  	/* If there isn't enough total space or single-extent, reject it. */
>  	need = xfs_alloc_min_freelist(mp, pag);
> -	if (!xfs_alloc_space_available(args, need, flags))
> +	if (!xfs_alloc_space_available(args, need, alloc_flags))
>  		goto out_agbp_relse;
>  
>  #ifdef DEBUG
> @@ -2672,11 +2675,12 @@ xfs_alloc_fix_freelist(
>  	 */
>  	memset(&targs, 0, sizeof(targs));
>  	/* struct copy below */
> -	if (flags & XFS_ALLOC_FLAG_NORMAP)
> +	if (alloc_flags & XFS_ALLOC_FLAG_NORMAP)
>  		targs.oinfo = XFS_RMAP_OINFO_SKIP_UPDATE;
>  	else
>  		targs.oinfo = XFS_RMAP_OINFO_AG;
> -	while (!(flags & XFS_ALLOC_FLAG_NOSHRINK) && pag->pagf_flcount > need) {
> +	while (!(alloc_flags & XFS_ALLOC_FLAG_NOSHRINK) &&
> +			pag->pagf_flcount > need) {
>  		error = xfs_alloc_get_freelist(pag, tp, agbp, &bno, 0);
>  		if (error)
>  			goto out_agbp_relse;
> @@ -2704,7 +2708,7 @@ xfs_alloc_fix_freelist(
>  		targs.resv = XFS_AG_RESV_AGFL;
>  
>  		/* Allocate as many blocks as possible at once. */
> -		error = xfs_alloc_ag_vextent_size(&targs);
> +		error = xfs_alloc_ag_vextent_size(&targs, alloc_flags);
>  		if (error)
>  			goto out_agflbp_relse;
>  
> @@ -2714,7 +2718,7 @@ xfs_alloc_fix_freelist(
>  		 * on a completely full ag.
>  		 */
>  		if (targs.agbno == NULLAGBLOCK) {
> -			if (flags & XFS_ALLOC_FLAG_FREEING)
> +			if (alloc_flags & XFS_ALLOC_FLAG_FREEING)
>  				break;
>  			goto out_agflbp_relse;
>  		}
> @@ -3230,7 +3234,7 @@ xfs_alloc_vextent_check_args(
>  static int
>  xfs_alloc_vextent_prepare_ag(
>  	struct xfs_alloc_arg	*args,
> -	uint32_t		flags)
> +	uint32_t		alloc_flags)
>  {
>  	bool			need_pag = !args->pag;
>  	int			error;
> @@ -3239,7 +3243,7 @@ xfs_alloc_vextent_prepare_ag(
>  		args->pag = xfs_perag_get(args->mp, args->agno);
>  
>  	args->agbp = NULL;
> -	error = xfs_alloc_fix_freelist(args, flags);
> +	error = xfs_alloc_fix_freelist(args, alloc_flags);
>  	if (error) {
>  		trace_xfs_alloc_vextent_nofix(args);
>  		if (need_pag)
> @@ -3361,6 +3365,7 @@ xfs_alloc_vextent_this_ag(
>  {
>  	struct xfs_mount	*mp = args->mp;
>  	xfs_agnumber_t		minimum_agno;
> +	uint32_t		alloc_flags = 0;
>  	int			error;
>  
>  	ASSERT(args->pag != NULL);
> @@ -3379,9 +3384,9 @@ xfs_alloc_vextent_this_ag(
>  		return error;
>  	}
>  
> -	error = xfs_alloc_vextent_prepare_ag(args, 0);
> +	error = xfs_alloc_vextent_prepare_ag(args, alloc_flags);
>  	if (!error && args->agbp)
> -		error = xfs_alloc_ag_vextent_size(args);
> +		error = xfs_alloc_ag_vextent_size(args, alloc_flags);
>  
>  	return xfs_alloc_vextent_finish(args, minimum_agno, error, false);
>  }
> @@ -3410,20 +3415,20 @@ xfs_alloc_vextent_iterate_ags(
>  	xfs_agnumber_t		minimum_agno,
>  	xfs_agnumber_t		start_agno,
>  	xfs_agblock_t		target_agbno,
> -	uint32_t		flags)
> +	uint32_t		alloc_flags)
>  {
>  	struct xfs_mount	*mp = args->mp;
>  	xfs_agnumber_t		restart_agno = minimum_agno;
>  	xfs_agnumber_t		agno;
>  	int			error = 0;
>  
> -	if (flags & XFS_ALLOC_FLAG_TRYLOCK)
> +	if (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK)
>  		restart_agno = 0;
>  restart:
>  	for_each_perag_wrap_range(mp, start_agno, restart_agno,
>  			mp->m_sb.sb_agcount, agno, args->pag) {
>  		args->agno = agno;
> -		error = xfs_alloc_vextent_prepare_ag(args, flags);
> +		error = xfs_alloc_vextent_prepare_ag(args, alloc_flags);
>  		if (error)
>  			break;
>  		if (!args->agbp) {
> @@ -3437,10 +3442,10 @@ xfs_alloc_vextent_iterate_ags(
>  		 */
>  		if (args->agno == start_agno && target_agbno) {
>  			args->agbno = target_agbno;
> -			error = xfs_alloc_ag_vextent_near(args);
> +			error = xfs_alloc_ag_vextent_near(args, alloc_flags);
>  		} else {
>  			args->agbno = 0;
> -			error = xfs_alloc_ag_vextent_size(args);
> +			error = xfs_alloc_ag_vextent_size(args, alloc_flags);
>  		}
>  		break;
>  	}
> @@ -3457,8 +3462,8 @@ xfs_alloc_vextent_iterate_ags(
>  	 * constraining flags by the caller, drop them and retry the allocation
>  	 * without any constraints being set.
>  	 */
> -	if (flags) {
> -		flags = 0;
> +	if (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK) {
> +		alloc_flags &= ~XFS_ALLOC_FLAG_TRYLOCK;
>  		restart_agno = minimum_agno;
>  		goto restart;
>  	}
> @@ -3486,6 +3491,7 @@ xfs_alloc_vextent_start_ag(
>  	xfs_agnumber_t		start_agno;
>  	xfs_agnumber_t		rotorstep = xfs_rotorstep;
>  	bool			bump_rotor = false;
> +	uint32_t		alloc_flags = XFS_ALLOC_FLAG_TRYLOCK;
>  	int			error;
>  
>  	ASSERT(args->pag == NULL);
> @@ -3512,7 +3518,7 @@ xfs_alloc_vextent_start_ag(
>  
>  	start_agno = max(minimum_agno, XFS_FSB_TO_AGNO(mp, target));
>  	error = xfs_alloc_vextent_iterate_ags(args, minimum_agno, start_agno,
> -			XFS_FSB_TO_AGBNO(mp, target), XFS_ALLOC_FLAG_TRYLOCK);
> +			XFS_FSB_TO_AGBNO(mp, target), alloc_flags);
>  
>  	if (bump_rotor) {
>  		if (args->agno == start_agno)
> @@ -3539,6 +3545,7 @@ xfs_alloc_vextent_first_ag(
>  	struct xfs_mount	*mp = args->mp;
>  	xfs_agnumber_t		minimum_agno;
>  	xfs_agnumber_t		start_agno;
> +	uint32_t		alloc_flags = XFS_ALLOC_FLAG_TRYLOCK;
>  	int			error;
>  
>  	ASSERT(args->pag == NULL);
> @@ -3557,7 +3564,7 @@ xfs_alloc_vextent_first_ag(
>  
>  	start_agno = max(minimum_agno, XFS_FSB_TO_AGNO(mp, target));
>  	error = xfs_alloc_vextent_iterate_ags(args, minimum_agno, start_agno,
> -			XFS_FSB_TO_AGBNO(mp, target), 0);
> +			XFS_FSB_TO_AGBNO(mp, target), alloc_flags);
>  	return xfs_alloc_vextent_finish(args, minimum_agno, error, true);
>  }
>  
> @@ -3610,6 +3617,7 @@ xfs_alloc_vextent_near_bno(
>  	struct xfs_mount	*mp = args->mp;
>  	xfs_agnumber_t		minimum_agno;
>  	bool			needs_perag = args->pag == NULL;
> +	uint32_t		alloc_flags = 0;
>  	int			error;
>  
>  	if (!needs_perag)
> @@ -3630,9 +3638,9 @@ xfs_alloc_vextent_near_bno(
>  	if (needs_perag)
>  		args->pag = xfs_perag_grab(mp, args->agno);
>  
> -	error = xfs_alloc_vextent_prepare_ag(args, 0);
> +	error = xfs_alloc_vextent_prepare_ag(args, alloc_flags);
>  	if (!error && args->agbp)
> -		error = xfs_alloc_ag_vextent_near(args);
> +		error = xfs_alloc_ag_vextent_near(args, alloc_flags);
>  
>  	return xfs_alloc_vextent_finish(args, minimum_agno, error, needs_perag);
>  }
> diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> index 121faf1e11ad..50119ebaede9 100644
> --- a/fs/xfs/libxfs/xfs_alloc.h
> +++ b/fs/xfs/libxfs/xfs_alloc.h
> @@ -195,7 +195,7 @@ int xfs_alloc_read_agfl(struct xfs_perag *pag, struct xfs_trans *tp,
>  		struct xfs_buf **bpp);
>  int xfs_free_agfl_block(struct xfs_trans *, xfs_agnumber_t, xfs_agblock_t,
>  			struct xfs_buf *, struct xfs_owner_info *);
> -int xfs_alloc_fix_freelist(struct xfs_alloc_arg *args, int flags);
> +int xfs_alloc_fix_freelist(struct xfs_alloc_arg *args, uint32_t alloc_flags);
>  int xfs_free_extent_fix_freelist(struct xfs_trans *tp, struct xfs_perag *pag,
>  		struct xfs_buf **agbp);
>  
> diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> index f3d328e4a440..5f44936eae1c 100644
> --- a/fs/xfs/xfs_extent_busy.c
> +++ b/fs/xfs/xfs_extent_busy.c
> @@ -571,7 +571,8 @@ void
>  xfs_extent_busy_flush(
>  	struct xfs_mount	*mp,
>  	struct xfs_perag	*pag,
> -	unsigned		busy_gen)
> +	unsigned		busy_gen,
> +	uint32_t		alloc_flags)
>  {
>  	DEFINE_WAIT		(wait);
>  	int			error;
> diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> index 4a118131059f..7a82c689bfa4 100644
> --- a/fs/xfs/xfs_extent_busy.h
> +++ b/fs/xfs/xfs_extent_busy.h
> @@ -53,7 +53,7 @@ xfs_extent_busy_trim(struct xfs_alloc_arg *args, xfs_agblock_t *bno,
>  
>  void
>  xfs_extent_busy_flush(struct xfs_mount *mp, struct xfs_perag *pag,
> -	unsigned busy_gen);
> +	unsigned busy_gen, uint32_t alloc_flags);
>  
>  void
>  xfs_extent_busy_wait_all(struct xfs_mount *mp);


-- 
chandan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/8] xfs: allow extent free intents to be retried
  2023-06-27 22:44 ` [PATCH 4/8] xfs: allow extent free intents to be retried Dave Chinner
  2023-06-28 17:48   ` Darrick J. Wong
@ 2023-06-29  9:50   ` Chandan Babu R
  1 sibling, 0 replies; 30+ messages in thread
From: Chandan Babu R @ 2023-06-29  9:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jun 28, 2023 at 08:44:08 AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Extent freeing neeeds to be able to avoid a busy extent deadlock
> when the transaction itself holds the only busy extents in the
> allocation group. This may occur if we have an EFI that contains
> multiple extents to be freed, and the freeing the second intent
> requires the space the first extent free released to expand the
> AGFL. If we block on the busy extent at this point, we deadlock.
>
> We hold a dirty transaction that contains a entire atomic extent
> free operations within it, so if we can abort the extent free
> operation and commit the progress that we've made, the busy extent
> can be resolved by a log force. Hence we can restart the aborted
> extent free with a new transaction and continue to make
> progress without risking deadlocks.
>
> To enable this, we need the EFI processing code to be able to handle
> an -EAGAIN error to tell it to commit the current transaction and
> retry again. This mechanism is already built into the defer ops
> processing (used bythe refcount btree modification intents), so
> there's relatively little handling we need to add to the EFI code to
> enable this.
>

Looks good to me.

Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_extfree_item.c | 73 +++++++++++++++++++++++++++++++++++++--
>  1 file changed, 70 insertions(+), 3 deletions(-)
>
> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> index 79e65bb6b0a2..098420cbd4a0 100644
> --- a/fs/xfs/xfs_extfree_item.c
> +++ b/fs/xfs/xfs_extfree_item.c
> @@ -336,6 +336,34 @@ xfs_trans_get_efd(
>  	return efdp;
>  }
>  
> +/*
> + * Fill the EFD with all extents from the EFI when we need to roll the
> + * transaction and continue with a new EFI.
> + *
> + * This simply copies all the extents in the EFI to the EFD rather than make
> + * assumptions about which extents in the EFI have already been processed. We
> + * currently keep the xefi list in the same order as the EFI extent list, but
> + * that may not always be the case. Copying everything avoids leaving a landmine
> + * were we fail to cancel all the extents in an EFI if the xefi list is
> + * processed in a different order to the extents in the EFI.
> + */
> +static void
> +xfs_efd_from_efi(
> +	struct xfs_efd_log_item	*efdp)
> +{
> +	struct xfs_efi_log_item *efip = efdp->efd_efip;
> +	uint                    i;
> +
> +	ASSERT(efip->efi_format.efi_nextents > 0);
> +	ASSERT(efdp->efd_next_extent < efip->efi_format.efi_nextents);
> +
> +	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
> +	       efdp->efd_format.efd_extents[i] =
> +		       efip->efi_format.efi_extents[i];
> +	}
> +	efdp->efd_next_extent = efip->efi_format.efi_nextents;
> +}
> +
>  /*
>   * Free an extent and log it to the EFD. Note that the transaction is marked
>   * dirty regardless of whether the extent free succeeds or fails to support the
> @@ -378,6 +406,17 @@ xfs_trans_free_extent(
>  	tp->t_flags |= XFS_TRANS_DIRTY | XFS_TRANS_HAS_INTENT_DONE;
>  	set_bit(XFS_LI_DIRTY, &efdp->efd_item.li_flags);
>  
> +	/*
> +	 * If we need a new transaction to make progress, the caller will log a
> +	 * new EFI with the current contents. It will also log an EFD to cancel
> +	 * the existing EFI, and so we need to copy all the unprocessed extents
> +	 * in this EFI to the EFD so this works correctly.
> +	 */
> +	if (error == -EAGAIN) {
> +		xfs_efd_from_efi(efdp);
> +		return error;
> +	}
> +
>  	next_extent = efdp->efd_next_extent;
>  	ASSERT(next_extent < efdp->efd_format.efd_nextents);
>  	extp = &(efdp->efd_format.efd_extents[next_extent]);
> @@ -495,6 +534,13 @@ xfs_extent_free_finish_item(
>  
>  	error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);
>  
> +	/*
> +	 * Don't free the XEFI if we need a new transaction to complete
> +	 * processing of it.
> +	 */
> +	if (error == -EAGAIN)
> +		return error;
> +
>  	xfs_extent_free_put_group(xefi);
>  	kmem_cache_free(xfs_extfree_item_cache, xefi);
>  	return error;
> @@ -620,6 +666,7 @@ xfs_efi_item_recover(
>  	struct xfs_trans		*tp;
>  	int				i;
>  	int				error = 0;
> +	bool				requeue_only = false;
>  
>  	/*
>  	 * First check the validity of the extents described by the
> @@ -653,9 +700,29 @@ xfs_efi_item_recover(
>  		fake.xefi_startblock = extp->ext_start;
>  		fake.xefi_blockcount = extp->ext_len;
>  
> -		xfs_extent_free_get_group(mp, &fake);
> -		error = xfs_trans_free_extent(tp, efdp, &fake);
> -		xfs_extent_free_put_group(&fake);
> +		if (!requeue_only) {
> +			xfs_extent_free_get_group(mp, &fake);
> +			error = xfs_trans_free_extent(tp, efdp, &fake);
> +			xfs_extent_free_put_group(&fake);
> +		}
> +
> +		/*
> +		 * If we can't free the extent without potentially deadlocking,
> +		 * requeue the rest of the extents to a new so that they get
> +		 * run again later with a new transaction context.
> +		 */
> +		if (error == -EAGAIN || requeue_only) {
> +			error = xfs_free_extent_later(tp, fake.xefi_startblock,
> +					fake.xefi_blockcount,
> +					&XFS_RMAP_OINFO_ANY_OWNER,
> +					fake.xefi_type);
> +			if (!error) {
> +				requeue_only = true;
> +				error = 0;
> +				continue;
> +			}
> +		};
> +
>  		if (error == -EFSCORRUPTED)
>  			XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
>  					extp, sizeof(*extp));


-- 
chandan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/8 V2] xfs: AGF length has never been bounds checked
  2023-06-29  2:09     ` [PATCH 7/8 V2] " Dave Chinner
@ 2023-06-29 16:35       ` Darrick J. Wong
  2023-06-29 22:33         ` Dave Chinner
  0 siblings, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2023-06-29 16:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jun 29, 2023 at 12:09:25PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The AGF verifier does not check that the AGF length field is within
> known good bounds. This has never been checked by runtime kernel
> code (i.e. the lack of verification goes back to 1993) yet we assume
> in many places that it is correct and verify other metdata against
> it.
> 
> Add length verification to the AGF verifier. The length of the AGF
> must be equal to the size of the AG specified in the superblock,
> unless it is the last AG in the filesystem. In that case, it must be
> less than or equal to sb->sb_agblocks and greater than
> XFS_MIN_AG_BLOCKS, which is the smallest AG a growfs operation will
> allow to exist.
> 
> This requires a bit of rework of the verifier function. We want to
> verify metadata before we use it to verify other metadata. Hence
> we need to verify the AGF sequence numbers before using them to
> verify the length of the AGF. Then we can verify the AGF length
> before we verify AGFL fields. Then we can verifier other fields that
> are bounds limited by the AGF length.
> 
> And, finally, by calculating agf_length only once into a local
> variable, we can collapse repeated "if (xfs_has_foo() &&"
> conditionaly checks into single checks. This makes the code much
> easier to follow as all the checks for a given feature are obviously
> in the same place.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Still looks good to me.  New question: Do we need to validate agi_length
in the AGI verifier too?

--D

> ---
> 
> Version 2:
> - growfs will write the new AGFs before the superblock has been
>   updated, so we have to skip the new runt AGF seqno check otherwise
>   it will fail.
> 
>  fs/xfs/libxfs/xfs_alloc.c | 92 +++++++++++++++++++++++++++++------------------
>  1 file changed, 57 insertions(+), 35 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 1e72b91daff6..fe7d5ea47b90 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2974,6 +2974,7 @@ xfs_agf_verify(
>  {
>  	struct xfs_mount	*mp = bp->b_mount;
>  	struct xfs_agf		*agf = bp->b_addr;
> +	uint32_t		agf_length = be32_to_cpu(agf->agf_length);
>  
>  	if (xfs_has_crc(mp)) {
>  		if (!uuid_equal(&agf->agf_uuid, &mp->m_sb.sb_meta_uuid))
> @@ -2985,18 +2986,49 @@ xfs_agf_verify(
>  	if (!xfs_verify_magic(bp, agf->agf_magicnum))
>  		return __this_address;
>  
> -	if (!(XFS_AGF_GOOD_VERSION(be32_to_cpu(agf->agf_versionnum)) &&
> -	      be32_to_cpu(agf->agf_freeblks) <= be32_to_cpu(agf->agf_length) &&
> -	      be32_to_cpu(agf->agf_flfirst) < xfs_agfl_size(mp) &&
> -	      be32_to_cpu(agf->agf_fllast) < xfs_agfl_size(mp) &&
> -	      be32_to_cpu(agf->agf_flcount) <= xfs_agfl_size(mp)))
> +	if (!XFS_AGF_GOOD_VERSION(be32_to_cpu(agf->agf_versionnum)))
>  		return __this_address;
>  
> -	if (be32_to_cpu(agf->agf_length) > mp->m_sb.sb_dblocks)
> +	/*
> +	 * Both agf_seqno and agf_length need to validated before anything else
> +	 * block number related in the AGF or AGFL can be checked.
> +	 *
> +	 * During growfs operations, the perag is not fully initialised,
> +	 * so we can't use it for any useful checking. growfs ensures we can't
> +	 * use it by using uncached buffers that don't have the perag attached
> +	 * so we can detect and avoid this problem.
> +	 */
> +	if (bp->b_pag && be32_to_cpu(agf->agf_seqno) != bp->b_pag->pag_agno)
> +		return __this_address;
> +
> +	/*
> +	 * Only the last AGF in the filesytsem is allowed to be shorter
> +	 * than the AG size recorded in the superblock.
> +	 */
> +	if (agf_length != mp->m_sb.sb_agblocks) {
> +		/*
> +		 * During growfs, the new last AGF can get here before we
> +		 * have updated the superblock. Give it a pass on the seqno
> +		 * check.
> +		 */
> +		if (bp->b_pag &&
> +		    be32_to_cpu(agf->agf_seqno) != mp->m_sb.sb_agcount - 1)
> +			return __this_address;
> +		if (agf_length < XFS_MIN_AG_BLOCKS)
> +			return __this_address;
> +		if (agf_length > mp->m_sb.sb_agblocks)
> +			return __this_address;
> +	}
> +
> +	if (be32_to_cpu(agf->agf_flfirst) >= xfs_agfl_size(mp))
> +		return __this_address;
> +	if (be32_to_cpu(agf->agf_fllast) >= xfs_agfl_size(mp))
> +		return __this_address;
> +	if (be32_to_cpu(agf->agf_flcount) > xfs_agfl_size(mp))
>  		return __this_address;
>  
>  	if (be32_to_cpu(agf->agf_freeblks) < be32_to_cpu(agf->agf_longest) ||
> -	    be32_to_cpu(agf->agf_freeblks) > be32_to_cpu(agf->agf_length))
> +	    be32_to_cpu(agf->agf_freeblks) > agf_length)
>  		return __this_address;
>  
>  	if (be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNO]) < 1 ||
> @@ -3007,38 +3039,28 @@ xfs_agf_verify(
>  						mp->m_alloc_maxlevels)
>  		return __this_address;
>  
> -	if (xfs_has_rmapbt(mp) &&
> -	    (be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) < 1 ||
> -	     be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) >
> -						mp->m_rmap_maxlevels))
> -		return __this_address;
> -
> -	if (xfs_has_rmapbt(mp) &&
> -	    be32_to_cpu(agf->agf_rmap_blocks) > be32_to_cpu(agf->agf_length))
> -		return __this_address;
> -
> -	/*
> -	 * during growfs operations, the perag is not fully initialised,
> -	 * so we can't use it for any useful checking. growfs ensures we can't
> -	 * use it by using uncached buffers that don't have the perag attached
> -	 * so we can detect and avoid this problem.
> -	 */
> -	if (bp->b_pag && be32_to_cpu(agf->agf_seqno) != bp->b_pag->pag_agno)
> -		return __this_address;
> -
>  	if (xfs_has_lazysbcount(mp) &&
> -	    be32_to_cpu(agf->agf_btreeblks) > be32_to_cpu(agf->agf_length))
> +	    be32_to_cpu(agf->agf_btreeblks) > agf_length)
>  		return __this_address;
>  
> -	if (xfs_has_reflink(mp) &&
> -	    be32_to_cpu(agf->agf_refcount_blocks) >
> -	    be32_to_cpu(agf->agf_length))
> -		return __this_address;
> +	if (xfs_has_rmapbt(mp)) {
> +		if (be32_to_cpu(agf->agf_rmap_blocks) > agf_length)
> +			return __this_address;
>  
> -	if (xfs_has_reflink(mp) &&
> -	    (be32_to_cpu(agf->agf_refcount_level) < 1 ||
> -	     be32_to_cpu(agf->agf_refcount_level) > mp->m_refc_maxlevels))
> -		return __this_address;
> +		if (be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) < 1 ||
> +		    be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) >
> +							mp->m_rmap_maxlevels)
> +			return __this_address;
> +	}
> +
> +	if (xfs_has_reflink(mp)) {
> +		if (be32_to_cpu(agf->agf_refcount_blocks) > agf_length)
> +			return __this_address;
> +
> +		if (be32_to_cpu(agf->agf_refcount_level) < 1 ||
> +		    be32_to_cpu(agf->agf_refcount_level) > mp->m_refc_maxlevels)
> +			return __this_address;
> +	}
>  
>  	return NULL;
>  }

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 9/8] xfs: AGI length should be bounds checked
  2023-06-27 22:44 [PATCH 0/8 v3] xfs: various fixes for 6.5 Dave Chinner
                   ` (7 preceding siblings ...)
  2023-06-27 22:44 ` [PATCH 8/8] xfs: fix bounds check in xfs_defer_agfl_block() Dave Chinner
@ 2023-06-29 19:42 ` Darrick J. Wong
  2023-06-29 22:35   ` Dave Chinner
  8 siblings, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2023-06-29 19:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Similar to the recent patch strengthening the AGF agf_length
verification, the AGI verifier does not check that the AGI length field
is within known good bounds.  This isn't currently checked by runtime
kernel code, yet we assume in many places that it is correct and verify
other metadata against it.

Add length verification to the AGI verifier.  Just like the AGF length
checking, the length of the AGI must be equal to the size of the AG
specified in the superblock, unless it is the last AG in the filesystem.
In that case, it must be less than or equal to sb->sb_agblocks and
greater than XFS_MIN_AG_BLOCKS, which is the smallest AG a growfs
operation will allow to exist.

There's only one place in the filesystem that actually uses agi_length,
but let's not leave it vulnerable to the same weird nonsense that
generates syzbot bugs, eh?

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
NOTE: Untested, but this patch builds...
---
 fs/xfs/libxfs/xfs_ialloc.c |   49 ++++++++++++++++++++++++++++++++------------
 1 file changed, 36 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 1e5fafbc0cdb..fec6713e1fa9 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -2486,11 +2486,12 @@ xfs_ialloc_log_agi(
 
 static xfs_failaddr_t
 xfs_agi_verify(
-	struct xfs_buf	*bp)
+	struct xfs_buf		*bp)
 {
-	struct xfs_mount *mp = bp->b_mount;
-	struct xfs_agi	*agi = bp->b_addr;
-	int		i;
+	struct xfs_mount	*mp = bp->b_mount;
+	struct xfs_agi		*agi = bp->b_addr;
+	uint32_t		agi_length = be32_to_cpu(agi->agi_length);
+	int			i;
 
 	if (xfs_has_crc(mp)) {
 		if (!uuid_equal(&agi->agi_uuid, &mp->m_sb.sb_meta_uuid))
@@ -2507,6 +2508,37 @@ xfs_agi_verify(
 	if (!XFS_AGI_GOOD_VERSION(be32_to_cpu(agi->agi_versionnum)))
 		return __this_address;
 
+	/*
+	 * Both agi_seqno and agi_length need to validated before anything else
+	 * block number related in the AGI can be checked.
+	 *
+	 * During growfs operations, the perag is not fully initialised,
+	 * so we can't use it for any useful checking. growfs ensures we can't
+	 * use it by using uncached buffers that don't have the perag attached
+	 * so we can detect and avoid this problem.
+	 */
+	if (bp->b_pag && be32_to_cpu(agi->agi_seqno) != bp->b_pag->pag_agno)
+		return __this_address;
+
+	/*
+	 * Only the last AGI in the filesytsem is allowed to be shorter
+	 * than the AG size recorded in the superblock.
+	 */
+	if (agi_length != mp->m_sb.sb_agblocks) {
+		/*
+		 * During growfs, the new last AGI can get here before we
+		 * have updated the superblock. Give it a pass on the seqno
+		 * check.
+		 */
+		if (bp->b_pag &&
+		    be32_to_cpu(agi->agi_seqno) != mp->m_sb.sb_agcount - 1)
+			return __this_address;
+		if (agi_length < XFS_MIN_AG_BLOCKS)
+			return __this_address;
+		if (agi_length > mp->m_sb.sb_agblocks)
+			return __this_address;
+	}
+
 	if (be32_to_cpu(agi->agi_level) < 1 ||
 	    be32_to_cpu(agi->agi_level) > M_IGEO(mp)->inobt_maxlevels)
 		return __this_address;
@@ -2516,15 +2548,6 @@ xfs_agi_verify(
 	     be32_to_cpu(agi->agi_free_level) > M_IGEO(mp)->inobt_maxlevels))
 		return __this_address;
 
-	/*
-	 * during growfs operations, the perag is not fully initialised,
-	 * so we can't use it for any useful checking. growfs ensures we can't
-	 * use it by using uncached buffers that don't have the perag attached
-	 * so we can detect and avoid this problem.
-	 */
-	if (bp->b_pag && be32_to_cpu(agi->agi_seqno) != bp->b_pag->pag_agno)
-		return __this_address;
-
 	for (i = 0; i < XFS_AGI_UNLINKED_BUCKETS; i++) {
 		if (agi->agi_unlinked[i] == cpu_to_be32(NULLAGINO))
 			continue;

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/8 V2] xfs: AGF length has never been bounds checked
  2023-06-29 16:35       ` Darrick J. Wong
@ 2023-06-29 22:33         ` Dave Chinner
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2023-06-29 22:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Jun 29, 2023 at 09:35:35AM -0700, Darrick J. Wong wrote:
> On Thu, Jun 29, 2023 at 12:09:25PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The AGF verifier does not check that the AGF length field is within
> > known good bounds. This has never been checked by runtime kernel
> > code (i.e. the lack of verification goes back to 1993) yet we assume
> > in many places that it is correct and verify other metdata against
> > it.
> > 
> > Add length verification to the AGF verifier. The length of the AGF
> > must be equal to the size of the AG specified in the superblock,
> > unless it is the last AG in the filesystem. In that case, it must be
> > less than or equal to sb->sb_agblocks and greater than
> > XFS_MIN_AG_BLOCKS, which is the smallest AG a growfs operation will
> > allow to exist.
> > 
> > This requires a bit of rework of the verifier function. We want to
> > verify metadata before we use it to verify other metadata. Hence
> > we need to verify the AGF sequence numbers before using them to
> > verify the length of the AGF. Then we can verify the AGF length
> > before we verify AGFL fields. Then we can verifier other fields that
> > are bounds limited by the AGF length.
> > 
> > And, finally, by calculating agf_length only once into a local
> > variable, we can collapse repeated "if (xfs_has_foo() &&"
> > conditionaly checks into single checks. This makes the code much
> > easier to follow as all the checks for a given feature are obviously
> > in the same place.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> Still looks good to me.  New question: Do we need to validate agi_length
> in the AGI verifier too?

I'm on the fence about that after the audit I did. It's only used
for bounds checking in one place in xfs_ialloc_ag_alloc() when
trying to do exact inode chunk allocation (for sequential inode
chunk layout). If it's not correct it doesn't matter (too small will
skip exact allocation, too large means exact allocation at EOAG will
fail) as we fall back to an "anywhere near target" allocation that
doesn't care about agi_length.

Hence the agi_length being wrong isn't going to cause fatal errors
at the moment, so it wasn't anywhere near as urgent to "fix" because
the code isn't actually broken right now...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 9/8] xfs: AGI length should be bounds checked
  2023-06-29 19:42 ` [RFC PATCH 9/8] xfs: AGI length should be bounds checked Darrick J. Wong
@ 2023-06-29 22:35   ` Dave Chinner
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Chinner @ 2023-06-29 22:35 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Jun 29, 2023 at 12:42:30PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Similar to the recent patch strengthening the AGF agf_length
> verification, the AGI verifier does not check that the AGI length field
> is within known good bounds.  This isn't currently checked by runtime
> kernel code, yet we assume in many places that it is correct and verify
> other metadata against it.
> 
> Add length verification to the AGI verifier.  Just like the AGF length
> checking, the length of the AGI must be equal to the size of the AG
> specified in the superblock, unless it is the last AG in the filesystem.
> In that case, it must be less than or equal to sb->sb_agblocks and
> greater than XFS_MIN_AG_BLOCKS, which is the smallest AG a growfs
> operation will allow to exist.
> 
> There's only one place in the filesystem that actually uses agi_length,
> but let's not leave it vulnerable to the same weird nonsense that
> generates syzbot bugs, eh?
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
> NOTE: Untested, but this patch builds...
> ---
>  fs/xfs/libxfs/xfs_ialloc.c |   49 ++++++++++++++++++++++++++++++++------------
>  1 file changed, 36 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index 1e5fafbc0cdb..fec6713e1fa9 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -2486,11 +2486,12 @@ xfs_ialloc_log_agi(
>  
>  static xfs_failaddr_t
>  xfs_agi_verify(
> -	struct xfs_buf	*bp)
> +	struct xfs_buf		*bp)
>  {
> -	struct xfs_mount *mp = bp->b_mount;
> -	struct xfs_agi	*agi = bp->b_addr;
> -	int		i;
> +	struct xfs_mount	*mp = bp->b_mount;
> +	struct xfs_agi		*agi = bp->b_addr;
> +	uint32_t		agi_length = be32_to_cpu(agi->agi_length);
> +	int			i;
>  
>  	if (xfs_has_crc(mp)) {
>  		if (!uuid_equal(&agi->agi_uuid, &mp->m_sb.sb_meta_uuid))
> @@ -2507,6 +2508,37 @@ xfs_agi_verify(
>  	if (!XFS_AGI_GOOD_VERSION(be32_to_cpu(agi->agi_versionnum)))
>  		return __this_address;
>  
> +	/*
> +	 * Both agi_seqno and agi_length need to validated before anything else
> +	 * block number related in the AGI can be checked.
> +	 *
> +	 * During growfs operations, the perag is not fully initialised,
> +	 * so we can't use it for any useful checking. growfs ensures we can't
> +	 * use it by using uncached buffers that don't have the perag attached
> +	 * so we can detect and avoid this problem.
> +	 */
> +	if (bp->b_pag && be32_to_cpu(agi->agi_seqno) != bp->b_pag->pag_agno)
> +		return __this_address;
> +
> +	/*
> +	 * Only the last AGI in the filesytsem is allowed to be shorter
> +	 * than the AG size recorded in the superblock.
> +	 */
> +	if (agi_length != mp->m_sb.sb_agblocks) {
> +		/*
> +		 * During growfs, the new last AGI can get here before we
> +		 * have updated the superblock. Give it a pass on the seqno
> +		 * check.
> +		 */
> +		if (bp->b_pag &&
> +		    be32_to_cpu(agi->agi_seqno) != mp->m_sb.sb_agcount - 1)
> +			return __this_address;
> +		if (agi_length < XFS_MIN_AG_BLOCKS)
> +			return __this_address;
> +		if (agi_length > mp->m_sb.sb_agblocks)
> +			return __this_address;
> +	}

I'd pull this into a helper function that both the AGF and AGI
verifiers call. It's the same checks, with the same caveats and
growfs landmines, so I think would be better as a helper...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2023-06-29 22:36 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-27 22:44 [PATCH 0/8 v3] xfs: various fixes for 6.5 Dave Chinner
2023-06-27 22:44 ` [PATCH 1/8] xfs: don't reverse order of items in bulk AIL insertion Dave Chinner
2023-06-28  6:03   ` Christoph Hellwig
2023-06-28  9:55   ` Chandan Babu R
2023-06-28 17:46   ` Darrick J. Wong
2023-06-27 22:44 ` [PATCH 2/8] xfs: use deferred frees for btree block freeing Dave Chinner
2023-06-28 17:46   ` Darrick J. Wong
2023-06-28 22:55     ` Dave Chinner
2023-06-29  7:52   ` Chandan Babu R
2023-06-27 22:44 ` [PATCH 3/8] xfs: pass alloc flags through to xfs_extent_busy_flush() Dave Chinner
2023-06-29  9:44   ` Chandan Babu R
2023-06-27 22:44 ` [PATCH 4/8] xfs: allow extent free intents to be retried Dave Chinner
2023-06-28 17:48   ` Darrick J. Wong
2023-06-28 22:57     ` Dave Chinner
2023-06-29  9:50   ` Chandan Babu R
2023-06-27 22:44 ` [PATCH 5/8] xfs: don't block in busy flushing when freeing extents Dave Chinner
2023-06-27 22:44 ` [PATCH 6/8] xfs: journal geometry is not properly bounds checked Dave Chinner
2023-06-28  6:08   ` Christoph Hellwig
2023-06-28  6:38     ` Dave Chinner
2023-06-28 17:50   ` Darrick J. Wong
2023-06-27 22:44 ` [PATCH 7/8] xfs: AGF length has never been " Dave Chinner
2023-06-28 17:52   ` Darrick J. Wong
2023-06-29  2:09     ` [PATCH 7/8 V2] " Dave Chinner
2023-06-29 16:35       ` Darrick J. Wong
2023-06-29 22:33         ` Dave Chinner
2023-06-27 22:44 ` [PATCH 8/8] xfs: fix bounds check in xfs_defer_agfl_block() Dave Chinner
2023-06-28  6:09   ` Christoph Hellwig
2023-06-28 17:52   ` Darrick J. Wong
2023-06-29 19:42 ` [RFC PATCH 9/8] xfs: AGI length should be bounds checked Darrick J. Wong
2023-06-29 22:35   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox