public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* xfs: bug fixes for 6.4-rcX
@ 2023-05-17  0:04 Dave Chinner
  2023-05-17  0:04 ` [PATCH 1/4] xfs: buffer pins need to hold a buffer reference Dave Chinner
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Dave Chinner @ 2023-05-17  0:04 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

The following patches are fixes for recently discovered problems.
I'd like to consider them all for a 6.4-rc merge, though really only
patch 2 is a necessary recent regression fix.

The first patch addresses a long standing buffer UAF during shutdown
situations, where shutdown log item completions can race with CIL
abort and iclog log item completions.

The second patch addresses a small performance regression from the
6.3 allocator rewrite which is also a potential AGF-AGF deadlock
vector as it allows out-of-order locking.

The third patch is a change needed by patch 4, which I split out for
clarity. By itself it does nothing.

The fourth patch is a fix for a AGI->AGF->inode cluster buffer lock
ordering deadlock. This was introduced when we started pinning inode
cluster buffers in xfs_trans_log_inode() in 5.8 to fix long-standing
inode reclaim blocking issues, but I've only seen it in the wild
once on a system making heavy use of OVL and hence O_TMPFILE based
operations.

This has all passed through fstests without any issues being
detected, but wider testing would be appreciated along with review.

-Dave.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/4] xfs: buffer pins need to hold a buffer reference
  2023-05-17  0:04 xfs: bug fixes for 6.4-rcX Dave Chinner
@ 2023-05-17  0:04 ` Dave Chinner
  2023-05-17  1:26   ` Darrick J. Wong
  2023-05-17 12:58   ` Christoph Hellwig
  2023-05-17  0:04 ` [PATCH 2/4] xfs: restore allocation trylock iteration Dave Chinner
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 23+ messages in thread
From: Dave Chinner @ 2023-05-17  0:04 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When a buffer is unpinned by xfs_buf_item_unpin(), we need to access
the buffer after we've dropped the buffer log item reference count.
This opens a window where we can have two racing unpins for the
buffer item (e.g. shutdown checkpoint context callback processing
racing with journal IO iclog completion processing) and both attempt
to access the buffer after dropping the BLI reference count.  If we
are unlucky, the "BLI freed" context wins the race and frees the
buffer before the "BLI still active" case checks the buffer pin
count.

This results in a use after free that can only be triggered
in active filesystem shutdown situations.

To fix this, we need to ensure that buffer existence extends beyond
the BLI reference count checks and until the unpin processing is
complete. This implies that a buffer pin operation must also take a
buffer reference to ensure that the buffer cannot be freed until the
buffer unpin processing is complete.

Reported-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf_item.c | 88 ++++++++++++++++++++++++++++++++-----------
 1 file changed, 65 insertions(+), 23 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index df7322ed73fa..b2d211730fd2 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -452,10 +452,18 @@ xfs_buf_item_format(
  * This is called to pin the buffer associated with the buf log item in memory
  * so it cannot be written out.
  *
- * We also always take a reference to the buffer log item here so that the bli
- * is held while the item is pinned in memory. This means that we can
- * unconditionally drop the reference count a transaction holds when the
- * transaction is completed.
+ * We take a reference to the buffer log item here so that the BLI life cycle
+ * extends at least until the buffer is unpinned via xfs_buf_item_unpin() and
+ * inserted into the AIL.
+ *
+ * We also need to take a reference to the buffer itself as the BLI unpin
+ * processing requires accessing the buffer after the BLI has dropped the final
+ * BLI reference. See xfs_buf_item_unpin() for an explanation.
+ * If unpins race to drop the final BLI reference and only the
+ * BLI owns a reference to the buffer, then the loser of the race can have the
+ * buffer fgreed from under it (e.g. on shutdown). Taking a buffer reference per
+ * pin count ensures the life cycle of the buffer extends for as
+ * long as we hold the buffer pin reference in xfs_buf_item_unpin().
  */
 STATIC void
 xfs_buf_item_pin(
@@ -470,13 +478,30 @@ xfs_buf_item_pin(
 
 	trace_xfs_buf_item_pin(bip);
 
+	xfs_buf_hold(bip->bli_buf);
 	atomic_inc(&bip->bli_refcount);
 	atomic_inc(&bip->bli_buf->b_pin_count);
 }
 
 /*
- * This is called to unpin the buffer associated with the buf log item which
- * was previously pinned with a call to xfs_buf_item_pin().
+ * This is called to unpin the buffer associated with the buf log item which was
+ * previously pinned with a call to xfs_buf_item_pin().  We enter this function
+ * with a buffer pin count, a buffer reference and a BLI reference.
+ *
+ * We must drop the BLI reference before we unpin the buffer because the AIL
+ * doesn't acquire a BLI reference whenever it accesses it. Therefore if the
+ * refcount drops to zero, the bli could still be AIL resident and the buffer
+ * submitted for I/O at any point before we return. This can result in IO
+ * completion freeing the buffer while we are still trying to access it here.
+ * This race condition can also occur in shutdown situations where we abort and
+ * unpin buffers from contexts other that journal IO completion.
+ *
+ * Hence we have to hold a buffer reference per pin count to ensure that the
+ * buffer cannot be freed until we have finished processing the unpin operation.
+ * The reference is taken in xfs_buf_item_pin(), and we must hold it until we
+ * are done processing the buffer state. In the case of an abort (remove =
+ * true) then we re-use the current pin reference as the IO reference we hand
+ * off to IO failure handling.
  */
 STATIC void
 xfs_buf_item_unpin(
@@ -493,24 +518,18 @@ xfs_buf_item_unpin(
 
 	trace_xfs_buf_item_unpin(bip);
 
-	/*
-	 * Drop the bli ref associated with the pin and grab the hold required
-	 * for the I/O simulation failure in the abort case. We have to do this
-	 * before the pin count drops because the AIL doesn't acquire a bli
-	 * reference. Therefore if the refcount drops to zero, the bli could
-	 * still be AIL resident and the buffer submitted for I/O (and freed on
-	 * completion) at any point before we return. This can be removed once
-	 * the AIL properly holds a reference on the bli.
-	 */
 	freed = atomic_dec_and_test(&bip->bli_refcount);
-	if (freed && !stale && remove)
-		xfs_buf_hold(bp);
 	if (atomic_dec_and_test(&bp->b_pin_count))
 		wake_up_all(&bp->b_waiters);
 
-	 /* nothing to do but drop the pin count if the bli is active */
-	if (!freed)
+	 /*
+	  * Nothing to do but drop the buffer pin reference if the BLI is
+	  * still active
+	  */
+	if (!freed) {
+		xfs_buf_rele(bp);
 		return;
+	}
 
 	if (stale) {
 		ASSERT(bip->bli_flags & XFS_BLI_STALE);
@@ -522,6 +541,15 @@ xfs_buf_item_unpin(
 
 		trace_xfs_buf_item_unpin_stale(bip);
 
+		/*
+		 * The buffer has been locked and referenced since it was marked
+		 * stale so we own both lock and reference exclusively here. We
+		 * do not need the pin reference any more, so drop it now so
+		 * that we only have one reference to drop once item completion
+		 * processing is complete.
+		 */
+		xfs_buf_rele(bp);
+
 		/*
 		 * If we get called here because of an IO error, we may or may
 		 * not have the item on the AIL. xfs_trans_ail_delete() will
@@ -538,16 +566,30 @@ xfs_buf_item_unpin(
 			ASSERT(bp->b_log_item == NULL);
 		}
 		xfs_buf_relse(bp);
-	} else if (remove) {
+		return;
+	}
+
+	if (remove) {
 		/*
-		 * The buffer must be locked and held by the caller to simulate
-		 * an async I/O failure. We acquired the hold for this case
-		 * before the buffer was unpinned.
+		 * We need to simulate an async IO failures here to ensure that
+		 * the correct error completion is run on this buffer. This
+		 * requires a reference to the buffer and for the buffer to be
+		 * locked. We can safely pass ownership of the pin reference to
+		 * the IO to ensure that nothing can free the buffer while we
+		 * wait for the lock and then run the IO failure completion.
 		 */
 		xfs_buf_lock(bp);
 		bp->b_flags |= XBF_ASYNC;
 		xfs_buf_ioend_fail(bp);
+		return;
 	}
+
+	/*
+	 * BLI has no more active references - it will be moved to the AIL to
+	 * manage the remaining BLI/buffer life cycle. There is nothing left for
+	 * us to do here so drop the pin reference to the buffer.
+	 */
+	xfs_buf_rele(bp);
 }
 
 STATIC uint
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/4] xfs: restore allocation trylock iteration
  2023-05-17  0:04 xfs: bug fixes for 6.4-rcX Dave Chinner
  2023-05-17  0:04 ` [PATCH 1/4] xfs: buffer pins need to hold a buffer reference Dave Chinner
@ 2023-05-17  0:04 ` Dave Chinner
  2023-05-17  1:11   ` Darrick J. Wong
  2023-05-17 12:59   ` Christoph Hellwig
  2023-05-17  0:04 ` [PATCH 3/4] xfs: defered work could create precommits Dave Chinner
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 23+ messages in thread
From: Dave Chinner @ 2023-05-17  0:04 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

It was accidentally dropped when refactoring the allocation code,
resulting in the AG iteration always doing blocking AG iteration.
This results in a small performance regression for a specific fsmark
test that runs more user data writer threads than there are AGs.

Reported-by: kernel test robot <oliver.sang@intel.com>
Fixes: 2edf06a50f5b ("xfs: factor xfs_alloc_vextent_this_ag() for _iterate_ags()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index fdfa08cbf4db..61eb65be17f3 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -3187,7 +3187,8 @@ xfs_alloc_vextent_check_args(
  */
 static int
 xfs_alloc_vextent_prepare_ag(
-	struct xfs_alloc_arg	*args)
+	struct xfs_alloc_arg	*args,
+	uint32_t		flags)
 {
 	bool			need_pag = !args->pag;
 	int			error;
@@ -3196,7 +3197,7 @@ xfs_alloc_vextent_prepare_ag(
 		args->pag = xfs_perag_get(args->mp, args->agno);
 
 	args->agbp = NULL;
-	error = xfs_alloc_fix_freelist(args, 0);
+	error = xfs_alloc_fix_freelist(args, flags);
 	if (error) {
 		trace_xfs_alloc_vextent_nofix(args);
 		if (need_pag)
@@ -3336,7 +3337,7 @@ xfs_alloc_vextent_this_ag(
 		return error;
 	}
 
-	error = xfs_alloc_vextent_prepare_ag(args);
+	error = xfs_alloc_vextent_prepare_ag(args, 0);
 	if (!error && args->agbp)
 		error = xfs_alloc_ag_vextent_size(args);
 
@@ -3380,7 +3381,7 @@ xfs_alloc_vextent_iterate_ags(
 	for_each_perag_wrap_range(mp, start_agno, restart_agno,
 			mp->m_sb.sb_agcount, agno, args->pag) {
 		args->agno = agno;
-		error = xfs_alloc_vextent_prepare_ag(args);
+		error = xfs_alloc_vextent_prepare_ag(args, flags);
 		if (error)
 			break;
 		if (!args->agbp) {
@@ -3546,7 +3547,7 @@ xfs_alloc_vextent_exact_bno(
 		return error;
 	}
 
-	error = xfs_alloc_vextent_prepare_ag(args);
+	error = xfs_alloc_vextent_prepare_ag(args, 0);
 	if (!error && args->agbp)
 		error = xfs_alloc_ag_vextent_exact(args);
 
@@ -3587,7 +3588,7 @@ xfs_alloc_vextent_near_bno(
 	if (needs_perag)
 		args->pag = xfs_perag_grab(mp, args->agno);
 
-	error = xfs_alloc_vextent_prepare_ag(args);
+	error = xfs_alloc_vextent_prepare_ag(args, 0);
 	if (!error && args->agbp)
 		error = xfs_alloc_ag_vextent_near(args);
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/4] xfs: defered work could create precommits
  2023-05-17  0:04 xfs: bug fixes for 6.4-rcX Dave Chinner
  2023-05-17  0:04 ` [PATCH 1/4] xfs: buffer pins need to hold a buffer reference Dave Chinner
  2023-05-17  0:04 ` [PATCH 2/4] xfs: restore allocation trylock iteration Dave Chinner
@ 2023-05-17  0:04 ` Dave Chinner
  2023-05-17  1:20   ` Darrick J. Wong
  2023-06-01 15:01   ` Christoph Hellwig
  2023-05-17  0:04 ` [PATCH 4/4] xfs: fix AGF vs inode cluster buffer deadlock Dave Chinner
  2023-05-17  7:07 ` xfs: bug fixes for 6.4-rcX Amir Goldstein
  4 siblings, 2 replies; 23+ messages in thread
From: Dave Chinner @ 2023-05-17  0:04 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

To fix a AGI-AGF-inode cluster buffer deadlock, we need to move
inode cluster buffer oeprations to the ->iop_precommit() method.
However, this means that deferred operations can require precommits
to be run on the final transaction that the deferred ops pass back
to xfs_trans_commit() context. This will be exposed by attribute
handling, in that the last changes to the inode in the attr set
state machine "disappear" because the precommit operation is not run.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_trans.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 8afc0c080861..664084509af5 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -970,6 +970,11 @@ __xfs_trans_commit(
 		error = xfs_defer_finish_noroll(&tp);
 		if (error)
 			goto out_unreserve;
+
+		/* Run precomits from final tx in defer chain */
+		error = xfs_trans_run_precommits(tp);
+		if (error)
+			goto out_unreserve;
 	}
 
 	/*
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/4] xfs: fix AGF vs inode cluster buffer deadlock
  2023-05-17  0:04 xfs: bug fixes for 6.4-rcX Dave Chinner
                   ` (2 preceding siblings ...)
  2023-05-17  0:04 ` [PATCH 3/4] xfs: defered work could create precommits Dave Chinner
@ 2023-05-17  0:04 ` Dave Chinner
  2023-05-17  1:26   ` Darrick J. Wong
                     ` (2 more replies)
  2023-05-17  7:07 ` xfs: bug fixes for 6.4-rcX Amir Goldstein
  4 siblings, 3 replies; 23+ messages in thread
From: Dave Chinner @ 2023-05-17  0:04 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Lock order in XFS is AGI -> AGF, hence for operations involving
inode unlinked list operations we always lock the AGI first. Inode
unlinked list operations operate on the inode cluster buffer,
so the lock order there is AGI -> inode cluster buffer.

For O_TMPFILE operations, this now means the lock order set down in
xfs_rename and xfs_link is AGI -> inode cluster buffer -> AGF as the
unlinked ops are done before the directory modifications that may
allocate space and lock the AGF.

Unfortunately, we also now lock the inode cluster buffer when
logging an inode so that we can attach the inode to the cluster
buffer and pin it in memory. This creates a lock order of AGF ->
inode cluster buffer in directory operations as we have to log the
inode after we've allocated new space for it.

This creates a lock inversion between the AGF and the inode cluster
buffer. Because the inode cluster buffer is shared across multiple
inodes, the inversion is not specific to individual inodes but can
occur when inodes in the same cluster buffer are accessed in
different orders.

To fix this we need move all the inode log item cluster buffer
interactions to the end of the current transaction. Unfortunately,
xfs_trans_log_inode() calls are littered throughout the transactions
with no thought to ordering against other items or locking. This
makes it difficult to do anything that involves changing the call
sites of xfs_trans_log_inode() to change locking orders.

However, we do now have a mechanism that allows is to postpone dirty
item processing to just before we commit the transaction: the
->iop_precommit method. This will be called after all the
modifications are done and high level objects like AGI and AGF
buffers have been locked and modified, thereby providing a mechanism
that guarantees we don't lock the inode cluster buffer before those
high level objects are locked.

This change is largely moving the guts of xfs_trans_log_inode() to
xfs_inode_item_precommit() and providing an extra flag context in
the inode log item to track the dirty state of the inode in the
current transaction. This also means we do a lot less repeated work
in xfs_trans_log_inode() by only doing it once per transaction when
all the work is done.

Fixes: 298f7bec503f ("xfs: pin inode backing buffer to the inode log item")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_log_format.h  |   9 +-
 fs/xfs/libxfs/xfs_trans_inode.c | 115 +++---------------------
 fs/xfs/xfs_inode_item.c         | 152 ++++++++++++++++++++++++++++++++
 fs/xfs/xfs_inode_item.h         |   1 +
 4 files changed, 171 insertions(+), 106 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index f13e0809dc63..269573c82808 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -324,7 +324,6 @@ struct xfs_inode_log_format_32 {
 #define XFS_ILOG_DOWNER	0x200	/* change the data fork owner on replay */
 #define XFS_ILOG_AOWNER	0x400	/* change the attr fork owner on replay */
 
-
 /*
  * The timestamps are dirty, but not necessarily anything else in the inode
  * core.  Unlike the other fields above this one must never make it to disk
@@ -333,6 +332,14 @@ struct xfs_inode_log_format_32 {
  */
 #define XFS_ILOG_TIMESTAMP	0x4000
 
+/*
+ * The version field has been changed, but not necessarily anything else of
+ * interest. This must never make it to disk - it is used purely to ensure that
+ * the inode item ->precommit operation can update the fsync flag triggers
+ * in the inode item correctly.
+ */
+#define XFS_ILOG_IVERSION	0x8000
+
 #define	XFS_ILOG_NONCORE	(XFS_ILOG_DDATA | XFS_ILOG_DEXT | \
 				 XFS_ILOG_DBROOT | XFS_ILOG_DEV | \
 				 XFS_ILOG_ADATA | XFS_ILOG_AEXT | \
diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index 8b5547073379..2d164d0588b1 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -40,9 +40,8 @@ xfs_trans_ijoin(
 	iip->ili_lock_flags = lock_flags;
 	ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
 
-	/*
-	 * Get a log_item_desc to point at the new item.
-	 */
+	/* Reset the per-tx dirty context and add the item to the tx. */
+	iip->ili_dirty_flags = 0;
 	xfs_trans_add_item(tp, &iip->ili_item);
 }
 
@@ -76,17 +75,10 @@ xfs_trans_ichgtime(
 /*
  * This is called to mark the fields indicated in fieldmask as needing to be
  * logged when the transaction is committed.  The inode must already be
- * associated with the given transaction.
- *
- * The values for fieldmask are defined in xfs_inode_item.h.  We always log all
- * of the core inode if any of it has changed, and we always log all of the
- * inline data/extents/b-tree root if any of them has changed.
- *
- * Grab and pin the cluster buffer associated with this inode to avoid RMW
- * cycles at inode writeback time. Avoid the need to add error handling to every
- * xfs_trans_log_inode() call by shutting down on read error.  This will cause
- * transactions to fail and everything to error out, just like if we return a
- * read error in a dirty transaction and cancel it.
+ * associated with the given transaction. All we do here is record where the
+ * inode was dirtied and mark the transaction and inode log item dirty;
+ * everything else is done in the ->precommit log item operation after the
+ * changes in the transaction have been completed.
  */
 void
 xfs_trans_log_inode(
@@ -96,7 +88,6 @@ xfs_trans_log_inode(
 {
 	struct xfs_inode_log_item *iip = ip->i_itemp;
 	struct inode		*inode = VFS_I(ip);
-	uint			iversion_flags = 0;
 
 	ASSERT(iip);
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
@@ -104,18 +95,6 @@ xfs_trans_log_inode(
 
 	tp->t_flags |= XFS_TRANS_DIRTY;
 
-	/*
-	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
-	 * don't matter - we either will need an extra transaction in 24 hours
-	 * to log the timestamps, or will clear already cleared fields in the
-	 * worst case.
-	 */
-	if (inode->i_state & I_DIRTY_TIME) {
-		spin_lock(&inode->i_lock);
-		inode->i_state &= ~I_DIRTY_TIME;
-		spin_unlock(&inode->i_lock);
-	}
-
 	/*
 	 * First time we log the inode in a transaction, bump the inode change
 	 * counter if it is configured for this to occur. While we have the
@@ -128,86 +107,12 @@ xfs_trans_log_inode(
 	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
 		if (IS_I_VERSION(inode) &&
 		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
-			iversion_flags = XFS_ILOG_CORE;
+			flags |= XFS_ILOG_IVERSION;
 	}
 
-	/*
-	 * If we're updating the inode core or the timestamps and it's possible
-	 * to upgrade this inode to bigtime format, do so now.
-	 */
-	if ((flags & (XFS_ILOG_CORE | XFS_ILOG_TIMESTAMP)) &&
-	    xfs_has_bigtime(ip->i_mount) &&
-	    !xfs_inode_has_bigtime(ip)) {
-		ip->i_diflags2 |= XFS_DIFLAG2_BIGTIME;
-		flags |= XFS_ILOG_CORE;
-	}
-
-	/*
-	 * Inode verifiers do not check that the extent size hint is an integer
-	 * multiple of the rt extent size on a directory with both rtinherit
-	 * and extszinherit flags set.  If we're logging a directory that is
-	 * misconfigured in this way, clear the hint.
-	 */
-	if ((ip->i_diflags & XFS_DIFLAG_RTINHERIT) &&
-	    (ip->i_diflags & XFS_DIFLAG_EXTSZINHERIT) &&
-	    (ip->i_extsize % ip->i_mount->m_sb.sb_rextsize) > 0) {
-		ip->i_diflags &= ~(XFS_DIFLAG_EXTSIZE |
-				   XFS_DIFLAG_EXTSZINHERIT);
-		ip->i_extsize = 0;
-		flags |= XFS_ILOG_CORE;
-	}
-
-	/*
-	 * Record the specific change for fdatasync optimisation. This allows
-	 * fdatasync to skip log forces for inodes that are only timestamp
-	 * dirty.
-	 */
-	spin_lock(&iip->ili_lock);
-	iip->ili_fsync_fields |= flags;
-
-	if (!iip->ili_item.li_buf) {
-		struct xfs_buf	*bp;
-		int		error;
-
-		/*
-		 * We hold the ILOCK here, so this inode is not going to be
-		 * flushed while we are here. Further, because there is no
-		 * buffer attached to the item, we know that there is no IO in
-		 * progress, so nothing will clear the ili_fields while we read
-		 * in the buffer. Hence we can safely drop the spin lock and
-		 * read the buffer knowing that the state will not change from
-		 * here.
-		 */
-		spin_unlock(&iip->ili_lock);
-		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, &bp);
-		if (error) {
-			xfs_force_shutdown(ip->i_mount, SHUTDOWN_META_IO_ERROR);
-			return;
-		}
-
-		/*
-		 * We need an explicit buffer reference for the log item but
-		 * don't want the buffer to remain attached to the transaction.
-		 * Hold the buffer but release the transaction reference once
-		 * we've attached the inode log item to the buffer log item
-		 * list.
-		 */
-		xfs_buf_hold(bp);
-		spin_lock(&iip->ili_lock);
-		iip->ili_item.li_buf = bp;
-		bp->b_flags |= _XBF_INODES;
-		list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
-		xfs_trans_brelse(tp, bp);
-	}
-
-	/*
-	 * Always OR in the bits from the ili_last_fields field.  This is to
-	 * coordinate with the xfs_iflush() and xfs_buf_inode_iodone() routines
-	 * in the eventual clearing of the ili_fields bits.  See the big comment
-	 * in xfs_iflush() for an explanation of this coordination mechanism.
-	 */
-	iip->ili_fields |= (flags | iip->ili_last_fields | iversion_flags);
-	spin_unlock(&iip->ili_lock);
+	iip->ili_dirty_flags |= flags;
+	trace_printk("ino 0x%llx, flags 0x%x, dflags 0x%x",
+		ip->i_ino, flags, iip->ili_dirty_flags);
 }
 
 int
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index ca2941ab6cbc..586af11b7cd1 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -29,6 +29,156 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip)
 	return container_of(lip, struct xfs_inode_log_item, ili_item);
 }
 
+static uint64_t
+xfs_inode_item_sort(
+	struct xfs_log_item	*lip)
+{
+	return INODE_ITEM(lip)->ili_inode->i_ino;
+}
+
+/*
+ * Prior to finally logging the inode, we have to ensure that all the
+ * per-modification inode state changes are applied. This includes VFS inode
+ * state updates, format conversions, verifier state synchronisation and
+ * ensuring the inode buffer remains in memory whilst the inode is dirty.
+ *
+ * We have to be careful when we grab the inode cluster buffer due to lock
+ * ordering constraints. The unlinked inode modifications (xfs_iunlink_item)
+ * require AGI -> inode cluster buffer lock order. The inode cluster buffer is
+ * not locked until ->precommit, so it happens after everything else has been
+ * modified.
+ *
+ * Further, we have AGI -> AGF lock ordering, and with O_TMPFILE handling we
+ * have AGI -> AGF -> iunlink item -> inode cluster buffer lock order. Hence we
+ * cannot safely lock the inode cluster buffer in xfs_trans_log_inode() because
+ * it can be called on a inode (e.g. via bumplink/droplink) before we take the
+ * AGF lock modifying directory blocks.
+ *
+ * Rather than force a complete rework of all the transactions to call
+ * xfs_trans_log_inode() once and once only at the end of every transaction, we
+ * move the pinning of the inode cluster buffer to a ->precommit operation. This
+ * matches how the xfs_iunlink_item locks the inode cluster buffer, and it
+ * ensures that the inode cluster buffer locking is always done last in a
+ * transaction. i.e. we ensure the lock order is always AGI -> AGF -> inode
+ * cluster buffer.
+ *
+ * If we return the inode number as the precommit sort key then we'll also
+ * guarantee that the order all inode cluster buffer locking is the same all the
+ * inodes and unlink items in the transaction.
+ */
+static int
+xfs_inode_item_precommit(
+	struct xfs_trans	*tp,
+	struct xfs_log_item	*lip)
+{
+	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
+	struct xfs_inode	*ip = iip->ili_inode;
+	struct inode		*inode = VFS_I(ip);
+	unsigned int		flags = iip->ili_dirty_flags;
+
+	trace_printk("ino 0x%llx, dflags 0x%x, fields 0x%x lastf 0x%x",
+		ip->i_ino, flags, iip->ili_fields, iip->ili_last_fields);
+	/*
+	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
+	 * don't matter - we either will need an extra transaction in 24 hours
+	 * to log the timestamps, or will clear already cleared fields in the
+	 * worst case.
+	 */
+	if (inode->i_state & I_DIRTY_TIME) {
+		spin_lock(&inode->i_lock);
+		inode->i_state &= ~I_DIRTY_TIME;
+		spin_unlock(&inode->i_lock);
+	}
+
+
+	/*
+	 * If we're updating the inode core or the timestamps and it's possible
+	 * to upgrade this inode to bigtime format, do so now.
+	 */
+	if ((flags & (XFS_ILOG_CORE | XFS_ILOG_TIMESTAMP)) &&
+	    xfs_has_bigtime(ip->i_mount) &&
+	    !xfs_inode_has_bigtime(ip)) {
+		ip->i_diflags2 |= XFS_DIFLAG2_BIGTIME;
+		flags |= XFS_ILOG_CORE;
+	}
+
+	/*
+	 * Inode verifiers do not check that the extent size hint is an integer
+	 * multiple of the rt extent size on a directory with both rtinherit
+	 * and extszinherit flags set.  If we're logging a directory that is
+	 * misconfigured in this way, clear the hint.
+	 */
+	if ((ip->i_diflags & XFS_DIFLAG_RTINHERIT) &&
+	    (ip->i_diflags & XFS_DIFLAG_EXTSZINHERIT) &&
+	    (ip->i_extsize % ip->i_mount->m_sb.sb_rextsize) > 0) {
+		ip->i_diflags &= ~(XFS_DIFLAG_EXTSIZE |
+				   XFS_DIFLAG_EXTSZINHERIT);
+		ip->i_extsize = 0;
+		flags |= XFS_ILOG_CORE;
+	}
+
+	/*
+	 * Record the specific change for fdatasync optimisation. This allows
+	 * fdatasync to skip log forces for inodes that are only timestamp
+	 * dirty. Once we've processed the XFS_ILOG_IVERSION flag, convert it
+	 * to XFS_ILOG_CORE so that the actual on-disk dirty tracking
+	 * (ili_fields) correctly tracks that the version has changed.
+	 */
+	spin_lock(&iip->ili_lock);
+	iip->ili_fsync_fields |= (flags & ~XFS_ILOG_IVERSION);
+	if (flags & XFS_ILOG_IVERSION)
+		flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
+
+	if (!iip->ili_item.li_buf) {
+		struct xfs_buf	*bp;
+		int		error;
+
+		/*
+		 * We hold the ILOCK here, so this inode is not going to be
+		 * flushed while we are here. Further, because there is no
+		 * buffer attached to the item, we know that there is no IO in
+		 * progress, so nothing will clear the ili_fields while we read
+		 * in the buffer. Hence we can safely drop the spin lock and
+		 * read the buffer knowing that the state will not change from
+		 * here.
+		 */
+		spin_unlock(&iip->ili_lock);
+		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, &bp);
+		if (error)
+			return error;
+
+		/*
+		 * We need an explicit buffer reference for the log item but
+		 * don't want the buffer to remain attached to the transaction.
+		 * Hold the buffer but release the transaction reference once
+		 * we've attached the inode log item to the buffer log item
+		 * list.
+		 */
+		xfs_buf_hold(bp);
+		spin_lock(&iip->ili_lock);
+		iip->ili_item.li_buf = bp;
+		bp->b_flags |= _XBF_INODES;
+		list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
+		xfs_trans_brelse(tp, bp);
+	}
+
+	/*
+	 * Always OR in the bits from the ili_last_fields field.  This is to
+	 * coordinate with the xfs_iflush() and xfs_buf_inode_iodone() routines
+	 * in the eventual clearing of the ili_fields bits.  See the big comment
+	 * in xfs_iflush() for an explanation of this coordination mechanism.
+	 */
+	iip->ili_fields |= (flags | iip->ili_last_fields);
+	spin_unlock(&iip->ili_lock);
+
+	/*
+	 * We are done with the log item transaction dirty state, so clear it so
+	 * that it doesn't pollute future transactions.
+	 */
+	iip->ili_dirty_flags = 0;
+	return 0;
+}
+
 /*
  * The logged size of an inode fork is always the current size of the inode
  * fork. This means that when an inode fork is relogged, the size of the logged
@@ -662,6 +812,8 @@ xfs_inode_item_committing(
 }
 
 static const struct xfs_item_ops xfs_inode_item_ops = {
+	.iop_sort	= xfs_inode_item_sort,
+	.iop_precommit	= xfs_inode_item_precommit,
 	.iop_size	= xfs_inode_item_size,
 	.iop_format	= xfs_inode_item_format,
 	.iop_pin	= xfs_inode_item_pin,
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index bbd836a44ff0..377e06007804 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -17,6 +17,7 @@ struct xfs_inode_log_item {
 	struct xfs_log_item	ili_item;	   /* common portion */
 	struct xfs_inode	*ili_inode;	   /* inode ptr */
 	unsigned short		ili_lock_flags;	   /* inode lock flags */
+	unsigned int		ili_dirty_flags;   /* dirty in current tx */
 	/*
 	 * The ili_lock protects the interactions between the dirty state and
 	 * the flush state of the inode log item. This allows us to do atomic
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] xfs: restore allocation trylock iteration
  2023-05-17  0:04 ` [PATCH 2/4] xfs: restore allocation trylock iteration Dave Chinner
@ 2023-05-17  1:11   ` Darrick J. Wong
  2023-05-17 12:59   ` Christoph Hellwig
  1 sibling, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2023-05-17  1:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 17, 2023 at 10:04:47AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> It was accidentally dropped when refactoring the allocation code,
> resulting in the AG iteration always doing blocking AG iteration.
> This results in a small performance regression for a specific fsmark
> test that runs more user data writer threads than there are AGs.
> 
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Fixes: 2edf06a50f5b ("xfs: factor xfs_alloc_vextent_this_ag() for _iterate_ags()")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Heh, this is exactly what I was thinking from that thread...

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/libxfs/xfs_alloc.c | 13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index fdfa08cbf4db..61eb65be17f3 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -3187,7 +3187,8 @@ xfs_alloc_vextent_check_args(
>   */
>  static int
>  xfs_alloc_vextent_prepare_ag(
> -	struct xfs_alloc_arg	*args)
> +	struct xfs_alloc_arg	*args,
> +	uint32_t		flags)
>  {
>  	bool			need_pag = !args->pag;
>  	int			error;
> @@ -3196,7 +3197,7 @@ xfs_alloc_vextent_prepare_ag(
>  		args->pag = xfs_perag_get(args->mp, args->agno);
>  
>  	args->agbp = NULL;
> -	error = xfs_alloc_fix_freelist(args, 0);
> +	error = xfs_alloc_fix_freelist(args, flags);
>  	if (error) {
>  		trace_xfs_alloc_vextent_nofix(args);
>  		if (need_pag)
> @@ -3336,7 +3337,7 @@ xfs_alloc_vextent_this_ag(
>  		return error;
>  	}
>  
> -	error = xfs_alloc_vextent_prepare_ag(args);
> +	error = xfs_alloc_vextent_prepare_ag(args, 0);
>  	if (!error && args->agbp)
>  		error = xfs_alloc_ag_vextent_size(args);
>  
> @@ -3380,7 +3381,7 @@ xfs_alloc_vextent_iterate_ags(
>  	for_each_perag_wrap_range(mp, start_agno, restart_agno,
>  			mp->m_sb.sb_agcount, agno, args->pag) {
>  		args->agno = agno;
> -		error = xfs_alloc_vextent_prepare_ag(args);
> +		error = xfs_alloc_vextent_prepare_ag(args, flags);
>  		if (error)
>  			break;
>  		if (!args->agbp) {
> @@ -3546,7 +3547,7 @@ xfs_alloc_vextent_exact_bno(
>  		return error;
>  	}
>  
> -	error = xfs_alloc_vextent_prepare_ag(args);
> +	error = xfs_alloc_vextent_prepare_ag(args, 0);
>  	if (!error && args->agbp)
>  		error = xfs_alloc_ag_vextent_exact(args);
>  
> @@ -3587,7 +3588,7 @@ xfs_alloc_vextent_near_bno(
>  	if (needs_perag)
>  		args->pag = xfs_perag_grab(mp, args->agno);
>  
> -	error = xfs_alloc_vextent_prepare_ag(args);
> +	error = xfs_alloc_vextent_prepare_ag(args, 0);
>  	if (!error && args->agbp)
>  		error = xfs_alloc_ag_vextent_near(args);
>  
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] xfs: defered work could create precommits
  2023-05-17  0:04 ` [PATCH 3/4] xfs: defered work could create precommits Dave Chinner
@ 2023-05-17  1:20   ` Darrick J. Wong
  2023-05-17  1:44     ` Dave Chinner
  2023-06-01 15:01   ` Christoph Hellwig
  1 sibling, 1 reply; 23+ messages in thread
From: Darrick J. Wong @ 2023-05-17  1:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 17, 2023 at 10:04:48AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To fix a AGI-AGF-inode cluster buffer deadlock, we need to move
> inode cluster buffer oeprations to the ->iop_precommit() method.

                       operations

> However, this means that deferred operations can require precommits
> to be run on the final transaction that the deferred ops pass back
> to xfs_trans_commit() context. This will be exposed by attribute
> handling, in that the last changes to the inode in the attr set
> state machine "disappear" because the precommit operation is not run.

Wait, what?

OH, this is because the LARP state machine can log the inode in the
final transaction in its chain.  __xfs_trans_commit calls the noroll
variant of xfs_defer_finish, which means that when we get to...

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_trans.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 8afc0c080861..664084509af5 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -970,6 +970,11 @@ __xfs_trans_commit(
>  		error = xfs_defer_finish_noroll(&tp);
>  		if (error)
>  			goto out_unreserve;

...here, tp might actually be a dirty transaction.  Prior to
xlog_cil_committing the dirty transaction, we need to run the precommit
hooks or else we don't actually (re)attach the inode cluster buffer to
the transaction, and everything goes batty from there.  Right?

This isn't specific to LARP; any defer item that logs an item with a
precommit hook is subject to this.  Right?

> +
> +		/* Run precomits from final tx in defer chain */

                       precommits

If the answers to the last 2 questions are 'yes' and the spelling errors
get fixed, I think I'm ok enough with this one to throw it at
fstestscloud.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


> +		error = xfs_trans_run_precommits(tp);
> +		if (error)
> +			goto out_unreserve;
>  	}
>  
>  	/*
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] xfs: fix AGF vs inode cluster buffer deadlock
  2023-05-17  0:04 ` [PATCH 4/4] xfs: fix AGF vs inode cluster buffer deadlock Dave Chinner
@ 2023-05-17  1:26   ` Darrick J. Wong
  2023-05-17  1:47     ` Dave Chinner
  2023-06-01  1:51     ` Dave Chinner
  2023-06-01 15:12   ` Christoph Hellwig
  2023-06-25  2:58   ` Matthew Wilcox
  2 siblings, 2 replies; 23+ messages in thread
From: Darrick J. Wong @ 2023-05-17  1:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 17, 2023 at 10:04:49AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Lock order in XFS is AGI -> AGF, hence for operations involving
> inode unlinked list operations we always lock the AGI first. Inode
> unlinked list operations operate on the inode cluster buffer,
> so the lock order there is AGI -> inode cluster buffer.
> 
> For O_TMPFILE operations, this now means the lock order set down in
> xfs_rename and xfs_link is AGI -> inode cluster buffer -> AGF as the
> unlinked ops are done before the directory modifications that may
> allocate space and lock the AGF.
> 
> Unfortunately, we also now lock the inode cluster buffer when
> logging an inode so that we can attach the inode to the cluster
> buffer and pin it in memory. This creates a lock order of AGF ->
> inode cluster buffer in directory operations as we have to log the
> inode after we've allocated new space for it.
> 
> This creates a lock inversion between the AGF and the inode cluster
> buffer. Because the inode cluster buffer is shared across multiple
> inodes, the inversion is not specific to individual inodes but can
> occur when inodes in the same cluster buffer are accessed in
> different orders.
> 
> To fix this we need move all the inode log item cluster buffer
> interactions to the end of the current transaction. Unfortunately,
> xfs_trans_log_inode() calls are littered throughout the transactions
> with no thought to ordering against other items or locking. This
> makes it difficult to do anything that involves changing the call
> sites of xfs_trans_log_inode() to change locking orders.
> 
> However, we do now have a mechanism that allows is to postpone dirty
> item processing to just before we commit the transaction: the
> ->iop_precommit method. This will be called after all the
> modifications are done and high level objects like AGI and AGF
> buffers have been locked and modified, thereby providing a mechanism
> that guarantees we don't lock the inode cluster buffer before those
> high level objects are locked.
> 
> This change is largely moving the guts of xfs_trans_log_inode() to
> xfs_inode_item_precommit() and providing an extra flag context in
> the inode log item to track the dirty state of the inode in the
> current transaction. This also means we do a lot less repeated work
> in xfs_trans_log_inode() by only doing it once per transaction when
> all the work is done.

Aha, and that's why you moved all the "opportunistically tweak inode
metadata while we're already logging it" bits to the precommit hook.

> Fixes: 298f7bec503f ("xfs: pin inode backing buffer to the inode log item")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_log_format.h  |   9 +-
>  fs/xfs/libxfs/xfs_trans_inode.c | 115 +++---------------------
>  fs/xfs/xfs_inode_item.c         | 152 ++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_inode_item.h         |   1 +
>  4 files changed, 171 insertions(+), 106 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> index f13e0809dc63..269573c82808 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -324,7 +324,6 @@ struct xfs_inode_log_format_32 {
>  #define XFS_ILOG_DOWNER	0x200	/* change the data fork owner on replay */
>  #define XFS_ILOG_AOWNER	0x400	/* change the attr fork owner on replay */
>  
> -
>  /*
>   * The timestamps are dirty, but not necessarily anything else in the inode
>   * core.  Unlike the other fields above this one must never make it to disk
> @@ -333,6 +332,14 @@ struct xfs_inode_log_format_32 {
>   */
>  #define XFS_ILOG_TIMESTAMP	0x4000
>  
> +/*
> + * The version field has been changed, but not necessarily anything else of
> + * interest. This must never make it to disk - it is used purely to ensure that
> + * the inode item ->precommit operation can update the fsync flag triggers
> + * in the inode item correctly.
> + */
> +#define XFS_ILOG_IVERSION	0x8000
> +
>  #define	XFS_ILOG_NONCORE	(XFS_ILOG_DDATA | XFS_ILOG_DEXT | \
>  				 XFS_ILOG_DBROOT | XFS_ILOG_DEV | \
>  				 XFS_ILOG_ADATA | XFS_ILOG_AEXT | \
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index 8b5547073379..2d164d0588b1 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -40,9 +40,8 @@ xfs_trans_ijoin(
>  	iip->ili_lock_flags = lock_flags;
>  	ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
>  
> -	/*
> -	 * Get a log_item_desc to point at the new item.
> -	 */
> +	/* Reset the per-tx dirty context and add the item to the tx. */
> +	iip->ili_dirty_flags = 0;
>  	xfs_trans_add_item(tp, &iip->ili_item);
>  }
>  
> @@ -76,17 +75,10 @@ xfs_trans_ichgtime(
>  /*
>   * This is called to mark the fields indicated in fieldmask as needing to be
>   * logged when the transaction is committed.  The inode must already be
> - * associated with the given transaction.
> - *
> - * The values for fieldmask are defined in xfs_inode_item.h.  We always log all
> - * of the core inode if any of it has changed, and we always log all of the
> - * inline data/extents/b-tree root if any of them has changed.
> - *
> - * Grab and pin the cluster buffer associated with this inode to avoid RMW
> - * cycles at inode writeback time. Avoid the need to add error handling to every
> - * xfs_trans_log_inode() call by shutting down on read error.  This will cause
> - * transactions to fail and everything to error out, just like if we return a
> - * read error in a dirty transaction and cancel it.
> + * associated with the given transaction. All we do here is record where the
> + * inode was dirtied and mark the transaction and inode log item dirty;
> + * everything else is done in the ->precommit log item operation after the
> + * changes in the transaction have been completed.
>   */
>  void
>  xfs_trans_log_inode(
> @@ -96,7 +88,6 @@ xfs_trans_log_inode(
>  {
>  	struct xfs_inode_log_item *iip = ip->i_itemp;
>  	struct inode		*inode = VFS_I(ip);
> -	uint			iversion_flags = 0;
>  
>  	ASSERT(iip);
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> @@ -104,18 +95,6 @@ xfs_trans_log_inode(
>  
>  	tp->t_flags |= XFS_TRANS_DIRTY;
>  
> -	/*
> -	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
> -	 * don't matter - we either will need an extra transaction in 24 hours
> -	 * to log the timestamps, or will clear already cleared fields in the
> -	 * worst case.
> -	 */
> -	if (inode->i_state & I_DIRTY_TIME) {
> -		spin_lock(&inode->i_lock);
> -		inode->i_state &= ~I_DIRTY_TIME;
> -		spin_unlock(&inode->i_lock);
> -	}
> -
>  	/*
>  	 * First time we log the inode in a transaction, bump the inode change
>  	 * counter if it is configured for this to occur. While we have the
> @@ -128,86 +107,12 @@ xfs_trans_log_inode(
>  	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
>  		if (IS_I_VERSION(inode) &&
>  		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> -			iversion_flags = XFS_ILOG_CORE;
> +			flags |= XFS_ILOG_IVERSION;
>  	}
>  
> -	/*
> -	 * If we're updating the inode core or the timestamps and it's possible
> -	 * to upgrade this inode to bigtime format, do so now.
> -	 */
> -	if ((flags & (XFS_ILOG_CORE | XFS_ILOG_TIMESTAMP)) &&
> -	    xfs_has_bigtime(ip->i_mount) &&
> -	    !xfs_inode_has_bigtime(ip)) {
> -		ip->i_diflags2 |= XFS_DIFLAG2_BIGTIME;
> -		flags |= XFS_ILOG_CORE;
> -	}
> -
> -	/*
> -	 * Inode verifiers do not check that the extent size hint is an integer
> -	 * multiple of the rt extent size on a directory with both rtinherit
> -	 * and extszinherit flags set.  If we're logging a directory that is
> -	 * misconfigured in this way, clear the hint.
> -	 */
> -	if ((ip->i_diflags & XFS_DIFLAG_RTINHERIT) &&
> -	    (ip->i_diflags & XFS_DIFLAG_EXTSZINHERIT) &&
> -	    (ip->i_extsize % ip->i_mount->m_sb.sb_rextsize) > 0) {
> -		ip->i_diflags &= ~(XFS_DIFLAG_EXTSIZE |
> -				   XFS_DIFLAG_EXTSZINHERIT);
> -		ip->i_extsize = 0;
> -		flags |= XFS_ILOG_CORE;
> -	}
> -
> -	/*
> -	 * Record the specific change for fdatasync optimisation. This allows
> -	 * fdatasync to skip log forces for inodes that are only timestamp
> -	 * dirty.
> -	 */
> -	spin_lock(&iip->ili_lock);
> -	iip->ili_fsync_fields |= flags;
> -
> -	if (!iip->ili_item.li_buf) {
> -		struct xfs_buf	*bp;
> -		int		error;
> -
> -		/*
> -		 * We hold the ILOCK here, so this inode is not going to be
> -		 * flushed while we are here. Further, because there is no
> -		 * buffer attached to the item, we know that there is no IO in
> -		 * progress, so nothing will clear the ili_fields while we read
> -		 * in the buffer. Hence we can safely drop the spin lock and
> -		 * read the buffer knowing that the state will not change from
> -		 * here.
> -		 */
> -		spin_unlock(&iip->ili_lock);
> -		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, &bp);
> -		if (error) {
> -			xfs_force_shutdown(ip->i_mount, SHUTDOWN_META_IO_ERROR);
> -			return;
> -		}
> -
> -		/*
> -		 * We need an explicit buffer reference for the log item but
> -		 * don't want the buffer to remain attached to the transaction.
> -		 * Hold the buffer but release the transaction reference once
> -		 * we've attached the inode log item to the buffer log item
> -		 * list.
> -		 */
> -		xfs_buf_hold(bp);
> -		spin_lock(&iip->ili_lock);
> -		iip->ili_item.li_buf = bp;
> -		bp->b_flags |= _XBF_INODES;
> -		list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
> -		xfs_trans_brelse(tp, bp);
> -	}
> -
> -	/*
> -	 * Always OR in the bits from the ili_last_fields field.  This is to
> -	 * coordinate with the xfs_iflush() and xfs_buf_inode_iodone() routines
> -	 * in the eventual clearing of the ili_fields bits.  See the big comment
> -	 * in xfs_iflush() for an explanation of this coordination mechanism.
> -	 */
> -	iip->ili_fields |= (flags | iip->ili_last_fields | iversion_flags);
> -	spin_unlock(&iip->ili_lock);
> +	iip->ili_dirty_flags |= flags;
> +	trace_printk("ino 0x%llx, flags 0x%x, dflags 0x%x",
> +		ip->i_ino, flags, iip->ili_dirty_flags);

Urk, leftover debugging info?

--D
>  }
>  
>  int
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index ca2941ab6cbc..586af11b7cd1 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -29,6 +29,156 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip)
>  	return container_of(lip, struct xfs_inode_log_item, ili_item);
>  }
>  
> +static uint64_t
> +xfs_inode_item_sort(
> +	struct xfs_log_item	*lip)
> +{
> +	return INODE_ITEM(lip)->ili_inode->i_ino;
> +}
> +
> +/*
> + * Prior to finally logging the inode, we have to ensure that all the
> + * per-modification inode state changes are applied. This includes VFS inode
> + * state updates, format conversions, verifier state synchronisation and
> + * ensuring the inode buffer remains in memory whilst the inode is dirty.
> + *
> + * We have to be careful when we grab the inode cluster buffer due to lock
> + * ordering constraints. The unlinked inode modifications (xfs_iunlink_item)
> + * require AGI -> inode cluster buffer lock order. The inode cluster buffer is
> + * not locked until ->precommit, so it happens after everything else has been
> + * modified.
> + *
> + * Further, we have AGI -> AGF lock ordering, and with O_TMPFILE handling we
> + * have AGI -> AGF -> iunlink item -> inode cluster buffer lock order. Hence we
> + * cannot safely lock the inode cluster buffer in xfs_trans_log_inode() because
> + * it can be called on a inode (e.g. via bumplink/droplink) before we take the
> + * AGF lock modifying directory blocks.
> + *
> + * Rather than force a complete rework of all the transactions to call
> + * xfs_trans_log_inode() once and once only at the end of every transaction, we
> + * move the pinning of the inode cluster buffer to a ->precommit operation. This
> + * matches how the xfs_iunlink_item locks the inode cluster buffer, and it
> + * ensures that the inode cluster buffer locking is always done last in a
> + * transaction. i.e. we ensure the lock order is always AGI -> AGF -> inode
> + * cluster buffer.
> + *
> + * If we return the inode number as the precommit sort key then we'll also
> + * guarantee that the order all inode cluster buffer locking is the same all the
> + * inodes and unlink items in the transaction.
> + */
> +static int
> +xfs_inode_item_precommit(
> +	struct xfs_trans	*tp,
> +	struct xfs_log_item	*lip)
> +{
> +	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
> +	struct xfs_inode	*ip = iip->ili_inode;
> +	struct inode		*inode = VFS_I(ip);
> +	unsigned int		flags = iip->ili_dirty_flags;
> +
> +	trace_printk("ino 0x%llx, dflags 0x%x, fields 0x%x lastf 0x%x",
> +		ip->i_ino, flags, iip->ili_fields, iip->ili_last_fields);
> +	/*
> +	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
> +	 * don't matter - we either will need an extra transaction in 24 hours
> +	 * to log the timestamps, or will clear already cleared fields in the
> +	 * worst case.
> +	 */
> +	if (inode->i_state & I_DIRTY_TIME) {
> +		spin_lock(&inode->i_lock);
> +		inode->i_state &= ~I_DIRTY_TIME;
> +		spin_unlock(&inode->i_lock);
> +	}
> +
> +
> +	/*
> +	 * If we're updating the inode core or the timestamps and it's possible
> +	 * to upgrade this inode to bigtime format, do so now.
> +	 */
> +	if ((flags & (XFS_ILOG_CORE | XFS_ILOG_TIMESTAMP)) &&
> +	    xfs_has_bigtime(ip->i_mount) &&
> +	    !xfs_inode_has_bigtime(ip)) {
> +		ip->i_diflags2 |= XFS_DIFLAG2_BIGTIME;
> +		flags |= XFS_ILOG_CORE;
> +	}
> +
> +	/*
> +	 * Inode verifiers do not check that the extent size hint is an integer
> +	 * multiple of the rt extent size on a directory with both rtinherit
> +	 * and extszinherit flags set.  If we're logging a directory that is
> +	 * misconfigured in this way, clear the hint.
> +	 */
> +	if ((ip->i_diflags & XFS_DIFLAG_RTINHERIT) &&
> +	    (ip->i_diflags & XFS_DIFLAG_EXTSZINHERIT) &&
> +	    (ip->i_extsize % ip->i_mount->m_sb.sb_rextsize) > 0) {
> +		ip->i_diflags &= ~(XFS_DIFLAG_EXTSIZE |
> +				   XFS_DIFLAG_EXTSZINHERIT);
> +		ip->i_extsize = 0;
> +		flags |= XFS_ILOG_CORE;
> +	}
> +
> +	/*
> +	 * Record the specific change for fdatasync optimisation. This allows
> +	 * fdatasync to skip log forces for inodes that are only timestamp
> +	 * dirty. Once we've processed the XFS_ILOG_IVERSION flag, convert it
> +	 * to XFS_ILOG_CORE so that the actual on-disk dirty tracking
> +	 * (ili_fields) correctly tracks that the version has changed.
> +	 */
> +	spin_lock(&iip->ili_lock);
> +	iip->ili_fsync_fields |= (flags & ~XFS_ILOG_IVERSION);
> +	if (flags & XFS_ILOG_IVERSION)
> +		flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
> +
> +	if (!iip->ili_item.li_buf) {
> +		struct xfs_buf	*bp;
> +		int		error;
> +
> +		/*
> +		 * We hold the ILOCK here, so this inode is not going to be
> +		 * flushed while we are here. Further, because there is no
> +		 * buffer attached to the item, we know that there is no IO in
> +		 * progress, so nothing will clear the ili_fields while we read
> +		 * in the buffer. Hence we can safely drop the spin lock and
> +		 * read the buffer knowing that the state will not change from
> +		 * here.
> +		 */
> +		spin_unlock(&iip->ili_lock);
> +		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, &bp);
> +		if (error)
> +			return error;
> +
> +		/*
> +		 * We need an explicit buffer reference for the log item but
> +		 * don't want the buffer to remain attached to the transaction.
> +		 * Hold the buffer but release the transaction reference once
> +		 * we've attached the inode log item to the buffer log item
> +		 * list.
> +		 */
> +		xfs_buf_hold(bp);
> +		spin_lock(&iip->ili_lock);
> +		iip->ili_item.li_buf = bp;
> +		bp->b_flags |= _XBF_INODES;
> +		list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
> +		xfs_trans_brelse(tp, bp);
> +	}
> +
> +	/*
> +	 * Always OR in the bits from the ili_last_fields field.  This is to
> +	 * coordinate with the xfs_iflush() and xfs_buf_inode_iodone() routines
> +	 * in the eventual clearing of the ili_fields bits.  See the big comment
> +	 * in xfs_iflush() for an explanation of this coordination mechanism.
> +	 */
> +	iip->ili_fields |= (flags | iip->ili_last_fields);
> +	spin_unlock(&iip->ili_lock);
> +
> +	/*
> +	 * We are done with the log item transaction dirty state, so clear it so
> +	 * that it doesn't pollute future transactions.
> +	 */
> +	iip->ili_dirty_flags = 0;
> +	return 0;
> +}
> +
>  /*
>   * The logged size of an inode fork is always the current size of the inode
>   * fork. This means that when an inode fork is relogged, the size of the logged
> @@ -662,6 +812,8 @@ xfs_inode_item_committing(
>  }
>  
>  static const struct xfs_item_ops xfs_inode_item_ops = {
> +	.iop_sort	= xfs_inode_item_sort,
> +	.iop_precommit	= xfs_inode_item_precommit,
>  	.iop_size	= xfs_inode_item_size,
>  	.iop_format	= xfs_inode_item_format,
>  	.iop_pin	= xfs_inode_item_pin,
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index bbd836a44ff0..377e06007804 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -17,6 +17,7 @@ struct xfs_inode_log_item {
>  	struct xfs_log_item	ili_item;	   /* common portion */
>  	struct xfs_inode	*ili_inode;	   /* inode ptr */
>  	unsigned short		ili_lock_flags;	   /* inode lock flags */
> +	unsigned int		ili_dirty_flags;   /* dirty in current tx */
>  	/*
>  	 * The ili_lock protects the interactions between the dirty state and
>  	 * the flush state of the inode log item. This allows us to do atomic
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/4] xfs: buffer pins need to hold a buffer reference
  2023-05-17  0:04 ` [PATCH 1/4] xfs: buffer pins need to hold a buffer reference Dave Chinner
@ 2023-05-17  1:26   ` Darrick J. Wong
  2023-05-17 12:58   ` Christoph Hellwig
  1 sibling, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2023-05-17  1:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 17, 2023 at 10:04:46AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When a buffer is unpinned by xfs_buf_item_unpin(), we need to access
> the buffer after we've dropped the buffer log item reference count.
> This opens a window where we can have two racing unpins for the
> buffer item (e.g. shutdown checkpoint context callback processing
> racing with journal IO iclog completion processing) and both attempt
> to access the buffer after dropping the BLI reference count.  If we
> are unlucky, the "BLI freed" context wins the race and frees the
> buffer before the "BLI still active" case checks the buffer pin
> count.
> 
> This results in a use after free that can only be triggered
> in active filesystem shutdown situations.
> 
> To fix this, we need to ensure that buffer existence extends beyond
> the BLI reference count checks and until the unpin processing is
> complete. This implies that a buffer pin operation must also take a
> buffer reference to ensure that the buffer cannot be freed until the
> buffer unpin processing is complete.
> 
> Reported-by: yangerkun <yangerkun@huawei.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

LGTM
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_buf_item.c | 88 ++++++++++++++++++++++++++++++++-----------
>  1 file changed, 65 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index df7322ed73fa..b2d211730fd2 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -452,10 +452,18 @@ xfs_buf_item_format(
>   * This is called to pin the buffer associated with the buf log item in memory
>   * so it cannot be written out.
>   *
> - * We also always take a reference to the buffer log item here so that the bli
> - * is held while the item is pinned in memory. This means that we can
> - * unconditionally drop the reference count a transaction holds when the
> - * transaction is completed.
> + * We take a reference to the buffer log item here so that the BLI life cycle
> + * extends at least until the buffer is unpinned via xfs_buf_item_unpin() and
> + * inserted into the AIL.
> + *
> + * We also need to take a reference to the buffer itself as the BLI unpin
> + * processing requires accessing the buffer after the BLI has dropped the final
> + * BLI reference. See xfs_buf_item_unpin() for an explanation.
> + * If unpins race to drop the final BLI reference and only the
> + * BLI owns a reference to the buffer, then the loser of the race can have the
> + * buffer fgreed from under it (e.g. on shutdown). Taking a buffer reference per
> + * pin count ensures the life cycle of the buffer extends for as
> + * long as we hold the buffer pin reference in xfs_buf_item_unpin().
>   */
>  STATIC void
>  xfs_buf_item_pin(
> @@ -470,13 +478,30 @@ xfs_buf_item_pin(
>  
>  	trace_xfs_buf_item_pin(bip);
>  
> +	xfs_buf_hold(bip->bli_buf);
>  	atomic_inc(&bip->bli_refcount);
>  	atomic_inc(&bip->bli_buf->b_pin_count);
>  }
>  
>  /*
> - * This is called to unpin the buffer associated with the buf log item which
> - * was previously pinned with a call to xfs_buf_item_pin().
> + * This is called to unpin the buffer associated with the buf log item which was
> + * previously pinned with a call to xfs_buf_item_pin().  We enter this function
> + * with a buffer pin count, a buffer reference and a BLI reference.
> + *
> + * We must drop the BLI reference before we unpin the buffer because the AIL
> + * doesn't acquire a BLI reference whenever it accesses it. Therefore if the
> + * refcount drops to zero, the bli could still be AIL resident and the buffer
> + * submitted for I/O at any point before we return. This can result in IO
> + * completion freeing the buffer while we are still trying to access it here.
> + * This race condition can also occur in shutdown situations where we abort and
> + * unpin buffers from contexts other that journal IO completion.
> + *
> + * Hence we have to hold a buffer reference per pin count to ensure that the
> + * buffer cannot be freed until we have finished processing the unpin operation.
> + * The reference is taken in xfs_buf_item_pin(), and we must hold it until we
> + * are done processing the buffer state. In the case of an abort (remove =
> + * true) then we re-use the current pin reference as the IO reference we hand
> + * off to IO failure handling.
>   */
>  STATIC void
>  xfs_buf_item_unpin(
> @@ -493,24 +518,18 @@ xfs_buf_item_unpin(
>  
>  	trace_xfs_buf_item_unpin(bip);
>  
> -	/*
> -	 * Drop the bli ref associated with the pin and grab the hold required
> -	 * for the I/O simulation failure in the abort case. We have to do this
> -	 * before the pin count drops because the AIL doesn't acquire a bli
> -	 * reference. Therefore if the refcount drops to zero, the bli could
> -	 * still be AIL resident and the buffer submitted for I/O (and freed on
> -	 * completion) at any point before we return. This can be removed once
> -	 * the AIL properly holds a reference on the bli.
> -	 */
>  	freed = atomic_dec_and_test(&bip->bli_refcount);
> -	if (freed && !stale && remove)
> -		xfs_buf_hold(bp);
>  	if (atomic_dec_and_test(&bp->b_pin_count))
>  		wake_up_all(&bp->b_waiters);
>  
> -	 /* nothing to do but drop the pin count if the bli is active */
> -	if (!freed)
> +	 /*
> +	  * Nothing to do but drop the buffer pin reference if the BLI is
> +	  * still active
> +	  */
> +	if (!freed) {
> +		xfs_buf_rele(bp);
>  		return;
> +	}
>  
>  	if (stale) {
>  		ASSERT(bip->bli_flags & XFS_BLI_STALE);
> @@ -522,6 +541,15 @@ xfs_buf_item_unpin(
>  
>  		trace_xfs_buf_item_unpin_stale(bip);
>  
> +		/*
> +		 * The buffer has been locked and referenced since it was marked
> +		 * stale so we own both lock and reference exclusively here. We
> +		 * do not need the pin reference any more, so drop it now so
> +		 * that we only have one reference to drop once item completion
> +		 * processing is complete.
> +		 */
> +		xfs_buf_rele(bp);
> +
>  		/*
>  		 * If we get called here because of an IO error, we may or may
>  		 * not have the item on the AIL. xfs_trans_ail_delete() will
> @@ -538,16 +566,30 @@ xfs_buf_item_unpin(
>  			ASSERT(bp->b_log_item == NULL);
>  		}
>  		xfs_buf_relse(bp);
> -	} else if (remove) {
> +		return;
> +	}
> +
> +	if (remove) {
>  		/*
> -		 * The buffer must be locked and held by the caller to simulate
> -		 * an async I/O failure. We acquired the hold for this case
> -		 * before the buffer was unpinned.
> +		 * We need to simulate an async IO failures here to ensure that
> +		 * the correct error completion is run on this buffer. This
> +		 * requires a reference to the buffer and for the buffer to be
> +		 * locked. We can safely pass ownership of the pin reference to
> +		 * the IO to ensure that nothing can free the buffer while we
> +		 * wait for the lock and then run the IO failure completion.
>  		 */
>  		xfs_buf_lock(bp);
>  		bp->b_flags |= XBF_ASYNC;
>  		xfs_buf_ioend_fail(bp);
> +		return;
>  	}
> +
> +	/*
> +	 * BLI has no more active references - it will be moved to the AIL to
> +	 * manage the remaining BLI/buffer life cycle. There is nothing left for
> +	 * us to do here so drop the pin reference to the buffer.
> +	 */
> +	xfs_buf_rele(bp);
>  }
>  
>  STATIC uint
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] xfs: defered work could create precommits
  2023-05-17  1:20   ` Darrick J. Wong
@ 2023-05-17  1:44     ` Dave Chinner
  0 siblings, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2023-05-17  1:44 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, May 16, 2023 at 06:20:59PM -0700, Darrick J. Wong wrote:
> On Wed, May 17, 2023 at 10:04:48AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > To fix a AGI-AGF-inode cluster buffer deadlock, we need to move
> > inode cluster buffer oeprations to the ->iop_precommit() method.
> 
>                        operations
> 
> > However, this means that deferred operations can require precommits
> > to be run on the final transaction that the deferred ops pass back
> > to xfs_trans_commit() context. This will be exposed by attribute
> > handling, in that the last changes to the inode in the attr set
> > state machine "disappear" because the precommit operation is not run.
> 
> Wait, what?

That was exactly the reaction I had when attribute tests failed
unexpectedly. Then a quick bit of traceprintk debugging (which I
probably forgot to remove!) pointed me at shortform attr updates
where no precommit was running...

> OH, this is because the LARP state machine can log the inode in the
> final transaction in its chain.  __xfs_trans_commit calls the noroll
> variant of xfs_defer_finish, which means that when we get to...

Yes.

> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_trans.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > index 8afc0c080861..664084509af5 100644
> > --- a/fs/xfs/xfs_trans.c
> > +++ b/fs/xfs/xfs_trans.c
> > @@ -970,6 +970,11 @@ __xfs_trans_commit(
> >  		error = xfs_defer_finish_noroll(&tp);
> >  		if (error)
> >  			goto out_unreserve;
> 
> ...here, tp might actually be a dirty transaction.  Prior to
> xlog_cil_committing the dirty transaction, we need to run the precommit
> hooks or else we don't actually (re)attach the inode cluster buffer to
> the transaction, and everything goes batty from there.  Right?

Yes.

> This isn't specific to LARP; any defer item that logs an item with a
> precommit hook is subject to this.  Right?

Yes.

But right now, the only precommit hook is the unlinked inode
processing, which doesn't run from defer-ops context...

> > +
> > +		/* Run precomits from final tx in defer chain */
> 
>                        precommits
> 
> If the answers to the last 2 questions are 'yes' and the spelling errors
> get fixed, I think I'm ok enough with this one to throw it at
> fstestscloud.
> 
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Thanks!

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] xfs: fix AGF vs inode cluster buffer deadlock
  2023-05-17  1:26   ` Darrick J. Wong
@ 2023-05-17  1:47     ` Dave Chinner
  2023-06-01  1:51     ` Dave Chinner
  1 sibling, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2023-05-17  1:47 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, May 16, 2023 at 06:26:29PM -0700, Darrick J. Wong wrote:
> On Wed, May 17, 2023 at 10:04:49AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Lock order in XFS is AGI -> AGF, hence for operations involving
> > inode unlinked list operations we always lock the AGI first. Inode
> > unlinked list operations operate on the inode cluster buffer,
> > so the lock order there is AGI -> inode cluster buffer.
> > 
> > For O_TMPFILE operations, this now means the lock order set down in
> > xfs_rename and xfs_link is AGI -> inode cluster buffer -> AGF as the
> > unlinked ops are done before the directory modifications that may
> > allocate space and lock the AGF.
> > 
> > Unfortunately, we also now lock the inode cluster buffer when
> > logging an inode so that we can attach the inode to the cluster
> > buffer and pin it in memory. This creates a lock order of AGF ->
> > inode cluster buffer in directory operations as we have to log the
> > inode after we've allocated new space for it.
> > 
> > This creates a lock inversion between the AGF and the inode cluster
> > buffer. Because the inode cluster buffer is shared across multiple
> > inodes, the inversion is not specific to individual inodes but can
> > occur when inodes in the same cluster buffer are accessed in
> > different orders.
> > 
> > To fix this we need move all the inode log item cluster buffer
> > interactions to the end of the current transaction. Unfortunately,
> > xfs_trans_log_inode() calls are littered throughout the transactions
> > with no thought to ordering against other items or locking. This
> > makes it difficult to do anything that involves changing the call
> > sites of xfs_trans_log_inode() to change locking orders.
> > 
> > However, we do now have a mechanism that allows is to postpone dirty
> > item processing to just before we commit the transaction: the
> > ->iop_precommit method. This will be called after all the
> > modifications are done and high level objects like AGI and AGF
> > buffers have been locked and modified, thereby providing a mechanism
> > that guarantees we don't lock the inode cluster buffer before those
> > high level objects are locked.
> > 
> > This change is largely moving the guts of xfs_trans_log_inode() to
> > xfs_inode_item_precommit() and providing an extra flag context in
> > the inode log item to track the dirty state of the inode in the
> > current transaction. This also means we do a lot less repeated work
> > in xfs_trans_log_inode() by only doing it once per transaction when
> > all the work is done.
> 
> Aha, and that's why you moved all the "opportunistically tweak inode
> metadata while we're already logging it" bits to the precommit hook.

Yes. It didn't make sense to move just some of the "only need to do
once per transaction while the inode is locked" stuff from one place
to the other. I figured it's better to have it all in one place and
do it all just once...

....
> > -	/*
> > -	 * Always OR in the bits from the ili_last_fields field.  This is to
> > -	 * coordinate with the xfs_iflush() and xfs_buf_inode_iodone() routines
> > -	 * in the eventual clearing of the ili_fields bits.  See the big comment
> > -	 * in xfs_iflush() for an explanation of this coordination mechanism.
> > -	 */
> > -	iip->ili_fields |= (flags | iip->ili_last_fields | iversion_flags);
> > -	spin_unlock(&iip->ili_lock);
> > +	iip->ili_dirty_flags |= flags;
> > +	trace_printk("ino 0x%llx, flags 0x%x, dflags 0x%x",
> > +		ip->i_ino, flags, iip->ili_dirty_flags);
> 
> Urk, leftover debugging info?

Yes, I just realised I'd left it there when writing the last email.
Ignoring those two trace-printk() statements, the rest of the code
should be ok to review...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: xfs: bug fixes for 6.4-rcX
  2023-05-17  0:04 xfs: bug fixes for 6.4-rcX Dave Chinner
                   ` (3 preceding siblings ...)
  2023-05-17  0:04 ` [PATCH 4/4] xfs: fix AGF vs inode cluster buffer deadlock Dave Chinner
@ 2023-05-17  7:07 ` Amir Goldstein
  2023-05-17 11:34   ` Dave Chinner
  4 siblings, 1 reply; 23+ messages in thread
From: Amir Goldstein @ 2023-05-17  7:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Chandan Babu R, Leah Rumancik

On Wed, May 17, 2023 at 3:08 AM Dave Chinner <david@fromorbit.com> wrote:
>
> Hi folks,
>

Hi Dave,

> The following patches are fixes for recently discovered problems.
> I'd like to consider them all for a 6.4-rc merge, though really only
> patch 2 is a necessary recent regression fix.
>
> The first patch addresses a long standing buffer UAF during shutdown
> situations, where shutdown log item completions can race with CIL
> abort and iclog log item completions.
>

Can you tell roughly how far back this UAF bug goes
or is it long standing enough to be treated as "forever"?

> The second patch addresses a small performance regression from the
> 6.3 allocator rewrite which is also a potential AGF-AGF deadlock
> vector as it allows out-of-order locking.
>
> The third patch is a change needed by patch 4, which I split out for
> clarity. By itself it does nothing.
>
> The fourth patch is a fix for a AGI->AGF->inode cluster buffer lock
> ordering deadlock. This was introduced when we started pinning inode
> cluster buffers in xfs_trans_log_inode() in 5.8 to fix long-standing
> inode reclaim blocking issues, but I've only seen it in the wild
> once on a system making heavy use of OVL and hence O_TMPFILE based
> operations.
>

Thank you for providing Fixes and this summary with backporing hints :)

Amir.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: xfs: bug fixes for 6.4-rcX
  2023-05-17  7:07 ` xfs: bug fixes for 6.4-rcX Amir Goldstein
@ 2023-05-17 11:34   ` Dave Chinner
  2023-05-17 12:48     ` Amir Goldstein
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2023-05-17 11:34 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: linux-xfs, Chandan Babu R, Leah Rumancik

On Wed, May 17, 2023 at 10:07:55AM +0300, Amir Goldstein wrote:
> On Wed, May 17, 2023 at 3:08 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > Hi folks,
> >
> 
> Hi Dave,
> 
> > The following patches are fixes for recently discovered problems.
> > I'd like to consider them all for a 6.4-rc merge, though really only
> > patch 2 is a necessary recent regression fix.
> >
> > The first patch addresses a long standing buffer UAF during shutdown
> > situations, where shutdown log item completions can race with CIL
> > abort and iclog log item completions.
> >
> 
> Can you tell roughly how far back this UAF bug goes
> or is it long standing enough to be treated as "forever"?

Longer than I cared to trace the history back.

> > The second patch addresses a small performance regression from the
> > 6.3 allocator rewrite which is also a potential AGF-AGF deadlock
> > vector as it allows out-of-order locking.
> >
> > The third patch is a change needed by patch 4, which I split out for
> > clarity. By itself it does nothing.
> >
> > The fourth patch is a fix for a AGI->AGF->inode cluster buffer lock
> > ordering deadlock. This was introduced when we started pinning inode
> > cluster buffers in xfs_trans_log_inode() in 5.8 to fix long-standing
> > inode reclaim blocking issues, but I've only seen it in the wild
> > once on a system making heavy use of OVL and hence O_TMPFILE based
> > operations.
> 
> Thank you for providing Fixes and this summary with backporing hints :)

I don't think you're going to be able to backport the inode cluster
buffer lock fix. Nothing prior to 5.19 has the necessary
infrastructure or the iunlink log item code this latest fix builds
on top of. That was done to fix a whole class of relatively easy to
hit O_TMPFILE related AGI-AGF-inode cluster buffer deadlocks.  This
fix avoids an entirely different class of inode cluster buffer
deadlocks using the infrastructure introduced in the 5.19 deadlock
fixes.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: xfs: bug fixes for 6.4-rcX
  2023-05-17 11:34   ` Dave Chinner
@ 2023-05-17 12:48     ` Amir Goldstein
  0 siblings, 0 replies; 23+ messages in thread
From: Amir Goldstein @ 2023-05-17 12:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Chandan Babu R, Leah Rumancik

On Wed, May 17, 2023 at 2:34 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, May 17, 2023 at 10:07:55AM +0300, Amir Goldstein wrote:
> > On Wed, May 17, 2023 at 3:08 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > Hi folks,
> > >
> >
> > Hi Dave,
> >
> > > The following patches are fixes for recently discovered problems.
> > > I'd like to consider them all for a 6.4-rc merge, though really only
> > > patch 2 is a necessary recent regression fix.
> > >
> > > The first patch addresses a long standing buffer UAF during shutdown
> > > situations, where shutdown log item completions can race with CIL
> > > abort and iclog log item completions.
> > >
> >
> > Can you tell roughly how far back this UAF bug goes
> > or is it long standing enough to be treated as "forever"?
>
> Longer than I cared to trace the history back.
>
> > > The second patch addresses a small performance regression from the
> > > 6.3 allocator rewrite which is also a potential AGF-AGF deadlock
> > > vector as it allows out-of-order locking.
> > >
> > > The third patch is a change needed by patch 4, which I split out for
> > > clarity. By itself it does nothing.
> > >
> > > The fourth patch is a fix for a AGI->AGF->inode cluster buffer lock
> > > ordering deadlock. This was introduced when we started pinning inode
> > > cluster buffers in xfs_trans_log_inode() in 5.8 to fix long-standing
> > > inode reclaim blocking issues, but I've only seen it in the wild
> > > once on a system making heavy use of OVL and hence O_TMPFILE based
> > > operations.
> >
> > Thank you for providing Fixes and this summary with backporing hints :)
>
> I don't think you're going to be able to backport the inode cluster
> buffer lock fix. Nothing prior to 5.19 has the necessary
> infrastructure or the iunlink log item code this latest fix builds
> on top of. That was done to fix a whole class of relatively easy to
> hit O_TMPFILE related AGI-AGF-inode cluster buffer deadlocks.  This
> fix avoids an entirely different class of inode cluster buffer
> deadlocks using the infrastructure introduced in the 5.19 deadlock
> fixes.
>

Even better. Less work for us ;)

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/4] xfs: buffer pins need to hold a buffer reference
  2023-05-17  0:04 ` [PATCH 1/4] xfs: buffer pins need to hold a buffer reference Dave Chinner
  2023-05-17  1:26   ` Darrick J. Wong
@ 2023-05-17 12:58   ` Christoph Hellwig
  2023-05-17 22:24     ` Dave Chinner
  1 sibling, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2023-05-17 12:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 17, 2023 at 10:04:46AM +1000, Dave Chinner wrote:
> To fix this, we need to ensure that buffer existence extends beyond
> the BLI reference count checks and until the unpin processing is
> complete. This implies that a buffer pin operation must also take a
> buffer reference to ensure that the buffer cannot be freed until the
> buffer unpin processing is complete.

Yeah.  I wonder why we haven't done this from the very beginning..

> +	 /*
> +	  * Nothing to do but drop the buffer pin reference if the BLI is
> +	  * still active
> +	  */

Nit: this block comment is indentented by an extra space.

> +	if (!freed) {
> +		xfs_buf_rele(bp);
>  		return;
> +	}
>  
>  	if (stale) {

Nit: this is the only use of the stale variable now, so we might
as well just drop it.

>  		ASSERT(bip->bli_flags & XFS_BLI_STALE);

.. which then also clearly shows this ASSERT is pointless now.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] xfs: restore allocation trylock iteration
  2023-05-17  0:04 ` [PATCH 2/4] xfs: restore allocation trylock iteration Dave Chinner
  2023-05-17  1:11   ` Darrick J. Wong
@ 2023-05-17 12:59   ` Christoph Hellwig
  1 sibling, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2023-05-17 12:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/4] xfs: buffer pins need to hold a buffer reference
  2023-05-17 12:58   ` Christoph Hellwig
@ 2023-05-17 22:24     ` Dave Chinner
  0 siblings, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2023-05-17 22:24 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Wed, May 17, 2023 at 05:58:55AM -0700, Christoph Hellwig wrote:
> On Wed, May 17, 2023 at 10:04:46AM +1000, Dave Chinner wrote:
> > To fix this, we need to ensure that buffer existence extends beyond
> > the BLI reference count checks and until the unpin processing is
> > complete. This implies that a buffer pin operation must also take a
> > buffer reference to ensure that the buffer cannot be freed until the
> > buffer unpin processing is complete.
> 
> Yeah.  I wonder why we haven't done this from the very beginning..

Likely because the whole BLI lifecycle is completely screwed up and
nobody has had the time to understand it fully and fix it properly.

> > +	 /*
> > +	  * Nothing to do but drop the buffer pin reference if the BLI is
> > +	  * still active
> > +	  */
> 
> Nit: this block comment is indentented by an extra space.
> 
> > +	if (!freed) {
> > +		xfs_buf_rele(bp);
> >  		return;
> > +	}
> >  
> >  	if (stale) {
> 
> Nit: this is the only use of the stale variable now, so we might
> as well just drop it.

Actually, after we've dropped the bli reference, it isn't safe to
reference the bli unless we know the buffer is stale. In this case,
we know the bli still exists because the buffer has been locked
since it was marked stale. However, for the other cases the BLI
could be freed from under us as it's reference count is zero and so
the next call to xfs_buf_item_relse() will free it no matter where
it comes from.

The reference counting around BLIs a total mess - this patch just
gets rid of one landmine but there's still plenty more in this code
that need to be untangled.

> >  		ASSERT(bip->bli_flags & XFS_BLI_STALE);
> 
> .. which then also clearly shows this ASSERT is pointless now.

*nod*

> Otherwise looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] xfs: fix AGF vs inode cluster buffer deadlock
  2023-05-17  1:26   ` Darrick J. Wong
  2023-05-17  1:47     ` Dave Chinner
@ 2023-06-01  1:51     ` Dave Chinner
  2023-06-01 14:38       ` Darrick J. Wong
  1 sibling, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2023-06-01  1:51 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Friendly Ping.

Apart from the stray trace_printk()s I forgot to remove, are
there any other problems with this patch I need to fix?

-Dave.

On Tue, May 16, 2023 at 06:26:29PM -0700, Darrick J. Wong wrote:
> On Wed, May 17, 2023 at 10:04:49AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Lock order in XFS is AGI -> AGF, hence for operations involving
> > inode unlinked list operations we always lock the AGI first. Inode
> > unlinked list operations operate on the inode cluster buffer,
> > so the lock order there is AGI -> inode cluster buffer.
> > 
> > For O_TMPFILE operations, this now means the lock order set down in
> > xfs_rename and xfs_link is AGI -> inode cluster buffer -> AGF as the
> > unlinked ops are done before the directory modifications that may
> > allocate space and lock the AGF.
> > 
> > Unfortunately, we also now lock the inode cluster buffer when
> > logging an inode so that we can attach the inode to the cluster
> > buffer and pin it in memory. This creates a lock order of AGF ->
> > inode cluster buffer in directory operations as we have to log the
> > inode after we've allocated new space for it.
> > 
> > This creates a lock inversion between the AGF and the inode cluster
> > buffer. Because the inode cluster buffer is shared across multiple
> > inodes, the inversion is not specific to individual inodes but can
> > occur when inodes in the same cluster buffer are accessed in
> > different orders.
> > 
> > To fix this we need move all the inode log item cluster buffer
> > interactions to the end of the current transaction. Unfortunately,
> > xfs_trans_log_inode() calls are littered throughout the transactions
> > with no thought to ordering against other items or locking. This
> > makes it difficult to do anything that involves changing the call
> > sites of xfs_trans_log_inode() to change locking orders.
> > 
> > However, we do now have a mechanism that allows is to postpone dirty
> > item processing to just before we commit the transaction: the
> > ->iop_precommit method. This will be called after all the
> > modifications are done and high level objects like AGI and AGF
> > buffers have been locked and modified, thereby providing a mechanism
> > that guarantees we don't lock the inode cluster buffer before those
> > high level objects are locked.
> > 
> > This change is largely moving the guts of xfs_trans_log_inode() to
> > xfs_inode_item_precommit() and providing an extra flag context in
> > the inode log item to track the dirty state of the inode in the
> > current transaction. This also means we do a lot less repeated work
> > in xfs_trans_log_inode() by only doing it once per transaction when
> > all the work is done.
> 
> Aha, and that's why you moved all the "opportunistically tweak inode
> metadata while we're already logging it" bits to the precommit hook.
> 
> > Fixes: 298f7bec503f ("xfs: pin inode backing buffer to the inode log item")
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_log_format.h  |   9 +-
> >  fs/xfs/libxfs/xfs_trans_inode.c | 115 +++---------------------
> >  fs/xfs/xfs_inode_item.c         | 152 ++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_inode_item.h         |   1 +
> >  4 files changed, 171 insertions(+), 106 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> > index f13e0809dc63..269573c82808 100644
> > --- a/fs/xfs/libxfs/xfs_log_format.h
> > +++ b/fs/xfs/libxfs/xfs_log_format.h
> > @@ -324,7 +324,6 @@ struct xfs_inode_log_format_32 {
> >  #define XFS_ILOG_DOWNER	0x200	/* change the data fork owner on replay */
> >  #define XFS_ILOG_AOWNER	0x400	/* change the attr fork owner on replay */
> >  
> > -
> >  /*
> >   * The timestamps are dirty, but not necessarily anything else in the inode
> >   * core.  Unlike the other fields above this one must never make it to disk
> > @@ -333,6 +332,14 @@ struct xfs_inode_log_format_32 {
> >   */
> >  #define XFS_ILOG_TIMESTAMP	0x4000
> >  
> > +/*
> > + * The version field has been changed, but not necessarily anything else of
> > + * interest. This must never make it to disk - it is used purely to ensure that
> > + * the inode item ->precommit operation can update the fsync flag triggers
> > + * in the inode item correctly.
> > + */
> > +#define XFS_ILOG_IVERSION	0x8000
> > +
> >  #define	XFS_ILOG_NONCORE	(XFS_ILOG_DDATA | XFS_ILOG_DEXT | \
> >  				 XFS_ILOG_DBROOT | XFS_ILOG_DEV | \
> >  				 XFS_ILOG_ADATA | XFS_ILOG_AEXT | \
> > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> > index 8b5547073379..2d164d0588b1 100644
> > --- a/fs/xfs/libxfs/xfs_trans_inode.c
> > +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> > @@ -40,9 +40,8 @@ xfs_trans_ijoin(
> >  	iip->ili_lock_flags = lock_flags;
> >  	ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
> >  
> > -	/*
> > -	 * Get a log_item_desc to point at the new item.
> > -	 */
> > +	/* Reset the per-tx dirty context and add the item to the tx. */
> > +	iip->ili_dirty_flags = 0;
> >  	xfs_trans_add_item(tp, &iip->ili_item);
> >  }
> >  
> > @@ -76,17 +75,10 @@ xfs_trans_ichgtime(
> >  /*
> >   * This is called to mark the fields indicated in fieldmask as needing to be
> >   * logged when the transaction is committed.  The inode must already be
> > - * associated with the given transaction.
> > - *
> > - * The values for fieldmask are defined in xfs_inode_item.h.  We always log all
> > - * of the core inode if any of it has changed, and we always log all of the
> > - * inline data/extents/b-tree root if any of them has changed.
> > - *
> > - * Grab and pin the cluster buffer associated with this inode to avoid RMW
> > - * cycles at inode writeback time. Avoid the need to add error handling to every
> > - * xfs_trans_log_inode() call by shutting down on read error.  This will cause
> > - * transactions to fail and everything to error out, just like if we return a
> > - * read error in a dirty transaction and cancel it.
> > + * associated with the given transaction. All we do here is record where the
> > + * inode was dirtied and mark the transaction and inode log item dirty;
> > + * everything else is done in the ->precommit log item operation after the
> > + * changes in the transaction have been completed.
> >   */
> >  void
> >  xfs_trans_log_inode(
> > @@ -96,7 +88,6 @@ xfs_trans_log_inode(
> >  {
> >  	struct xfs_inode_log_item *iip = ip->i_itemp;
> >  	struct inode		*inode = VFS_I(ip);
> > -	uint			iversion_flags = 0;
> >  
> >  	ASSERT(iip);
> >  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> > @@ -104,18 +95,6 @@ xfs_trans_log_inode(
> >  
> >  	tp->t_flags |= XFS_TRANS_DIRTY;
> >  
> > -	/*
> > -	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
> > -	 * don't matter - we either will need an extra transaction in 24 hours
> > -	 * to log the timestamps, or will clear already cleared fields in the
> > -	 * worst case.
> > -	 */
> > -	if (inode->i_state & I_DIRTY_TIME) {
> > -		spin_lock(&inode->i_lock);
> > -		inode->i_state &= ~I_DIRTY_TIME;
> > -		spin_unlock(&inode->i_lock);
> > -	}
> > -
> >  	/*
> >  	 * First time we log the inode in a transaction, bump the inode change
> >  	 * counter if it is configured for this to occur. While we have the
> > @@ -128,86 +107,12 @@ xfs_trans_log_inode(
> >  	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
> >  		if (IS_I_VERSION(inode) &&
> >  		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> > -			iversion_flags = XFS_ILOG_CORE;
> > +			flags |= XFS_ILOG_IVERSION;
> >  	}
> >  
> > -	/*
> > -	 * If we're updating the inode core or the timestamps and it's possible
> > -	 * to upgrade this inode to bigtime format, do so now.
> > -	 */
> > -	if ((flags & (XFS_ILOG_CORE | XFS_ILOG_TIMESTAMP)) &&
> > -	    xfs_has_bigtime(ip->i_mount) &&
> > -	    !xfs_inode_has_bigtime(ip)) {
> > -		ip->i_diflags2 |= XFS_DIFLAG2_BIGTIME;
> > -		flags |= XFS_ILOG_CORE;
> > -	}
> > -
> > -	/*
> > -	 * Inode verifiers do not check that the extent size hint is an integer
> > -	 * multiple of the rt extent size on a directory with both rtinherit
> > -	 * and extszinherit flags set.  If we're logging a directory that is
> > -	 * misconfigured in this way, clear the hint.
> > -	 */
> > -	if ((ip->i_diflags & XFS_DIFLAG_RTINHERIT) &&
> > -	    (ip->i_diflags & XFS_DIFLAG_EXTSZINHERIT) &&
> > -	    (ip->i_extsize % ip->i_mount->m_sb.sb_rextsize) > 0) {
> > -		ip->i_diflags &= ~(XFS_DIFLAG_EXTSIZE |
> > -				   XFS_DIFLAG_EXTSZINHERIT);
> > -		ip->i_extsize = 0;
> > -		flags |= XFS_ILOG_CORE;
> > -	}
> > -
> > -	/*
> > -	 * Record the specific change for fdatasync optimisation. This allows
> > -	 * fdatasync to skip log forces for inodes that are only timestamp
> > -	 * dirty.
> > -	 */
> > -	spin_lock(&iip->ili_lock);
> > -	iip->ili_fsync_fields |= flags;
> > -
> > -	if (!iip->ili_item.li_buf) {
> > -		struct xfs_buf	*bp;
> > -		int		error;
> > -
> > -		/*
> > -		 * We hold the ILOCK here, so this inode is not going to be
> > -		 * flushed while we are here. Further, because there is no
> > -		 * buffer attached to the item, we know that there is no IO in
> > -		 * progress, so nothing will clear the ili_fields while we read
> > -		 * in the buffer. Hence we can safely drop the spin lock and
> > -		 * read the buffer knowing that the state will not change from
> > -		 * here.
> > -		 */
> > -		spin_unlock(&iip->ili_lock);
> > -		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, &bp);
> > -		if (error) {
> > -			xfs_force_shutdown(ip->i_mount, SHUTDOWN_META_IO_ERROR);
> > -			return;
> > -		}
> > -
> > -		/*
> > -		 * We need an explicit buffer reference for the log item but
> > -		 * don't want the buffer to remain attached to the transaction.
> > -		 * Hold the buffer but release the transaction reference once
> > -		 * we've attached the inode log item to the buffer log item
> > -		 * list.
> > -		 */
> > -		xfs_buf_hold(bp);
> > -		spin_lock(&iip->ili_lock);
> > -		iip->ili_item.li_buf = bp;
> > -		bp->b_flags |= _XBF_INODES;
> > -		list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
> > -		xfs_trans_brelse(tp, bp);
> > -	}
> > -
> > -	/*
> > -	 * Always OR in the bits from the ili_last_fields field.  This is to
> > -	 * coordinate with the xfs_iflush() and xfs_buf_inode_iodone() routines
> > -	 * in the eventual clearing of the ili_fields bits.  See the big comment
> > -	 * in xfs_iflush() for an explanation of this coordination mechanism.
> > -	 */
> > -	iip->ili_fields |= (flags | iip->ili_last_fields | iversion_flags);
> > -	spin_unlock(&iip->ili_lock);
> > +	iip->ili_dirty_flags |= flags;
> > +	trace_printk("ino 0x%llx, flags 0x%x, dflags 0x%x",
> > +		ip->i_ino, flags, iip->ili_dirty_flags);
> 
> Urk, leftover debugging info?
> 
> --D
> >  }
> >  
> >  int
> > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > index ca2941ab6cbc..586af11b7cd1 100644
> > --- a/fs/xfs/xfs_inode_item.c
> > +++ b/fs/xfs/xfs_inode_item.c
> > @@ -29,6 +29,156 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip)
> >  	return container_of(lip, struct xfs_inode_log_item, ili_item);
> >  }
> >  
> > +static uint64_t
> > +xfs_inode_item_sort(
> > +	struct xfs_log_item	*lip)
> > +{
> > +	return INODE_ITEM(lip)->ili_inode->i_ino;
> > +}
> > +
> > +/*
> > + * Prior to finally logging the inode, we have to ensure that all the
> > + * per-modification inode state changes are applied. This includes VFS inode
> > + * state updates, format conversions, verifier state synchronisation and
> > + * ensuring the inode buffer remains in memory whilst the inode is dirty.
> > + *
> > + * We have to be careful when we grab the inode cluster buffer due to lock
> > + * ordering constraints. The unlinked inode modifications (xfs_iunlink_item)
> > + * require AGI -> inode cluster buffer lock order. The inode cluster buffer is
> > + * not locked until ->precommit, so it happens after everything else has been
> > + * modified.
> > + *
> > + * Further, we have AGI -> AGF lock ordering, and with O_TMPFILE handling we
> > + * have AGI -> AGF -> iunlink item -> inode cluster buffer lock order. Hence we
> > + * cannot safely lock the inode cluster buffer in xfs_trans_log_inode() because
> > + * it can be called on a inode (e.g. via bumplink/droplink) before we take the
> > + * AGF lock modifying directory blocks.
> > + *
> > + * Rather than force a complete rework of all the transactions to call
> > + * xfs_trans_log_inode() once and once only at the end of every transaction, we
> > + * move the pinning of the inode cluster buffer to a ->precommit operation. This
> > + * matches how the xfs_iunlink_item locks the inode cluster buffer, and it
> > + * ensures that the inode cluster buffer locking is always done last in a
> > + * transaction. i.e. we ensure the lock order is always AGI -> AGF -> inode
> > + * cluster buffer.
> > + *
> > + * If we return the inode number as the precommit sort key then we'll also
> > + * guarantee that the order all inode cluster buffer locking is the same all the
> > + * inodes and unlink items in the transaction.
> > + */
> > +static int
> > +xfs_inode_item_precommit(
> > +	struct xfs_trans	*tp,
> > +	struct xfs_log_item	*lip)
> > +{
> > +	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
> > +	struct xfs_inode	*ip = iip->ili_inode;
> > +	struct inode		*inode = VFS_I(ip);
> > +	unsigned int		flags = iip->ili_dirty_flags;
> > +
> > +	trace_printk("ino 0x%llx, dflags 0x%x, fields 0x%x lastf 0x%x",
> > +		ip->i_ino, flags, iip->ili_fields, iip->ili_last_fields);
> > +	/*
> > +	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
> > +	 * don't matter - we either will need an extra transaction in 24 hours
> > +	 * to log the timestamps, or will clear already cleared fields in the
> > +	 * worst case.
> > +	 */
> > +	if (inode->i_state & I_DIRTY_TIME) {
> > +		spin_lock(&inode->i_lock);
> > +		inode->i_state &= ~I_DIRTY_TIME;
> > +		spin_unlock(&inode->i_lock);
> > +	}
> > +
> > +
> > +	/*
> > +	 * If we're updating the inode core or the timestamps and it's possible
> > +	 * to upgrade this inode to bigtime format, do so now.
> > +	 */
> > +	if ((flags & (XFS_ILOG_CORE | XFS_ILOG_TIMESTAMP)) &&
> > +	    xfs_has_bigtime(ip->i_mount) &&
> > +	    !xfs_inode_has_bigtime(ip)) {
> > +		ip->i_diflags2 |= XFS_DIFLAG2_BIGTIME;
> > +		flags |= XFS_ILOG_CORE;
> > +	}
> > +
> > +	/*
> > +	 * Inode verifiers do not check that the extent size hint is an integer
> > +	 * multiple of the rt extent size on a directory with both rtinherit
> > +	 * and extszinherit flags set.  If we're logging a directory that is
> > +	 * misconfigured in this way, clear the hint.
> > +	 */
> > +	if ((ip->i_diflags & XFS_DIFLAG_RTINHERIT) &&
> > +	    (ip->i_diflags & XFS_DIFLAG_EXTSZINHERIT) &&
> > +	    (ip->i_extsize % ip->i_mount->m_sb.sb_rextsize) > 0) {
> > +		ip->i_diflags &= ~(XFS_DIFLAG_EXTSIZE |
> > +				   XFS_DIFLAG_EXTSZINHERIT);
> > +		ip->i_extsize = 0;
> > +		flags |= XFS_ILOG_CORE;
> > +	}
> > +
> > +	/*
> > +	 * Record the specific change for fdatasync optimisation. This allows
> > +	 * fdatasync to skip log forces for inodes that are only timestamp
> > +	 * dirty. Once we've processed the XFS_ILOG_IVERSION flag, convert it
> > +	 * to XFS_ILOG_CORE so that the actual on-disk dirty tracking
> > +	 * (ili_fields) correctly tracks that the version has changed.
> > +	 */
> > +	spin_lock(&iip->ili_lock);
> > +	iip->ili_fsync_fields |= (flags & ~XFS_ILOG_IVERSION);
> > +	if (flags & XFS_ILOG_IVERSION)
> > +		flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
> > +
> > +	if (!iip->ili_item.li_buf) {
> > +		struct xfs_buf	*bp;
> > +		int		error;
> > +
> > +		/*
> > +		 * We hold the ILOCK here, so this inode is not going to be
> > +		 * flushed while we are here. Further, because there is no
> > +		 * buffer attached to the item, we know that there is no IO in
> > +		 * progress, so nothing will clear the ili_fields while we read
> > +		 * in the buffer. Hence we can safely drop the spin lock and
> > +		 * read the buffer knowing that the state will not change from
> > +		 * here.
> > +		 */
> > +		spin_unlock(&iip->ili_lock);
> > +		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, &bp);
> > +		if (error)
> > +			return error;
> > +
> > +		/*
> > +		 * We need an explicit buffer reference for the log item but
> > +		 * don't want the buffer to remain attached to the transaction.
> > +		 * Hold the buffer but release the transaction reference once
> > +		 * we've attached the inode log item to the buffer log item
> > +		 * list.
> > +		 */
> > +		xfs_buf_hold(bp);
> > +		spin_lock(&iip->ili_lock);
> > +		iip->ili_item.li_buf = bp;
> > +		bp->b_flags |= _XBF_INODES;
> > +		list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
> > +		xfs_trans_brelse(tp, bp);
> > +	}
> > +
> > +	/*
> > +	 * Always OR in the bits from the ili_last_fields field.  This is to
> > +	 * coordinate with the xfs_iflush() and xfs_buf_inode_iodone() routines
> > +	 * in the eventual clearing of the ili_fields bits.  See the big comment
> > +	 * in xfs_iflush() for an explanation of this coordination mechanism.
> > +	 */
> > +	iip->ili_fields |= (flags | iip->ili_last_fields);
> > +	spin_unlock(&iip->ili_lock);
> > +
> > +	/*
> > +	 * We are done with the log item transaction dirty state, so clear it so
> > +	 * that it doesn't pollute future transactions.
> > +	 */
> > +	iip->ili_dirty_flags = 0;
> > +	return 0;
> > +}
> > +
> >  /*
> >   * The logged size of an inode fork is always the current size of the inode
> >   * fork. This means that when an inode fork is relogged, the size of the logged
> > @@ -662,6 +812,8 @@ xfs_inode_item_committing(
> >  }
> >  
> >  static const struct xfs_item_ops xfs_inode_item_ops = {
> > +	.iop_sort	= xfs_inode_item_sort,
> > +	.iop_precommit	= xfs_inode_item_precommit,
> >  	.iop_size	= xfs_inode_item_size,
> >  	.iop_format	= xfs_inode_item_format,
> >  	.iop_pin	= xfs_inode_item_pin,
> > diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> > index bbd836a44ff0..377e06007804 100644
> > --- a/fs/xfs/xfs_inode_item.h
> > +++ b/fs/xfs/xfs_inode_item.h
> > @@ -17,6 +17,7 @@ struct xfs_inode_log_item {
> >  	struct xfs_log_item	ili_item;	   /* common portion */
> >  	struct xfs_inode	*ili_inode;	   /* inode ptr */
> >  	unsigned short		ili_lock_flags;	   /* inode lock flags */
> > +	unsigned int		ili_dirty_flags;   /* dirty in current tx */
> >  	/*
> >  	 * The ili_lock protects the interactions between the dirty state and
> >  	 * the flush state of the inode log item. This allows us to do atomic
> > -- 
> > 2.40.1
> > 
> 

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] xfs: fix AGF vs inode cluster buffer deadlock
  2023-06-01  1:51     ` Dave Chinner
@ 2023-06-01 14:38       ` Darrick J. Wong
  0 siblings, 0 replies; 23+ messages in thread
From: Darrick J. Wong @ 2023-06-01 14:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jun 01, 2023 at 11:51:49AM +1000, Dave Chinner wrote:
> Friendly Ping.
> 
> Apart from the stray trace_printk()s I forgot to remove, are
> there any other problems with this patch I need to fix?

Nothing obvious that I could see.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


> -Dave.
> 
> On Tue, May 16, 2023 at 06:26:29PM -0700, Darrick J. Wong wrote:
> > On Wed, May 17, 2023 at 10:04:49AM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Lock order in XFS is AGI -> AGF, hence for operations involving
> > > inode unlinked list operations we always lock the AGI first. Inode
> > > unlinked list operations operate on the inode cluster buffer,
> > > so the lock order there is AGI -> inode cluster buffer.
> > > 
> > > For O_TMPFILE operations, this now means the lock order set down in
> > > xfs_rename and xfs_link is AGI -> inode cluster buffer -> AGF as the
> > > unlinked ops are done before the directory modifications that may
> > > allocate space and lock the AGF.
> > > 
> > > Unfortunately, we also now lock the inode cluster buffer when
> > > logging an inode so that we can attach the inode to the cluster
> > > buffer and pin it in memory. This creates a lock order of AGF ->
> > > inode cluster buffer in directory operations as we have to log the
> > > inode after we've allocated new space for it.
> > > 
> > > This creates a lock inversion between the AGF and the inode cluster
> > > buffer. Because the inode cluster buffer is shared across multiple
> > > inodes, the inversion is not specific to individual inodes but can
> > > occur when inodes in the same cluster buffer are accessed in
> > > different orders.
> > > 
> > > To fix this we need move all the inode log item cluster buffer
> > > interactions to the end of the current transaction. Unfortunately,
> > > xfs_trans_log_inode() calls are littered throughout the transactions
> > > with no thought to ordering against other items or locking. This
> > > makes it difficult to do anything that involves changing the call
> > > sites of xfs_trans_log_inode() to change locking orders.
> > > 
> > > However, we do now have a mechanism that allows is to postpone dirty
> > > item processing to just before we commit the transaction: the
> > > ->iop_precommit method. This will be called after all the
> > > modifications are done and high level objects like AGI and AGF
> > > buffers have been locked and modified, thereby providing a mechanism
> > > that guarantees we don't lock the inode cluster buffer before those
> > > high level objects are locked.
> > > 
> > > This change is largely moving the guts of xfs_trans_log_inode() to
> > > xfs_inode_item_precommit() and providing an extra flag context in
> > > the inode log item to track the dirty state of the inode in the
> > > current transaction. This also means we do a lot less repeated work
> > > in xfs_trans_log_inode() by only doing it once per transaction when
> > > all the work is done.
> > 
> > Aha, and that's why you moved all the "opportunistically tweak inode
> > metadata while we're already logging it" bits to the precommit hook.
> > 
> > > Fixes: 298f7bec503f ("xfs: pin inode backing buffer to the inode log item")
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_log_format.h  |   9 +-
> > >  fs/xfs/libxfs/xfs_trans_inode.c | 115 +++---------------------
> > >  fs/xfs/xfs_inode_item.c         | 152 ++++++++++++++++++++++++++++++++
> > >  fs/xfs/xfs_inode_item.h         |   1 +
> > >  4 files changed, 171 insertions(+), 106 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> > > index f13e0809dc63..269573c82808 100644
> > > --- a/fs/xfs/libxfs/xfs_log_format.h
> > > +++ b/fs/xfs/libxfs/xfs_log_format.h
> > > @@ -324,7 +324,6 @@ struct xfs_inode_log_format_32 {
> > >  #define XFS_ILOG_DOWNER	0x200	/* change the data fork owner on replay */
> > >  #define XFS_ILOG_AOWNER	0x400	/* change the attr fork owner on replay */
> > >  
> > > -
> > >  /*
> > >   * The timestamps are dirty, but not necessarily anything else in the inode
> > >   * core.  Unlike the other fields above this one must never make it to disk
> > > @@ -333,6 +332,14 @@ struct xfs_inode_log_format_32 {
> > >   */
> > >  #define XFS_ILOG_TIMESTAMP	0x4000
> > >  
> > > +/*
> > > + * The version field has been changed, but not necessarily anything else of
> > > + * interest. This must never make it to disk - it is used purely to ensure that
> > > + * the inode item ->precommit operation can update the fsync flag triggers
> > > + * in the inode item correctly.
> > > + */
> > > +#define XFS_ILOG_IVERSION	0x8000
> > > +
> > >  #define	XFS_ILOG_NONCORE	(XFS_ILOG_DDATA | XFS_ILOG_DEXT | \
> > >  				 XFS_ILOG_DBROOT | XFS_ILOG_DEV | \
> > >  				 XFS_ILOG_ADATA | XFS_ILOG_AEXT | \
> > > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> > > index 8b5547073379..2d164d0588b1 100644
> > > --- a/fs/xfs/libxfs/xfs_trans_inode.c
> > > +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> > > @@ -40,9 +40,8 @@ xfs_trans_ijoin(
> > >  	iip->ili_lock_flags = lock_flags;
> > >  	ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
> > >  
> > > -	/*
> > > -	 * Get a log_item_desc to point at the new item.
> > > -	 */
> > > +	/* Reset the per-tx dirty context and add the item to the tx. */
> > > +	iip->ili_dirty_flags = 0;
> > >  	xfs_trans_add_item(tp, &iip->ili_item);
> > >  }
> > >  
> > > @@ -76,17 +75,10 @@ xfs_trans_ichgtime(
> > >  /*
> > >   * This is called to mark the fields indicated in fieldmask as needing to be
> > >   * logged when the transaction is committed.  The inode must already be
> > > - * associated with the given transaction.
> > > - *
> > > - * The values for fieldmask are defined in xfs_inode_item.h.  We always log all
> > > - * of the core inode if any of it has changed, and we always log all of the
> > > - * inline data/extents/b-tree root if any of them has changed.
> > > - *
> > > - * Grab and pin the cluster buffer associated with this inode to avoid RMW
> > > - * cycles at inode writeback time. Avoid the need to add error handling to every
> > > - * xfs_trans_log_inode() call by shutting down on read error.  This will cause
> > > - * transactions to fail and everything to error out, just like if we return a
> > > - * read error in a dirty transaction and cancel it.
> > > + * associated with the given transaction. All we do here is record where the
> > > + * inode was dirtied and mark the transaction and inode log item dirty;
> > > + * everything else is done in the ->precommit log item operation after the
> > > + * changes in the transaction have been completed.
> > >   */
> > >  void
> > >  xfs_trans_log_inode(
> > > @@ -96,7 +88,6 @@ xfs_trans_log_inode(
> > >  {
> > >  	struct xfs_inode_log_item *iip = ip->i_itemp;
> > >  	struct inode		*inode = VFS_I(ip);
> > > -	uint			iversion_flags = 0;
> > >  
> > >  	ASSERT(iip);
> > >  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> > > @@ -104,18 +95,6 @@ xfs_trans_log_inode(
> > >  
> > >  	tp->t_flags |= XFS_TRANS_DIRTY;
> > >  
> > > -	/*
> > > -	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
> > > -	 * don't matter - we either will need an extra transaction in 24 hours
> > > -	 * to log the timestamps, or will clear already cleared fields in the
> > > -	 * worst case.
> > > -	 */
> > > -	if (inode->i_state & I_DIRTY_TIME) {
> > > -		spin_lock(&inode->i_lock);
> > > -		inode->i_state &= ~I_DIRTY_TIME;
> > > -		spin_unlock(&inode->i_lock);
> > > -	}
> > > -
> > >  	/*
> > >  	 * First time we log the inode in a transaction, bump the inode change
> > >  	 * counter if it is configured for this to occur. While we have the
> > > @@ -128,86 +107,12 @@ xfs_trans_log_inode(
> > >  	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
> > >  		if (IS_I_VERSION(inode) &&
> > >  		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> > > -			iversion_flags = XFS_ILOG_CORE;
> > > +			flags |= XFS_ILOG_IVERSION;
> > >  	}
> > >  
> > > -	/*
> > > -	 * If we're updating the inode core or the timestamps and it's possible
> > > -	 * to upgrade this inode to bigtime format, do so now.
> > > -	 */
> > > -	if ((flags & (XFS_ILOG_CORE | XFS_ILOG_TIMESTAMP)) &&
> > > -	    xfs_has_bigtime(ip->i_mount) &&
> > > -	    !xfs_inode_has_bigtime(ip)) {
> > > -		ip->i_diflags2 |= XFS_DIFLAG2_BIGTIME;
> > > -		flags |= XFS_ILOG_CORE;
> > > -	}
> > > -
> > > -	/*
> > > -	 * Inode verifiers do not check that the extent size hint is an integer
> > > -	 * multiple of the rt extent size on a directory with both rtinherit
> > > -	 * and extszinherit flags set.  If we're logging a directory that is
> > > -	 * misconfigured in this way, clear the hint.
> > > -	 */
> > > -	if ((ip->i_diflags & XFS_DIFLAG_RTINHERIT) &&
> > > -	    (ip->i_diflags & XFS_DIFLAG_EXTSZINHERIT) &&
> > > -	    (ip->i_extsize % ip->i_mount->m_sb.sb_rextsize) > 0) {
> > > -		ip->i_diflags &= ~(XFS_DIFLAG_EXTSIZE |
> > > -				   XFS_DIFLAG_EXTSZINHERIT);
> > > -		ip->i_extsize = 0;
> > > -		flags |= XFS_ILOG_CORE;
> > > -	}
> > > -
> > > -	/*
> > > -	 * Record the specific change for fdatasync optimisation. This allows
> > > -	 * fdatasync to skip log forces for inodes that are only timestamp
> > > -	 * dirty.
> > > -	 */
> > > -	spin_lock(&iip->ili_lock);
> > > -	iip->ili_fsync_fields |= flags;
> > > -
> > > -	if (!iip->ili_item.li_buf) {
> > > -		struct xfs_buf	*bp;
> > > -		int		error;
> > > -
> > > -		/*
> > > -		 * We hold the ILOCK here, so this inode is not going to be
> > > -		 * flushed while we are here. Further, because there is no
> > > -		 * buffer attached to the item, we know that there is no IO in
> > > -		 * progress, so nothing will clear the ili_fields while we read
> > > -		 * in the buffer. Hence we can safely drop the spin lock and
> > > -		 * read the buffer knowing that the state will not change from
> > > -		 * here.
> > > -		 */
> > > -		spin_unlock(&iip->ili_lock);
> > > -		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, &bp);
> > > -		if (error) {
> > > -			xfs_force_shutdown(ip->i_mount, SHUTDOWN_META_IO_ERROR);
> > > -			return;
> > > -		}
> > > -
> > > -		/*
> > > -		 * We need an explicit buffer reference for the log item but
> > > -		 * don't want the buffer to remain attached to the transaction.
> > > -		 * Hold the buffer but release the transaction reference once
> > > -		 * we've attached the inode log item to the buffer log item
> > > -		 * list.
> > > -		 */
> > > -		xfs_buf_hold(bp);
> > > -		spin_lock(&iip->ili_lock);
> > > -		iip->ili_item.li_buf = bp;
> > > -		bp->b_flags |= _XBF_INODES;
> > > -		list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
> > > -		xfs_trans_brelse(tp, bp);
> > > -	}
> > > -
> > > -	/*
> > > -	 * Always OR in the bits from the ili_last_fields field.  This is to
> > > -	 * coordinate with the xfs_iflush() and xfs_buf_inode_iodone() routines
> > > -	 * in the eventual clearing of the ili_fields bits.  See the big comment
> > > -	 * in xfs_iflush() for an explanation of this coordination mechanism.
> > > -	 */
> > > -	iip->ili_fields |= (flags | iip->ili_last_fields | iversion_flags);
> > > -	spin_unlock(&iip->ili_lock);
> > > +	iip->ili_dirty_flags |= flags;
> > > +	trace_printk("ino 0x%llx, flags 0x%x, dflags 0x%x",
> > > +		ip->i_ino, flags, iip->ili_dirty_flags);
> > 
> > Urk, leftover debugging info?
> > 
> > --D
> > >  }
> > >  
> > >  int
> > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > > index ca2941ab6cbc..586af11b7cd1 100644
> > > --- a/fs/xfs/xfs_inode_item.c
> > > +++ b/fs/xfs/xfs_inode_item.c
> > > @@ -29,6 +29,156 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip)
> > >  	return container_of(lip, struct xfs_inode_log_item, ili_item);
> > >  }
> > >  
> > > +static uint64_t
> > > +xfs_inode_item_sort(
> > > +	struct xfs_log_item	*lip)
> > > +{
> > > +	return INODE_ITEM(lip)->ili_inode->i_ino;
> > > +}
> > > +
> > > +/*
> > > + * Prior to finally logging the inode, we have to ensure that all the
> > > + * per-modification inode state changes are applied. This includes VFS inode
> > > + * state updates, format conversions, verifier state synchronisation and
> > > + * ensuring the inode buffer remains in memory whilst the inode is dirty.
> > > + *
> > > + * We have to be careful when we grab the inode cluster buffer due to lock
> > > + * ordering constraints. The unlinked inode modifications (xfs_iunlink_item)
> > > + * require AGI -> inode cluster buffer lock order. The inode cluster buffer is
> > > + * not locked until ->precommit, so it happens after everything else has been
> > > + * modified.
> > > + *
> > > + * Further, we have AGI -> AGF lock ordering, and with O_TMPFILE handling we
> > > + * have AGI -> AGF -> iunlink item -> inode cluster buffer lock order. Hence we
> > > + * cannot safely lock the inode cluster buffer in xfs_trans_log_inode() because
> > > + * it can be called on a inode (e.g. via bumplink/droplink) before we take the
> > > + * AGF lock modifying directory blocks.
> > > + *
> > > + * Rather than force a complete rework of all the transactions to call
> > > + * xfs_trans_log_inode() once and once only at the end of every transaction, we
> > > + * move the pinning of the inode cluster buffer to a ->precommit operation. This
> > > + * matches how the xfs_iunlink_item locks the inode cluster buffer, and it
> > > + * ensures that the inode cluster buffer locking is always done last in a
> > > + * transaction. i.e. we ensure the lock order is always AGI -> AGF -> inode
> > > + * cluster buffer.
> > > + *
> > > + * If we return the inode number as the precommit sort key then we'll also
> > > + * guarantee that the order all inode cluster buffer locking is the same all the
> > > + * inodes and unlink items in the transaction.
> > > + */
> > > +static int
> > > +xfs_inode_item_precommit(
> > > +	struct xfs_trans	*tp,
> > > +	struct xfs_log_item	*lip)
> > > +{
> > > +	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
> > > +	struct xfs_inode	*ip = iip->ili_inode;
> > > +	struct inode		*inode = VFS_I(ip);
> > > +	unsigned int		flags = iip->ili_dirty_flags;
> > > +
> > > +	trace_printk("ino 0x%llx, dflags 0x%x, fields 0x%x lastf 0x%x",
> > > +		ip->i_ino, flags, iip->ili_fields, iip->ili_last_fields);
> > > +	/*
> > > +	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
> > > +	 * don't matter - we either will need an extra transaction in 24 hours
> > > +	 * to log the timestamps, or will clear already cleared fields in the
> > > +	 * worst case.
> > > +	 */
> > > +	if (inode->i_state & I_DIRTY_TIME) {
> > > +		spin_lock(&inode->i_lock);
> > > +		inode->i_state &= ~I_DIRTY_TIME;
> > > +		spin_unlock(&inode->i_lock);
> > > +	}
> > > +
> > > +
> > > +	/*
> > > +	 * If we're updating the inode core or the timestamps and it's possible
> > > +	 * to upgrade this inode to bigtime format, do so now.
> > > +	 */
> > > +	if ((flags & (XFS_ILOG_CORE | XFS_ILOG_TIMESTAMP)) &&
> > > +	    xfs_has_bigtime(ip->i_mount) &&
> > > +	    !xfs_inode_has_bigtime(ip)) {
> > > +		ip->i_diflags2 |= XFS_DIFLAG2_BIGTIME;
> > > +		flags |= XFS_ILOG_CORE;
> > > +	}
> > > +
> > > +	/*
> > > +	 * Inode verifiers do not check that the extent size hint is an integer
> > > +	 * multiple of the rt extent size on a directory with both rtinherit
> > > +	 * and extszinherit flags set.  If we're logging a directory that is
> > > +	 * misconfigured in this way, clear the hint.
> > > +	 */
> > > +	if ((ip->i_diflags & XFS_DIFLAG_RTINHERIT) &&
> > > +	    (ip->i_diflags & XFS_DIFLAG_EXTSZINHERIT) &&
> > > +	    (ip->i_extsize % ip->i_mount->m_sb.sb_rextsize) > 0) {
> > > +		ip->i_diflags &= ~(XFS_DIFLAG_EXTSIZE |
> > > +				   XFS_DIFLAG_EXTSZINHERIT);
> > > +		ip->i_extsize = 0;
> > > +		flags |= XFS_ILOG_CORE;
> > > +	}
> > > +
> > > +	/*
> > > +	 * Record the specific change for fdatasync optimisation. This allows
> > > +	 * fdatasync to skip log forces for inodes that are only timestamp
> > > +	 * dirty. Once we've processed the XFS_ILOG_IVERSION flag, convert it
> > > +	 * to XFS_ILOG_CORE so that the actual on-disk dirty tracking
> > > +	 * (ili_fields) correctly tracks that the version has changed.
> > > +	 */
> > > +	spin_lock(&iip->ili_lock);
> > > +	iip->ili_fsync_fields |= (flags & ~XFS_ILOG_IVERSION);
> > > +	if (flags & XFS_ILOG_IVERSION)
> > > +		flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
> > > +
> > > +	if (!iip->ili_item.li_buf) {
> > > +		struct xfs_buf	*bp;
> > > +		int		error;
> > > +
> > > +		/*
> > > +		 * We hold the ILOCK here, so this inode is not going to be
> > > +		 * flushed while we are here. Further, because there is no
> > > +		 * buffer attached to the item, we know that there is no IO in
> > > +		 * progress, so nothing will clear the ili_fields while we read
> > > +		 * in the buffer. Hence we can safely drop the spin lock and
> > > +		 * read the buffer knowing that the state will not change from
> > > +		 * here.
> > > +		 */
> > > +		spin_unlock(&iip->ili_lock);
> > > +		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, &bp);
> > > +		if (error)
> > > +			return error;
> > > +
> > > +		/*
> > > +		 * We need an explicit buffer reference for the log item but
> > > +		 * don't want the buffer to remain attached to the transaction.
> > > +		 * Hold the buffer but release the transaction reference once
> > > +		 * we've attached the inode log item to the buffer log item
> > > +		 * list.
> > > +		 */
> > > +		xfs_buf_hold(bp);
> > > +		spin_lock(&iip->ili_lock);
> > > +		iip->ili_item.li_buf = bp;
> > > +		bp->b_flags |= _XBF_INODES;
> > > +		list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
> > > +		xfs_trans_brelse(tp, bp);
> > > +	}
> > > +
> > > +	/*
> > > +	 * Always OR in the bits from the ili_last_fields field.  This is to
> > > +	 * coordinate with the xfs_iflush() and xfs_buf_inode_iodone() routines
> > > +	 * in the eventual clearing of the ili_fields bits.  See the big comment
> > > +	 * in xfs_iflush() for an explanation of this coordination mechanism.
> > > +	 */
> > > +	iip->ili_fields |= (flags | iip->ili_last_fields);
> > > +	spin_unlock(&iip->ili_lock);
> > > +
> > > +	/*
> > > +	 * We are done with the log item transaction dirty state, so clear it so
> > > +	 * that it doesn't pollute future transactions.
> > > +	 */
> > > +	iip->ili_dirty_flags = 0;
> > > +	return 0;
> > > +}
> > > +
> > >  /*
> > >   * The logged size of an inode fork is always the current size of the inode
> > >   * fork. This means that when an inode fork is relogged, the size of the logged
> > > @@ -662,6 +812,8 @@ xfs_inode_item_committing(
> > >  }
> > >  
> > >  static const struct xfs_item_ops xfs_inode_item_ops = {
> > > +	.iop_sort	= xfs_inode_item_sort,
> > > +	.iop_precommit	= xfs_inode_item_precommit,
> > >  	.iop_size	= xfs_inode_item_size,
> > >  	.iop_format	= xfs_inode_item_format,
> > >  	.iop_pin	= xfs_inode_item_pin,
> > > diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> > > index bbd836a44ff0..377e06007804 100644
> > > --- a/fs/xfs/xfs_inode_item.h
> > > +++ b/fs/xfs/xfs_inode_item.h
> > > @@ -17,6 +17,7 @@ struct xfs_inode_log_item {
> > >  	struct xfs_log_item	ili_item;	   /* common portion */
> > >  	struct xfs_inode	*ili_inode;	   /* inode ptr */
> > >  	unsigned short		ili_lock_flags;	   /* inode lock flags */
> > > +	unsigned int		ili_dirty_flags;   /* dirty in current tx */
> > >  	/*
> > >  	 * The ili_lock protects the interactions between the dirty state and
> > >  	 * the flush state of the inode log item. This allows us to do atomic
> > > -- 
> > > 2.40.1
> > > 
> > 
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] xfs: defered work could create precommits
  2023-05-17  0:04 ` [PATCH 3/4] xfs: defered work could create precommits Dave Chinner
  2023-05-17  1:20   ` Darrick J. Wong
@ 2023-06-01 15:01   ` Christoph Hellwig
  1 sibling, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2023-06-01 15:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] xfs: fix AGF vs inode cluster buffer deadlock
  2023-05-17  0:04 ` [PATCH 4/4] xfs: fix AGF vs inode cluster buffer deadlock Dave Chinner
  2023-05-17  1:26   ` Darrick J. Wong
@ 2023-06-01 15:12   ` Christoph Hellwig
  2023-06-25  2:58   ` Matthew Wilcox
  2 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2023-06-01 15:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

[apparently your gmail server decided my previous reply was spam, so
 I'm not sure this reaches you]

This looks good minus the left over trace_printk statements.

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] xfs: fix AGF vs inode cluster buffer deadlock
  2023-05-17  0:04 ` [PATCH 4/4] xfs: fix AGF vs inode cluster buffer deadlock Dave Chinner
  2023-05-17  1:26   ` Darrick J. Wong
  2023-06-01 15:12   ` Christoph Hellwig
@ 2023-06-25  2:58   ` Matthew Wilcox
  2023-06-25 22:34     ` Dave Chinner
  2 siblings, 1 reply; 23+ messages in thread
From: Matthew Wilcox @ 2023-06-25  2:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 17, 2023 at 10:04:49AM +1000, Dave Chinner wrote:
> Lock order in XFS is AGI -> AGF, hence for operations involving
> inode unlinked list operations we always lock the AGI first. Inode
> unlinked list operations operate on the inode cluster buffer,
> so the lock order there is AGI -> inode cluster buffer.

Hi Dave,

This commit reliably produces an assertion failure for me.  I haven't
tried to analyse why.  It's pretty clear though; I can run generic/426
in a loop for hundreds of seconds on the parent commit (cb042117488d),
but it'll die within 30 seconds on commit 82842fee6e59.

    export MKFS_OPTIONS="-m reflink=1,rmapbt=1 -i sparse=1 -b size=1024"

I suspect the size=1024 is the important thing, but I haven't tested
that hypothesis.  This is on an x86-64 virtual machine; full qemu
command line at the end [1]

00028 FSTYP         -- xfs (debug)
00028 PLATFORM      -- Linux/x86_64 pepe-kvm 6.4.0-rc5-00004-g82842fee6e59 #182 SMP PREEMPT_DYNAMIC Sat Jun 24 22:51:32 EDT 2023
00028 MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1 -i sparse=1 -b size=1024 /dev/sdc
00028 MOUNT_OPTIONS -- /dev/sdc /mnt/scratch
00028
00028 XFS (sdc): Mounting V5 Filesystem 591c2048-7cce-4eda-acf7-649e19cd8554
00028 XFS (sdc): Ending clean mount
00028 XFS (sdc): Unmounting Filesystem 591c2048-7cce-4eda-acf7-649e19cd8554
00028 XFS (sdb): EXPERIMENTAL online scrub feature in use. Use at your own risk!
00028 XFS (sdb): Unmounting Filesystem 9db9e0a2-c05b-4690-a938-ae8f7b70be8e
00028 XFS (sdb): Mounting V5 Filesystem 9db9e0a2-c05b-4690-a938-ae8f7b70be8e
00028 XFS (sdb): Ending clean mount
00028 generic/426       run fstests generic/426 at 2023-06-25 02:52:07
00029 XFS: Assertion failed: bp->b_flags & XBF_DONE, file: fs/xfs/xfs_trans_buf.c, line: 241
00029 ------------[ cut here ]------------
00029 kernel BUG at fs/xfs/xfs_message.c:102!
00029 invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
00029 CPU: 1 PID: 62 Comm: kworker/1:1 Kdump: loaded Not tainted 6.4.0-rc5-00004-g82842fee6e59 #182
00029 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
00029 Workqueue: xfs-inodegc/sdb xfs_inodegc_worker
00029 RIP: 0010:assfail+0x30/0x40
00029 Code: c9 48 c7 c2 48 f8 ea 81 48 89 f1 48 89 e5 48 89 fe 48 c7 c7 b9 cc e5 81 e8 fd fd ff ff 80 3d f6 2f d3 00 00 75 04 0f 0b 5d c3 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 55 48 63 f6 49 89
00029 RSP: 0018:ffff88800317bc78 EFLAGS: 00010202
00029 RAX: 00000000ffffffea RBX: ffff88800611e000 RCX: 000000007fffffff
00029 RDX: 0000000000000021 RSI: 0000000000000000 RDI: ffffffff81e5ccb9
00029 RBP: ffff88800317bc78 R08: 0000000000000000 R09: 000000000000000a
00029 R10: 000000000000000a R11: 0fffffffffffffff R12: ffff88800c780800
00029 R13: ffff88800317bce0 R14: 0000000000000001 R15: ffff88800c73d000
00029 FS:  0000000000000000(0000) GS:ffff88807d840000(0000) knlGS:0000000000000000
00029 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
00029 CR2: 00005623b1911068 CR3: 000000000ee28003 CR4: 0000000000770ea0
00029 PKRU: 55555554
00029 Call Trace:
00029  <TASK>
00029  ? show_regs+0x5c/0x70
00029  ? die+0x32/0x90
00029  ? do_trap+0xbb/0xe0
00029  ? do_error_trap+0x67/0x90
00029  ? assfail+0x30/0x40
00029  ? exc_invalid_op+0x52/0x70
00029  ? assfail+0x30/0x40
00029  ? asm_exc_invalid_op+0x1b/0x20
00029  ? assfail+0x30/0x40
00029  ? assfail+0x23/0x40
00029  xfs_trans_read_buf_map+0x2d9/0x480
00029  xfs_imap_to_bp+0x3d/0x40
00029  xfs_inode_item_precommit+0x176/0x200
00029  xfs_trans_run_precommits+0x65/0xc0
00029  __xfs_trans_commit+0x3d/0x360
00029  xfs_trans_commit+0xb/0x10
00029  xfs_inactive_ifree.isra.0+0xea/0x200
00029  xfs_inactive+0x132/0x230
00029  xfs_inodegc_worker+0xb6/0x1a0
00029  process_one_work+0x1a9/0x3a0
00029  worker_thread+0x4e/0x3a0
00029  ? process_one_work+0x3a0/0x3a0
00029  kthread+0xf9/0x130

In case things have moved around since that commit, the particular line
throwing the assertion is in this paragraph:

        if (bp) {
                ASSERT(xfs_buf_islocked(bp));
                ASSERT(bp->b_transp == tp);
                ASSERT(bp->b_log_item != NULL);
                ASSERT(!bp->b_error);
                ASSERT(bp->b_flags & XBF_DONE);

It's the last one that trips.  Sorry for not catching this earlier; my
test suite experienced a bit of a failure and I only just got around to
fixing it enough to run all the way through.

[1] qemu-system-x86_64 -nodefaults -nographic -cpu host -machine type=q35,accel=kvm,nvdimm=on -m 2G,slots=8,maxmem=256G -smp 8 -kernel /home/willy/kernel/linux-next/.build_test_kernel-x86_64/kpgk/vmlinuz -append mitigations=off console=hvc0 root=/dev/sda rw log_buf_len=8M ktest.dir=/home/willy/kernel/ktest ktest.env=/tmp/build-test-kernel-FzOfFCHDVD/env crashkernel=128M no_console_suspend page_owner=on -device virtio-serial -chardev stdio,id=console -device virtconsole,chardev=console -serial unix:/tmp/build-test-kernel-FzOfFCHDVD/vm-kgdb,server,nowait -monitor unix:/tmp/build-test-kernel-FzOfFCHDVD/vm-mon,server,nowait -gdb unix:/tmp/build-test-kernel-FzOfFCHDVD/vm-gdb,server,nowait -device virtio-rng-pci -virtfs local,path=/,mount_tag=host,security_model=none -device virtio-scsi-pci,id=hba -nic user,model=virtio,hostfwd=tcp:127.0.0.1:28201-:22 -drive if=none,format=raw,id=disk0,file=/var/lib/ktest/root.amd64,snapshot=on -device scsi-hd,bus=hba.0,drive=disk0 -drive if=none,format=raw,id=disk1,file=/tmp/build-test-kernel-FzOfFCHDVD/dev-1,cache=unsafe -device scsi-hd,bus=hba.0,drive=disk1 -drive if=none,format=raw,id=disk2,file=/tmp/build-test-kernel-FzOfFCHDVD/dev-2,cache=unsafe -device scsi-hd,bus=hba.0,drive=disk2 -drive if=none,format=raw,id=disk3,file=/tmp/build-test-kernel-FzOfFCHDVD/dev-3,cache=unsafe -device scsi-hd,bus=hba.0,drive=disk3 -drive if=none,format=raw,id=disk4,file=/tmp/build-test-kernel-FzOfFCHDVD/dev-4,cache=unsafe -device scsi-hd,bus=hba.0,drive=disk4

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] xfs: fix AGF vs inode cluster buffer deadlock
  2023-06-25  2:58   ` Matthew Wilcox
@ 2023-06-25 22:34     ` Dave Chinner
  0 siblings, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2023-06-25 22:34 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-xfs

On Sun, Jun 25, 2023 at 03:58:15AM +0100, Matthew Wilcox wrote:
> On Wed, May 17, 2023 at 10:04:49AM +1000, Dave Chinner wrote:
> > Lock order in XFS is AGI -> AGF, hence for operations involving
> > inode unlinked list operations we always lock the AGI first. Inode
> > unlinked list operations operate on the inode cluster buffer,
> > so the lock order there is AGI -> inode cluster buffer.
> 
> Hi Dave,
> 
> This commit reliably produces an assertion failure for me.  I haven't
> tried to analyse why.  It's pretty clear though; I can run generic/426
> in a loop for hundreds of seconds on the parent commit (cb042117488d),
> but it'll die within 30 seconds on commit 82842fee6e59.
> 
>     export MKFS_OPTIONS="-m reflink=1,rmapbt=1 -i sparse=1 -b size=1024"

That's part of my regular regression test config (mkfs defaults w/
1kB block size), and I haven't seen this problem.

I've just kicked off an iteration of g/426 on a couple of machines,
on a current TOT, and they are already a couple of hundred
iterations in without a failure....

Can you grab a trace for me? i.e. run

# trace-cmd record -e xfs\* -e printk

in one shell and leave it running. Then in another shell run the
test. when the test fails, ctrl-c the trace-cmd, and send me
the output of

# trace-cmd report > report.txt

> I suspect the size=1024 is the important thing, but I haven't tested
> that hypothesis.  This is on an x86-64 virtual machine; full qemu
> command line at the end [1]

As it's an inode cluster buffer failure, I very much doubt
filesystem block size plays a part. Inode buffer size is defined by
inode size, not filesystem block size and so the buffer in question
will be a 16kB unmapped buffer because inode size is 512 bytes...

> 00028 FSTYP         -- xfs (debug)
> 00028 PLATFORM      -- Linux/x86_64 pepe-kvm 6.4.0-rc5-00004-g82842fee6e59 #182 SMP PREEMPT_DYNAMIC Sat Jun 24 22:51:32 EDT 2023
> 00028 MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1 -i sparse=1 -b size=1024 /dev/sdc
> 00028 MOUNT_OPTIONS -- /dev/sdc /mnt/scratch
> 00028
> 00028 XFS (sdc): Mounting V5 Filesystem 591c2048-7cce-4eda-acf7-649e19cd8554
> 00028 XFS (sdc): Ending clean mount
> 00028 XFS (sdc): Unmounting Filesystem 591c2048-7cce-4eda-acf7-649e19cd8554
> 00028 XFS (sdb): EXPERIMENTAL online scrub feature in use. Use at your own risk!
> 00028 XFS (sdb): Unmounting Filesystem 9db9e0a2-c05b-4690-a938-ae8f7b70be8e
> 00028 XFS (sdb): Mounting V5 Filesystem 9db9e0a2-c05b-4690-a938-ae8f7b70be8e
> 00028 XFS (sdb): Ending clean mount
> 00028 generic/426       run fstests generic/426 at 2023-06-25 02:52:07
> 00029 XFS: Assertion failed: bp->b_flags & XBF_DONE, file: fs/xfs/xfs_trans_buf.c, line: 241
> 00029 ------------[ cut here ]------------
> 00029 kernel BUG at fs/xfs/xfs_message.c:102!
> 00029 invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> 00029 CPU: 1 PID: 62 Comm: kworker/1:1 Kdump: loaded Not tainted 6.4.0-rc5-00004-g82842fee6e59 #182
> 00029 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> 00029 Workqueue: xfs-inodegc/sdb xfs_inodegc_worker
> 00029 RIP: 0010:assfail+0x30/0x40
> 00029 Code: c9 48 c7 c2 48 f8 ea 81 48 89 f1 48 89 e5 48 89 fe 48 c7 c7 b9 cc e5 81 e8 fd fd ff ff 80 3d f6 2f d3 00 00 75 04 0f 0b 5d c3 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 55 48 63 f6 49 89
> 00029 RSP: 0018:ffff88800317bc78 EFLAGS: 00010202
> 00029 RAX: 00000000ffffffea RBX: ffff88800611e000 RCX: 000000007fffffff
> 00029 RDX: 0000000000000021 RSI: 0000000000000000 RDI: ffffffff81e5ccb9
> 00029 RBP: ffff88800317bc78 R08: 0000000000000000 R09: 000000000000000a
> 00029 R10: 000000000000000a R11: 0fffffffffffffff R12: ffff88800c780800
> 00029 R13: ffff88800317bce0 R14: 0000000000000001 R15: ffff88800c73d000
> 00029 FS:  0000000000000000(0000) GS:ffff88807d840000(0000) knlGS:0000000000000000
> 00029 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 00029 CR2: 00005623b1911068 CR3: 000000000ee28003 CR4: 0000000000770ea0
> 00029 PKRU: 55555554
> 00029 Call Trace:
> 00029  <TASK>
> 00029  ? show_regs+0x5c/0x70
> 00029  ? die+0x32/0x90
> 00029  ? do_trap+0xbb/0xe0
> 00029  ? do_error_trap+0x67/0x90
> 00029  ? assfail+0x30/0x40
> 00029  ? exc_invalid_op+0x52/0x70
> 00029  ? assfail+0x30/0x40
> 00029  ? asm_exc_invalid_op+0x1b/0x20
> 00029  ? assfail+0x30/0x40
> 00029  ? assfail+0x23/0x40
> 00029  xfs_trans_read_buf_map+0x2d9/0x480
> 00029  xfs_imap_to_bp+0x3d/0x40
> 00029  xfs_inode_item_precommit+0x176/0x200
> 00029  xfs_trans_run_precommits+0x65/0xc0
> 00029  __xfs_trans_commit+0x3d/0x360
> 00029  xfs_trans_commit+0xb/0x10
> 00029  xfs_inactive_ifree.isra.0+0xea/0x200
> 00029  xfs_inactive+0x132/0x230
> 00029  xfs_inodegc_worker+0xb6/0x1a0
> 00029  process_one_work+0x1a9/0x3a0
> 00029  worker_thread+0x4e/0x3a0
> 00029  ? process_one_work+0x3a0/0x3a0
> 00029  kthread+0xf9/0x130
> 
> In case things have moved around since that commit, the particular line
> throwing the assertion is in this paragraph:
> 
>         if (bp) {
>                 ASSERT(xfs_buf_islocked(bp));
>                 ASSERT(bp->b_transp == tp);
>                 ASSERT(bp->b_log_item != NULL);
>                 ASSERT(!bp->b_error);
>                 ASSERT(bp->b_flags & XBF_DONE);

Nothing immediately obvious stands out here. 

I suspect it may be an interaction with memory reclaim freeing the
inode cluster buffer while it is clean after the inode has been
brought into memory, then xfs_ifree_cluster using
xfs_trans_get_buf() to invalidate the inode cluster (hence bringing
it into memory without reading it's contents) and then us trying to
read it, finding it already linked into the transaction, and it
skipping the buffer cache lookup that would have read the data
in....

The trace will tell me if this is roughly what is happening.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-06-25 22:34 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-17  0:04 xfs: bug fixes for 6.4-rcX Dave Chinner
2023-05-17  0:04 ` [PATCH 1/4] xfs: buffer pins need to hold a buffer reference Dave Chinner
2023-05-17  1:26   ` Darrick J. Wong
2023-05-17 12:58   ` Christoph Hellwig
2023-05-17 22:24     ` Dave Chinner
2023-05-17  0:04 ` [PATCH 2/4] xfs: restore allocation trylock iteration Dave Chinner
2023-05-17  1:11   ` Darrick J. Wong
2023-05-17 12:59   ` Christoph Hellwig
2023-05-17  0:04 ` [PATCH 3/4] xfs: defered work could create precommits Dave Chinner
2023-05-17  1:20   ` Darrick J. Wong
2023-05-17  1:44     ` Dave Chinner
2023-06-01 15:01   ` Christoph Hellwig
2023-05-17  0:04 ` [PATCH 4/4] xfs: fix AGF vs inode cluster buffer deadlock Dave Chinner
2023-05-17  1:26   ` Darrick J. Wong
2023-05-17  1:47     ` Dave Chinner
2023-06-01  1:51     ` Dave Chinner
2023-06-01 14:38       ` Darrick J. Wong
2023-06-01 15:12   ` Christoph Hellwig
2023-06-25  2:58   ` Matthew Wilcox
2023-06-25 22:34     ` Dave Chinner
2023-05-17  7:07 ` xfs: bug fixes for 6.4-rcX Amir Goldstein
2023-05-17 11:34   ` Dave Chinner
2023-05-17 12:48     ` Amir Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox