fix recovery of allocator ops after a growfs

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* fix recovery of allocator ops after a growfs
@ 2024-09-30 16:41 Christoph Hellwig
  2024-09-30 16:41 ` [PATCH 1/7] xfs: pass the exact range to initialize to xfs_initialize_perag Christoph Hellwig
                   ` (6 more replies)
  0 siblings, 7 replies; 44+ messages in thread
From: Christoph Hellwig @ 2024-09-30 16:41 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: Darrick J. Wong, linux-xfs

Hi all,

auditing the perag code for the generic groups feature found an issue
where recovery of an extfree intent without a logged done entry will
fail when the log also contained the transaction that added the AG
to the extent is freed to because the file system geometry in the
superblock is only updated updated and the perag structures are only
created after log recovery has finished.

This version now also ensures the transactions using the new AGs
are not in the same CIL checkpoint as the growfs transaction.

Diffstat:
 libxfs/xfs_ag.c          |   69 +++-----------
 libxfs/xfs_ag.h          |   10 +-
 libxfs/xfs_ag_resv.c     |   18 +--
 libxfs/xfs_ialloc.c      |   14 +-
 libxfs/xfs_log_recover.h |    2 
 libxfs/xfs_rtbitmap.c    |    3 
 libxfs/xfs_sb.c          |   97 +++++++++++++++----
 libxfs/xfs_sb.h          |    3 
 libxfs/xfs_shared.h      |   18 ---
 scrub/rtbitmap_repair.c  |   26 ++---
 xfs_buf_item_recover.c   |   27 +++++
 xfs_fsops.c              |  102 ++++++++++++--------
 xfs_log_recover.c        |   30 ++++--
 xfs_mount.c              |    9 -
 xfs_rtalloc.c            |   98 ++++++++++---------
 xfs_trans.c              |  231 ++++++++++++-----------------------------------
 xfs_trans.h              |   15 +--
 xfs_trans_dquot.c        |    2 
 18 files changed, 368 insertions(+), 406 deletions(-)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/7] xfs: pass the exact range to initialize to xfs_initialize_perag
  2024-09-30 16:41 fix recovery of allocator ops after a growfs Christoph Hellwig
@ 2024-09-30 16:41 ` Christoph Hellwig
  2024-10-10 14:02   ` Brian Foster
  2024-09-30 16:41 ` [PATCH 2/7] xfs: merge the perag freeing helpers Christoph Hellwig
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2024-09-30 16:41 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: Darrick J. Wong, linux-xfs

Currently only the new agcount is passed to xfs_initialize_perag, which
requires lookups of existing AGs to skip them and complicates error
handling.  Also pass the previous agcount so that the range that
xfs_initialize_perag operates on is exactly defined.  That way the
extra lookups can be avoided, and error handling can clean up the
exact range from the old count to the last added perag structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ag.c   | 29 ++++++++---------------------
 fs/xfs/libxfs/xfs_ag.h   |  5 +++--
 fs/xfs/xfs_fsops.c       | 18 ++++++++----------
 fs/xfs/xfs_log_recover.c |  5 +++--
 fs/xfs/xfs_mount.c       |  4 ++--
 5 files changed, 24 insertions(+), 37 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index 5f0494702e0b55..652376aa52e990 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -296,27 +296,19 @@ xfs_free_unused_perag_range(
 int
 xfs_initialize_perag(
 	struct xfs_mount	*mp,
-	xfs_agnumber_t		agcount,
+	xfs_agnumber_t		old_agcount,
+	xfs_agnumber_t		new_agcount,
 	xfs_rfsblock_t		dblocks,
 	xfs_agnumber_t		*maxagi)
 {
 	struct xfs_perag	*pag;
 	xfs_agnumber_t		index;
-	xfs_agnumber_t		first_initialised = NULLAGNUMBER;
 	int			error;
 
-	/*
-	 * Walk the current per-ag tree so we don't try to initialise AGs
-	 * that already exist (growfs case). Allocate and insert all the
-	 * AGs we don't find ready for initialisation.
-	 */
-	for (index = 0; index < agcount; index++) {
-		pag = xfs_perag_get(mp, index);
-		if (pag) {
-			xfs_perag_put(pag);
-			continue;
-		}
+	if (old_agcount >= new_agcount)
+		return 0;
 
+	for (index = old_agcount; index < new_agcount; index++) {
 		pag = kzalloc(sizeof(*pag), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
 		if (!pag) {
 			error = -ENOMEM;
@@ -353,21 +345,17 @@ xfs_initialize_perag(
 		/* Active ref owned by mount indicates AG is online. */
 		atomic_set(&pag->pag_active_ref, 1);
 
-		/* first new pag is fully initialized */
-		if (first_initialised == NULLAGNUMBER)
-			first_initialised = index;
-
 		/*
 		 * Pre-calculated geometry
 		 */
-		pag->block_count = __xfs_ag_block_count(mp, index, agcount,
+		pag->block_count = __xfs_ag_block_count(mp, index, new_agcount,
 				dblocks);
 		pag->min_block = XFS_AGFL_BLOCK(mp);
 		__xfs_agino_range(mp, pag->block_count, &pag->agino_min,
 				&pag->agino_max);
 	}
 
-	index = xfs_set_inode_alloc(mp, agcount);
+	index = xfs_set_inode_alloc(mp, new_agcount);
 
 	if (maxagi)
 		*maxagi = index;
@@ -381,8 +369,7 @@ xfs_initialize_perag(
 out_free_pag:
 	kfree(pag);
 out_unwind_new_pags:
-	/* unwind any prior newly initialized pags */
-	xfs_free_unused_perag_range(mp, first_initialised, agcount);
+	xfs_free_unused_perag_range(mp, old_agcount, index);
 	return error;
 }
 
diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
index d9cccd093b60e0..69fc31e7b84728 100644
--- a/fs/xfs/libxfs/xfs_ag.h
+++ b/fs/xfs/libxfs/xfs_ag.h
@@ -146,8 +146,9 @@ __XFS_AG_OPSTATE(agfl_needs_reset, AGFL_NEEDS_RESET)
 
 void xfs_free_unused_perag_range(struct xfs_mount *mp, xfs_agnumber_t agstart,
 			xfs_agnumber_t agend);
-int xfs_initialize_perag(struct xfs_mount *mp, xfs_agnumber_t agcount,
-			xfs_rfsblock_t dcount, xfs_agnumber_t *maxagi);
+int xfs_initialize_perag(struct xfs_mount *mp, xfs_agnumber_t old_agcount,
+		xfs_agnumber_t agcount, xfs_rfsblock_t dcount,
+		xfs_agnumber_t *maxagi);
 int xfs_initialize_perag_data(struct xfs_mount *mp, xfs_agnumber_t agno);
 void xfs_free_perag(struct xfs_mount *mp);
 
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 3643cc843f6271..de2bf0594cb474 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -87,6 +87,7 @@ xfs_growfs_data_private(
 	struct xfs_mount	*mp,		/* mount point for filesystem */
 	struct xfs_growfs_data	*in)		/* growfs data input struct */
 {
+	xfs_agnumber_t		oagcount = mp->m_sb.sb_agcount;
 	struct xfs_buf		*bp;
 	int			error;
 	xfs_agnumber_t		nagcount;
@@ -94,7 +95,6 @@ xfs_growfs_data_private(
 	xfs_rfsblock_t		nb, nb_div, nb_mod;
 	int64_t			delta;
 	bool			lastag_extended = false;
-	xfs_agnumber_t		oagcount;
 	struct xfs_trans	*tp;
 	struct aghdr_init_data	id = {};
 	struct xfs_perag	*last_pag;
@@ -138,16 +138,14 @@ xfs_growfs_data_private(
 	if (delta == 0)
 		return 0;
 
-	oagcount = mp->m_sb.sb_agcount;
-	/* allocate the new per-ag structures */
-	if (nagcount > oagcount) {
-		error = xfs_initialize_perag(mp, nagcount, nb, &nagimax);
-		if (error)
-			return error;
-	} else if (nagcount < oagcount) {
-		/* TODO: shrinking the entire AGs hasn't yet completed */
+	/* TODO: shrinking the entire AGs hasn't yet completed */
+	if (nagcount < oagcount)
 		return -EINVAL;
-	}
+
+	/* allocate the new per-ag structures */
+	error = xfs_initialize_perag(mp, oagcount, nagcount, nb, &nagimax);
+	if (error)
+		return error;
 
 	if (delta > 0)
 		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_growdata,
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index ec766b4bc8537b..6a165ca55da1a8 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -3346,6 +3346,7 @@ xlog_do_recover(
 	struct xfs_mount	*mp = log->l_mp;
 	struct xfs_buf		*bp = mp->m_sb_bp;
 	struct xfs_sb		*sbp = &mp->m_sb;
+	xfs_agnumber_t		old_agcount = sbp->sb_agcount;
 	int			error;
 
 	trace_xfs_log_recover(log, head_blk, tail_blk);
@@ -3393,8 +3394,8 @@ xlog_do_recover(
 	/* re-initialise in-core superblock and geometry structures */
 	mp->m_features |= xfs_sb_version_to_features(sbp);
 	xfs_reinit_percpu_counters(mp);
-	error = xfs_initialize_perag(mp, sbp->sb_agcount, sbp->sb_dblocks,
-			&mp->m_maxagi);
+	error = xfs_initialize_perag(mp, old_agcount, sbp->sb_agcount,
+			sbp->sb_dblocks, &mp->m_maxagi);
 	if (error) {
 		xfs_warn(mp, "Failed post-recovery per-ag init: %d", error);
 		return error;
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 1fdd79c5bfa04e..6fa7239a4a01b6 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -810,8 +810,8 @@ xfs_mountfs(
 	/*
 	 * Allocate and initialize the per-ag data.
 	 */
-	error = xfs_initialize_perag(mp, sbp->sb_agcount, mp->m_sb.sb_dblocks,
-			&mp->m_maxagi);
+	error = xfs_initialize_perag(mp, 0, sbp->sb_agcount,
+			mp->m_sb.sb_dblocks, &mp->m_maxagi);
 	if (error) {
 		xfs_warn(mp, "Failed per-ag init: %d", error);
 		goto out_free_dir;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/7] xfs: merge the perag freeing helpers
  2024-09-30 16:41 fix recovery of allocator ops after a growfs Christoph Hellwig
  2024-09-30 16:41 ` [PATCH 1/7] xfs: pass the exact range to initialize to xfs_initialize_perag Christoph Hellwig
@ 2024-09-30 16:41 ` Christoph Hellwig
  2024-10-10 14:02   ` Brian Foster
  2024-09-30 16:41 ` [PATCH 3/7] xfs: update the file system geometry after recoverying superblock buffers Christoph Hellwig
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2024-09-30 16:41 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: Darrick J. Wong, linux-xfs

There is no good reason to have two different routines for freeing perag
structures for the unmount and error cases.  Add two arguments to specify
the range of AGs to free to xfs_free_perag, and use that to replace
xfs_free_unused_perag_range.

The addition RCU grace period for the error case is harmless, and the
extra check for the AG to actually exist is not required now that the
callers pass the exact known allocated range.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ag.c | 40 ++++++++++------------------------------
 fs/xfs/libxfs/xfs_ag.h |  5 ++---
 fs/xfs/xfs_fsops.c     |  2 +-
 fs/xfs/xfs_mount.c     |  5 ++---
 4 files changed, 15 insertions(+), 37 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index 652376aa52e990..8fac0ce45b1559 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -185,17 +185,20 @@ xfs_initialize_perag_data(
 }
 
 /*
- * Free up the per-ag resources associated with the mount structure.
+ * Free up the per-ag resources  within the specified AG range.
  */
 void
-xfs_free_perag(
-	struct xfs_mount	*mp)
+xfs_free_perag_range(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		first_agno,
+	xfs_agnumber_t		end_agno)
+
 {
-	struct xfs_perag	*pag;
 	xfs_agnumber_t		agno;
 
-	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
-		pag = xa_erase(&mp->m_perags, agno);
+	for (agno = first_agno; agno < end_agno; agno++) {
+		struct xfs_perag	*pag = xa_erase(&mp->m_perags, agno);
+
 		ASSERT(pag);
 		XFS_IS_CORRUPT(pag->pag_mount, atomic_read(&pag->pag_ref) != 0);
 		xfs_defer_drain_free(&pag->pag_intents_drain);
@@ -270,29 +273,6 @@ xfs_agino_range(
 	return __xfs_agino_range(mp, xfs_ag_block_count(mp, agno), first, last);
 }
 
-/*
- * Free perag within the specified AG range, it is only used to free unused
- * perags under the error handling path.
- */
-void
-xfs_free_unused_perag_range(
-	struct xfs_mount	*mp,
-	xfs_agnumber_t		agstart,
-	xfs_agnumber_t		agend)
-{
-	struct xfs_perag	*pag;
-	xfs_agnumber_t		index;
-
-	for (index = agstart; index < agend; index++) {
-		pag = xa_erase(&mp->m_perags, index);
-		if (!pag)
-			break;
-		xfs_buf_cache_destroy(&pag->pag_bcache);
-		xfs_defer_drain_free(&pag->pag_intents_drain);
-		kfree(pag);
-	}
-}
-
 int
 xfs_initialize_perag(
 	struct xfs_mount	*mp,
@@ -369,7 +349,7 @@ xfs_initialize_perag(
 out_free_pag:
 	kfree(pag);
 out_unwind_new_pags:
-	xfs_free_unused_perag_range(mp, old_agcount, index);
+	xfs_free_perag_range(mp, old_agcount, index);
 	return error;
 }
 
diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
index 69fc31e7b84728..6e68d6a3161a0f 100644
--- a/fs/xfs/libxfs/xfs_ag.h
+++ b/fs/xfs/libxfs/xfs_ag.h
@@ -144,13 +144,12 @@ __XFS_AG_OPSTATE(prefers_metadata, PREFERS_METADATA)
 __XFS_AG_OPSTATE(allows_inodes, ALLOWS_INODES)
 __XFS_AG_OPSTATE(agfl_needs_reset, AGFL_NEEDS_RESET)
 
-void xfs_free_unused_perag_range(struct xfs_mount *mp, xfs_agnumber_t agstart,
-			xfs_agnumber_t agend);
 int xfs_initialize_perag(struct xfs_mount *mp, xfs_agnumber_t old_agcount,
 		xfs_agnumber_t agcount, xfs_rfsblock_t dcount,
 		xfs_agnumber_t *maxagi);
+void xfs_free_perag_range(struct xfs_mount *mp, xfs_agnumber_t first_agno,
+		xfs_agnumber_t end_agno);
 int xfs_initialize_perag_data(struct xfs_mount *mp, xfs_agnumber_t agno);
-void xfs_free_perag(struct xfs_mount *mp);
 
 /* Passive AG references */
 struct xfs_perag *xfs_perag_get(struct xfs_mount *mp, xfs_agnumber_t agno);
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index de2bf0594cb474..b247d895c276d2 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -229,7 +229,7 @@ xfs_growfs_data_private(
 	xfs_trans_cancel(tp);
 out_free_unused_perag:
 	if (nagcount > oagcount)
-		xfs_free_unused_perag_range(mp, oagcount, nagcount);
+		xfs_free_perag_range(mp, oagcount, nagcount);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 6fa7239a4a01b6..25bbcc3f4ee08b 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1048,7 +1048,7 @@ xfs_mountfs(
 		xfs_buftarg_drain(mp->m_logdev_targp);
 	xfs_buftarg_drain(mp->m_ddev_targp);
  out_free_perag:
-	xfs_free_perag(mp);
+	xfs_free_perag_range(mp, 0, mp->m_sb.sb_agcount);
  out_free_dir:
 	xfs_da_unmount(mp);
  out_remove_uuid:
@@ -1129,8 +1129,7 @@ xfs_unmountfs(
 	xfs_errortag_clearall(mp);
 #endif
 	shrinker_free(mp->m_inodegc_shrinker);
-	xfs_free_perag(mp);
-
+	xfs_free_perag_range(mp, 0, mp->m_sb.sb_agcount);
 	xfs_errortag_del(mp);
 	xfs_error_sysfs_del(mp);
 	xchk_stats_unregister(mp->m_scrub_stats);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/7] xfs: update the file system geometry after recoverying superblock buffers
  2024-09-30 16:41 fix recovery of allocator ops after a growfs Christoph Hellwig
  2024-09-30 16:41 ` [PATCH 1/7] xfs: pass the exact range to initialize to xfs_initialize_perag Christoph Hellwig
  2024-09-30 16:41 ` [PATCH 2/7] xfs: merge the perag freeing helpers Christoph Hellwig
@ 2024-09-30 16:41 ` Christoph Hellwig
  2024-09-30 16:50   ` Darrick J. Wong
  2024-10-10 14:03   ` Brian Foster
  2024-09-30 16:41 ` [PATCH 4/7] xfs: error out when a superblock buffer updates reduces the agcount Christoph Hellwig
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 44+ messages in thread
From: Christoph Hellwig @ 2024-09-30 16:41 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: Darrick J. Wong, linux-xfs

Primary superblock buffers that change the file system geometry after a
growfs operation can affect the operation of later CIL checkpoints that
make use of the newly added space and allocation groups.

Apply the changes to the in-memory structures as part of recovery pass 2,
to ensure recovery works fine for such cases.

In the future we should apply the logic to other updates such as features
bits as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_log_recover.h |  2 ++
 fs/xfs/xfs_buf_item_recover.c   | 27 +++++++++++++++++++++++++++
 fs/xfs/xfs_log_recover.c        | 27 +++++++++++++++++++--------
 3 files changed, 48 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h
index 521d327e4c89ed..d0e13c84422d0a 100644
--- a/fs/xfs/libxfs/xfs_log_recover.h
+++ b/fs/xfs/libxfs/xfs_log_recover.h
@@ -165,4 +165,6 @@ void xlog_recover_intent_item(struct xlog *log, struct xfs_log_item *lip,
 int xlog_recover_finish_intent(struct xfs_trans *tp,
 		struct xfs_defer_pending *dfp);
 
+int xlog_recover_update_agcount(struct xfs_mount *mp, struct xfs_dsb *dsb);
+
 #endif	/* __XFS_LOG_RECOVER_H__ */
diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c
index 09e893cf563cb9..08c129022304a8 100644
--- a/fs/xfs/xfs_buf_item_recover.c
+++ b/fs/xfs/xfs_buf_item_recover.c
@@ -684,6 +684,28 @@ xlog_recover_do_inode_buffer(
 	return 0;
 }
 
+static int
+xlog_recover_do_sb_buffer(
+	struct xfs_mount		*mp,
+	struct xlog_recover_item	*item,
+	struct xfs_buf			*bp,
+	struct xfs_buf_log_format	*buf_f,
+	xfs_lsn_t			current_lsn)
+{
+	xlog_recover_do_reg_buffer(mp, item, bp, buf_f, current_lsn);
+
+	/*
+	 * Update the in-memory superblock and perag structures from the
+	 * primary SB buffer.
+	 *
+	 * This is required because transactions running after growfs may require
+	 * the updated values to be set in a previous fully commit transaction.
+	 */
+	if (xfs_buf_daddr(bp) != 0)
+		return 0;
+	return xlog_recover_update_agcount(mp, bp->b_addr);
+}
+
 /*
  * V5 filesystems know the age of the buffer on disk being recovered. We can
  * have newer objects on disk than we are replaying, and so for these cases we
@@ -967,6 +989,11 @@ xlog_recover_buf_commit_pass2(
 		dirty = xlog_recover_do_dquot_buffer(mp, log, item, bp, buf_f);
 		if (!dirty)
 			goto out_release;
+	} else if (xfs_blft_from_flags(buf_f) & XFS_BLFT_SB_BUF) {
+		error = xlog_recover_do_sb_buffer(mp, item, bp, buf_f,
+				current_lsn);
+		if (error)
+			goto out_release;
 	} else {
 		xlog_recover_do_reg_buffer(mp, item, bp, buf_f, current_lsn);
 	}
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 6a165ca55da1a8..03701409c7dcd6 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -3334,6 +3334,25 @@ xlog_do_log_recovery(
 	return error;
 }
 
+int
+xlog_recover_update_agcount(
+	struct xfs_mount		*mp,
+	struct xfs_dsb			*dsb)
+{
+	xfs_agnumber_t			old_agcount = mp->m_sb.sb_agcount;
+	int				error;
+
+	xfs_sb_from_disk(&mp->m_sb, dsb);
+	error = xfs_initialize_perag(mp, old_agcount, mp->m_sb.sb_agcount,
+			mp->m_sb.sb_dblocks, &mp->m_maxagi);
+	if (error) {
+		xfs_warn(mp, "Failed recovery per-ag init: %d", error);
+		return error;
+	}
+	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
+	return 0;
+}
+
 /*
  * Do the actual recovery
  */
@@ -3346,7 +3365,6 @@ xlog_do_recover(
 	struct xfs_mount	*mp = log->l_mp;
 	struct xfs_buf		*bp = mp->m_sb_bp;
 	struct xfs_sb		*sbp = &mp->m_sb;
-	xfs_agnumber_t		old_agcount = sbp->sb_agcount;
 	int			error;
 
 	trace_xfs_log_recover(log, head_blk, tail_blk);
@@ -3394,13 +3412,6 @@ xlog_do_recover(
 	/* re-initialise in-core superblock and geometry structures */
 	mp->m_features |= xfs_sb_version_to_features(sbp);
 	xfs_reinit_percpu_counters(mp);
-	error = xfs_initialize_perag(mp, old_agcount, sbp->sb_agcount,
-			sbp->sb_dblocks, &mp->m_maxagi);
-	if (error) {
-		xfs_warn(mp, "Failed post-recovery per-ag init: %d", error);
-		return error;
-	}
-	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
 
 	/* Normal transactions can now occur */
 	clear_bit(XLOG_ACTIVE_RECOVERY, &log->l_opstate);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 4/7] xfs: error out when a superblock buffer updates reduces the agcount
  2024-09-30 16:41 fix recovery of allocator ops after a growfs Christoph Hellwig
                   ` (2 preceding siblings ...)
  2024-09-30 16:41 ` [PATCH 3/7] xfs: update the file system geometry after recoverying superblock buffers Christoph Hellwig
@ 2024-09-30 16:41 ` Christoph Hellwig
  2024-09-30 16:51   ` Darrick J. Wong
  2024-10-10 14:04   ` Brian Foster
  2024-09-30 16:41 ` [PATCH 5/7] xfs: don't use __GFP_RETRY_MAYFAIL in xfs_initialize_perag Christoph Hellwig
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 44+ messages in thread
From: Christoph Hellwig @ 2024-09-30 16:41 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: Darrick J. Wong, linux-xfs

XFS currently does not support reducing the agcount, so error out if
a logged sb buffer tries to shrink the agcount.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_log_recover.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 03701409c7dcd6..3b5cd240bb62ef 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -3343,6 +3343,10 @@ xlog_recover_update_agcount(
 	int				error;
 
 	xfs_sb_from_disk(&mp->m_sb, dsb);
+	if (mp->m_sb.sb_agcount < old_agcount) {
+		xfs_alert(mp, "Shrinking AG count in log recovery");
+		return -EFSCORRUPTED;
+	}
 	error = xfs_initialize_perag(mp, old_agcount, mp->m_sb.sb_agcount,
 			mp->m_sb.sb_dblocks, &mp->m_maxagi);
 	if (error) {
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 5/7] xfs: don't use __GFP_RETRY_MAYFAIL in xfs_initialize_perag
  2024-09-30 16:41 fix recovery of allocator ops after a growfs Christoph Hellwig
                   ` (3 preceding siblings ...)
  2024-09-30 16:41 ` [PATCH 4/7] xfs: error out when a superblock buffer updates reduces the agcount Christoph Hellwig
@ 2024-09-30 16:41 ` Christoph Hellwig
  2024-10-10 14:04   ` Brian Foster
  2024-09-30 16:41 ` [PATCH 6/7] xfs: don't update file system geometry through transaction deltas Christoph Hellwig
  2024-09-30 16:41 ` [PATCH 7/7] xfs: split xfs_trans_mod_sb Christoph Hellwig
  6 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2024-09-30 16:41 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: Darrick J. Wong, linux-xfs

__GFP_RETRY_MAYFAIL increases the likelyhood of allocations to fail,
which isn't really helpful during log recovery.  Remove the flag and
stick to the default GFP_KERNEL policies.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ag.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
index 8fac0ce45b1559..29feaed7c8f880 100644
--- a/fs/xfs/libxfs/xfs_ag.c
+++ b/fs/xfs/libxfs/xfs_ag.c
@@ -289,7 +289,7 @@ xfs_initialize_perag(
 		return 0;
 
 	for (index = old_agcount; index < new_agcount; index++) {
-		pag = kzalloc(sizeof(*pag), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+		pag = kzalloc(sizeof(*pag), GFP_KERNEL);
 		if (!pag) {
 			error = -ENOMEM;
 			goto out_unwind_new_pags;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 6/7] xfs: don't update file system geometry through transaction deltas
  2024-09-30 16:41 fix recovery of allocator ops after a growfs Christoph Hellwig
                   ` (4 preceding siblings ...)
  2024-09-30 16:41 ` [PATCH 5/7] xfs: don't use __GFP_RETRY_MAYFAIL in xfs_initialize_perag Christoph Hellwig
@ 2024-09-30 16:41 ` Christoph Hellwig
  2024-10-10 14:05   ` Brian Foster
  2024-10-10 19:01   ` Darrick J. Wong
  2024-09-30 16:41 ` [PATCH 7/7] xfs: split xfs_trans_mod_sb Christoph Hellwig
  6 siblings, 2 replies; 44+ messages in thread
From: Christoph Hellwig @ 2024-09-30 16:41 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: Darrick J. Wong, linux-xfs

Updates to the file system geometry in growfs need to be committed to
stable store before the allocator can see them to avoid that they are
in the same CIL checkpoint as transactions that make use of this new
information, which will make recovery impossible or broken.

To do this add two new helpers to prepare a superblock for direct
manipulation of the on-disk buffer, and to commit these updates while
holding the buffer locked (similar to what xfs_sync_sb_buf does) and use
those in growfs instead of applying the changes through the deltas in the
xfs_trans structure (which also happens to shrink the xfs_trans structure
a fair bit).

The rtbmimap repair code was also using the transaction deltas and is
converted to also update the superblock buffer directly under the buffer
lock.

This new method establishes a locking protocol where even in-core
superblock fields must only be updated with the superblock buffer
locked.  For now it is only applied to affected geometry fields,
but in the future it would make sense to apply it universally.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_sb.c         |  97 ++++++++++++++++++++++++-------
 fs/xfs/libxfs/xfs_sb.h         |   3 +
 fs/xfs/libxfs/xfs_shared.h     |   8 ---
 fs/xfs/scrub/rtbitmap_repair.c |  26 +++++----
 fs/xfs/xfs_fsops.c             |  80 ++++++++++++++++----------
 fs/xfs/xfs_rtalloc.c           |  92 +++++++++++++++++-------------
 fs/xfs/xfs_trans.c             | 101 ++-------------------------------
 fs/xfs/xfs_trans.h             |   8 ---
 8 files changed, 198 insertions(+), 217 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index d95409f3cba667..2c83ab7441ade5 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -1025,6 +1025,80 @@ xfs_sb_mount_common(
 	mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp);
 }
 
+/*
+ * Mirror the lazy sb counters to the in-core superblock.
+ *
+ * If this is at unmount, the counters will be exactly correct, but at any other
+ * time they will only be ballpark correct because of reservations that have
+ * been taken out percpu counters.  If we have an unclean shutdown, this will be
+ * corrected by log recovery rebuilding the counters from the AGF block counts.
+ *
+ * Do not update sb_frextents here because it is not part of the lazy sb
+ * counters, despite having a percpu counter.  It is always kept consistent with
+ * the ondisk rtbitmap by xfs_trans_apply_sb_deltas() and hence we don't need
+ * have to update it here.
+ */
+static void
+xfs_flush_sb_counters(
+	struct xfs_mount	*mp)
+{
+	if (xfs_has_lazysbcount(mp)) {
+		mp->m_sb.sb_icount = percpu_counter_sum_positive(&mp->m_icount);
+		mp->m_sb.sb_ifree = min_t(uint64_t,
+				percpu_counter_sum_positive(&mp->m_ifree),
+				mp->m_sb.sb_icount);
+		mp->m_sb.sb_fdblocks =
+				percpu_counter_sum_positive(&mp->m_fdblocks);
+	}
+}
+
+/*
+ * Prepare a direct update to the superblock through the on-disk buffer.
+ *
+ * This locks out other modifications through the buffer lock and then syncs all
+ * in-core values to the on-disk buffer (including the percpu counters).
+ *
+ * The caller then directly manipulates the on-disk fields and calls
+ * xfs_commit_sb_update to the updates to disk them.  The caller is responsible
+ * to also update the in-core field, but it can do so after the transaction has
+ * been committed to disk.
+ *
+ * Updating the in-core field only after xfs_commit_sb_update ensures that other
+ * processes only see the update once it is stable on disk, and is usually the
+ * right thing to do for superblock updates.
+ *
+ * Note that writes to superblock fields updated using this helper are
+ * synchronized using the superblock buffer lock, which must be taken around
+ * all updates to the in-core fields as well.
+ */
+struct xfs_dsb *
+xfs_prepare_sb_update(
+	struct xfs_trans	*tp,
+	struct xfs_buf		**bpp)
+{
+	*bpp = xfs_trans_getsb(tp);
+	xfs_flush_sb_counters(tp->t_mountp);
+	xfs_sb_to_disk((*bpp)->b_addr, &tp->t_mountp->m_sb);
+	return (*bpp)->b_addr;
+}
+
+/*
+ * Commit a direct update to the on-disk superblock.  Keeps @bp locked and
+ * referenced, so the caller must call xfs_buf_relse() manually.
+ */
+int
+xfs_commit_sb_update(
+	struct xfs_trans	*tp,
+	struct xfs_buf		*bp)
+{
+	xfs_trans_bhold(tp, bp);
+	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF);
+	xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1);
+
+	xfs_trans_set_sync(tp);
+	return xfs_trans_commit(tp);
+}
+
 /*
  * xfs_log_sb() can be used to copy arbitrary changes to the in-core superblock
  * into the superblock buffer to be logged.  It does not provide the higher
@@ -1038,28 +1112,7 @@ xfs_log_sb(
 	struct xfs_mount	*mp = tp->t_mountp;
 	struct xfs_buf		*bp = xfs_trans_getsb(tp);
 
-	/*
-	 * Lazy sb counters don't update the in-core superblock so do that now.
-	 * If this is at unmount, the counters will be exactly correct, but at
-	 * any other time they will only be ballpark correct because of
-	 * reservations that have been taken out percpu counters. If we have an
-	 * unclean shutdown, this will be corrected by log recovery rebuilding
-	 * the counters from the AGF block counts.
-	 *
-	 * Do not update sb_frextents here because it is not part of the lazy
-	 * sb counters, despite having a percpu counter. It is always kept
-	 * consistent with the ondisk rtbitmap by xfs_trans_apply_sb_deltas()
-	 * and hence we don't need have to update it here.
-	 */
-	if (xfs_has_lazysbcount(mp)) {
-		mp->m_sb.sb_icount = percpu_counter_sum_positive(&mp->m_icount);
-		mp->m_sb.sb_ifree = min_t(uint64_t,
-				percpu_counter_sum_positive(&mp->m_ifree),
-				mp->m_sb.sb_icount);
-		mp->m_sb.sb_fdblocks =
-				percpu_counter_sum_positive(&mp->m_fdblocks);
-	}
-
+	xfs_flush_sb_counters(mp);
 	xfs_sb_to_disk(bp->b_addr, &mp->m_sb);
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF);
 	xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1);
diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h
index 885c837559914d..3649d071687e33 100644
--- a/fs/xfs/libxfs/xfs_sb.h
+++ b/fs/xfs/libxfs/xfs_sb.h
@@ -13,6 +13,9 @@ struct xfs_trans;
 struct xfs_fsop_geom;
 struct xfs_perag;
 
+struct xfs_dsb *xfs_prepare_sb_update(struct xfs_trans *tp,
+			struct xfs_buf **bpp);
+int		xfs_commit_sb_update(struct xfs_trans *tp, struct xfs_buf *bp);
 extern void	xfs_log_sb(struct xfs_trans *tp);
 extern int	xfs_sync_sb(struct xfs_mount *mp, bool wait);
 extern int	xfs_sync_sb_buf(struct xfs_mount *mp);
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 33b84a3a83ff63..45a32ea426164a 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -149,14 +149,6 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
 #define	XFS_TRANS_SB_RES_FDBLOCKS	0x00000008
 #define	XFS_TRANS_SB_FREXTENTS		0x00000010
 #define	XFS_TRANS_SB_RES_FREXTENTS	0x00000020
-#define	XFS_TRANS_SB_DBLOCKS		0x00000040
-#define	XFS_TRANS_SB_AGCOUNT		0x00000080
-#define	XFS_TRANS_SB_IMAXPCT		0x00000100
-#define	XFS_TRANS_SB_REXTSIZE		0x00000200
-#define	XFS_TRANS_SB_RBMBLOCKS		0x00000400
-#define	XFS_TRANS_SB_RBLOCKS		0x00000800
-#define	XFS_TRANS_SB_REXTENTS		0x00001000
-#define	XFS_TRANS_SB_REXTSLOG		0x00002000
 
 /*
  * Here we centralize the specification of XFS meta-data buffer reference count
diff --git a/fs/xfs/scrub/rtbitmap_repair.c b/fs/xfs/scrub/rtbitmap_repair.c
index 0fef98e9f83409..be9d31f032b1bf 100644
--- a/fs/xfs/scrub/rtbitmap_repair.c
+++ b/fs/xfs/scrub/rtbitmap_repair.c
@@ -16,6 +16,7 @@
 #include "xfs_bit.h"
 #include "xfs_bmap.h"
 #include "xfs_bmap_btree.h"
+#include "xfs_sb.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
@@ -127,20 +128,21 @@ xrep_rtbitmap_geometry(
 	struct xchk_rtbitmap	*rtb)
 {
 	struct xfs_mount	*mp = sc->mp;
-	struct xfs_trans	*tp = sc->tp;
 
 	/* Superblock fields */
-	if (mp->m_sb.sb_rextents != rtb->rextents)
-		xfs_trans_mod_sb(sc->tp, XFS_TRANS_SB_REXTENTS,
-				rtb->rextents - mp->m_sb.sb_rextents);
-
-	if (mp->m_sb.sb_rbmblocks != rtb->rbmblocks)
-		xfs_trans_mod_sb(tp, XFS_TRANS_SB_RBMBLOCKS,
-				rtb->rbmblocks - mp->m_sb.sb_rbmblocks);
-
-	if (mp->m_sb.sb_rextslog != rtb->rextslog)
-		xfs_trans_mod_sb(tp, XFS_TRANS_SB_REXTSLOG,
-				rtb->rextslog - mp->m_sb.sb_rextslog);
+	if (mp->m_sb.sb_rextents != rtb->rextents ||
+	    mp->m_sb.sb_rbmblocks != rtb->rbmblocks ||
+	    mp->m_sb.sb_rextslog != rtb->rextslog) {
+		struct xfs_buf		*bp = xfs_trans_getsb(sc->tp);
+
+		mp->m_sb.sb_rextents = rtb->rextents;
+		mp->m_sb.sb_rbmblocks = rtb->rbmblocks;
+		mp->m_sb.sb_rextslog = rtb->rextslog;
+		xfs_sb_to_disk(bp->b_addr, &mp->m_sb);
+
+		xfs_trans_buf_set_type(sc->tp, bp, XFS_BLFT_SB_BUF);
+		xfs_trans_log_buf(sc->tp, bp, 0, sizeof(struct xfs_dsb) - 1);
+	}
 
 	/* Fix broken isize */
 	sc->ip->i_disk_size = roundup_64(sc->ip->i_disk_size,
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index b247d895c276d2..4168ccf21068cb 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -79,6 +79,46 @@ xfs_resizefs_init_new_ags(
 	return error;
 }
 
+static int
+xfs_growfs_data_update_sb(
+	struct xfs_trans	*tp,
+	xfs_agnumber_t		nagcount,
+	xfs_rfsblock_t		nb,
+	xfs_agnumber_t		nagimax)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_dsb		*sbp;
+	struct xfs_buf		*bp;
+	int			error;
+
+	/*
+	 * Update the geometry in the on-disk superblock first, and ensure
+	 * they make it to disk before the superblock can be relogged.
+	 */
+	sbp = xfs_prepare_sb_update(tp, &bp);
+	sbp->sb_agcount = cpu_to_be32(nagcount);
+	sbp->sb_dblocks = cpu_to_be64(nb);
+	error = xfs_commit_sb_update(tp, bp);
+	if (error)
+		goto out_unlock;
+
+	/*
+	 * Propagate the new values to the live mount structure after they made
+	 * it to disk with the superblock buffer still locked.
+	 */
+	mp->m_sb.sb_agcount = nagcount;
+	mp->m_sb.sb_dblocks = nb;
+
+	if (nagimax)
+		mp->m_maxagi = nagimax;
+	xfs_set_low_space_thresholds(mp);
+	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
+
+out_unlock:
+	xfs_buf_relse(bp);
+	return error;
+}
+
 /*
  * growfs operations
  */
@@ -171,37 +211,13 @@ xfs_growfs_data_private(
 	if (error)
 		goto out_trans_cancel;
 
-	/*
-	 * Update changed superblock fields transactionally. These are not
-	 * seen by the rest of the world until the transaction commit applies
-	 * them atomically to the superblock.
-	 */
-	if (nagcount > oagcount)
-		xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount);
-	if (delta)
-		xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS, delta);
 	if (id.nfree)
 		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, id.nfree);
 
-	/*
-	 * Sync sb counters now to reflect the updated values. This is
-	 * particularly important for shrink because the write verifier
-	 * will fail if sb_fdblocks is ever larger than sb_dblocks.
-	 */
-	if (xfs_has_lazysbcount(mp))
-		xfs_log_sb(tp);
-
-	xfs_trans_set_sync(tp);
-	error = xfs_trans_commit(tp);
+	error = xfs_growfs_data_update_sb(tp, nagcount, nb, nagimax);
 	if (error)
 		return error;
 
-	/* New allocation groups fully initialized, so update mount struct */
-	if (nagimax)
-		mp->m_maxagi = nagimax;
-	xfs_set_low_space_thresholds(mp);
-	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
-
 	if (delta > 0) {
 		/*
 		 * If we expanded the last AG, free the per-AG reservation
@@ -260,8 +276,9 @@ xfs_growfs_imaxpct(
 	struct xfs_mount	*mp,
 	__u32			imaxpct)
 {
+	struct xfs_dsb		*sbp;
+	struct xfs_buf		*bp;
 	struct xfs_trans	*tp;
-	int			dpct;
 	int			error;
 
 	if (imaxpct > 100)
@@ -272,10 +289,13 @@ xfs_growfs_imaxpct(
 	if (error)
 		return error;
 
-	dpct = imaxpct - mp->m_sb.sb_imax_pct;
-	xfs_trans_mod_sb(tp, XFS_TRANS_SB_IMAXPCT, dpct);
-	xfs_trans_set_sync(tp);
-	return xfs_trans_commit(tp);
+	sbp = xfs_prepare_sb_update(tp, &bp);
+	sbp->sb_imax_pct = imaxpct;
+	error = xfs_commit_sb_update(tp, bp);
+	if (!error)
+		mp->m_sb.sb_imax_pct = imaxpct;
+	xfs_buf_relse(bp);
+	return error;
 }
 
 /*
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 3a2005a1e673dc..994e5efedab20f 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -698,6 +698,56 @@ xfs_growfs_rt_fixup_extsize(
 	return error;
 }
 
+static int
+xfs_growfs_rt_update_sb(
+	struct xfs_trans	*tp,
+	struct xfs_mount	*mp,
+	struct xfs_mount	*nmp,
+	xfs_rtbxlen_t		freed_rtx)
+{
+	struct xfs_dsb		*sbp;
+	struct xfs_buf		*bp;
+	int			error;
+
+	/*
+	 * Update the geometry in the on-disk superblock first, and ensure
+	 * they make it to disk before the superblock can be relogged.
+	 */
+	sbp = xfs_prepare_sb_update(tp, &bp);
+	sbp->sb_rextsize = cpu_to_be32(nmp->m_sb.sb_rextsize);
+	sbp->sb_rbmblocks = cpu_to_be32(nmp->m_sb.sb_rbmblocks);
+	sbp->sb_rblocks = cpu_to_be64(nmp->m_sb.sb_rblocks);
+	sbp->sb_rextents = cpu_to_be64(nmp->m_sb.sb_rextents);
+	sbp->sb_rextslog = nmp->m_sb.sb_rextslog;
+	error = xfs_commit_sb_update(tp, bp);
+	if (error)
+		return error;
+
+	/*
+	 * Propagate the new values to the live mount structure after they made
+	 * it to disk with the superblock buffer still locked.
+	 */
+	mp->m_sb.sb_rextsize = nmp->m_sb.sb_rextsize;
+	mp->m_sb.sb_rbmblocks = nmp->m_sb.sb_rbmblocks;
+	mp->m_sb.sb_rblocks = nmp->m_sb.sb_rblocks;
+	mp->m_sb.sb_rextents = nmp->m_sb.sb_rextents;
+	mp->m_sb.sb_rextslog = nmp->m_sb.sb_rextslog;
+	mp->m_rsumlevels = nmp->m_rsumlevels;
+	mp->m_rsumblocks = nmp->m_rsumblocks;
+
+	/*
+	 * Recompute the growfsrt reservation from the new rsumsize.
+	 */
+	xfs_trans_resv_calc(mp, &mp->m_resv);
+
+	/*
+	 * Ensure the mount RT feature flag is now set.
+	 */
+	mp->m_features |= XFS_FEAT_REALTIME;
+	xfs_buf_relse(bp);
+	return 0;
+}
+
 static int
 xfs_growfs_rt_bmblock(
 	struct xfs_mount	*mp,
@@ -780,25 +830,6 @@ xfs_growfs_rt_bmblock(
 			goto out_cancel;
 	}
 
-	/*
-	 * Update superblock fields.
-	 */
-	if (nmp->m_sb.sb_rextsize != mp->m_sb.sb_rextsize)
-		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTSIZE,
-			nmp->m_sb.sb_rextsize - mp->m_sb.sb_rextsize);
-	if (nmp->m_sb.sb_rbmblocks != mp->m_sb.sb_rbmblocks)
-		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_RBMBLOCKS,
-			nmp->m_sb.sb_rbmblocks - mp->m_sb.sb_rbmblocks);
-	if (nmp->m_sb.sb_rblocks != mp->m_sb.sb_rblocks)
-		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_RBLOCKS,
-			nmp->m_sb.sb_rblocks - mp->m_sb.sb_rblocks);
-	if (nmp->m_sb.sb_rextents != mp->m_sb.sb_rextents)
-		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTENTS,
-			nmp->m_sb.sb_rextents - mp->m_sb.sb_rextents);
-	if (nmp->m_sb.sb_rextslog != mp->m_sb.sb_rextslog)
-		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTSLOG,
-			nmp->m_sb.sb_rextslog - mp->m_sb.sb_rextslog);
-
 	/*
 	 * Free the new extent.
 	 */
@@ -807,33 +838,12 @@ xfs_growfs_rt_bmblock(
 	xfs_rtbuf_cache_relse(&nargs);
 	if (error)
 		goto out_cancel;
-
-	/*
-	 * Mark more blocks free in the superblock.
-	 */
 	xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_FREXTENTS, freed_rtx);
 
-	/*
-	 * Update the calculated values in the real mount structure.
-	 */
-	mp->m_rsumlevels = nmp->m_rsumlevels;
-	mp->m_rsumblocks = nmp->m_rsumblocks;
-	xfs_mount_sb_set_rextsize(mp, &mp->m_sb);
-
-	/*
-	 * Recompute the growfsrt reservation from the new rsumsize.
-	 */
-	xfs_trans_resv_calc(mp, &mp->m_resv);
-
-	error = xfs_trans_commit(args.tp);
+	error = xfs_growfs_rt_update_sb(args.tp, mp, nmp, freed_rtx);
 	if (error)
 		goto out_free;
 
-	/*
-	 * Ensure the mount RT feature flag is now set.
-	 */
-	mp->m_features |= XFS_FEAT_REALTIME;
-
 	kfree(nmp);
 	return 0;
 
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index bdf3704dc30118..56505cb94f877d 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -430,31 +430,6 @@ xfs_trans_mod_sb(
 		ASSERT(delta < 0);
 		tp->t_res_frextents_delta += delta;
 		break;
-	case XFS_TRANS_SB_DBLOCKS:
-		tp->t_dblocks_delta += delta;
-		break;
-	case XFS_TRANS_SB_AGCOUNT:
-		ASSERT(delta > 0);
-		tp->t_agcount_delta += delta;
-		break;
-	case XFS_TRANS_SB_IMAXPCT:
-		tp->t_imaxpct_delta += delta;
-		break;
-	case XFS_TRANS_SB_REXTSIZE:
-		tp->t_rextsize_delta += delta;
-		break;
-	case XFS_TRANS_SB_RBMBLOCKS:
-		tp->t_rbmblocks_delta += delta;
-		break;
-	case XFS_TRANS_SB_RBLOCKS:
-		tp->t_rblocks_delta += delta;
-		break;
-	case XFS_TRANS_SB_REXTENTS:
-		tp->t_rextents_delta += delta;
-		break;
-	case XFS_TRANS_SB_REXTSLOG:
-		tp->t_rextslog_delta += delta;
-		break;
 	default:
 		ASSERT(0);
 		return;
@@ -475,12 +450,8 @@ STATIC void
 xfs_trans_apply_sb_deltas(
 	xfs_trans_t	*tp)
 {
-	struct xfs_dsb	*sbp;
-	struct xfs_buf	*bp;
-	int		whole = 0;
-
-	bp = xfs_trans_getsb(tp);
-	sbp = bp->b_addr;
+	struct xfs_buf	*bp = xfs_trans_getsb(tp);
+	struct xfs_dsb	*sbp = bp->b_addr;
 
 	/*
 	 * Only update the superblock counters if we are logging them
@@ -522,53 +493,10 @@ xfs_trans_apply_sb_deltas(
 		spin_unlock(&mp->m_sb_lock);
 	}
 
-	if (tp->t_dblocks_delta) {
-		be64_add_cpu(&sbp->sb_dblocks, tp->t_dblocks_delta);
-		whole = 1;
-	}
-	if (tp->t_agcount_delta) {
-		be32_add_cpu(&sbp->sb_agcount, tp->t_agcount_delta);
-		whole = 1;
-	}
-	if (tp->t_imaxpct_delta) {
-		sbp->sb_imax_pct += tp->t_imaxpct_delta;
-		whole = 1;
-	}
-	if (tp->t_rextsize_delta) {
-		be32_add_cpu(&sbp->sb_rextsize, tp->t_rextsize_delta);
-		whole = 1;
-	}
-	if (tp->t_rbmblocks_delta) {
-		be32_add_cpu(&sbp->sb_rbmblocks, tp->t_rbmblocks_delta);
-		whole = 1;
-	}
-	if (tp->t_rblocks_delta) {
-		be64_add_cpu(&sbp->sb_rblocks, tp->t_rblocks_delta);
-		whole = 1;
-	}
-	if (tp->t_rextents_delta) {
-		be64_add_cpu(&sbp->sb_rextents, tp->t_rextents_delta);
-		whole = 1;
-	}
-	if (tp->t_rextslog_delta) {
-		sbp->sb_rextslog += tp->t_rextslog_delta;
-		whole = 1;
-	}
-
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF);
-	if (whole)
-		/*
-		 * Log the whole thing, the fields are noncontiguous.
-		 */
-		xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1);
-	else
-		/*
-		 * Since all the modifiable fields are contiguous, we
-		 * can get away with this.
-		 */
-		xfs_trans_log_buf(tp, bp, offsetof(struct xfs_dsb, sb_icount),
-				  offsetof(struct xfs_dsb, sb_frextents) +
-				  sizeof(sbp->sb_frextents) - 1);
+	xfs_trans_log_buf(tp, bp, offsetof(struct xfs_dsb, sb_icount),
+			  offsetof(struct xfs_dsb, sb_frextents) +
+			  sizeof(sbp->sb_frextents) - 1);
 }
 
 /*
@@ -656,26 +584,7 @@ xfs_trans_unreserve_and_mod_sb(
 	 * must be consistent with the ondisk rtbitmap and must never include
 	 * incore reservations.
 	 */
-	mp->m_sb.sb_dblocks += tp->t_dblocks_delta;
-	mp->m_sb.sb_agcount += tp->t_agcount_delta;
-	mp->m_sb.sb_imax_pct += tp->t_imaxpct_delta;
-	mp->m_sb.sb_rextsize += tp->t_rextsize_delta;
-	if (tp->t_rextsize_delta) {
-		mp->m_rtxblklog = log2_if_power2(mp->m_sb.sb_rextsize);
-		mp->m_rtxblkmask = mask64_if_power2(mp->m_sb.sb_rextsize);
-	}
-	mp->m_sb.sb_rbmblocks += tp->t_rbmblocks_delta;
-	mp->m_sb.sb_rblocks += tp->t_rblocks_delta;
-	mp->m_sb.sb_rextents += tp->t_rextents_delta;
-	mp->m_sb.sb_rextslog += tp->t_rextslog_delta;
 	spin_unlock(&mp->m_sb_lock);
-
-	/*
-	 * Debug checks outside of the spinlock so they don't lock up the
-	 * machine if they fail.
-	 */
-	ASSERT(mp->m_sb.sb_imax_pct >= 0);
-	ASSERT(mp->m_sb.sb_rextslog >= 0);
 }
 
 /* Add the given log item to the transaction's list of log items. */
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index f06cc0f41665ad..e5911cf09be444 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -140,14 +140,6 @@ typedef struct xfs_trans {
 	int64_t			t_res_fdblocks_delta; /* on-disk only chg */
 	int64_t			t_frextents_delta;/* superblock freextents chg*/
 	int64_t			t_res_frextents_delta; /* on-disk only chg */
-	int64_t			t_dblocks_delta;/* superblock dblocks change */
-	int64_t			t_agcount_delta;/* superblock agcount change */
-	int64_t			t_imaxpct_delta;/* superblock imaxpct change */
-	int64_t			t_rextsize_delta;/* superblock rextsize chg */
-	int64_t			t_rbmblocks_delta;/* superblock rbmblocks chg */
-	int64_t			t_rblocks_delta;/* superblock rblocks change */
-	int64_t			t_rextents_delta;/* superblocks rextents chg */
-	int64_t			t_rextslog_delta;/* superblocks rextslog chg */
 	struct list_head	t_items;	/* log item descriptors */
 	struct list_head	t_busy;		/* list of busy extents */
 	struct list_head	t_dfops;	/* deferred operations */
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 7/7] xfs: split xfs_trans_mod_sb
  2024-09-30 16:41 fix recovery of allocator ops after a growfs Christoph Hellwig
                   ` (5 preceding siblings ...)
  2024-09-30 16:41 ` [PATCH 6/7] xfs: don't update file system geometry through transaction deltas Christoph Hellwig
@ 2024-09-30 16:41 ` Christoph Hellwig
  2024-10-10 14:06   ` Brian Foster
  6 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2024-09-30 16:41 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: Darrick J. Wong, linux-xfs

Split xfs_trans_mod_sb into separate helpers for the different counts.
While the icount and ifree counters get their own helpers, the handling
for fdblocks and frextents merges the delalloc and non-delalloc cases
to keep the related code together.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_ag_resv.c  |  18 +++--
 fs/xfs/libxfs/xfs_ialloc.c   |  14 ++--
 fs/xfs/libxfs/xfs_rtbitmap.c |   3 +-
 fs/xfs/libxfs/xfs_shared.h   |  10 ---
 fs/xfs/xfs_fsops.c           |   2 +-
 fs/xfs/xfs_rtalloc.c         |   6 +-
 fs/xfs/xfs_trans.c           | 130 +++++++++++++++--------------------
 fs/xfs/xfs_trans.h           |   7 +-
 fs/xfs/xfs_trans_dquot.c     |   2 +-
 9 files changed, 82 insertions(+), 110 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
index 216423df939e5c..bb518d6a2dcecd 100644
--- a/fs/xfs/libxfs/xfs_ag_resv.c
+++ b/fs/xfs/libxfs/xfs_ag_resv.c
@@ -341,7 +341,6 @@ xfs_ag_resv_alloc_extent(
 {
 	struct xfs_ag_resv		*resv;
 	xfs_extlen_t			len;
-	uint				field;
 
 	trace_xfs_ag_resv_alloc_extent(pag, type, args->len);
 
@@ -356,9 +355,8 @@ xfs_ag_resv_alloc_extent(
 		ASSERT(0);
 		fallthrough;
 	case XFS_AG_RESV_NONE:
-		field = args->wasdel ? XFS_TRANS_SB_RES_FDBLOCKS :
-				       XFS_TRANS_SB_FDBLOCKS;
-		xfs_trans_mod_sb(args->tp, field, -(int64_t)args->len);
+		xfs_trans_mod_fdblocks(args->tp, -(int64_t)args->len,
+				args->wasdel);
 		return;
 	}
 
@@ -367,11 +365,11 @@ xfs_ag_resv_alloc_extent(
 	if (type == XFS_AG_RESV_RMAPBT)
 		return;
 	/* Allocations of reserved blocks only need on-disk sb updates... */
-	xfs_trans_mod_sb(args->tp, XFS_TRANS_SB_RES_FDBLOCKS, -(int64_t)len);
+	xfs_trans_mod_fdblocks(args->tp, -(int64_t)len, true);
 	/* ...but non-reserved blocks need in-core and on-disk updates. */
 	if (args->len > len)
-		xfs_trans_mod_sb(args->tp, XFS_TRANS_SB_FDBLOCKS,
-				-((int64_t)args->len - len));
+		xfs_trans_mod_fdblocks(args->tp, -((int64_t)args->len - len),
+				false);
 }
 
 /* Free a block to the reservation. */
@@ -398,7 +396,7 @@ xfs_ag_resv_free_extent(
 		ASSERT(0);
 		fallthrough;
 	case XFS_AG_RESV_NONE:
-		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, (int64_t)len);
+		xfs_trans_mod_fdblocks(tp, (int64_t)len, false);
 		fallthrough;
 	case XFS_AG_RESV_IGNORE:
 		return;
@@ -409,8 +407,8 @@ xfs_ag_resv_free_extent(
 	if (type == XFS_AG_RESV_RMAPBT)
 		return;
 	/* Freeing into the reserved pool only requires on-disk update... */
-	xfs_trans_mod_sb(tp, XFS_TRANS_SB_RES_FDBLOCKS, len);
+	xfs_trans_mod_fdblocks(tp, len, true);
 	/* ...but freeing beyond that requires in-core and on-disk update. */
 	if (len > leftover)
-		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, len - leftover);
+		xfs_trans_mod_fdblocks(tp, len - leftover, false);
 }
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 271855227514cb..ad28823debb6f1 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -970,8 +970,8 @@ xfs_ialloc_ag_alloc(
 	/*
 	 * Modify/log superblock values for inode count and inode free count.
 	 */
-	xfs_trans_mod_sb(tp, XFS_TRANS_SB_ICOUNT, (long)newlen);
-	xfs_trans_mod_sb(tp, XFS_TRANS_SB_IFREE, (long)newlen);
+	xfs_trans_mod_icount(tp, (long)newlen);
+	xfs_trans_mod_ifree(tp, (long)newlen);
 	return 0;
 }
 
@@ -1357,7 +1357,7 @@ xfs_dialloc_ag_inobt(
 		goto error0;
 
 	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
-	xfs_trans_mod_sb(tp, XFS_TRANS_SB_IFREE, -1);
+	xfs_trans_mod_ifree(tp, -1);
 	*inop = ino;
 	return 0;
 error1:
@@ -1660,7 +1660,7 @@ xfs_dialloc_ag(
 	xfs_ialloc_log_agi(tp, agbp, XFS_AGI_FREECOUNT);
 	pag->pagi_freecount--;
 
-	xfs_trans_mod_sb(tp, XFS_TRANS_SB_IFREE, -1);
+	xfs_trans_mod_ifree(tp, -1);
 
 	error = xfs_check_agi_freecount(icur);
 	if (error)
@@ -2139,8 +2139,8 @@ xfs_difree_inobt(
 		xfs_ialloc_log_agi(tp, agbp, XFS_AGI_COUNT | XFS_AGI_FREECOUNT);
 		pag->pagi_freecount -= ilen - 1;
 		pag->pagi_count -= ilen;
-		xfs_trans_mod_sb(tp, XFS_TRANS_SB_ICOUNT, -ilen);
-		xfs_trans_mod_sb(tp, XFS_TRANS_SB_IFREE, -(ilen - 1));
+		xfs_trans_mod_icount(tp, -ilen);
+		xfs_trans_mod_ifree(tp, -(ilen - 1));
 
 		if ((error = xfs_btree_delete(cur, &i))) {
 			xfs_warn(mp, "%s: xfs_btree_delete returned error %d.",
@@ -2167,7 +2167,7 @@ xfs_difree_inobt(
 		be32_add_cpu(&agi->agi_freecount, 1);
 		xfs_ialloc_log_agi(tp, agbp, XFS_AGI_FREECOUNT);
 		pag->pagi_freecount++;
-		xfs_trans_mod_sb(tp, XFS_TRANS_SB_IFREE, 1);
+		xfs_trans_mod_ifree(tp, 1);
 	}
 
 	error = xfs_check_agi_freecount(cur);
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
index 27a4472402bacd..d0c693a69e0001 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@ -989,7 +989,8 @@ xfs_rtfree_extent(
 	/*
 	 * Mark more blocks free in the superblock.
 	 */
-	xfs_trans_mod_sb(tp, XFS_TRANS_SB_FREXTENTS, (long)len);
+	xfs_trans_mod_frextents(tp, (long)len, false);
+
 	/*
 	 * If we've now freed all the blocks, reset the file sequence
 	 * number to 0.
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 45a32ea426164a..6b5a7bfc32dbb8 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -140,16 +140,6 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
 /* Transaction has locked the rtbitmap and rtsum inodes */
 #define XFS_TRANS_RTBITMAP_LOCKED	(1u << 9)
 
-/*
- * Field values for xfs_trans_mod_sb.
- */
-#define	XFS_TRANS_SB_ICOUNT		0x00000001
-#define	XFS_TRANS_SB_IFREE		0x00000002
-#define	XFS_TRANS_SB_FDBLOCKS		0x00000004
-#define	XFS_TRANS_SB_RES_FDBLOCKS	0x00000008
-#define	XFS_TRANS_SB_FREXTENTS		0x00000010
-#define	XFS_TRANS_SB_RES_FREXTENTS	0x00000020
-
 /*
  * Here we centralize the specification of XFS meta-data buffer reference count
  * values.  This determines how hard the buffer cache tries to hold onto the
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 4168ccf21068cb..ac88a38c6cd522 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -212,7 +212,7 @@ xfs_growfs_data_private(
 		goto out_trans_cancel;
 
 	if (id.nfree)
-		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, id.nfree);
+		xfs_trans_mod_fdblocks(tp, id.nfree, false);
 
 	error = xfs_growfs_data_update_sb(tp, nagcount, nb, nagimax);
 	if (error)
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 994e5efedab20f..07f6008db322cb 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -838,7 +838,7 @@ xfs_growfs_rt_bmblock(
 	xfs_rtbuf_cache_relse(&nargs);
 	if (error)
 		goto out_cancel;
-	xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_FREXTENTS, freed_rtx);
+	xfs_trans_mod_frextents(args.tp, freed_rtx, false);
 
 	error = xfs_growfs_rt_update_sb(args.tp, mp, nmp, freed_rtx);
 	if (error)
@@ -1335,9 +1335,7 @@ xfs_rtallocate(
 	if (error)
 		goto out_release;
 
-	xfs_trans_mod_sb(tp, wasdel ?
-			XFS_TRANS_SB_RES_FREXTENTS : XFS_TRANS_SB_FREXTENTS,
-			-(long)len);
+	xfs_trans_mod_frextents(tp, -(long)len, wasdel);
 	*bno = xfs_rtx_to_rtb(args.mp, rtx);
 	*blen = xfs_rtxlen_to_extlen(args.mp, len);
 
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 56505cb94f877d..fa133535235d4c 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -334,48 +334,43 @@ xfs_trans_alloc_empty(
 	return xfs_trans_alloc(mp, &resv, 0, 0, XFS_TRANS_NO_WRITECOUNT, tpp);
 }
 
-/*
- * Record the indicated change to the given field for application
- * to the file system's superblock when the transaction commits.
- * For now, just store the change in the transaction structure.
- *
- * Mark the transaction structure to indicate that the superblock
- * needs to be updated before committing.
- *
- * Because we may not be keeping track of allocated/free inodes and
- * used filesystem blocks in the superblock, we do not mark the
- * superblock dirty in this transaction if we modify these fields.
- * We still need to update the transaction deltas so that they get
- * applied to the incore superblock, but we don't want them to
- * cause the superblock to get locked and logged if these are the
- * only fields in the superblock that the transaction modifies.
- */
 void
-xfs_trans_mod_sb(
-	xfs_trans_t	*tp,
-	uint		field,
-	int64_t		delta)
+xfs_trans_mod_icount(
+	struct xfs_trans	*tp,
+	int64_t			delta)
+{
+	tp->t_icount_delta += delta;
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	if (!xfs_has_lazysbcount(tp->t_mountp))
+		tp->t_flags |= XFS_TRANS_SB_DIRTY;
+}
+
+void
+xfs_trans_mod_ifree(
+	struct xfs_trans	*tp,
+	int64_t			delta)
 {
-	uint32_t	flags = (XFS_TRANS_DIRTY|XFS_TRANS_SB_DIRTY);
-	xfs_mount_t	*mp = tp->t_mountp;
-
-	switch (field) {
-	case XFS_TRANS_SB_ICOUNT:
-		tp->t_icount_delta += delta;
-		if (xfs_has_lazysbcount(mp))
-			flags &= ~XFS_TRANS_SB_DIRTY;
-		break;
-	case XFS_TRANS_SB_IFREE:
-		tp->t_ifree_delta += delta;
-		if (xfs_has_lazysbcount(mp))
-			flags &= ~XFS_TRANS_SB_DIRTY;
-		break;
-	case XFS_TRANS_SB_FDBLOCKS:
+	tp->t_ifree_delta += delta;
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	if (!xfs_has_lazysbcount(tp->t_mountp))
+		tp->t_flags |= XFS_TRANS_SB_DIRTY;
+}
+
+void
+xfs_trans_mod_fdblocks(
+	struct xfs_trans	*tp,
+	int64_t			delta,
+	bool			wasdel)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+
+	if (wasdel) {
 		/*
-		 * Track the number of blocks allocated in the transaction.
-		 * Make sure it does not exceed the number reserved. If so,
-		 * shutdown as this can lead to accounting inconsistency.
+		 * The allocation has already been applied to the in-core
+		 * counter, only apply it to the on-disk superblock.
 		 */
+		tp->t_res_fdblocks_delta += delta;
+	} else {
 		if (delta < 0) {
 			tp->t_blk_res_used += (uint)-delta;
 			if (tp->t_blk_res_used > tp->t_blk_res)
@@ -396,55 +391,40 @@ xfs_trans_mod_sb(
 			delta -= blkres_delta;
 		}
 		tp->t_fdblocks_delta += delta;
-		if (xfs_has_lazysbcount(mp))
-			flags &= ~XFS_TRANS_SB_DIRTY;
-		break;
-	case XFS_TRANS_SB_RES_FDBLOCKS:
-		/*
-		 * The allocation has already been applied to the
-		 * in-core superblock's counter.  This should only
-		 * be applied to the on-disk superblock.
-		 */
-		tp->t_res_fdblocks_delta += delta;
-		if (xfs_has_lazysbcount(mp))
-			flags &= ~XFS_TRANS_SB_DIRTY;
-		break;
-	case XFS_TRANS_SB_FREXTENTS:
+	}
+
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	if (!xfs_has_lazysbcount(mp))
+		tp->t_flags |= XFS_TRANS_SB_DIRTY;
+}
+
+void
+xfs_trans_mod_frextents(
+	struct xfs_trans	*tp,
+	int64_t			delta,
+	bool			wasdel)
+{
+	if (wasdel) {
 		/*
-		 * Track the number of blocks allocated in the
-		 * transaction.  Make sure it does not exceed the
-		 * number reserved.
+		 * The allocation has already been applied to the in-core
+		 * counter, only apply it to the on-disk superblock.
 		 */
+		ASSERT(delta < 0);
+		tp->t_res_frextents_delta += delta;
+	} else {
 		if (delta < 0) {
 			tp->t_rtx_res_used += (uint)-delta;
 			ASSERT(tp->t_rtx_res_used <= tp->t_rtx_res);
 		}
 		tp->t_frextents_delta += delta;
-		break;
-	case XFS_TRANS_SB_RES_FREXTENTS:
-		/*
-		 * The allocation has already been applied to the
-		 * in-core superblock's counter.  This should only
-		 * be applied to the on-disk superblock.
-		 */
-		ASSERT(delta < 0);
-		tp->t_res_frextents_delta += delta;
-		break;
-	default:
-		ASSERT(0);
-		return;
 	}
 
-	tp->t_flags |= flags;
+	tp->t_flags |= (XFS_TRANS_DIRTY | XFS_TRANS_SB_DIRTY);
 }
 
 /*
- * xfs_trans_apply_sb_deltas() is called from the commit code
- * to bring the superblock buffer into the current transaction
- * and modify it as requested by earlier calls to xfs_trans_mod_sb().
- *
- * For now we just look at each field allowed to change and change
- * it if necessary.
+ * Called from the commit code to bring the superblock buffer into the current
+ * transaction and modify it as based on earlier calls to  xfs_trans_mod_*().
  */
 STATIC void
 xfs_trans_apply_sb_deltas(
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index e5911cf09be444..a2cee42368bd25 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -162,7 +162,12 @@ int		xfs_trans_reserve_more(struct xfs_trans *tp,
 			unsigned int blocks, unsigned int rtextents);
 int		xfs_trans_alloc_empty(struct xfs_mount *mp,
 			struct xfs_trans **tpp);
-void		xfs_trans_mod_sb(xfs_trans_t *, uint, int64_t);
+void		xfs_trans_mod_icount(struct xfs_trans *tp, int64_t delta);
+void		xfs_trans_mod_ifree(struct xfs_trans *tp, int64_t delta);
+void		xfs_trans_mod_fdblocks(struct xfs_trans *tp, int64_t delta,
+			bool wasdel);
+void		xfs_trans_mod_frextents(struct xfs_trans *tp, int64_t delta,
+			bool wasdel);
 
 int xfs_trans_get_buf_map(struct xfs_trans *tp, struct xfs_buftarg *target,
 		struct xfs_buf_map *map, int nmaps, xfs_buf_flags_t flags,
diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
index b368e13424c4f4..839eb1780d4694 100644
--- a/fs/xfs/xfs_trans_dquot.c
+++ b/fs/xfs/xfs_trans_dquot.c
@@ -288,7 +288,7 @@ xfs_trans_get_dqtrx(
 
 /*
  * Make the changes in the transaction structure.
- * The moral equivalent to xfs_trans_mod_sb().
+ *
  * We don't touch any fields in the dquot, so we don't care
  * if it's locked or not (most of the time it won't be).
  */
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/7] xfs: update the file system geometry after recoverying superblock buffers
  2024-09-30 16:41 ` [PATCH 3/7] xfs: update the file system geometry after recoverying superblock buffers Christoph Hellwig
@ 2024-09-30 16:50   ` Darrick J. Wong
  2024-10-01  8:49     ` Christoph Hellwig
  2024-10-10 14:03   ` Brian Foster
  1 sibling, 1 reply; 44+ messages in thread
From: Darrick J. Wong @ 2024-09-30 16:50 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chandan Babu R, linux-xfs

On Mon, Sep 30, 2024 at 06:41:44PM +0200, Christoph Hellwig wrote:
> Primary superblock buffers that change the file system geometry after a
> growfs operation can affect the operation of later CIL checkpoints that
> make use of the newly added space and allocation groups.
> 
> Apply the changes to the in-memory structures as part of recovery pass 2,
> to ensure recovery works fine for such cases.
> 
> In the future we should apply the logic to other updates such as features
> bits as well.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/libxfs/xfs_log_recover.h |  2 ++
>  fs/xfs/xfs_buf_item_recover.c   | 27 +++++++++++++++++++++++++++
>  fs/xfs/xfs_log_recover.c        | 27 +++++++++++++++++++--------
>  3 files changed, 48 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h
> index 521d327e4c89ed..d0e13c84422d0a 100644
> --- a/fs/xfs/libxfs/xfs_log_recover.h
> +++ b/fs/xfs/libxfs/xfs_log_recover.h
> @@ -165,4 +165,6 @@ void xlog_recover_intent_item(struct xlog *log, struct xfs_log_item *lip,
>  int xlog_recover_finish_intent(struct xfs_trans *tp,
>  		struct xfs_defer_pending *dfp);
>  
> +int xlog_recover_update_agcount(struct xfs_mount *mp, struct xfs_dsb *dsb);
> +
>  #endif	/* __XFS_LOG_RECOVER_H__ */
> diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c
> index 09e893cf563cb9..08c129022304a8 100644
> --- a/fs/xfs/xfs_buf_item_recover.c
> +++ b/fs/xfs/xfs_buf_item_recover.c
> @@ -684,6 +684,28 @@ xlog_recover_do_inode_buffer(
>  	return 0;
>  }
>  
> +static int
> +xlog_recover_do_sb_buffer(
> +	struct xfs_mount		*mp,
> +	struct xlog_recover_item	*item,
> +	struct xfs_buf			*bp,
> +	struct xfs_buf_log_format	*buf_f,
> +	xfs_lsn_t			current_lsn)
> +{
> +	xlog_recover_do_reg_buffer(mp, item, bp, buf_f, current_lsn);
> +
> +	/*
> +	 * Update the in-memory superblock and perag structures from the
> +	 * primary SB buffer.
> +	 *
> +	 * This is required because transactions running after growfs may require
> +	 * the updated values to be set in a previous fully commit transaction.
> +	 */
> +	if (xfs_buf_daddr(bp) != 0)
> +		return 0;
> +	return xlog_recover_update_agcount(mp, bp->b_addr);
> +}
> +
>  /*
>   * V5 filesystems know the age of the buffer on disk being recovered. We can
>   * have newer objects on disk than we are replaying, and so for these cases we
> @@ -967,6 +989,11 @@ xlog_recover_buf_commit_pass2(
>  		dirty = xlog_recover_do_dquot_buffer(mp, log, item, bp, buf_f);
>  		if (!dirty)
>  			goto out_release;
> +	} else if (xfs_blft_from_flags(buf_f) & XFS_BLFT_SB_BUF) {
> +		error = xlog_recover_do_sb_buffer(mp, item, bp, buf_f,
> +				current_lsn);
> +		if (error)
> +			goto out_release;
>  	} else {
>  		xlog_recover_do_reg_buffer(mp, item, bp, buf_f, current_lsn);
>  	}
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 6a165ca55da1a8..03701409c7dcd6 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -3334,6 +3334,25 @@ xlog_do_log_recovery(
>  	return error;
>  }
>  
> +int
> +xlog_recover_update_agcount(
> +	struct xfs_mount		*mp,
> +	struct xfs_dsb			*dsb)
> +{
> +	xfs_agnumber_t			old_agcount = mp->m_sb.sb_agcount;
> +	int				error;
> +
> +	xfs_sb_from_disk(&mp->m_sb, dsb);

If I'm understanding this correctly, the incore superblock gets updated
every time recovery finds a logged primary superblock buffer now,
instead of once at the end of log recovery, right?

Shouldn't this conversion be done in the caller?  Some day we're going
to want to do the same with xfs_initialize_rtgroups(), right?

--D

> +	error = xfs_initialize_perag(mp, old_agcount, mp->m_sb.sb_agcount,
> +			mp->m_sb.sb_dblocks, &mp->m_maxagi);
> +	if (error) {
> +		xfs_warn(mp, "Failed recovery per-ag init: %d", error);
> +		return error;
> +	}
> +	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> +	return 0;
> +}
> +
>  /*
>   * Do the actual recovery
>   */
> @@ -3346,7 +3365,6 @@ xlog_do_recover(
>  	struct xfs_mount	*mp = log->l_mp;
>  	struct xfs_buf		*bp = mp->m_sb_bp;
>  	struct xfs_sb		*sbp = &mp->m_sb;
> -	xfs_agnumber_t		old_agcount = sbp->sb_agcount;
>  	int			error;
>  
>  	trace_xfs_log_recover(log, head_blk, tail_blk);
> @@ -3394,13 +3412,6 @@ xlog_do_recover(
>  	/* re-initialise in-core superblock and geometry structures */
>  	mp->m_features |= xfs_sb_version_to_features(sbp);
>  	xfs_reinit_percpu_counters(mp);
> -	error = xfs_initialize_perag(mp, old_agcount, sbp->sb_agcount,
> -			sbp->sb_dblocks, &mp->m_maxagi);
> -	if (error) {
> -		xfs_warn(mp, "Failed post-recovery per-ag init: %d", error);
> -		return error;
> -	}
> -	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
>  
>  	/* Normal transactions can now occur */
>  	clear_bit(XLOG_ACTIVE_RECOVERY, &log->l_opstate);
> -- 
> 2.45.2
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/7] xfs: error out when a superblock buffer updates reduces the agcount
  2024-09-30 16:41 ` [PATCH 4/7] xfs: error out when a superblock buffer updates reduces the agcount Christoph Hellwig
@ 2024-09-30 16:51   ` Darrick J. Wong
  2024-10-01  8:47     ` Christoph Hellwig
  2024-10-10 14:04   ` Brian Foster
  1 sibling, 1 reply; 44+ messages in thread
From: Darrick J. Wong @ 2024-09-30 16:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chandan Babu R, linux-xfs

On Mon, Sep 30, 2024 at 06:41:45PM +0200, Christoph Hellwig wrote:
> XFS currently does not support reducing the agcount, so error out if
> a logged sb buffer tries to shrink the agcount.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks good,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

/me notes that cem is the release manager now, not chandan.  Patches
should go to him.

/me updates his scripts

--D

> ---
>  fs/xfs/xfs_log_recover.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 03701409c7dcd6..3b5cd240bb62ef 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -3343,6 +3343,10 @@ xlog_recover_update_agcount(
>  	int				error;
>  
>  	xfs_sb_from_disk(&mp->m_sb, dsb);
> +	if (mp->m_sb.sb_agcount < old_agcount) {
> +		xfs_alert(mp, "Shrinking AG count in log recovery");
> +		return -EFSCORRUPTED;
> +	}
>  	error = xfs_initialize_perag(mp, old_agcount, mp->m_sb.sb_agcount,
>  			mp->m_sb.sb_dblocks, &mp->m_maxagi);
>  	if (error) {
> -- 
> 2.45.2
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/7] xfs: error out when a superblock buffer updates reduces the agcount
  2024-09-30 16:51   ` Darrick J. Wong
@ 2024-10-01  8:47     ` Christoph Hellwig
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2024-10-01  8:47 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, Chandan Babu R, linux-xfs

On Mon, Sep 30, 2024 at 09:51:16AM -0700, Darrick J. Wong wrote:
> On Mon, Sep 30, 2024 at 06:41:45PM +0200, Christoph Hellwig wrote:
> > XFS currently does not support reducing the agcount, so error out if
> > a logged sb buffer tries to shrink the agcount.
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Looks good,
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> /me notes that cem is the release manager now, not chandan.  Patches
> should go to him.
> 
> /me updates his scripts

Yeah, it'll be a while until all old cover letters are updated :)


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/7] xfs: update the file system geometry after recoverying superblock buffers
  2024-09-30 16:50   ` Darrick J. Wong
@ 2024-10-01  8:49     ` Christoph Hellwig
  2024-10-10 16:02       ` Darrick J. Wong
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2024-10-01  8:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, Chandan Babu R, linux-xfs

On Mon, Sep 30, 2024 at 09:50:19AM -0700, Darrick J. Wong wrote:
> > +int
> > +xlog_recover_update_agcount(
> > +	struct xfs_mount		*mp,
> > +	struct xfs_dsb			*dsb)
> > +{
> > +	xfs_agnumber_t			old_agcount = mp->m_sb.sb_agcount;
> > +	int				error;
> > +
> > +	xfs_sb_from_disk(&mp->m_sb, dsb);
> 
> If I'm understanding this correctly, the incore superblock gets updated
> every time recovery finds a logged primary superblock buffer now,
> instead of once at the end of log recovery, right?

Yes.

> Shouldn't this conversion be done in the caller?  Some day we're going
> to want to do the same with xfs_initialize_rtgroups(), right?

Yeah.  But the right "fix" for that is probably to just rename
this function :)  Probably even for the next repost instead of
waiting for more features.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/7] xfs: pass the exact range to initialize to xfs_initialize_perag
  2024-09-30 16:41 ` [PATCH 1/7] xfs: pass the exact range to initialize to xfs_initialize_perag Christoph Hellwig
@ 2024-10-10 14:02   ` Brian Foster
  2024-10-11  7:53     ` Christoph Hellwig
  0 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2024-10-10 14:02 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chandan Babu R, Darrick J. Wong, linux-xfs

On Mon, Sep 30, 2024 at 06:41:42PM +0200, Christoph Hellwig wrote:
> Currently only the new agcount is passed to xfs_initialize_perag, which
> requires lookups of existing AGs to skip them and complicates error
> handling.  Also pass the previous agcount so that the range that
> xfs_initialize_perag operates on is exactly defined.  That way the
> extra lookups can be avoided, and error handling can clean up the
> exact range from the old count to the last added perag structure.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_ag.c   | 29 ++++++++---------------------
>  fs/xfs/libxfs/xfs_ag.h   |  5 +++--
>  fs/xfs/xfs_fsops.c       | 18 ++++++++----------
>  fs/xfs/xfs_log_recover.c |  5 +++--
>  fs/xfs/xfs_mount.c       |  4 ++--
>  5 files changed, 24 insertions(+), 37 deletions(-)
> 
...
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index ec766b4bc8537b..6a165ca55da1a8 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -3346,6 +3346,7 @@ xlog_do_recover(
>  	struct xfs_mount	*mp = log->l_mp;
>  	struct xfs_buf		*bp = mp->m_sb_bp;
>  	struct xfs_sb		*sbp = &mp->m_sb;
> +	xfs_agnumber_t		old_agcount = sbp->sb_agcount;
>  	int			error;
>  
>  	trace_xfs_log_recover(log, head_blk, tail_blk);
> @@ -3393,8 +3394,8 @@ xlog_do_recover(
>  	/* re-initialise in-core superblock and geometry structures */
>  	mp->m_features |= xfs_sb_version_to_features(sbp);
>  	xfs_reinit_percpu_counters(mp);
> -	error = xfs_initialize_perag(mp, sbp->sb_agcount, sbp->sb_dblocks,
> -			&mp->m_maxagi);
> +	error = xfs_initialize_perag(mp, old_agcount, sbp->sb_agcount,
> +			sbp->sb_dblocks, &mp->m_maxagi);

I assume this is because the superblock can change across recovery, but
code wise this seems kind of easy to misread into thinking the variable
is the same. I think the whole old/new terminology is kind of clunky for
an interface that is not just for growfs. Maybe it would be more clear
to use start/end terminology for xfs_initialize_perag(), then it's more
straightforward that mount would init the full range whereas growfs
inits a subrange.

A oneliner comment or s/old_agcount/orig_agcount/ wouldn't hurt here
either. Actually if that's the only purpose for this call and if you
already have to sample sb_agcount, maybe just lifting/copying the if
(old_agcount >= new_agcount) check into the caller would make the logic
more self-explanatory. Hm?

Otherwise the logic changes look Ok to me functionally.

Brian

>  	if (error) {
>  		xfs_warn(mp, "Failed post-recovery per-ag init: %d", error);
>  		return error;
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 1fdd79c5bfa04e..6fa7239a4a01b6 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -810,8 +810,8 @@ xfs_mountfs(
>  	/*
>  	 * Allocate and initialize the per-ag data.
>  	 */
> -	error = xfs_initialize_perag(mp, sbp->sb_agcount, mp->m_sb.sb_dblocks,
> -			&mp->m_maxagi);
> +	error = xfs_initialize_perag(mp, 0, sbp->sb_agcount,
> +			mp->m_sb.sb_dblocks, &mp->m_maxagi);
>  	if (error) {
>  		xfs_warn(mp, "Failed per-ag init: %d", error);
>  		goto out_free_dir;
> -- 
> 2.45.2
> 
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/7] xfs: merge the perag freeing helpers
  2024-09-30 16:41 ` [PATCH 2/7] xfs: merge the perag freeing helpers Christoph Hellwig
@ 2024-10-10 14:02   ` Brian Foster
  0 siblings, 0 replies; 44+ messages in thread
From: Brian Foster @ 2024-10-10 14:02 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chandan Babu R, Darrick J. Wong, linux-xfs

On Mon, Sep 30, 2024 at 06:41:43PM +0200, Christoph Hellwig wrote:
> There is no good reason to have two different routines for freeing perag
> structures for the unmount and error cases.  Add two arguments to specify
> the range of AGs to free to xfs_free_perag, and use that to replace
> xfs_free_unused_perag_range.
> 
> The addition RCU grace period for the error case is harmless, and the
> extra check for the AG to actually exist is not required now that the
> callers pass the exact known allocated range.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/libxfs/xfs_ag.c | 40 ++++++++++------------------------------
>  fs/xfs/libxfs/xfs_ag.h |  5 ++---
>  fs/xfs/xfs_fsops.c     |  2 +-
>  fs/xfs/xfs_mount.c     |  5 ++---
>  4 files changed, 15 insertions(+), 37 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
> index 652376aa52e990..8fac0ce45b1559 100644
> --- a/fs/xfs/libxfs/xfs_ag.c
> +++ b/fs/xfs/libxfs/xfs_ag.c
> @@ -185,17 +185,20 @@ xfs_initialize_perag_data(
>  }
>  
>  /*
> - * Free up the per-ag resources associated with the mount structure.
> + * Free up the per-ag resources  within the specified AG range.
>   */
>  void
> -xfs_free_perag(
> -	struct xfs_mount	*mp)
> +xfs_free_perag_range(
> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		first_agno,
> +	xfs_agnumber_t		end_agno)
> +
>  {
> -	struct xfs_perag	*pag;
>  	xfs_agnumber_t		agno;
>  
> -	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> -		pag = xa_erase(&mp->m_perags, agno);
> +	for (agno = first_agno; agno < end_agno; agno++) {
> +		struct xfs_perag	*pag = xa_erase(&mp->m_perags, agno);
> +
>  		ASSERT(pag);
>  		XFS_IS_CORRUPT(pag->pag_mount, atomic_read(&pag->pag_ref) != 0);
>  		xfs_defer_drain_free(&pag->pag_intents_drain);
> @@ -270,29 +273,6 @@ xfs_agino_range(
>  	return __xfs_agino_range(mp, xfs_ag_block_count(mp, agno), first, last);
>  }
>  
> -/*
> - * Free perag within the specified AG range, it is only used to free unused
> - * perags under the error handling path.
> - */
> -void
> -xfs_free_unused_perag_range(
> -	struct xfs_mount	*mp,
> -	xfs_agnumber_t		agstart,
> -	xfs_agnumber_t		agend)
> -{
> -	struct xfs_perag	*pag;
> -	xfs_agnumber_t		index;
> -
> -	for (index = agstart; index < agend; index++) {
> -		pag = xa_erase(&mp->m_perags, index);
> -		if (!pag)
> -			break;
> -		xfs_buf_cache_destroy(&pag->pag_bcache);
> -		xfs_defer_drain_free(&pag->pag_intents_drain);
> -		kfree(pag);
> -	}
> -}
> -
>  int
>  xfs_initialize_perag(
>  	struct xfs_mount	*mp,
> @@ -369,7 +349,7 @@ xfs_initialize_perag(
>  out_free_pag:
>  	kfree(pag);
>  out_unwind_new_pags:
> -	xfs_free_unused_perag_range(mp, old_agcount, index);
> +	xfs_free_perag_range(mp, old_agcount, index);
>  	return error;
>  }
>  
> diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
> index 69fc31e7b84728..6e68d6a3161a0f 100644
> --- a/fs/xfs/libxfs/xfs_ag.h
> +++ b/fs/xfs/libxfs/xfs_ag.h
> @@ -144,13 +144,12 @@ __XFS_AG_OPSTATE(prefers_metadata, PREFERS_METADATA)
>  __XFS_AG_OPSTATE(allows_inodes, ALLOWS_INODES)
>  __XFS_AG_OPSTATE(agfl_needs_reset, AGFL_NEEDS_RESET)
>  
> -void xfs_free_unused_perag_range(struct xfs_mount *mp, xfs_agnumber_t agstart,
> -			xfs_agnumber_t agend);
>  int xfs_initialize_perag(struct xfs_mount *mp, xfs_agnumber_t old_agcount,
>  		xfs_agnumber_t agcount, xfs_rfsblock_t dcount,
>  		xfs_agnumber_t *maxagi);
> +void xfs_free_perag_range(struct xfs_mount *mp, xfs_agnumber_t first_agno,
> +		xfs_agnumber_t end_agno);
>  int xfs_initialize_perag_data(struct xfs_mount *mp, xfs_agnumber_t agno);
> -void xfs_free_perag(struct xfs_mount *mp);
>  
>  /* Passive AG references */
>  struct xfs_perag *xfs_perag_get(struct xfs_mount *mp, xfs_agnumber_t agno);
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index de2bf0594cb474..b247d895c276d2 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -229,7 +229,7 @@ xfs_growfs_data_private(
>  	xfs_trans_cancel(tp);
>  out_free_unused_perag:
>  	if (nagcount > oagcount)
> -		xfs_free_unused_perag_range(mp, oagcount, nagcount);
> +		xfs_free_perag_range(mp, oagcount, nagcount);
>  	return error;
>  }
>  
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 6fa7239a4a01b6..25bbcc3f4ee08b 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -1048,7 +1048,7 @@ xfs_mountfs(
>  		xfs_buftarg_drain(mp->m_logdev_targp);
>  	xfs_buftarg_drain(mp->m_ddev_targp);
>   out_free_perag:
> -	xfs_free_perag(mp);
> +	xfs_free_perag_range(mp, 0, mp->m_sb.sb_agcount);
>   out_free_dir:
>  	xfs_da_unmount(mp);
>   out_remove_uuid:
> @@ -1129,8 +1129,7 @@ xfs_unmountfs(
>  	xfs_errortag_clearall(mp);
>  #endif
>  	shrinker_free(mp->m_inodegc_shrinker);
> -	xfs_free_perag(mp);
> -
> +	xfs_free_perag_range(mp, 0, mp->m_sb.sb_agcount);
>  	xfs_errortag_del(mp);
>  	xfs_error_sysfs_del(mp);
>  	xchk_stats_unregister(mp->m_scrub_stats);
> -- 
> 2.45.2
> 
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/7] xfs: update the file system geometry after recoverying superblock buffers
  2024-09-30 16:41 ` [PATCH 3/7] xfs: update the file system geometry after recoverying superblock buffers Christoph Hellwig
  2024-09-30 16:50   ` Darrick J. Wong
@ 2024-10-10 14:03   ` Brian Foster
  1 sibling, 0 replies; 44+ messages in thread
From: Brian Foster @ 2024-10-10 14:03 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chandan Babu R, Darrick J. Wong, linux-xfs

On Mon, Sep 30, 2024 at 06:41:44PM +0200, Christoph Hellwig wrote:
> Primary superblock buffers that change the file system geometry after a
> growfs operation can affect the operation of later CIL checkpoints that
> make use of the newly added space and allocation groups.
> 
> Apply the changes to the in-memory structures as part of recovery pass 2,
> to ensure recovery works fine for such cases.
> 
> In the future we should apply the logic to other updates such as features
> bits as well.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/libxfs/xfs_log_recover.h |  2 ++
>  fs/xfs/xfs_buf_item_recover.c   | 27 +++++++++++++++++++++++++++
>  fs/xfs/xfs_log_recover.c        | 27 +++++++++++++++++++--------
>  3 files changed, 48 insertions(+), 8 deletions(-)
> 
...
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 6a165ca55da1a8..03701409c7dcd6 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -3334,6 +3334,25 @@ xlog_do_log_recovery(
>  	return error;
>  }
>  
> +int
> +xlog_recover_update_agcount(
> +	struct xfs_mount		*mp,
> +	struct xfs_dsb			*dsb)
> +{
> +	xfs_agnumber_t			old_agcount = mp->m_sb.sb_agcount;
> +	int				error;
> +
> +	xfs_sb_from_disk(&mp->m_sb, dsb);
> +	error = xfs_initialize_perag(mp, old_agcount, mp->m_sb.sb_agcount,
> +			mp->m_sb.sb_dblocks, &mp->m_maxagi);
> +	if (error) {
> +		xfs_warn(mp, "Failed recovery per-ag init: %d", error);
> +		return error;
> +	}
> +	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);

Re: my comments on patch 1, it looks like this also doesn't need to
change unless the superblock update actually changed the AG count.
Otherwise seems Ok in terms of the context change.

Brian

> +	return 0;
> +}
> +
>  /*
>   * Do the actual recovery
>   */
> @@ -3346,7 +3365,6 @@ xlog_do_recover(
>  	struct xfs_mount	*mp = log->l_mp;
>  	struct xfs_buf		*bp = mp->m_sb_bp;
>  	struct xfs_sb		*sbp = &mp->m_sb;
> -	xfs_agnumber_t		old_agcount = sbp->sb_agcount;
>  	int			error;
>  
>  	trace_xfs_log_recover(log, head_blk, tail_blk);
> @@ -3394,13 +3412,6 @@ xlog_do_recover(
>  	/* re-initialise in-core superblock and geometry structures */
>  	mp->m_features |= xfs_sb_version_to_features(sbp);
>  	xfs_reinit_percpu_counters(mp);
> -	error = xfs_initialize_perag(mp, old_agcount, sbp->sb_agcount,
> -			sbp->sb_dblocks, &mp->m_maxagi);
> -	if (error) {
> -		xfs_warn(mp, "Failed post-recovery per-ag init: %d", error);
> -		return error;
> -	}
> -	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
>  
>  	/* Normal transactions can now occur */
>  	clear_bit(XLOG_ACTIVE_RECOVERY, &log->l_opstate);
> -- 
> 2.45.2
> 
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/7] xfs: error out when a superblock buffer updates reduces the agcount
  2024-09-30 16:41 ` [PATCH 4/7] xfs: error out when a superblock buffer updates reduces the agcount Christoph Hellwig
  2024-09-30 16:51   ` Darrick J. Wong
@ 2024-10-10 14:04   ` Brian Foster
  1 sibling, 0 replies; 44+ messages in thread
From: Brian Foster @ 2024-10-10 14:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chandan Babu R, Darrick J. Wong, linux-xfs

On Mon, Sep 30, 2024 at 06:41:45PM +0200, Christoph Hellwig wrote:
> XFS currently does not support reducing the agcount, so error out if
> a logged sb buffer tries to shrink the agcount.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_log_recover.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 03701409c7dcd6..3b5cd240bb62ef 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -3343,6 +3343,10 @@ xlog_recover_update_agcount(
>  	int				error;
>  
>  	xfs_sb_from_disk(&mp->m_sb, dsb);
> +	if (mp->m_sb.sb_agcount < old_agcount) {
> +		xfs_alert(mp, "Shrinking AG count in log recovery");
> +		return -EFSCORRUPTED;
> +	}
>  	error = xfs_initialize_perag(mp, old_agcount, mp->m_sb.sb_agcount,
>  			mp->m_sb.sb_dblocks, &mp->m_maxagi);
>  	if (error) {
> -- 
> 2.45.2
> 
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/7] xfs: don't use __GFP_RETRY_MAYFAIL in xfs_initialize_perag
  2024-09-30 16:41 ` [PATCH 5/7] xfs: don't use __GFP_RETRY_MAYFAIL in xfs_initialize_perag Christoph Hellwig
@ 2024-10-10 14:04   ` Brian Foster
  0 siblings, 0 replies; 44+ messages in thread
From: Brian Foster @ 2024-10-10 14:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chandan Babu R, Darrick J. Wong, linux-xfs

On Mon, Sep 30, 2024 at 06:41:46PM +0200, Christoph Hellwig wrote:
> __GFP_RETRY_MAYFAIL increases the likelyhood of allocations to fail,
> which isn't really helpful during log recovery.  Remove the flag and
> stick to the default GFP_KERNEL policies.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/libxfs/xfs_ag.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
> index 8fac0ce45b1559..29feaed7c8f880 100644
> --- a/fs/xfs/libxfs/xfs_ag.c
> +++ b/fs/xfs/libxfs/xfs_ag.c
> @@ -289,7 +289,7 @@ xfs_initialize_perag(
>  		return 0;
>  
>  	for (index = old_agcount; index < new_agcount; index++) {
> -		pag = kzalloc(sizeof(*pag), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
> +		pag = kzalloc(sizeof(*pag), GFP_KERNEL);
>  		if (!pag) {
>  			error = -ENOMEM;
>  			goto out_unwind_new_pags;
> -- 
> 2.45.2
> 
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/7] xfs: don't update file system geometry through transaction deltas
  2024-09-30 16:41 ` [PATCH 6/7] xfs: don't update file system geometry through transaction deltas Christoph Hellwig
@ 2024-10-10 14:05   ` Brian Foster
  2024-10-11  7:57     ` Christoph Hellwig
  2024-10-10 19:01   ` Darrick J. Wong
  1 sibling, 1 reply; 44+ messages in thread
From: Brian Foster @ 2024-10-10 14:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chandan Babu R, Darrick J. Wong, linux-xfs

On Mon, Sep 30, 2024 at 06:41:47PM +0200, Christoph Hellwig wrote:
> Updates to the file system geometry in growfs need to be committed to
> stable store before the allocator can see them to avoid that they are
> in the same CIL checkpoint as transactions that make use of this new
> information, which will make recovery impossible or broken.
> 

Ok, so we don't want geometry changes transactions in the same CIL
checkpoint as alloc related transactions that might depend on the
geometry changes. That seems reasonable and on a first pass I have an
idea of what this is doing, but the description is kind of vague.
Obviously this fixes an issue on the recovery side (since I've tested
it), but it's not quite clear to me from the description and/or logic
changes how that issue manifests.

Could you elaborate please? For example, is this some kind of race
situation between an allocation request and a growfs transaction, where
the former perhaps sees a newly added AG between the time the growfs
transaction commits (applying the sb deltas) and it actually commits to
the log due to being a sync transaction, thus allowing an alloc on a new
AG into the same checkpoint that adds the AG?

Is there any ordering issue on the recovery side, or is it mainly that
we don't init the in-core perags until after recovery, so log recovery
of grow operations is basically broken unless we don't reference the new
AGs in the log before the geometry update pushes out of the active
log..? I'm trying to grok how much of this patch is fixing a
reproducible bug vs. supporting what the earlier patches are doing on
the recovery side.

I also wonder if/how this might fundamentally apply to shrink, but maybe
that's getting too far into the weeds. IIRC we don't currently support
AG shrink, but can lop off the end of the tail AG if space happens to be
free. So for example, is there any issue where that tail AG space is
freed and immediately shrunk off of the fs? Hm.. maybe the hacky growfs
test should try and include some shrink operations as well.

Anyways, context on the specific problem would make this easier to
review and IMO, should be included in the commit log anyways for
historical reference if you're going to change how superblock fields are
logged. Just my .02.

Brian

> To do this add two new helpers to prepare a superblock for direct
> manipulation of the on-disk buffer, and to commit these updates while
> holding the buffer locked (similar to what xfs_sync_sb_buf does) and use
> those in growfs instead of applying the changes through the deltas in the
> xfs_trans structure (which also happens to shrink the xfs_trans structure
> a fair bit).
> 
> The rtbmimap repair code was also using the transaction deltas and is
> converted to also update the superblock buffer directly under the buffer
> lock.
> 
> This new method establishes a locking protocol where even in-core
> superblock fields must only be updated with the superblock buffer
> locked.  For now it is only applied to affected geometry fields,
> but in the future it would make sense to apply it universally.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/libxfs/xfs_sb.c         |  97 ++++++++++++++++++++++++-------
>  fs/xfs/libxfs/xfs_sb.h         |   3 +
>  fs/xfs/libxfs/xfs_shared.h     |   8 ---
>  fs/xfs/scrub/rtbitmap_repair.c |  26 +++++----
>  fs/xfs/xfs_fsops.c             |  80 ++++++++++++++++----------
>  fs/xfs/xfs_rtalloc.c           |  92 +++++++++++++++++-------------
>  fs/xfs/xfs_trans.c             | 101 ++-------------------------------
>  fs/xfs/xfs_trans.h             |   8 ---
>  8 files changed, 198 insertions(+), 217 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> index d95409f3cba667..2c83ab7441ade5 100644
> --- a/fs/xfs/libxfs/xfs_sb.c
> +++ b/fs/xfs/libxfs/xfs_sb.c
> @@ -1025,6 +1025,80 @@ xfs_sb_mount_common(
>  	mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp);
>  }
>  
> +/*
> + * Mirror the lazy sb counters to the in-core superblock.
> + *
> + * If this is at unmount, the counters will be exactly correct, but at any other
> + * time they will only be ballpark correct because of reservations that have
> + * been taken out percpu counters.  If we have an unclean shutdown, this will be
> + * corrected by log recovery rebuilding the counters from the AGF block counts.
> + *
> + * Do not update sb_frextents here because it is not part of the lazy sb
> + * counters, despite having a percpu counter.  It is always kept consistent with
> + * the ondisk rtbitmap by xfs_trans_apply_sb_deltas() and hence we don't need
> + * have to update it here.
> + */
> +static void
> +xfs_flush_sb_counters(
> +	struct xfs_mount	*mp)
> +{
> +	if (xfs_has_lazysbcount(mp)) {
> +		mp->m_sb.sb_icount = percpu_counter_sum_positive(&mp->m_icount);
> +		mp->m_sb.sb_ifree = min_t(uint64_t,
> +				percpu_counter_sum_positive(&mp->m_ifree),
> +				mp->m_sb.sb_icount);
> +		mp->m_sb.sb_fdblocks =
> +				percpu_counter_sum_positive(&mp->m_fdblocks);
> +	}
> +}
> +
> +/*
> + * Prepare a direct update to the superblock through the on-disk buffer.
> + *
> + * This locks out other modifications through the buffer lock and then syncs all
> + * in-core values to the on-disk buffer (including the percpu counters).
> + *
> + * The caller then directly manipulates the on-disk fields and calls
> + * xfs_commit_sb_update to the updates to disk them.  The caller is responsible
> + * to also update the in-core field, but it can do so after the transaction has
> + * been committed to disk.
> + *
> + * Updating the in-core field only after xfs_commit_sb_update ensures that other
> + * processes only see the update once it is stable on disk, and is usually the
> + * right thing to do for superblock updates.
> + *
> + * Note that writes to superblock fields updated using this helper are
> + * synchronized using the superblock buffer lock, which must be taken around
> + * all updates to the in-core fields as well.
> + */
> +struct xfs_dsb *
> +xfs_prepare_sb_update(
> +	struct xfs_trans	*tp,
> +	struct xfs_buf		**bpp)
> +{
> +	*bpp = xfs_trans_getsb(tp);
> +	xfs_flush_sb_counters(tp->t_mountp);
> +	xfs_sb_to_disk((*bpp)->b_addr, &tp->t_mountp->m_sb);
> +	return (*bpp)->b_addr;
> +}
> +
> +/*
> + * Commit a direct update to the on-disk superblock.  Keeps @bp locked and
> + * referenced, so the caller must call xfs_buf_relse() manually.
> + */
> +int
> +xfs_commit_sb_update(
> +	struct xfs_trans	*tp,
> +	struct xfs_buf		*bp)
> +{
> +	xfs_trans_bhold(tp, bp);
> +	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF);
> +	xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1);
> +
> +	xfs_trans_set_sync(tp);
> +	return xfs_trans_commit(tp);
> +}
> +
>  /*
>   * xfs_log_sb() can be used to copy arbitrary changes to the in-core superblock
>   * into the superblock buffer to be logged.  It does not provide the higher
> @@ -1038,28 +1112,7 @@ xfs_log_sb(
>  	struct xfs_mount	*mp = tp->t_mountp;
>  	struct xfs_buf		*bp = xfs_trans_getsb(tp);
>  
> -	/*
> -	 * Lazy sb counters don't update the in-core superblock so do that now.
> -	 * If this is at unmount, the counters will be exactly correct, but at
> -	 * any other time they will only be ballpark correct because of
> -	 * reservations that have been taken out percpu counters. If we have an
> -	 * unclean shutdown, this will be corrected by log recovery rebuilding
> -	 * the counters from the AGF block counts.
> -	 *
> -	 * Do not update sb_frextents here because it is not part of the lazy
> -	 * sb counters, despite having a percpu counter. It is always kept
> -	 * consistent with the ondisk rtbitmap by xfs_trans_apply_sb_deltas()
> -	 * and hence we don't need have to update it here.
> -	 */
> -	if (xfs_has_lazysbcount(mp)) {
> -		mp->m_sb.sb_icount = percpu_counter_sum_positive(&mp->m_icount);
> -		mp->m_sb.sb_ifree = min_t(uint64_t,
> -				percpu_counter_sum_positive(&mp->m_ifree),
> -				mp->m_sb.sb_icount);
> -		mp->m_sb.sb_fdblocks =
> -				percpu_counter_sum_positive(&mp->m_fdblocks);
> -	}
> -
> +	xfs_flush_sb_counters(mp);
>  	xfs_sb_to_disk(bp->b_addr, &mp->m_sb);
>  	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF);
>  	xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1);
> diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h
> index 885c837559914d..3649d071687e33 100644
> --- a/fs/xfs/libxfs/xfs_sb.h
> +++ b/fs/xfs/libxfs/xfs_sb.h
> @@ -13,6 +13,9 @@ struct xfs_trans;
>  struct xfs_fsop_geom;
>  struct xfs_perag;
>  
> +struct xfs_dsb *xfs_prepare_sb_update(struct xfs_trans *tp,
> +			struct xfs_buf **bpp);
> +int		xfs_commit_sb_update(struct xfs_trans *tp, struct xfs_buf *bp);
>  extern void	xfs_log_sb(struct xfs_trans *tp);
>  extern int	xfs_sync_sb(struct xfs_mount *mp, bool wait);
>  extern int	xfs_sync_sb_buf(struct xfs_mount *mp);
> diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> index 33b84a3a83ff63..45a32ea426164a 100644
> --- a/fs/xfs/libxfs/xfs_shared.h
> +++ b/fs/xfs/libxfs/xfs_shared.h
> @@ -149,14 +149,6 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
>  #define	XFS_TRANS_SB_RES_FDBLOCKS	0x00000008
>  #define	XFS_TRANS_SB_FREXTENTS		0x00000010
>  #define	XFS_TRANS_SB_RES_FREXTENTS	0x00000020
> -#define	XFS_TRANS_SB_DBLOCKS		0x00000040
> -#define	XFS_TRANS_SB_AGCOUNT		0x00000080
> -#define	XFS_TRANS_SB_IMAXPCT		0x00000100
> -#define	XFS_TRANS_SB_REXTSIZE		0x00000200
> -#define	XFS_TRANS_SB_RBMBLOCKS		0x00000400
> -#define	XFS_TRANS_SB_RBLOCKS		0x00000800
> -#define	XFS_TRANS_SB_REXTENTS		0x00001000
> -#define	XFS_TRANS_SB_REXTSLOG		0x00002000
>  
>  /*
>   * Here we centralize the specification of XFS meta-data buffer reference count
> diff --git a/fs/xfs/scrub/rtbitmap_repair.c b/fs/xfs/scrub/rtbitmap_repair.c
> index 0fef98e9f83409..be9d31f032b1bf 100644
> --- a/fs/xfs/scrub/rtbitmap_repair.c
> +++ b/fs/xfs/scrub/rtbitmap_repair.c
> @@ -16,6 +16,7 @@
>  #include "xfs_bit.h"
>  #include "xfs_bmap.h"
>  #include "xfs_bmap_btree.h"
> +#include "xfs_sb.h"
>  #include "scrub/scrub.h"
>  #include "scrub/common.h"
>  #include "scrub/trace.h"
> @@ -127,20 +128,21 @@ xrep_rtbitmap_geometry(
>  	struct xchk_rtbitmap	*rtb)
>  {
>  	struct xfs_mount	*mp = sc->mp;
> -	struct xfs_trans	*tp = sc->tp;
>  
>  	/* Superblock fields */
> -	if (mp->m_sb.sb_rextents != rtb->rextents)
> -		xfs_trans_mod_sb(sc->tp, XFS_TRANS_SB_REXTENTS,
> -				rtb->rextents - mp->m_sb.sb_rextents);
> -
> -	if (mp->m_sb.sb_rbmblocks != rtb->rbmblocks)
> -		xfs_trans_mod_sb(tp, XFS_TRANS_SB_RBMBLOCKS,
> -				rtb->rbmblocks - mp->m_sb.sb_rbmblocks);
> -
> -	if (mp->m_sb.sb_rextslog != rtb->rextslog)
> -		xfs_trans_mod_sb(tp, XFS_TRANS_SB_REXTSLOG,
> -				rtb->rextslog - mp->m_sb.sb_rextslog);
> +	if (mp->m_sb.sb_rextents != rtb->rextents ||
> +	    mp->m_sb.sb_rbmblocks != rtb->rbmblocks ||
> +	    mp->m_sb.sb_rextslog != rtb->rextslog) {
> +		struct xfs_buf		*bp = xfs_trans_getsb(sc->tp);
> +
> +		mp->m_sb.sb_rextents = rtb->rextents;
> +		mp->m_sb.sb_rbmblocks = rtb->rbmblocks;
> +		mp->m_sb.sb_rextslog = rtb->rextslog;
> +		xfs_sb_to_disk(bp->b_addr, &mp->m_sb);
> +
> +		xfs_trans_buf_set_type(sc->tp, bp, XFS_BLFT_SB_BUF);
> +		xfs_trans_log_buf(sc->tp, bp, 0, sizeof(struct xfs_dsb) - 1);
> +	}
>  
>  	/* Fix broken isize */
>  	sc->ip->i_disk_size = roundup_64(sc->ip->i_disk_size,
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index b247d895c276d2..4168ccf21068cb 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -79,6 +79,46 @@ xfs_resizefs_init_new_ags(
>  	return error;
>  }
>  
> +static int
> +xfs_growfs_data_update_sb(
> +	struct xfs_trans	*tp,
> +	xfs_agnumber_t		nagcount,
> +	xfs_rfsblock_t		nb,
> +	xfs_agnumber_t		nagimax)
> +{
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	struct xfs_dsb		*sbp;
> +	struct xfs_buf		*bp;
> +	int			error;
> +
> +	/*
> +	 * Update the geometry in the on-disk superblock first, and ensure
> +	 * they make it to disk before the superblock can be relogged.
> +	 */
> +	sbp = xfs_prepare_sb_update(tp, &bp);
> +	sbp->sb_agcount = cpu_to_be32(nagcount);
> +	sbp->sb_dblocks = cpu_to_be64(nb);
> +	error = xfs_commit_sb_update(tp, bp);
> +	if (error)
> +		goto out_unlock;
> +
> +	/*
> +	 * Propagate the new values to the live mount structure after they made
> +	 * it to disk with the superblock buffer still locked.
> +	 */
> +	mp->m_sb.sb_agcount = nagcount;
> +	mp->m_sb.sb_dblocks = nb;
> +
> +	if (nagimax)
> +		mp->m_maxagi = nagimax;
> +	xfs_set_low_space_thresholds(mp);
> +	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> +
> +out_unlock:
> +	xfs_buf_relse(bp);
> +	return error;
> +}
> +
>  /*
>   * growfs operations
>   */
> @@ -171,37 +211,13 @@ xfs_growfs_data_private(
>  	if (error)
>  		goto out_trans_cancel;
>  
> -	/*
> -	 * Update changed superblock fields transactionally. These are not
> -	 * seen by the rest of the world until the transaction commit applies
> -	 * them atomically to the superblock.
> -	 */
> -	if (nagcount > oagcount)
> -		xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount);
> -	if (delta)
> -		xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS, delta);
>  	if (id.nfree)
>  		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, id.nfree);
>  
> -	/*
> -	 * Sync sb counters now to reflect the updated values. This is
> -	 * particularly important for shrink because the write verifier
> -	 * will fail if sb_fdblocks is ever larger than sb_dblocks.
> -	 */
> -	if (xfs_has_lazysbcount(mp))
> -		xfs_log_sb(tp);
> -
> -	xfs_trans_set_sync(tp);
> -	error = xfs_trans_commit(tp);
> +	error = xfs_growfs_data_update_sb(tp, nagcount, nb, nagimax);
>  	if (error)
>  		return error;
>  
> -	/* New allocation groups fully initialized, so update mount struct */
> -	if (nagimax)
> -		mp->m_maxagi = nagimax;
> -	xfs_set_low_space_thresholds(mp);
> -	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> -
>  	if (delta > 0) {
>  		/*
>  		 * If we expanded the last AG, free the per-AG reservation
> @@ -260,8 +276,9 @@ xfs_growfs_imaxpct(
>  	struct xfs_mount	*mp,
>  	__u32			imaxpct)
>  {
> +	struct xfs_dsb		*sbp;
> +	struct xfs_buf		*bp;
>  	struct xfs_trans	*tp;
> -	int			dpct;
>  	int			error;
>  
>  	if (imaxpct > 100)
> @@ -272,10 +289,13 @@ xfs_growfs_imaxpct(
>  	if (error)
>  		return error;
>  
> -	dpct = imaxpct - mp->m_sb.sb_imax_pct;
> -	xfs_trans_mod_sb(tp, XFS_TRANS_SB_IMAXPCT, dpct);
> -	xfs_trans_set_sync(tp);
> -	return xfs_trans_commit(tp);
> +	sbp = xfs_prepare_sb_update(tp, &bp);
> +	sbp->sb_imax_pct = imaxpct;
> +	error = xfs_commit_sb_update(tp, bp);
> +	if (!error)
> +		mp->m_sb.sb_imax_pct = imaxpct;
> +	xfs_buf_relse(bp);
> +	return error;
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
> index 3a2005a1e673dc..994e5efedab20f 100644
> --- a/fs/xfs/xfs_rtalloc.c
> +++ b/fs/xfs/xfs_rtalloc.c
> @@ -698,6 +698,56 @@ xfs_growfs_rt_fixup_extsize(
>  	return error;
>  }
>  
> +static int
> +xfs_growfs_rt_update_sb(
> +	struct xfs_trans	*tp,
> +	struct xfs_mount	*mp,
> +	struct xfs_mount	*nmp,
> +	xfs_rtbxlen_t		freed_rtx)
> +{
> +	struct xfs_dsb		*sbp;
> +	struct xfs_buf		*bp;
> +	int			error;
> +
> +	/*
> +	 * Update the geometry in the on-disk superblock first, and ensure
> +	 * they make it to disk before the superblock can be relogged.
> +	 */
> +	sbp = xfs_prepare_sb_update(tp, &bp);
> +	sbp->sb_rextsize = cpu_to_be32(nmp->m_sb.sb_rextsize);
> +	sbp->sb_rbmblocks = cpu_to_be32(nmp->m_sb.sb_rbmblocks);
> +	sbp->sb_rblocks = cpu_to_be64(nmp->m_sb.sb_rblocks);
> +	sbp->sb_rextents = cpu_to_be64(nmp->m_sb.sb_rextents);
> +	sbp->sb_rextslog = nmp->m_sb.sb_rextslog;
> +	error = xfs_commit_sb_update(tp, bp);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Propagate the new values to the live mount structure after they made
> +	 * it to disk with the superblock buffer still locked.
> +	 */
> +	mp->m_sb.sb_rextsize = nmp->m_sb.sb_rextsize;
> +	mp->m_sb.sb_rbmblocks = nmp->m_sb.sb_rbmblocks;
> +	mp->m_sb.sb_rblocks = nmp->m_sb.sb_rblocks;
> +	mp->m_sb.sb_rextents = nmp->m_sb.sb_rextents;
> +	mp->m_sb.sb_rextslog = nmp->m_sb.sb_rextslog;
> +	mp->m_rsumlevels = nmp->m_rsumlevels;
> +	mp->m_rsumblocks = nmp->m_rsumblocks;
> +
> +	/*
> +	 * Recompute the growfsrt reservation from the new rsumsize.
> +	 */
> +	xfs_trans_resv_calc(mp, &mp->m_resv);
> +
> +	/*
> +	 * Ensure the mount RT feature flag is now set.
> +	 */
> +	mp->m_features |= XFS_FEAT_REALTIME;
> +	xfs_buf_relse(bp);
> +	return 0;
> +}
> +
>  static int
>  xfs_growfs_rt_bmblock(
>  	struct xfs_mount	*mp,
> @@ -780,25 +830,6 @@ xfs_growfs_rt_bmblock(
>  			goto out_cancel;
>  	}
>  
> -	/*
> -	 * Update superblock fields.
> -	 */
> -	if (nmp->m_sb.sb_rextsize != mp->m_sb.sb_rextsize)
> -		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTSIZE,
> -			nmp->m_sb.sb_rextsize - mp->m_sb.sb_rextsize);
> -	if (nmp->m_sb.sb_rbmblocks != mp->m_sb.sb_rbmblocks)
> -		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_RBMBLOCKS,
> -			nmp->m_sb.sb_rbmblocks - mp->m_sb.sb_rbmblocks);
> -	if (nmp->m_sb.sb_rblocks != mp->m_sb.sb_rblocks)
> -		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_RBLOCKS,
> -			nmp->m_sb.sb_rblocks - mp->m_sb.sb_rblocks);
> -	if (nmp->m_sb.sb_rextents != mp->m_sb.sb_rextents)
> -		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTENTS,
> -			nmp->m_sb.sb_rextents - mp->m_sb.sb_rextents);
> -	if (nmp->m_sb.sb_rextslog != mp->m_sb.sb_rextslog)
> -		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTSLOG,
> -			nmp->m_sb.sb_rextslog - mp->m_sb.sb_rextslog);
> -
>  	/*
>  	 * Free the new extent.
>  	 */
> @@ -807,33 +838,12 @@ xfs_growfs_rt_bmblock(
>  	xfs_rtbuf_cache_relse(&nargs);
>  	if (error)
>  		goto out_cancel;
> -
> -	/*
> -	 * Mark more blocks free in the superblock.
> -	 */
>  	xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_FREXTENTS, freed_rtx);
>  
> -	/*
> -	 * Update the calculated values in the real mount structure.
> -	 */
> -	mp->m_rsumlevels = nmp->m_rsumlevels;
> -	mp->m_rsumblocks = nmp->m_rsumblocks;
> -	xfs_mount_sb_set_rextsize(mp, &mp->m_sb);
> -
> -	/*
> -	 * Recompute the growfsrt reservation from the new rsumsize.
> -	 */
> -	xfs_trans_resv_calc(mp, &mp->m_resv);
> -
> -	error = xfs_trans_commit(args.tp);
> +	error = xfs_growfs_rt_update_sb(args.tp, mp, nmp, freed_rtx);
>  	if (error)
>  		goto out_free;
>  
> -	/*
> -	 * Ensure the mount RT feature flag is now set.
> -	 */
> -	mp->m_features |= XFS_FEAT_REALTIME;
> -
>  	kfree(nmp);
>  	return 0;
>  
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index bdf3704dc30118..56505cb94f877d 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -430,31 +430,6 @@ xfs_trans_mod_sb(
>  		ASSERT(delta < 0);
>  		tp->t_res_frextents_delta += delta;
>  		break;
> -	case XFS_TRANS_SB_DBLOCKS:
> -		tp->t_dblocks_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_AGCOUNT:
> -		ASSERT(delta > 0);
> -		tp->t_agcount_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_IMAXPCT:
> -		tp->t_imaxpct_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_REXTSIZE:
> -		tp->t_rextsize_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_RBMBLOCKS:
> -		tp->t_rbmblocks_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_RBLOCKS:
> -		tp->t_rblocks_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_REXTENTS:
> -		tp->t_rextents_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_REXTSLOG:
> -		tp->t_rextslog_delta += delta;
> -		break;
>  	default:
>  		ASSERT(0);
>  		return;
> @@ -475,12 +450,8 @@ STATIC void
>  xfs_trans_apply_sb_deltas(
>  	xfs_trans_t	*tp)
>  {
> -	struct xfs_dsb	*sbp;
> -	struct xfs_buf	*bp;
> -	int		whole = 0;
> -
> -	bp = xfs_trans_getsb(tp);
> -	sbp = bp->b_addr;
> +	struct xfs_buf	*bp = xfs_trans_getsb(tp);
> +	struct xfs_dsb	*sbp = bp->b_addr;
>  
>  	/*
>  	 * Only update the superblock counters if we are logging them
> @@ -522,53 +493,10 @@ xfs_trans_apply_sb_deltas(
>  		spin_unlock(&mp->m_sb_lock);
>  	}
>  
> -	if (tp->t_dblocks_delta) {
> -		be64_add_cpu(&sbp->sb_dblocks, tp->t_dblocks_delta);
> -		whole = 1;
> -	}
> -	if (tp->t_agcount_delta) {
> -		be32_add_cpu(&sbp->sb_agcount, tp->t_agcount_delta);
> -		whole = 1;
> -	}
> -	if (tp->t_imaxpct_delta) {
> -		sbp->sb_imax_pct += tp->t_imaxpct_delta;
> -		whole = 1;
> -	}
> -	if (tp->t_rextsize_delta) {
> -		be32_add_cpu(&sbp->sb_rextsize, tp->t_rextsize_delta);
> -		whole = 1;
> -	}
> -	if (tp->t_rbmblocks_delta) {
> -		be32_add_cpu(&sbp->sb_rbmblocks, tp->t_rbmblocks_delta);
> -		whole = 1;
> -	}
> -	if (tp->t_rblocks_delta) {
> -		be64_add_cpu(&sbp->sb_rblocks, tp->t_rblocks_delta);
> -		whole = 1;
> -	}
> -	if (tp->t_rextents_delta) {
> -		be64_add_cpu(&sbp->sb_rextents, tp->t_rextents_delta);
> -		whole = 1;
> -	}
> -	if (tp->t_rextslog_delta) {
> -		sbp->sb_rextslog += tp->t_rextslog_delta;
> -		whole = 1;
> -	}
> -
>  	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF);
> -	if (whole)
> -		/*
> -		 * Log the whole thing, the fields are noncontiguous.
> -		 */
> -		xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1);
> -	else
> -		/*
> -		 * Since all the modifiable fields are contiguous, we
> -		 * can get away with this.
> -		 */
> -		xfs_trans_log_buf(tp, bp, offsetof(struct xfs_dsb, sb_icount),
> -				  offsetof(struct xfs_dsb, sb_frextents) +
> -				  sizeof(sbp->sb_frextents) - 1);
> +	xfs_trans_log_buf(tp, bp, offsetof(struct xfs_dsb, sb_icount),
> +			  offsetof(struct xfs_dsb, sb_frextents) +
> +			  sizeof(sbp->sb_frextents) - 1);
>  }
>  
>  /*
> @@ -656,26 +584,7 @@ xfs_trans_unreserve_and_mod_sb(
>  	 * must be consistent with the ondisk rtbitmap and must never include
>  	 * incore reservations.
>  	 */
> -	mp->m_sb.sb_dblocks += tp->t_dblocks_delta;
> -	mp->m_sb.sb_agcount += tp->t_agcount_delta;
> -	mp->m_sb.sb_imax_pct += tp->t_imaxpct_delta;
> -	mp->m_sb.sb_rextsize += tp->t_rextsize_delta;
> -	if (tp->t_rextsize_delta) {
> -		mp->m_rtxblklog = log2_if_power2(mp->m_sb.sb_rextsize);
> -		mp->m_rtxblkmask = mask64_if_power2(mp->m_sb.sb_rextsize);
> -	}
> -	mp->m_sb.sb_rbmblocks += tp->t_rbmblocks_delta;
> -	mp->m_sb.sb_rblocks += tp->t_rblocks_delta;
> -	mp->m_sb.sb_rextents += tp->t_rextents_delta;
> -	mp->m_sb.sb_rextslog += tp->t_rextslog_delta;
>  	spin_unlock(&mp->m_sb_lock);
> -
> -	/*
> -	 * Debug checks outside of the spinlock so they don't lock up the
> -	 * machine if they fail.
> -	 */
> -	ASSERT(mp->m_sb.sb_imax_pct >= 0);
> -	ASSERT(mp->m_sb.sb_rextslog >= 0);
>  }
>  
>  /* Add the given log item to the transaction's list of log items. */
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index f06cc0f41665ad..e5911cf09be444 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -140,14 +140,6 @@ typedef struct xfs_trans {
>  	int64_t			t_res_fdblocks_delta; /* on-disk only chg */
>  	int64_t			t_frextents_delta;/* superblock freextents chg*/
>  	int64_t			t_res_frextents_delta; /* on-disk only chg */
> -	int64_t			t_dblocks_delta;/* superblock dblocks change */
> -	int64_t			t_agcount_delta;/* superblock agcount change */
> -	int64_t			t_imaxpct_delta;/* superblock imaxpct change */
> -	int64_t			t_rextsize_delta;/* superblock rextsize chg */
> -	int64_t			t_rbmblocks_delta;/* superblock rbmblocks chg */
> -	int64_t			t_rblocks_delta;/* superblock rblocks change */
> -	int64_t			t_rextents_delta;/* superblocks rextents chg */
> -	int64_t			t_rextslog_delta;/* superblocks rextslog chg */
>  	struct list_head	t_items;	/* log item descriptors */
>  	struct list_head	t_busy;		/* list of busy extents */
>  	struct list_head	t_dfops;	/* deferred operations */
> -- 
> 2.45.2
> 
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 7/7] xfs: split xfs_trans_mod_sb
  2024-09-30 16:41 ` [PATCH 7/7] xfs: split xfs_trans_mod_sb Christoph Hellwig
@ 2024-10-10 14:06   ` Brian Foster
  2024-10-11  7:54     ` Christoph Hellwig
  0 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2024-10-10 14:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chandan Babu R, Darrick J. Wong, linux-xfs

On Mon, Sep 30, 2024 at 06:41:48PM +0200, Christoph Hellwig wrote:
> Split xfs_trans_mod_sb into separate helpers for the different counts.
> While the icount and ifree counters get their own helpers, the handling
> for fdblocks and frextents merges the delalloc and non-delalloc cases
> to keep the related code together.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---

Seems Ok, but not sure I see the point personally. Rather than a single
helper with flags, we have multiple helpers, some of which still mix
deltas via an incrementally harder to read boolean param. This seems
sort of arbitrary to me. Is this to support some future work?

Brian

>  fs/xfs/libxfs/xfs_ag_resv.c  |  18 +++--
>  fs/xfs/libxfs/xfs_ialloc.c   |  14 ++--
>  fs/xfs/libxfs/xfs_rtbitmap.c |   3 +-
>  fs/xfs/libxfs/xfs_shared.h   |  10 ---
>  fs/xfs/xfs_fsops.c           |   2 +-
>  fs/xfs/xfs_rtalloc.c         |   6 +-
>  fs/xfs/xfs_trans.c           | 130 +++++++++++++++--------------------
>  fs/xfs/xfs_trans.h           |   7 +-
>  fs/xfs/xfs_trans_dquot.c     |   2 +-
>  9 files changed, 82 insertions(+), 110 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> index 216423df939e5c..bb518d6a2dcecd 100644
> --- a/fs/xfs/libxfs/xfs_ag_resv.c
> +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> @@ -341,7 +341,6 @@ xfs_ag_resv_alloc_extent(
>  {
>  	struct xfs_ag_resv		*resv;
>  	xfs_extlen_t			len;
> -	uint				field;
>  
>  	trace_xfs_ag_resv_alloc_extent(pag, type, args->len);
>  
> @@ -356,9 +355,8 @@ xfs_ag_resv_alloc_extent(
>  		ASSERT(0);
>  		fallthrough;
>  	case XFS_AG_RESV_NONE:
> -		field = args->wasdel ? XFS_TRANS_SB_RES_FDBLOCKS :
> -				       XFS_TRANS_SB_FDBLOCKS;
> -		xfs_trans_mod_sb(args->tp, field, -(int64_t)args->len);
> +		xfs_trans_mod_fdblocks(args->tp, -(int64_t)args->len,
> +				args->wasdel);
>  		return;
>  	}
>  
> @@ -367,11 +365,11 @@ xfs_ag_resv_alloc_extent(
>  	if (type == XFS_AG_RESV_RMAPBT)
>  		return;
>  	/* Allocations of reserved blocks only need on-disk sb updates... */
> -	xfs_trans_mod_sb(args->tp, XFS_TRANS_SB_RES_FDBLOCKS, -(int64_t)len);
> +	xfs_trans_mod_fdblocks(args->tp, -(int64_t)len, true);
>  	/* ...but non-reserved blocks need in-core and on-disk updates. */
>  	if (args->len > len)
> -		xfs_trans_mod_sb(args->tp, XFS_TRANS_SB_FDBLOCKS,
> -				-((int64_t)args->len - len));
> +		xfs_trans_mod_fdblocks(args->tp, -((int64_t)args->len - len),
> +				false);
>  }
>  
>  /* Free a block to the reservation. */
> @@ -398,7 +396,7 @@ xfs_ag_resv_free_extent(
>  		ASSERT(0);
>  		fallthrough;
>  	case XFS_AG_RESV_NONE:
> -		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, (int64_t)len);
> +		xfs_trans_mod_fdblocks(tp, (int64_t)len, false);
>  		fallthrough;
>  	case XFS_AG_RESV_IGNORE:
>  		return;
> @@ -409,8 +407,8 @@ xfs_ag_resv_free_extent(
>  	if (type == XFS_AG_RESV_RMAPBT)
>  		return;
>  	/* Freeing into the reserved pool only requires on-disk update... */
> -	xfs_trans_mod_sb(tp, XFS_TRANS_SB_RES_FDBLOCKS, len);
> +	xfs_trans_mod_fdblocks(tp, len, true);
>  	/* ...but freeing beyond that requires in-core and on-disk update. */
>  	if (len > leftover)
> -		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, len - leftover);
> +		xfs_trans_mod_fdblocks(tp, len - leftover, false);
>  }
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index 271855227514cb..ad28823debb6f1 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -970,8 +970,8 @@ xfs_ialloc_ag_alloc(
>  	/*
>  	 * Modify/log superblock values for inode count and inode free count.
>  	 */
> -	xfs_trans_mod_sb(tp, XFS_TRANS_SB_ICOUNT, (long)newlen);
> -	xfs_trans_mod_sb(tp, XFS_TRANS_SB_IFREE, (long)newlen);
> +	xfs_trans_mod_icount(tp, (long)newlen);
> +	xfs_trans_mod_ifree(tp, (long)newlen);
>  	return 0;
>  }
>  
> @@ -1357,7 +1357,7 @@ xfs_dialloc_ag_inobt(
>  		goto error0;
>  
>  	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> -	xfs_trans_mod_sb(tp, XFS_TRANS_SB_IFREE, -1);
> +	xfs_trans_mod_ifree(tp, -1);
>  	*inop = ino;
>  	return 0;
>  error1:
> @@ -1660,7 +1660,7 @@ xfs_dialloc_ag(
>  	xfs_ialloc_log_agi(tp, agbp, XFS_AGI_FREECOUNT);
>  	pag->pagi_freecount--;
>  
> -	xfs_trans_mod_sb(tp, XFS_TRANS_SB_IFREE, -1);
> +	xfs_trans_mod_ifree(tp, -1);
>  
>  	error = xfs_check_agi_freecount(icur);
>  	if (error)
> @@ -2139,8 +2139,8 @@ xfs_difree_inobt(
>  		xfs_ialloc_log_agi(tp, agbp, XFS_AGI_COUNT | XFS_AGI_FREECOUNT);
>  		pag->pagi_freecount -= ilen - 1;
>  		pag->pagi_count -= ilen;
> -		xfs_trans_mod_sb(tp, XFS_TRANS_SB_ICOUNT, -ilen);
> -		xfs_trans_mod_sb(tp, XFS_TRANS_SB_IFREE, -(ilen - 1));
> +		xfs_trans_mod_icount(tp, -ilen);
> +		xfs_trans_mod_ifree(tp, -(ilen - 1));
>  
>  		if ((error = xfs_btree_delete(cur, &i))) {
>  			xfs_warn(mp, "%s: xfs_btree_delete returned error %d.",
> @@ -2167,7 +2167,7 @@ xfs_difree_inobt(
>  		be32_add_cpu(&agi->agi_freecount, 1);
>  		xfs_ialloc_log_agi(tp, agbp, XFS_AGI_FREECOUNT);
>  		pag->pagi_freecount++;
> -		xfs_trans_mod_sb(tp, XFS_TRANS_SB_IFREE, 1);
> +		xfs_trans_mod_ifree(tp, 1);
>  	}
>  
>  	error = xfs_check_agi_freecount(cur);
> diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
> index 27a4472402bacd..d0c693a69e0001 100644
> --- a/fs/xfs/libxfs/xfs_rtbitmap.c
> +++ b/fs/xfs/libxfs/xfs_rtbitmap.c
> @@ -989,7 +989,8 @@ xfs_rtfree_extent(
>  	/*
>  	 * Mark more blocks free in the superblock.
>  	 */
> -	xfs_trans_mod_sb(tp, XFS_TRANS_SB_FREXTENTS, (long)len);
> +	xfs_trans_mod_frextents(tp, (long)len, false);
> +
>  	/*
>  	 * If we've now freed all the blocks, reset the file sequence
>  	 * number to 0.
> diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> index 45a32ea426164a..6b5a7bfc32dbb8 100644
> --- a/fs/xfs/libxfs/xfs_shared.h
> +++ b/fs/xfs/libxfs/xfs_shared.h
> @@ -140,16 +140,6 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
>  /* Transaction has locked the rtbitmap and rtsum inodes */
>  #define XFS_TRANS_RTBITMAP_LOCKED	(1u << 9)
>  
> -/*
> - * Field values for xfs_trans_mod_sb.
> - */
> -#define	XFS_TRANS_SB_ICOUNT		0x00000001
> -#define	XFS_TRANS_SB_IFREE		0x00000002
> -#define	XFS_TRANS_SB_FDBLOCKS		0x00000004
> -#define	XFS_TRANS_SB_RES_FDBLOCKS	0x00000008
> -#define	XFS_TRANS_SB_FREXTENTS		0x00000010
> -#define	XFS_TRANS_SB_RES_FREXTENTS	0x00000020
> -
>  /*
>   * Here we centralize the specification of XFS meta-data buffer reference count
>   * values.  This determines how hard the buffer cache tries to hold onto the
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 4168ccf21068cb..ac88a38c6cd522 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -212,7 +212,7 @@ xfs_growfs_data_private(
>  		goto out_trans_cancel;
>  
>  	if (id.nfree)
> -		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, id.nfree);
> +		xfs_trans_mod_fdblocks(tp, id.nfree, false);
>  
>  	error = xfs_growfs_data_update_sb(tp, nagcount, nb, nagimax);
>  	if (error)
> diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
> index 994e5efedab20f..07f6008db322cb 100644
> --- a/fs/xfs/xfs_rtalloc.c
> +++ b/fs/xfs/xfs_rtalloc.c
> @@ -838,7 +838,7 @@ xfs_growfs_rt_bmblock(
>  	xfs_rtbuf_cache_relse(&nargs);
>  	if (error)
>  		goto out_cancel;
> -	xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_FREXTENTS, freed_rtx);
> +	xfs_trans_mod_frextents(args.tp, freed_rtx, false);
>  
>  	error = xfs_growfs_rt_update_sb(args.tp, mp, nmp, freed_rtx);
>  	if (error)
> @@ -1335,9 +1335,7 @@ xfs_rtallocate(
>  	if (error)
>  		goto out_release;
>  
> -	xfs_trans_mod_sb(tp, wasdel ?
> -			XFS_TRANS_SB_RES_FREXTENTS : XFS_TRANS_SB_FREXTENTS,
> -			-(long)len);
> +	xfs_trans_mod_frextents(tp, -(long)len, wasdel);
>  	*bno = xfs_rtx_to_rtb(args.mp, rtx);
>  	*blen = xfs_rtxlen_to_extlen(args.mp, len);
>  
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 56505cb94f877d..fa133535235d4c 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -334,48 +334,43 @@ xfs_trans_alloc_empty(
>  	return xfs_trans_alloc(mp, &resv, 0, 0, XFS_TRANS_NO_WRITECOUNT, tpp);
>  }
>  
> -/*
> - * Record the indicated change to the given field for application
> - * to the file system's superblock when the transaction commits.
> - * For now, just store the change in the transaction structure.
> - *
> - * Mark the transaction structure to indicate that the superblock
> - * needs to be updated before committing.
> - *
> - * Because we may not be keeping track of allocated/free inodes and
> - * used filesystem blocks in the superblock, we do not mark the
> - * superblock dirty in this transaction if we modify these fields.
> - * We still need to update the transaction deltas so that they get
> - * applied to the incore superblock, but we don't want them to
> - * cause the superblock to get locked and logged if these are the
> - * only fields in the superblock that the transaction modifies.
> - */
>  void
> -xfs_trans_mod_sb(
> -	xfs_trans_t	*tp,
> -	uint		field,
> -	int64_t		delta)
> +xfs_trans_mod_icount(
> +	struct xfs_trans	*tp,
> +	int64_t			delta)
> +{
> +	tp->t_icount_delta += delta;
> +	tp->t_flags |= XFS_TRANS_DIRTY;
> +	if (!xfs_has_lazysbcount(tp->t_mountp))
> +		tp->t_flags |= XFS_TRANS_SB_DIRTY;
> +}
> +
> +void
> +xfs_trans_mod_ifree(
> +	struct xfs_trans	*tp,
> +	int64_t			delta)
>  {
> -	uint32_t	flags = (XFS_TRANS_DIRTY|XFS_TRANS_SB_DIRTY);
> -	xfs_mount_t	*mp = tp->t_mountp;
> -
> -	switch (field) {
> -	case XFS_TRANS_SB_ICOUNT:
> -		tp->t_icount_delta += delta;
> -		if (xfs_has_lazysbcount(mp))
> -			flags &= ~XFS_TRANS_SB_DIRTY;
> -		break;
> -	case XFS_TRANS_SB_IFREE:
> -		tp->t_ifree_delta += delta;
> -		if (xfs_has_lazysbcount(mp))
> -			flags &= ~XFS_TRANS_SB_DIRTY;
> -		break;
> -	case XFS_TRANS_SB_FDBLOCKS:
> +	tp->t_ifree_delta += delta;
> +	tp->t_flags |= XFS_TRANS_DIRTY;
> +	if (!xfs_has_lazysbcount(tp->t_mountp))
> +		tp->t_flags |= XFS_TRANS_SB_DIRTY;
> +}
> +
> +void
> +xfs_trans_mod_fdblocks(
> +	struct xfs_trans	*tp,
> +	int64_t			delta,
> +	bool			wasdel)
> +{
> +	struct xfs_mount	*mp = tp->t_mountp;
> +
> +	if (wasdel) {
>  		/*
> -		 * Track the number of blocks allocated in the transaction.
> -		 * Make sure it does not exceed the number reserved. If so,
> -		 * shutdown as this can lead to accounting inconsistency.
> +		 * The allocation has already been applied to the in-core
> +		 * counter, only apply it to the on-disk superblock.
>  		 */
> +		tp->t_res_fdblocks_delta += delta;
> +	} else {
>  		if (delta < 0) {
>  			tp->t_blk_res_used += (uint)-delta;
>  			if (tp->t_blk_res_used > tp->t_blk_res)
> @@ -396,55 +391,40 @@ xfs_trans_mod_sb(
>  			delta -= blkres_delta;
>  		}
>  		tp->t_fdblocks_delta += delta;
> -		if (xfs_has_lazysbcount(mp))
> -			flags &= ~XFS_TRANS_SB_DIRTY;
> -		break;
> -	case XFS_TRANS_SB_RES_FDBLOCKS:
> -		/*
> -		 * The allocation has already been applied to the
> -		 * in-core superblock's counter.  This should only
> -		 * be applied to the on-disk superblock.
> -		 */
> -		tp->t_res_fdblocks_delta += delta;
> -		if (xfs_has_lazysbcount(mp))
> -			flags &= ~XFS_TRANS_SB_DIRTY;
> -		break;
> -	case XFS_TRANS_SB_FREXTENTS:
> +	}
> +
> +	tp->t_flags |= XFS_TRANS_DIRTY;
> +	if (!xfs_has_lazysbcount(mp))
> +		tp->t_flags |= XFS_TRANS_SB_DIRTY;
> +}
> +
> +void
> +xfs_trans_mod_frextents(
> +	struct xfs_trans	*tp,
> +	int64_t			delta,
> +	bool			wasdel)
> +{
> +	if (wasdel) {
>  		/*
> -		 * Track the number of blocks allocated in the
> -		 * transaction.  Make sure it does not exceed the
> -		 * number reserved.
> +		 * The allocation has already been applied to the in-core
> +		 * counter, only apply it to the on-disk superblock.
>  		 */
> +		ASSERT(delta < 0);
> +		tp->t_res_frextents_delta += delta;
> +	} else {
>  		if (delta < 0) {
>  			tp->t_rtx_res_used += (uint)-delta;
>  			ASSERT(tp->t_rtx_res_used <= tp->t_rtx_res);
>  		}
>  		tp->t_frextents_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_RES_FREXTENTS:
> -		/*
> -		 * The allocation has already been applied to the
> -		 * in-core superblock's counter.  This should only
> -		 * be applied to the on-disk superblock.
> -		 */
> -		ASSERT(delta < 0);
> -		tp->t_res_frextents_delta += delta;
> -		break;
> -	default:
> -		ASSERT(0);
> -		return;
>  	}
>  
> -	tp->t_flags |= flags;
> +	tp->t_flags |= (XFS_TRANS_DIRTY | XFS_TRANS_SB_DIRTY);
>  }
>  
>  /*
> - * xfs_trans_apply_sb_deltas() is called from the commit code
> - * to bring the superblock buffer into the current transaction
> - * and modify it as requested by earlier calls to xfs_trans_mod_sb().
> - *
> - * For now we just look at each field allowed to change and change
> - * it if necessary.
> + * Called from the commit code to bring the superblock buffer into the current
> + * transaction and modify it as based on earlier calls to  xfs_trans_mod_*().
>   */
>  STATIC void
>  xfs_trans_apply_sb_deltas(
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index e5911cf09be444..a2cee42368bd25 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -162,7 +162,12 @@ int		xfs_trans_reserve_more(struct xfs_trans *tp,
>  			unsigned int blocks, unsigned int rtextents);
>  int		xfs_trans_alloc_empty(struct xfs_mount *mp,
>  			struct xfs_trans **tpp);
> -void		xfs_trans_mod_sb(xfs_trans_t *, uint, int64_t);
> +void		xfs_trans_mod_icount(struct xfs_trans *tp, int64_t delta);
> +void		xfs_trans_mod_ifree(struct xfs_trans *tp, int64_t delta);
> +void		xfs_trans_mod_fdblocks(struct xfs_trans *tp, int64_t delta,
> +			bool wasdel);
> +void		xfs_trans_mod_frextents(struct xfs_trans *tp, int64_t delta,
> +			bool wasdel);
>  
>  int xfs_trans_get_buf_map(struct xfs_trans *tp, struct xfs_buftarg *target,
>  		struct xfs_buf_map *map, int nmaps, xfs_buf_flags_t flags,
> diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
> index b368e13424c4f4..839eb1780d4694 100644
> --- a/fs/xfs/xfs_trans_dquot.c
> +++ b/fs/xfs/xfs_trans_dquot.c
> @@ -288,7 +288,7 @@ xfs_trans_get_dqtrx(
>  
>  /*
>   * Make the changes in the transaction structure.
> - * The moral equivalent to xfs_trans_mod_sb().
> + *
>   * We don't touch any fields in the dquot, so we don't care
>   * if it's locked or not (most of the time it won't be).
>   */
> -- 
> 2.45.2
> 
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/7] xfs: update the file system geometry after recoverying superblock buffers
  2024-10-01  8:49     ` Christoph Hellwig
@ 2024-10-10 16:02       ` Darrick J. Wong
  0 siblings, 0 replies; 44+ messages in thread
From: Darrick J. Wong @ 2024-10-10 16:02 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chandan Babu R, linux-xfs

On Tue, Oct 01, 2024 at 10:49:18AM +0200, Christoph Hellwig wrote:
> On Mon, Sep 30, 2024 at 09:50:19AM -0700, Darrick J. Wong wrote:
> > > +int
> > > +xlog_recover_update_agcount(
> > > +	struct xfs_mount		*mp,
> > > +	struct xfs_dsb			*dsb)
> > > +{
> > > +	xfs_agnumber_t			old_agcount = mp->m_sb.sb_agcount;
> > > +	int				error;
> > > +
> > > +	xfs_sb_from_disk(&mp->m_sb, dsb);
> > 
> > If I'm understanding this correctly, the incore superblock gets updated
> > every time recovery finds a logged primary superblock buffer now,
> > instead of once at the end of log recovery, right?
> 
> Yes.
> 
> > Shouldn't this conversion be done in the caller?  Some day we're going
> > to want to do the same with xfs_initialize_rtgroups(), right?
> 
> Yeah.  But the right "fix" for that is probably to just rename
> this function :)  Probably even for the next repost instead of
> waiting for more features.

Forgot to reply to this, but yes, how about
xlog_recover_update_group_count?

--D

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/7] xfs: don't update file system geometry through transaction deltas
  2024-09-30 16:41 ` [PATCH 6/7] xfs: don't update file system geometry through transaction deltas Christoph Hellwig
  2024-10-10 14:05   ` Brian Foster
@ 2024-10-10 19:01   ` Darrick J. Wong
  2024-10-11  7:59     ` Christoph Hellwig
  1 sibling, 1 reply; 44+ messages in thread
From: Darrick J. Wong @ 2024-10-10 19:01 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chandan Babu R, linux-xfs

On Mon, Sep 30, 2024 at 06:41:47PM +0200, Christoph Hellwig wrote:
> Updates to the file system geometry in growfs need to be committed to
> stable store before the allocator can see them to avoid that they are
> in the same CIL checkpoint as transactions that make use of this new
> information, which will make recovery impossible or broken.
> 
> To do this add two new helpers to prepare a superblock for direct
> manipulation of the on-disk buffer, and to commit these updates while
> holding the buffer locked (similar to what xfs_sync_sb_buf does) and use
> those in growfs instead of applying the changes through the deltas in the
> xfs_trans structure (which also happens to shrink the xfs_trans structure
> a fair bit).

Yay for shrinking xfs_trans!

> The rtbmimap repair code was also using the transaction deltas and is
> converted to also update the superblock buffer directly under the buffer
> lock.
> 
> This new method establishes a locking protocol where even in-core
> superblock fields must only be updated with the superblock buffer
> locked.  For now it is only applied to affected geometry fields,
> but in the future it would make sense to apply it universally.

Hmm.  One thing that I don't quite like here is the separation between
updating the ondisk sb fields and updating the incore sb/recomputing the
cached geometry fields.  I think that's been handled correctly here, but
the pending changes in growfsrt for rtgroups is going to make this more
ugly.

What if instead this took the form of a new defer_ops type?  The
xfs_prepare_sb_update function would allocate a tracking object where
we'd pin the sb buffer and record which fields get changed, as well as
the new values.  xfs_commit_sb_update then xfs_defer_add()s it to the
transaction and commits it.  (The ->create_intent function would return
NULL so that no log item is created.)

The ->finish_item function would then bhold the sb buffer, update the
ondisk super like how xfs_commit_sb_update does in this patch, set
XFS_SB_TRANS_SYNC, and return -EAGAIN.  The defer ops would commit and
flush that transaction and call ->finish_item again, at which point it
would recompute the incore/cached geometry as necessary, bwrite the sb
buffer, and release it.

The downside is that it's more complexity, but the upside is that the
geometry changes are contained in one place instead of being scattered
around, and the incore changes only happen if the synchronous
transaction actually gets written to disk.  IOWs, the end result is the
same as what you propose here, but structured differently.  

I guess the biggest downside is that log recovery has to call the
incore/cached geometry recomputation function directly because there's
no actual log intent item to recover.

(The code changes themselves look acceptable to me.)

--D

> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/libxfs/xfs_sb.c         |  97 ++++++++++++++++++++++++-------
>  fs/xfs/libxfs/xfs_sb.h         |   3 +
>  fs/xfs/libxfs/xfs_shared.h     |   8 ---
>  fs/xfs/scrub/rtbitmap_repair.c |  26 +++++----
>  fs/xfs/xfs_fsops.c             |  80 ++++++++++++++++----------
>  fs/xfs/xfs_rtalloc.c           |  92 +++++++++++++++++-------------
>  fs/xfs/xfs_trans.c             | 101 ++-------------------------------
>  fs/xfs/xfs_trans.h             |   8 ---
>  8 files changed, 198 insertions(+), 217 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> index d95409f3cba667..2c83ab7441ade5 100644
> --- a/fs/xfs/libxfs/xfs_sb.c
> +++ b/fs/xfs/libxfs/xfs_sb.c
> @@ -1025,6 +1025,80 @@ xfs_sb_mount_common(
>  	mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp);
>  }
>  
> +/*
> + * Mirror the lazy sb counters to the in-core superblock.
> + *
> + * If this is at unmount, the counters will be exactly correct, but at any other
> + * time they will only be ballpark correct because of reservations that have
> + * been taken out percpu counters.  If we have an unclean shutdown, this will be
> + * corrected by log recovery rebuilding the counters from the AGF block counts.
> + *
> + * Do not update sb_frextents here because it is not part of the lazy sb
> + * counters, despite having a percpu counter.  It is always kept consistent with
> + * the ondisk rtbitmap by xfs_trans_apply_sb_deltas() and hence we don't need
> + * have to update it here.
> + */
> +static void
> +xfs_flush_sb_counters(
> +	struct xfs_mount	*mp)
> +{
> +	if (xfs_has_lazysbcount(mp)) {
> +		mp->m_sb.sb_icount = percpu_counter_sum_positive(&mp->m_icount);
> +		mp->m_sb.sb_ifree = min_t(uint64_t,
> +				percpu_counter_sum_positive(&mp->m_ifree),
> +				mp->m_sb.sb_icount);
> +		mp->m_sb.sb_fdblocks =
> +				percpu_counter_sum_positive(&mp->m_fdblocks);
> +	}
> +}
> +
> +/*
> + * Prepare a direct update to the superblock through the on-disk buffer.
> + *
> + * This locks out other modifications through the buffer lock and then syncs all
> + * in-core values to the on-disk buffer (including the percpu counters).
> + *
> + * The caller then directly manipulates the on-disk fields and calls
> + * xfs_commit_sb_update to the updates to disk them.  The caller is responsible
> + * to also update the in-core field, but it can do so after the transaction has
> + * been committed to disk.
> + *
> + * Updating the in-core field only after xfs_commit_sb_update ensures that other
> + * processes only see the update once it is stable on disk, and is usually the
> + * right thing to do for superblock updates.
> + *
> + * Note that writes to superblock fields updated using this helper are
> + * synchronized using the superblock buffer lock, which must be taken around
> + * all updates to the in-core fields as well.
> + */
> +struct xfs_dsb *
> +xfs_prepare_sb_update(
> +	struct xfs_trans	*tp,
> +	struct xfs_buf		**bpp)
> +{
> +	*bpp = xfs_trans_getsb(tp);
> +	xfs_flush_sb_counters(tp->t_mountp);
> +	xfs_sb_to_disk((*bpp)->b_addr, &tp->t_mountp->m_sb);
> +	return (*bpp)->b_addr;
> +}
> +
> +/*
> + * Commit a direct update to the on-disk superblock.  Keeps @bp locked and
> + * referenced, so the caller must call xfs_buf_relse() manually.
> + */
> +int
> +xfs_commit_sb_update(
> +	struct xfs_trans	*tp,
> +	struct xfs_buf		*bp)
> +{
> +	xfs_trans_bhold(tp, bp);
> +	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF);
> +	xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1);
> +
> +	xfs_trans_set_sync(tp);
> +	return xfs_trans_commit(tp);
> +}
> +
>  /*
>   * xfs_log_sb() can be used to copy arbitrary changes to the in-core superblock
>   * into the superblock buffer to be logged.  It does not provide the higher
> @@ -1038,28 +1112,7 @@ xfs_log_sb(
>  	struct xfs_mount	*mp = tp->t_mountp;
>  	struct xfs_buf		*bp = xfs_trans_getsb(tp);
>  
> -	/*
> -	 * Lazy sb counters don't update the in-core superblock so do that now.
> -	 * If this is at unmount, the counters will be exactly correct, but at
> -	 * any other time they will only be ballpark correct because of
> -	 * reservations that have been taken out percpu counters. If we have an
> -	 * unclean shutdown, this will be corrected by log recovery rebuilding
> -	 * the counters from the AGF block counts.
> -	 *
> -	 * Do not update sb_frextents here because it is not part of the lazy
> -	 * sb counters, despite having a percpu counter. It is always kept
> -	 * consistent with the ondisk rtbitmap by xfs_trans_apply_sb_deltas()
> -	 * and hence we don't need have to update it here.
> -	 */
> -	if (xfs_has_lazysbcount(mp)) {
> -		mp->m_sb.sb_icount = percpu_counter_sum_positive(&mp->m_icount);
> -		mp->m_sb.sb_ifree = min_t(uint64_t,
> -				percpu_counter_sum_positive(&mp->m_ifree),
> -				mp->m_sb.sb_icount);
> -		mp->m_sb.sb_fdblocks =
> -				percpu_counter_sum_positive(&mp->m_fdblocks);
> -	}
> -
> +	xfs_flush_sb_counters(mp);
>  	xfs_sb_to_disk(bp->b_addr, &mp->m_sb);
>  	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF);
>  	xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1);
> diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h
> index 885c837559914d..3649d071687e33 100644
> --- a/fs/xfs/libxfs/xfs_sb.h
> +++ b/fs/xfs/libxfs/xfs_sb.h
> @@ -13,6 +13,9 @@ struct xfs_trans;
>  struct xfs_fsop_geom;
>  struct xfs_perag;
>  
> +struct xfs_dsb *xfs_prepare_sb_update(struct xfs_trans *tp,
> +			struct xfs_buf **bpp);
> +int		xfs_commit_sb_update(struct xfs_trans *tp, struct xfs_buf *bp);
>  extern void	xfs_log_sb(struct xfs_trans *tp);
>  extern int	xfs_sync_sb(struct xfs_mount *mp, bool wait);
>  extern int	xfs_sync_sb_buf(struct xfs_mount *mp);
> diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> index 33b84a3a83ff63..45a32ea426164a 100644
> --- a/fs/xfs/libxfs/xfs_shared.h
> +++ b/fs/xfs/libxfs/xfs_shared.h
> @@ -149,14 +149,6 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
>  #define	XFS_TRANS_SB_RES_FDBLOCKS	0x00000008
>  #define	XFS_TRANS_SB_FREXTENTS		0x00000010
>  #define	XFS_TRANS_SB_RES_FREXTENTS	0x00000020
> -#define	XFS_TRANS_SB_DBLOCKS		0x00000040
> -#define	XFS_TRANS_SB_AGCOUNT		0x00000080
> -#define	XFS_TRANS_SB_IMAXPCT		0x00000100
> -#define	XFS_TRANS_SB_REXTSIZE		0x00000200
> -#define	XFS_TRANS_SB_RBMBLOCKS		0x00000400
> -#define	XFS_TRANS_SB_RBLOCKS		0x00000800
> -#define	XFS_TRANS_SB_REXTENTS		0x00001000
> -#define	XFS_TRANS_SB_REXTSLOG		0x00002000
>  
>  /*
>   * Here we centralize the specification of XFS meta-data buffer reference count
> diff --git a/fs/xfs/scrub/rtbitmap_repair.c b/fs/xfs/scrub/rtbitmap_repair.c
> index 0fef98e9f83409..be9d31f032b1bf 100644
> --- a/fs/xfs/scrub/rtbitmap_repair.c
> +++ b/fs/xfs/scrub/rtbitmap_repair.c
> @@ -16,6 +16,7 @@
>  #include "xfs_bit.h"
>  #include "xfs_bmap.h"
>  #include "xfs_bmap_btree.h"
> +#include "xfs_sb.h"
>  #include "scrub/scrub.h"
>  #include "scrub/common.h"
>  #include "scrub/trace.h"
> @@ -127,20 +128,21 @@ xrep_rtbitmap_geometry(
>  	struct xchk_rtbitmap	*rtb)
>  {
>  	struct xfs_mount	*mp = sc->mp;
> -	struct xfs_trans	*tp = sc->tp;
>  
>  	/* Superblock fields */
> -	if (mp->m_sb.sb_rextents != rtb->rextents)
> -		xfs_trans_mod_sb(sc->tp, XFS_TRANS_SB_REXTENTS,
> -				rtb->rextents - mp->m_sb.sb_rextents);
> -
> -	if (mp->m_sb.sb_rbmblocks != rtb->rbmblocks)
> -		xfs_trans_mod_sb(tp, XFS_TRANS_SB_RBMBLOCKS,
> -				rtb->rbmblocks - mp->m_sb.sb_rbmblocks);
> -
> -	if (mp->m_sb.sb_rextslog != rtb->rextslog)
> -		xfs_trans_mod_sb(tp, XFS_TRANS_SB_REXTSLOG,
> -				rtb->rextslog - mp->m_sb.sb_rextslog);
> +	if (mp->m_sb.sb_rextents != rtb->rextents ||
> +	    mp->m_sb.sb_rbmblocks != rtb->rbmblocks ||
> +	    mp->m_sb.sb_rextslog != rtb->rextslog) {
> +		struct xfs_buf		*bp = xfs_trans_getsb(sc->tp);
> +
> +		mp->m_sb.sb_rextents = rtb->rextents;
> +		mp->m_sb.sb_rbmblocks = rtb->rbmblocks;
> +		mp->m_sb.sb_rextslog = rtb->rextslog;
> +		xfs_sb_to_disk(bp->b_addr, &mp->m_sb);
> +
> +		xfs_trans_buf_set_type(sc->tp, bp, XFS_BLFT_SB_BUF);
> +		xfs_trans_log_buf(sc->tp, bp, 0, sizeof(struct xfs_dsb) - 1);
> +	}
>  
>  	/* Fix broken isize */
>  	sc->ip->i_disk_size = roundup_64(sc->ip->i_disk_size,
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index b247d895c276d2..4168ccf21068cb 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -79,6 +79,46 @@ xfs_resizefs_init_new_ags(
>  	return error;
>  }
>  
> +static int
> +xfs_growfs_data_update_sb(
> +	struct xfs_trans	*tp,
> +	xfs_agnumber_t		nagcount,
> +	xfs_rfsblock_t		nb,
> +	xfs_agnumber_t		nagimax)
> +{
> +	struct xfs_mount	*mp = tp->t_mountp;
> +	struct xfs_dsb		*sbp;
> +	struct xfs_buf		*bp;
> +	int			error;
> +
> +	/*
> +	 * Update the geometry in the on-disk superblock first, and ensure
> +	 * they make it to disk before the superblock can be relogged.
> +	 */
> +	sbp = xfs_prepare_sb_update(tp, &bp);
> +	sbp->sb_agcount = cpu_to_be32(nagcount);
> +	sbp->sb_dblocks = cpu_to_be64(nb);
> +	error = xfs_commit_sb_update(tp, bp);
> +	if (error)
> +		goto out_unlock;
> +
> +	/*
> +	 * Propagate the new values to the live mount structure after they made
> +	 * it to disk with the superblock buffer still locked.
> +	 */
> +	mp->m_sb.sb_agcount = nagcount;
> +	mp->m_sb.sb_dblocks = nb;
> +
> +	if (nagimax)
> +		mp->m_maxagi = nagimax;
> +	xfs_set_low_space_thresholds(mp);
> +	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> +
> +out_unlock:
> +	xfs_buf_relse(bp);
> +	return error;
> +}
> +
>  /*
>   * growfs operations
>   */
> @@ -171,37 +211,13 @@ xfs_growfs_data_private(
>  	if (error)
>  		goto out_trans_cancel;
>  
> -	/*
> -	 * Update changed superblock fields transactionally. These are not
> -	 * seen by the rest of the world until the transaction commit applies
> -	 * them atomically to the superblock.
> -	 */
> -	if (nagcount > oagcount)
> -		xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount);
> -	if (delta)
> -		xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS, delta);
>  	if (id.nfree)
>  		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, id.nfree);
>  
> -	/*
> -	 * Sync sb counters now to reflect the updated values. This is
> -	 * particularly important for shrink because the write verifier
> -	 * will fail if sb_fdblocks is ever larger than sb_dblocks.
> -	 */
> -	if (xfs_has_lazysbcount(mp))
> -		xfs_log_sb(tp);
> -
> -	xfs_trans_set_sync(tp);
> -	error = xfs_trans_commit(tp);
> +	error = xfs_growfs_data_update_sb(tp, nagcount, nb, nagimax);
>  	if (error)
>  		return error;
>  
> -	/* New allocation groups fully initialized, so update mount struct */
> -	if (nagimax)
> -		mp->m_maxagi = nagimax;
> -	xfs_set_low_space_thresholds(mp);
> -	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> -
>  	if (delta > 0) {
>  		/*
>  		 * If we expanded the last AG, free the per-AG reservation
> @@ -260,8 +276,9 @@ xfs_growfs_imaxpct(
>  	struct xfs_mount	*mp,
>  	__u32			imaxpct)
>  {
> +	struct xfs_dsb		*sbp;
> +	struct xfs_buf		*bp;
>  	struct xfs_trans	*tp;
> -	int			dpct;
>  	int			error;
>  
>  	if (imaxpct > 100)
> @@ -272,10 +289,13 @@ xfs_growfs_imaxpct(
>  	if (error)
>  		return error;
>  
> -	dpct = imaxpct - mp->m_sb.sb_imax_pct;
> -	xfs_trans_mod_sb(tp, XFS_TRANS_SB_IMAXPCT, dpct);
> -	xfs_trans_set_sync(tp);
> -	return xfs_trans_commit(tp);
> +	sbp = xfs_prepare_sb_update(tp, &bp);
> +	sbp->sb_imax_pct = imaxpct;
> +	error = xfs_commit_sb_update(tp, bp);
> +	if (!error)
> +		mp->m_sb.sb_imax_pct = imaxpct;
> +	xfs_buf_relse(bp);
> +	return error;
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
> index 3a2005a1e673dc..994e5efedab20f 100644
> --- a/fs/xfs/xfs_rtalloc.c
> +++ b/fs/xfs/xfs_rtalloc.c
> @@ -698,6 +698,56 @@ xfs_growfs_rt_fixup_extsize(
>  	return error;
>  }
>  
> +static int
> +xfs_growfs_rt_update_sb(
> +	struct xfs_trans	*tp,
> +	struct xfs_mount	*mp,
> +	struct xfs_mount	*nmp,
> +	xfs_rtbxlen_t		freed_rtx)
> +{
> +	struct xfs_dsb		*sbp;
> +	struct xfs_buf		*bp;
> +	int			error;
> +
> +	/*
> +	 * Update the geometry in the on-disk superblock first, and ensure
> +	 * they make it to disk before the superblock can be relogged.
> +	 */
> +	sbp = xfs_prepare_sb_update(tp, &bp);
> +	sbp->sb_rextsize = cpu_to_be32(nmp->m_sb.sb_rextsize);
> +	sbp->sb_rbmblocks = cpu_to_be32(nmp->m_sb.sb_rbmblocks);
> +	sbp->sb_rblocks = cpu_to_be64(nmp->m_sb.sb_rblocks);
> +	sbp->sb_rextents = cpu_to_be64(nmp->m_sb.sb_rextents);
> +	sbp->sb_rextslog = nmp->m_sb.sb_rextslog;
> +	error = xfs_commit_sb_update(tp, bp);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Propagate the new values to the live mount structure after they made
> +	 * it to disk with the superblock buffer still locked.
> +	 */
> +	mp->m_sb.sb_rextsize = nmp->m_sb.sb_rextsize;
> +	mp->m_sb.sb_rbmblocks = nmp->m_sb.sb_rbmblocks;
> +	mp->m_sb.sb_rblocks = nmp->m_sb.sb_rblocks;
> +	mp->m_sb.sb_rextents = nmp->m_sb.sb_rextents;
> +	mp->m_sb.sb_rextslog = nmp->m_sb.sb_rextslog;
> +	mp->m_rsumlevels = nmp->m_rsumlevels;
> +	mp->m_rsumblocks = nmp->m_rsumblocks;
> +
> +	/*
> +	 * Recompute the growfsrt reservation from the new rsumsize.
> +	 */
> +	xfs_trans_resv_calc(mp, &mp->m_resv);
> +
> +	/*
> +	 * Ensure the mount RT feature flag is now set.
> +	 */
> +	mp->m_features |= XFS_FEAT_REALTIME;
> +	xfs_buf_relse(bp);
> +	return 0;
> +}
> +
>  static int
>  xfs_growfs_rt_bmblock(
>  	struct xfs_mount	*mp,
> @@ -780,25 +830,6 @@ xfs_growfs_rt_bmblock(
>  			goto out_cancel;
>  	}
>  
> -	/*
> -	 * Update superblock fields.
> -	 */
> -	if (nmp->m_sb.sb_rextsize != mp->m_sb.sb_rextsize)
> -		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTSIZE,
> -			nmp->m_sb.sb_rextsize - mp->m_sb.sb_rextsize);
> -	if (nmp->m_sb.sb_rbmblocks != mp->m_sb.sb_rbmblocks)
> -		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_RBMBLOCKS,
> -			nmp->m_sb.sb_rbmblocks - mp->m_sb.sb_rbmblocks);
> -	if (nmp->m_sb.sb_rblocks != mp->m_sb.sb_rblocks)
> -		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_RBLOCKS,
> -			nmp->m_sb.sb_rblocks - mp->m_sb.sb_rblocks);
> -	if (nmp->m_sb.sb_rextents != mp->m_sb.sb_rextents)
> -		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTENTS,
> -			nmp->m_sb.sb_rextents - mp->m_sb.sb_rextents);
> -	if (nmp->m_sb.sb_rextslog != mp->m_sb.sb_rextslog)
> -		xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_REXTSLOG,
> -			nmp->m_sb.sb_rextslog - mp->m_sb.sb_rextslog);
> -
>  	/*
>  	 * Free the new extent.
>  	 */
> @@ -807,33 +838,12 @@ xfs_growfs_rt_bmblock(
>  	xfs_rtbuf_cache_relse(&nargs);
>  	if (error)
>  		goto out_cancel;
> -
> -	/*
> -	 * Mark more blocks free in the superblock.
> -	 */
>  	xfs_trans_mod_sb(args.tp, XFS_TRANS_SB_FREXTENTS, freed_rtx);
>  
> -	/*
> -	 * Update the calculated values in the real mount structure.
> -	 */
> -	mp->m_rsumlevels = nmp->m_rsumlevels;
> -	mp->m_rsumblocks = nmp->m_rsumblocks;
> -	xfs_mount_sb_set_rextsize(mp, &mp->m_sb);
> -
> -	/*
> -	 * Recompute the growfsrt reservation from the new rsumsize.
> -	 */
> -	xfs_trans_resv_calc(mp, &mp->m_resv);
> -
> -	error = xfs_trans_commit(args.tp);
> +	error = xfs_growfs_rt_update_sb(args.tp, mp, nmp, freed_rtx);
>  	if (error)
>  		goto out_free;
>  
> -	/*
> -	 * Ensure the mount RT feature flag is now set.
> -	 */
> -	mp->m_features |= XFS_FEAT_REALTIME;
> -
>  	kfree(nmp);
>  	return 0;
>  
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index bdf3704dc30118..56505cb94f877d 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -430,31 +430,6 @@ xfs_trans_mod_sb(
>  		ASSERT(delta < 0);
>  		tp->t_res_frextents_delta += delta;
>  		break;
> -	case XFS_TRANS_SB_DBLOCKS:
> -		tp->t_dblocks_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_AGCOUNT:
> -		ASSERT(delta > 0);
> -		tp->t_agcount_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_IMAXPCT:
> -		tp->t_imaxpct_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_REXTSIZE:
> -		tp->t_rextsize_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_RBMBLOCKS:
> -		tp->t_rbmblocks_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_RBLOCKS:
> -		tp->t_rblocks_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_REXTENTS:
> -		tp->t_rextents_delta += delta;
> -		break;
> -	case XFS_TRANS_SB_REXTSLOG:
> -		tp->t_rextslog_delta += delta;
> -		break;
>  	default:
>  		ASSERT(0);
>  		return;
> @@ -475,12 +450,8 @@ STATIC void
>  xfs_trans_apply_sb_deltas(
>  	xfs_trans_t	*tp)
>  {
> -	struct xfs_dsb	*sbp;
> -	struct xfs_buf	*bp;
> -	int		whole = 0;
> -
> -	bp = xfs_trans_getsb(tp);
> -	sbp = bp->b_addr;
> +	struct xfs_buf	*bp = xfs_trans_getsb(tp);
> +	struct xfs_dsb	*sbp = bp->b_addr;
>  
>  	/*
>  	 * Only update the superblock counters if we are logging them
> @@ -522,53 +493,10 @@ xfs_trans_apply_sb_deltas(
>  		spin_unlock(&mp->m_sb_lock);
>  	}
>  
> -	if (tp->t_dblocks_delta) {
> -		be64_add_cpu(&sbp->sb_dblocks, tp->t_dblocks_delta);
> -		whole = 1;
> -	}
> -	if (tp->t_agcount_delta) {
> -		be32_add_cpu(&sbp->sb_agcount, tp->t_agcount_delta);
> -		whole = 1;
> -	}
> -	if (tp->t_imaxpct_delta) {
> -		sbp->sb_imax_pct += tp->t_imaxpct_delta;
> -		whole = 1;
> -	}
> -	if (tp->t_rextsize_delta) {
> -		be32_add_cpu(&sbp->sb_rextsize, tp->t_rextsize_delta);
> -		whole = 1;
> -	}
> -	if (tp->t_rbmblocks_delta) {
> -		be32_add_cpu(&sbp->sb_rbmblocks, tp->t_rbmblocks_delta);
> -		whole = 1;
> -	}
> -	if (tp->t_rblocks_delta) {
> -		be64_add_cpu(&sbp->sb_rblocks, tp->t_rblocks_delta);
> -		whole = 1;
> -	}
> -	if (tp->t_rextents_delta) {
> -		be64_add_cpu(&sbp->sb_rextents, tp->t_rextents_delta);
> -		whole = 1;
> -	}
> -	if (tp->t_rextslog_delta) {
> -		sbp->sb_rextslog += tp->t_rextslog_delta;
> -		whole = 1;
> -	}
> -
>  	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF);
> -	if (whole)
> -		/*
> -		 * Log the whole thing, the fields are noncontiguous.
> -		 */
> -		xfs_trans_log_buf(tp, bp, 0, sizeof(struct xfs_dsb) - 1);
> -	else
> -		/*
> -		 * Since all the modifiable fields are contiguous, we
> -		 * can get away with this.
> -		 */
> -		xfs_trans_log_buf(tp, bp, offsetof(struct xfs_dsb, sb_icount),
> -				  offsetof(struct xfs_dsb, sb_frextents) +
> -				  sizeof(sbp->sb_frextents) - 1);
> +	xfs_trans_log_buf(tp, bp, offsetof(struct xfs_dsb, sb_icount),
> +			  offsetof(struct xfs_dsb, sb_frextents) +
> +			  sizeof(sbp->sb_frextents) - 1);
>  }
>  
>  /*
> @@ -656,26 +584,7 @@ xfs_trans_unreserve_and_mod_sb(
>  	 * must be consistent with the ondisk rtbitmap and must never include
>  	 * incore reservations.
>  	 */
> -	mp->m_sb.sb_dblocks += tp->t_dblocks_delta;
> -	mp->m_sb.sb_agcount += tp->t_agcount_delta;
> -	mp->m_sb.sb_imax_pct += tp->t_imaxpct_delta;
> -	mp->m_sb.sb_rextsize += tp->t_rextsize_delta;
> -	if (tp->t_rextsize_delta) {
> -		mp->m_rtxblklog = log2_if_power2(mp->m_sb.sb_rextsize);
> -		mp->m_rtxblkmask = mask64_if_power2(mp->m_sb.sb_rextsize);
> -	}
> -	mp->m_sb.sb_rbmblocks += tp->t_rbmblocks_delta;
> -	mp->m_sb.sb_rblocks += tp->t_rblocks_delta;
> -	mp->m_sb.sb_rextents += tp->t_rextents_delta;
> -	mp->m_sb.sb_rextslog += tp->t_rextslog_delta;
>  	spin_unlock(&mp->m_sb_lock);
> -
> -	/*
> -	 * Debug checks outside of the spinlock so they don't lock up the
> -	 * machine if they fail.
> -	 */
> -	ASSERT(mp->m_sb.sb_imax_pct >= 0);
> -	ASSERT(mp->m_sb.sb_rextslog >= 0);
>  }
>  
>  /* Add the given log item to the transaction's list of log items. */
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index f06cc0f41665ad..e5911cf09be444 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -140,14 +140,6 @@ typedef struct xfs_trans {
>  	int64_t			t_res_fdblocks_delta; /* on-disk only chg */
>  	int64_t			t_frextents_delta;/* superblock freextents chg*/
>  	int64_t			t_res_frextents_delta; /* on-disk only chg */
> -	int64_t			t_dblocks_delta;/* superblock dblocks change */
> -	int64_t			t_agcount_delta;/* superblock agcount change */
> -	int64_t			t_imaxpct_delta;/* superblock imaxpct change */
> -	int64_t			t_rextsize_delta;/* superblock rextsize chg */
> -	int64_t			t_rbmblocks_delta;/* superblock rbmblocks chg */
> -	int64_t			t_rblocks_delta;/* superblock rblocks change */
> -	int64_t			t_rextents_delta;/* superblocks rextents chg */
> -	int64_t			t_rextslog_delta;/* superblocks rextslog chg */
>  	struct list_head	t_items;	/* log item descriptors */
>  	struct list_head	t_busy;		/* list of busy extents */
>  	struct list_head	t_dfops;	/* deferred operations */
> -- 
> 2.45.2
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/7] xfs: pass the exact range to initialize to xfs_initialize_perag
  2024-10-10 14:02   ` Brian Foster
@ 2024-10-11  7:53     ` Christoph Hellwig
  2024-10-11 14:01       ` Brian Foster
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2024-10-11  7:53 UTC (permalink / raw)
  To: Brian Foster
  Cc: Christoph Hellwig, Chandan Babu R, Darrick J. Wong, linux-xfs

On Thu, Oct 10, 2024 at 10:02:49AM -0400, Brian Foster wrote:
> > -	error = xfs_initialize_perag(mp, sbp->sb_agcount, sbp->sb_dblocks,
> > -			&mp->m_maxagi);
> > +	error = xfs_initialize_perag(mp, old_agcount, sbp->sb_agcount,
> > +			sbp->sb_dblocks, &mp->m_maxagi);
> 
> I assume this is because the superblock can change across recovery, but
> code wise this seems kind of easy to misread into thinking the variable
> is the same.

Which variable?

> I think the whole old/new terminology is kind of clunky for
> an interface that is not just for growfs. Maybe it would be more clear
> to use start/end terminology for xfs_initialize_perag(), then it's more
> straightforward that mount would init the full range whereas growfs
> inits a subrange.

fine with me.

> A oneliner comment or s/old_agcount/orig_agcount/ wouldn't hurt here
> either. Actually if that's the only purpose for this call and if you
> already have to sample sb_agcount, maybe just lifting/copying the if
> (old_agcount >= new_agcount) check into the caller would make the logic
> more self-explanatory. Hm?

Sure.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 7/7] xfs: split xfs_trans_mod_sb
  2024-10-10 14:06   ` Brian Foster
@ 2024-10-11  7:54     ` Christoph Hellwig
  2024-10-11 14:05       ` Brian Foster
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2024-10-11  7:54 UTC (permalink / raw)
  To: Brian Foster
  Cc: Christoph Hellwig, Chandan Babu R, Darrick J. Wong, linux-xfs

On Thu, Oct 10, 2024 at 10:06:15AM -0400, Brian Foster wrote:
> Seems Ok, but not sure I see the point personally. Rather than a single
> helper with flags, we have multiple helpers, some of which still mix
> deltas via an incrementally harder to read boolean param. This seems
> sort of arbitrary to me. Is this to support some future work?

I just find these multiplexers that have no common logic very confusing.

And yes, I also have some changes to share more logic between the
delalloc vs non-delalloc block accounting.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/7] xfs: don't update file system geometry through transaction deltas
  2024-10-10 14:05   ` Brian Foster
@ 2024-10-11  7:57     ` Christoph Hellwig
  2024-10-11 14:02       ` Brian Foster
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2024-10-11  7:57 UTC (permalink / raw)
  To: Brian Foster
  Cc: Christoph Hellwig, Chandan Babu R, Darrick J. Wong, linux-xfs

On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> Ok, so we don't want geometry changes transactions in the same CIL
> checkpoint as alloc related transactions that might depend on the
> geometry changes. That seems reasonable and on a first pass I have an
> idea of what this is doing, but the description is kind of vague.
> Obviously this fixes an issue on the recovery side (since I've tested
> it), but it's not quite clear to me from the description and/or logic
> changes how that issue manifests.
> 
> Could you elaborate please? For example, is this some kind of race
> situation between an allocation request and a growfs transaction, where
> the former perhaps sees a newly added AG between the time the growfs
> transaction commits (applying the sb deltas) and it actually commits to
> the log due to being a sync transaction, thus allowing an alloc on a new
> AG into the same checkpoint that adds the AG?

This is based on the feedback by Dave on the previous version:

https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/

Just doing the perag/in-core sb updates earlier fixes all the issues
with my test case, so I'm not actually sure how to get more updates
into the check checkpoint.  I'll try your exercisers if it could hit
that.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/7] xfs: don't update file system geometry through transaction deltas
  2024-10-10 19:01   ` Darrick J. Wong
@ 2024-10-11  7:59     ` Christoph Hellwig
  2024-10-11 16:44       ` Darrick J. Wong
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2024-10-11  7:59 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, Chandan Babu R, linux-xfs

On Thu, Oct 10, 2024 at 12:01:47PM -0700, Darrick J. Wong wrote:
> What if instead this took the form of a new defer_ops type?  The
> xfs_prepare_sb_update function would allocate a tracking object where
> we'd pin the sb buffer and record which fields get changed, as well as
> the new values.  xfs_commit_sb_update then xfs_defer_add()s it to the
> transaction and commits it.  (The ->create_intent function would return
> NULL so that no log item is created.)
> 
> The ->finish_item function would then bhold the sb buffer, update the
> ondisk super like how xfs_commit_sb_update does in this patch, set
> XFS_SB_TRANS_SYNC, and return -EAGAIN.  The defer ops would commit and
> flush that transaction and call ->finish_item again, at which point it
> would recompute the incore/cached geometry as necessary, bwrite the sb
> buffer, and release it.
> 
> The downside is that it's more complexity, but the upside is that the
> geometry changes are contained in one place instead of being scattered
> around, and the incore changes only happen if the synchronous
> transaction actually gets written to disk.  IOWs, the end result is the
> same as what you propose here, but structured differently.  

That sounds overkill at first, but if we want to move all sb updates
to that model more strutured infrastructure might be very useful.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/7] xfs: pass the exact range to initialize to xfs_initialize_perag
  2024-10-11  7:53     ` Christoph Hellwig
@ 2024-10-11 14:01       ` Brian Foster
  0 siblings, 0 replies; 44+ messages in thread
From: Brian Foster @ 2024-10-11 14:01 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chandan Babu R, Darrick J. Wong, linux-xfs

On Fri, Oct 11, 2024 at 09:53:14AM +0200, Christoph Hellwig wrote:
> On Thu, Oct 10, 2024 at 10:02:49AM -0400, Brian Foster wrote:
> > > -	error = xfs_initialize_perag(mp, sbp->sb_agcount, sbp->sb_dblocks,
> > > -			&mp->m_maxagi);
> > > +	error = xfs_initialize_perag(mp, old_agcount, sbp->sb_agcount,
> > > +			sbp->sb_dblocks, &mp->m_maxagi);
> > 
> > I assume this is because the superblock can change across recovery, but
> > code wise this seems kind of easy to misread into thinking the variable
> > is the same.
> 
> Which variable?
> 

old_agcount and sb_agcount and the fact that the value of the latter
might change down in the recovery code isn't immediately obvious. A
oneliner and/or logic check suggested below would clear it up IMO,
thanks.

Brian

> > I think the whole old/new terminology is kind of clunky for
> > an interface that is not just for growfs. Maybe it would be more clear
> > to use start/end terminology for xfs_initialize_perag(), then it's more
> > straightforward that mount would init the full range whereas growfs
> > inits a subrange.
> 
> fine with me.
> 
> > A oneliner comment or s/old_agcount/orig_agcount/ wouldn't hurt here
> > either. Actually if that's the only purpose for this call and if you
> > already have to sample sb_agcount, maybe just lifting/copying the if
> > (old_agcount >= new_agcount) check into the caller would make the logic
> > more self-explanatory. Hm?
> 
> Sure.
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/7] xfs: don't update file system geometry through transaction deltas
  2024-10-11  7:57     ` Christoph Hellwig
@ 2024-10-11 14:02       ` Brian Foster
  2024-10-11 17:13         ` Darrick J. Wong
  0 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2024-10-11 14:02 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chandan Babu R, Darrick J. Wong, linux-xfs

On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote:
> On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> > Ok, so we don't want geometry changes transactions in the same CIL
> > checkpoint as alloc related transactions that might depend on the
> > geometry changes. That seems reasonable and on a first pass I have an
> > idea of what this is doing, but the description is kind of vague.
> > Obviously this fixes an issue on the recovery side (since I've tested
> > it), but it's not quite clear to me from the description and/or logic
> > changes how that issue manifests.
> > 
> > Could you elaborate please? For example, is this some kind of race
> > situation between an allocation request and a growfs transaction, where
> > the former perhaps sees a newly added AG between the time the growfs
> > transaction commits (applying the sb deltas) and it actually commits to
> > the log due to being a sync transaction, thus allowing an alloc on a new
> > AG into the same checkpoint that adds the AG?
> 
> This is based on the feedback by Dave on the previous version:
> 
> https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/
> 

Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm
not sure I'd go straight to this change given the situation...

> Just doing the perag/in-core sb updates earlier fixes all the issues
> with my test case, so I'm not actually sure how to get more updates
> into the check checkpoint.  I'll try your exercisers if it could hit
> that.
> 

Ok, that explains things a bit. My observation is that the first 5
patches or so address the mount failure problem, but from there I'm not
reproducing much difference with or without the final patch. Either way,
I see aborts and splats all over the place, which implies at minimum
this isn't the only issue here.

So given that 1. growfs recovery seems pretty much broken, 2. this
particular patch has no straightforward way to test that it fixes
something and at the same time doesn't break anything else, and 3. we do
have at least one fairly straightforward growfs/recovery test in the
works that reliably explodes, personally I'd suggest to split this work
off into separate series.

It seems reasonable enough to me to get patches 1-5 in asap once they're
fully cleaned up, and then leave the next two as part of a followon
series pending further investigation into these other issues. As part of
that I'd like to know whether the recovery test reproduces (or can be
made to reproduce) the issue this patch presumably fixes, but I'd also
settle for "the grow recovery test now passes reliably and this doesn't
regress it." But once again, just my .02.

Brian

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 7/7] xfs: split xfs_trans_mod_sb
  2024-10-11  7:54     ` Christoph Hellwig
@ 2024-10-11 14:05       ` Brian Foster
  2024-10-11 16:50         ` Darrick J. Wong
  0 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2024-10-11 14:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chandan Babu R, Darrick J. Wong, linux-xfs

On Fri, Oct 11, 2024 at 09:54:08AM +0200, Christoph Hellwig wrote:
> On Thu, Oct 10, 2024 at 10:06:15AM -0400, Brian Foster wrote:
> > Seems Ok, but not sure I see the point personally. Rather than a single
> > helper with flags, we have multiple helpers, some of which still mix
> > deltas via an incrementally harder to read boolean param. This seems
> > sort of arbitrary to me. Is this to support some future work?
> 
> I just find these multiplexers that have no common logic very confusing.
> 
> And yes, I also have some changes to share more logic between the
> delalloc vs non-delalloc block accounting.
> 

I'm not sure what you mean by no common logic. The original
trans_mod_sb() is basically a big switch statement for modifying the
appropriate transaction delta associated with a superblock field. That
seems logical to me.

Just to be clear, I don't really feel strongly about this one way or the
other. I don't object and I don't think it makes anything worse, and
it's less of a change if half this stuff goes away anyways by changing
how the sb is logged. But I also think sometimes code seems more clear
moreso because we go through the process of refactoring it (i.e.
familiarity bias) over what the code ultimately looks like.

*shrug* This is all subjective, I'm sure there are other opinions.

Brian

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/7] xfs: don't update file system geometry through transaction deltas
  2024-10-11  7:59     ` Christoph Hellwig
@ 2024-10-11 16:44       ` Darrick J. Wong
  0 siblings, 0 replies; 44+ messages in thread
From: Darrick J. Wong @ 2024-10-11 16:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chandan Babu R, linux-xfs

On Fri, Oct 11, 2024 at 09:59:03AM +0200, Christoph Hellwig wrote:
> On Thu, Oct 10, 2024 at 12:01:47PM -0700, Darrick J. Wong wrote:
> > What if instead this took the form of a new defer_ops type?  The
> > xfs_prepare_sb_update function would allocate a tracking object where
> > we'd pin the sb buffer and record which fields get changed, as well as
> > the new values.  xfs_commit_sb_update then xfs_defer_add()s it to the
> > transaction and commits it.  (The ->create_intent function would return
> > NULL so that no log item is created.)
> > 
> > The ->finish_item function would then bhold the sb buffer, update the
> > ondisk super like how xfs_commit_sb_update does in this patch, set
> > XFS_SB_TRANS_SYNC, and return -EAGAIN.  The defer ops would commit and
> > flush that transaction and call ->finish_item again, at which point it
> > would recompute the incore/cached geometry as necessary, bwrite the sb
> > buffer, and release it.
> > 
> > The downside is that it's more complexity, but the upside is that the
> > geometry changes are contained in one place instead of being scattered
> > around, and the incore changes only happen if the synchronous
> > transaction actually gets written to disk.  IOWs, the end result is the
> > same as what you propose here, but structured differently.  
> 
> That sounds overkill at first, but if we want to move all sb updates
> to that model more strutured infrastructure might be very useful.

<nod> We could just take this as-is and refactor it into the defer item
code once we're done making all the other sb geometry growfsrt updates.
I'd rather do that than rebase *two* entire patchsets just to get the
same results.

--D

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 7/7] xfs: split xfs_trans_mod_sb
  2024-10-11 14:05       ` Brian Foster
@ 2024-10-11 16:50         ` Darrick J. Wong
  0 siblings, 0 replies; 44+ messages in thread
From: Darrick J. Wong @ 2024-10-11 16:50 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, Chandan Babu R, linux-xfs

On Fri, Oct 11, 2024 at 10:05:33AM -0400, Brian Foster wrote:
> On Fri, Oct 11, 2024 at 09:54:08AM +0200, Christoph Hellwig wrote:
> > On Thu, Oct 10, 2024 at 10:06:15AM -0400, Brian Foster wrote:
> > > Seems Ok, but not sure I see the point personally. Rather than a single
> > > helper with flags, we have multiple helpers, some of which still mix
> > > deltas via an incrementally harder to read boolean param. This seems
> > > sort of arbitrary to me. Is this to support some future work?
> > 
> > I just find these multiplexers that have no common logic very confusing.
> > 
> > And yes, I also have some changes to share more logic between the
> > delalloc vs non-delalloc block accounting.
> > 
> 
> I'm not sure what you mean by no common logic. The original
> trans_mod_sb() is basically a big switch statement for modifying the
> appropriate transaction delta associated with a superblock field. That
> seems logical to me.
> 
> Just to be clear, I don't really feel strongly about this one way or the
> other. I don't object and I don't think it makes anything worse, and
> it's less of a change if half this stuff goes away anyways by changing
> how the sb is logged. But I also think sometimes code seems more clear
> moreso because we go through the process of refactoring it (i.e.
> familiarity bias) over what the code ultimately looks like.
> 
> *shrug* This is all subjective, I'm sure there are other opinions.

I'd rather have separate functions for each field, because
xfs_trans_mod_sb is a giant dispatch function, with almost no shared
logic save the tp->t_flags update at the end.

I'm not in love with the 'wasdel' parameter name, but I don't have a
better suggestion short of splitting them up into even more tiny
functions:

void	xfs_trans_mod_res_fdblocks(struct xfs_trans *tp, int64_t delta);
void	xfs_trans_mod_fdblocks(struct xfs_trans *tp, int64_t delta);

which is sort of gross since the callers already have a wasdel variable.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> Brian
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/7] xfs: don't update file system geometry through transaction deltas
  2024-10-11 14:02       ` Brian Foster
@ 2024-10-11 17:13         ` Darrick J. Wong
  2024-10-11 18:41           ` Brian Foster
  0 siblings, 1 reply; 44+ messages in thread
From: Darrick J. Wong @ 2024-10-11 17:13 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, Chandan Babu R, linux-xfs

On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote:
> On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote:
> > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> > > Ok, so we don't want geometry changes transactions in the same CIL
> > > checkpoint as alloc related transactions that might depend on the
> > > geometry changes. That seems reasonable and on a first pass I have an
> > > idea of what this is doing, but the description is kind of vague.
> > > Obviously this fixes an issue on the recovery side (since I've tested
> > > it), but it's not quite clear to me from the description and/or logic
> > > changes how that issue manifests.
> > > 
> > > Could you elaborate please? For example, is this some kind of race
> > > situation between an allocation request and a growfs transaction, where
> > > the former perhaps sees a newly added AG between the time the growfs
> > > transaction commits (applying the sb deltas) and it actually commits to
> > > the log due to being a sync transaction, thus allowing an alloc on a new
> > > AG into the same checkpoint that adds the AG?
> > 
> > This is based on the feedback by Dave on the previous version:
> > 
> > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/
> > 
> 
> Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm
> not sure I'd go straight to this change given the situation...
> 
> > Just doing the perag/in-core sb updates earlier fixes all the issues
> > with my test case, so I'm not actually sure how to get more updates
> > into the check checkpoint.  I'll try your exercisers if it could hit
> > that.
> > 
> 
> Ok, that explains things a bit. My observation is that the first 5
> patches or so address the mount failure problem, but from there I'm not
> reproducing much difference with or without the final patch.

Does this change to flush the log after committing the new sb fix the
recovery problems on older kernels?  I /think/ that's the point of this
patch.

>                                                              Either way,
> I see aborts and splats all over the place, which implies at minimum
> this isn't the only issue here.

Ugh.  I've recently noticed the long soak logrecovery test vm have seen
a slight tick up in failure rates -- random blocks that have clearly had
garbage written to them such that recovery tries to read the block to
recover a buffer log item and kaboom.  At this point it's unclear if
that's a problem with xfs or somewhere else. :(

> So given that 1. growfs recovery seems pretty much broken, 2. this
> particular patch has no straightforward way to test that it fixes
> something and at the same time doesn't break anything else, and 3. we do

I'm curious, what might break?  Was that merely a general comment, or do
you have something specific in mind?  (iows: do you see more string to
pull? :))

> have at least one fairly straightforward growfs/recovery test in the
> works that reliably explodes, personally I'd suggest to split this work
> off into separate series.
> 
> It seems reasonable enough to me to get patches 1-5 in asap once they're
> fully cleaned up, and then leave the next two as part of a followon
> series pending further investigation into these other issues. As part of
> that I'd like to know whether the recovery test reproduces (or can be
> made to reproduce) the issue this patch presumably fixes, but I'd also
> settle for "the grow recovery test now passes reliably and this doesn't
> regress it." But once again, just my .02.

Yeah, it's too bad there's no good way to test recovery with older
kernels either. :(

--D

> Brian
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/7] xfs: don't update file system geometry through transaction deltas
  2024-10-11 17:13         ` Darrick J. Wong
@ 2024-10-11 18:41           ` Brian Foster
  2024-10-11 23:12             ` Darrick J. Wong
  0 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2024-10-11 18:41 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, Chandan Babu R, linux-xfs

On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote:
> On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote:
> > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote:
> > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> > > > Ok, so we don't want geometry changes transactions in the same CIL
> > > > checkpoint as alloc related transactions that might depend on the
> > > > geometry changes. That seems reasonable and on a first pass I have an
> > > > idea of what this is doing, but the description is kind of vague.
> > > > Obviously this fixes an issue on the recovery side (since I've tested
> > > > it), but it's not quite clear to me from the description and/or logic
> > > > changes how that issue manifests.
> > > > 
> > > > Could you elaborate please? For example, is this some kind of race
> > > > situation between an allocation request and a growfs transaction, where
> > > > the former perhaps sees a newly added AG between the time the growfs
> > > > transaction commits (applying the sb deltas) and it actually commits to
> > > > the log due to being a sync transaction, thus allowing an alloc on a new
> > > > AG into the same checkpoint that adds the AG?
> > > 
> > > This is based on the feedback by Dave on the previous version:
> > > 
> > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/
> > > 
> > 
> > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm
> > not sure I'd go straight to this change given the situation...
> > 
> > > Just doing the perag/in-core sb updates earlier fixes all the issues
> > > with my test case, so I'm not actually sure how to get more updates
> > > into the check checkpoint.  I'll try your exercisers if it could hit
> > > that.
> > > 
> > 
> > Ok, that explains things a bit. My observation is that the first 5
> > patches or so address the mount failure problem, but from there I'm not
> > reproducing much difference with or without the final patch.
> 
> Does this change to flush the log after committing the new sb fix the
> recovery problems on older kernels?  I /think/ that's the point of this
> patch.
> 

I don't follow.. growfs always forced the log via the sync transaction,
right? Or do you mean something else by "change to flush the log?"

I thought the main functional change of this patch was to hold the
superblock buffer locked across the force so nothing else can relog the
new geometry superblock buffer in the same cil checkpoint. Presumably,
the theory is that prevents recovery from seeing updates to different
buffers that depend on the geometry update before the actual sb geometry
update is recovered (because the latter might have been relogged).

Maybe we're saying the same thing..? Or maybe I just misunderstand.
Either way I think patch could use a more detailed commit log...

> >                                                              Either way,
> > I see aborts and splats all over the place, which implies at minimum
> > this isn't the only issue here.
> 
> Ugh.  I've recently noticed the long soak logrecovery test vm have seen
> a slight tick up in failure rates -- random blocks that have clearly had
> garbage written to them such that recovery tries to read the block to
> recover a buffer log item and kaboom.  At this point it's unclear if
> that's a problem with xfs or somewhere else. :(
> 
> > So given that 1. growfs recovery seems pretty much broken, 2. this
> > particular patch has no straightforward way to test that it fixes
> > something and at the same time doesn't break anything else, and 3. we do
> 
> I'm curious, what might break?  Was that merely a general comment, or do
> you have something specific in mind?  (iows: do you see more string to
> pull? :))
> 

Just a general comment..

Something related that isn't totally clear to me is what about the
inverse shrink situation where dblocks is reduced. I.e., is there some
similar scenario where for example instead of the sb buffer being
relogged past some other buffer update that depends on it, some other
change is relogged past a sb update that invalidates/removes blocks
referenced by the relogged buffer..? If so, does that imply a shrink
should flush the log before the shrink transaction commits to ensure it
lands in a new checkpoint (as opposed to ensuring followon updates land
in a new checkpoint)..?

Anyways, my point is just that if it were me I wouldn't get too deep
into this until some of the reproducible growfs recovery issues are at
least characterized and testing is more sorted out.

The context for testing is here [1]. The TLDR is basically that
Christoph has a targeted test that reproduces the initial mount failure
and I hacked up a more general test that also reproduces it and
additional growfs recovery problems. This test does seem to confirm that
the previous patches address the mount failure issue, but this patch
doesn't seem to prevent any of the other problems produced by the
generic test. That might just mean the test doesn't reproduce what this
fixes, but it's kind of hard to at least regression test something like
this when basic growfs crash-recovery seems pretty much broken.

Brian

[1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/

> > have at least one fairly straightforward growfs/recovery test in the
> > works that reliably explodes, personally I'd suggest to split this work
> > off into separate series.
> > 
> > It seems reasonable enough to me to get patches 1-5 in asap once they're
> > fully cleaned up, and then leave the next two as part of a followon
> > series pending further investigation into these other issues. As part of
> > that I'd like to know whether the recovery test reproduces (or can be
> > made to reproduce) the issue this patch presumably fixes, but I'd also
> > settle for "the grow recovery test now passes reliably and this doesn't
> > regress it." But once again, just my .02.
> 
> Yeah, it's too bad there's no good way to test recovery with older
> kernels either. :(
> 
> --D
> 
> > Brian
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/7] xfs: don't update file system geometry through transaction deltas
  2024-10-11 18:41           ` Brian Foster
@ 2024-10-11 23:12             ` Darrick J. Wong
  2024-10-11 23:29               ` Darrick J. Wong
  2024-10-14 18:50               ` Brian Foster
  0 siblings, 2 replies; 44+ messages in thread
From: Darrick J. Wong @ 2024-10-11 23:12 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, Chandan Babu R, linux-xfs

On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote:
> On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote:
> > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote:
> > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote:
> > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> > > > > Ok, so we don't want geometry changes transactions in the same CIL
> > > > > checkpoint as alloc related transactions that might depend on the
> > > > > geometry changes. That seems reasonable and on a first pass I have an
> > > > > idea of what this is doing, but the description is kind of vague.
> > > > > Obviously this fixes an issue on the recovery side (since I've tested
> > > > > it), but it's not quite clear to me from the description and/or logic
> > > > > changes how that issue manifests.
> > > > > 
> > > > > Could you elaborate please? For example, is this some kind of race
> > > > > situation between an allocation request and a growfs transaction, where
> > > > > the former perhaps sees a newly added AG between the time the growfs
> > > > > transaction commits (applying the sb deltas) and it actually commits to
> > > > > the log due to being a sync transaction, thus allowing an alloc on a new
> > > > > AG into the same checkpoint that adds the AG?
> > > > 
> > > > This is based on the feedback by Dave on the previous version:
> > > > 
> > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/
> > > > 
> > > 
> > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm
> > > not sure I'd go straight to this change given the situation...
> > > 
> > > > Just doing the perag/in-core sb updates earlier fixes all the issues
> > > > with my test case, so I'm not actually sure how to get more updates
> > > > into the check checkpoint.  I'll try your exercisers if it could hit
> > > > that.
> > > > 
> > > 
> > > Ok, that explains things a bit. My observation is that the first 5
> > > patches or so address the mount failure problem, but from there I'm not
> > > reproducing much difference with or without the final patch.
> > 
> > Does this change to flush the log after committing the new sb fix the
> > recovery problems on older kernels?  I /think/ that's the point of this
> > patch.
> > 
> 
> I don't follow.. growfs always forced the log via the sync transaction,
> right? Or do you mean something else by "change to flush the log?"

I guess I was typing a bit too fast this morning -- "change to flush the
log to disk before anyone else can get their hands on the superblock".
You're right that xfs_log_sb and data-device growfs already do that.

That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug
that this patch fixes, right?

> I thought the main functional change of this patch was to hold the
> superblock buffer locked across the force so nothing else can relog the
> new geometry superblock buffer in the same cil checkpoint. Presumably,
> the theory is that prevents recovery from seeing updates to different
> buffers that depend on the geometry update before the actual sb geometry
> update is recovered (because the latter might have been relogged).
> 
> Maybe we're saying the same thing..? Or maybe I just misunderstand.
> Either way I think patch could use a more detailed commit log...

<nod> The commit message should point out that we're fixing a real bug
here, which is that growfsrt doesn't force the log to disk when it
commits the new rt geometry.

> > >                                                              Either way,
> > > I see aborts and splats all over the place, which implies at minimum
> > > this isn't the only issue here.
> > 
> > Ugh.  I've recently noticed the long soak logrecovery test vm have seen
> > a slight tick up in failure rates -- random blocks that have clearly had
> > garbage written to them such that recovery tries to read the block to
> > recover a buffer log item and kaboom.  At this point it's unclear if
> > that's a problem with xfs or somewhere else. :(
> > 
> > > So given that 1. growfs recovery seems pretty much broken, 2. this
> > > particular patch has no straightforward way to test that it fixes
> > > something and at the same time doesn't break anything else, and 3. we do
> > 
> > I'm curious, what might break?  Was that merely a general comment, or do
> > you have something specific in mind?  (iows: do you see more string to
> > pull? :))
> > 
> 
> Just a general comment..
> 
> Something related that isn't totally clear to me is what about the
> inverse shrink situation where dblocks is reduced. I.e., is there some
> similar scenario where for example instead of the sb buffer being
> relogged past some other buffer update that depends on it, some other
> change is relogged past a sb update that invalidates/removes blocks
> referenced by the relogged buffer..? If so, does that imply a shrink
> should flush the log before the shrink transaction commits to ensure it
> lands in a new checkpoint (as opposed to ensuring followon updates land
> in a new checkpoint)..?

I think so.  Might we want to do that before and after to be careful?

> Anyways, my point is just that if it were me I wouldn't get too deep
> into this until some of the reproducible growfs recovery issues are at
> least characterized and testing is more sorted out.
> 
> The context for testing is here [1]. The TLDR is basically that
> Christoph has a targeted test that reproduces the initial mount failure
> and I hacked up a more general test that also reproduces it and
> additional growfs recovery problems. This test does seem to confirm that
> the previous patches address the mount failure issue, but this patch
> doesn't seem to prevent any of the other problems produced by the
> generic test. That might just mean the test doesn't reproduce what this
> fixes, but it's kind of hard to at least regression test something like
> this when basic growfs crash-recovery seems pretty much broken.

Hmm, if you make a variant of that test which formats with an rt device
and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see
similar blowups?  Let's see what happens if I do that...

--D

> Brian
> 
> [1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/
> 
> > > have at least one fairly straightforward growfs/recovery test in the
> > > works that reliably explodes, personally I'd suggest to split this work
> > > off into separate series.
> > > 
> > > It seems reasonable enough to me to get patches 1-5 in asap once they're
> > > fully cleaned up, and then leave the next two as part of a followon
> > > series pending further investigation into these other issues. As part of
> > > that I'd like to know whether the recovery test reproduces (or can be
> > > made to reproduce) the issue this patch presumably fixes, but I'd also
> > > settle for "the grow recovery test now passes reliably and this doesn't
> > > regress it." But once again, just my .02.
> > 
> > Yeah, it's too bad there's no good way to test recovery with older
> > kernels either. :(
> > 
> > --D
> > 
> > > Brian
> > > 
> > > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/7] xfs: don't update file system geometry through transaction deltas
  2024-10-11 23:12             ` Darrick J. Wong
@ 2024-10-11 23:29               ` Darrick J. Wong
  2024-10-14  5:58                 ` Christoph Hellwig
  2024-10-14 18:50               ` Brian Foster
  1 sibling, 1 reply; 44+ messages in thread
From: Darrick J. Wong @ 2024-10-11 23:29 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, Chandan Babu R, linux-xfs

On Fri, Oct 11, 2024 at 04:12:41PM -0700, Darrick J. Wong wrote:
> On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote:
> > On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote:
> > > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote:
> > > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote:
> > > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> > > > > > Ok, so we don't want geometry changes transactions in the same CIL
> > > > > > checkpoint as alloc related transactions that might depend on the
> > > > > > geometry changes. That seems reasonable and on a first pass I have an
> > > > > > idea of what this is doing, but the description is kind of vague.
> > > > > > Obviously this fixes an issue on the recovery side (since I've tested
> > > > > > it), but it's not quite clear to me from the description and/or logic
> > > > > > changes how that issue manifests.
> > > > > > 
> > > > > > Could you elaborate please? For example, is this some kind of race
> > > > > > situation between an allocation request and a growfs transaction, where
> > > > > > the former perhaps sees a newly added AG between the time the growfs
> > > > > > transaction commits (applying the sb deltas) and it actually commits to
> > > > > > the log due to being a sync transaction, thus allowing an alloc on a new
> > > > > > AG into the same checkpoint that adds the AG?
> > > > > 
> > > > > This is based on the feedback by Dave on the previous version:
> > > > > 
> > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/
> > > > > 
> > > > 
> > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm
> > > > not sure I'd go straight to this change given the situation...
> > > > 
> > > > > Just doing the perag/in-core sb updates earlier fixes all the issues
> > > > > with my test case, so I'm not actually sure how to get more updates
> > > > > into the check checkpoint.  I'll try your exercisers if it could hit
> > > > > that.
> > > > > 
> > > > 
> > > > Ok, that explains things a bit. My observation is that the first 5
> > > > patches or so address the mount failure problem, but from there I'm not
> > > > reproducing much difference with or without the final patch.
> > > 
> > > Does this change to flush the log after committing the new sb fix the
> > > recovery problems on older kernels?  I /think/ that's the point of this
> > > patch.
> > > 
> > 
> > I don't follow.. growfs always forced the log via the sync transaction,
> > right? Or do you mean something else by "change to flush the log?"
> 
> I guess I was typing a bit too fast this morning -- "change to flush the
> log to disk before anyone else can get their hands on the superblock".
> You're right that xfs_log_sb and data-device growfs already do that.
> 
> That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug
> that this patch fixes, right?
> 
> > I thought the main functional change of this patch was to hold the
> > superblock buffer locked across the force so nothing else can relog the
> > new geometry superblock buffer in the same cil checkpoint. Presumably,
> > the theory is that prevents recovery from seeing updates to different
> > buffers that depend on the geometry update before the actual sb geometry
> > update is recovered (because the latter might have been relogged).
> > 
> > Maybe we're saying the same thing..? Or maybe I just misunderstand.
> > Either way I think patch could use a more detailed commit log...
> 
> <nod> The commit message should point out that we're fixing a real bug
> here, which is that growfsrt doesn't force the log to disk when it
> commits the new rt geometry.
> 
> > > >                                                              Either way,
> > > > I see aborts and splats all over the place, which implies at minimum
> > > > this isn't the only issue here.
> > > 
> > > Ugh.  I've recently noticed the long soak logrecovery test vm have seen
> > > a slight tick up in failure rates -- random blocks that have clearly had
> > > garbage written to them such that recovery tries to read the block to
> > > recover a buffer log item and kaboom.  At this point it's unclear if
> > > that's a problem with xfs or somewhere else. :(
> > > 
> > > > So given that 1. growfs recovery seems pretty much broken, 2. this
> > > > particular patch has no straightforward way to test that it fixes
> > > > something and at the same time doesn't break anything else, and 3. we do
> > > 
> > > I'm curious, what might break?  Was that merely a general comment, or do
> > > you have something specific in mind?  (iows: do you see more string to
> > > pull? :))
> > > 
> > 
> > Just a general comment..
> > 
> > Something related that isn't totally clear to me is what about the
> > inverse shrink situation where dblocks is reduced. I.e., is there some
> > similar scenario where for example instead of the sb buffer being
> > relogged past some other buffer update that depends on it, some other
> > change is relogged past a sb update that invalidates/removes blocks
> > referenced by the relogged buffer..? If so, does that imply a shrink
> > should flush the log before the shrink transaction commits to ensure it
> > lands in a new checkpoint (as opposed to ensuring followon updates land
> > in a new checkpoint)..?
> 
> I think so.  Might we want to do that before and after to be careful?
> 
> > Anyways, my point is just that if it were me I wouldn't get too deep
> > into this until some of the reproducible growfs recovery issues are at
> > least characterized and testing is more sorted out.
> > 
> > The context for testing is here [1]. The TLDR is basically that
> > Christoph has a targeted test that reproduces the initial mount failure
> > and I hacked up a more general test that also reproduces it and
> > additional growfs recovery problems. This test does seem to confirm that
> > the previous patches address the mount failure issue, but this patch
> > doesn't seem to prevent any of the other problems produced by the
> > generic test. That might just mean the test doesn't reproduce what this
> > fixes, but it's kind of hard to at least regression test something like
> > this when basic growfs crash-recovery seems pretty much broken.
> 
> Hmm, if you make a variant of that test which formats with an rt device
> and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see
> similar blowups?  Let's see what happens if I do that...

Ahahaha awesome it corrupts the filesystem:

_check_xfs_filesystem: filesystem on /dev/sdf is inconsistent (r)
*** xfs_repair -n output ***
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - generate realtime summary info and bitmap...
sb_frextents 389, counted 10329
discrepancy in summary (0) at dblock 0x0 words 0x3f-0x3f/0x400
discrepancy in summary (0) at dblock 0x0 words 0x44-0x44/0x400
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...

--D

--- /dev/null
+++ b/tests/xfs/610
@@ -0,0 +1,102 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2000-2004 Silicon Graphics, Inc.  All Rights Reserved.
+#
+# FS QA Test No. 610
+#
+# XFS online growfs-while-allocating tests (rt subvol variant)
+#
+. ./common/preamble
+_begin_fstest growfs ioctl prealloc auto stress
+
+# Import common functions.
+. ./common/filter
+
+_create_scratch()
+{
+	_scratch_mkfs_xfs "$@" >> $seqres.full
+
+	if ! _try_scratch_mount 2>/dev/null
+	then
+		echo "failed to mount $SCRATCH_DEV"
+		exit 1
+	fi
+
+	_xfs_force_bdev realtime $SCRATCH_MNT &> /dev/null
+
+	# fix the reserve block pool to a known size so that the enospc
+	# calculations work out correctly.
+	_scratch_resvblks 1024 >  /dev/null 2>&1
+}
+
+_fill_scratch()
+{
+	$XFS_IO_PROG -f -c "resvsp 0 ${1}" $SCRATCH_MNT/resvfile
+}
+
+_stress_scratch()
+{
+	procs=3
+	nops=1000
+	# -w ensures that the only ops are ones which cause write I/O
+	FSSTRESS_ARGS=`_scale_fsstress_args -d $SCRATCH_MNT -w -p $procs \
+	    -n $nops $FSSTRESS_AVOID`
+	$FSSTRESS_PROG $FSSTRESS_ARGS >> $seqres.full 2>&1 &
+}
+
+_require_realtime
+_require_scratch
+_require_xfs_io_command "falloc"
+
+_scratch_mkfs_xfs | tee -a $seqres.full | _filter_mkfs 2>$tmp.mkfs
+. $tmp.mkfs	# extract blocksize and data size for scratch device
+
+endsize=`expr 550 \* 1048576`	# stop after growing this big
+incsize=`expr  42 \* 1048576`	# grow in chunks of this size
+modsize=`expr   4 \* $incsize`	# pause after this many increments
+
+[ `expr $endsize / $dbsize` -lt $dblocks ] || _notrun "Scratch device too small"
+
+size=`expr 125 \* 1048576`	# 120 megabytes initially
+sizeb=`expr $size / $dbsize`	# in data blocks
+logblks=$(_scratch_find_xfs_min_logblocks -rsize=${size})
+_create_scratch -lsize=${logblks}b -rsize=${size}
+
+for i in `seq 125 -1 90`; do
+	fillsize=`expr $i \* 1048576`
+	out="$(_fill_scratch $fillsize 2>&1)"
+	echo "$out" | grep -q 'No space left on device' && continue
+	test -n "${out}" && echo "$out"
+	break
+done
+
+#
+# Grow the filesystem while actively stressing it...
+# Kick off more stress threads on each iteration, grow; repeat.
+#
+while [ $size -le $endsize ]; do
+	echo "*** stressing a ${sizeb} block filesystem" >> $seqres.full
+	_stress_scratch
+	size=`expr $size + $incsize`
+	sizeb=`expr $size / $dbsize`	# in data blocks
+	echo "*** growing to a ${sizeb} block filesystem" >> $seqres.full
+	xfs_growfs -R ${sizeb} $SCRATCH_MNT >> $seqres.full
+	echo AGCOUNT=$agcount >> $seqres.full
+	echo >> $seqres.full
+
+	sleep $((RANDOM % 3))
+	_scratch_shutdown
+	ps -e | grep fsstress > /dev/null 2>&1
+	while [ $? -eq 0 ]; do
+		killall -9 fsstress > /dev/null 2>&1
+		wait > /dev/null 2>&1
+		ps -e | grep fsstress > /dev/null 2>&1
+	done
+	_scratch_cycle_mount || _fail "cycle mount failed"
+done > /dev/null 2>&1
+wait	# stop for any remaining stress processes
+
+_scratch_unmount
+
+status=0
+exit
--- /dev/null
+++ b/tests/xfs/610.out
@@ -0,0 +1,7 @@
+QA output created by 610
+meta-data=DDEV isize=XXX agcount=N, agsize=XXX blks
+data     = bsize=XXX blocks=XXX, imaxpct=PCT
+         = sunit=XXX swidth=XXX, unwritten=X
+naming   =VERN bsize=XXX
+log      =LDEV bsize=XXX blocks=XXX
+realtime =RDEV extsz=XXX blocks=XXX, rtextents=XXX

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/7] xfs: don't update file system geometry through transaction deltas
  2024-10-11 23:29               ` Darrick J. Wong
@ 2024-10-14  5:58                 ` Christoph Hellwig
  2024-10-14 15:30                   ` Darrick J. Wong
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2024-10-14  5:58 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Brian Foster, Christoph Hellwig, Chandan Babu R, linux-xfs

On Fri, Oct 11, 2024 at 04:29:06PM -0700, Darrick J. Wong wrote:
> Ahahaha awesome it corrupts the filesystem:

Is this with a rtgroup file system?  I can't get your test to fail
with the latest xfs staging tree.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/7] xfs: don't update file system geometry through transaction deltas
  2024-10-14  5:58                 ` Christoph Hellwig
@ 2024-10-14 15:30                   ` Darrick J. Wong
  0 siblings, 0 replies; 44+ messages in thread
From: Darrick J. Wong @ 2024-10-14 15:30 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Brian Foster, Chandan Babu R, linux-xfs

On Mon, Oct 14, 2024 at 07:58:50AM +0200, Christoph Hellwig wrote:
> On Fri, Oct 11, 2024 at 04:29:06PM -0700, Darrick J. Wong wrote:
> > Ahahaha awesome it corrupts the filesystem:
> 
> Is this with a rtgroup file system?  I can't get your test to fail
> with the latest xfs staging tree.

aha, yes, it's with rtgroups.

--D

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/7] xfs: don't update file system geometry through transaction deltas
  2024-10-11 23:12             ` Darrick J. Wong
  2024-10-11 23:29               ` Darrick J. Wong
@ 2024-10-14 18:50               ` Brian Foster
  2024-10-15 16:42                 ` Darrick J. Wong
  2024-10-21 13:38                 ` Dave Chinner
  1 sibling, 2 replies; 44+ messages in thread
From: Brian Foster @ 2024-10-14 18:50 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, Chandan Babu R, linux-xfs

On Fri, Oct 11, 2024 at 04:12:41PM -0700, Darrick J. Wong wrote:
> On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote:
> > On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote:
> > > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote:
> > > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote:
> > > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> > > > > > Ok, so we don't want geometry changes transactions in the same CIL
> > > > > > checkpoint as alloc related transactions that might depend on the
> > > > > > geometry changes. That seems reasonable and on a first pass I have an
> > > > > > idea of what this is doing, but the description is kind of vague.
> > > > > > Obviously this fixes an issue on the recovery side (since I've tested
> > > > > > it), but it's not quite clear to me from the description and/or logic
> > > > > > changes how that issue manifests.
> > > > > > 
> > > > > > Could you elaborate please? For example, is this some kind of race
> > > > > > situation between an allocation request and a growfs transaction, where
> > > > > > the former perhaps sees a newly added AG between the time the growfs
> > > > > > transaction commits (applying the sb deltas) and it actually commits to
> > > > > > the log due to being a sync transaction, thus allowing an alloc on a new
> > > > > > AG into the same checkpoint that adds the AG?
> > > > > 
> > > > > This is based on the feedback by Dave on the previous version:
> > > > > 
> > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/
> > > > > 
> > > > 
> > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm
> > > > not sure I'd go straight to this change given the situation...
> > > > 
> > > > > Just doing the perag/in-core sb updates earlier fixes all the issues
> > > > > with my test case, so I'm not actually sure how to get more updates
> > > > > into the check checkpoint.  I'll try your exercisers if it could hit
> > > > > that.
> > > > > 
> > > > 
> > > > Ok, that explains things a bit. My observation is that the first 5
> > > > patches or so address the mount failure problem, but from there I'm not
> > > > reproducing much difference with or without the final patch.
> > > 
> > > Does this change to flush the log after committing the new sb fix the
> > > recovery problems on older kernels?  I /think/ that's the point of this
> > > patch.
> > > 
> > 
> > I don't follow.. growfs always forced the log via the sync transaction,
> > right? Or do you mean something else by "change to flush the log?"
> 
> I guess I was typing a bit too fast this morning -- "change to flush the
> log to disk before anyone else can get their hands on the superblock".
> You're right that xfs_log_sb and data-device growfs already do that.
> 
> That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug
> that this patch fixes, right?
> 

Ah, Ok.. that makes sense. Sounds like it could be..

> > I thought the main functional change of this patch was to hold the
> > superblock buffer locked across the force so nothing else can relog the
> > new geometry superblock buffer in the same cil checkpoint. Presumably,
> > the theory is that prevents recovery from seeing updates to different
> > buffers that depend on the geometry update before the actual sb geometry
> > update is recovered (because the latter might have been relogged).
> > 
> > Maybe we're saying the same thing..? Or maybe I just misunderstand.
> > Either way I think patch could use a more detailed commit log...
> 
> <nod> The commit message should point out that we're fixing a real bug
> here, which is that growfsrt doesn't force the log to disk when it
> commits the new rt geometry.
> 

Maybe even make it a separate patch to pull apart some of these cleanups
from fixes. I was also wondering if the whole locking change is the
moral equivalent of locking the sb across the growfs trans (i.e.
trans_getsb() + trans_bhold()), at which point maybe that would be a
reasonable incremental patch too.

> > > >                                                              Either way,
> > > > I see aborts and splats all over the place, which implies at minimum
> > > > this isn't the only issue here.
> > > 
> > > Ugh.  I've recently noticed the long soak logrecovery test vm have seen
> > > a slight tick up in failure rates -- random blocks that have clearly had
> > > garbage written to them such that recovery tries to read the block to
> > > recover a buffer log item and kaboom.  At this point it's unclear if
> > > that's a problem with xfs or somewhere else. :(
> > > 
> > > > So given that 1. growfs recovery seems pretty much broken, 2. this
> > > > particular patch has no straightforward way to test that it fixes
> > > > something and at the same time doesn't break anything else, and 3. we do
> > > 
> > > I'm curious, what might break?  Was that merely a general comment, or do
> > > you have something specific in mind?  (iows: do you see more string to
> > > pull? :))
> > > 
> > 
> > Just a general comment..
> > 
> > Something related that isn't totally clear to me is what about the
> > inverse shrink situation where dblocks is reduced. I.e., is there some
> > similar scenario where for example instead of the sb buffer being
> > relogged past some other buffer update that depends on it, some other
> > change is relogged past a sb update that invalidates/removes blocks
> > referenced by the relogged buffer..? If so, does that imply a shrink
> > should flush the log before the shrink transaction commits to ensure it
> > lands in a new checkpoint (as opposed to ensuring followon updates land
> > in a new checkpoint)..?
> 
> I think so.  Might we want to do that before and after to be careful?
> 

Yeah maybe. I'm not quite sure if even that's enough. I.e. assuming we
had a log preflush to flush out already committed changes before the
grow, I don't think anything really prevents another "problematic"
transaction from committing after that preflush.

I dunno.. on one hand it does seem like an unlikely thing due to the
nature of needing space to be free in order to shrink in the first
place, but OTOH if you have something like grow that is rare, not
performance sensitive, has a history of not being well tested, and has
these subtle ordering requirements that might change indirectly to other
transactions, ISTM it could be a wise engineering decision to simplify
to the degree possible and find the most basic model that enforces
predictable ordering.

So for a hacky thought/example, suppose we defined a transaction mode
that basically implemented an exclusive checkpoint requirement (i.e.
this transaction owns an entire checkpoint, nothing else is allowed in
the CIL concurrently). Presumably that would ensure everything before
the grow would flush out to disk in one checkpoint, everything
concurrent would block on synchronous commit of the grow trans (before
new geometry is exposed), and then after that point everything pending
would drain into another checkpoint.

It kind of sounds like overkill, but really if it could be implemented
simply enough then we wouldn't have to think too hard about auditing all
other relog scenarios. I'd probably want to see at least some reproducer
for this sort of problem to prove the theory though too, even if it
required debug instrumentation or something. Hm?

> > Anyways, my point is just that if it were me I wouldn't get too deep
> > into this until some of the reproducible growfs recovery issues are at
> > least characterized and testing is more sorted out.
> > 
> > The context for testing is here [1]. The TLDR is basically that
> > Christoph has a targeted test that reproduces the initial mount failure
> > and I hacked up a more general test that also reproduces it and
> > additional growfs recovery problems. This test does seem to confirm that
> > the previous patches address the mount failure issue, but this patch
> > doesn't seem to prevent any of the other problems produced by the
> > generic test. That might just mean the test doesn't reproduce what this
> > fixes, but it's kind of hard to at least regression test something like
> > this when basic growfs crash-recovery seems pretty much broken.
> 
> Hmm, if you make a variant of that test which formats with an rt device
> and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see
> similar blowups?  Let's see what happens if I do that...
> 

Heh, sounds like so from your followup. Fun times.

I guess that test should probably work its way upstream. I made some
tweaks locally since last posted to try and make it a little more
aggressive, but it didn't repro anything new so not sure how much
difference it makes really. Do we want a separate version like yours for
the rt case or would you expect to cover both cases in a single test?

Brian

> --D
> 
> > Brian
> > 
> > [1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/
> > 
> > > > have at least one fairly straightforward growfs/recovery test in the
> > > > works that reliably explodes, personally I'd suggest to split this work
> > > > off into separate series.
> > > > 
> > > > It seems reasonable enough to me to get patches 1-5 in asap once they're
> > > > fully cleaned up, and then leave the next two as part of a followon
> > > > series pending further investigation into these other issues. As part of
> > > > that I'd like to know whether the recovery test reproduces (or can be
> > > > made to reproduce) the issue this patch presumably fixes, but I'd also
> > > > settle for "the grow recovery test now passes reliably and this doesn't
> > > > regress it." But once again, just my .02.
> > > 
> > > Yeah, it's too bad there's no good way to test recovery with older
> > > kernels either. :(
> > > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > 
> > > 
> > 
> > 
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/7] xfs: don't update file system geometry through transaction deltas
  2024-10-14 18:50               ` Brian Foster
@ 2024-10-15 16:42                 ` Darrick J. Wong
  2024-10-18 12:27                   ` Brian Foster
  2024-10-21 13:38                 ` Dave Chinner
  1 sibling, 1 reply; 44+ messages in thread
From: Darrick J. Wong @ 2024-10-15 16:42 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, Chandan Babu R, linux-xfs

On Mon, Oct 14, 2024 at 02:50:37PM -0400, Brian Foster wrote:
> On Fri, Oct 11, 2024 at 04:12:41PM -0700, Darrick J. Wong wrote:
> > On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote:
> > > On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote:
> > > > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote:
> > > > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote:
> > > > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> > > > > > > Ok, so we don't want geometry changes transactions in the same CIL
> > > > > > > checkpoint as alloc related transactions that might depend on the
> > > > > > > geometry changes. That seems reasonable and on a first pass I have an
> > > > > > > idea of what this is doing, but the description is kind of vague.
> > > > > > > Obviously this fixes an issue on the recovery side (since I've tested
> > > > > > > it), but it's not quite clear to me from the description and/or logic
> > > > > > > changes how that issue manifests.
> > > > > > > 
> > > > > > > Could you elaborate please? For example, is this some kind of race
> > > > > > > situation between an allocation request and a growfs transaction, where
> > > > > > > the former perhaps sees a newly added AG between the time the growfs
> > > > > > > transaction commits (applying the sb deltas) and it actually commits to
> > > > > > > the log due to being a sync transaction, thus allowing an alloc on a new
> > > > > > > AG into the same checkpoint that adds the AG?
> > > > > > 
> > > > > > This is based on the feedback by Dave on the previous version:
> > > > > > 
> > > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/
> > > > > > 
> > > > > 
> > > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm
> > > > > not sure I'd go straight to this change given the situation...
> > > > > 
> > > > > > Just doing the perag/in-core sb updates earlier fixes all the issues
> > > > > > with my test case, so I'm not actually sure how to get more updates
> > > > > > into the check checkpoint.  I'll try your exercisers if it could hit
> > > > > > that.
> > > > > > 
> > > > > 
> > > > > Ok, that explains things a bit. My observation is that the first 5
> > > > > patches or so address the mount failure problem, but from there I'm not
> > > > > reproducing much difference with or without the final patch.
> > > > 
> > > > Does this change to flush the log after committing the new sb fix the
> > > > recovery problems on older kernels?  I /think/ that's the point of this
> > > > patch.
> > > > 
> > > 
> > > I don't follow.. growfs always forced the log via the sync transaction,
> > > right? Or do you mean something else by "change to flush the log?"
> > 
> > I guess I was typing a bit too fast this morning -- "change to flush the
> > log to disk before anyone else can get their hands on the superblock".
> > You're right that xfs_log_sb and data-device growfs already do that.
> > 
> > That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug
> > that this patch fixes, right?
> > 
> 
> Ah, Ok.. that makes sense. Sounds like it could be..

Yeah.  Hey Christoph, would you mind pre-pending a minimal fixpatch to
set xfs_trans_set_sync in growfsrt before this one that refactors the
existing growfs/sb updates?

> > > I thought the main functional change of this patch was to hold the
> > > superblock buffer locked across the force so nothing else can relog the
> > > new geometry superblock buffer in the same cil checkpoint. Presumably,
> > > the theory is that prevents recovery from seeing updates to different
> > > buffers that depend on the geometry update before the actual sb geometry
> > > update is recovered (because the latter might have been relogged).
> > > 
> > > Maybe we're saying the same thing..? Or maybe I just misunderstand.
> > > Either way I think patch could use a more detailed commit log...
> > 
> > <nod> The commit message should point out that we're fixing a real bug
> > here, which is that growfsrt doesn't force the log to disk when it
> > commits the new rt geometry.
> > 
> 
> Maybe even make it a separate patch to pull apart some of these cleanups
> from fixes. I was also wondering if the whole locking change is the
> moral equivalent of locking the sb across the growfs trans (i.e.
> trans_getsb() + trans_bhold()), at which point maybe that would be a
> reasonable incremental patch too.
> 
> > > > >                                                              Either way,
> > > > > I see aborts and splats all over the place, which implies at minimum
> > > > > this isn't the only issue here.
> > > > 
> > > > Ugh.  I've recently noticed the long soak logrecovery test vm have seen
> > > > a slight tick up in failure rates -- random blocks that have clearly had
> > > > garbage written to them such that recovery tries to read the block to
> > > > recover a buffer log item and kaboom.  At this point it's unclear if
> > > > that's a problem with xfs or somewhere else. :(
> > > > 
> > > > > So given that 1. growfs recovery seems pretty much broken, 2. this
> > > > > particular patch has no straightforward way to test that it fixes
> > > > > something and at the same time doesn't break anything else, and 3. we do
> > > > 
> > > > I'm curious, what might break?  Was that merely a general comment, or do
> > > > you have something specific in mind?  (iows: do you see more string to
> > > > pull? :))
> > > > 
> > > 
> > > Just a general comment..
> > > 
> > > Something related that isn't totally clear to me is what about the
> > > inverse shrink situation where dblocks is reduced. I.e., is there some
> > > similar scenario where for example instead of the sb buffer being
> > > relogged past some other buffer update that depends on it, some other
> > > change is relogged past a sb update that invalidates/removes blocks
> > > referenced by the relogged buffer..? If so, does that imply a shrink
> > > should flush the log before the shrink transaction commits to ensure it
> > > lands in a new checkpoint (as opposed to ensuring followon updates land
> > > in a new checkpoint)..?
> > 
> > I think so.  Might we want to do that before and after to be careful?
> > 
> 
> Yeah maybe. I'm not quite sure if even that's enough. I.e. assuming we
> had a log preflush to flush out already committed changes before the
> grow, I don't think anything really prevents another "problematic"
> transaction from committing after that preflush.

Yeah, I guess you'd have to hold the AGF while forcing the log, wouldn't
you?

> I dunno.. on one hand it does seem like an unlikely thing due to the
> nature of needing space to be free in order to shrink in the first
> place, but OTOH if you have something like grow that is rare, not
> performance sensitive, has a history of not being well tested, and has
> these subtle ordering requirements that might change indirectly to other
> transactions, ISTM it could be a wise engineering decision to simplify
> to the degree possible and find the most basic model that enforces
> predictable ordering.
> 
> So for a hacky thought/example, suppose we defined a transaction mode
> that basically implemented an exclusive checkpoint requirement (i.e.
> this transaction owns an entire checkpoint, nothing else is allowed in
> the CIL concurrently). Presumably that would ensure everything before
> the grow would flush out to disk in one checkpoint, everything
> concurrent would block on synchronous commit of the grow trans (before
> new geometry is exposed), and then after that point everything pending
> would drain into another checkpoint.
> 
> It kind of sounds like overkill, but really if it could be implemented
> simply enough then we wouldn't have to think too hard about auditing all
> other relog scenarios. I'd probably want to see at least some reproducer
> for this sort of problem to prove the theory though too, even if it
> required debug instrumentation or something. Hm?

What if we redefined the input requirements to shrink?  Lets say we
require that the fd argument to a shrink ioctl is actually an unlinkable
O_TMPFILE regular file with the EOFS blocks mapped to it.  Then we can
force the log without holding any locks, and the shrink transaction can
remove the bmap and rmap records at the same time that it updates the sb
geometry.  The otherwise inaccessible file means that nobody can reuse
that space between the log force and the sb update.

> > > Anyways, my point is just that if it were me I wouldn't get too deep
> > > into this until some of the reproducible growfs recovery issues are at
> > > least characterized and testing is more sorted out.
> > > 
> > > The context for testing is here [1]. The TLDR is basically that
> > > Christoph has a targeted test that reproduces the initial mount failure
> > > and I hacked up a more general test that also reproduces it and
> > > additional growfs recovery problems. This test does seem to confirm that
> > > the previous patches address the mount failure issue, but this patch
> > > doesn't seem to prevent any of the other problems produced by the
> > > generic test. That might just mean the test doesn't reproduce what this
> > > fixes, but it's kind of hard to at least regression test something like
> > > this when basic growfs crash-recovery seems pretty much broken.
> > 
> > Hmm, if you make a variant of that test which formats with an rt device
> > and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see
> > similar blowups?  Let's see what happens if I do that...
> > 
> 
> Heh, sounds like so from your followup. Fun times.
> 
> I guess that test should probably work its way upstream. I made some
> tweaks locally since last posted to try and make it a little more
> aggressive, but it didn't repro anything new so not sure how much
> difference it makes really. Do we want a separate version like yours for
> the rt case or would you expect to cover both cases in a single test?

This probably should be different tests, because rt is its own very
weird animal.

--D

> Brian
> 
> > --D
> > 
> > > Brian
> > > 
> > > [1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/
> > > 
> > > > > have at least one fairly straightforward growfs/recovery test in the
> > > > > works that reliably explodes, personally I'd suggest to split this work
> > > > > off into separate series.
> > > > > 
> > > > > It seems reasonable enough to me to get patches 1-5 in asap once they're
> > > > > fully cleaned up, and then leave the next two as part of a followon
> > > > > series pending further investigation into these other issues. As part of
> > > > > that I'd like to know whether the recovery test reproduces (or can be
> > > > > made to reproduce) the issue this patch presumably fixes, but I'd also
> > > > > settle for "the grow recovery test now passes reliably and this doesn't
> > > > > regress it." But once again, just my .02.
> > > > 
> > > > Yeah, it's too bad there's no good way to test recovery with older
> > > > kernels either. :(
> > > > 
> > > > --D
> > > > 
> > > > > Brian
> > > > > 
> > > > > 
> > > > 
> > > 
> > > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/7] xfs: don't update file system geometry through transaction deltas
  2024-10-15 16:42                 ` Darrick J. Wong
@ 2024-10-18 12:27                   ` Brian Foster
  2024-10-21 16:59                     ` Darrick J. Wong
  0 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2024-10-18 12:27 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, Chandan Babu R, linux-xfs

On Tue, Oct 15, 2024 at 09:42:05AM -0700, Darrick J. Wong wrote:
> On Mon, Oct 14, 2024 at 02:50:37PM -0400, Brian Foster wrote:
> > On Fri, Oct 11, 2024 at 04:12:41PM -0700, Darrick J. Wong wrote:
> > > On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote:
> > > > On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote:
> > > > > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote:
> > > > > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote:
> > > > > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> > > > > > > > Ok, so we don't want geometry changes transactions in the same CIL
> > > > > > > > checkpoint as alloc related transactions that might depend on the
> > > > > > > > geometry changes. That seems reasonable and on a first pass I have an
> > > > > > > > idea of what this is doing, but the description is kind of vague.
> > > > > > > > Obviously this fixes an issue on the recovery side (since I've tested
> > > > > > > > it), but it's not quite clear to me from the description and/or logic
> > > > > > > > changes how that issue manifests.
> > > > > > > > 
> > > > > > > > Could you elaborate please? For example, is this some kind of race
> > > > > > > > situation between an allocation request and a growfs transaction, where
> > > > > > > > the former perhaps sees a newly added AG between the time the growfs
> > > > > > > > transaction commits (applying the sb deltas) and it actually commits to
> > > > > > > > the log due to being a sync transaction, thus allowing an alloc on a new
> > > > > > > > AG into the same checkpoint that adds the AG?
> > > > > > > 
> > > > > > > This is based on the feedback by Dave on the previous version:
> > > > > > > 
> > > > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/
> > > > > > > 
> > > > > > 
> > > > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm
> > > > > > not sure I'd go straight to this change given the situation...
> > > > > > 
> > > > > > > Just doing the perag/in-core sb updates earlier fixes all the issues
> > > > > > > with my test case, so I'm not actually sure how to get more updates
> > > > > > > into the check checkpoint.  I'll try your exercisers if it could hit
> > > > > > > that.
> > > > > > > 
> > > > > > 
> > > > > > Ok, that explains things a bit. My observation is that the first 5
> > > > > > patches or so address the mount failure problem, but from there I'm not
> > > > > > reproducing much difference with or without the final patch.
> > > > > 
> > > > > Does this change to flush the log after committing the new sb fix the
> > > > > recovery problems on older kernels?  I /think/ that's the point of this
> > > > > patch.
> > > > > 
> > > > 
> > > > I don't follow.. growfs always forced the log via the sync transaction,
> > > > right? Or do you mean something else by "change to flush the log?"
> > > 
> > > I guess I was typing a bit too fast this morning -- "change to flush the
> > > log to disk before anyone else can get their hands on the superblock".
> > > You're right that xfs_log_sb and data-device growfs already do that.
> > > 
> > > That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug
> > > that this patch fixes, right?
> > > 
> > 
> > Ah, Ok.. that makes sense. Sounds like it could be..
> 
> Yeah.  Hey Christoph, would you mind pre-pending a minimal fixpatch to
> set xfs_trans_set_sync in growfsrt before this one that refactors the
> existing growfs/sb updates?
> 
> > > > I thought the main functional change of this patch was to hold the
> > > > superblock buffer locked across the force so nothing else can relog the
> > > > new geometry superblock buffer in the same cil checkpoint. Presumably,
> > > > the theory is that prevents recovery from seeing updates to different
> > > > buffers that depend on the geometry update before the actual sb geometry
> > > > update is recovered (because the latter might have been relogged).
> > > > 
> > > > Maybe we're saying the same thing..? Or maybe I just misunderstand.
> > > > Either way I think patch could use a more detailed commit log...
> > > 
> > > <nod> The commit message should point out that we're fixing a real bug
> > > here, which is that growfsrt doesn't force the log to disk when it
> > > commits the new rt geometry.
> > > 
> > 
> > Maybe even make it a separate patch to pull apart some of these cleanups
> > from fixes. I was also wondering if the whole locking change is the
> > moral equivalent of locking the sb across the growfs trans (i.e.
> > trans_getsb() + trans_bhold()), at which point maybe that would be a
> > reasonable incremental patch too.
> > 
> > > > > >                                                              Either way,
> > > > > > I see aborts and splats all over the place, which implies at minimum
> > > > > > this isn't the only issue here.
> > > > > 
> > > > > Ugh.  I've recently noticed the long soak logrecovery test vm have seen
> > > > > a slight tick up in failure rates -- random blocks that have clearly had
> > > > > garbage written to them such that recovery tries to read the block to
> > > > > recover a buffer log item and kaboom.  At this point it's unclear if
> > > > > that's a problem with xfs or somewhere else. :(
> > > > > 
> > > > > > So given that 1. growfs recovery seems pretty much broken, 2. this
> > > > > > particular patch has no straightforward way to test that it fixes
> > > > > > something and at the same time doesn't break anything else, and 3. we do
> > > > > 
> > > > > I'm curious, what might break?  Was that merely a general comment, or do
> > > > > you have something specific in mind?  (iows: do you see more string to
> > > > > pull? :))
> > > > > 
> > > > 
> > > > Just a general comment..
> > > > 
> > > > Something related that isn't totally clear to me is what about the
> > > > inverse shrink situation where dblocks is reduced. I.e., is there some
> > > > similar scenario where for example instead of the sb buffer being
> > > > relogged past some other buffer update that depends on it, some other
> > > > change is relogged past a sb update that invalidates/removes blocks
> > > > referenced by the relogged buffer..? If so, does that imply a shrink
> > > > should flush the log before the shrink transaction commits to ensure it
> > > > lands in a new checkpoint (as opposed to ensuring followon updates land
> > > > in a new checkpoint)..?
> > > 
> > > I think so.  Might we want to do that before and after to be careful?
> > > 
> > 
> > Yeah maybe. I'm not quite sure if even that's enough. I.e. assuming we
> > had a log preflush to flush out already committed changes before the
> > grow, I don't think anything really prevents another "problematic"
> > transaction from committing after that preflush.
> 
> Yeah, I guess you'd have to hold the AGF while forcing the log, wouldn't
> you?
> 

I guess it depends on how far into the weeds we want to get. I'm not
necessarily sure than anything exists today that is definitely
problematic wrt shrink. That would probably warrant an audit of
transactions or some other high level analysis to disprove. More thought
needed.

Short of the latter, I'm more thinking about the question "is there some
new thing we could add years down the line that 1. adds something to the
log that could conflict and 2. could be reordered past a shrink
transaction in a problematic way?" If the answer to that is open ended
and some such thing does come along, I think it's highly likely this
would just break growfs logging again until somebody trips over it in
the field.

> > I dunno.. on one hand it does seem like an unlikely thing due to the
> > nature of needing space to be free in order to shrink in the first
> > place, but OTOH if you have something like grow that is rare, not
> > performance sensitive, has a history of not being well tested, and has
> > these subtle ordering requirements that might change indirectly to other
> > transactions, ISTM it could be a wise engineering decision to simplify
> > to the degree possible and find the most basic model that enforces
> > predictable ordering.
> > 
> > So for a hacky thought/example, suppose we defined a transaction mode
> > that basically implemented an exclusive checkpoint requirement (i.e.
> > this transaction owns an entire checkpoint, nothing else is allowed in
> > the CIL concurrently). Presumably that would ensure everything before
> > the grow would flush out to disk in one checkpoint, everything
> > concurrent would block on synchronous commit of the grow trans (before
> > new geometry is exposed), and then after that point everything pending
> > would drain into another checkpoint.
> > 
> > It kind of sounds like overkill, but really if it could be implemented
> > simply enough then we wouldn't have to think too hard about auditing all
> > other relog scenarios. I'd probably want to see at least some reproducer
> > for this sort of problem to prove the theory though too, even if it
> > required debug instrumentation or something. Hm?
> 
> What if we redefined the input requirements to shrink?  Lets say we
> require that the fd argument to a shrink ioctl is actually an unlinkable
> O_TMPFILE regular file with the EOFS blocks mapped to it.  Then we can
> force the log without holding any locks, and the shrink transaction can
> remove the bmap and rmap records at the same time that it updates the sb
> geometry.  The otherwise inaccessible file means that nobody can reuse
> that space between the log force and the sb update.
> 

Interesting thought. It kind of sounds like how shrink already works to
some degree, right? I.e. the kernel side allocs the blocks out of the
btrees and tosses them, just no inode in the mix?

Honestly I'd probably need to stare at this code and think about it and
work through some scenarios to quantify how much of a concern this
really is, and I don't really have the bandwidth for that just now. I
mainly wanted to raise the notion that if we're assessing high level log
ordering requirements for growfs, we should consider the shrink case as
well.

> > > > Anyways, my point is just that if it were me I wouldn't get too deep
> > > > into this until some of the reproducible growfs recovery issues are at
> > > > least characterized and testing is more sorted out.
> > > > 
> > > > The context for testing is here [1]. The TLDR is basically that
> > > > Christoph has a targeted test that reproduces the initial mount failure
> > > > and I hacked up a more general test that also reproduces it and
> > > > additional growfs recovery problems. This test does seem to confirm that
> > > > the previous patches address the mount failure issue, but this patch
> > > > doesn't seem to prevent any of the other problems produced by the
> > > > generic test. That might just mean the test doesn't reproduce what this
> > > > fixes, but it's kind of hard to at least regression test something like
> > > > this when basic growfs crash-recovery seems pretty much broken.
> > > 
> > > Hmm, if you make a variant of that test which formats with an rt device
> > > and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see
> > > similar blowups?  Let's see what happens if I do that...
> > > 
> > 
> > Heh, sounds like so from your followup. Fun times.
> > 
> > I guess that test should probably work its way upstream. I made some
> > tweaks locally since last posted to try and make it a little more
> > aggressive, but it didn't repro anything new so not sure how much
> > difference it makes really. Do we want a separate version like yours for
> > the rt case or would you expect to cover both cases in a single test?
> 
> This probably should be different tests, because rt is its own very
> weird animal.
> 

Posted a couple tests the other day, JFYI.

Brian

> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > [1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/
> > > > 
> > > > > > have at least one fairly straightforward growfs/recovery test in the
> > > > > > works that reliably explodes, personally I'd suggest to split this work
> > > > > > off into separate series.
> > > > > > 
> > > > > > It seems reasonable enough to me to get patches 1-5 in asap once they're
> > > > > > fully cleaned up, and then leave the next two as part of a followon
> > > > > > series pending further investigation into these other issues. As part of
> > > > > > that I'd like to know whether the recovery test reproduces (or can be
> > > > > > made to reproduce) the issue this patch presumably fixes, but I'd also
> > > > > > settle for "the grow recovery test now passes reliably and this doesn't
> > > > > > regress it." But once again, just my .02.
> > > > > 
> > > > > Yeah, it's too bad there's no good way to test recovery with older
> > > > > kernels either. :(
> > > > > 
> > > > > --D
> > > > > 
> > > > > > Brian
> > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > > 
> > > 
> > 
> > 
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/7] xfs: don't update file system geometry through transaction deltas
  2024-10-14 18:50               ` Brian Foster
  2024-10-15 16:42                 ` Darrick J. Wong
@ 2024-10-21 13:38                 ` Dave Chinner
  2024-10-23 15:06                   ` Brian Foster
  1 sibling, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2024-10-21 13:38 UTC (permalink / raw)
  To: Brian Foster
  Cc: Darrick J. Wong, Christoph Hellwig, Chandan Babu R, linux-xfs

On Mon, Oct 14, 2024 at 02:50:37PM -0400, Brian Foster wrote:
> So for a hacky thought/example, suppose we defined a transaction mode
> that basically implemented an exclusive checkpoint requirement (i.e.
> this transaction owns an entire checkpoint, nothing else is allowed in
> the CIL concurrently).

Transactions know nothing about the CIL, nor should they. The CIL
also has no place in ordering transactions - it's purely an
aggregation mechanism that flushes committed transactions to stable
storage when it is told to. i.e. when a log force is issued.

A globally serialised transaction requires ordering at the
transaction allocation/reservation level, not at the CIL. i.e. it is
essentially the same ordering problem as serialising against
untracked DIO on the inode before we can run a truncate (lock,
drain, do operation, unlock).

> Presumably that would ensure everything before
> the grow would flush out to disk in one checkpoint, everything
> concurrent would block on synchronous commit of the grow trans (before
> new geometry is exposed), and then after that point everything pending
> would drain into another checkpoint.

Yup, that's high level transaction level ordering and really has
nothing to do with the CIL. The CIL is mostly a FIFO aggregator; the
only ordering it does is to preserve transaction commit ordering
down to the journal.

> It kind of sounds like overkill, but really if it could be implemented
> simply enough then we wouldn't have to think too hard about auditing all
> other relog scenarios. I'd probably want to see at least some reproducer
> for this sort of problem to prove the theory though too, even if it
> required debug instrumentation or something. Hm?

It's relatively straight forward to do, but it seems like total
overkill for growfs, as growfs only requires ordering
between the change of size and new allocations. We can do that by
not exposing the new space until after then superblock has been
modifed on stable storage in the case of grow.

In the case of shrink, globally serialising the growfs
transaction for shrink won't actually do any thing useful because we
have to deny access to the free space we are removing before we
even start the shrink transaction. Hence we need allocation vs
shrink co-ordination before we shrink the superblock space, not a
globally serialised size modification transaction...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/7] xfs: don't update file system geometry through transaction deltas
  2024-10-18 12:27                   ` Brian Foster
@ 2024-10-21 16:59                     ` Darrick J. Wong
  2024-10-23 14:45                       ` Brian Foster
  0 siblings, 1 reply; 44+ messages in thread
From: Darrick J. Wong @ 2024-10-21 16:59 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, Chandan Babu R, linux-xfs

On Fri, Oct 18, 2024 at 08:27:23AM -0400, Brian Foster wrote:
> On Tue, Oct 15, 2024 at 09:42:05AM -0700, Darrick J. Wong wrote:
> > On Mon, Oct 14, 2024 at 02:50:37PM -0400, Brian Foster wrote:
> > > On Fri, Oct 11, 2024 at 04:12:41PM -0700, Darrick J. Wong wrote:
> > > > On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote:
> > > > > On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote:
> > > > > > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote:
> > > > > > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote:
> > > > > > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> > > > > > > > > Ok, so we don't want geometry changes transactions in the same CIL
> > > > > > > > > checkpoint as alloc related transactions that might depend on the
> > > > > > > > > geometry changes. That seems reasonable and on a first pass I have an
> > > > > > > > > idea of what this is doing, but the description is kind of vague.
> > > > > > > > > Obviously this fixes an issue on the recovery side (since I've tested
> > > > > > > > > it), but it's not quite clear to me from the description and/or logic
> > > > > > > > > changes how that issue manifests.
> > > > > > > > > 
> > > > > > > > > Could you elaborate please? For example, is this some kind of race
> > > > > > > > > situation between an allocation request and a growfs transaction, where
> > > > > > > > > the former perhaps sees a newly added AG between the time the growfs
> > > > > > > > > transaction commits (applying the sb deltas) and it actually commits to
> > > > > > > > > the log due to being a sync transaction, thus allowing an alloc on a new
> > > > > > > > > AG into the same checkpoint that adds the AG?
> > > > > > > > 
> > > > > > > > This is based on the feedback by Dave on the previous version:
> > > > > > > > 
> > > > > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/
> > > > > > > > 
> > > > > > > 
> > > > > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm
> > > > > > > not sure I'd go straight to this change given the situation...
> > > > > > > 
> > > > > > > > Just doing the perag/in-core sb updates earlier fixes all the issues
> > > > > > > > with my test case, so I'm not actually sure how to get more updates
> > > > > > > > into the check checkpoint.  I'll try your exercisers if it could hit
> > > > > > > > that.
> > > > > > > > 
> > > > > > > 
> > > > > > > Ok, that explains things a bit. My observation is that the first 5
> > > > > > > patches or so address the mount failure problem, but from there I'm not
> > > > > > > reproducing much difference with or without the final patch.
> > > > > > 
> > > > > > Does this change to flush the log after committing the new sb fix the
> > > > > > recovery problems on older kernels?  I /think/ that's the point of this
> > > > > > patch.
> > > > > > 
> > > > > 
> > > > > I don't follow.. growfs always forced the log via the sync transaction,
> > > > > right? Or do you mean something else by "change to flush the log?"
> > > > 
> > > > I guess I was typing a bit too fast this morning -- "change to flush the
> > > > log to disk before anyone else can get their hands on the superblock".
> > > > You're right that xfs_log_sb and data-device growfs already do that.
> > > > 
> > > > That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug
> > > > that this patch fixes, right?
> > > > 
> > > 
> > > Ah, Ok.. that makes sense. Sounds like it could be..
> > 
> > Yeah.  Hey Christoph, would you mind pre-pending a minimal fixpatch to
> > set xfs_trans_set_sync in growfsrt before this one that refactors the
> > existing growfs/sb updates?
> > 
> > > > > I thought the main functional change of this patch was to hold the
> > > > > superblock buffer locked across the force so nothing else can relog the
> > > > > new geometry superblock buffer in the same cil checkpoint. Presumably,
> > > > > the theory is that prevents recovery from seeing updates to different
> > > > > buffers that depend on the geometry update before the actual sb geometry
> > > > > update is recovered (because the latter might have been relogged).
> > > > > 
> > > > > Maybe we're saying the same thing..? Or maybe I just misunderstand.
> > > > > Either way I think patch could use a more detailed commit log...
> > > > 
> > > > <nod> The commit message should point out that we're fixing a real bug
> > > > here, which is that growfsrt doesn't force the log to disk when it
> > > > commits the new rt geometry.
> > > > 
> > > 
> > > Maybe even make it a separate patch to pull apart some of these cleanups
> > > from fixes. I was also wondering if the whole locking change is the
> > > moral equivalent of locking the sb across the growfs trans (i.e.
> > > trans_getsb() + trans_bhold()), at which point maybe that would be a
> > > reasonable incremental patch too.
> > > 
> > > > > > >                                                              Either way,
> > > > > > > I see aborts and splats all over the place, which implies at minimum
> > > > > > > this isn't the only issue here.
> > > > > > 
> > > > > > Ugh.  I've recently noticed the long soak logrecovery test vm have seen
> > > > > > a slight tick up in failure rates -- random blocks that have clearly had
> > > > > > garbage written to them such that recovery tries to read the block to
> > > > > > recover a buffer log item and kaboom.  At this point it's unclear if
> > > > > > that's a problem with xfs or somewhere else. :(
> > > > > > 
> > > > > > > So given that 1. growfs recovery seems pretty much broken, 2. this
> > > > > > > particular patch has no straightforward way to test that it fixes
> > > > > > > something and at the same time doesn't break anything else, and 3. we do
> > > > > > 
> > > > > > I'm curious, what might break?  Was that merely a general comment, or do
> > > > > > you have something specific in mind?  (iows: do you see more string to
> > > > > > pull? :))
> > > > > > 
> > > > > 
> > > > > Just a general comment..
> > > > > 
> > > > > Something related that isn't totally clear to me is what about the
> > > > > inverse shrink situation where dblocks is reduced. I.e., is there some
> > > > > similar scenario where for example instead of the sb buffer being
> > > > > relogged past some other buffer update that depends on it, some other
> > > > > change is relogged past a sb update that invalidates/removes blocks
> > > > > referenced by the relogged buffer..? If so, does that imply a shrink
> > > > > should flush the log before the shrink transaction commits to ensure it
> > > > > lands in a new checkpoint (as opposed to ensuring followon updates land
> > > > > in a new checkpoint)..?
> > > > 
> > > > I think so.  Might we want to do that before and after to be careful?
> > > > 
> > > 
> > > Yeah maybe. I'm not quite sure if even that's enough. I.e. assuming we
> > > had a log preflush to flush out already committed changes before the
> > > grow, I don't think anything really prevents another "problematic"
> > > transaction from committing after that preflush.
> > 
> > Yeah, I guess you'd have to hold the AGF while forcing the log, wouldn't
> > you?
> > 
> 
> I guess it depends on how far into the weeds we want to get. I'm not
> necessarily sure than anything exists today that is definitely
> problematic wrt shrink. That would probably warrant an audit of
> transactions or some other high level analysis to disprove. More thought
> needed.

<nod> I think there isn't a problem with shrink because the shrink
transaction itself must be able to find the space, which means that
there cannot be any files or unfinished deferred ops pointing to that
space.

> Short of the latter, I'm more thinking about the question "is there some
> new thing we could add years down the line that 1. adds something to the
> log that could conflict and 2. could be reordered past a shrink
> transaction in a problematic way?" If the answer to that is open ended
> and some such thing does come along, I think it's highly likely this
> would just break growfs logging again until somebody trips over it in
> the field.

Good thing we have a couple of tests now? :)

> > > I dunno.. on one hand it does seem like an unlikely thing due to the
> > > nature of needing space to be free in order to shrink in the first
> > > place, but OTOH if you have something like grow that is rare, not
> > > performance sensitive, has a history of not being well tested, and has
> > > these subtle ordering requirements that might change indirectly to other
> > > transactions, ISTM it could be a wise engineering decision to simplify
> > > to the degree possible and find the most basic model that enforces
> > > predictable ordering.
> > > 
> > > So for a hacky thought/example, suppose we defined a transaction mode
> > > that basically implemented an exclusive checkpoint requirement (i.e.
> > > this transaction owns an entire checkpoint, nothing else is allowed in
> > > the CIL concurrently). Presumably that would ensure everything before
> > > the grow would flush out to disk in one checkpoint, everything
> > > concurrent would block on synchronous commit of the grow trans (before
> > > new geometry is exposed), and then after that point everything pending
> > > would drain into another checkpoint.
> > > 
> > > It kind of sounds like overkill, but really if it could be implemented
> > > simply enough then we wouldn't have to think too hard about auditing all
> > > other relog scenarios. I'd probably want to see at least some reproducer
> > > for this sort of problem to prove the theory though too, even if it
> > > required debug instrumentation or something. Hm?
> > 
> > What if we redefined the input requirements to shrink?  Lets say we
> > require that the fd argument to a shrink ioctl is actually an unlinkable
> > O_TMPFILE regular file with the EOFS blocks mapped to it.  Then we can
> > force the log without holding any locks, and the shrink transaction can
> > remove the bmap and rmap records at the same time that it updates the sb
> > geometry.  The otherwise inaccessible file means that nobody can reuse
> > that space between the log force and the sb update.
> > 
> 
> Interesting thought. It kind of sounds like how shrink already works to
> some degree, right? I.e. the kernel side allocs the blocks out of the
> btrees and tosses them, just no inode in the mix?

Right.

> Honestly I'd probably need to stare at this code and think about it and
> work through some scenarios to quantify how much of a concern this
> really is, and I don't really have the bandwidth for that just now. I
> mainly wanted to raise the notion that if we're assessing high level log
> ordering requirements for growfs, we should consider the shrink case as
> well.

<nod>

> > > > > Anyways, my point is just that if it were me I wouldn't get too deep
> > > > > into this until some of the reproducible growfs recovery issues are at
> > > > > least characterized and testing is more sorted out.
> > > > > 
> > > > > The context for testing is here [1]. The TLDR is basically that
> > > > > Christoph has a targeted test that reproduces the initial mount failure
> > > > > and I hacked up a more general test that also reproduces it and
> > > > > additional growfs recovery problems. This test does seem to confirm that
> > > > > the previous patches address the mount failure issue, but this patch
> > > > > doesn't seem to prevent any of the other problems produced by the
> > > > > generic test. That might just mean the test doesn't reproduce what this
> > > > > fixes, but it's kind of hard to at least regression test something like
> > > > > this when basic growfs crash-recovery seems pretty much broken.
> > > > 
> > > > Hmm, if you make a variant of that test which formats with an rt device
> > > > and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see
> > > > similar blowups?  Let's see what happens if I do that...
> > > > 
> > > 
> > > Heh, sounds like so from your followup. Fun times.
> > > 
> > > I guess that test should probably work its way upstream. I made some
> > > tweaks locally since last posted to try and make it a little more
> > > aggressive, but it didn't repro anything new so not sure how much
> > > difference it makes really. Do we want a separate version like yours for
> > > the rt case or would you expect to cover both cases in a single test?
> > 
> > This probably should be different tests, because rt is its own very
> > weird animal.
> > 
> 
> Posted a couple tests the other day, JFYI.
> 
> Brian
> 
> > --D
> > 
> > > Brian
> > > 
> > > > --D
> > > > 
> > > > > Brian
> > > > > 
> > > > > [1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/
> > > > > 
> > > > > > > have at least one fairly straightforward growfs/recovery test in the
> > > > > > > works that reliably explodes, personally I'd suggest to split this work
> > > > > > > off into separate series.
> > > > > > > 
> > > > > > > It seems reasonable enough to me to get patches 1-5 in asap once they're
> > > > > > > fully cleaned up, and then leave the next two as part of a followon
> > > > > > > series pending further investigation into these other issues. As part of
> > > > > > > that I'd like to know whether the recovery test reproduces (or can be
> > > > > > > made to reproduce) the issue this patch presumably fixes, but I'd also
> > > > > > > settle for "the grow recovery test now passes reliably and this doesn't
> > > > > > > regress it." But once again, just my .02.
> > > > > > 
> > > > > > Yeah, it's too bad there's no good way to test recovery with older
> > > > > > kernels either. :(
> > > > > > 
> > > > > > --D
> > > > > > 
> > > > > > > Brian
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > 
> > > 
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/7] xfs: don't update file system geometry through transaction deltas
  2024-10-21 16:59                     ` Darrick J. Wong
@ 2024-10-23 14:45                       ` Brian Foster
  2024-10-24 18:02                         ` Darrick J. Wong
  0 siblings, 1 reply; 44+ messages in thread
From: Brian Foster @ 2024-10-23 14:45 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, Chandan Babu R, linux-xfs

On Mon, Oct 21, 2024 at 09:59:18AM -0700, Darrick J. Wong wrote:
> On Fri, Oct 18, 2024 at 08:27:23AM -0400, Brian Foster wrote:
> > On Tue, Oct 15, 2024 at 09:42:05AM -0700, Darrick J. Wong wrote:
> > > On Mon, Oct 14, 2024 at 02:50:37PM -0400, Brian Foster wrote:
> > > > On Fri, Oct 11, 2024 at 04:12:41PM -0700, Darrick J. Wong wrote:
> > > > > On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote:
> > > > > > On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote:
> > > > > > > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote:
> > > > > > > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote:
> > > > > > > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> > > > > > > > > > Ok, so we don't want geometry changes transactions in the same CIL
> > > > > > > > > > checkpoint as alloc related transactions that might depend on the
> > > > > > > > > > geometry changes. That seems reasonable and on a first pass I have an
> > > > > > > > > > idea of what this is doing, but the description is kind of vague.
> > > > > > > > > > Obviously this fixes an issue on the recovery side (since I've tested
> > > > > > > > > > it), but it's not quite clear to me from the description and/or logic
> > > > > > > > > > changes how that issue manifests.
> > > > > > > > > > 
> > > > > > > > > > Could you elaborate please? For example, is this some kind of race
> > > > > > > > > > situation between an allocation request and a growfs transaction, where
> > > > > > > > > > the former perhaps sees a newly added AG between the time the growfs
> > > > > > > > > > transaction commits (applying the sb deltas) and it actually commits to
> > > > > > > > > > the log due to being a sync transaction, thus allowing an alloc on a new
> > > > > > > > > > AG into the same checkpoint that adds the AG?
> > > > > > > > > 
> > > > > > > > > This is based on the feedback by Dave on the previous version:
> > > > > > > > > 
> > > > > > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm
> > > > > > > > not sure I'd go straight to this change given the situation...
> > > > > > > > 
> > > > > > > > > Just doing the perag/in-core sb updates earlier fixes all the issues
> > > > > > > > > with my test case, so I'm not actually sure how to get more updates
> > > > > > > > > into the check checkpoint.  I'll try your exercisers if it could hit
> > > > > > > > > that.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Ok, that explains things a bit. My observation is that the first 5
> > > > > > > > patches or so address the mount failure problem, but from there I'm not
> > > > > > > > reproducing much difference with or without the final patch.
> > > > > > > 
> > > > > > > Does this change to flush the log after committing the new sb fix the
> > > > > > > recovery problems on older kernels?  I /think/ that's the point of this
> > > > > > > patch.
> > > > > > > 
> > > > > > 
> > > > > > I don't follow.. growfs always forced the log via the sync transaction,
> > > > > > right? Or do you mean something else by "change to flush the log?"
> > > > > 
> > > > > I guess I was typing a bit too fast this morning -- "change to flush the
> > > > > log to disk before anyone else can get their hands on the superblock".
> > > > > You're right that xfs_log_sb and data-device growfs already do that.
> > > > > 
> > > > > That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug
> > > > > that this patch fixes, right?
> > > > > 
> > > > 
> > > > Ah, Ok.. that makes sense. Sounds like it could be..
> > > 
> > > Yeah.  Hey Christoph, would you mind pre-pending a minimal fixpatch to
> > > set xfs_trans_set_sync in growfsrt before this one that refactors the
> > > existing growfs/sb updates?
> > > 
> > > > > > I thought the main functional change of this patch was to hold the
> > > > > > superblock buffer locked across the force so nothing else can relog the
> > > > > > new geometry superblock buffer in the same cil checkpoint. Presumably,
> > > > > > the theory is that prevents recovery from seeing updates to different
> > > > > > buffers that depend on the geometry update before the actual sb geometry
> > > > > > update is recovered (because the latter might have been relogged).
> > > > > > 
> > > > > > Maybe we're saying the same thing..? Or maybe I just misunderstand.
> > > > > > Either way I think patch could use a more detailed commit log...
> > > > > 
> > > > > <nod> The commit message should point out that we're fixing a real bug
> > > > > here, which is that growfsrt doesn't force the log to disk when it
> > > > > commits the new rt geometry.
> > > > > 
> > > > 
> > > > Maybe even make it a separate patch to pull apart some of these cleanups
> > > > from fixes. I was also wondering if the whole locking change is the
> > > > moral equivalent of locking the sb across the growfs trans (i.e.
> > > > trans_getsb() + trans_bhold()), at which point maybe that would be a
> > > > reasonable incremental patch too.
> > > > 
> > > > > > > >                                                              Either way,
> > > > > > > > I see aborts and splats all over the place, which implies at minimum
> > > > > > > > this isn't the only issue here.
> > > > > > > 
> > > > > > > Ugh.  I've recently noticed the long soak logrecovery test vm have seen
> > > > > > > a slight tick up in failure rates -- random blocks that have clearly had
> > > > > > > garbage written to them such that recovery tries to read the block to
> > > > > > > recover a buffer log item and kaboom.  At this point it's unclear if
> > > > > > > that's a problem with xfs or somewhere else. :(
> > > > > > > 
> > > > > > > > So given that 1. growfs recovery seems pretty much broken, 2. this
> > > > > > > > particular patch has no straightforward way to test that it fixes
> > > > > > > > something and at the same time doesn't break anything else, and 3. we do
> > > > > > > 
> > > > > > > I'm curious, what might break?  Was that merely a general comment, or do
> > > > > > > you have something specific in mind?  (iows: do you see more string to
> > > > > > > pull? :))
> > > > > > > 
> > > > > > 
> > > > > > Just a general comment..
> > > > > > 
> > > > > > Something related that isn't totally clear to me is what about the
> > > > > > inverse shrink situation where dblocks is reduced. I.e., is there some
> > > > > > similar scenario where for example instead of the sb buffer being
> > > > > > relogged past some other buffer update that depends on it, some other
> > > > > > change is relogged past a sb update that invalidates/removes blocks
> > > > > > referenced by the relogged buffer..? If so, does that imply a shrink
> > > > > > should flush the log before the shrink transaction commits to ensure it
> > > > > > lands in a new checkpoint (as opposed to ensuring followon updates land
> > > > > > in a new checkpoint)..?
> > > > > 
> > > > > I think so.  Might we want to do that before and after to be careful?
> > > > > 
> > > > 
> > > > Yeah maybe. I'm not quite sure if even that's enough. I.e. assuming we
> > > > had a log preflush to flush out already committed changes before the
> > > > grow, I don't think anything really prevents another "problematic"
> > > > transaction from committing after that preflush.
> > > 
> > > Yeah, I guess you'd have to hold the AGF while forcing the log, wouldn't
> > > you?
> > > 
> > 
> > I guess it depends on how far into the weeds we want to get. I'm not
> > necessarily sure than anything exists today that is definitely
> > problematic wrt shrink. That would probably warrant an audit of
> > transactions or some other high level analysis to disprove. More thought
> > needed.
> 
> <nod> I think there isn't a problem with shrink because the shrink
> transaction itself must be able to find the space, which means that
> there cannot be any files or unfinished deferred ops pointing to that
> space.
> 

Ok, that makes sense to me. On poking through the code, one thing that
it looks like it misses is the case where the allocation fails purely
due to extents being busy (i.e. even if the blocks are free).

That doesn't seem harmful, just perhaps a little odd that a shrink might
fail and then succeed after some period of time for the log tail to push
with no other visible changes. Unless I'm missing something, it might be
worth adding the log force there (that I think you suggested earlier).

All that said, I'm not so sure this will all apply the same if shrink
grows to a more active implementation. For example, I suspect an
implementation that learns to quiesce/truncate full AGs isn't going to
necessarily need to allocate all of the blocks out of the AG free space
trees. But of course, this is all vaporware anyways. :)

> > Short of the latter, I'm more thinking about the question "is there some
> > new thing we could add years down the line that 1. adds something to the
> > log that could conflict and 2. could be reordered past a shrink
> > transaction in a problematic way?" If the answer to that is open ended
> > and some such thing does come along, I think it's highly likely this
> > would just break growfs logging again until somebody trips over it in
> > the field.
> 
> Good thing we have a couple of tests now? :)
> 

Not with good shrink support, unfortunately. :/ I tried adding basic
shrink calls into the proposed tests, but I wasn't able to see them
actually do anything, presumably because of the stress workload and
smallish filesystem keeping those blocks in use. It might require a more
active shrink implementation before we could support this kind of test,
or otherwise maybe something more limited/targeted than fsstress.

Hmm.. a random, off the top of my head idea might be a debug knob that
artificially restricts all allocations beyond a certain disk offset...

Brian

> > > > I dunno.. on one hand it does seem like an unlikely thing due to the
> > > > nature of needing space to be free in order to shrink in the first
> > > > place, but OTOH if you have something like grow that is rare, not
> > > > performance sensitive, has a history of not being well tested, and has
> > > > these subtle ordering requirements that might change indirectly to other
> > > > transactions, ISTM it could be a wise engineering decision to simplify
> > > > to the degree possible and find the most basic model that enforces
> > > > predictable ordering.
> > > > 
> > > > So for a hacky thought/example, suppose we defined a transaction mode
> > > > that basically implemented an exclusive checkpoint requirement (i.e.
> > > > this transaction owns an entire checkpoint, nothing else is allowed in
> > > > the CIL concurrently). Presumably that would ensure everything before
> > > > the grow would flush out to disk in one checkpoint, everything
> > > > concurrent would block on synchronous commit of the grow trans (before
> > > > new geometry is exposed), and then after that point everything pending
> > > > would drain into another checkpoint.
> > > > 
> > > > It kind of sounds like overkill, but really if it could be implemented
> > > > simply enough then we wouldn't have to think too hard about auditing all
> > > > other relog scenarios. I'd probably want to see at least some reproducer
> > > > for this sort of problem to prove the theory though too, even if it
> > > > required debug instrumentation or something. Hm?
> > > 
> > > What if we redefined the input requirements to shrink?  Lets say we
> > > require that the fd argument to a shrink ioctl is actually an unlinkable
> > > O_TMPFILE regular file with the EOFS blocks mapped to it.  Then we can
> > > force the log without holding any locks, and the shrink transaction can
> > > remove the bmap and rmap records at the same time that it updates the sb
> > > geometry.  The otherwise inaccessible file means that nobody can reuse
> > > that space between the log force and the sb update.
> > > 
> > 
> > Interesting thought. It kind of sounds like how shrink already works to
> > some degree, right? I.e. the kernel side allocs the blocks out of the
> > btrees and tosses them, just no inode in the mix?
> 
> Right.
> 
> > Honestly I'd probably need to stare at this code and think about it and
> > work through some scenarios to quantify how much of a concern this
> > really is, and I don't really have the bandwidth for that just now. I
> > mainly wanted to raise the notion that if we're assessing high level log
> > ordering requirements for growfs, we should consider the shrink case as
> > well.
> 
> <nod>
> 
> > > > > > Anyways, my point is just that if it were me I wouldn't get too deep
> > > > > > into this until some of the reproducible growfs recovery issues are at
> > > > > > least characterized and testing is more sorted out.
> > > > > > 
> > > > > > The context for testing is here [1]. The TLDR is basically that
> > > > > > Christoph has a targeted test that reproduces the initial mount failure
> > > > > > and I hacked up a more general test that also reproduces it and
> > > > > > additional growfs recovery problems. This test does seem to confirm that
> > > > > > the previous patches address the mount failure issue, but this patch
> > > > > > doesn't seem to prevent any of the other problems produced by the
> > > > > > generic test. That might just mean the test doesn't reproduce what this
> > > > > > fixes, but it's kind of hard to at least regression test something like
> > > > > > this when basic growfs crash-recovery seems pretty much broken.
> > > > > 
> > > > > Hmm, if you make a variant of that test which formats with an rt device
> > > > > and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see
> > > > > similar blowups?  Let's see what happens if I do that...
> > > > > 
> > > > 
> > > > Heh, sounds like so from your followup. Fun times.
> > > > 
> > > > I guess that test should probably work its way upstream. I made some
> > > > tweaks locally since last posted to try and make it a little more
> > > > aggressive, but it didn't repro anything new so not sure how much
> > > > difference it makes really. Do we want a separate version like yours for
> > > > the rt case or would you expect to cover both cases in a single test?
> > > 
> > > This probably should be different tests, because rt is its own very
> > > weird animal.
> > > 
> > 
> > Posted a couple tests the other day, JFYI.
> > 
> > Brian
> > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > > --D
> > > > > 
> > > > > > Brian
> > > > > > 
> > > > > > [1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/
> > > > > > 
> > > > > > > > have at least one fairly straightforward growfs/recovery test in the
> > > > > > > > works that reliably explodes, personally I'd suggest to split this work
> > > > > > > > off into separate series.
> > > > > > > > 
> > > > > > > > It seems reasonable enough to me to get patches 1-5 in asap once they're
> > > > > > > > fully cleaned up, and then leave the next two as part of a followon
> > > > > > > > series pending further investigation into these other issues. As part of
> > > > > > > > that I'd like to know whether the recovery test reproduces (or can be
> > > > > > > > made to reproduce) the issue this patch presumably fixes, but I'd also
> > > > > > > > settle for "the grow recovery test now passes reliably and this doesn't
> > > > > > > > regress it." But once again, just my .02.
> > > > > > > 
> > > > > > > Yeah, it's too bad there's no good way to test recovery with older
> > > > > > > kernels either. :(
> > > > > > > 
> > > > > > > --D
> > > > > > > 
> > > > > > > > Brian
> > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > > 
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/7] xfs: don't update file system geometry through transaction deltas
  2024-10-21 13:38                 ` Dave Chinner
@ 2024-10-23 15:06                   ` Brian Foster
  0 siblings, 0 replies; 44+ messages in thread
From: Brian Foster @ 2024-10-23 15:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Christoph Hellwig, Chandan Babu R, linux-xfs

On Tue, Oct 22, 2024 at 12:38:23AM +1100, Dave Chinner wrote:
> On Mon, Oct 14, 2024 at 02:50:37PM -0400, Brian Foster wrote:
> > So for a hacky thought/example, suppose we defined a transaction mode
> > that basically implemented an exclusive checkpoint requirement (i.e.
> > this transaction owns an entire checkpoint, nothing else is allowed in
> > the CIL concurrently).
> 
> Transactions know nothing about the CIL, nor should they. The CIL
> also has no place in ordering transactions - it's purely an
> aggregation mechanism that flushes committed transactions to stable
> storage when it is told to. i.e. when a log force is issued.
> 
> A globally serialised transaction requires ordering at the
> transaction allocation/reservation level, not at the CIL. i.e. it is
> essentially the same ordering problem as serialising against
> untracked DIO on the inode before we can run a truncate (lock,
> drain, do operation, unlock).
> 
> > Presumably that would ensure everything before
> > the grow would flush out to disk in one checkpoint, everything
> > concurrent would block on synchronous commit of the grow trans (before
> > new geometry is exposed), and then after that point everything pending
> > would drain into another checkpoint.
> 
> Yup, that's high level transaction level ordering and really has
> nothing to do with the CIL. The CIL is mostly a FIFO aggregator; the
> only ordering it does is to preserve transaction commit ordering
> down to the journal.
> 
> > It kind of sounds like overkill, but really if it could be implemented
> > simply enough then we wouldn't have to think too hard about auditing all
> > other relog scenarios. I'd probably want to see at least some reproducer
> > for this sort of problem to prove the theory though too, even if it
> > required debug instrumentation or something. Hm?
> 
> It's relatively straight forward to do, but it seems like total
> overkill for growfs, as growfs only requires ordering
> between the change of size and new allocations. We can do that by
> not exposing the new space until after then superblock has been
> modifed on stable storage in the case of grow.
> 
> In the case of shrink, globally serialising the growfs
> transaction for shrink won't actually do any thing useful because we
> have to deny access to the free space we are removing before we
> even start the shrink transaction. Hence we need allocation vs
> shrink co-ordination before we shrink the superblock space, not a
> globally serialised size modification transaction...
> 

Not sure what you mean here, at least I don't see that requirement in
the current code. It looks like shrink acquires the blocks all in the
same transaction as the shrink. If something fails, it rolls or returns
the space depending on what actually failed..

Brian

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/7] xfs: don't update file system geometry through transaction deltas
  2024-10-23 14:45                       ` Brian Foster
@ 2024-10-24 18:02                         ` Darrick J. Wong
  0 siblings, 0 replies; 44+ messages in thread
From: Darrick J. Wong @ 2024-10-24 18:02 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, Chandan Babu R, linux-xfs

On Wed, Oct 23, 2024 at 10:45:51AM -0400, Brian Foster wrote:
> On Mon, Oct 21, 2024 at 09:59:18AM -0700, Darrick J. Wong wrote:
> > On Fri, Oct 18, 2024 at 08:27:23AM -0400, Brian Foster wrote:
> > > On Tue, Oct 15, 2024 at 09:42:05AM -0700, Darrick J. Wong wrote:
> > > > On Mon, Oct 14, 2024 at 02:50:37PM -0400, Brian Foster wrote:
> > > > > On Fri, Oct 11, 2024 at 04:12:41PM -0700, Darrick J. Wong wrote:
> > > > > > On Fri, Oct 11, 2024 at 02:41:17PM -0400, Brian Foster wrote:
> > > > > > > On Fri, Oct 11, 2024 at 10:13:03AM -0700, Darrick J. Wong wrote:
> > > > > > > > On Fri, Oct 11, 2024 at 10:02:16AM -0400, Brian Foster wrote:
> > > > > > > > > On Fri, Oct 11, 2024 at 09:57:09AM +0200, Christoph Hellwig wrote:
> > > > > > > > > > On Thu, Oct 10, 2024 at 10:05:53AM -0400, Brian Foster wrote:
> > > > > > > > > > > Ok, so we don't want geometry changes transactions in the same CIL
> > > > > > > > > > > checkpoint as alloc related transactions that might depend on the
> > > > > > > > > > > geometry changes. That seems reasonable and on a first pass I have an
> > > > > > > > > > > idea of what this is doing, but the description is kind of vague.
> > > > > > > > > > > Obviously this fixes an issue on the recovery side (since I've tested
> > > > > > > > > > > it), but it's not quite clear to me from the description and/or logic
> > > > > > > > > > > changes how that issue manifests.
> > > > > > > > > > > 
> > > > > > > > > > > Could you elaborate please? For example, is this some kind of race
> > > > > > > > > > > situation between an allocation request and a growfs transaction, where
> > > > > > > > > > > the former perhaps sees a newly added AG between the time the growfs
> > > > > > > > > > > transaction commits (applying the sb deltas) and it actually commits to
> > > > > > > > > > > the log due to being a sync transaction, thus allowing an alloc on a new
> > > > > > > > > > > AG into the same checkpoint that adds the AG?
> > > > > > > > > > 
> > > > > > > > > > This is based on the feedback by Dave on the previous version:
> > > > > > > > > > 
> > > > > > > > > > https://lore.kernel.org/linux-xfs/Zut51Ftv%2F46Oj386@dread.disaster.area/
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Ah, Ok. That all seems reasonably sane to me on a first pass, but I'm
> > > > > > > > > not sure I'd go straight to this change given the situation...
> > > > > > > > > 
> > > > > > > > > > Just doing the perag/in-core sb updates earlier fixes all the issues
> > > > > > > > > > with my test case, so I'm not actually sure how to get more updates
> > > > > > > > > > into the check checkpoint.  I'll try your exercisers if it could hit
> > > > > > > > > > that.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Ok, that explains things a bit. My observation is that the first 5
> > > > > > > > > patches or so address the mount failure problem, but from there I'm not
> > > > > > > > > reproducing much difference with or without the final patch.
> > > > > > > > 
> > > > > > > > Does this change to flush the log after committing the new sb fix the
> > > > > > > > recovery problems on older kernels?  I /think/ that's the point of this
> > > > > > > > patch.
> > > > > > > > 
> > > > > > > 
> > > > > > > I don't follow.. growfs always forced the log via the sync transaction,
> > > > > > > right? Or do you mean something else by "change to flush the log?"
> > > > > > 
> > > > > > I guess I was typing a bit too fast this morning -- "change to flush the
> > > > > > log to disk before anyone else can get their hands on the superblock".
> > > > > > You're right that xfs_log_sb and data-device growfs already do that.
> > > > > > 
> > > > > > That said, growfsrt **doesn't** call xfs_trans_set_sync, so that's a bug
> > > > > > that this patch fixes, right?
> > > > > > 
> > > > > 
> > > > > Ah, Ok.. that makes sense. Sounds like it could be..
> > > > 
> > > > Yeah.  Hey Christoph, would you mind pre-pending a minimal fixpatch to
> > > > set xfs_trans_set_sync in growfsrt before this one that refactors the
> > > > existing growfs/sb updates?
> > > > 
> > > > > > > I thought the main functional change of this patch was to hold the
> > > > > > > superblock buffer locked across the force so nothing else can relog the
> > > > > > > new geometry superblock buffer in the same cil checkpoint. Presumably,
> > > > > > > the theory is that prevents recovery from seeing updates to different
> > > > > > > buffers that depend on the geometry update before the actual sb geometry
> > > > > > > update is recovered (because the latter might have been relogged).
> > > > > > > 
> > > > > > > Maybe we're saying the same thing..? Or maybe I just misunderstand.
> > > > > > > Either way I think patch could use a more detailed commit log...
> > > > > > 
> > > > > > <nod> The commit message should point out that we're fixing a real bug
> > > > > > here, which is that growfsrt doesn't force the log to disk when it
> > > > > > commits the new rt geometry.
> > > > > > 
> > > > > 
> > > > > Maybe even make it a separate patch to pull apart some of these cleanups
> > > > > from fixes. I was also wondering if the whole locking change is the
> > > > > moral equivalent of locking the sb across the growfs trans (i.e.
> > > > > trans_getsb() + trans_bhold()), at which point maybe that would be a
> > > > > reasonable incremental patch too.
> > > > > 
> > > > > > > > >                                                              Either way,
> > > > > > > > > I see aborts and splats all over the place, which implies at minimum
> > > > > > > > > this isn't the only issue here.
> > > > > > > > 
> > > > > > > > Ugh.  I've recently noticed the long soak logrecovery test vm have seen
> > > > > > > > a slight tick up in failure rates -- random blocks that have clearly had
> > > > > > > > garbage written to them such that recovery tries to read the block to
> > > > > > > > recover a buffer log item and kaboom.  At this point it's unclear if
> > > > > > > > that's a problem with xfs or somewhere else. :(
> > > > > > > > 
> > > > > > > > > So given that 1. growfs recovery seems pretty much broken, 2. this
> > > > > > > > > particular patch has no straightforward way to test that it fixes
> > > > > > > > > something and at the same time doesn't break anything else, and 3. we do
> > > > > > > > 
> > > > > > > > I'm curious, what might break?  Was that merely a general comment, or do
> > > > > > > > you have something specific in mind?  (iows: do you see more string to
> > > > > > > > pull? :))
> > > > > > > > 
> > > > > > > 
> > > > > > > Just a general comment..
> > > > > > > 
> > > > > > > Something related that isn't totally clear to me is what about the
> > > > > > > inverse shrink situation where dblocks is reduced. I.e., is there some
> > > > > > > similar scenario where for example instead of the sb buffer being
> > > > > > > relogged past some other buffer update that depends on it, some other
> > > > > > > change is relogged past a sb update that invalidates/removes blocks
> > > > > > > referenced by the relogged buffer..? If so, does that imply a shrink
> > > > > > > should flush the log before the shrink transaction commits to ensure it
> > > > > > > lands in a new checkpoint (as opposed to ensuring followon updates land
> > > > > > > in a new checkpoint)..?
> > > > > > 
> > > > > > I think so.  Might we want to do that before and after to be careful?
> > > > > > 
> > > > > 
> > > > > Yeah maybe. I'm not quite sure if even that's enough. I.e. assuming we
> > > > > had a log preflush to flush out already committed changes before the
> > > > > grow, I don't think anything really prevents another "problematic"
> > > > > transaction from committing after that preflush.
> > > > 
> > > > Yeah, I guess you'd have to hold the AGF while forcing the log, wouldn't
> > > > you?
> > > > 
> > > 
> > > I guess it depends on how far into the weeds we want to get. I'm not
> > > necessarily sure than anything exists today that is definitely
> > > problematic wrt shrink. That would probably warrant an audit of
> > > transactions or some other high level analysis to disprove. More thought
> > > needed.
> > 
> > <nod> I think there isn't a problem with shrink because the shrink
> > transaction itself must be able to find the space, which means that
> > there cannot be any files or unfinished deferred ops pointing to that
> > space.
> > 
> 
> Ok, that makes sense to me. On poking through the code, one thing that
> it looks like it misses is the case where the allocation fails purely
> due to extents being busy (i.e. even if the blocks are free).
> 
> That doesn't seem harmful, just perhaps a little odd that a shrink might
> fail and then succeed after some period of time for the log tail to push
> with no other visible changes. Unless I'm missing something, it might be
> worth adding the log force there (that I think you suggested earlier).

Any program that relies on a particular piece of space being in a
particular state can fail due to other active threads, so I'm not
worried about that.

> All that said, I'm not so sure this will all apply the same if shrink
> grows to a more active implementation. For example, I suspect an
> implementation that learns to quiesce/truncate full AGs isn't going to
> necessarily need to allocate all of the blocks out of the AG free space
> trees. But of course, this is all vaporware anyways. :)

<nod> That's a burden for whoever ends up working on AG removal. ;)

> > > Short of the latter, I'm more thinking about the question "is there some
> > > new thing we could add years down the line that 1. adds something to the
> > > log that could conflict and 2. could be reordered past a shrink
> > > transaction in a problematic way?" If the answer to that is open ended
> > > and some such thing does come along, I think it's highly likely this
> > > would just break growfs logging again until somebody trips over it in
> > > the field.
> > 
> > Good thing we have a couple of tests now? :)
> > 
> 
> Not with good shrink support, unfortunately. :/ I tried adding basic
> shrink calls into the proposed tests, but I wasn't able to see them
> actually do anything, presumably because of the stress workload and
> smallish filesystem keeping those blocks in use. It might require a more
> active shrink implementation before we could support this kind of test,
> or otherwise maybe something more limited/targeted than fsstress.

I suspect you're right.

--D

> Hmm.. a random, off the top of my head idea might be a debug knob that
> artificially restricts all allocations beyond a certain disk offset...
> 
> Brian
> 
> > > > > I dunno.. on one hand it does seem like an unlikely thing due to the
> > > > > nature of needing space to be free in order to shrink in the first
> > > > > place, but OTOH if you have something like grow that is rare, not
> > > > > performance sensitive, has a history of not being well tested, and has
> > > > > these subtle ordering requirements that might change indirectly to other
> > > > > transactions, ISTM it could be a wise engineering decision to simplify
> > > > > to the degree possible and find the most basic model that enforces
> > > > > predictable ordering.
> > > > > 
> > > > > So for a hacky thought/example, suppose we defined a transaction mode
> > > > > that basically implemented an exclusive checkpoint requirement (i.e.
> > > > > this transaction owns an entire checkpoint, nothing else is allowed in
> > > > > the CIL concurrently). Presumably that would ensure everything before
> > > > > the grow would flush out to disk in one checkpoint, everything
> > > > > concurrent would block on synchronous commit of the grow trans (before
> > > > > new geometry is exposed), and then after that point everything pending
> > > > > would drain into another checkpoint.
> > > > > 
> > > > > It kind of sounds like overkill, but really if it could be implemented
> > > > > simply enough then we wouldn't have to think too hard about auditing all
> > > > > other relog scenarios. I'd probably want to see at least some reproducer
> > > > > for this sort of problem to prove the theory though too, even if it
> > > > > required debug instrumentation or something. Hm?
> > > > 
> > > > What if we redefined the input requirements to shrink?  Lets say we
> > > > require that the fd argument to a shrink ioctl is actually an unlinkable
> > > > O_TMPFILE regular file with the EOFS blocks mapped to it.  Then we can
> > > > force the log without holding any locks, and the shrink transaction can
> > > > remove the bmap and rmap records at the same time that it updates the sb
> > > > geometry.  The otherwise inaccessible file means that nobody can reuse
> > > > that space between the log force and the sb update.
> > > > 
> > > 
> > > Interesting thought. It kind of sounds like how shrink already works to
> > > some degree, right? I.e. the kernel side allocs the blocks out of the
> > > btrees and tosses them, just no inode in the mix?
> > 
> > Right.
> > 
> > > Honestly I'd probably need to stare at this code and think about it and
> > > work through some scenarios to quantify how much of a concern this
> > > really is, and I don't really have the bandwidth for that just now. I
> > > mainly wanted to raise the notion that if we're assessing high level log
> > > ordering requirements for growfs, we should consider the shrink case as
> > > well.
> > 
> > <nod>
> > 
> > > > > > > Anyways, my point is just that if it were me I wouldn't get too deep
> > > > > > > into this until some of the reproducible growfs recovery issues are at
> > > > > > > least characterized and testing is more sorted out.
> > > > > > > 
> > > > > > > The context for testing is here [1]. The TLDR is basically that
> > > > > > > Christoph has a targeted test that reproduces the initial mount failure
> > > > > > > and I hacked up a more general test that also reproduces it and
> > > > > > > additional growfs recovery problems. This test does seem to confirm that
> > > > > > > the previous patches address the mount failure issue, but this patch
> > > > > > > doesn't seem to prevent any of the other problems produced by the
> > > > > > > generic test. That might just mean the test doesn't reproduce what this
> > > > > > > fixes, but it's kind of hard to at least regression test something like
> > > > > > > this when basic growfs crash-recovery seems pretty much broken.
> > > > > > 
> > > > > > Hmm, if you make a variant of that test which formats with an rt device
> > > > > > and -d rtinherit=1 and then runs xfs_growfs -R instead of -D, do you see
> > > > > > similar blowups?  Let's see what happens if I do that...
> > > > > > 
> > > > > 
> > > > > Heh, sounds like so from your followup. Fun times.
> > > > > 
> > > > > I guess that test should probably work its way upstream. I made some
> > > > > tweaks locally since last posted to try and make it a little more
> > > > > aggressive, but it didn't repro anything new so not sure how much
> > > > > difference it makes really. Do we want a separate version like yours for
> > > > > the rt case or would you expect to cover both cases in a single test?
> > > > 
> > > > This probably should be different tests, because rt is its own very
> > > > weird animal.
> > > > 
> > > 
> > > Posted a couple tests the other day, JFYI.
> > > 
> > > Brian
> > > 
> > > > --D
> > > > 
> > > > > Brian
> > > > > 
> > > > > > --D
> > > > > > 
> > > > > > > Brian
> > > > > > > 
> > > > > > > [1] https://lore.kernel.org/fstests/ZwVdtXUSwEXRpcuQ@bfoster/
> > > > > > > 
> > > > > > > > > have at least one fairly straightforward growfs/recovery test in the
> > > > > > > > > works that reliably explodes, personally I'd suggest to split this work
> > > > > > > > > off into separate series.
> > > > > > > > > 
> > > > > > > > > It seems reasonable enough to me to get patches 1-5 in asap once they're
> > > > > > > > > fully cleaned up, and then leave the next two as part of a followon
> > > > > > > > > series pending further investigation into these other issues. As part of
> > > > > > > > > that I'd like to know whether the recovery test reproduces (or can be
> > > > > > > > > made to reproduce) the issue this patch presumably fixes, but I'd also
> > > > > > > > > settle for "the grow recovery test now passes reliably and this doesn't
> > > > > > > > > regress it." But once again, just my .02.
> > > > > > > > 
> > > > > > > > Yeah, it's too bad there's no good way to test recovery with older
> > > > > > > > kernels either. :(
> > > > > > > > 
> > > > > > > > --D
> > > > > > > > 
> > > > > > > > > Brian
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2024-10-24 18:02 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-30 16:41 fix recovery of allocator ops after a growfs Christoph Hellwig
2024-09-30 16:41 ` [PATCH 1/7] xfs: pass the exact range to initialize to xfs_initialize_perag Christoph Hellwig
2024-10-10 14:02   ` Brian Foster
2024-10-11  7:53     ` Christoph Hellwig
2024-10-11 14:01       ` Brian Foster
2024-09-30 16:41 ` [PATCH 2/7] xfs: merge the perag freeing helpers Christoph Hellwig
2024-10-10 14:02   ` Brian Foster
2024-09-30 16:41 ` [PATCH 3/7] xfs: update the file system geometry after recoverying superblock buffers Christoph Hellwig
2024-09-30 16:50   ` Darrick J. Wong
2024-10-01  8:49     ` Christoph Hellwig
2024-10-10 16:02       ` Darrick J. Wong
2024-10-10 14:03   ` Brian Foster
2024-09-30 16:41 ` [PATCH 4/7] xfs: error out when a superblock buffer updates reduces the agcount Christoph Hellwig
2024-09-30 16:51   ` Darrick J. Wong
2024-10-01  8:47     ` Christoph Hellwig
2024-10-10 14:04   ` Brian Foster
2024-09-30 16:41 ` [PATCH 5/7] xfs: don't use __GFP_RETRY_MAYFAIL in xfs_initialize_perag Christoph Hellwig
2024-10-10 14:04   ` Brian Foster
2024-09-30 16:41 ` [PATCH 6/7] xfs: don't update file system geometry through transaction deltas Christoph Hellwig
2024-10-10 14:05   ` Brian Foster
2024-10-11  7:57     ` Christoph Hellwig
2024-10-11 14:02       ` Brian Foster
2024-10-11 17:13         ` Darrick J. Wong
2024-10-11 18:41           ` Brian Foster
2024-10-11 23:12             ` Darrick J. Wong
2024-10-11 23:29               ` Darrick J. Wong
2024-10-14  5:58                 ` Christoph Hellwig
2024-10-14 15:30                   ` Darrick J. Wong
2024-10-14 18:50               ` Brian Foster
2024-10-15 16:42                 ` Darrick J. Wong
2024-10-18 12:27                   ` Brian Foster
2024-10-21 16:59                     ` Darrick J. Wong
2024-10-23 14:45                       ` Brian Foster
2024-10-24 18:02                         ` Darrick J. Wong
2024-10-21 13:38                 ` Dave Chinner
2024-10-23 15:06                   ` Brian Foster
2024-10-10 19:01   ` Darrick J. Wong
2024-10-11  7:59     ` Christoph Hellwig
2024-10-11 16:44       ` Darrick J. Wong
2024-09-30 16:41 ` [PATCH 7/7] xfs: split xfs_trans_mod_sb Christoph Hellwig
2024-10-10 14:06   ` Brian Foster
2024-10-11  7:54     ` Christoph Hellwig
2024-10-11 14:05       ` Brian Foster
2024-10-11 16:50         ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox