[PATCHBOMB] xfsprogs: ports and new code for 6.16

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHBOMB] xfsprogs: ports and new code for 6.16
@ 2025-07-01 18:03 Darrick J. Wong
  2025-07-01 18:04 ` [PATCHSET 1/3] xfsprogs: new libxfs code from kernel 6.16 Darrick J. Wong
                   ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-01 18:03 UTC (permalink / raw)
  To: Andrey Albershteyn; +Cc: linux-xfs, John Garry, Catherine Hoang

Hi everyone,

This is a collection of all the new code that I have for xfsprogs 6.16.
First are the libxfs ports from Linux 6.16; after that come the
userspace changes to report hardware atomic write capabilities and
format a filesystem that will play nicely with the storage; and finally
a patch to remove EXPERIMENTAL from xfs_scrub.

--D

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCHSET 1/3] xfsprogs: new libxfs code from kernel 6.16
  2025-07-01 18:03 [PATCHBOMB] xfsprogs: ports and new code for 6.16 Darrick J. Wong
@ 2025-07-01 18:04 ` Darrick J. Wong
  2025-07-01 18:05   ` [PATCH 1/6] xfs: add helpers to compute transaction reservation for finishing intent items Darrick J. Wong
                     ` (5 more replies)
  2025-07-01 18:05 ` [PATCHSET 2/3] xfsprogs: atomic writes Darrick J. Wong
  2025-07-01 18:05 ` [PATCHSET 3/3] xfs_scrub: drop EXPERIMENTAL warning Darrick J. Wong
  2 siblings, 6 replies; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-01 18:04 UTC (permalink / raw)
  To: djwong, aalbersh
  Cc: hch, john.g.garry, catherine.hoang, john.g.garry, linux-xfs

Hi all,

Port kernel libxfs code to userspace.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

Comments and questions are, as always, welcome.

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=libxfs-sync-6.16
---
Commits in this patchset:
 * xfs: add helpers to compute transaction reservation for finishing intent items
 * xfs: allow block allocator to take an alignment hint
 * xfs: commit CoW-based atomic writes atomically
 * libxfs: add helpers to compute log item overhead
 * xfs: add xfs_calc_atomic_write_unit_max()
 * xfs: allow sysadmins to specify a maximum atomic write limit at mount time
---
 include/platform_defs.h |   14 ++
 include/xfs_trace.h     |    3 
 libxfs/defer_item.h     |   14 ++
 libxfs/xfs_bmap.h       |    6 +
 libxfs/xfs_trans_resv.h |   25 +++
 libxfs/defer_item.c     |   51 +++++++
 libxfs/xfs_bmap.c       |    5 +
 libxfs/xfs_log_rlimit.c |    4 +
 libxfs/xfs_trans_resv.c |  339 +++++++++++++++++++++++++++++++++++++++++++----
 9 files changed, 429 insertions(+), 32 deletions(-)


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/6] xfs: add helpers to compute transaction reservation for finishing intent items
  2025-07-01 18:04 ` [PATCHSET 1/3] xfsprogs: new libxfs code from kernel 6.16 Darrick J. Wong
@ 2025-07-01 18:05   ` Darrick J. Wong
  2025-07-01 18:05   ` [PATCH 2/6] xfs: allow block allocator to take an alignment hint Darrick J. Wong
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-01 18:05 UTC (permalink / raw)
  To: djwong, aalbersh; +Cc: john.g.garry, catherine.hoang, john.g.garry, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Source kernel commit: 805f89881252a9aee30799b8a395deec79c13414

In the transaction reservation code, hoist the logic that computes the
reservation needed to finish one log intent item into separate helper
functions.  These will be used in subsequent patches to estimate the
number of blocks that an online repair can commit to reaping in the same
transaction as the change committing the new data structure.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 libxfs/xfs_trans_resv.h |   18 +++++
 libxfs/xfs_trans_resv.c |  165 ++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 152 insertions(+), 31 deletions(-)


diff --git a/libxfs/xfs_trans_resv.h b/libxfs/xfs_trans_resv.h
index 0554b9d775d269..d9d0032cbbc5d4 100644
--- a/libxfs/xfs_trans_resv.h
+++ b/libxfs/xfs_trans_resv.h
@@ -98,6 +98,24 @@ struct xfs_trans_resv {
 void xfs_trans_resv_calc(struct xfs_mount *mp, struct xfs_trans_resv *resp);
 uint xfs_allocfree_block_count(struct xfs_mount *mp, uint num_ops);
 
+unsigned int xfs_calc_finish_bui_reservation(struct xfs_mount *mp,
+		unsigned int nr_ops);
+
+unsigned int xfs_calc_finish_efi_reservation(struct xfs_mount *mp,
+		unsigned int nr_ops);
+unsigned int xfs_calc_finish_rt_efi_reservation(struct xfs_mount *mp,
+		unsigned int nr_ops);
+
+unsigned int xfs_calc_finish_rui_reservation(struct xfs_mount *mp,
+		unsigned int nr_ops);
+unsigned int xfs_calc_finish_rt_rui_reservation(struct xfs_mount *mp,
+		unsigned int nr_ops);
+
+unsigned int xfs_calc_finish_cui_reservation(struct xfs_mount *mp,
+		unsigned int nr_ops);
+unsigned int xfs_calc_finish_rt_cui_reservation(struct xfs_mount *mp,
+		unsigned int nr_ops);
+
 unsigned int xfs_calc_itruncate_reservation_minlogsize(struct xfs_mount *mp);
 unsigned int xfs_calc_write_reservation_minlogsize(struct xfs_mount *mp);
 unsigned int xfs_calc_qm_dqalloc_reservation_minlogsize(struct xfs_mount *mp);
diff --git a/libxfs/xfs_trans_resv.c b/libxfs/xfs_trans_resv.c
index 4034c52790de8b..0a843deb50a118 100644
--- a/libxfs/xfs_trans_resv.c
+++ b/libxfs/xfs_trans_resv.c
@@ -260,6 +260,42 @@ xfs_rtalloc_block_count(
  * register overflow from temporaries in the calculations.
  */
 
+/*
+ * Finishing a data device refcount updates (t1):
+ *    the agfs of the ags containing the blocks: nr_ops * sector size
+ *    the refcount btrees: nr_ops * 1 trees * (2 * max depth - 1) * block size
+ */
+inline unsigned int
+xfs_calc_finish_cui_reservation(
+	struct xfs_mount	*mp,
+	unsigned int		nr_ops)
+{
+	if (!xfs_has_reflink(mp))
+		return 0;
+
+	return xfs_calc_buf_res(nr_ops, mp->m_sb.sb_sectsize) +
+	       xfs_calc_buf_res(xfs_refcountbt_block_count(mp, nr_ops),
+			       mp->m_sb.sb_blocksize);
+}
+
+/*
+ * Realtime refcount updates (t2);
+ *    the rt refcount inode
+ *    the rtrefcount btrees: nr_ops * 1 trees * (2 * max depth - 1) * block size
+ */
+inline unsigned int
+xfs_calc_finish_rt_cui_reservation(
+	struct xfs_mount	*mp,
+	unsigned int		nr_ops)
+{
+	if (!xfs_has_rtreflink(mp))
+		return 0;
+
+	return xfs_calc_inode_res(mp, 1) +
+	       xfs_calc_buf_res(xfs_rtrefcountbt_block_count(mp, nr_ops),
+				     mp->m_sb.sb_blocksize);
+}
+
 /*
  * Compute the log reservation required to handle the refcount update
  * transaction.  Refcount updates are always done via deferred log items.
@@ -277,19 +313,10 @@ xfs_calc_refcountbt_reservation(
 	struct xfs_mount	*mp,
 	unsigned int		nr_ops)
 {
-	unsigned int		blksz = XFS_FSB_TO_B(mp, 1);
-	unsigned int		t1, t2 = 0;
+	unsigned int		t1, t2;
 
-	if (!xfs_has_reflink(mp))
-		return 0;
-
-	t1 = xfs_calc_buf_res(nr_ops, mp->m_sb.sb_sectsize) +
-	     xfs_calc_buf_res(xfs_refcountbt_block_count(mp, nr_ops), blksz);
-
-	if (xfs_has_realtime(mp))
-		t2 = xfs_calc_inode_res(mp, 1) +
-		     xfs_calc_buf_res(xfs_rtrefcountbt_block_count(mp, nr_ops),
-				     blksz);
+	t1 = xfs_calc_finish_cui_reservation(mp, nr_ops);
+	t2 = xfs_calc_finish_rt_cui_reservation(mp, nr_ops);
 
 	return max(t1, t2);
 }
@@ -376,6 +403,96 @@ xfs_calc_write_reservation_minlogsize(
 	return xfs_calc_write_reservation(mp, true);
 }
 
+/*
+ * Finishing an EFI can free the blocks and bmap blocks (t2):
+ *    the agf for each of the ags: nr * sector size
+ *    the agfl for each of the ags: nr * sector size
+ *    the super block to reflect the freed blocks: sector size
+ *    worst case split in allocation btrees per extent assuming nr extents:
+ *		nr exts * 2 trees * (2 * max depth - 1) * block size
+ */
+inline unsigned int
+xfs_calc_finish_efi_reservation(
+	struct xfs_mount	*mp,
+	unsigned int		nr)
+{
+	return xfs_calc_buf_res((2 * nr) + 1, mp->m_sb.sb_sectsize) +
+	       xfs_calc_buf_res(xfs_allocfree_block_count(mp, nr),
+			       mp->m_sb.sb_blocksize);
+}
+
+/*
+ * Or, if it's a realtime file (t3):
+ *    the agf for each of the ags: 2 * sector size
+ *    the agfl for each of the ags: 2 * sector size
+ *    the super block to reflect the freed blocks: sector size
+ *    the realtime bitmap:
+ *		2 exts * ((XFS_BMBT_MAX_EXTLEN / rtextsize) / NBBY) bytes
+ *    the realtime summary: 2 exts * 1 block
+ *    worst case split in allocation btrees per extent assuming 2 extents:
+ *		2 exts * 2 trees * (2 * max depth - 1) * block size
+ */
+inline unsigned int
+xfs_calc_finish_rt_efi_reservation(
+	struct xfs_mount	*mp,
+	unsigned int		nr)
+{
+	if (!xfs_has_realtime(mp))
+		return 0;
+
+	return xfs_calc_buf_res((2 * nr) + 1, mp->m_sb.sb_sectsize) +
+	       xfs_calc_buf_res(xfs_rtalloc_block_count(mp, nr),
+			       mp->m_sb.sb_blocksize) +
+	       xfs_calc_buf_res(xfs_allocfree_block_count(mp, nr),
+			       mp->m_sb.sb_blocksize);
+}
+
+/*
+ * Finishing an RUI is the same as an EFI.  We can split the rmap btree twice
+ * on each end of the record, and that can cause the AGFL to be refilled or
+ * emptied out.
+ */
+inline unsigned int
+xfs_calc_finish_rui_reservation(
+	struct xfs_mount	*mp,
+	unsigned int		nr)
+{
+	if (!xfs_has_rmapbt(mp))
+		return 0;
+	return xfs_calc_finish_efi_reservation(mp, nr);
+}
+
+/*
+ * Finishing an RUI is the same as an EFI.  We can split the rmap btree twice
+ * on each end of the record, and that can cause the AGFL to be refilled or
+ * emptied out.
+ */
+inline unsigned int
+xfs_calc_finish_rt_rui_reservation(
+	struct xfs_mount	*mp,
+	unsigned int		nr)
+{
+	if (!xfs_has_rtrmapbt(mp))
+		return 0;
+	return xfs_calc_finish_rt_efi_reservation(mp, nr);
+}
+
+/*
+ * In finishing a BUI, we can modify:
+ *    the inode being truncated: inode size
+ *    dquots
+ *    the inode's bmap btree: (max depth + 1) * block size
+ */
+inline unsigned int
+xfs_calc_finish_bui_reservation(
+	struct xfs_mount	*mp,
+	unsigned int		nr)
+{
+	return xfs_calc_inode_res(mp, 1) + XFS_DQUOT_LOGRES +
+	       xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + 1,
+			       mp->m_sb.sb_blocksize);
+}
+
 /*
  * In truncating a file we free up to two extents at once.  We can modify (t1):
  *    the inode being truncated: inode size
@@ -408,16 +525,8 @@ xfs_calc_itruncate_reservation(
 	t1 = xfs_calc_inode_res(mp, 1) +
 	     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + 1, blksz);
 
-	t2 = xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
-	     xfs_calc_buf_res(xfs_allocfree_block_count(mp, 4), blksz);
-
-	if (xfs_has_realtime(mp)) {
-		t3 = xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
-		     xfs_calc_buf_res(xfs_rtalloc_block_count(mp, 2), blksz) +
-		     xfs_calc_buf_res(xfs_allocfree_block_count(mp, 2), blksz);
-	} else {
-		t3 = 0;
-	}
+	t2 = xfs_calc_finish_efi_reservation(mp, 4);
+	t3 = xfs_calc_finish_rt_efi_reservation(mp, 2);
 
 	/*
 	 * In the early days of reflink, we included enough reservation to log
@@ -498,9 +607,7 @@ xfs_calc_rename_reservation(
 	     xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp),
 			XFS_FSB_TO_B(mp, 1));
 
-	t2 = xfs_calc_buf_res(7, mp->m_sb.sb_sectsize) +
-	     xfs_calc_buf_res(xfs_allocfree_block_count(mp, 3),
-			XFS_FSB_TO_B(mp, 1));
+	t2 = xfs_calc_finish_efi_reservation(mp, 3);
 
 	if (xfs_has_parent(mp)) {
 		unsigned int	rename_overhead, exchange_overhead;
@@ -608,9 +715,7 @@ xfs_calc_link_reservation(
 	overhead += xfs_calc_iunlink_remove_reservation(mp);
 	t1 = xfs_calc_inode_res(mp, 2) +
 	     xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1));
-	t2 = xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
-	     xfs_calc_buf_res(xfs_allocfree_block_count(mp, 1),
-			      XFS_FSB_TO_B(mp, 1));
+	t2 = xfs_calc_finish_efi_reservation(mp, 1);
 
 	if (xfs_has_parent(mp)) {
 		t3 = resp->tr_attrsetm.tr_logres;
@@ -673,9 +778,7 @@ xfs_calc_remove_reservation(
 
 	t1 = xfs_calc_inode_res(mp, 2) +
 	     xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1));
-	t2 = xfs_calc_buf_res(4, mp->m_sb.sb_sectsize) +
-	     xfs_calc_buf_res(xfs_allocfree_block_count(mp, 2),
-			      XFS_FSB_TO_B(mp, 1));
+	t2 = xfs_calc_finish_efi_reservation(mp, 2);
 
 	if (xfs_has_parent(mp)) {
 		t3 = resp->tr_attrrm.tr_logres;


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 2/6] xfs: allow block allocator to take an alignment hint
  2025-07-01 18:04 ` [PATCHSET 1/3] xfsprogs: new libxfs code from kernel 6.16 Darrick J. Wong
  2025-07-01 18:05   ` [PATCH 1/6] xfs: add helpers to compute transaction reservation for finishing intent items Darrick J. Wong
@ 2025-07-01 18:05   ` Darrick J. Wong
  2025-07-01 18:06   ` [PATCH 3/6] xfs: commit CoW-based atomic writes atomically Darrick J. Wong
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-01 18:05 UTC (permalink / raw)
  To: djwong, aalbersh
  Cc: hch, john.g.garry, catherine.hoang, john.g.garry, linux-xfs

From: John Garry <john.g.garry@oracle.com>

Source kernel commit: 6baf4cc47a741024d37e6149d5d035d3fc9ed1fe

Add a BMAPI flag to provide a hint to the block allocator to align extents
according to the extszhint.

This will be useful for atomic writes to ensure that we are not being
allocated extents which are not suitable (for atomic writes).

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 libxfs/xfs_bmap.h |    6 +++++-
 libxfs/xfs_bmap.c |    5 +++++
 2 files changed, 10 insertions(+), 1 deletion(-)


diff --git a/libxfs/xfs_bmap.h b/libxfs/xfs_bmap.h
index b4d9c6e0f3f9b6..d5f2729305fada 100644
--- a/libxfs/xfs_bmap.h
+++ b/libxfs/xfs_bmap.h
@@ -87,6 +87,9 @@ struct xfs_bmalloca {
 /* Do not update the rmap btree.  Used for reconstructing bmbt from rmapbt. */
 #define XFS_BMAPI_NORMAP	(1u << 10)
 
+/* Try to align allocations to the extent size hint */
+#define XFS_BMAPI_EXTSZALIGN	(1u << 11)
+
 #define XFS_BMAPI_FLAGS \
 	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
 	{ XFS_BMAPI_METADATA,	"METADATA" }, \
@@ -98,7 +101,8 @@ struct xfs_bmalloca {
 	{ XFS_BMAPI_REMAP,	"REMAP" }, \
 	{ XFS_BMAPI_COWFORK,	"COWFORK" }, \
 	{ XFS_BMAPI_NODISCARD,	"NODISCARD" }, \
-	{ XFS_BMAPI_NORMAP,	"NORMAP" }
+	{ XFS_BMAPI_NORMAP,	"NORMAP" },\
+	{ XFS_BMAPI_EXTSZALIGN,	"EXTSZALIGN" }
 
 
 static inline int xfs_bmapi_aflag(int w)
diff --git a/libxfs/xfs_bmap.c b/libxfs/xfs_bmap.c
index 3cb47c3c8707db..99f5e6f9d545a4 100644
--- a/libxfs/xfs_bmap.c
+++ b/libxfs/xfs_bmap.c
@@ -3306,6 +3306,11 @@ xfs_bmap_compute_alignments(
 		align = xfs_get_cowextsz_hint(ap->ip);
 	else if (ap->datatype & XFS_ALLOC_USERDATA)
 		align = xfs_get_extsz_hint(ap->ip);
+
+	/* Try to align start block to any minimum allocation alignment */
+	if (align > 1 && (ap->flags & XFS_BMAPI_EXTSZALIGN))
+		args->alignment = align;
+
 	if (align) {
 		if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
 					ap->eof, 0, ap->conv, &ap->offset,


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 3/6] xfs: commit CoW-based atomic writes atomically
  2025-07-01 18:04 ` [PATCHSET 1/3] xfsprogs: new libxfs code from kernel 6.16 Darrick J. Wong
  2025-07-01 18:05   ` [PATCH 1/6] xfs: add helpers to compute transaction reservation for finishing intent items Darrick J. Wong
  2025-07-01 18:05   ` [PATCH 2/6] xfs: allow block allocator to take an alignment hint Darrick J. Wong
@ 2025-07-01 18:06   ` Darrick J. Wong
  2025-07-01 18:06   ` [PATCH 4/6] libxfs: add helpers to compute log item overhead Darrick J. Wong
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-01 18:06 UTC (permalink / raw)
  To: djwong, aalbersh
  Cc: hch, john.g.garry, catherine.hoang, john.g.garry, linux-xfs

From: John Garry <john.g.garry@oracle.com>

Source kernel commit: b1e09178b73adf10dc87fba9aee7787a7ad26874

When completing a CoW-based write, each extent range mapping update is
covered by a separate transaction.

For a CoW-based atomic write, all mappings must be changed at once, so
change to use a single transaction.

Note that there is a limit on the amount of log intent items which can be
fit into a single transaction, but this is being ignored for now since
the count of items for a typical atomic write would be much less than is
typically supported. A typical atomic write would be expected to be 64KB
or less, which means only 16 possible extents unmaps, which is quite
small.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: add tr_atomic_ioend]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 libxfs/xfs_trans_resv.h |    1 +
 libxfs/xfs_log_rlimit.c |    4 ++++
 libxfs/xfs_trans_resv.c |   15 +++++++++++++++
 3 files changed, 20 insertions(+)


diff --git a/libxfs/xfs_trans_resv.h b/libxfs/xfs_trans_resv.h
index d9d0032cbbc5d4..670045d417a65f 100644
--- a/libxfs/xfs_trans_resv.h
+++ b/libxfs/xfs_trans_resv.h
@@ -48,6 +48,7 @@ struct xfs_trans_resv {
 	struct xfs_trans_res	tr_qm_dqalloc;	/* allocate quota on disk */
 	struct xfs_trans_res	tr_sb;		/* modify superblock */
 	struct xfs_trans_res	tr_fsyncts;	/* update timestamps on fsync */
+	struct xfs_trans_res	tr_atomic_ioend; /* untorn write completion */
 };
 
 /* shorthand way of accessing reservation structure */
diff --git a/libxfs/xfs_log_rlimit.c b/libxfs/xfs_log_rlimit.c
index 246d5f486d024a..2b9047f5ffb58b 100644
--- a/libxfs/xfs_log_rlimit.c
+++ b/libxfs/xfs_log_rlimit.c
@@ -91,6 +91,7 @@ xfs_log_calc_trans_resv_for_minlogblocks(
 	 */
 	if (xfs_want_minlogsize_fixes(&mp->m_sb)) {
 		xfs_trans_resv_calc(mp, resv);
+		resv->tr_atomic_ioend = M_RES(mp)->tr_atomic_ioend;
 		return;
 	}
 
@@ -107,6 +108,9 @@ xfs_log_calc_trans_resv_for_minlogblocks(
 
 	xfs_trans_resv_calc(mp, resv);
 
+	/* Copy the dynamic transaction reservation types from the running fs */
+	resv->tr_atomic_ioend = M_RES(mp)->tr_atomic_ioend;
+
 	if (xfs_has_reflink(mp)) {
 		/*
 		 * In the early days of reflink, typical log operation counts
diff --git a/libxfs/xfs_trans_resv.c b/libxfs/xfs_trans_resv.c
index 0a843deb50a118..cf735f946c8ac7 100644
--- a/libxfs/xfs_trans_resv.c
+++ b/libxfs/xfs_trans_resv.c
@@ -1281,6 +1281,15 @@ xfs_calc_namespace_reservations(
 	resp->tr_mkdir.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
 }
 
+STATIC void
+xfs_calc_default_atomic_ioend_reservation(
+	struct xfs_mount	*mp,
+	struct xfs_trans_resv	*resp)
+{
+	/* Pick a default that will scale reasonably for the log size. */
+	resp->tr_atomic_ioend = resp->tr_itruncate;
+}
+
 void
 xfs_trans_resv_calc(
 	struct xfs_mount	*mp,
@@ -1375,4 +1384,10 @@ xfs_trans_resv_calc(
 	resp->tr_itruncate.tr_logcount += logcount_adj;
 	resp->tr_write.tr_logcount += logcount_adj;
 	resp->tr_qm_dqalloc.tr_logcount += logcount_adj;
+
+	/*
+	 * Now that we've finished computing the static reservations, we can
+	 * compute the dynamic reservation for atomic writes.
+	 */
+	xfs_calc_default_atomic_ioend_reservation(mp, resp);
 }


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 4/6] libxfs: add helpers to compute log item overhead
  2025-07-01 18:04 ` [PATCHSET 1/3] xfsprogs: new libxfs code from kernel 6.16 Darrick J. Wong
                     ` (2 preceding siblings ...)
  2025-07-01 18:06   ` [PATCH 3/6] xfs: commit CoW-based atomic writes atomically Darrick J. Wong
@ 2025-07-01 18:06   ` Darrick J. Wong
  2025-07-01 18:06   ` [PATCH 5/6] xfs: add xfs_calc_atomic_write_unit_max() Darrick J. Wong
  2025-07-01 18:06   ` [PATCH 6/6] xfs: allow sysadmins to specify a maximum atomic write limit at mount time Darrick J. Wong
  5 siblings, 0 replies; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-01 18:06 UTC (permalink / raw)
  To: djwong, aalbersh; +Cc: catherine.hoang, john.g.garry, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Add selected helpers to estimate the transaction reservation required to
write various log intent and buffer items to the log.  These helpers
will be used by the online repair code for more precise estimations of
how much work can be done in a single transaction.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 libxfs/defer_item.h |   14 ++++++++++++++
 libxfs/defer_item.c |   51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 65 insertions(+)


diff --git a/libxfs/defer_item.h b/libxfs/defer_item.h
index 93cf1eed58a382..325a6f7b2dcbce 100644
--- a/libxfs/defer_item.h
+++ b/libxfs/defer_item.h
@@ -39,4 +39,18 @@ struct xfs_refcount_intent;
 void xfs_refcount_defer_add(struct xfs_trans *tp,
 		struct xfs_refcount_intent *ri);
 
+/* log intent size calculations */
+
+unsigned int xfs_efi_log_space(unsigned int nr);
+unsigned int xfs_efd_log_space(unsigned int nr);
+
+unsigned int xfs_rui_log_space(unsigned int nr);
+unsigned int xfs_rud_log_space(void);
+
+unsigned int xfs_bui_log_space(unsigned int nr);
+unsigned int xfs_bud_log_space(void);
+
+unsigned int xfs_cui_log_space(unsigned int nr);
+unsigned int xfs_cud_log_space(void);
+
 #endif /* __LIBXFS_DEFER_ITEM_H_ */
diff --git a/libxfs/defer_item.c b/libxfs/defer_item.c
index 6beefa6a439980..4530583ddabae1 100644
--- a/libxfs/defer_item.c
+++ b/libxfs/defer_item.c
@@ -942,3 +942,54 @@ const struct xfs_defer_op_type xfs_exchmaps_defer_type = {
 	.finish_item	= xfs_exchmaps_finish_item,
 	.cancel_item	= xfs_exchmaps_cancel_item,
 };
+
+/* log intent size calculations */
+
+static inline unsigned int
+xlog_item_space(
+	unsigned int	niovecs,
+	unsigned int	nbytes)
+{
+	nbytes += niovecs * (sizeof(uint64_t) + sizeof(struct xlog_op_header));
+	return round_up(nbytes, sizeof(uint64_t));
+}
+
+unsigned int xfs_efi_log_space(unsigned int nr)
+{
+	return xlog_item_space(1, xfs_efi_log_format_sizeof(nr));
+}
+
+unsigned int xfs_efd_log_space(unsigned int nr)
+{
+	return xlog_item_space(1, xfs_efd_log_format_sizeof(nr));
+}
+
+unsigned int xfs_rui_log_space(unsigned int nr)
+{
+	return xlog_item_space(1, xfs_rui_log_format_sizeof(nr));
+}
+
+unsigned int xfs_rud_log_space(void)
+{
+	return xlog_item_space(1, sizeof(struct xfs_rud_log_format));
+}
+
+unsigned int xfs_bui_log_space(unsigned int nr)
+{
+	return xlog_item_space(1, xfs_bui_log_format_sizeof(nr));
+}
+
+unsigned int xfs_bud_log_space(void)
+{
+	return xlog_item_space(1, sizeof(struct xfs_bud_log_format));
+}
+
+unsigned int xfs_cui_log_space(unsigned int nr)
+{
+	return xlog_item_space(1, xfs_cui_log_format_sizeof(nr));
+}
+
+unsigned int xfs_cud_log_space(void)
+{
+	return xlog_item_space(1, sizeof(struct xfs_cud_log_format));
+}


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 5/6] xfs: add xfs_calc_atomic_write_unit_max()
  2025-07-01 18:04 ` [PATCHSET 1/3] xfsprogs: new libxfs code from kernel 6.16 Darrick J. Wong
                     ` (3 preceding siblings ...)
  2025-07-01 18:06   ` [PATCH 4/6] libxfs: add helpers to compute log item overhead Darrick J. Wong
@ 2025-07-01 18:06   ` Darrick J. Wong
  2025-07-01 18:06   ` [PATCH 6/6] xfs: allow sysadmins to specify a maximum atomic write limit at mount time Darrick J. Wong
  5 siblings, 0 replies; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-01 18:06 UTC (permalink / raw)
  To: djwong, aalbersh
  Cc: hch, john.g.garry, catherine.hoang, john.g.garry, linux-xfs

From: John Garry <john.g.garry@oracle.com>

Source kernel commit: 0c438dcc31504bf4f50b20dc52f8f5ca7fab53e2

Now that CoW-based atomic writes are supported, update the max size of an
atomic write for the data device.

The limit of a CoW-based atomic write will be the limit of the number of
logitems which can fit into a single transaction.

In addition, the max atomic write size needs to be aligned to the agsize.
Limit the size of atomic writes to the greatest power-of-two factor of the
agsize so that allocations for an atomic write will always be aligned
compatibly with the alignment requirements of the storage.

Function xfs_atomic_write_logitems() is added to find the limit the number
of log items which can fit in a single transaction.

Amend the max atomic write computation to create a new transaction
reservation type, and compute the maximum size of an atomic write
completion (in fsblocks) based on this new transaction reservation.
Initially, tr_atomic_write is a clone of tr_itruncate, which provides a
reasonable level of parallelism.  In the next patch, we'll add a mount
option so that sysadmins can configure their own limits.

[djwong: use a new reservation type for atomic write ioends, refactor
group limit calculations]

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[jpg: rounddown power-of-2 always]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 include/xfs_trace.h     |    2 +
 libxfs/xfs_trans_resv.h |    2 +
 libxfs/xfs_trans_resv.c |   90 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 94 insertions(+)


diff --git a/include/xfs_trace.h b/include/xfs_trace.h
index 30166c11dd597b..965c109cc1941a 100644
--- a/include/xfs_trace.h
+++ b/include/xfs_trace.h
@@ -64,6 +64,8 @@
 #define trace_xfs_attr_rmtval_alloc(...)	((void) 0)
 #define trace_xfs_attr_rmtval_remove_return(...) ((void) 0)
 
+#define trace_xfs_calc_max_atomic_write_fsblocks(...) ((void) 0)
+
 #define trace_xfs_log_recover_item_add_cont(a,b,c,d)	((void) 0)
 #define trace_xfs_log_recover_item_add(a,b,c,d)	((void) 0)
 
diff --git a/libxfs/xfs_trans_resv.h b/libxfs/xfs_trans_resv.h
index 670045d417a65f..a6d303b836883f 100644
--- a/libxfs/xfs_trans_resv.h
+++ b/libxfs/xfs_trans_resv.h
@@ -121,4 +121,6 @@ unsigned int xfs_calc_itruncate_reservation_minlogsize(struct xfs_mount *mp);
 unsigned int xfs_calc_write_reservation_minlogsize(struct xfs_mount *mp);
 unsigned int xfs_calc_qm_dqalloc_reservation_minlogsize(struct xfs_mount *mp);
 
+xfs_extlen_t xfs_calc_max_atomic_write_fsblocks(struct xfs_mount *mp);
+
 #endif	/* __XFS_TRANS_RESV_H__ */
diff --git a/libxfs/xfs_trans_resv.c b/libxfs/xfs_trans_resv.c
index cf735f946c8ac7..ec61ddfba44601 100644
--- a/libxfs/xfs_trans_resv.c
+++ b/libxfs/xfs_trans_resv.c
@@ -19,6 +19,8 @@
 #include "xfs_trans_space.h"
 #include "xfs_quota_defs.h"
 #include "xfs_rtbitmap.h"
+#include "xfs_trace.h"
+#include "defer_item.h"
 
 #define _ALLOC	true
 #define _FREE	false
@@ -1391,3 +1393,91 @@ xfs_trans_resv_calc(
 	 */
 	xfs_calc_default_atomic_ioend_reservation(mp, resp);
 }
+
+/*
+ * Return the per-extent and fixed transaction reservation sizes needed to
+ * complete an atomic write.
+ */
+STATIC unsigned int
+xfs_calc_atomic_write_ioend_geometry(
+	struct xfs_mount	*mp,
+	unsigned int		*step_size)
+{
+	const unsigned int	efi = xfs_efi_log_space(1);
+	const unsigned int	efd = xfs_efd_log_space(1);
+	const unsigned int	rui = xfs_rui_log_space(1);
+	const unsigned int	rud = xfs_rud_log_space();
+	const unsigned int	cui = xfs_cui_log_space(1);
+	const unsigned int	cud = xfs_cud_log_space();
+	const unsigned int	bui = xfs_bui_log_space(1);
+	const unsigned int	bud = xfs_bud_log_space();
+
+	/*
+	 * Maximum overhead to complete an atomic write ioend in software:
+	 * remove data fork extent + remove cow fork extent + map extent into
+	 * data fork.
+	 *
+	 * tx0: Creates a BUI and a CUI and that's all it needs.
+	 *
+	 * tx1: Roll to finish the BUI.  Need space for the BUD, an RUI, and
+	 * enough space to relog the CUI (== CUI + CUD).
+	 *
+	 * tx2: Roll again to finish the RUI.  Need space for the RUD and space
+	 * to relog the CUI.
+	 *
+	 * tx3: Roll again, need space for the CUD and possibly a new EFI.
+	 *
+	 * tx4: Roll again, need space for an EFD.
+	 *
+	 * If the extent referenced by the pair of BUI/CUI items is not the one
+	 * being currently processed, then we need to reserve space to relog
+	 * both items.
+	 */
+	const unsigned int	tx0 = bui + cui;
+	const unsigned int	tx1 = bud + rui + cui + cud;
+	const unsigned int	tx2 = rud + cui + cud;
+	const unsigned int	tx3 = cud + efi;
+	const unsigned int	tx4 = efd;
+	const unsigned int	relog = bui + bud + cui + cud;
+
+	const unsigned int	per_intent = max(max3(tx0, tx1, tx2),
+						 max3(tx3, tx4, relog));
+
+	/* Overhead to finish one step of each intent item type */
+	const unsigned int	f1 = xfs_calc_finish_efi_reservation(mp, 1);
+	const unsigned int	f2 = xfs_calc_finish_rui_reservation(mp, 1);
+	const unsigned int	f3 = xfs_calc_finish_cui_reservation(mp, 1);
+	const unsigned int	f4 = xfs_calc_finish_bui_reservation(mp, 1);
+
+	/* We only finish one item per transaction in a chain */
+	*step_size = max(f4, max3(f1, f2, f3));
+
+	return per_intent;
+}
+
+/*
+ * Compute the maximum size (in fsblocks) of atomic writes that we can complete
+ * given the existing log reservations.
+ */
+xfs_extlen_t
+xfs_calc_max_atomic_write_fsblocks(
+	struct xfs_mount		*mp)
+{
+	const struct xfs_trans_res	*resv = &M_RES(mp)->tr_atomic_ioend;
+	unsigned int			per_intent = 0;
+	unsigned int			step_size = 0;
+	unsigned int			ret = 0;
+
+	if (resv->tr_logres > 0) {
+		per_intent = xfs_calc_atomic_write_ioend_geometry(mp,
+				&step_size);
+
+		if (resv->tr_logres >= step_size)
+			ret = (resv->tr_logres - step_size) / per_intent;
+	}
+
+	trace_xfs_calc_max_atomic_write_fsblocks(mp, per_intent, step_size,
+			resv->tr_logres, ret);
+
+	return ret;
+}


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 6/6] xfs: allow sysadmins to specify a maximum atomic write limit at mount time
  2025-07-01 18:04 ` [PATCHSET 1/3] xfsprogs: new libxfs code from kernel 6.16 Darrick J. Wong
                     ` (4 preceding siblings ...)
  2025-07-01 18:06   ` [PATCH 5/6] xfs: add xfs_calc_atomic_write_unit_max() Darrick J. Wong
@ 2025-07-01 18:06   ` Darrick J. Wong
  5 siblings, 0 replies; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-01 18:06 UTC (permalink / raw)
  To: djwong, aalbersh; +Cc: john.g.garry, catherine.hoang, john.g.garry, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Source kernel commit: 4528b9052731f14c1a9be16b98e33c9401e6d1bc

Introduce a mount option to allow sysadmins to specify the maximum size
of an atomic write.  If the filesystem can work with the supplied value,
that becomes the new guaranteed maximum.

The value mustn't be too big for the existing filesystem geometry (max
write size, max AG/rtgroup size).  We dynamically recompute the
tr_atomic_write transaction reservation based on the given block size,
check that the current log size isn't less than the new minimum log size
constraints, and set a new maximum.

The actual software atomic write max is still computed based off of
tr_atomic_ioend the same way it has for the past few commits.  Note also
that xfs_calc_atomic_write_log_geometry is non-static because mkfs will
need that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
---
 include/platform_defs.h |   14 ++++++++++
 include/xfs_trace.h     |    1 +
 libxfs/xfs_trans_resv.h |    4 +++
 libxfs/xfs_trans_resv.c |   69 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 88 insertions(+)


diff --git a/include/platform_defs.h b/include/platform_defs.h
index 9af7b4318f8917..74a00583ebd6cb 100644
--- a/include/platform_defs.h
+++ b/include/platform_defs.h
@@ -261,6 +261,20 @@ static inline bool __must_check __must_check_overflow(bool overflow)
 	__builtin_add_overflow(__a, __b, __d);	\
 }))
 
+/**
+ * check_mul_overflow() - Calculate multiplication with overflow checking
+ * @a: first factor
+ * @b: second factor
+ * @d: pointer to store product
+ *
+ * Returns true on wrap-around, false otherwise.
+ *
+ * *@d holds the results of the attempted multiplication, regardless of whether
+ * wrap-around occurred.
+ */
+#define check_mul_overflow(a, b, d)	\
+	__must_check_overflow(__builtin_mul_overflow(a, b, d))
+
 /**
  * abs_diff - return absolute value of the difference between the arguments
  * @a: the first argument
diff --git a/include/xfs_trace.h b/include/xfs_trace.h
index 965c109cc1941a..6bbf4007cbbbd9 100644
--- a/include/xfs_trace.h
+++ b/include/xfs_trace.h
@@ -65,6 +65,7 @@
 #define trace_xfs_attr_rmtval_remove_return(...) ((void) 0)
 
 #define trace_xfs_calc_max_atomic_write_fsblocks(...) ((void) 0)
+#define trace_xfs_calc_max_atomic_write_log_geometry(...) ((void) 0)
 
 #define trace_xfs_log_recover_item_add_cont(a,b,c,d)	((void) 0)
 #define trace_xfs_log_recover_item_add(a,b,c,d)	((void) 0)
diff --git a/libxfs/xfs_trans_resv.h b/libxfs/xfs_trans_resv.h
index a6d303b836883f..336279e0fc6137 100644
--- a/libxfs/xfs_trans_resv.h
+++ b/libxfs/xfs_trans_resv.h
@@ -122,5 +122,9 @@ unsigned int xfs_calc_write_reservation_minlogsize(struct xfs_mount *mp);
 unsigned int xfs_calc_qm_dqalloc_reservation_minlogsize(struct xfs_mount *mp);
 
 xfs_extlen_t xfs_calc_max_atomic_write_fsblocks(struct xfs_mount *mp);
+xfs_extlen_t xfs_calc_atomic_write_log_geometry(struct xfs_mount *mp,
+		xfs_extlen_t blockcount, unsigned int *new_logres);
+int xfs_calc_atomic_write_reservation(struct xfs_mount *mp,
+		xfs_extlen_t blockcount);
 
 #endif	/* __XFS_TRANS_RESV_H__ */
diff --git a/libxfs/xfs_trans_resv.c b/libxfs/xfs_trans_resv.c
index ec61ddfba44601..b3b9d22b54515d 100644
--- a/libxfs/xfs_trans_resv.c
+++ b/libxfs/xfs_trans_resv.c
@@ -1481,3 +1481,72 @@ xfs_calc_max_atomic_write_fsblocks(
 
 	return ret;
 }
+
+/*
+ * Compute the log blocks and transaction reservation needed to complete an
+ * atomic write of a given number of blocks.  Worst case, each block requires
+ * separate handling.  A return value of 0 means something went wrong.
+ */
+xfs_extlen_t
+xfs_calc_atomic_write_log_geometry(
+	struct xfs_mount	*mp,
+	xfs_extlen_t		blockcount,
+	unsigned int		*new_logres)
+{
+	struct xfs_trans_res	*curr_res = &M_RES(mp)->tr_atomic_ioend;
+	uint			old_logres = curr_res->tr_logres;
+	unsigned int		per_intent, step_size;
+	unsigned int		logres;
+	xfs_extlen_t		min_logblocks;
+
+	ASSERT(blockcount > 0);
+
+	xfs_calc_default_atomic_ioend_reservation(mp, M_RES(mp));
+
+	per_intent = xfs_calc_atomic_write_ioend_geometry(mp, &step_size);
+
+	/* Check for overflows */
+	if (check_mul_overflow(blockcount, per_intent, &logres) ||
+	    check_add_overflow(logres, step_size, &logres))
+		return 0;
+
+	curr_res->tr_logres = logres;
+	min_logblocks = xfs_log_calc_minimum_size(mp);
+	curr_res->tr_logres = old_logres;
+
+	trace_xfs_calc_max_atomic_write_log_geometry(mp, per_intent, step_size,
+			blockcount, min_logblocks, logres);
+
+	*new_logres = logres;
+	return min_logblocks;
+}
+
+/*
+ * Compute the transaction reservation needed to complete an out of place
+ * atomic write of a given number of blocks.
+ */
+int
+xfs_calc_atomic_write_reservation(
+	struct xfs_mount	*mp,
+	xfs_extlen_t		blockcount)
+{
+	unsigned int		new_logres;
+	xfs_extlen_t		min_logblocks;
+
+	/*
+	 * If the caller doesn't ask for a specific atomic write size, then
+	 * use the defaults.
+	 */
+	if (blockcount == 0) {
+		xfs_calc_default_atomic_ioend_reservation(mp, M_RES(mp));
+		return 0;
+	}
+
+	min_logblocks = xfs_calc_atomic_write_log_geometry(mp, blockcount,
+			&new_logres);
+	if (!min_logblocks || min_logblocks > mp->m_sb.sb_logblocks)
+		return -EINVAL;
+
+	M_RES(mp)->tr_atomic_ioend.tr_logres = new_logres;
+	return 0;
+}


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCHSET 2/3] xfsprogs: atomic writes
  2025-07-01 18:03 [PATCHBOMB] xfsprogs: ports and new code for 6.16 Darrick J. Wong
  2025-07-01 18:04 ` [PATCHSET 1/3] xfsprogs: new libxfs code from kernel 6.16 Darrick J. Wong
@ 2025-07-01 18:05 ` Darrick J. Wong
  2025-07-01 18:07   ` [PATCH 1/7] libfrog: move statx.h from io/ to libfrog/ Darrick J. Wong
                     ` (6 more replies)
  2025-07-01 18:05 ` [PATCHSET 3/3] xfs_scrub: drop EXPERIMENTAL warning Darrick J. Wong
  2 siblings, 7 replies; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-01 18:05 UTC (permalink / raw)
  To: djwong, aalbersh; +Cc: linux-xfs, catherine.hoang, john.g.garry, linux-xfs

Hi all,

Utility updates for dumping atomic writes configuration and formatting
filesystems to take advantage of the new functionality.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

Comments and questions are, as always, welcome.

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-writes-6.16

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=atomic-writes-6.16
---
Commits in this patchset:
 * libfrog: move statx.h from io/ to libfrog/
 * xfs_db: create an untorn_max subcommand
 * xfs_io: dump new atomic_write_unit_max_opt statx field
 * mkfs: don't complain about overly large auto-detected log stripe units
 * mkfs: autodetect log stripe unit for external log devices
 * mkfs: try to align AG size based on atomic write capabilities
 * mkfs: allow users to configure the desired maximum atomic write size
---
 include/bitops.h         |   12 ++
 include/libxfs.h         |    1 
 libfrog/statx.h          |   23 ++++
 libxfs/libxfs_api_defs.h |    5 +
 libxfs/topology.h        |    6 +
 db/logformat.c           |  129 ++++++++++++++++++++++++
 io/stat.c                |   21 +---
 libfrog/Makefile         |    1 
 libxfs/topology.c        |   36 +++++++
 m4/package_libcdev.m4    |    2 
 man/man8/mkfs.xfs.8.in   |    7 +
 man/man8/xfs_db.8        |   10 ++
 mkfs/xfs_mkfs.c          |  248 +++++++++++++++++++++++++++++++++++++++++++++-
 13 files changed, 472 insertions(+), 29 deletions(-)
 rename io/statx.h => libfrog/statx.h (94%)


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/7] libfrog: move statx.h from io/ to libfrog/
  2025-07-01 18:05 ` [PATCHSET 2/3] xfsprogs: atomic writes Darrick J. Wong
@ 2025-07-01 18:07   ` Darrick J. Wong
  2025-07-02  9:13     ` John Garry
  2025-07-01 18:07   ` [PATCH 2/7] xfs_db: create an untorn_max subcommand Darrick J. Wong
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-01 18:07 UTC (permalink / raw)
  To: djwong, aalbersh; +Cc: catherine.hoang, john.g.garry, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Move this header file so we can use it elsewhere.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 libfrog/statx.h  |   17 +++++++++++++++++
 io/stat.c        |   20 ++------------------
 libfrog/Makefile |    1 +
 3 files changed, 20 insertions(+), 18 deletions(-)
 rename io/statx.h => libfrog/statx.h (96%)


diff --git a/io/statx.h b/libfrog/statx.h
similarity index 96%
rename from io/statx.h
rename to libfrog/statx.h
index f7ef1d2784a2a9..b76dfae21e7092 100644
--- a/io/statx.h
+++ b/libfrog/statx.h
@@ -146,7 +146,24 @@ struct statx {
 	__u64	__spare3[9];	/* Spare space for future expansion */
 	/* 0x100 */
 };
+
+static inline ssize_t
+statx(
+	int		dfd,
+	const char	*filename,
+	unsigned int	flags,
+	unsigned int	mask,
+	struct statx	*buffer)
+{
+#ifdef __NR_statx
+	return syscall(__NR_statx, dfd, filename, flags, mask, buffer);
+#else
+	errno = ENOSYS;
+	return -1;
 #endif
+}
+
+#endif /* OVERRIDE_SYSTEM_STATX */
 
 #ifndef STATX_TYPE
 /*
diff --git a/io/stat.c b/io/stat.c
index c3a4bb15229ee5..46475df343470c 100644
--- a/io/stat.c
+++ b/io/stat.c
@@ -14,7 +14,7 @@
 #include "input.h"
 #include "init.h"
 #include "io.h"
-#include "statx.h"
+#include "libfrog/statx.h"
 #include "libxfs.h"
 #include "libfrog/logging.h"
 #include "libfrog/fsgeom.h"
@@ -305,22 +305,6 @@ statfs_f(
 	return 0;
 }
 
-static ssize_t
-_statx(
-	int		dfd,
-	const char	*filename,
-	unsigned int	flags,
-	unsigned int	mask,
-	struct statx	*buffer)
-{
-#ifdef __NR_statx
-	return syscall(__NR_statx, dfd, filename, flags, mask, buffer);
-#else
-	errno = ENOSYS;
-	return -1;
-#endif
-}
-
 struct statx_masks {
 	const char	*name;
 	unsigned int	mask;
@@ -525,7 +509,7 @@ statx_f(
 		return command_usage(&statx_cmd);
 
 	memset(&stx, 0xbf, sizeof(stx));
-	if (_statx(file->fd, "", atflag | AT_EMPTY_PATH, mask, &stx) < 0) {
+	if (statx(file->fd, "", atflag | AT_EMPTY_PATH, mask, &stx) < 0) {
 		perror("statx");
 		exitcode = 1;
 		return 0;
diff --git a/libfrog/Makefile b/libfrog/Makefile
index b64ca4597f4ea9..560bad417ee434 100644
--- a/libfrog/Makefile
+++ b/libfrog/Makefile
@@ -62,6 +62,7 @@ ptvar.h \
 radix-tree.h \
 randbytes.h \
 scrub.h \
+statx.h \
 workqueue.h
 
 GETTEXT_PY = \


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/7] libfrog: move statx.h from io/ to libfrog/
  2025-07-01 18:07   ` [PATCH 1/7] libfrog: move statx.h from io/ to libfrog/ Darrick J. Wong
@ 2025-07-02  9:13     ` John Garry
  0 siblings, 0 replies; 33+ messages in thread
From: John Garry @ 2025-07-02  9:13 UTC (permalink / raw)
  To: Darrick J. Wong, aalbersh; +Cc: catherine.hoang, linux-xfs

On 01/07/2025 19:07, Darrick J. Wong wrote:

nit: statx() is being moved as well (as statx.h), if we want to be more 
specific

> From: Darrick J. Wong <djwong@kernel.org>
> 
> Move this header file so we can use it elsewhere.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>

Regardless of my nitpick, FWIW:
Reviewed-by: John Garry <john.g.garry@oracle.com>




^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 2/7] xfs_db: create an untorn_max subcommand
  2025-07-01 18:05 ` [PATCHSET 2/3] xfsprogs: atomic writes Darrick J. Wong
  2025-07-01 18:07   ` [PATCH 1/7] libfrog: move statx.h from io/ to libfrog/ Darrick J. Wong
@ 2025-07-01 18:07   ` Darrick J. Wong
  2025-07-09 15:39     ` John Garry
  2025-07-01 18:07   ` [PATCH 3/7] xfs_io: dump new atomic_write_unit_max_opt statx field Darrick J. Wong
                     ` (4 subsequent siblings)
  6 siblings, 1 reply; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-01 18:07 UTC (permalink / raw)
  To: djwong, aalbersh; +Cc: catherine.hoang, john.g.garry, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a debugger command to compute the either the logres needed to
perform an untorn cow write completion for a given number of blocks; or
the number of blocks that can be completed given a log reservation.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/libxfs.h         |    1 
 libxfs/libxfs_api_defs.h |    4 +
 db/logformat.c           |  129 ++++++++++++++++++++++++++++++++++++++++++++++
 man/man8/xfs_db.8        |   10 ++++
 4 files changed, 144 insertions(+)


diff --git a/include/libxfs.h b/include/libxfs.h
index b968a2b88da372..1e0d1a48fbb698 100644
--- a/include/libxfs.h
+++ b/include/libxfs.h
@@ -102,6 +102,7 @@ struct iomap;
 #include "xfs_rtbitmap.h"
 #include "xfs_rtrmap_btree.h"
 #include "xfs_ag_resv.h"
+#include "defer_item.h"
 
 #ifndef ARRAY_SIZE
 #define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
diff --git a/libxfs/libxfs_api_defs.h b/libxfs/libxfs_api_defs.h
index c5fcb5e3229ae4..4bd02c57b496e6 100644
--- a/libxfs/libxfs_api_defs.h
+++ b/libxfs/libxfs_api_defs.h
@@ -108,6 +108,10 @@
 #define xfs_bunmapi			libxfs_bunmapi
 #define xfs_bwrite			libxfs_bwrite
 #define xfs_calc_dquots_per_chunk	libxfs_calc_dquots_per_chunk
+#define xfs_calc_finish_bui_reservation	libxfs_calc_finish_bui_reservation
+#define xfs_calc_finish_cui_reservation	libxfs_calc_finish_cui_reservation
+#define xfs_calc_finish_efi_reservation	libxfs_calc_finish_efi_reservation
+#define xfs_calc_finish_rui_reservation	libxfs_calc_finish_rui_reservation
 #define xfs_cntbt_init_cursor		libxfs_cntbt_init_cursor
 #define xfs_compute_rextslog		libxfs_compute_rextslog
 #define xfs_compute_rgblklog		libxfs_compute_rgblklog
diff --git a/db/logformat.c b/db/logformat.c
index 5edaa5494637c8..454ea20c0c7d5c 100644
--- a/db/logformat.c
+++ b/db/logformat.c
@@ -192,8 +192,137 @@ static const struct cmdinfo logres_cmd = {
 	.help =		logres_help,
 };
 
+STATIC void
+untorn_cow_limits(
+	struct xfs_mount	*mp,
+	unsigned int		logres,
+	unsigned int		desired_max)
+{
+	const unsigned int	efi = xfs_efi_log_space(1);
+	const unsigned int	efd = xfs_efd_log_space(1);
+	const unsigned int	rui = xfs_rui_log_space(1);
+	const unsigned int	rud = xfs_rud_log_space();
+	const unsigned int	cui = xfs_cui_log_space(1);
+	const unsigned int	cud = xfs_cud_log_space();
+	const unsigned int	bui = xfs_bui_log_space(1);
+	const unsigned int	bud = xfs_bud_log_space();
+
+	/*
+	 * Maximum overhead to complete an untorn write ioend in software:
+	 * remove data fork extent + remove cow fork extent + map extent into
+	 * data fork.
+	 *
+	 * tx0: Creates a BUI and a CUI and that's all it needs.
+	 *
+	 * tx1: Roll to finish the BUI.  Need space for the BUD, an RUI, and
+	 * enough space to relog the CUI (== CUI + CUD).
+	 *
+	 * tx2: Roll again to finish the RUI.  Need space for the RUD and space
+	 * to relog the CUI.
+	 *
+	 * tx3: Roll again, need space for the CUD and possibly a new EFI.
+	 *
+	 * tx4: Roll again, need space for an EFD.
+	 *
+	 * If the extent referenced by the pair of BUI/CUI items is not the one
+	 * being currently processed, then we need to reserve space to relog
+	 * both items.
+	 */
+	const unsigned int	tx0 = bui + cui;
+	const unsigned int	tx1 = bud + rui + cui + cud;
+	const unsigned int	tx2 = rud + cui + cud;
+	const unsigned int	tx3 = cud + efi;
+	const unsigned int	tx4 = efd;
+	const unsigned int	relog = bui + bud + cui + cud;
+
+	const unsigned int	per_intent = max(max3(tx0, tx1, tx2),
+						 max3(tx3, tx4, relog));
+
+	/* Overhead to finish one step of each intent item type */
+	const unsigned int	f1 = libxfs_calc_finish_efi_reservation(mp, 1);
+	const unsigned int	f2 = libxfs_calc_finish_rui_reservation(mp, 1);
+	const unsigned int	f3 = libxfs_calc_finish_cui_reservation(mp, 1);
+	const unsigned int	f4 = libxfs_calc_finish_bui_reservation(mp, 1);
+
+	/* We only finish one item per transaction in a chain */
+	const unsigned int	step_size = max(f4, max3(f1, f2, f3));
+
+	if (desired_max) {
+		dbprintf(
+ "desired_max: %u\nstep_size: %u\nper_intent: %u\nlogres: %u\n",
+				desired_max, step_size, per_intent,
+				(desired_max * per_intent) + step_size);
+	} else if (logres) {
+		dbprintf(
+ "logres: %u\nstep_size: %u\nper_intent: %u\nmax_awu: %u\n",
+				logres, step_size, per_intent,
+				logres >= step_size ? (logres - step_size) / per_intent : 0);
+	}
+}
+
+static void
+untorn_max_help(void)
+{
+	dbprintf(_(
+"\n"
+" The 'untorn_max' command computes either the log reservation needed to\n"
+" complete an untorn write of a given block count; or the maximum number of\n"
+" blocks that can be completed given a specific log reservation.\n"
+"\n"
+	));
+}
+
+static int
+untorn_max_f(
+	int		argc,
+	char		**argv)
+{
+	unsigned int	logres = 0;
+	unsigned int	desired_max = 0;
+	int		c;
+
+	while ((c = getopt(argc, argv, "l:b:")) != EOF) {
+		switch (c) {
+		case 'l':
+			logres = atoi(optarg);
+			break;
+		case 'b':
+			desired_max = atoi(optarg);
+			break;
+		default:
+			untorn_max_help();
+			return 0;
+		}
+	}
+
+	if (!logres && !desired_max) {
+		dbprintf("untorn_max needs -l or -b option\n");
+		return 0;
+	}
+
+	if (xfs_has_reflink(mp))
+		untorn_cow_limits(mp, logres, desired_max);
+	else
+		dbprintf("untorn write emulation not supported\n");
+
+	return 0;
+}
+
+static const struct cmdinfo untorn_max_cmd = {
+	.name =		"untorn_max",
+	.altname =	NULL,
+	.cfunc =	untorn_max_f,
+	.argmin =	0,
+	.argmax =	-1,
+	.canpush =	0,
+	.args =		NULL,
+	.oneline =	N_("compute untorn write max"),
+	.help =		logres_help,
+};
+
 void
 logres_init(void)
 {
 	add_command(&logres_cmd);
+	add_command(&untorn_max_cmd);
 }
diff --git a/man/man8/xfs_db.8 b/man/man8/xfs_db.8
index 2a9322560584b0..d4531fc0e380a3 100644
--- a/man/man8/xfs_db.8
+++ b/man/man8/xfs_db.8
@@ -1366,6 +1366,16 @@ .SH COMMANDS
 .IR name .
 The file being targetted will not be put on the iunlink list.
 .TP
+.BI "untorn_max [\-b " blockcount "|\-l " logres "]"
+If
+.B -l
+is specified, compute the maximum (in fsblocks) untorn write that we can
+emulate with copy on write given a log reservation size (in bytes).
+If
+.B -b
+is specified, compute the log reservation size that would be needed to
+emulate an untorn write of the given number of fsblocks.
+.TP
 .BI "uuid [" uuid " | " generate " | " rewrite " | " restore ]
 Set the filesystem universally unique identifier (UUID).
 The filesystem UUID can be used by


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/7] xfs_db: create an untorn_max subcommand
  2025-07-01 18:07   ` [PATCH 2/7] xfs_db: create an untorn_max subcommand Darrick J. Wong
@ 2025-07-09 15:39     ` John Garry
  2025-07-09 16:35       ` Darrick J. Wong
  0 siblings, 1 reply; 33+ messages in thread
From: John Garry @ 2025-07-09 15:39 UTC (permalink / raw)
  To: Darrick J. Wong, aalbersh; +Cc: catherine.hoang, linux-xfs

>   };

Generally it looks ok, just some small comments.

If you are not too concerned with any comment, then feel free to add the 
following:

Reviewed-by: John Garry <john.g.garry@oracle.com>

>   
> +STATIC void
> +untorn_cow_limits(
> +	struct xfs_mount	*mp,
> +	unsigned int		logres,
> +	unsigned int		desired_max)
> +{
> +	const unsigned int	efi = xfs_efi_log_space(1);
> +	const unsigned int	efd = xfs_efd_log_space(1);
> +	const unsigned int	rui = xfs_rui_log_space(1);
> +	const unsigned int	rud = xfs_rud_log_space();
> +	const unsigned int	cui = xfs_cui_log_space(1);
> +	const unsigned int	cud = xfs_cud_log_space();
> +	const unsigned int	bui = xfs_bui_log_space(1);
> +	const unsigned int	bud = xfs_bud_log_space();
> +
> +	/*
> +	 * Maximum overhead to complete an untorn write ioend in software:
> +	 * remove data fork extent + remove cow fork extent + map extent into
> +	 * data fork.
> +	 *
> +	 * tx0: Creates a BUI and a CUI and that's all it needs.
> +	 *
> +	 * tx1: Roll to finish the BUI.  Need space for the BUD, an RUI, and
> +	 * enough space to relog the CUI (== CUI + CUD).
> +	 *
> +	 * tx2: Roll again to finish the RUI.  Need space for the RUD and space
> +	 * to relog the CUI.
> +	 *
> +	 * tx3: Roll again, need space for the CUD and possibly a new EFI.
> +	 *
> +	 * tx4: Roll again, need space for an EFD.
> +	 *
> +	 * If the extent referenced by the pair of BUI/CUI items is not the one
> +	 * being currently processed, then we need to reserve space to relog
> +	 * both items.
> +	 */
> +	const unsigned int	tx0 = bui + cui;
> +	const unsigned int	tx1 = bud + rui + cui + cud;
> +	const unsigned int	tx2 = rud + cui + cud;
> +	const unsigned int	tx3 = cud + efi;
> +	const unsigned int	tx4 = efd;
> +	const unsigned int	relog = bui + bud + cui + cud;
> +
> +	const unsigned int	per_intent = max(max3(tx0, tx1, tx2),
> +						 max3(tx3, tx4, relog));
> +
> +	/* Overhead to finish one step of each intent item type */
> +	const unsigned int	f1 = libxfs_calc_finish_efi_reservation(mp, 1);
> +	const unsigned int	f2 = libxfs_calc_finish_rui_reservation(mp, 1);
> +	const unsigned int	f3 = libxfs_calc_finish_cui_reservation(mp, 1);
> +	const unsigned int	f4 = libxfs_calc_finish_bui_reservation(mp, 1);
> +
> +	/* We only finish one item per transaction in a chain */
> +	const unsigned int	step_size = max(f4, max3(f1, f2, f3));

This all looks to match xfs_calc_atomic_write_ioend_geometry(). I assume 
that there is a good reason why that code cannot be reused.

> +
> +	if (desired_max) {
> +		dbprintf(
> + "desired_max: %u\nstep_size: %u\nper_intent: %u\nlogres: %u\n",
> +				desired_max, step_size, per_intent,
> +				(desired_max * per_intent) + step_size);
> +	} else if (logres) {
> +		dbprintf(
> + "logres: %u\nstep_size: %u\nper_intent: %u\nmax_awu: %u\n",
> +				logres, step_size, per_intent,
> +				logres >= step_size ? (logres - step_size) / per_intent : 0);
> +	}
> +}
> +
> +static void
> +untorn_max_help(void)
> +{
> +	dbprintf(_(
> +"\n"
> +" The 'untorn_max' command computes either the log reservation needed to\n"
> +" complete an untorn write of a given block count; or the maximum number of\n"
> +" blocks that can be completed given a specific log reservation.\n"
> +"\n"
> +	));
> +}
> +
> +static int
> +untorn_max_f(
> +	int		argc,
> +	char		**argv)
> +{
> +	unsigned int	logres = 0;
> +	unsigned int	desired_max = 0;
> +	int		c;
> +
> +	while ((c = getopt(argc, argv, "l:b:")) != EOF) {
> +		switch (c) {
> +		case 'l':
> +			logres = atoi(optarg);
> +			break;
> +		case 'b':
> +			desired_max = atoi(optarg);
> +			break;
> +		default:
> +			untorn_max_help();
> +			return 0;
> +		}
> +	}

 From untorn_cow_limits(), it seems that it's best not give both 'l' and 
'b', as we only ever print one value. As such, would be better to set 
argmax = 1 (or whatever is needed to only accept only 'l' or 'b')?

> +
> +	if (!logres && !desired_max) {
> +		dbprintf("untorn_max needs -l or -b option\n");
> +		return 0;

similar db command handlers use -1, but I guess that it's not important 
here since you just rely on the print message output always

> +	}
> +
> +	if (xfs_has_reflink(mp))

this check could be put earlier

> +		untorn_cow_limits(mp, logres, desired_max);
> +	else
> +		dbprintf("untorn write emulation not supported\n");
> +
> +	return 0;
> +}
> +
> +static const struct cmdinfo untorn_max_cmd = {

it would be nice to use untorn_write_max_cmd

> +	.name =		"untorn_max",
> +	.altname =	NULL,
> +	.cfunc =	untorn_max_f,
> +	.argmin =	0,
> +	.argmax =	-1,
> +	.canpush =	0,
> +	.args =		NULL,
> +	.oneline =	N_("compute untorn write max"),
> +	.help =		logres_help,
> +};
> +
>   void
>   logres_init(void)
>   {
>   	add_command(&logres_cmd);
> +	add_command(&untorn_max_cmd);
>   }
> diff --git a/man/man8/xfs_db.8 b/man/man8/xfs_db.8
> index 2a9322560584b0..d4531fc0e380a3 100644
> --- a/man/man8/xfs_db.8
> +++ b/man/man8/xfs_db.8
> @@ -1366,6 +1366,16 @@ .SH COMMANDS
>   .IR name .
>   The file being targetted will not be put on the iunlink list.
>   .TP
> +.BI "untorn_max [\-b " blockcount "|\-l " logres "]"
> +If
> +.B -l
> +is specified, compute the maximum (in fsblocks) untorn write that we can
> +emulate with copy on write given a log reservation size (in bytes).
> +If
> +.B -b
> +is specified,
> compute the log reservation size that would be needed to
> +emulate an untorn write of the given number of fsblocks.
> +.TP
>   .BI "uuid [" uuid " | " generate " | " rewrite " | " restore ]
>   Set the filesystem universally unique identifier (UUID).
>   The filesystem UUID can be used by
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/7] xfs_db: create an untorn_max subcommand
  2025-07-09 15:39     ` John Garry
@ 2025-07-09 16:35       ` Darrick J. Wong
  0 siblings, 0 replies; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-09 16:35 UTC (permalink / raw)
  To: John Garry; +Cc: aalbersh, catherine.hoang, linux-xfs

On Wed, Jul 09, 2025 at 04:39:12PM +0100, John Garry wrote:
> >   };
> 
> Generally it looks ok, just some small comments.
> 
> If you are not too concerned with any comment, then feel free to add the
> following:
> 
> Reviewed-by: John Garry <john.g.garry@oracle.com>

Thanks!  Replies below.

> > +STATIC void
> > +untorn_cow_limits(
> > +	struct xfs_mount	*mp,
> > +	unsigned int		logres,
> > +	unsigned int		desired_max)
> > +{
> > +	const unsigned int	efi = xfs_efi_log_space(1);
> > +	const unsigned int	efd = xfs_efd_log_space(1);
> > +	const unsigned int	rui = xfs_rui_log_space(1);
> > +	const unsigned int	rud = xfs_rud_log_space();
> > +	const unsigned int	cui = xfs_cui_log_space(1);
> > +	const unsigned int	cud = xfs_cud_log_space();
> > +	const unsigned int	bui = xfs_bui_log_space(1);
> > +	const unsigned int	bud = xfs_bud_log_space();
> > +
> > +	/*
> > +	 * Maximum overhead to complete an untorn write ioend in software:
> > +	 * remove data fork extent + remove cow fork extent + map extent into
> > +	 * data fork.
> > +	 *
> > +	 * tx0: Creates a BUI and a CUI and that's all it needs.
> > +	 *
> > +	 * tx1: Roll to finish the BUI.  Need space for the BUD, an RUI, and
> > +	 * enough space to relog the CUI (== CUI + CUD).
> > +	 *
> > +	 * tx2: Roll again to finish the RUI.  Need space for the RUD and space
> > +	 * to relog the CUI.
> > +	 *
> > +	 * tx3: Roll again, need space for the CUD and possibly a new EFI.
> > +	 *
> > +	 * tx4: Roll again, need space for an EFD.
> > +	 *
> > +	 * If the extent referenced by the pair of BUI/CUI items is not the one
> > +	 * being currently processed, then we need to reserve space to relog
> > +	 * both items.
> > +	 */
> > +	const unsigned int	tx0 = bui + cui;
> > +	const unsigned int	tx1 = bud + rui + cui + cud;
> > +	const unsigned int	tx2 = rud + cui + cud;
> > +	const unsigned int	tx3 = cud + efi;
> > +	const unsigned int	tx4 = efd;
> > +	const unsigned int	relog = bui + bud + cui + cud;
> > +
> > +	const unsigned int	per_intent = max(max3(tx0, tx1, tx2),
> > +						 max3(tx3, tx4, relog));
> > +
> > +	/* Overhead to finish one step of each intent item type */
> > +	const unsigned int	f1 = libxfs_calc_finish_efi_reservation(mp, 1);
> > +	const unsigned int	f2 = libxfs_calc_finish_rui_reservation(mp, 1);
> > +	const unsigned int	f3 = libxfs_calc_finish_cui_reservation(mp, 1);
> > +	const unsigned int	f4 = libxfs_calc_finish_bui_reservation(mp, 1);
> > +
> > +	/* We only finish one item per transaction in a chain */
> > +	const unsigned int	step_size = max(f4, max3(f1, f2, f3));
> 
> This all looks to match xfs_calc_atomic_write_ioend_geometry(). I assume
> that there is a good reason why that code cannot be reused.

Hrmm, that /would/ be a good refactoring opportunity.  Oh, ugh:

STATIC unsigned int
xfs_calc_atomic_write_ioend_geometry(
	struct xfs_mount	*mp,
	unsigned int		*step_size)

Ok, I'd have to get that into 6.17... :(

> > +
> > +	if (desired_max) {
> > +		dbprintf(
> > + "desired_max: %u\nstep_size: %u\nper_intent: %u\nlogres: %u\n",
> > +				desired_max, step_size, per_intent,
> > +				(desired_max * per_intent) + step_size);
> > +	} else if (logres) {
> > +		dbprintf(
> > + "logres: %u\nstep_size: %u\nper_intent: %u\nmax_awu: %u\n",
> > +				logres, step_size, per_intent,
> > +				logres >= step_size ? (logres - step_size) / per_intent : 0);
> > +	}
> > +}
> > +
> > +static void
> > +untorn_max_help(void)
> > +{
> > +	dbprintf(_(
> > +"\n"
> > +" The 'untorn_max' command computes either the log reservation needed to\n"
> > +" complete an untorn write of a given block count; or the maximum number of\n"
> > +" blocks that can be completed given a specific log reservation.\n"
> > +"\n"
> > +	));
> > +}
> > +
> > +static int
> > +untorn_max_f(
> > +	int		argc,
> > +	char		**argv)
> > +{
> > +	unsigned int	logres = 0;
> > +	unsigned int	desired_max = 0;
> > +	int		c;
> > +
> > +	while ((c = getopt(argc, argv, "l:b:")) != EOF) {
> > +		switch (c) {
> > +		case 'l':
> > +			logres = atoi(optarg);
> > +			break;
> > +		case 'b':
> > +			desired_max = atoi(optarg);
> > +			break;
> > +		default:
> > +			untorn_max_help();
> > +			return 0;
> > +		}
> > +	}
> 
> From untorn_cow_limits(), it seems that it's best not give both 'l' and 'b',
> as we only ever print one value. As such, would be better to set argmax = 1
> (or whatever is needed to only accept only 'l' or 'b')?
> 
> > +
> > +	if (!logres && !desired_max) {
> > +		dbprintf("untorn_max needs -l or -b option\n");
> > +		return 0;
> 
> similar db command handlers use -1, but I guess that it's not important here
> since you just rely on the print message output always

I think you'd have to set argmax = 2 to pick up the parameter, right?
And then you'd still allow "untorn_max -l -b" which would immediately
fail, obviously.  But this works just as well.

> > +	}
> > +
> > +	if (xfs_has_reflink(mp))
> 
> this check could be put earlier

Sure, but what would be gained?  All we've done so far is parsed the CLI
options.

> > +		untorn_cow_limits(mp, logres, desired_max);
> > +	else
> > +		dbprintf("untorn write emulation not supported\n");
> > +
> > +	return 0;
> > +}
> > +
> > +static const struct cmdinfo untorn_max_cmd = {
> 
> it would be nice to use untorn_write_max_cmd

<nod> I'll s/untorn_max/untorn_write_max/ in this file.

--D

> > +	.name =		"untorn_max",
> > +	.altname =	NULL,
> > +	.cfunc =	untorn_max_f,
> > +	.argmin =	0,
> > +	.argmax =	-1,
> > +	.canpush =	0,
> > +	.args =		NULL,
> > +	.oneline =	N_("compute untorn write max"),
> > +	.help =		logres_help,
> > +};
> > +
> >   void
> >   logres_init(void)
> >   {
> >   	add_command(&logres_cmd);
> > +	add_command(&untorn_max_cmd);
> >   }
> > diff --git a/man/man8/xfs_db.8 b/man/man8/xfs_db.8
> > index 2a9322560584b0..d4531fc0e380a3 100644
> > --- a/man/man8/xfs_db.8
> > +++ b/man/man8/xfs_db.8
> > @@ -1366,6 +1366,16 @@ .SH COMMANDS
> >   .IR name .
> >   The file being targetted will not be put on the iunlink list.
> >   .TP
> > +.BI "untorn_max [\-b " blockcount "|\-l " logres "]"
> > +If
> > +.B -l
> > +is specified, compute the maximum (in fsblocks) untorn write that we can
> > +emulate with copy on write given a log reservation size (in bytes).
> > +If
> > +.B -b
> > +is specified,
> > compute the log reservation size that would be needed to
> > +emulate an untorn write of the given number of fsblocks.
> > +.TP
> >   .BI "uuid [" uuid " | " generate " | " rewrite " | " restore ]
> >   Set the filesystem universally unique identifier (UUID).
> >   The filesystem UUID can be used by
> > 
> 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 3/7] xfs_io: dump new atomic_write_unit_max_opt statx field
  2025-07-01 18:05 ` [PATCHSET 2/3] xfsprogs: atomic writes Darrick J. Wong
  2025-07-01 18:07   ` [PATCH 1/7] libfrog: move statx.h from io/ to libfrog/ Darrick J. Wong
  2025-07-01 18:07   ` [PATCH 2/7] xfs_db: create an untorn_max subcommand Darrick J. Wong
@ 2025-07-01 18:07   ` Darrick J. Wong
  2025-07-02  8:23     ` John Garry
  2025-07-01 18:07   ` [PATCH 4/7] mkfs: don't complain about overly large auto-detected log stripe units Darrick J. Wong
                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-01 18:07 UTC (permalink / raw)
  To: djwong, aalbersh; +Cc: catherine.hoang, john.g.garry, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Dump the new atomic writes statx field that's being submitted for 6.16.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 libfrog/statx.h       |    6 +++++-
 io/stat.c             |    1 +
 m4/package_libcdev.m4 |    2 +-
 3 files changed, 7 insertions(+), 2 deletions(-)


diff --git a/libfrog/statx.h b/libfrog/statx.h
index b76dfae21e7092..e11e2d8f49fa5f 100644
--- a/libfrog/statx.h
+++ b/libfrog/statx.h
@@ -143,7 +143,11 @@ struct statx {
 	__u32	stx_dio_read_offset_align;
 
 	/* 0xb8 */
-	__u64	__spare3[9];	/* Spare space for future expansion */
+	/* Optimised max atomic write unit in bytes */
+	__u32	stx_atomic_write_unit_max_opt;
+	__u32	__spare2[1];
+	/* 0xc0 */
+	__u64	__spare3[8];	/* Spare space for future expansion */
 	/* 0x100 */
 };
 
diff --git a/io/stat.c b/io/stat.c
index 46475df343470c..c257037aa8eec3 100644
--- a/io/stat.c
+++ b/io/stat.c
@@ -396,6 +396,7 @@ dump_raw_statx(struct statx *stx)
 	printf("stat.atomic_write_unit_max = %u\n", stx->stx_atomic_write_unit_max);
 	printf("stat.atomic_write_segments_max = %u\n", stx->stx_atomic_write_segments_max);
 	printf("stat.dio_read_offset_align = %u\n", stx->stx_dio_read_offset_align);
+	printf("stat.atomic_write_unit_max_opt = %u\n", stx->stx_atomic_write_unit_max_opt);
 	return 0;
 }
 
diff --git a/m4/package_libcdev.m4 b/m4/package_libcdev.m4
index 61353d0aa9d536..b77ac1a7580a80 100644
--- a/m4/package_libcdev.m4
+++ b/m4/package_libcdev.m4
@@ -126,7 +126,7 @@ AC_DEFUN([AC_NEED_INTERNAL_FSCRYPT_POLICY_V2],
 AC_DEFUN([AC_NEED_INTERNAL_STATX],
   [ AC_CHECK_TYPE(struct statx,
       [
-        AC_CHECK_MEMBER(struct statx.stx_dio_read_offset_align,
+        AC_CHECK_MEMBER(struct statx.stx_atomic_write_unit_max_opt,
           ,
           need_internal_statx=yes,
           [#include <linux/stat.h>]


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/7] xfs_io: dump new atomic_write_unit_max_opt statx field
  2025-07-01 18:07   ` [PATCH 3/7] xfs_io: dump new atomic_write_unit_max_opt statx field Darrick J. Wong
@ 2025-07-02  8:23     ` John Garry
  0 siblings, 0 replies; 33+ messages in thread
From: John Garry @ 2025-07-02  8:23 UTC (permalink / raw)
  To: Darrick J. Wong, aalbersh; +Cc: catherine.hoang, linux-xfs

On 01/07/2025 19:07, Darrick J. Wong wrote:
> From: Darrick J. Wong<djwong@kernel.org>
> 
> Dump the new atomic writes statx field that's being submitted for 6.16.
> 
> Signed-off-by: "Darrick J. Wong"<djwong@kernel.org>

thanks, FWIW:

Reviewed-by: John Garry <john.g.garry@oracle.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 4/7] mkfs: don't complain about overly large auto-detected log stripe units
  2025-07-01 18:05 ` [PATCHSET 2/3] xfsprogs: atomic writes Darrick J. Wong
                     ` (2 preceding siblings ...)
  2025-07-01 18:07   ` [PATCH 3/7] xfs_io: dump new atomic_write_unit_max_opt statx field Darrick J. Wong
@ 2025-07-01 18:07   ` Darrick J. Wong
  2025-07-08 14:38     ` John Garry
  2025-07-01 18:08   ` [PATCH 5/7] mkfs: autodetect log stripe unit for external log devices Darrick J. Wong
                     ` (2 subsequent siblings)
  6 siblings, 1 reply; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-01 18:07 UTC (permalink / raw)
  To: djwong, aalbersh; +Cc: linux-xfs, catherine.hoang, john.g.garry, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If mkfs declines to apply what it thinks is an overly large data device
stripe unit to the log device, it should only log a message about that
if the lsunit parameter was actually supplied by the caller.  It should
not do that when the lsunit was autodetected from the block devices.

The cli parameters are zero-initialized in main and always have been.

Cc: <linux-xfs@vger.kernel.org> # v4.15.0
Fixes: 2f44b1b0e5adc4 ("mkfs: rework stripe calculations")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 mkfs/xfs_mkfs.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index 812241c49a5494..8b946f3ef817da 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -3629,7 +3629,7 @@ _("log stripe unit (%d) must be a multiple of the block size (%d)\n"),
 	if (cfg->sb_feat.log_version == 2 &&
 	    cfg->lsunit * cfg->blocksize > 256 * 1024) {
 		/* Warn only if specified on commandline */
-		if (cli->lsu || cli->lsunit != -1) {
+		if (cli->lsu || cli->lsunit) {
 			fprintf(stderr,
 _("log stripe unit (%d bytes) is too large (maximum is 256KiB)\n"
   "log stripe unit adjusted to 32KiB\n"),


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/7] mkfs: don't complain about overly large auto-detected log stripe units
  2025-07-01 18:07   ` [PATCH 4/7] mkfs: don't complain about overly large auto-detected log stripe units Darrick J. Wong
@ 2025-07-08 14:38     ` John Garry
  2025-07-08 15:05       ` Darrick J. Wong
  0 siblings, 1 reply; 33+ messages in thread
From: John Garry @ 2025-07-08 14:38 UTC (permalink / raw)
  To: Darrick J. Wong, aalbersh; +Cc: linux-xfs, catherine.hoang

On 01/07/2025 19:07, Darrick J. Wong wrote:
> From: Darrick J. Wong<djwong@kernel.org>
> 
> If mkfs declines to apply what it thinks is an overly large data device
> stripe unit to the log device, it should only log a message about that
> if the lsunit parameter was actually supplied by the caller.  It should
> not do that when the lsunit was autodetected from the block devices.
> 
> The cli parameters are zero-initialized in main and always have been.
> 
> Cc:<linux-xfs@vger.kernel.org> # v4.15.0
> Fixes: 2f44b1b0e5adc4 ("mkfs: rework stripe calculations")
> Signed-off-by: "Darrick J. Wong"<djwong@kernel.org>

Makes sense, so FWIW:

John Garry <john.g.garry@oracle.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/7] mkfs: don't complain about overly large auto-detected log stripe units
  2025-07-08 14:38     ` John Garry
@ 2025-07-08 15:05       ` Darrick J. Wong
  2025-07-08 15:07         ` John Garry
  0 siblings, 1 reply; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-08 15:05 UTC (permalink / raw)
  To: John Garry; +Cc: aalbersh, linux-xfs, catherine.hoang

On Tue, Jul 08, 2025 at 03:38:10PM +0100, John Garry wrote:
> On 01/07/2025 19:07, Darrick J. Wong wrote:
> > From: Darrick J. Wong<djwong@kernel.org>
> > 
> > If mkfs declines to apply what it thinks is an overly large data device
> > stripe unit to the log device, it should only log a message about that
> > if the lsunit parameter was actually supplied by the caller.  It should
> > not do that when the lsunit was autodetected from the block devices.
> > 
> > The cli parameters are zero-initialized in main and always have been.
> > 
> > Cc:<linux-xfs@vger.kernel.org> # v4.15.0
> > Fixes: 2f44b1b0e5adc4 ("mkfs: rework stripe calculations")
> > Signed-off-by: "Darrick J. Wong"<djwong@kernel.org>
> 
> Makes sense, so FWIW:
> 
> John Garry <john.g.garry@oracle.com>

Um.... is this a Reviewed-by: ?

--D

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/7] mkfs: don't complain about overly large auto-detected log stripe units
  2025-07-08 15:05       ` Darrick J. Wong
@ 2025-07-08 15:07         ` John Garry
  0 siblings, 0 replies; 33+ messages in thread
From: John Garry @ 2025-07-08 15:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: aalbersh, linux-xfs, catherine.hoang

On 08/07/2025 16:05, Darrick J. Wong wrote:
> On Tue, Jul 08, 2025 at 03:38:10PM +0100, John Garry wrote:
>> On 01/07/2025 19:07, Darrick J. Wong wrote:
>>> From: Darrick J. Wong<djwong@kernel.org>
>>>
>>> If mkfs declines to apply what it thinks is an overly large data device
>>> stripe unit to the log device, it should only log a message about that
>>> if the lsunit parameter was actually supplied by the caller.  It should
>>> not do that when the lsunit was autodetected from the block devices.
>>>
>>> The cli parameters are zero-initialized in main and always have been.
>>>
>>> Cc:<linux-xfs@vger.kernel.org> # v4.15.0
>>> Fixes: 2f44b1b0e5adc4 ("mkfs: rework stripe calculations")
>>> Signed-off-by: "Darrick J. Wong"<djwong@kernel.org>
>>
>> Makes sense, so FWIW:
>>
>> John Garry <john.g.garry@oracle.com>
> 
> Um.... is this a Reviewed-by: ?

oops, yes:

Reviewed-by: John Garry <john.g.garry@oracle.com>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 5/7] mkfs: autodetect log stripe unit for external log devices
  2025-07-01 18:05 ` [PATCHSET 2/3] xfsprogs: atomic writes Darrick J. Wong
                     ` (3 preceding siblings ...)
  2025-07-01 18:07   ` [PATCH 4/7] mkfs: don't complain about overly large auto-detected log stripe units Darrick J. Wong
@ 2025-07-01 18:08   ` Darrick J. Wong
  2025-07-09 15:57     ` John Garry
  2025-07-01 18:08   ` [PATCH 6/7] mkfs: try to align AG size based on atomic write capabilities Darrick J. Wong
  2025-07-01 18:08   ` [PATCH 7/7] mkfs: allow users to configure the desired maximum atomic write size Darrick J. Wong
  6 siblings, 1 reply; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-01 18:08 UTC (permalink / raw)
  To: djwong, aalbersh; +Cc: catherine.hoang, john.g.garry, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If we're using an external log device and the caller doesn't give us a
lsunit, use the block device geometry (if present) to set it.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 mkfs/xfs_mkfs.c |    4 ++++
 1 file changed, 4 insertions(+)


diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index 8b946f3ef817da..6c8cc715d3476b 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -3624,6 +3624,10 @@ _("log stripe unit (%d) must be a multiple of the block size (%d)\n"),
 		   cfg->loginternal && cfg->dsunit) {
 		/* lsunit and dsunit now in fs blocks */
 		cfg->lsunit = cfg->dsunit;
+	} else if (cfg->sb_feat.log_version == 2 &&
+		   !cfg->loginternal) {
+		/* use the external log device properties */
+		cfg->lsunit = DTOBT(ft->log.sunit, cfg->blocklog);
 	}
 
 	if (cfg->sb_feat.log_version == 2 &&


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/7] mkfs: autodetect log stripe unit for external log devices
  2025-07-01 18:08   ` [PATCH 5/7] mkfs: autodetect log stripe unit for external log devices Darrick J. Wong
@ 2025-07-09 15:57     ` John Garry
  2025-07-09 16:45       ` Darrick J. Wong
  0 siblings, 1 reply; 33+ messages in thread
From: John Garry @ 2025-07-09 15:57 UTC (permalink / raw)
  To: Darrick J. Wong, aalbersh; +Cc: catherine.hoang, linux-xfs

On 01/07/2025 19:08, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> If we're using an external log device and the caller doesn't give us a
> lsunit, use the block device geometry (if present) to set it.

this seems fine, but I am not imitatively familiar with the code. So, FWIW:

Reviewed-by: John Garry <john.g.garry@oracle.com>

There is a small question below, though.

> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>   mkfs/xfs_mkfs.c |    4 ++++
>   1 file changed, 4 insertions(+)
> 
> 
> diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
> index 8b946f3ef817da..6c8cc715d3476b 100644
> --- a/mkfs/xfs_mkfs.c
> +++ b/mkfs/xfs_mkfs.c

nit: maybe the comment on method of calculation (not shown, but begins 
"check the log sunit is modulo ...") could be updated

> @@ -3624,6 +3624,10 @@ _("log stripe unit (%d) must be a multiple of the block size (%d)\n"),
>   		   cfg->loginternal && cfg->dsunit) {
>   		/* lsunit and dsunit now in fs blocks */
>   		cfg->lsunit = cfg->dsunit;
> +	} else if (cfg->sb_feat.log_version == 2 &&
> +		   !cfg->loginternal) {
> +		/* use the external log device properties */
> +		cfg->lsunit = DTOBT(ft->log.sunit, cfg->blocklog);
>   	}
>   
>   	if (cfg->sb_feat.log_version == 2 &&
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/7] mkfs: autodetect log stripe unit for external log devices
  2025-07-09 15:57     ` John Garry
@ 2025-07-09 16:45       ` Darrick J. Wong
  0 siblings, 0 replies; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-09 16:45 UTC (permalink / raw)
  To: John Garry; +Cc: aalbersh, catherine.hoang, linux-xfs

On Wed, Jul 09, 2025 at 04:57:49PM +0100, John Garry wrote:
> On 01/07/2025 19:08, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > If we're using an external log device and the caller doesn't give us a
> > lsunit, use the block device geometry (if present) to set it.
> 
> this seems fine, but I am not imitatively familiar with the code. So, FWIW:
> 
> Reviewed-by: John Garry <john.g.garry@oracle.com>
> 
> There is a small question below, though.
> 
> > 
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> >   mkfs/xfs_mkfs.c |    4 ++++
> >   1 file changed, 4 insertions(+)
> > 
> > 
> > diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
> > index 8b946f3ef817da..6c8cc715d3476b 100644
> > --- a/mkfs/xfs_mkfs.c
> > +++ b/mkfs/xfs_mkfs.c
> 
> nit: maybe the comment on method of calculation (not shown, but begins
> "check the log sunit is modulo ...") could be updated

Ok.

	/*
	 * check that log sunit is modulo fsblksize; default it to dsunit for
	 * an internal log; or the log device stripe unit if it's external.
	 */

Thanks for the review.

--D

> > @@ -3624,6 +3624,10 @@ _("log stripe unit (%d) must be a multiple of the block size (%d)\n"),
> >   		   cfg->loginternal && cfg->dsunit) {
> >   		/* lsunit and dsunit now in fs blocks */
> >   		cfg->lsunit = cfg->dsunit;
> > +	} else if (cfg->sb_feat.log_version == 2 &&
> > +		   !cfg->loginternal) {
> > +		/* use the external log device properties */
> > +		cfg->lsunit = DTOBT(ft->log.sunit, cfg->blocklog);
> >   	}
> >   	if (cfg->sb_feat.log_version == 2 &&
> > 
> 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 6/7] mkfs: try to align AG size based on atomic write capabilities
  2025-07-01 18:05 ` [PATCHSET 2/3] xfsprogs: atomic writes Darrick J. Wong
                     ` (4 preceding siblings ...)
  2025-07-01 18:08   ` [PATCH 5/7] mkfs: autodetect log stripe unit for external log devices Darrick J. Wong
@ 2025-07-01 18:08   ` Darrick J. Wong
  2025-07-02  9:03     ` John Garry
  2025-07-01 18:08   ` [PATCH 7/7] mkfs: allow users to configure the desired maximum atomic write size Darrick J. Wong
  6 siblings, 1 reply; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-01 18:08 UTC (permalink / raw)
  To: djwong, aalbersh; +Cc: catherine.hoang, john.g.garry, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Try to align the AG size to the maximum hardware atomic write unit so
that we can give users maximum flexibility in choosing an RWF_ATOMIC
write size.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 libxfs/topology.h |    6 ++++--
 libxfs/topology.c |   36 ++++++++++++++++++++++++++++++++++++
 mkfs/xfs_mkfs.c   |   48 +++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 83 insertions(+), 7 deletions(-)


diff --git a/libxfs/topology.h b/libxfs/topology.h
index 207a8a7f150556..f0ca65f3576e92 100644
--- a/libxfs/topology.h
+++ b/libxfs/topology.h
@@ -13,8 +13,10 @@
 struct device_topology {
 	int	logical_sector_size;	/* logical sector size */
 	int	physical_sector_size;	/* physical sector size */
-	int	sunit;		/* stripe unit */
-	int	swidth;		/* stripe width  */
+	int	sunit;			/* stripe unit */
+	int	swidth;			/* stripe width  */
+	int	awu_min;		/* min atomic write unit in bbcounts */
+	int	awu_max;		/* max atomic write unit in bbcounts */
 };
 
 struct fs_topology {
diff --git a/libxfs/topology.c b/libxfs/topology.c
index 96ee74b61b30f5..7764687beac000 100644
--- a/libxfs/topology.c
+++ b/libxfs/topology.c
@@ -4,11 +4,18 @@
  * All Rights Reserved.
  */
 
+#ifdef OVERRIDE_SYSTEM_STATX
+#define statx sys_statx
+#endif
+#include <fcntl.h>
+#include <sys/stat.h>
+
 #include "libxfs_priv.h"
 #include "libxcmd.h"
 #include <blkid/blkid.h>
 #include "xfs_multidisk.h"
 #include "libfrog/platform.h"
+#include "libfrog/statx.h"
 
 #define TERABYTES(count, blog)	((uint64_t)(count) << (40 - (blog)))
 #define GIGABYTES(count, blog)	((uint64_t)(count) << (30 - (blog)))
@@ -278,6 +285,34 @@ blkid_get_topology(
 		device);
 }
 
+static void
+get_hw_atomic_writes_topology(
+	struct libxfs_dev	*dev,
+	struct device_topology	*dt)
+{
+	struct statx		sx;
+	int			fd;
+	int			ret;
+
+	fd = open(dev->name, O_RDONLY);
+	if (fd < 0)
+		return;
+
+	ret = statx(fd, "", AT_EMPTY_PATH, STATX_WRITE_ATOMIC, &sx);
+	if (ret)
+		goto out_close;
+
+	if (!(sx.stx_mask & STATX_WRITE_ATOMIC))
+		goto out_close;
+
+	dt->awu_min = sx.stx_atomic_write_unit_min >> 9;
+	dt->awu_max = max(sx.stx_atomic_write_unit_max_opt,
+			  sx.stx_atomic_write_unit_max) >> 9;
+
+out_close:
+	close(fd);
+}
+
 static void
 get_device_topology(
 	struct libxfs_dev	*dev,
@@ -316,6 +351,7 @@ get_device_topology(
 		}
 	} else {
 		blkid_get_topology(dev->name, dt, force_overwrite);
+		get_hw_atomic_writes_topology(dev, dt);
 	}
 
 	ASSERT(dt->logical_sector_size);
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index 6c8cc715d3476b..7d3e9dd567b7b2 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -3379,6 +3379,32 @@ _("illegal CoW extent size hint %lld, must be less than %u and a multiple of %u.
 	}
 }
 
+static void
+validate_device_awu(
+	struct mkfs_params	*cfg,
+	struct device_topology	*dt)
+{
+	/* Ignore hw atomic write capability if it can't do even 1 fsblock */
+	if (BBTOB(dt->awu_min) > cfg->blocksize ||
+	    BBTOB(dt->awu_max) < cfg->blocksize) {
+		dt->awu_min = 0;
+		dt->awu_max = 0;
+	}
+}
+
+static void
+validate_hw_atomic_writes(
+	struct mkfs_params	*cfg,
+	struct cli_params	*cli,
+	struct fs_topology	*ft)
+{
+	validate_device_awu(cfg, &ft->data);
+	if (cli->xi->log.name)
+		validate_device_awu(cfg, &ft->log);
+	if (cli->xi->rt.name)
+		validate_device_awu(cfg, &ft->rt);
+}
+
 /* Complain if this filesystem is not a supported configuration. */
 static void
 validate_supported(
@@ -4051,10 +4077,20 @@ _("agsize (%s) not a multiple of fs blk size (%d)\n"),
  */
 static void
 align_ag_geometry(
-	struct mkfs_params	*cfg)
+	struct mkfs_params	*cfg,
+	struct fs_topology	*ft)
 {
-	uint64_t	tmp_agsize;
-	int		dsunit = cfg->dsunit;
+	uint64_t		tmp_agsize;
+	int			dsunit = cfg->dsunit;
+
+	/*
+	 * We've already validated (or discarded) the hardware atomic write
+	 * geometry.  Try to align the agsize to the maximum atomic write unit
+	 * to give users maximum flexibility in choosing atomic write sizes.
+	 */
+	if (ft->data.awu_max > 0)
+		dsunit = max(DTOBT(ft->data.awu_max, cfg->blocklog),
+				dsunit);
 
 	if (!dsunit)
 		goto validate;
@@ -4110,7 +4146,8 @@ _("agsize rounded to %lld, sunit = %d\n"),
 				(long long)cfg->agsize, dsunit);
 	}
 
-	if ((cfg->agsize % cfg->dswidth) == 0 &&
+	if (cfg->dswidth > 0 &&
+	    (cfg->agsize % cfg->dswidth) == 0 &&
 	    cfg->dswidth != cfg->dsunit &&
 	    cfg->agcount > 1) {
 
@@ -5874,6 +5911,7 @@ main(
 	cfg.rtblocks = calc_dev_size(cli.rtsize, &cfg, &ropts, R_SIZE, "rt");
 
 	validate_rtextsize(&cfg, &cli, &ft);
+	validate_hw_atomic_writes(&cfg, &cli, &ft);
 
 	/*
 	 * Open and validate the device configurations
@@ -5892,7 +5930,7 @@ main(
 	 * aligns to device geometry correctly.
 	 */
 	calculate_initial_ag_geometry(&cfg, &cli, &xi);
-	align_ag_geometry(&cfg);
+	align_ag_geometry(&cfg, &ft);
 	if (cfg.sb_feat.zoned)
 		calculate_zone_geometry(&cfg, &cli, &xi, &zt);
 	else


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 6/7] mkfs: try to align AG size based on atomic write capabilities
  2025-07-01 18:08   ` [PATCH 6/7] mkfs: try to align AG size based on atomic write capabilities Darrick J. Wong
@ 2025-07-02  9:03     ` John Garry
  2025-07-02 19:00       ` Darrick J. Wong
  0 siblings, 1 reply; 33+ messages in thread
From: John Garry @ 2025-07-02  9:03 UTC (permalink / raw)
  To: Darrick J. Wong, aalbersh; +Cc: catherine.hoang, linux-xfs

On 01/07/2025 19:08, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Try to align the AG size to the maximum hardware atomic write unit so
> that we can give users maximum flexibility in choosing an RWF_ATOMIC
> write size.


Regardless of comments below, FWIW:

Reviewed-by: John Garry <john.g.garry@oracle.com>


> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>   libxfs/topology.h |    6 ++++--
>   libxfs/topology.c |   36 ++++++++++++++++++++++++++++++++++++
>   mkfs/xfs_mkfs.c   |   48 +++++++++++++++++++++++++++++++++++++++++++-----
>   3 files changed, 83 insertions(+), 7 deletions(-)
> 
> 
> diff --git a/libxfs/topology.h b/libxfs/topology.h
> index 207a8a7f150556..f0ca65f3576e92 100644
> --- a/libxfs/topology.h
> +++ b/libxfs/topology.h
> @@ -13,8 +13,10 @@
>   struct device_topology {
>   	int	logical_sector_size;	/* logical sector size */
>   	int	physical_sector_size;	/* physical sector size */
> -	int	sunit;		/* stripe unit */
> -	int	swidth;		/* stripe width  */
> +	int	sunit;			/* stripe unit */
> +	int	swidth;			/* stripe width  */
> +	int	awu_min;		/* min atomic write unit in bbcounts */

awu_min is not really used, but, like the kernel code does, I suppose 
useful to store it

> +	int	awu_max;		/* max atomic write unit in bbcounts */
>   };
>   
>   struct fs_topology {
> diff --git a/libxfs/topology.c b/libxfs/topology.c
> index 96ee74b61b30f5..7764687beac000 100644
> --- a/libxfs/topology.c
> +++ b/libxfs/topology.c
> @@ -4,11 +4,18 @@
>    * All Rights Reserved.
>    */
>   
> +#ifdef OVERRIDE_SYSTEM_STATX
> +#define statx sys_statx
> +#endif
> +#include <fcntl.h>
> +#include <sys/stat.h>
> +
>   #include "libxfs_priv.h"
>   #include "libxcmd.h"
>   #include <blkid/blkid.h>
>   #include "xfs_multidisk.h"
>   #include "libfrog/platform.h"
> +#include "libfrog/statx.h"
>   
>   #define TERABYTES(count, blog)	((uint64_t)(count) << (40 - (blog)))
>   #define GIGABYTES(count, blog)	((uint64_t)(count) << (30 - (blog)))
> @@ -278,6 +285,34 @@ blkid_get_topology(
>   		device);
>   }
>   
> +static void
> +get_hw_atomic_writes_topology(
> +	struct libxfs_dev	*dev,
> +	struct device_topology	*dt)
> +{
> +	struct statx		sx;
> +	int			fd;
> +	int			ret;
> +
> +	fd = open(dev->name, O_RDONLY);
> +	if (fd < 0)
> +		return;
> +
> +	ret = statx(fd, "", AT_EMPTY_PATH, STATX_WRITE_ATOMIC, &sx);
> +	if (ret)
> +		goto out_close;
> +
> +	if (!(sx.stx_mask & STATX_WRITE_ATOMIC))
> +		goto out_close;
> +
> +	dt->awu_min = sx.stx_atomic_write_unit_min >> 9;
> +	dt->awu_max = max(sx.stx_atomic_write_unit_max_opt,
> +			  sx.stx_atomic_write_unit_max) >> 9;

for a bdev, stx_atomic_write_unit_max_opt should be zero

However, I suppose some bdev could have hybrid atomic write support, 
just like xfs, so what you are doing looks good



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 6/7] mkfs: try to align AG size based on atomic write capabilities
  2025-07-02  9:03     ` John Garry
@ 2025-07-02 19:00       ` Darrick J. Wong
  0 siblings, 0 replies; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-02 19:00 UTC (permalink / raw)
  To: John Garry; +Cc: aalbersh, catherine.hoang, linux-xfs

On Wed, Jul 02, 2025 at 10:03:54AM +0100, John Garry wrote:
> On 01/07/2025 19:08, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Try to align the AG size to the maximum hardware atomic write unit so
> > that we can give users maximum flexibility in choosing an RWF_ATOMIC
> > write size.
> 
> 
> Regardless of comments below, FWIW:
> 
> Reviewed-by: John Garry <john.g.garry@oracle.com>
> 
> 
> > 
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> > ---
> >   libxfs/topology.h |    6 ++++--
> >   libxfs/topology.c |   36 ++++++++++++++++++++++++++++++++++++
> >   mkfs/xfs_mkfs.c   |   48 +++++++++++++++++++++++++++++++++++++++++++-----
> >   3 files changed, 83 insertions(+), 7 deletions(-)
> > 
> > 
> > diff --git a/libxfs/topology.h b/libxfs/topology.h
> > index 207a8a7f150556..f0ca65f3576e92 100644
> > --- a/libxfs/topology.h
> > +++ b/libxfs/topology.h
> > @@ -13,8 +13,10 @@
> >   struct device_topology {
> >   	int	logical_sector_size;	/* logical sector size */
> >   	int	physical_sector_size;	/* physical sector size */
> > -	int	sunit;		/* stripe unit */
> > -	int	swidth;		/* stripe width  */
> > +	int	sunit;			/* stripe unit */
> > +	int	swidth;			/* stripe width  */
> > +	int	awu_min;		/* min atomic write unit in bbcounts */
> 
> awu_min is not really used, but, like the kernel code does, I suppose useful
> to store it
> 
> > +	int	awu_max;		/* max atomic write unit in bbcounts */
> >   };
> >   struct fs_topology {
> > diff --git a/libxfs/topology.c b/libxfs/topology.c
> > index 96ee74b61b30f5..7764687beac000 100644
> > --- a/libxfs/topology.c
> > +++ b/libxfs/topology.c
> > @@ -4,11 +4,18 @@
> >    * All Rights Reserved.
> >    */
> > +#ifdef OVERRIDE_SYSTEM_STATX
> > +#define statx sys_statx
> > +#endif
> > +#include <fcntl.h>
> > +#include <sys/stat.h>
> > +
> >   #include "libxfs_priv.h"
> >   #include "libxcmd.h"
> >   #include <blkid/blkid.h>
> >   #include "xfs_multidisk.h"
> >   #include "libfrog/platform.h"
> > +#include "libfrog/statx.h"
> >   #define TERABYTES(count, blog)	((uint64_t)(count) << (40 - (blog)))
> >   #define GIGABYTES(count, blog)	((uint64_t)(count) << (30 - (blog)))
> > @@ -278,6 +285,34 @@ blkid_get_topology(
> >   		device);
> >   }
> > +static void
> > +get_hw_atomic_writes_topology(
> > +	struct libxfs_dev	*dev,
> > +	struct device_topology	*dt)
> > +{
> > +	struct statx		sx;
> > +	int			fd;
> > +	int			ret;
> > +
> > +	fd = open(dev->name, O_RDONLY);
> > +	if (fd < 0)
> > +		return;
> > +
> > +	ret = statx(fd, "", AT_EMPTY_PATH, STATX_WRITE_ATOMIC, &sx);
> > +	if (ret)
> > +		goto out_close;
> > +
> > +	if (!(sx.stx_mask & STATX_WRITE_ATOMIC))
> > +		goto out_close;
> > +
> > +	dt->awu_min = sx.stx_atomic_write_unit_min >> 9;
> > +	dt->awu_max = max(sx.stx_atomic_write_unit_max_opt,
> > +			  sx.stx_atomic_write_unit_max) >> 9;
> 
> for a bdev, stx_atomic_write_unit_max_opt should be zero
> 
> However, I suppose some bdev could have hybrid atomic write support, just
> like xfs, so what you are doing looks good

Yeah, it's unlikely ever to happen but if (say) you had a loop device
backed by an xfs file then maybe it'd useful to pass through both atomic
write maxima.

Thanks for the review.

--D

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 7/7] mkfs: allow users to configure the desired maximum atomic write size
  2025-07-01 18:05 ` [PATCHSET 2/3] xfsprogs: atomic writes Darrick J. Wong
                     ` (5 preceding siblings ...)
  2025-07-01 18:08   ` [PATCH 6/7] mkfs: try to align AG size based on atomic write capabilities Darrick J. Wong
@ 2025-07-01 18:08   ` Darrick J. Wong
  2025-07-02  8:50     ` John Garry
  6 siblings, 1 reply; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-01 18:08 UTC (permalink / raw)
  To: djwong, aalbersh; +Cc: catherine.hoang, john.g.garry, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Allow callers of mkfs.xfs to specify a desired maximum atomic write
size.  This value will cause the log size to be adjusted to support
software atomic writes, and the AG size to be aligned to support
hardware atomic writes.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/bitops.h         |   12 +++
 libxfs/libxfs_api_defs.h |    1 
 man/man8/mkfs.xfs.8.in   |    7 ++
 mkfs/xfs_mkfs.c          |  194 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 213 insertions(+), 1 deletion(-)


diff --git a/include/bitops.h b/include/bitops.h
index 1f1adceccf5d2b..d0c55827044e54 100644
--- a/include/bitops.h
+++ b/include/bitops.h
@@ -113,4 +113,16 @@ static inline int lowbit64(uint64_t v)
 	return n - 1;
 }
 
+/**
+ * __rounddown_pow_of_two() - round down to nearest power of two
+ * @n: value to round down
+ */
+static inline __attribute__((const))
+unsigned long __rounddown_pow_of_two(unsigned long n)
+{
+	return 1UL << (fls_long(n) - 1);
+}
+
+#define rounddown_pow_of_two(n) __rounddown_pow_of_two(n)
+
 #endif
diff --git a/libxfs/libxfs_api_defs.h b/libxfs/libxfs_api_defs.h
index 4bd02c57b496e6..fe00e19bada9d8 100644
--- a/libxfs/libxfs_api_defs.h
+++ b/libxfs/libxfs_api_defs.h
@@ -107,6 +107,7 @@
 #define xfs_buftarg_drain		libxfs_buftarg_drain
 #define xfs_bunmapi			libxfs_bunmapi
 #define xfs_bwrite			libxfs_bwrite
+#define xfs_calc_atomic_write_log_geometry	libxfs_calc_atomic_write_log_geometry
 #define xfs_calc_dquots_per_chunk	libxfs_calc_dquots_per_chunk
 #define xfs_calc_finish_bui_reservation	libxfs_calc_finish_bui_reservation
 #define xfs_calc_finish_cui_reservation	libxfs_calc_finish_cui_reservation
diff --git a/man/man8/mkfs.xfs.8.in b/man/man8/mkfs.xfs.8.in
index bc80493187f6f9..5f59d4b2da6e02 100644
--- a/man/man8/mkfs.xfs.8.in
+++ b/man/man8/mkfs.xfs.8.in
@@ -742,6 +742,13 @@ .SH OPTIONS
 directories, symbolic links, and realtime metadata files.
 This feature is disabled by default.
 This feature is only available for filesystems formatted with -m crc=1.
+.TP
+.BI max_atomic_write[= value]
+When enabled, application programs can use the RWF_ATOMIC write flag to
+persist changes of up to this size without tearing.
+The default is chosen to allow a reasonable amount of scalability.
+This value must also be passed via mount option.
+This feature is only available for filesystems formatted with reflink.
 .RE
 .PP
 .PD 0
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index 7d3e9dd567b7b2..fc769df22357a6 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -94,6 +94,7 @@ enum {
 	I_SPINODES,
 	I_NREXT64,
 	I_EXCHANGE,
+	I_MAX_ATOMIC_WRITE,
 	I_MAX_OPTS,
 };
 
@@ -489,6 +490,7 @@ static struct opt_params iopts = {
 		[I_SPINODES] = "sparse",
 		[I_NREXT64] = "nrext64",
 		[I_EXCHANGE] = "exchange",
+		[I_MAX_ATOMIC_WRITE] = "max_atomic_write",
 		[I_MAX_OPTS] = NULL,
 	},
 	.subopt_params = {
@@ -550,6 +552,13 @@ static struct opt_params iopts = {
 		  .maxval = 1,
 		  .defaultval = 1,
 		},
+		{ .index = I_MAX_ATOMIC_WRITE,
+		  .conflicts = { { NULL, LAST_CONFLICT } },
+		  .convert = true,
+		  .minval = 1,
+		  .maxval = 1ULL << 30, /* 1GiB */
+		  .defaultval = SUBOPT_NEEDS_VAL,
+		},
 	},
 };
 
@@ -1069,6 +1078,7 @@ struct cli_params {
 	char	*rtsize;
 	char	*rtstart;
 	uint64_t rtreserved;
+	char	*max_atomic_write;
 
 	/* parameters where 0 is a valid CLI value */
 	int	dsunit;
@@ -1157,6 +1167,8 @@ struct mkfs_params {
 	struct sb_feat_args	sb_feat;
 	uint64_t	rtstart;
 	uint64_t	rtreserved;
+
+	uint64_t	max_atomic_write;
 };
 
 /*
@@ -1197,7 +1209,7 @@ usage( void )
 /* force overwrite */	[-f]\n\
 /* inode size */	[-i perblock=n|size=num,maxpct=n,attr=0|1|2,\n\
 			    projid32bit=0|1,sparse=0|1,nrext64=0|1,\n\
-			    exchange=0|1]\n\
+			    exchange=0|1,max_atomic_write=n]\n\
 /* no discard */	[-K]\n\
 /* log subvol */	[-l agnum=n,internal,size=num,logdev=xxx,version=n\n\
 			    sunit=value|su=num,sectsize=num,lazy-count=0|1,\n\
@@ -1927,6 +1939,9 @@ inode_opts_parser(
 	case I_EXCHANGE:
 		cli->sb_feat.exchrange = getnum(value, opts, subopt);
 		break;
+	case I_MAX_ATOMIC_WRITE:
+		cli->max_atomic_write = getstr(value, opts, subopt);
+		break;
 	default:
 		return -EINVAL;
 	}
@@ -4092,6 +4107,18 @@ align_ag_geometry(
 		dsunit = max(DTOBT(ft->data.awu_max, cfg->blocklog),
 				dsunit);
 
+	/*
+	 * If the user gave us a maximum atomic write size that is less than
+	 * a whole AG, try to align the AG size to that value.
+	 */
+	if (cfg->max_atomic_write > 0) {
+		xfs_extlen_t	max_atomic_fsbs =
+			cfg->max_atomic_write >> cfg->blocklog;
+
+		if (max_atomic_fsbs < cfg->agsize)
+			dsunit = max(dsunit, max_atomic_fsbs);
+	}
+
 	if (!dsunit)
 		goto validate;
 
@@ -4971,6 +4998,140 @@ calc_concurrency_logblocks(
 	return logblocks;
 }
 
+#define MAX_RW_COUNT (INT_MAX & ~(getpagesize() - 1))
+
+/* Maximum atomic write IO size that the kernel allows. */
+static inline xfs_extlen_t calc_atomic_write_max(struct mkfs_params *cfg)
+{
+	return rounddown_pow_of_two(MAX_RW_COUNT >> cfg->blocklog);
+}
+
+static inline unsigned int max_pow_of_two_factor(const unsigned int nr)
+{
+	return 1 << (ffs(nr) - 1);
+}
+
+/*
+ * If the data device advertises atomic write support, limit the size of data
+ * device atomic writes to the greatest power-of-two factor of the AG size so
+ * that every atomic write unit aligns with the start of every AG.  This is
+ * required so that the per-AG allocations for an atomic write will always be
+ * aligned compatibly with the alignment requirements of the storage.
+ *
+ * If the data device doesn't advertise atomic writes, then there are no
+ * alignment restrictions and the largest out-of-place write we can do
+ * ourselves is the number of blocks that user files can allocate from any AG.
+ */
+static inline xfs_extlen_t
+calc_perag_awu_max(
+	struct mkfs_params	*cfg,
+	struct fs_topology	*ft)
+{
+	if (ft->data.awu_min > 0)
+		return max_pow_of_two_factor(cfg->agsize);
+	return cfg->agsize;
+}
+
+/*
+ * Reflink on the realtime device requires rtgroups, and atomic writes require
+ * reflink.
+ *
+ * If the realtime device advertises atomic write support, limit the size of
+ * data device atomic writes to the greatest power-of-two factor of the rtgroup
+ * size so that every atomic write unit aligns with the start of every rtgroup.
+ * This is required so that the per-rtgroup allocations for an atomic write
+ * will always be aligned compatibly with the alignment requirements of the
+ * storage.
+ *
+ * If the rt device doesn't advertise atomic writes, then there are no
+ * alignment restrictions and the largest out-of-place write we can do
+ * ourselves is the number of blocks that user files can allocate from any
+ * rtgroup.
+ */
+static inline xfs_extlen_t
+calc_rtgroup_awu_max(
+	struct mkfs_params	*cfg,
+	struct fs_topology	*ft)
+{
+	if (ft->rt.awu_min > 0)
+		return max_pow_of_two_factor(cfg->rgsize);
+	return cfg->rgsize;
+}
+
+/*
+ * Validate the maximum atomic out of place write size passed in by the user.
+ */
+static void
+validate_max_atomic_write(
+	struct mkfs_params	*cfg,
+	struct cli_params	*cli,
+	struct fs_topology	*ft,
+	struct xfs_mount	*mp)
+{
+	const xfs_extlen_t	max_write = calc_atomic_write_max(cfg);
+	xfs_filblks_t		max_atomic_fsbcount;
+
+	cfg->max_atomic_write = getnum(cli->max_atomic_write, &iopts,
+			I_MAX_ATOMIC_WRITE);
+	max_atomic_fsbcount = cfg->max_atomic_write >> cfg->blocklog;
+
+	/* generic_atomic_write_valid enforces power of two length */
+	if (!is_power_of_2(cfg->max_atomic_write)) {
+		fprintf(stderr,
+ _("Max atomic write size of %llu bytes is not a power of 2\n"),
+			(unsigned long long)cfg->max_atomic_write);
+		exit(1);
+	}
+
+	if (cfg->max_atomic_write % cfg->blocksize) {
+		fprintf(stderr,
+ _("Max atomic write size of %llu bytes not aligned with fsblock.\n"),
+			(unsigned long long)cfg->max_atomic_write);
+		exit(1);
+	}
+
+	if (max_atomic_fsbcount > max_write) {
+		fprintf(stderr,
+ _("Max atomic write size of %lluk cannot be larger than max write size %lluk.\n"),
+			(unsigned long long)cfg->max_atomic_write >> 10,
+			(unsigned long long)max_write << (cfg->blocklog - 10));
+		exit(1);
+	}
+}
+
+/*
+ * Validate the maximum atomic out of place write size passed in by the user
+ * actually works with the allocation groups sizes.
+ */
+static void
+validate_max_atomic_write_ags(
+	struct mkfs_params	*cfg,
+	struct fs_topology	*ft,
+	struct xfs_mount	*mp)
+{
+	const xfs_extlen_t	max_group = max(cfg->agsize, cfg->rgsize);
+	const xfs_extlen_t	max_group_write =
+		max(calc_perag_awu_max(cfg, ft), calc_rtgroup_awu_max(cfg, ft));
+	xfs_filblks_t		max_atomic_fsbcount =
+		XFS_B_TO_FSBT(mp, cfg->max_atomic_write);
+
+	if (max_atomic_fsbcount > max_group) {
+		fprintf(stderr,
+ _("Max atomic write size of %lluk cannot be larger than allocation group size %lluk.\n"),
+			(unsigned long long)cfg->max_atomic_write >> 10,
+			(unsigned long long)XFS_FSB_TO_B(mp, max_group) >> 10);
+		exit(1);
+	}
+
+	if (max_atomic_fsbcount > max_group_write) {
+		fprintf(stderr,
+ _("Max atomic write size of %lluk cannot be larger than max allocation group write size %lluk.\n"),
+			(unsigned long long)cfg->max_atomic_write >> 10,
+			(unsigned long long)XFS_FSB_TO_B(mp, max_group_write) >> 10);
+		exit(1);
+	}
+}
+
 static void
 calculate_log_size(
 	struct mkfs_params	*cfg,
@@ -4996,6 +5157,22 @@ calculate_log_size(
 		libxfs_log_get_max_trans_res(&mount, &res);
 		max_tx_bytes = res.tr_logres * res.tr_logcount;
 	}
+	if (cfg->max_atomic_write > 0) {
+		unsigned int	dontcare;
+		xfs_extlen_t	atomic_min_logblocks =
+			libxfs_calc_atomic_write_log_geometry(&mount,
+					cfg->max_atomic_write >> cfg->blocklog,
+					&dontcare);
+
+		if (!atomic_min_logblocks) {
+			fprintf(stderr,
+ _("atomic write size %lluk is too big for the log to handle.\n"),
+				(unsigned long long)cfg->max_atomic_write >> 10);
+			exit(1);
+		}
+
+		min_logblocks = max(min_logblocks, atomic_min_logblocks);
+	}
 	libxfs_umount(&mount);
 
 	ASSERT(min_logblocks);
@@ -5923,6 +6100,13 @@ main(
 	validate_rtdev(&cfg, &cli, &zt);
 	calc_stripe_factors(&cfg, &cli, &ft);
 
+	/*
+	 * Now that we have basic geometry set up, we can validate the CLI
+	 * max atomic write parameter.
+	 */
+	if (cli.max_atomic_write)
+		validate_max_atomic_write(&cfg, &cli, &ft, mp);
+
 	/*
 	 * At this point when know exactly what size all the devices are,
 	 * so we can start validating and calculating layout options that are
@@ -5946,6 +6130,14 @@ main(
 	start_superblock_setup(&cfg, mp, sbp);
 	initialise_mount(mp, sbp);
 
+	/*
+	 * Now that we have computed the allocation group geometry, we can
+	 * continue validating the maximum software atomic write parameter, if
+	 * one was given.
+	 */
+	if (cfg.max_atomic_write)
+		validate_max_atomic_write_ags(&cfg, &ft, mp);
+
 	/*
 	 * With the mount set up, we can finally calculate the log size
 	 * constraints and do default size calculations and final validation


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 7/7] mkfs: allow users to configure the desired maximum atomic write size
  2025-07-01 18:08   ` [PATCH 7/7] mkfs: allow users to configure the desired maximum atomic write size Darrick J. Wong
@ 2025-07-02  8:50     ` John Garry
  2025-07-02 19:01       ` Darrick J. Wong
  0 siblings, 1 reply; 33+ messages in thread
From: John Garry @ 2025-07-02  8:50 UTC (permalink / raw)
  To: Darrick J. Wong, aalbersh; +Cc: catherine.hoang, linux-xfs

On 01/07/2025 19:08, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Allow callers of mkfs.xfs to specify a desired maximum atomic write
> size.  This value will cause the log size to be adjusted to support
> software atomic writes, and the AG size to be aligned to support
> hardware atomic writes.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>

thanks, regardless of comments below, FWIW:

Reviewed-by: John Garry <john.g.garry@oracle.com>

>   		goto validate;
>   
> @@ -4971,6 +4998,140 @@ calc_concurrency_logblocks(
>   	return logblocks;
>   }
>   
> +#define MAX_RW_COUNT (INT_MAX & ~(getpagesize() - 1))
> +
> +/* Maximum atomic write IO size that the kernel allows. */

FWIW, statx atomic write unit max is a 32b value, so we get a 2GB limit 
just from that factor

> +static inline xfs_extlen_t calc_atomic_write_max(struct mkfs_params *cfg)
> +{
> +	return rounddown_pow_of_two(MAX_RW_COUNT >> cfg->blocklog);
> +}
> +
> +static inline unsigned int max_pow_of_two_factor(const unsigned int nr)
> +{
> +	return 1 << (ffs(nr) - 1);
> +}
> +
> +/*
> + * If the data device advertises atomic write support, limit the size of data
> + * device atomic writes to the greatest power-of-two factor of the AG size so
> + * that every atomic write unit aligns with the start of every AG.  This is
> + * required so that the per-AG allocations for an atomic write will always be
> + * aligned compatibly with the alignment requirements of the storage.
> + *
> + * If the data device doesn't advertise atomic writes, then there are no
> + * alignment restrictions and the largest out-of-place write we can do
> + * ourselves is the number of blocks that user files can allocate from any AG.
> + */
> +static inline xfs_extlen_t
> +calc_perag_awu_max(
> +	struct mkfs_params	*cfg,
> +	struct fs_topology	*ft)
> +{
> +	if (ft->data.awu_min > 0)
> +		return max_pow_of_two_factor(cfg->agsize);
> +	return cfg->agsize;

out of curiosity, for out-of-place atomic writes, is there anything to 
stop the blocks being allocated across multiple AGs?



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 7/7] mkfs: allow users to configure the desired maximum atomic write size
  2025-07-02  8:50     ` John Garry
@ 2025-07-02 19:01       ` Darrick J. Wong
  0 siblings, 0 replies; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-02 19:01 UTC (permalink / raw)
  To: John Garry; +Cc: aalbersh, catherine.hoang, linux-xfs

On Wed, Jul 02, 2025 at 09:50:04AM +0100, John Garry wrote:
> On 01/07/2025 19:08, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Allow callers of mkfs.xfs to specify a desired maximum atomic write
> > size.  This value will cause the log size to be adjusted to support
> > software atomic writes, and the AG size to be aligned to support
> > hardware atomic writes.
> > 
> > Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> 
> thanks, regardless of comments below, FWIW:
> 
> Reviewed-by: John Garry <john.g.garry@oracle.com>
> 
> >   		goto validate;
> > @@ -4971,6 +4998,140 @@ calc_concurrency_logblocks(
> >   	return logblocks;
> >   }
> > +#define MAX_RW_COUNT (INT_MAX & ~(getpagesize() - 1))
> > +
> > +/* Maximum atomic write IO size that the kernel allows. */
> 
> FWIW, statx atomic write unit max is a 32b value, so we get a 2GB limit just
> from that factor

<nod> But we might as well mirror the kernel's calculations...

> > +static inline xfs_extlen_t calc_atomic_write_max(struct mkfs_params *cfg)
> > +{
> > +	return rounddown_pow_of_two(MAX_RW_COUNT >> cfg->blocklog);
> > +}
> > +
> > +static inline unsigned int max_pow_of_two_factor(const unsigned int nr)
> > +{
> > +	return 1 << (ffs(nr) - 1);
> > +}
> > +
> > +/*
> > + * If the data device advertises atomic write support, limit the size of data
> > + * device atomic writes to the greatest power-of-two factor of the AG size so
> > + * that every atomic write unit aligns with the start of every AG.  This is
> > + * required so that the per-AG allocations for an atomic write will always be
> > + * aligned compatibly with the alignment requirements of the storage.
> > + *
> > + * If the data device doesn't advertise atomic writes, then there are no
> > + * alignment restrictions and the largest out-of-place write we can do
> > + * ourselves is the number of blocks that user files can allocate from any AG.
> > + */
> > +static inline xfs_extlen_t
> > +calc_perag_awu_max(
> > +	struct mkfs_params	*cfg,
> > +	struct fs_topology	*ft)
> > +{
> > +	if (ft->data.awu_min > 0)
> > +		return max_pow_of_two_factor(cfg->agsize);
> > +	return cfg->agsize;
> 
> out of curiosity, for out-of-place atomic writes, is there anything to stop
> the blocks being allocated across multiple AGs?

Nope.  But they'll at least get the software fallback, same as if they
were writing to a severely fragmented filesystem.

--D

--D

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCHSET 3/3] xfs_scrub: drop EXPERIMENTAL warning
  2025-07-01 18:03 [PATCHBOMB] xfsprogs: ports and new code for 6.16 Darrick J. Wong
  2025-07-01 18:04 ` [PATCHSET 1/3] xfsprogs: new libxfs code from kernel 6.16 Darrick J. Wong
  2025-07-01 18:05 ` [PATCHSET 2/3] xfsprogs: atomic writes Darrick J. Wong
@ 2025-07-01 18:05 ` Darrick J. Wong
  2025-07-01 18:09   ` [PATCH 1/1] xfs_scrub: remove EXPERIMENTAL warnings Darrick J. Wong
  2 siblings, 1 reply; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-01 18:05 UTC (permalink / raw)
  To: djwong, aalbersh; +Cc: linux-xfs

Hi all,

Drop the EXPERIMENTAL warning on xfs_scrub now that we've done so for the
kernel.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

Comments and questions are, as always, welcome.

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-6.16
---
Commits in this patchset:
 * xfs_scrub: remove EXPERIMENTAL warnings
---
 man/man8/xfs_scrub.8 |    6 ------
 scrub/xfs_scrub.c    |    3 ---
 2 files changed, 9 deletions(-)


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/1] xfs_scrub: remove EXPERIMENTAL warnings
  2025-07-01 18:05 ` [PATCHSET 3/3] xfs_scrub: drop EXPERIMENTAL warning Darrick J. Wong
@ 2025-07-01 18:09   ` Darrick J. Wong
  2025-07-03 13:45     ` Christoph Hellwig
  0 siblings, 1 reply; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-01 18:09 UTC (permalink / raw)
  To: djwong, aalbersh; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

The kernel code for online fsck has been stable for a year, and there
haven't been any major changes to the program in quite some time, so
let's drop the experimental warnings.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 man/man8/xfs_scrub.8 |    6 ------
 scrub/xfs_scrub.c    |    3 ---
 2 files changed, 9 deletions(-)


diff --git a/man/man8/xfs_scrub.8 b/man/man8/xfs_scrub.8
index 1ed4b176b6a35a..8b2a990831a2cc 100644
--- a/man/man8/xfs_scrub.8
+++ b/man/man8/xfs_scrub.8
@@ -13,12 +13,6 @@ .SH DESCRIPTION
 .B xfs_scrub
 attempts to check and repair all metadata in a mounted XFS filesystem.
 .PP
-.B WARNING!
-This program is
-.BR EXPERIMENTAL ","
-which means that its behavior and interface
-could change at any time!
-.PP
 .B xfs_scrub
 asks the kernel to scrub all metadata objects in the filesystem.
 Metadata records are scanned for obviously bad values and then
diff --git a/scrub/xfs_scrub.c b/scrub/xfs_scrub.c
index 90897cc26cd71d..3dba972a7e8d2a 100644
--- a/scrub/xfs_scrub.c
+++ b/scrub/xfs_scrub.c
@@ -730,9 +730,6 @@ main(
 	hist_init(&ctx.datadev_hist);
 	hist_init(&ctx.rtdev_hist);
 
-	fprintf(stdout, "EXPERIMENTAL xfs_scrub program in use! Use at your own risk!\n");
-	fflush(stdout);
-
 	progname = basename(argv[0]);
 	setlocale(LC_ALL, "");
 	bindtextdomain(PACKAGE, LOCALEDIR);


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/1] xfs_scrub: remove EXPERIMENTAL warnings
  2025-07-01 18:09   ` [PATCH 1/1] xfs_scrub: remove EXPERIMENTAL warnings Darrick J. Wong
@ 2025-07-03 13:45     ` Christoph Hellwig
  0 siblings, 0 replies; 33+ messages in thread
From: Christoph Hellwig @ 2025-07-03 13:45 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: aalbersh, linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCHSET 2/3] xfsprogs: atomic writes
@ 2025-07-15  5:16 Darrick J. Wong
  2025-07-15  5:18 ` [PATCH 4/7] mkfs: don't complain about overly large auto-detected log stripe units Darrick J. Wong
  0 siblings, 1 reply; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-15  5:16 UTC (permalink / raw)
  To: aalbersh, djwong
  Cc: john.g.garry, linux-xfs, catherine.hoang, john.g.garry, linux-xfs

Hi all,

Utility updates for dumping atomic writes configuration and formatting
filesystems to take advantage of the new functionality.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

Comments and questions are, as always, welcome.

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-writes-6.16

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=atomic-writes-6.16
---
Commits in this patchset:
 * libfrog: move statx.h from io/ to libfrog/
 * xfs_db: create an untorn_max subcommand
 * xfs_io: dump new atomic_write_unit_max_opt statx field
 * mkfs: don't complain about overly large auto-detected log stripe units
 * mkfs: autodetect log stripe unit for external log devices
 * mkfs: try to align AG size based on atomic write capabilities
 * mkfs: allow users to configure the desired maximum atomic write size
---
 include/bitops.h         |   12 ++
 include/libxfs.h         |    1 
 libfrog/statx.h          |   23 ++++
 libxfs/libxfs_api_defs.h |    5 +
 libxfs/topology.h        |    6 +
 db/logformat.c           |  129 ++++++++++++++++++++++++
 io/stat.c                |   21 +---
 libfrog/Makefile         |    1 
 libxfs/topology.c        |   36 +++++++
 m4/package_libcdev.m4    |    2 
 man/man8/mkfs.xfs.8.in   |    7 +
 man/man8/xfs_db.8        |   10 ++
 mkfs/xfs_mkfs.c          |  251 +++++++++++++++++++++++++++++++++++++++++++++-
 13 files changed, 474 insertions(+), 30 deletions(-)
 rename io/statx.h => libfrog/statx.h (94%)


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 4/7] mkfs: don't complain about overly large auto-detected log stripe units
  2025-07-15  5:16 [PATCHSET 2/3] xfsprogs: atomic writes Darrick J. Wong
@ 2025-07-15  5:18 ` Darrick J. Wong
  0 siblings, 0 replies; 33+ messages in thread
From: Darrick J. Wong @ 2025-07-15  5:18 UTC (permalink / raw)
  To: aalbersh, djwong
  Cc: john.g.garry, linux-xfs, catherine.hoang, john.g.garry, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If mkfs declines to apply what it thinks is an overly large data device
stripe unit to the log device, it should only log a message about that
if the lsunit parameter was actually supplied by the caller.  It should
not do that when the lsunit was autodetected from the block devices.

The cli parameters are zero-initialized in main and always have been.

Cc: <linux-xfs@vger.kernel.org> # v4.15.0
Fixes: 2f44b1b0e5adc4 ("mkfs: rework stripe calculations")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
---
 mkfs/xfs_mkfs.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index 812241c49a5494..8b946f3ef817da 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -3629,7 +3629,7 @@ _("log stripe unit (%d) must be a multiple of the block size (%d)\n"),
 	if (cfg->sb_feat.log_version == 2 &&
 	    cfg->lsunit * cfg->blocksize > 256 * 1024) {
 		/* Warn only if specified on commandline */
-		if (cli->lsu || cli->lsunit != -1) {
+		if (cli->lsu || cli->lsunit) {
 			fprintf(stderr,
 _("log stripe unit (%d bytes) is too large (maximum is 256KiB)\n"
   "log stripe unit adjusted to 32KiB\n"),


^ permalink raw reply related	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2025-07-15  5:18 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-01 18:03 [PATCHBOMB] xfsprogs: ports and new code for 6.16 Darrick J. Wong
2025-07-01 18:04 ` [PATCHSET 1/3] xfsprogs: new libxfs code from kernel 6.16 Darrick J. Wong
2025-07-01 18:05   ` [PATCH 1/6] xfs: add helpers to compute transaction reservation for finishing intent items Darrick J. Wong
2025-07-01 18:05   ` [PATCH 2/6] xfs: allow block allocator to take an alignment hint Darrick J. Wong
2025-07-01 18:06   ` [PATCH 3/6] xfs: commit CoW-based atomic writes atomically Darrick J. Wong
2025-07-01 18:06   ` [PATCH 4/6] libxfs: add helpers to compute log item overhead Darrick J. Wong
2025-07-01 18:06   ` [PATCH 5/6] xfs: add xfs_calc_atomic_write_unit_max() Darrick J. Wong
2025-07-01 18:06   ` [PATCH 6/6] xfs: allow sysadmins to specify a maximum atomic write limit at mount time Darrick J. Wong
2025-07-01 18:05 ` [PATCHSET 2/3] xfsprogs: atomic writes Darrick J. Wong
2025-07-01 18:07   ` [PATCH 1/7] libfrog: move statx.h from io/ to libfrog/ Darrick J. Wong
2025-07-02  9:13     ` John Garry
2025-07-01 18:07   ` [PATCH 2/7] xfs_db: create an untorn_max subcommand Darrick J. Wong
2025-07-09 15:39     ` John Garry
2025-07-09 16:35       ` Darrick J. Wong
2025-07-01 18:07   ` [PATCH 3/7] xfs_io: dump new atomic_write_unit_max_opt statx field Darrick J. Wong
2025-07-02  8:23     ` John Garry
2025-07-01 18:07   ` [PATCH 4/7] mkfs: don't complain about overly large auto-detected log stripe units Darrick J. Wong
2025-07-08 14:38     ` John Garry
2025-07-08 15:05       ` Darrick J. Wong
2025-07-08 15:07         ` John Garry
2025-07-01 18:08   ` [PATCH 5/7] mkfs: autodetect log stripe unit for external log devices Darrick J. Wong
2025-07-09 15:57     ` John Garry
2025-07-09 16:45       ` Darrick J. Wong
2025-07-01 18:08   ` [PATCH 6/7] mkfs: try to align AG size based on atomic write capabilities Darrick J. Wong
2025-07-02  9:03     ` John Garry
2025-07-02 19:00       ` Darrick J. Wong
2025-07-01 18:08   ` [PATCH 7/7] mkfs: allow users to configure the desired maximum atomic write size Darrick J. Wong
2025-07-02  8:50     ` John Garry
2025-07-02 19:01       ` Darrick J. Wong
2025-07-01 18:05 ` [PATCHSET 3/3] xfs_scrub: drop EXPERIMENTAL warning Darrick J. Wong
2025-07-01 18:09   ` [PATCH 1/1] xfs_scrub: remove EXPERIMENTAL warnings Darrick J. Wong
2025-07-03 13:45     ` Christoph Hellwig
  -- strict thread matches above, loose matches on Subject: below --
2025-07-15  5:16 [PATCHSET 2/3] xfsprogs: atomic writes Darrick J. Wong
2025-07-15  5:18 ` [PATCH 4/7] mkfs: don't complain about overly large auto-detected log stripe units Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).