[PATCH v7 00/14] large atomic writes for xfs

linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 00/14] large atomic writes for xfs
@ 2025-04-15 12:14 John Garry
  2025-04-15 12:14 ` [PATCH v7 01/14] fs: add atomic write unit max opt to statx John Garry
                   ` (13 more replies)
  0 siblings, 14 replies; 40+ messages in thread
From: John Garry @ 2025-04-15 12:14 UTC (permalink / raw)
  To: brauner, djwong, hch, viro, jack, cem
  Cc: linux-fsdevel, dchinner, linux-xfs, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, linux-ext4, linux-block,
	catherine.hoang, linux-api, John Garry

Currently atomic write support for xfs is limited to writing a single
block as we have no way to guarantee alignment and that the write covers
a single extent.

This series introduces a method to issue atomic writes via a
software-based method.

The software-based method is used as a fallback for when attempting to
issue an atomic write over misaligned or multiple extents.

For xfs, this support is based on reflink CoW support.

The basic idea of this CoW method is to alloc a range in the CoW fork,
write the data, and atomically update the mapping.

Initial mysql performance testing has shown this method to perform ok.
However, there we are only using 16K atomic writes (and 4K block size),
so typically - and thankfully - this software fallback method won't be
used often.

For other FSes which want large atomics writes and don't support CoW, I
think that they can follow the example in [0].

Catherine is currently working on further xfstests for this feature,
which we hope to share soon.

Based on 8ffd015db85f (tag: v6.15-rc2, xfs/xfs-6.16-merge,
xfs/xfs-6.15-fixes, xfs/for-next) Linux 6.15-rc2

[0] https://lore.kernel.org/linux-xfs/20250310183946.932054-1-john.g.garry@oracle.com/

Differences to v6:
- log item sizes updates (Darrick)
- rtvol support (Darrick)
- mount option for atomic writes (Darrick)
- Add RB tags from Darrick and Christoph (Thanks!)

Differences to v5:
- Add statx unit_max_opt (Christoph, me)
- Add xfs_atomic_write_cow_iomap_begin() (Christoph)
- drop old mechanical changes
- limit atomic write max according to CoW-based atomic write max (Christoph)
- Add xfs_compute_atomic_write_unit_max()
- this contains changes for limiting awu max according to max
  transaction log items (Darrick)
- use -ENOPROTOOPT for fallback (Christoph)
- rename xfs_inode_can_atomicwrite() -> xfs_inode_can_hw_atomicwrite()
- rework varoious code comments (Christoph)
- limit CoW-based atomic write to log size and add helpers (Darrick)
- drop IOMAP_DIO_FORCE_WAIT usage in xfs_file_dio_write_atomic()
- Add RB tags from Christoph (thanks!)

Darrick J. Wong (3):
  xfs: add helpers to compute log item overhead
  xfs: add helpers to compute transaction reservation for finishing
    intent items
  xfs: allow sysadmins to specify a maximum atomic write limit at mount
    time

John Garry (11):
  fs: add atomic write unit max opt to statx
  xfs: rename xfs_inode_can_atomicwrite() ->
    xfs_inode_can_hw_atomicwrite()
  xfs: allow block allocator to take an alignment hint
  xfs: refactor xfs_reflink_end_cow_extent()
  xfs: refine atomic write size check in xfs_file_write_iter()
  xfs: add xfs_atomic_write_cow_iomap_begin()
  xfs: add large atomic writes checks in xfs_direct_write_iomap_begin()
  xfs: commit CoW-based atomic writes atomically
  xfs: add xfs_file_dio_write_atomic()
  xfs: add xfs_compute_atomic_write_unit_max()
  xfs: update atomic write limits

 Documentation/admin-guide/xfs.rst |   8 +
 block/bdev.c                      |   3 +-
 fs/ext4/inode.c                   |   2 +-
 fs/stat.c                         |   6 +-
 fs/xfs/libxfs/xfs_bmap.c          |   5 +
 fs/xfs/libxfs/xfs_bmap.h          |   6 +-
 fs/xfs/libxfs/xfs_trans_resv.c    | 315 +++++++++++++++++++++++++++---
 fs/xfs/libxfs/xfs_trans_resv.h    |  22 +++
 fs/xfs/xfs_bmap_item.c            |  10 +
 fs/xfs/xfs_bmap_item.h            |   3 +
 fs/xfs/xfs_buf_item.c             |  19 ++
 fs/xfs/xfs_buf_item.h             |   3 +
 fs/xfs/xfs_extfree_item.c         |  10 +
 fs/xfs/xfs_extfree_item.h         |   3 +
 fs/xfs/xfs_file.c                 |  87 ++++++++-
 fs/xfs/xfs_inode.h                |   2 +-
 fs/xfs/xfs_iomap.c                | 191 +++++++++++++++++-
 fs/xfs/xfs_iomap.h                |   1 +
 fs/xfs/xfs_iops.c                 |  77 +++++++-
 fs/xfs/xfs_iops.h                 |   3 +
 fs/xfs/xfs_log_cil.c              |   4 +-
 fs/xfs/xfs_log_priv.h             |  13 ++
 fs/xfs/xfs_mount.c                |  86 ++++++++
 fs/xfs/xfs_mount.h                |  11 ++
 fs/xfs/xfs_refcount_item.c        |  10 +
 fs/xfs/xfs_refcount_item.h        |   3 +
 fs/xfs/xfs_reflink.c              | 143 +++++++++++---
 fs/xfs/xfs_reflink.h              |   6 +
 fs/xfs/xfs_rmap_item.c            |  10 +
 fs/xfs/xfs_rmap_item.h            |   3 +
 fs/xfs/xfs_super.c                |  28 ++-
 fs/xfs/xfs_trace.h                | 115 +++++++++++
 include/linux/fs.h                |   3 +-
 include/linux/stat.h              |   1 +
 include/uapi/linux/stat.h         |   8 +-
 35 files changed, 1130 insertions(+), 90 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v7 01/14] fs: add atomic write unit max opt to statx
  2025-04-15 12:14 [PATCH v7 00/14] large atomic writes for xfs John Garry
@ 2025-04-15 12:14 ` John Garry
  2025-04-15 12:14 ` [PATCH v7 02/14] xfs: add helpers to compute log item overhead John Garry
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 40+ messages in thread
From: John Garry @ 2025-04-15 12:14 UTC (permalink / raw)
  To: brauner, djwong, hch, viro, jack, cem
  Cc: linux-fsdevel, dchinner, linux-xfs, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, linux-ext4, linux-block,
	catherine.hoang, linux-api, John Garry

XFS will be able to support large atomic writes (atomic write > 1x block)
in future. This will be achieved by using different operating methods,
depending on the size of the write.

Specifically a new method of operation based in FS atomic extent remapping
will be supported in addition to the current HW offload-based method.

The FS method will generally be appreciably slower performing than the
HW-offload method. However the FS method will be typically able to
contribute to achieving a larger atomic write unit max limit.

XFS will support a hybrid mode, where HW offload method will be used when
possible, i.e. HW offload is used when the length of the write is
supported, and for other times FS-based atomic writes will be used.

As such, there is an atomic write length at which the user may experience
appreciably slower performance.

Advertise this limit in a new statx field, stx_atomic_write_unit_max_opt.

When zero, it means that there is no such performance boundary.

Masks STATX{_ATTR}_WRITE_ATOMIC can be used to get this new field. This is
ok for older kernels which don't support this new field, as they would
report 0 in this field (from zeroing in cp_statx()) already. Furthermore
those older kernels don't support large atomic writes - apart from block
fops, but there would be consistent performance there for atomic writes
in range [unit min, unit max].

Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/bdev.c              | 3 ++-
 fs/ext4/inode.c           | 2 +-
 fs/stat.c                 | 6 +++++-
 fs/xfs/xfs_iops.c         | 2 +-
 include/linux/fs.h        | 3 ++-
 include/linux/stat.h      | 1 +
 include/uapi/linux/stat.h | 8 ++++++--
 7 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index 4844d1e27b6f..b4afc1763e8e 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1301,7 +1301,8 @@ void bdev_statx(struct path *path, struct kstat *stat,
 
 		generic_fill_statx_atomic_writes(stat,
 			queue_atomic_write_unit_min_bytes(bd_queue),
-			queue_atomic_write_unit_max_bytes(bd_queue));
+			queue_atomic_write_unit_max_bytes(bd_queue),
+			0);
 	}
 
 	stat->blksize = bdev_io_min(bdev);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 94c7d2d828a6..cdf01e60fa6d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5692,7 +5692,7 @@ int ext4_getattr(struct mnt_idmap *idmap, const struct path *path,
 			awu_max = sbi->s_awu_max;
 		}
 
-		generic_fill_statx_atomic_writes(stat, awu_min, awu_max);
+		generic_fill_statx_atomic_writes(stat, awu_min, awu_max, 0);
 	}
 
 	flags = ei->i_flags & EXT4_FL_USER_VISIBLE;
diff --git a/fs/stat.c b/fs/stat.c
index f13308bfdc98..c41855f62d22 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -136,13 +136,15 @@ EXPORT_SYMBOL(generic_fill_statx_attr);
  * @stat:	Where to fill in the attribute flags
  * @unit_min:	Minimum supported atomic write length in bytes
  * @unit_max:	Maximum supported atomic write length in bytes
+ * @unit_max_opt: Optimised maximum supported atomic write length in bytes
  *
  * Fill in the STATX{_ATTR}_WRITE_ATOMIC flags in the kstat structure from
  * atomic write unit_min and unit_max values.
  */
 void generic_fill_statx_atomic_writes(struct kstat *stat,
 				      unsigned int unit_min,
-				      unsigned int unit_max)
+				      unsigned int unit_max,
+				      unsigned int unit_max_opt)
 {
 	/* Confirm that the request type is known */
 	stat->result_mask |= STATX_WRITE_ATOMIC;
@@ -153,6 +155,7 @@ void generic_fill_statx_atomic_writes(struct kstat *stat,
 	if (unit_min) {
 		stat->atomic_write_unit_min = unit_min;
 		stat->atomic_write_unit_max = unit_max;
+		stat->atomic_write_unit_max_opt = unit_max_opt;
 		/* Initially only allow 1x segment */
 		stat->atomic_write_segments_max = 1;
 
@@ -732,6 +735,7 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
 	tmp.stx_atomic_write_unit_min = stat->atomic_write_unit_min;
 	tmp.stx_atomic_write_unit_max = stat->atomic_write_unit_max;
 	tmp.stx_atomic_write_segments_max = stat->atomic_write_segments_max;
+	tmp.stx_atomic_write_unit_max_opt = stat->atomic_write_unit_max_opt;
 
 	return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 756bd3ca8e00..f0e5d83195df 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -610,7 +610,7 @@ xfs_report_atomic_write(
 
 	if (xfs_inode_can_atomicwrite(ip))
 		unit_min = unit_max = ip->i_mount->m_sb.sb_blocksize;
-	generic_fill_statx_atomic_writes(stat, unit_min, unit_max);
+	generic_fill_statx_atomic_writes(stat, unit_min, unit_max, 0);
 }
 
 STATIC int
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 016b0fe1536e..7b19d8f99aff 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3475,7 +3475,8 @@ void generic_fillattr(struct mnt_idmap *, u32, struct inode *, struct kstat *);
 void generic_fill_statx_attr(struct inode *inode, struct kstat *stat);
 void generic_fill_statx_atomic_writes(struct kstat *stat,
 				      unsigned int unit_min,
-				      unsigned int unit_max);
+				      unsigned int unit_max,
+				      unsigned int unit_max_opt);
 extern int vfs_getattr_nosec(const struct path *, struct kstat *, u32, unsigned int);
 extern int vfs_getattr(const struct path *, struct kstat *, u32, unsigned int);
 void __inode_add_bytes(struct inode *inode, loff_t bytes);
diff --git a/include/linux/stat.h b/include/linux/stat.h
index be7496a6a0dd..e3d00e7bb26d 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -57,6 +57,7 @@ struct kstat {
 	u32		dio_read_offset_align;
 	u32		atomic_write_unit_min;
 	u32		atomic_write_unit_max;
+	u32		atomic_write_unit_max_opt;
 	u32		atomic_write_segments_max;
 };
 
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index f78ee3670dd5..1686861aae20 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -182,8 +182,12 @@ struct statx {
 	/* File offset alignment for direct I/O reads */
 	__u32	stx_dio_read_offset_align;
 
-	/* 0xb8 */
-	__u64	__spare3[9];	/* Spare space for future expansion */
+	/* Optimised max atomic write unit in bytes */
+	__u32	stx_atomic_write_unit_max_opt;
+	__u32	__spare2[1];
+
+	/* 0xc0 */
+	__u64	__spare3[8];	/* Spare space for future expansion */
 
 	/* 0x100 */
 };
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 02/14] xfs: add helpers to compute log item overhead
  2025-04-15 12:14 [PATCH v7 00/14] large atomic writes for xfs John Garry
  2025-04-15 12:14 ` [PATCH v7 01/14] fs: add atomic write unit max opt to statx John Garry
@ 2025-04-15 12:14 ` John Garry
  2025-04-15 12:14 ` [PATCH v7 03/14] xfs: add helpers to compute transaction reservation for finishing intent items John Garry
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 40+ messages in thread
From: John Garry @ 2025-04-15 12:14 UTC (permalink / raw)
  To: brauner, djwong, hch, viro, jack, cem
  Cc: linux-fsdevel, dchinner, linux-xfs, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, linux-ext4, linux-block,
	catherine.hoang, linux-api, John Garry

From: "Darrick J. Wong" <djwong@kernel.org>

Add selected helpers to estimate the transaction reservation required to
write various log intent and buffer items to the log.  These helpers
will be used by the online repair code for more precise estimations of
how much work can be done in a single transaction.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_bmap_item.c     | 10 ++++++++++
 fs/xfs/xfs_bmap_item.h     |  3 +++
 fs/xfs/xfs_buf_item.c      | 19 +++++++++++++++++++
 fs/xfs/xfs_buf_item.h      |  3 +++
 fs/xfs/xfs_extfree_item.c  | 10 ++++++++++
 fs/xfs/xfs_extfree_item.h  |  3 +++
 fs/xfs/xfs_log_cil.c       |  4 +---
 fs/xfs/xfs_log_priv.h      | 13 +++++++++++++
 fs/xfs/xfs_refcount_item.c | 10 ++++++++++
 fs/xfs/xfs_refcount_item.h |  3 +++
 fs/xfs/xfs_rmap_item.c     | 10 ++++++++++
 fs/xfs/xfs_rmap_item.h     |  3 +++
 12 files changed, 88 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 3d52e9d7ad57..646c515ee355 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -77,6 +77,11 @@ xfs_bui_item_size(
 	*nbytes += xfs_bui_log_format_sizeof(buip->bui_format.bui_nextents);
 }
 
+unsigned int xfs_bui_log_space(unsigned int nr)
+{
+	return xlog_item_space(1, xfs_bui_log_format_sizeof(nr));
+}
+
 /*
  * This is called to fill in the vector of log iovecs for the
  * given bui log item. We use only 1 iovec, and we point that
@@ -168,6 +173,11 @@ xfs_bud_item_size(
 	*nbytes += sizeof(struct xfs_bud_log_format);
 }
 
+unsigned int xfs_bud_log_space(void)
+{
+	return xlog_item_space(1, sizeof(struct xfs_bud_log_format));
+}
+
 /*
  * This is called to fill in the vector of log iovecs for the
  * given bud log item. We use only 1 iovec, and we point that
diff --git a/fs/xfs/xfs_bmap_item.h b/fs/xfs/xfs_bmap_item.h
index 6fee6a508343..b42fee06899d 100644
--- a/fs/xfs/xfs_bmap_item.h
+++ b/fs/xfs/xfs_bmap_item.h
@@ -72,4 +72,7 @@ struct xfs_bmap_intent;
 
 void xfs_bmap_defer_add(struct xfs_trans *tp, struct xfs_bmap_intent *bi);
 
+unsigned int xfs_bui_log_space(unsigned int nr);
+unsigned int xfs_bud_log_space(void);
+
 #endif	/* __XFS_BMAP_ITEM_H__ */
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 19eb0b7a3e58..90139e0f3271 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -103,6 +103,25 @@ xfs_buf_item_size_segment(
 	return;
 }
 
+/*
+ * Compute the worst case log item overhead for an invalidated buffer with the
+ * given map count and block size.
+ */
+unsigned int
+xfs_buf_inval_log_space(
+	unsigned int	map_count,
+	unsigned int	blocksize)
+{
+	unsigned int	chunks = DIV_ROUND_UP(blocksize, XFS_BLF_CHUNK);
+	unsigned int	bitmap_size = DIV_ROUND_UP(chunks, NBWORD);
+	unsigned int	ret =
+		offsetof(struct xfs_buf_log_format, blf_data_map) +
+			(bitmap_size * sizeof_field(struct xfs_buf_log_format,
+						    blf_data_map[0]));
+
+	return ret * map_count;
+}
+
 /*
  * Return the number of log iovecs and space needed to log the given buf log
  * item.
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index 8cde85259a58..e10e324cd245 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -64,6 +64,9 @@ static inline void xfs_buf_dquot_iodone(struct xfs_buf *bp)
 void	xfs_buf_iodone(struct xfs_buf *);
 bool	xfs_buf_log_check_iovec(struct xfs_log_iovec *iovec);
 
+unsigned int xfs_buf_inval_log_space(unsigned int map_count,
+		unsigned int blocksize);
+
 extern struct kmem_cache	*xfs_buf_item_cache;
 
 #endif	/* __XFS_BUF_ITEM_H__ */
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 777438b853da..d574f5f639fa 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -83,6 +83,11 @@ xfs_efi_item_size(
 	*nbytes += xfs_efi_log_format_sizeof(efip->efi_format.efi_nextents);
 }
 
+unsigned int xfs_efi_log_space(unsigned int nr)
+{
+	return xlog_item_space(1, xfs_efi_log_format_sizeof(nr));
+}
+
 /*
  * This is called to fill in the vector of log iovecs for the
  * given efi log item. We use only 1 iovec, and we point that
@@ -254,6 +259,11 @@ xfs_efd_item_size(
 	*nbytes += xfs_efd_log_format_sizeof(efdp->efd_format.efd_nextents);
 }
 
+unsigned int xfs_efd_log_space(unsigned int nr)
+{
+	return xlog_item_space(1, xfs_efd_log_format_sizeof(nr));
+}
+
 /*
  * This is called to fill in the vector of log iovecs for the
  * given efd log item. We use only 1 iovec, and we point that
diff --git a/fs/xfs/xfs_extfree_item.h b/fs/xfs/xfs_extfree_item.h
index 41b7c4306079..c8402040410b 100644
--- a/fs/xfs/xfs_extfree_item.h
+++ b/fs/xfs/xfs_extfree_item.h
@@ -94,4 +94,7 @@ void xfs_extent_free_defer_add(struct xfs_trans *tp,
 		struct xfs_extent_free_item *xefi,
 		struct xfs_defer_pending **dfpp);
 
+unsigned int xfs_efi_log_space(unsigned int nr);
+unsigned int xfs_efd_log_space(unsigned int nr);
+
 #endif	/* __XFS_EXTFREE_ITEM_H__ */
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 1ca406ec1b40..f66d2d430e4f 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -309,9 +309,7 @@ xlog_cil_alloc_shadow_bufs(
 		 * Then round nbytes up to 64-bit alignment so that the initial
 		 * buffer alignment is easy to calculate and verify.
 		 */
-		nbytes += niovecs *
-			(sizeof(uint64_t) + sizeof(struct xlog_op_header));
-		nbytes = round_up(nbytes, sizeof(uint64_t));
+		nbytes = xlog_item_space(niovecs, nbytes);
 
 		/*
 		 * The data buffer needs to start 64-bit aligned, so round up
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index f3d78869e5e5..39a102cc1b43 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -698,4 +698,17 @@ xlog_kvmalloc(
 	return p;
 }
 
+/*
+ * Given a count of iovecs and space for a log item, compute the space we need
+ * in the log to store that data plus the log headers.
+ */
+static inline unsigned int
+xlog_item_space(
+	unsigned int	niovecs,
+	unsigned int	nbytes)
+{
+	nbytes += niovecs * (sizeof(uint64_t) + sizeof(struct xlog_op_header));
+	return round_up(nbytes, sizeof(uint64_t));
+}
+
 #endif	/* __XFS_LOG_PRIV_H__ */
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index fe2d7aab8554..076501123d89 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -78,6 +78,11 @@ xfs_cui_item_size(
 	*nbytes += xfs_cui_log_format_sizeof(cuip->cui_format.cui_nextents);
 }
 
+unsigned int xfs_cui_log_space(unsigned int nr)
+{
+	return xlog_item_space(1, xfs_cui_log_format_sizeof(nr));
+}
+
 /*
  * This is called to fill in the vector of log iovecs for the
  * given cui log item. We use only 1 iovec, and we point that
@@ -179,6 +184,11 @@ xfs_cud_item_size(
 	*nbytes += sizeof(struct xfs_cud_log_format);
 }
 
+unsigned int xfs_cud_log_space(void)
+{
+	return xlog_item_space(1, sizeof(struct xfs_cud_log_format));
+}
+
 /*
  * This is called to fill in the vector of log iovecs for the
  * given cud log item. We use only 1 iovec, and we point that
diff --git a/fs/xfs/xfs_refcount_item.h b/fs/xfs/xfs_refcount_item.h
index bfee8f30c63c..0fc3f493342b 100644
--- a/fs/xfs/xfs_refcount_item.h
+++ b/fs/xfs/xfs_refcount_item.h
@@ -76,4 +76,7 @@ struct xfs_refcount_intent;
 void xfs_refcount_defer_add(struct xfs_trans *tp,
 		struct xfs_refcount_intent *ri);
 
+unsigned int xfs_cui_log_space(unsigned int nr);
+unsigned int xfs_cud_log_space(void);
+
 #endif	/* __XFS_REFCOUNT_ITEM_H__ */
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index 89decffe76c8..c99700318ec2 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -77,6 +77,11 @@ xfs_rui_item_size(
 	*nbytes += xfs_rui_log_format_sizeof(ruip->rui_format.rui_nextents);
 }
 
+unsigned int xfs_rui_log_space(unsigned int nr)
+{
+	return xlog_item_space(1, xfs_rui_log_format_sizeof(nr));
+}
+
 /*
  * This is called to fill in the vector of log iovecs for the
  * given rui log item. We use only 1 iovec, and we point that
@@ -180,6 +185,11 @@ xfs_rud_item_size(
 	*nbytes += sizeof(struct xfs_rud_log_format);
 }
 
+unsigned int xfs_rud_log_space(void)
+{
+	return xlog_item_space(1, sizeof(struct xfs_rud_log_format));
+}
+
 /*
  * This is called to fill in the vector of log iovecs for the
  * given rud log item. We use only 1 iovec, and we point that
diff --git a/fs/xfs/xfs_rmap_item.h b/fs/xfs/xfs_rmap_item.h
index 40d331555675..3a99f0117f2d 100644
--- a/fs/xfs/xfs_rmap_item.h
+++ b/fs/xfs/xfs_rmap_item.h
@@ -75,4 +75,7 @@ struct xfs_rmap_intent;
 
 void xfs_rmap_defer_add(struct xfs_trans *tp, struct xfs_rmap_intent *ri);
 
+unsigned int xfs_rui_log_space(unsigned int nr);
+unsigned int xfs_rud_log_space(void);
+
 #endif	/* __XFS_RMAP_ITEM_H__ */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 03/14] xfs: add helpers to compute transaction reservation for finishing intent items
  2025-04-15 12:14 [PATCH v7 00/14] large atomic writes for xfs John Garry
  2025-04-15 12:14 ` [PATCH v7 01/14] fs: add atomic write unit max opt to statx John Garry
  2025-04-15 12:14 ` [PATCH v7 02/14] xfs: add helpers to compute log item overhead John Garry
@ 2025-04-15 12:14 ` John Garry
  2025-04-15 12:14 ` [PATCH v7 04/14] xfs: rename xfs_inode_can_atomicwrite() -> xfs_inode_can_hw_atomicwrite() John Garry
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 40+ messages in thread
From: John Garry @ 2025-04-15 12:14 UTC (permalink / raw)
  To: brauner, djwong, hch, viro, jack, cem
  Cc: linux-fsdevel, dchinner, linux-xfs, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, linux-ext4, linux-block,
	catherine.hoang, linux-api, John Garry

From: "Darrick J. Wong" <djwong@kernel.org>

In the transaction reservation code, hoist the logic that computes the
reservation needed to finish one log intent item into separate helper
functions.  These will be used in subsequent patches to estimate the
number of blocks that an online repair can commit to reaping in the same
transaction as the change committing the new data structure.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_trans_resv.c | 165 ++++++++++++++++++++++++++-------
 fs/xfs/libxfs/xfs_trans_resv.h |  18 ++++
 2 files changed, 152 insertions(+), 31 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 13d00c7166e1..580d00ae2857 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -263,6 +263,42 @@ xfs_rtalloc_block_count(
  * register overflow from temporaries in the calculations.
  */
 
+/*
+ * Finishing a data device refcount updates (t1):
+ *    the agfs of the ags containing the blocks: nr_ops * sector size
+ *    the refcount btrees: nr_ops * 1 trees * (2 * max depth - 1) * block size
+ */
+inline unsigned int
+xfs_calc_finish_cui_reservation(
+	struct xfs_mount	*mp,
+	unsigned int		nr_ops)
+{
+	if (!xfs_has_reflink(mp))
+		return 0;
+
+	return xfs_calc_buf_res(nr_ops, mp->m_sb.sb_sectsize) +
+	       xfs_calc_buf_res(xfs_refcountbt_block_count(mp, nr_ops),
+			       mp->m_sb.sb_blocksize);
+}
+
+/*
+ * Realtime refcount updates (t2);
+ *    the rt refcount inode
+ *    the rtrefcount btrees: nr_ops * 1 trees * (2 * max depth - 1) * block size
+ */
+inline unsigned int
+xfs_calc_finish_rt_cui_reservation(
+	struct xfs_mount	*mp,
+	unsigned int		nr_ops)
+{
+	if (!xfs_has_rtreflink(mp))
+		return 0;
+
+	return xfs_calc_inode_res(mp, 1) +
+	       xfs_calc_buf_res(xfs_rtrefcountbt_block_count(mp, nr_ops),
+				     mp->m_sb.sb_blocksize);
+}
+
 /*
  * Compute the log reservation required to handle the refcount update
  * transaction.  Refcount updates are always done via deferred log items.
@@ -280,19 +316,10 @@ xfs_calc_refcountbt_reservation(
 	struct xfs_mount	*mp,
 	unsigned int		nr_ops)
 {
-	unsigned int		blksz = XFS_FSB_TO_B(mp, 1);
-	unsigned int		t1, t2 = 0;
+	unsigned int		t1, t2;
 
-	if (!xfs_has_reflink(mp))
-		return 0;
-
-	t1 = xfs_calc_buf_res(nr_ops, mp->m_sb.sb_sectsize) +
-	     xfs_calc_buf_res(xfs_refcountbt_block_count(mp, nr_ops), blksz);
-
-	if (xfs_has_realtime(mp))
-		t2 = xfs_calc_inode_res(mp, 1) +
-		     xfs_calc_buf_res(xfs_rtrefcountbt_block_count(mp, nr_ops),
-				     blksz);
+	t1 = xfs_calc_finish_cui_reservation(mp, nr_ops);
+	t2 = xfs_calc_finish_rt_cui_reservation(mp, nr_ops);
 
 	return max(t1, t2);
 }
@@ -379,6 +406,96 @@ xfs_calc_write_reservation_minlogsize(
 	return xfs_calc_write_reservation(mp, true);
 }
 
+/*
+ * Finishing an EFI can free the blocks and bmap blocks (t2):
+ *    the agf for each of the ags: nr * sector size
+ *    the agfl for each of the ags: nr * sector size
+ *    the super block to reflect the freed blocks: sector size
+ *    worst case split in allocation btrees per extent assuming nr extents:
+ *		nr exts * 2 trees * (2 * max depth - 1) * block size
+ */
+inline unsigned int
+xfs_calc_finish_efi_reservation(
+	struct xfs_mount	*mp,
+	unsigned int		nr)
+{
+	return xfs_calc_buf_res((2 * nr) + 1, mp->m_sb.sb_sectsize) +
+	       xfs_calc_buf_res(xfs_allocfree_block_count(mp, nr),
+			       mp->m_sb.sb_blocksize);
+}
+
+/*
+ * Or, if it's a realtime file (t3):
+ *    the agf for each of the ags: 2 * sector size
+ *    the agfl for each of the ags: 2 * sector size
+ *    the super block to reflect the freed blocks: sector size
+ *    the realtime bitmap:
+ *		2 exts * ((XFS_BMBT_MAX_EXTLEN / rtextsize) / NBBY) bytes
+ *    the realtime summary: 2 exts * 1 block
+ *    worst case split in allocation btrees per extent assuming 2 extents:
+ *		2 exts * 2 trees * (2 * max depth - 1) * block size
+ */
+inline unsigned int
+xfs_calc_finish_rt_efi_reservation(
+	struct xfs_mount	*mp,
+	unsigned int		nr)
+{
+	if (!xfs_has_realtime(mp))
+		return 0;
+
+	return xfs_calc_buf_res((2 * nr) + 1, mp->m_sb.sb_sectsize) +
+	       xfs_calc_buf_res(xfs_rtalloc_block_count(mp, nr),
+			       mp->m_sb.sb_blocksize) +
+	       xfs_calc_buf_res(xfs_allocfree_block_count(mp, nr),
+			       mp->m_sb.sb_blocksize);
+}
+
+/*
+ * Finishing an RUI is the same as an EFI.  We can split the rmap btree twice
+ * on each end of the record, and that can cause the AGFL to be refilled or
+ * emptied out.
+ */
+inline unsigned int
+xfs_calc_finish_rui_reservation(
+	struct xfs_mount	*mp,
+	unsigned int		nr)
+{
+	if (!xfs_has_rmapbt(mp))
+		return 0;
+	return xfs_calc_finish_efi_reservation(mp, nr);
+}
+
+/*
+ * Finishing an RUI is the same as an EFI.  We can split the rmap btree twice
+ * on each end of the record, and that can cause the AGFL to be refilled or
+ * emptied out.
+ */
+inline unsigned int
+xfs_calc_finish_rt_rui_reservation(
+	struct xfs_mount	*mp,
+	unsigned int		nr)
+{
+	if (!xfs_has_rtrmapbt(mp))
+		return 0;
+	return xfs_calc_finish_rt_efi_reservation(mp, nr);
+}
+
+/*
+ * In finishing a BUI, we can modify:
+ *    the inode being truncated: inode size
+ *    dquots
+ *    the inode's bmap btree: (max depth + 1) * block size
+ */
+inline unsigned int
+xfs_calc_finish_bui_reservation(
+	struct xfs_mount	*mp,
+	unsigned int		nr)
+{
+	return xfs_calc_inode_res(mp, 1) + XFS_DQUOT_LOGRES +
+	       xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + 1,
+			       mp->m_sb.sb_blocksize);
+}
+
 /*
  * In truncating a file we free up to two extents at once.  We can modify (t1):
  *    the inode being truncated: inode size
@@ -411,16 +528,8 @@ xfs_calc_itruncate_reservation(
 	t1 = xfs_calc_inode_res(mp, 1) +
 	     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + 1, blksz);
 
-	t2 = xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
-	     xfs_calc_buf_res(xfs_allocfree_block_count(mp, 4), blksz);
-
-	if (xfs_has_realtime(mp)) {
-		t3 = xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
-		     xfs_calc_buf_res(xfs_rtalloc_block_count(mp, 2), blksz) +
-		     xfs_calc_buf_res(xfs_allocfree_block_count(mp, 2), blksz);
-	} else {
-		t3 = 0;
-	}
+	t2 = xfs_calc_finish_efi_reservation(mp, 4);
+	t3 = xfs_calc_finish_rt_efi_reservation(mp, 2);
 
 	/*
 	 * In the early days of reflink, we included enough reservation to log
@@ -501,9 +610,7 @@ xfs_calc_rename_reservation(
 	     xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp),
 			XFS_FSB_TO_B(mp, 1));
 
-	t2 = xfs_calc_buf_res(7, mp->m_sb.sb_sectsize) +
-	     xfs_calc_buf_res(xfs_allocfree_block_count(mp, 3),
-			XFS_FSB_TO_B(mp, 1));
+	t2 = xfs_calc_finish_efi_reservation(mp, 3);
 
 	if (xfs_has_parent(mp)) {
 		unsigned int	rename_overhead, exchange_overhead;
@@ -611,9 +718,7 @@ xfs_calc_link_reservation(
 	overhead += xfs_calc_iunlink_remove_reservation(mp);
 	t1 = xfs_calc_inode_res(mp, 2) +
 	     xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1));
-	t2 = xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
-	     xfs_calc_buf_res(xfs_allocfree_block_count(mp, 1),
-			      XFS_FSB_TO_B(mp, 1));
+	t2 = xfs_calc_finish_efi_reservation(mp, 1);
 
 	if (xfs_has_parent(mp)) {
 		t3 = resp->tr_attrsetm.tr_logres;
@@ -676,9 +781,7 @@ xfs_calc_remove_reservation(
 
 	t1 = xfs_calc_inode_res(mp, 2) +
 	     xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1));
-	t2 = xfs_calc_buf_res(4, mp->m_sb.sb_sectsize) +
-	     xfs_calc_buf_res(xfs_allocfree_block_count(mp, 2),
-			      XFS_FSB_TO_B(mp, 1));
+	t2 = xfs_calc_finish_efi_reservation(mp, 2);
 
 	if (xfs_has_parent(mp)) {
 		t3 = resp->tr_attrrm.tr_logres;
diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
index 0554b9d775d2..d9d0032cbbc5 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.h
+++ b/fs/xfs/libxfs/xfs_trans_resv.h
@@ -98,6 +98,24 @@ struct xfs_trans_resv {
 void xfs_trans_resv_calc(struct xfs_mount *mp, struct xfs_trans_resv *resp);
 uint xfs_allocfree_block_count(struct xfs_mount *mp, uint num_ops);
 
+unsigned int xfs_calc_finish_bui_reservation(struct xfs_mount *mp,
+		unsigned int nr_ops);
+
+unsigned int xfs_calc_finish_efi_reservation(struct xfs_mount *mp,
+		unsigned int nr_ops);
+unsigned int xfs_calc_finish_rt_efi_reservation(struct xfs_mount *mp,
+		unsigned int nr_ops);
+
+unsigned int xfs_calc_finish_rui_reservation(struct xfs_mount *mp,
+		unsigned int nr_ops);
+unsigned int xfs_calc_finish_rt_rui_reservation(struct xfs_mount *mp,
+		unsigned int nr_ops);
+
+unsigned int xfs_calc_finish_cui_reservation(struct xfs_mount *mp,
+		unsigned int nr_ops);
+unsigned int xfs_calc_finish_rt_cui_reservation(struct xfs_mount *mp,
+		unsigned int nr_ops);
+
 unsigned int xfs_calc_itruncate_reservation_minlogsize(struct xfs_mount *mp);
 unsigned int xfs_calc_write_reservation_minlogsize(struct xfs_mount *mp);
 unsigned int xfs_calc_qm_dqalloc_reservation_minlogsize(struct xfs_mount *mp);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 04/14] xfs: rename xfs_inode_can_atomicwrite() -> xfs_inode_can_hw_atomicwrite()
  2025-04-15 12:14 [PATCH v7 00/14] large atomic writes for xfs John Garry
                   ` (2 preceding siblings ...)
  2025-04-15 12:14 ` [PATCH v7 03/14] xfs: add helpers to compute transaction reservation for finishing intent items John Garry
@ 2025-04-15 12:14 ` John Garry
  2025-04-15 12:14 ` [PATCH v7 05/14] xfs: allow block allocator to take an alignment hint John Garry
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 40+ messages in thread
From: John Garry @ 2025-04-15 12:14 UTC (permalink / raw)
  To: brauner, djwong, hch, viro, jack, cem
  Cc: linux-fsdevel, dchinner, linux-xfs, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, linux-ext4, linux-block,
	catherine.hoang, linux-api, John Garry

In future we will want to be able to check if specifically HW offload-based
atomic writes are possible, so rename xfs_inode_can_atomicwrite() ->
xfs_inode_can_hw_atomicwrite().

Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c  | 2 +-
 fs/xfs/xfs_inode.h | 2 +-
 fs/xfs/xfs_iops.c  | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 84f08c976ac4..653e42ccc0c3 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1488,7 +1488,7 @@ xfs_file_open(
 	if (xfs_is_shutdown(XFS_M(inode->i_sb)))
 		return -EIO;
 	file->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
-	if (xfs_inode_can_atomicwrite(XFS_I(inode)))
+	if (xfs_inode_can_hw_atomicwrite(XFS_I(inode)))
 		file->f_mode |= FMODE_CAN_ATOMIC_WRITE;
 	return generic_file_open(inode, file);
 }
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index eae0159983ca..cff643cd03fc 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -357,7 +357,7 @@ static inline bool xfs_inode_has_bigrtalloc(const struct xfs_inode *ip)
 		(ip)->i_mount->m_rtdev_targp : (ip)->i_mount->m_ddev_targp)
 
 static inline bool
-xfs_inode_can_atomicwrite(
+xfs_inode_can_hw_atomicwrite(
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = ip->i_mount;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index f0e5d83195df..d324044a2225 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -608,7 +608,7 @@ xfs_report_atomic_write(
 {
 	unsigned int		unit_min = 0, unit_max = 0;
 
-	if (xfs_inode_can_atomicwrite(ip))
+	if (xfs_inode_can_hw_atomicwrite(ip))
 		unit_min = unit_max = ip->i_mount->m_sb.sb_blocksize;
 	generic_fill_statx_atomic_writes(stat, unit_min, unit_max, 0);
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 05/14] xfs: allow block allocator to take an alignment hint
  2025-04-15 12:14 [PATCH v7 00/14] large atomic writes for xfs John Garry
                   ` (3 preceding siblings ...)
  2025-04-15 12:14 ` [PATCH v7 04/14] xfs: rename xfs_inode_can_atomicwrite() -> xfs_inode_can_hw_atomicwrite() John Garry
@ 2025-04-15 12:14 ` John Garry
  2025-04-15 12:14 ` [PATCH v7 06/14] xfs: refactor xfs_reflink_end_cow_extent() John Garry
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 40+ messages in thread
From: John Garry @ 2025-04-15 12:14 UTC (permalink / raw)
  To: brauner, djwong, hch, viro, jack, cem
  Cc: linux-fsdevel, dchinner, linux-xfs, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, linux-ext4, linux-block,
	catherine.hoang, linux-api, John Garry

Add a BMAPI flag to provide a hint to the block allocator to align extents
according to the extszhint.

This will be useful for atomic writes to ensure that we are not being
allocated extents which are not suitable (for atomic writes).

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 5 +++++
 fs/xfs/libxfs/xfs_bmap.h | 6 +++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 63255820b58a..d954f9b8071f 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3312,6 +3312,11 @@ xfs_bmap_compute_alignments(
 		align = xfs_get_cowextsz_hint(ap->ip);
 	else if (ap->datatype & XFS_ALLOC_USERDATA)
 		align = xfs_get_extsz_hint(ap->ip);
+
+	/* Try to align start block to any minimum allocation alignment */
+	if (align > 1 && (ap->flags & XFS_BMAPI_EXTSZALIGN))
+		args->alignment = align;
+
 	if (align) {
 		if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
 					ap->eof, 0, ap->conv, &ap->offset,
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index b4d9c6e0f3f9..d5f2729305fa 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -87,6 +87,9 @@ struct xfs_bmalloca {
 /* Do not update the rmap btree.  Used for reconstructing bmbt from rmapbt. */
 #define XFS_BMAPI_NORMAP	(1u << 10)
 
+/* Try to align allocations to the extent size hint */
+#define XFS_BMAPI_EXTSZALIGN	(1u << 11)
+
 #define XFS_BMAPI_FLAGS \
 	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
 	{ XFS_BMAPI_METADATA,	"METADATA" }, \
@@ -98,7 +101,8 @@ struct xfs_bmalloca {
 	{ XFS_BMAPI_REMAP,	"REMAP" }, \
 	{ XFS_BMAPI_COWFORK,	"COWFORK" }, \
 	{ XFS_BMAPI_NODISCARD,	"NODISCARD" }, \
-	{ XFS_BMAPI_NORMAP,	"NORMAP" }
+	{ XFS_BMAPI_NORMAP,	"NORMAP" },\
+	{ XFS_BMAPI_EXTSZALIGN,	"EXTSZALIGN" }
 
 
 static inline int xfs_bmapi_aflag(int w)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 06/14] xfs: refactor xfs_reflink_end_cow_extent()
  2025-04-15 12:14 [PATCH v7 00/14] large atomic writes for xfs John Garry
                   ` (4 preceding siblings ...)
  2025-04-15 12:14 ` [PATCH v7 05/14] xfs: allow block allocator to take an alignment hint John Garry
@ 2025-04-15 12:14 ` John Garry
  2025-04-15 12:14 ` [PATCH v7 07/14] xfs: refine atomic write size check in xfs_file_write_iter() John Garry
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 40+ messages in thread
From: John Garry @ 2025-04-15 12:14 UTC (permalink / raw)
  To: brauner, djwong, hch, viro, jack, cem
  Cc: linux-fsdevel, dchinner, linux-xfs, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, linux-ext4, linux-block,
	catherine.hoang, linux-api, John Garry

Refactor xfs_reflink_end_cow_extent() into separate parts which process
the CoW range and commit the transaction.

This refactoring will be used in future for when it is required to commit
a range of extents as a single transaction, similar to how it was done
pre-commit d6f215f359637.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_reflink.c | 72 ++++++++++++++++++++++++++------------------
 1 file changed, 42 insertions(+), 30 deletions(-)

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index cc3b4df88110..bd711c5bb6bb 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -786,35 +786,19 @@ xfs_reflink_update_quota(
  * requirements as low as possible.
  */
 STATIC int
-xfs_reflink_end_cow_extent(
+xfs_reflink_end_cow_extent_locked(
+	struct xfs_trans	*tp,
 	struct xfs_inode	*ip,
 	xfs_fileoff_t		*offset_fsb,
 	xfs_fileoff_t		end_fsb)
 {
 	struct xfs_iext_cursor	icur;
 	struct xfs_bmbt_irec	got, del, data;
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_trans	*tp;
 	struct xfs_ifork	*ifp = xfs_ifork_ptr(ip, XFS_COW_FORK);
-	unsigned int		resblks;
 	int			nmaps;
 	bool			isrt = XFS_IS_REALTIME_INODE(ip);
 	int			error;
 
-	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
-			XFS_TRANS_RESERVE, &tp);
-	if (error)
-		return error;
-
-	/*
-	 * Lock the inode.  We have to ijoin without automatic unlock because
-	 * the lead transaction is the refcountbt record deletion; the data
-	 * fork update follows as a deferred log item.
-	 */
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	xfs_trans_ijoin(tp, ip, 0);
-
 	/*
 	 * In case of racing, overlapping AIO writes no COW extents might be
 	 * left by the time I/O completes for the loser of the race.  In that
@@ -823,7 +807,7 @@ xfs_reflink_end_cow_extent(
 	if (!xfs_iext_lookup_extent(ip, ifp, *offset_fsb, &icur, &got) ||
 	    got.br_startoff >= end_fsb) {
 		*offset_fsb = end_fsb;
-		goto out_cancel;
+		return 0;
 	}
 
 	/*
@@ -837,7 +821,7 @@ xfs_reflink_end_cow_extent(
 		if (!xfs_iext_next_extent(ifp, &icur, &got) ||
 		    got.br_startoff >= end_fsb) {
 			*offset_fsb = end_fsb;
-			goto out_cancel;
+			return 0;
 		}
 	}
 	del = got;
@@ -846,14 +830,14 @@ xfs_reflink_end_cow_extent(
 	error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK,
 			XFS_IEXT_REFLINK_END_COW_CNT);
 	if (error)
-		goto out_cancel;
+		return error;
 
 	/* Grab the corresponding mapping in the data fork. */
 	nmaps = 1;
 	error = xfs_bmapi_read(ip, del.br_startoff, del.br_blockcount, &data,
 			&nmaps, 0);
 	if (error)
-		goto out_cancel;
+		return error;
 
 	/* We can only remap the smaller of the two extent sizes. */
 	data.br_blockcount = min(data.br_blockcount, del.br_blockcount);
@@ -882,7 +866,7 @@ xfs_reflink_end_cow_extent(
 		error = xfs_bunmapi(NULL, ip, data.br_startoff,
 				data.br_blockcount, 0, 1, &done);
 		if (error)
-			goto out_cancel;
+			return error;
 		ASSERT(done);
 	}
 
@@ -899,17 +883,45 @@ xfs_reflink_end_cow_extent(
 	/* Remove the mapping from the CoW fork. */
 	xfs_bmap_del_extent_cow(ip, &icur, &got, &del);
 
-	error = xfs_trans_commit(tp);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	if (error)
-		return error;
-
 	/* Update the caller about how much progress we made. */
 	*offset_fsb = del.br_startoff + del.br_blockcount;
 	return 0;
+}
 
-out_cancel:
-	xfs_trans_cancel(tp);
+/*
+ * Remap part of the CoW fork into the data fork.
+ *
+ * We aim to remap the range starting at @offset_fsb and ending at @end_fsb
+ * into the data fork; this function will remap what it can (at the end of the
+ * range) and update @end_fsb appropriately.  Each remap gets its own
+ * transaction because we can end up merging and splitting bmbt blocks for
+ * every remap operation and we'd like to keep the block reservation
+ * requirements as low as possible.
+ */
+STATIC int
+xfs_reflink_end_cow_extent(
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		*offset_fsb,
+	xfs_fileoff_t		end_fsb)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	unsigned int		resblks;
+	int			error;
+
+	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
+			XFS_TRANS_RESERVE, &tp);
+	if (error)
+		return error;
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
+
+	error = xfs_reflink_end_cow_extent_locked(tp, ip, offset_fsb, end_fsb);
+	if (error)
+		xfs_trans_cancel(tp);
+	else
+		error = xfs_trans_commit(tp);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 07/14] xfs: refine atomic write size check in xfs_file_write_iter()
  2025-04-15 12:14 [PATCH v7 00/14] large atomic writes for xfs John Garry
                   ` (5 preceding siblings ...)
  2025-04-15 12:14 ` [PATCH v7 06/14] xfs: refactor xfs_reflink_end_cow_extent() John Garry
@ 2025-04-15 12:14 ` John Garry
  2025-04-15 12:14 ` [PATCH v7 08/14] xfs: add xfs_atomic_write_cow_iomap_begin() John Garry
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 40+ messages in thread
From: John Garry @ 2025-04-15 12:14 UTC (permalink / raw)
  To: brauner, djwong, hch, viro, jack, cem
  Cc: linux-fsdevel, dchinner, linux-xfs, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, linux-ext4, linux-block,
	catherine.hoang, linux-api, John Garry

Currently the size of atomic write allowed is fixed at the blocksize.

To start to lift this restriction, partly refactor
xfs_report_atomic_write() to into helpers -
xfs_get_atomic_write_{min, max}() - and use those helpers to find the
per-inode atomic write limits and check according to that.

Also add xfs_get_atomic_write_max_opt() to return the optimal limit, and
just return 0 since large atomics aren't supported yet.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_file.c | 12 +++++-------
 fs/xfs/xfs_iops.c | 36 +++++++++++++++++++++++++++++++-----
 fs/xfs/xfs_iops.h |  3 +++
 3 files changed, 39 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 653e42ccc0c3..1302783a7157 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1032,14 +1032,12 @@ xfs_file_write_iter(
 		return xfs_file_dax_write(iocb, from);
 
 	if (iocb->ki_flags & IOCB_ATOMIC) {
-		/*
-		 * Currently only atomic writing of a single FS block is
-		 * supported. It would be possible to atomic write smaller than
-		 * a FS block, but there is no requirement to support this.
-		 * Note that iomap also does not support this yet.
-		 */
-		if (ocount != ip->i_mount->m_sb.sb_blocksize)
+		if (ocount < xfs_get_atomic_write_min(ip))
 			return -EINVAL;
+
+		if (ocount > xfs_get_atomic_write_max(ip))
+			return -EINVAL;
+
 		ret = generic_atomic_write_valid(iocb, from);
 		if (ret)
 			return ret;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index d324044a2225..3b5aa39dbfe9 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -601,16 +601,42 @@ xfs_report_dioalign(
 		stat->dio_offset_align = stat->dio_read_offset_align;
 }
 
+unsigned int
+xfs_get_atomic_write_min(
+	struct xfs_inode	*ip)
+{
+	if (!xfs_inode_can_hw_atomicwrite(ip))
+		return 0;
+
+	return ip->i_mount->m_sb.sb_blocksize;
+}
+
+unsigned int
+xfs_get_atomic_write_max(
+	struct xfs_inode	*ip)
+{
+	if (!xfs_inode_can_hw_atomicwrite(ip))
+		return 0;
+
+	return ip->i_mount->m_sb.sb_blocksize;
+}
+
+unsigned int
+xfs_get_atomic_write_max_opt(
+	struct xfs_inode	*ip)
+{
+	return 0;
+}
+
 static void
 xfs_report_atomic_write(
 	struct xfs_inode	*ip,
 	struct kstat		*stat)
 {
-	unsigned int		unit_min = 0, unit_max = 0;
-
-	if (xfs_inode_can_hw_atomicwrite(ip))
-		unit_min = unit_max = ip->i_mount->m_sb.sb_blocksize;
-	generic_fill_statx_atomic_writes(stat, unit_min, unit_max, 0);
+	generic_fill_statx_atomic_writes(stat,
+			xfs_get_atomic_write_min(ip),
+			xfs_get_atomic_write_max(ip),
+			xfs_get_atomic_write_max_opt(ip));
 }
 
 STATIC int
diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
index 3c1a2605ffd2..0896f6b8b3b8 100644
--- a/fs/xfs/xfs_iops.h
+++ b/fs/xfs/xfs_iops.h
@@ -19,5 +19,8 @@ int xfs_inode_init_security(struct inode *inode, struct inode *dir,
 extern void xfs_setup_inode(struct xfs_inode *ip);
 extern void xfs_setup_iops(struct xfs_inode *ip);
 extern void xfs_diflags_to_iflags(struct xfs_inode *ip, bool init);
+unsigned int xfs_get_atomic_write_min(struct xfs_inode *ip);
+unsigned int xfs_get_atomic_write_max(struct xfs_inode *ip);
+unsigned int xfs_get_atomic_write_max_opt(struct xfs_inode *ip);
 
 #endif /* __XFS_IOPS_H__ */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 08/14] xfs: add xfs_atomic_write_cow_iomap_begin()
  2025-04-15 12:14 [PATCH v7 00/14] large atomic writes for xfs John Garry
                   ` (6 preceding siblings ...)
  2025-04-15 12:14 ` [PATCH v7 07/14] xfs: refine atomic write size check in xfs_file_write_iter() John Garry
@ 2025-04-15 12:14 ` John Garry
  2025-04-15 12:14 ` [PATCH v7 09/14] xfs: add large atomic writes checks in xfs_direct_write_iomap_begin() John Garry
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 40+ messages in thread
From: John Garry @ 2025-04-15 12:14 UTC (permalink / raw)
  To: brauner, djwong, hch, viro, jack, cem
  Cc: linux-fsdevel, dchinner, linux-xfs, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, linux-ext4, linux-block,
	catherine.hoang, linux-api, John Garry

For CoW-based atomic writes, reuse the infrastructure for reflink CoW fork
support.

Add ->iomap_begin() callback xfs_atomic_write_cow_iomap_begin() to create
staging mappings in the CoW fork for atomic write updates.

The general steps in the function are as follows:
- find extent mapping in the CoW fork for the FS block range being written
	- if part or full extent is found, proceed to process found extent
	- if no extent found, map in new blocks to the CoW fork
- convert unwritten blocks in extent if required
- update iomap extent mapping and return

The bulk of this function is quite similar to the processing in
xfs_reflink_allocate_cow(), where we try to find an extent mapping; if
none exists, then allocate a new extent in the CoW fork, convert unwritten
blocks, and return a mapping.

Performance testing has shown the XFS_ILOCK_EXCL locking to be quite
a bottleneck, so this is an area which could be optimised in future.

Christoph Hellwig contributed almost all of the code in
xfs_atomic_write_cow_iomap_begin().

Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_iomap.c   | 126 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_iomap.h   |   1 +
 fs/xfs/xfs_reflink.c |   2 +-
 fs/xfs/xfs_reflink.h |   2 +
 fs/xfs/xfs_trace.h   |  22 ++++++++
 5 files changed, 152 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index cb23c8871f81..049655ebc3f7 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1022,6 +1022,132 @@ const struct iomap_ops xfs_zoned_direct_write_iomap_ops = {
 };
 #endif /* CONFIG_XFS_RT */
 
+static int
+xfs_atomic_write_cow_iomap_begin(
+	struct inode		*inode,
+	loff_t			offset,
+	loff_t			length,
+	unsigned		flags,
+	struct iomap		*iomap,
+	struct iomap		*srcmap)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	const xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
+	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, length);
+	xfs_filblks_t		count_fsb = end_fsb - offset_fsb;
+	int			nmaps = 1;
+	xfs_filblks_t		resaligned;
+	struct xfs_bmbt_irec	cmap;
+	struct xfs_iext_cursor	icur;
+	struct xfs_trans	*tp;
+	unsigned int		dblocks = 0, rblocks = 0;
+	int			error;
+	u64			seq;
+
+	ASSERT(flags & IOMAP_WRITE);
+	ASSERT(flags & IOMAP_DIRECT);
+
+	if (xfs_is_shutdown(mp))
+		return -EIO;
+
+	if (WARN_ON_ONCE(!xfs_has_reflink(mp)))
+		return -EINVAL;
+
+	/* blocks are always allocated in this path */
+	if (flags & IOMAP_NOWAIT)
+		return -EAGAIN;
+
+	trace_xfs_iomap_atomic_write_cow(ip, offset, length);
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+
+	if (!ip->i_cowfp) {
+		ASSERT(!xfs_is_reflink_inode(ip));
+		xfs_ifork_init_cow(ip);
+	}
+
+	if (!xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, &icur, &cmap))
+		cmap.br_startoff = end_fsb;
+	if (cmap.br_startoff <= offset_fsb) {
+		xfs_trim_extent(&cmap, offset_fsb, count_fsb);
+		goto found;
+	}
+
+	end_fsb = cmap.br_startoff;
+	count_fsb = end_fsb - offset_fsb;
+
+	resaligned = xfs_aligned_fsb_count(offset_fsb, count_fsb,
+			xfs_get_cowextsz_hint(ip));
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	if (XFS_IS_REALTIME_INODE(ip)) {
+		dblocks = XFS_DIOSTRAT_SPACE_RES(mp, 0);
+		rblocks = resaligned;
+	} else {
+		dblocks = XFS_DIOSTRAT_SPACE_RES(mp, resaligned);
+		rblocks = 0;
+	}
+
+	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks,
+			rblocks, false, &tp);
+	if (error)
+		return error;
+
+	/* extent layout could have changed since the unlock, so check again */
+	if (!xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb, &icur, &cmap))
+		cmap.br_startoff = end_fsb;
+	if (cmap.br_startoff <= offset_fsb) {
+		xfs_trim_extent(&cmap, offset_fsb, count_fsb);
+		xfs_trans_cancel(tp);
+		goto found;
+	}
+
+	/*
+	 * Allocate the entire reservation as unwritten blocks.
+	 *
+	 * Use XFS_BMAPI_EXTSZALIGN to hint at aligning new extents according to
+	 * extszhint, such that there will be a greater chance that future
+	 * atomic writes to that same range will be aligned (and don't require
+	 * this COW-based method).
+	 */
+	error = xfs_bmapi_write(tp, ip, offset_fsb, count_fsb,
+			XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC |
+			XFS_BMAPI_EXTSZALIGN, 0, &cmap, &nmaps);
+	if (error) {
+		xfs_trans_cancel(tp);
+		goto out_unlock;
+	}
+
+	xfs_inode_set_cowblocks_tag(ip);
+	error = xfs_trans_commit(tp);
+	if (error)
+		goto out_unlock;
+
+found:
+	if (cmap.br_state != XFS_EXT_NORM) {
+		error = xfs_reflink_convert_cow_locked(ip, offset_fsb,
+				count_fsb);
+		if (error)
+			goto out_unlock;
+		cmap.br_state = XFS_EXT_NORM;
+	}
+
+	length = XFS_FSB_TO_B(mp, cmap.br_startoff + cmap.br_blockcount);
+	trace_xfs_iomap_found(ip, offset, length - offset, XFS_COW_FORK, &cmap);
+	seq = xfs_iomap_inode_sequence(ip, IOMAP_F_SHARED);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED, seq);
+
+out_unlock:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+}
+
+const struct iomap_ops xfs_atomic_write_cow_iomap_ops = {
+	.iomap_begin		= xfs_atomic_write_cow_iomap_begin,
+};
+
 static int
 xfs_dax_write_iomap_end(
 	struct inode		*inode,
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index d330c4a581b1..674f8ac1b9bd 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -56,5 +56,6 @@ extern const struct iomap_ops xfs_read_iomap_ops;
 extern const struct iomap_ops xfs_seek_iomap_ops;
 extern const struct iomap_ops xfs_xattr_iomap_ops;
 extern const struct iomap_ops xfs_dax_write_iomap_ops;
+extern const struct iomap_ops xfs_atomic_write_cow_iomap_ops;
 
 #endif /* __XFS_IOMAP_H__*/
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index bd711c5bb6bb..f5d338916098 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -293,7 +293,7 @@ xfs_bmap_trim_cow(
 	return xfs_reflink_trim_around_shared(ip, imap, shared);
 }
 
-static int
+int
 xfs_reflink_convert_cow_locked(
 	struct xfs_inode	*ip,
 	xfs_fileoff_t		offset_fsb,
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index cc4e92278279..379619f24247 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -35,6 +35,8 @@ int xfs_reflink_allocate_cow(struct xfs_inode *ip, struct xfs_bmbt_irec *imap,
 		bool convert_now);
 extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
+int xfs_reflink_convert_cow_locked(struct xfs_inode *ip,
+		xfs_fileoff_t offset_fsb, xfs_filblks_t count_fsb);
 
 extern int xfs_reflink_cancel_cow_blocks(struct xfs_inode *ip,
 		struct xfs_trans **tpp, xfs_fileoff_t offset_fsb,
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index e56ba1963160..9554578c6da4 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1657,6 +1657,28 @@ DEFINE_RW_EVENT(xfs_file_direct_write);
 DEFINE_RW_EVENT(xfs_file_dax_write);
 DEFINE_RW_EVENT(xfs_reflink_bounce_dio_write);
 
+TRACE_EVENT(xfs_iomap_atomic_write_cow,
+	TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, ssize_t count),
+	TP_ARGS(ip, offset, count),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino)
+		__field(xfs_off_t, offset)
+		__field(ssize_t, count)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(ip)->i_sb->s_dev;
+		__entry->ino = ip->i_ino;
+		__entry->offset = offset;
+		__entry->count = count;
+	),
+	TP_printk("dev %d:%d ino 0x%llx pos 0x%llx bytecount 0x%zx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __entry->offset,
+		  __entry->count)
+)
+
 DECLARE_EVENT_CLASS(xfs_imap_class,
 	TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, ssize_t count,
 		 int whichfork, struct xfs_bmbt_irec *irec),
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 09/14] xfs: add large atomic writes checks in xfs_direct_write_iomap_begin()
  2025-04-15 12:14 [PATCH v7 00/14] large atomic writes for xfs John Garry
                   ` (7 preceding siblings ...)
  2025-04-15 12:14 ` [PATCH v7 08/14] xfs: add xfs_atomic_write_cow_iomap_begin() John Garry
@ 2025-04-15 12:14 ` John Garry
  2025-04-15 17:34   ` Darrick J. Wong
  2025-04-15 12:14 ` [PATCH v7 10/14] xfs: commit CoW-based atomic writes atomically John Garry
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: John Garry @ 2025-04-15 12:14 UTC (permalink / raw)
  To: brauner, djwong, hch, viro, jack, cem
  Cc: linux-fsdevel, dchinner, linux-xfs, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, linux-ext4, linux-block,
	catherine.hoang, linux-api, John Garry

For when large atomic writes (> 1x FS block) are supported, there will be
various occasions when HW offload may not be possible.

Such instances include:
- unaligned extent mapping wrt write length
- extent mappings which do not cover the full write, e.g. the write spans
  sparse or mixed-mapping extents
- the write length is greater than HW offload can support

In those cases, we need to fallback to the CoW-based atomic write mode. For
this, report special code -ENOPROTOOPT to inform the caller that HW
offload-based method is not possible.

In addition to the occasions mentioned, if the write covers an unallocated
range, we again judge that we need to rely on the CoW-based method when we
would need to allocate anything more than 1x block. This is because if we
allocate less blocks that is required for the write, then again HW
offload-based method would not be possible. So we are taking a pessimistic
approach to writes covering unallocated space.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_iomap.c | 65 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 63 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 049655ebc3f7..02bb8257ea24 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -798,6 +798,41 @@ imap_spans_range(
 	return true;
 }
 
+static bool
+xfs_bmap_hw_atomic_write_possible(
+	struct xfs_inode	*ip,
+	struct xfs_bmbt_irec	*imap,
+	xfs_fileoff_t		offset_fsb,
+	xfs_fileoff_t		end_fsb)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_fsize_t		len = XFS_FSB_TO_B(mp, end_fsb - offset_fsb);
+
+	/*
+	 * atomic writes are required to be naturally aligned for disk blocks,
+	 * which ensures that we adhere to block layer rules that we won't
+	 * straddle any boundary or violate write alignment requirement.
+	 */
+	if (!IS_ALIGNED(imap->br_startblock, imap->br_blockcount))
+		return false;
+
+	/*
+	 * Spanning multiple extents would mean that multiple BIOs would be
+	 * issued, and so would lose atomicity required for REQ_ATOMIC-based
+	 * atomics.
+	 */
+	if (!imap_spans_range(imap, offset_fsb, end_fsb))
+		return false;
+
+	/*
+	 * The ->iomap_begin caller should ensure this, but check anyway.
+	 */
+	if (len > xfs_inode_buftarg(ip)->bt_bdev_awu_max)
+		return false;
+
+	return true;
+}
+
 static int
 xfs_direct_write_iomap_begin(
 	struct inode		*inode,
@@ -812,9 +847,11 @@ xfs_direct_write_iomap_begin(
 	struct xfs_bmbt_irec	imap, cmap;
 	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
 	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, length);
+	xfs_fileoff_t		orig_end_fsb = end_fsb;
 	int			nimaps = 1, error = 0;
 	bool			shared = false;
 	u16			iomap_flags = 0;
+	bool			needs_alloc;
 	unsigned int		lockmode;
 	u64			seq;
 
@@ -875,13 +912,37 @@ xfs_direct_write_iomap_begin(
 				(flags & IOMAP_DIRECT) || IS_DAX(inode));
 		if (error)
 			goto out_unlock;
-		if (shared)
+		if (shared) {
+			if ((flags & IOMAP_ATOMIC) &&
+			    !xfs_bmap_hw_atomic_write_possible(ip, &cmap,
+					offset_fsb, end_fsb)) {
+				error = -ENOPROTOOPT;
+				goto out_unlock;
+			}
 			goto out_found_cow;
+		}
 		end_fsb = imap.br_startoff + imap.br_blockcount;
 		length = XFS_FSB_TO_B(mp, end_fsb) - offset;
 	}
 
-	if (imap_needs_alloc(inode, flags, &imap, nimaps))
+	needs_alloc = imap_needs_alloc(inode, flags, &imap, nimaps);
+
+	if (flags & IOMAP_ATOMIC) {
+		error = -ENOPROTOOPT;
+		/*
+		 * If we allocate less than what is required for the write
+		 * then we may end up with multiple extents, which means that
+		 * REQ_ATOMIC-based cannot be used, so avoid this possibility.
+		 */
+		if (needs_alloc && orig_end_fsb - offset_fsb > 1)
+			goto out_unlock;
+
+		if (!xfs_bmap_hw_atomic_write_possible(ip, &imap, offset_fsb,
+				orig_end_fsb))
+			goto out_unlock;
+	}
+
+	if (needs_alloc)
 		goto allocate_blocks;
 
 	/*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 10/14] xfs: commit CoW-based atomic writes atomically
  2025-04-15 12:14 [PATCH v7 00/14] large atomic writes for xfs John Garry
                   ` (8 preceding siblings ...)
  2025-04-15 12:14 ` [PATCH v7 09/14] xfs: add large atomic writes checks in xfs_direct_write_iomap_begin() John Garry
@ 2025-04-15 12:14 ` John Garry
  2025-04-15 12:14 ` [PATCH v7 11/14] xfs: add xfs_file_dio_write_atomic() John Garry
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 40+ messages in thread
From: John Garry @ 2025-04-15 12:14 UTC (permalink / raw)
  To: brauner, djwong, hch, viro, jack, cem
  Cc: linux-fsdevel, dchinner, linux-xfs, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, linux-ext4, linux-block,
	catherine.hoang, linux-api, John Garry

When completing a CoW-based write, each extent range mapping update is
covered by a separate transaction.

For a CoW-based atomic write, all mappings must be changed at once, so
change to use a single transaction.

Note that there is a limit on the amount of log intent items which can be
fit into a single transaction, but this is being ignored for now since
the count of items for a typical atomic write would be much less than is
typically supported. A typical atomic write would be expected to be 64KB
or less, which means only 16 possible extents unmaps, which is quite
small.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: add tr_atomic_ioend]
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_trans_resv.c |  6 ++++
 fs/xfs/libxfs/xfs_trans_resv.h |  1 +
 fs/xfs/xfs_file.c              |  5 ++-
 fs/xfs/xfs_reflink.c           | 56 ++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h           |  2 ++
 5 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 580d00ae2857..797eb6a41e9b 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -1378,4 +1378,10 @@ xfs_trans_resv_calc(
 	resp->tr_itruncate.tr_logcount += logcount_adj;
 	resp->tr_write.tr_logcount += logcount_adj;
 	resp->tr_qm_dqalloc.tr_logcount += logcount_adj;
+
+	/*
+	 * Now that we've finished computing tr_itruncate, use it as the
+	 * default for atomic write ioends.
+	 */
+	resp->tr_atomic_ioend = resp->tr_itruncate;
 }
diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
index d9d0032cbbc5..670045d417a6 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.h
+++ b/fs/xfs/libxfs/xfs_trans_resv.h
@@ -48,6 +48,7 @@ struct xfs_trans_resv {
 	struct xfs_trans_res	tr_qm_dqalloc;	/* allocate quota on disk */
 	struct xfs_trans_res	tr_sb;		/* modify superblock */
 	struct xfs_trans_res	tr_fsyncts;	/* update timestamps on fsync */
+	struct xfs_trans_res	tr_atomic_ioend; /* untorn write completion */
 };
 
 /* shorthand way of accessing reservation structure */
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 1302783a7157..ba4b02abc6e4 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -576,7 +576,10 @@ xfs_dio_write_end_io(
 	nofs_flag = memalloc_nofs_save();
 
 	if (flags & IOMAP_DIO_COW) {
-		error = xfs_reflink_end_cow(ip, offset, size);
+		if (iocb->ki_flags & IOCB_ATOMIC)
+			error = xfs_reflink_end_atomic_cow(ip, offset, size);
+		else
+			error = xfs_reflink_end_cow(ip, offset, size);
 		if (error)
 			goto out;
 	}
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index f5d338916098..218dee76768b 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -984,6 +984,62 @@ xfs_reflink_end_cow(
 	return error;
 }
 
+/*
+ * Fully remap all of the file's data fork at once, which is the critical part
+ * in achieving atomic behaviour.
+ * The regular CoW end path does not use function as to keep the block
+ * reservation per transaction as low as possible.
+ */
+int
+xfs_reflink_end_atomic_cow(
+	struct xfs_inode		*ip,
+	xfs_off_t			offset,
+	xfs_off_t			count)
+{
+	xfs_fileoff_t			offset_fsb;
+	xfs_fileoff_t			end_fsb;
+	int				error = 0;
+	struct xfs_mount		*mp = ip->i_mount;
+	struct xfs_trans		*tp;
+	unsigned int			resblks;
+
+	trace_xfs_reflink_end_cow(ip, offset, count);
+
+	offset_fsb = XFS_B_TO_FSBT(mp, offset);
+	end_fsb = XFS_B_TO_FSB(mp, offset + count);
+
+	/*
+	 * Each remapping operation could cause a btree split, so in the worst
+	 * case that's one for each block.
+	 */
+	resblks = (end_fsb - offset_fsb) *
+			XFS_NEXTENTADD_SPACE_RES(mp, 1, XFS_DATA_FORK);
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_atomic_ioend, resblks, 0,
+			XFS_TRANS_RESERVE, &tp);
+	if (error)
+		return error;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
+
+	while (end_fsb > offset_fsb && !error) {
+		error = xfs_reflink_end_cow_extent_locked(tp, ip, &offset_fsb,
+				end_fsb);
+	}
+	if (error) {
+		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
+		goto out_cancel;
+	}
+	error = xfs_trans_commit(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+out_cancel:
+	xfs_trans_cancel(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+}
+
 /*
  * Free all CoW staging blocks that are still referenced by the ondisk refcount
  * metadata.  The ondisk metadata does not track which inode created the
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 379619f24247..412e9b6f2082 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -45,6 +45,8 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count, bool cancel_real);
 extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
+int xfs_reflink_end_atomic_cow(struct xfs_inode *ip, xfs_off_t offset,
+		xfs_off_t count);
 extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
 extern loff_t xfs_reflink_remap_range(struct file *file_in, loff_t pos_in,
 		struct file *file_out, loff_t pos_out, loff_t len,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 11/14] xfs: add xfs_file_dio_write_atomic()
  2025-04-15 12:14 [PATCH v7 00/14] large atomic writes for xfs John Garry
                   ` (9 preceding siblings ...)
  2025-04-15 12:14 ` [PATCH v7 10/14] xfs: commit CoW-based atomic writes atomically John Garry
@ 2025-04-15 12:14 ` John Garry
  2025-04-21  4:00   ` Darrick J. Wong
  2025-04-21 21:18   ` Luis Chamberlain
  2025-04-15 12:14 ` [PATCH v7 12/14] xfs: add xfs_compute_atomic_write_unit_max() John Garry
                   ` (2 subsequent siblings)
  13 siblings, 2 replies; 40+ messages in thread
From: John Garry @ 2025-04-15 12:14 UTC (permalink / raw)
  To: brauner, djwong, hch, viro, jack, cem
  Cc: linux-fsdevel, dchinner, linux-xfs, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, linux-ext4, linux-block,
	catherine.hoang, linux-api, John Garry

Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes.

The function works based on two operating modes:
- HW offload, i.e. REQ_ATOMIC-based
- CoW based with out-of-places write and atomic extent remapping

The preferred method is HW offload as it will be faster. If HW offload is
not possible, then we fallback to the CoW-based method.

HW offload would not be possible for the write length exceeding the HW
offload limit, the write spanning multiple extents, unaligned disk blocks,
etc.

Apart from the write exceeding the HW offload limit, other conditions for
HW offload can only be detected in the iomap handling for the write. As
such, we use a fallback method to issue the write if we detect in the
->iomap_begin() handler that HW offload is not possible. Special code
-ENOPROTOOPT is returned from ->iomap_begin() to inform that HW offload
not possible.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/xfs_file.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ba4b02abc6e4..81a377f65aa3 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -728,6 +728,72 @@ xfs_file_dio_write_zoned(
 	return ret;
 }
 
+/*
+ * Handle block atomic writes
+ *
+ * Two methods of atomic writes are supported:
+ * - REQ_ATOMIC-based, which would typically use some form of HW offload in the
+ *   disk
+ * - COW-based, which uses a COW fork as a staging extent for data updates
+ *   before atomically updating extent mappings for the range being written
+ *
+ */
+static noinline ssize_t
+xfs_file_dio_write_atomic(
+	struct xfs_inode	*ip,
+	struct kiocb		*iocb,
+	struct iov_iter		*from)
+{
+	unsigned int		iolock = XFS_IOLOCK_SHARED;
+	ssize_t			ret, ocount = iov_iter_count(from);
+	const struct iomap_ops	*dops;
+
+	/*
+	 * HW offload should be faster, so try that first if it is already
+	 * known that the write length is not too large.
+	 */
+	if (ocount > xfs_inode_buftarg(ip)->bt_bdev_awu_max)
+		dops = &xfs_atomic_write_cow_iomap_ops;
+	else
+		dops = &xfs_direct_write_iomap_ops;
+
+retry:
+	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
+	if (ret)
+		return ret;
+
+	ret = xfs_file_write_checks(iocb, from, &iolock, NULL);
+	if (ret)
+		goto out_unlock;
+
+	/* Demote similar to xfs_file_dio_write_aligned() */
+	if (iolock == XFS_IOLOCK_EXCL) {
+		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
+		iolock = XFS_IOLOCK_SHARED;
+	}
+
+	trace_xfs_file_direct_write(iocb, from);
+	ret = iomap_dio_rw(iocb, from, dops, &xfs_dio_write_ops,
+			0, NULL, 0);
+
+	/*
+	 * The retry mechanism is based on the ->iomap_begin method returning
+	 * -ENOPROTOOPT, which would be when the REQ_ATOMIC-based write is not
+	 * possible. The REQ_ATOMIC-based method typically not be possible if
+	 * the write spans multiple extents or the disk blocks are misaligned.
+	 */
+	if (ret == -ENOPROTOOPT && dops == &xfs_direct_write_iomap_ops) {
+		xfs_iunlock(ip, iolock);
+		dops = &xfs_atomic_write_cow_iomap_ops;
+		goto retry;
+	}
+
+out_unlock:
+	if (iolock)
+		xfs_iunlock(ip, iolock);
+	return ret;
+}
+
 /*
  * Handle block unaligned direct I/O writes
  *
@@ -843,6 +909,8 @@ xfs_file_dio_write(
 		return xfs_file_dio_write_unaligned(ip, iocb, from);
 	if (xfs_is_zoned_inode(ip))
 		return xfs_file_dio_write_zoned(ip, iocb, from);
+	if (iocb->ki_flags & IOCB_ATOMIC)
+		return xfs_file_dio_write_atomic(ip, iocb, from);
 	return xfs_file_dio_write_aligned(ip, iocb, from,
 			&xfs_direct_write_iomap_ops, &xfs_dio_write_ops, NULL);
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 12/14] xfs: add xfs_compute_atomic_write_unit_max()
  2025-04-15 12:14 [PATCH v7 00/14] large atomic writes for xfs John Garry
                   ` (10 preceding siblings ...)
  2025-04-15 12:14 ` [PATCH v7 11/14] xfs: add xfs_file_dio_write_atomic() John Garry
@ 2025-04-15 12:14 ` John Garry
  2025-04-15 16:25   ` Darrick J. Wong
  2025-04-15 12:14 ` [PATCH v7 13/14] xfs: update atomic write limits John Garry
  2025-04-15 12:14 ` [PATCH v7 14/14] xfs: allow sysadmins to specify a maximum atomic write limit at mount time John Garry
  13 siblings, 1 reply; 40+ messages in thread
From: John Garry @ 2025-04-15 12:14 UTC (permalink / raw)
  To: brauner, djwong, hch, viro, jack, cem
  Cc: linux-fsdevel, dchinner, linux-xfs, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, linux-ext4, linux-block,
	catherine.hoang, linux-api, John Garry

Now that CoW-based atomic writes are supported, update the max size of an
atomic write for the data device.

The limit of a CoW-based atomic write will be the limit of the number of
logitems which can fit into a single transaction.

In addition, the max atomic write size needs to be aligned to the agsize.
Limit the size of atomic writes to the greatest power-of-two factor of the
agsize so that allocations for an atomic write will always be aligned
compatibly with the alignment requirements of the storage.

Function xfs_atomic_write_logitems() is added to find the limit the number
of log items which can fit in a single transaction.

Amend the max atomic write computation to create a new transaction
reservation type, and compute the maximum size of an atomic write
completion (in fsblocks) based on this new transaction reservation.
Initially, tr_atomic_write is a clone of tr_itruncate, which provides a
reasonable level of parallelism.  In the next patch, we'll add a mount
option so that sysadmins can configure their own limits.

Signed-off-by: John Garry <john.g.garry@oracle.com>
[djwong: use a new reservation type for atomic write ioends]
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_trans_resv.c | 90 ++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_trans_resv.h |  2 +
 fs/xfs/xfs_mount.c             | 80 ++++++++++++++++++++++++++++++
 fs/xfs/xfs_mount.h             |  6 +++
 fs/xfs/xfs_reflink.c           | 13 +++++
 fs/xfs/xfs_reflink.h           |  2 +
 fs/xfs/xfs_trace.h             | 60 +++++++++++++++++++++++
 7 files changed, 253 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 797eb6a41e9b..f530aa5d72f5 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -22,6 +22,12 @@
 #include "xfs_rtbitmap.h"
 #include "xfs_attr_item.h"
 #include "xfs_log.h"
+#include "xfs_defer.h"
+#include "xfs_bmap_item.h"
+#include "xfs_extfree_item.h"
+#include "xfs_rmap_item.h"
+#include "xfs_refcount_item.h"
+#include "xfs_trace.h"
 
 #define _ALLOC	true
 #define _FREE	false
@@ -1385,3 +1391,87 @@ xfs_trans_resv_calc(
 	 */
 	resp->tr_atomic_ioend = resp->tr_itruncate;
 }
+
+/*
+ * Return the per-extent and fixed transaction reservation sizes needed to
+ * complete an atomic write.
+ */
+STATIC unsigned int
+xfs_calc_atomic_write_ioend_geometry(
+	struct xfs_mount	*mp,
+	unsigned int		*step_size)
+{
+	const unsigned int	efi = xfs_efi_log_space(1);
+	const unsigned int	efd = xfs_efd_log_space(1);
+	const unsigned int	rui = xfs_rui_log_space(1);
+	const unsigned int	rud = xfs_rud_log_space();
+	const unsigned int	cui = xfs_cui_log_space(1);
+	const unsigned int	cud = xfs_cud_log_space();
+	const unsigned int	bui = xfs_bui_log_space(1);
+	const unsigned int	bud = xfs_bud_log_space();
+
+	/*
+	 * Maximum overhead to complete an atomic write ioend in software:
+	 * remove data fork extent + remove cow fork extent + map extent into
+	 * data fork.
+	 *
+	 * tx0: Creates a BUI and a CUI and that's all it needs.
+	 *
+	 * tx1: Roll to finish the BUI.  Need space for the BUD, an RUI, and
+	 * enough space to relog the CUI (== CUI + CUD).
+	 *
+	 * tx2: Roll again to finish the RUI.  Need space for the RUD and space
+	 * to relog the CUI.
+	 *
+	 * tx3: Roll again, need space for the CUD and possibly a new EFI.
+	 *
+	 * tx4: Roll again, need space for an EFD.
+	 *
+	 * If the extent referenced by the pair of BUI/CUI items is not the one
+	 * being currently processed, then we need to reserve space to relog
+	 * both items.
+	 */
+	const unsigned int	tx0 = bui + cui;
+	const unsigned int	tx1 = bud + rui + cui + cud;
+	const unsigned int	tx2 = rud + cui + cud;
+	const unsigned int	tx3 = cud + efi;
+	const unsigned int	tx4 = efd;
+	const unsigned int	relog = bui + bud + cui + cud;
+
+	const unsigned int	per_intent = max(max3(tx0, tx1, tx2),
+						 max3(tx3, tx4, relog));
+
+	/* Overhead to finish one step of each intent item type */
+	const unsigned int	f1 = xfs_calc_finish_efi_reservation(mp, 1);
+	const unsigned int	f2 = xfs_calc_finish_rui_reservation(mp, 1);
+	const unsigned int	f3 = xfs_calc_finish_cui_reservation(mp, 1);
+	const unsigned int	f4 = xfs_calc_finish_bui_reservation(mp, 1);
+
+	/* We only finish one item per transaction in a chain */
+	*step_size = max(f4, max3(f1, f2, f3));
+
+	return per_intent;
+}
+
+/*
+ * Compute the maximum size (in fsblocks) of atomic writes that we can complete
+ * given the existing log reservations.
+ */
+xfs_extlen_t
+xfs_calc_max_atomic_write_fsblocks(
+	struct xfs_mount		*mp)
+{
+	const struct xfs_trans_res	*resv = &M_RES(mp)->tr_atomic_ioend;
+	unsigned int			per_intent, step_size;
+	unsigned int			ret = 0;
+
+	per_intent = xfs_calc_atomic_write_ioend_geometry(mp, &step_size);
+
+	if (resv->tr_logres >= step_size)
+		ret = (resv->tr_logres - step_size) / per_intent;
+
+	trace_xfs_calc_max_atomic_write_fsblocks(mp, per_intent, step_size,
+			resv->tr_logres, ret);
+
+	return ret;
+}
diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
index 670045d417a6..a6d303b83688 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.h
+++ b/fs/xfs/libxfs/xfs_trans_resv.h
@@ -121,4 +121,6 @@ unsigned int xfs_calc_itruncate_reservation_minlogsize(struct xfs_mount *mp);
 unsigned int xfs_calc_write_reservation_minlogsize(struct xfs_mount *mp);
 unsigned int xfs_calc_qm_dqalloc_reservation_minlogsize(struct xfs_mount *mp);
 
+xfs_extlen_t xfs_calc_max_atomic_write_fsblocks(struct xfs_mount *mp);
+
 #endif	/* __XFS_TRANS_RESV_H__ */
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 00b53f479ece..860fc3c91fd5 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -666,6 +666,79 @@ xfs_agbtree_compute_maxlevels(
 	mp->m_agbtree_maxlevels = max(levels, mp->m_refc_maxlevels);
 }
 
+static inline unsigned int max_pow_of_two_factor(const unsigned int nr)
+{
+	return 1 << (ffs(nr) - 1);
+}
+
+static inline void
+xfs_compute_atomic_write_unit_max(
+	struct xfs_mount	*mp)
+{
+	struct xfs_groups	*ags = &mp->m_groups[XG_TYPE_AG];
+	struct xfs_groups	*rgs = &mp->m_groups[XG_TYPE_RTG];
+
+	/* Maximum write IO size that the kernel allows. */
+	const unsigned int	max_write =
+		rounddown_pow_of_two(XFS_B_TO_FSB(mp, MAX_RW_COUNT));
+
+	/*
+	 * Maximum atomic write ioend that we can handle.  The atomic write
+	 * fallback requires reflink to handle an out of place write, so we
+	 * don't support atomic writes at all unless reflink is enabled.
+	 */
+	const unsigned int	max_ioend = xfs_reflink_max_atomic_cow(mp);
+
+	unsigned int		max_agsize;
+	unsigned int		max_rgsize;
+
+	/*
+	 * If the data device advertises atomic write support, limit the size
+	 * of data device atomic writes to the greatest power-of-two factor of
+	 * the AG size so that every atomic write unit aligns with the start
+	 * of every AG.  This is required so that the per-AG allocations for an
+	 * atomic write will always be aligned compatibly with the alignment
+	 * requirements of the storage.
+	 *
+	 * If the data device doesn't advertise atomic writes, then there are
+	 * no alignment restrictions and the largest out-of-place write we can
+	 * do ourselves is the number of blocks that user files can allocate
+	 * from any AG.
+	 */
+
+	if (mp->m_ddev_targp->bt_bdev_awu_min > 0)
+		max_agsize = max_pow_of_two_factor(mp->m_sb.sb_agblocks);
+	else
+		max_agsize = mp->m_ag_max_usable;
+
+	/*
+	 * Reflink on the realtime device requires rtgroups and rt reflink
+	 * requires rtgroups.
+	 *
+	 * If the realtime device advertises atomic write support, limit the
+	 * size of data device atomic writes to the greatest power-of-two
+	 * factor of the rtgroup size so that every atomic write unit aligns
+	 * with the start of every rtgroup.  This is required so that the
+	 * per-rtgroup allocations for an atomic write will always be aligned
+	 * compatibly with the alignment requirements of the storage.
+	 *
+	 * If the rt device doesn't advertise atomic writes, then there are
+	 * no alignment restrictions and the largest out-of-place write we can
+	 * do ourselves is the number of blocks that user files can allocate
+	 * from any rtgroup.
+	 */
+	if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_bdev_awu_min > 0)
+		max_rgsize = max_pow_of_two_factor(rgs->blocks);
+	else
+		max_rgsize = rgs->blocks;
+
+	ags->awu_max = min3(max_write, max_ioend, max_agsize);
+	rgs->awu_max = min3(max_write, max_ioend, max_rgsize);
+
+	trace_xfs_compute_atomic_write_unit_max(mp, max_write, max_ioend,
+			max_agsize, max_rgsize);
+}
+
 /* Compute maximum possible height for realtime btree types for this fs. */
 static inline void
 xfs_rtbtree_compute_maxlevels(
@@ -1082,6 +1155,13 @@ xfs_mountfs(
 		xfs_zone_gc_start(mp);
 	}
 
+	/*
+	 * Pre-calculate atomic write unit max.  This involves computations
+	 * derived from transaction reservations, so we must do this after the
+	 * log is fully initialized.
+	 */
+	xfs_compute_atomic_write_unit_max(mp);
+
 	return 0;
 
  out_agresv:
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 799b84220ebb..c0eff3adfa31 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -119,6 +119,12 @@ struct xfs_groups {
 	 * SMR hard drives.
 	 */
 	xfs_fsblock_t		start_fsb;
+
+	/*
+	 * Maximum length of an atomic write for files stored in this
+	 * collection of allocation groups, in fsblocks.
+	 */
+	xfs_extlen_t		awu_max;
 };
 
 struct xfs_freecounter {
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 218dee76768b..eff560f284ab 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1040,6 +1040,19 @@ xfs_reflink_end_atomic_cow(
 	return error;
 }
 
+/* Compute the largest atomic write that we can complete through software. */
+xfs_extlen_t
+xfs_reflink_max_atomic_cow(
+	struct xfs_mount	*mp)
+{
+	/* We cannot do any atomic writes without out of place writes. */
+	if (!xfs_has_reflink(mp))
+		return 0;
+
+	/* atomic write limits are always a power-of-2 */
+	return rounddown_pow_of_two(xfs_calc_max_atomic_write_fsblocks(mp));
+}
+
 /*
  * Free all CoW staging blocks that are still referenced by the ondisk refcount
  * metadata.  The ondisk metadata does not track which inode created the
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 412e9b6f2082..36cda724da89 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -68,4 +68,6 @@ extern int xfs_reflink_update_dest(struct xfs_inode *dest, xfs_off_t newlen,
 
 bool xfs_reflink_supports_rextsize(struct xfs_mount *mp, unsigned int rextsize);
 
+xfs_extlen_t xfs_reflink_max_atomic_cow(struct xfs_mount *mp);
+
 #endif /* __XFS_REFLINK_H */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 9554578c6da4..24d73e9bbe83 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -170,6 +170,66 @@ DEFINE_ATTR_LIST_EVENT(xfs_attr_list_notfound);
 DEFINE_ATTR_LIST_EVENT(xfs_attr_leaf_list);
 DEFINE_ATTR_LIST_EVENT(xfs_attr_node_list);
 
+TRACE_EVENT(xfs_compute_atomic_write_unit_max,
+	TP_PROTO(struct xfs_mount *mp, unsigned int max_write,
+		 unsigned int max_ioend, unsigned int max_agsize,
+		 unsigned int max_rgsize),
+	TP_ARGS(mp, max_write, max_ioend, max_agsize, max_rgsize),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, max_write)
+		__field(unsigned int, max_ioend)
+		__field(unsigned int, max_agsize)
+		__field(unsigned int, max_rgsize)
+		__field(unsigned int, data_awu_max)
+		__field(unsigned int, rt_awu_max)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->max_write = max_write;
+		__entry->max_ioend = max_ioend;
+		__entry->max_agsize = max_agsize;
+		__entry->max_rgsize = max_rgsize;
+		__entry->data_awu_max = mp->m_groups[XG_TYPE_AG].awu_max;
+		__entry->rt_awu_max = mp->m_groups[XG_TYPE_RTG].awu_max;
+	),
+	TP_printk("dev %d:%d max_write %u max_ioend %u max_agsize %u max_rgsize %u data_awu_max %u rt_awu_max %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->max_write,
+		  __entry->max_ioend,
+		  __entry->max_agsize,
+		  __entry->max_rgsize,
+		  __entry->data_awu_max,
+		  __entry->rt_awu_max)
+);
+
+TRACE_EVENT(xfs_calc_max_atomic_write_fsblocks,
+	TP_PROTO(struct xfs_mount *mp, unsigned int per_intent,
+		 unsigned int step_size, unsigned int logres,
+		 unsigned int blockcount),
+	TP_ARGS(mp, per_intent, step_size, logres, blockcount),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, per_intent)
+		__field(unsigned int, step_size)
+		__field(unsigned int, logres)
+		__field(unsigned int, blockcount)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->per_intent = per_intent;
+		__entry->step_size = step_size;
+		__entry->logres = logres;
+		__entry->blockcount = blockcount;
+	),
+	TP_printk("dev %d:%d per_intent %u step_size %u logres %u blockcount %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->per_intent,
+		  __entry->step_size,
+		  __entry->logres,
+		  __entry->blockcount)
+);
+
 TRACE_EVENT(xlog_intent_recovery_failed,
 	TP_PROTO(struct xfs_mount *mp, const struct xfs_defer_op_type *ops,
 		 int error),
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 13/14] xfs: update atomic write limits
  2025-04-15 12:14 [PATCH v7 00/14] large atomic writes for xfs John Garry
                   ` (11 preceding siblings ...)
  2025-04-15 12:14 ` [PATCH v7 12/14] xfs: add xfs_compute_atomic_write_unit_max() John Garry
@ 2025-04-15 12:14 ` John Garry
  2025-04-15 16:26   ` Darrick J. Wong
  2025-04-15 12:14 ` [PATCH v7 14/14] xfs: allow sysadmins to specify a maximum atomic write limit at mount time John Garry
  13 siblings, 1 reply; 40+ messages in thread
From: John Garry @ 2025-04-15 12:14 UTC (permalink / raw)
  To: brauner, djwong, hch, viro, jack, cem
  Cc: linux-fsdevel, dchinner, linux-xfs, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, linux-ext4, linux-block,
	catherine.hoang, linux-api, John Garry

Update the limits returned from xfs_get_atomic_write_{min, max, max_opt)().

No reflink support always means no CoW-based atomic writes.

For updating xfs_get_atomic_write_min(), we support blocksize only and that
depends on HW or reflink support.

For updating xfs_get_atomic_write_max(), for no reflink, we are limited to
blocksize but only if HW support. Otherwise we are limited to combined
limit in mp->m_atomic_write_unit_max.

For updating xfs_get_atomic_write_max_opt(), ultimately we are limited by
the bdev atomic write limit. If xfs_get_atomic_write_max() does not report
 > 1x blocksize, then just continue to report 0 as before.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
[djwong: update comments in the helper functions]
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_file.c |  2 +-
 fs/xfs/xfs_iops.c | 53 +++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 48 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 81a377f65aa3..d1ddbc4a98c3 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1557,7 +1557,7 @@ xfs_file_open(
 	if (xfs_is_shutdown(XFS_M(inode->i_sb)))
 		return -EIO;
 	file->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
-	if (xfs_inode_can_hw_atomicwrite(XFS_I(inode)))
+	if (xfs_get_atomic_write_min(XFS_I(inode)))
 		file->f_mode |= FMODE_CAN_ATOMIC_WRITE;
 	return generic_file_open(inode, file);
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 3b5aa39dbfe9..183524d06bc3 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -605,27 +605,68 @@ unsigned int
 xfs_get_atomic_write_min(
 	struct xfs_inode	*ip)
 {
-	if (!xfs_inode_can_hw_atomicwrite(ip))
-		return 0;
+	struct xfs_mount	*mp = ip->i_mount;
+
+	/*
+	 * If we can complete an atomic write via atomic out of place writes,
+	 * then advertise a minimum size of one fsblock.  Without this
+	 * mechanism, we can only guarantee atomic writes up to a single LBA.
+	 *
+	 * If out of place writes are not available, we can guarantee an atomic
+	 * write of exactly one single fsblock if the bdev will make that
+	 * guarantee for us.
+	 */
+	if (xfs_inode_can_hw_atomicwrite(ip) || xfs_has_reflink(mp))
+		return mp->m_sb.sb_blocksize;
 
-	return ip->i_mount->m_sb.sb_blocksize;
+	return 0;
 }
 
 unsigned int
 xfs_get_atomic_write_max(
 	struct xfs_inode	*ip)
 {
-	if (!xfs_inode_can_hw_atomicwrite(ip))
+	struct xfs_mount	*mp = ip->i_mount;
+
+	/*
+	 * If out of place writes are not available, we can guarantee an atomic
+	 * write of exactly one single fsblock if the bdev will make that
+	 * guarantee for us.
+	 */
+	if (!xfs_has_reflink(mp)) {
+		if (xfs_inode_can_hw_atomicwrite(ip))
+			return mp->m_sb.sb_blocksize;
 		return 0;
+	}
 
-	return ip->i_mount->m_sb.sb_blocksize;
+	/*
+	 * If we can complete an atomic write via atomic out of place writes,
+	 * then advertise a maximum size of whatever we can complete through
+	 * that means.  Hardware support is reported via max_opt, not here.
+	 */
+	if (XFS_IS_REALTIME_INODE(ip))
+		return XFS_FSB_TO_B(mp, mp->m_groups[XG_TYPE_RTG].awu_max);
+	return XFS_FSB_TO_B(mp, mp->m_groups[XG_TYPE_AG].awu_max);
 }
 
 unsigned int
 xfs_get_atomic_write_max_opt(
 	struct xfs_inode	*ip)
 {
-	return 0;
+	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
+	unsigned int		awu_max = xfs_get_atomic_write_max(ip);
+
+	/* if the max is 1x block, then just keep behaviour that opt is 0 */
+	if (awu_max <= ip->i_mount->m_sb.sb_blocksize)
+		return 0;
+
+	/*
+	 * Advertise the maximum size of an atomic write that we can tell the
+	 * block device to perform for us.  In general the bdev limit will be
+	 * less than our out of place write limit, but we don't want to exceed
+	 * the awu_max.
+	 */
+	return min(awu_max, target->bt_bdev_awu_max);
 }
 
 static void
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v7 14/14] xfs: allow sysadmins to specify a maximum atomic write limit at mount time
  2025-04-15 12:14 [PATCH v7 00/14] large atomic writes for xfs John Garry
                   ` (12 preceding siblings ...)
  2025-04-15 12:14 ` [PATCH v7 13/14] xfs: update atomic write limits John Garry
@ 2025-04-15 12:14 ` John Garry
  2025-04-15 15:35   ` Randy Dunlap
                     ` (2 more replies)
  13 siblings, 3 replies; 40+ messages in thread
From: John Garry @ 2025-04-15 12:14 UTC (permalink / raw)
  To: brauner, djwong, hch, viro, jack, cem
  Cc: linux-fsdevel, dchinner, linux-xfs, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, linux-ext4, linux-block,
	catherine.hoang, linux-api, John Garry

From: "Darrick J. Wong" <djwong@kernel.org>

Introduce a mount option to allow sysadmins to specify the maximum size
of an atomic write.  When this happens, we dynamically recompute the
tr_atomic_write transaction reservation based on the given block size,
and then check that we don't violate any of the minimum log size
constraints.

The actual software atomic write max is still computed based off of
tr_atomic the same way it has for the past few commits.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 Documentation/admin-guide/xfs.rst |  8 +++++
 fs/xfs/libxfs/xfs_trans_resv.c    | 54 +++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_trans_resv.h    |  1 +
 fs/xfs/xfs_mount.c                |  8 ++++-
 fs/xfs/xfs_mount.h                |  5 +++
 fs/xfs/xfs_super.c                | 28 +++++++++++++++-
 fs/xfs/xfs_trace.h                | 33 +++++++++++++++++++
 7 files changed, 135 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst
index b67772cf36d6..715019ec4f24 100644
--- a/Documentation/admin-guide/xfs.rst
+++ b/Documentation/admin-guide/xfs.rst
@@ -143,6 +143,14 @@ When mounting an XFS filesystem, the following options are accepted.
 	optional, and the log section can be separate from the data
 	section or contained within it.
 
+  max_atomic_write=value
+	Set the maximum size of an atomic write.  The size may be
+	specified in bytes, in kilobytes with a "k" suffix, in megabytes
+	with a "m" suffix, or in gigabytes with a "g" suffix.
+
+	The default value is to set the maximum io completion size
+	to allow each CPU to handle one at a time.
+
   noalign
 	Data allocations will not be aligned at stripe unit
 	boundaries. This is only relevant to filesystems created
diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index f530aa5d72f5..36e47ec3c3c2 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -1475,3 +1475,57 @@ xfs_calc_max_atomic_write_fsblocks(
 
 	return ret;
 }
+
+/*
+ * Compute the log reservation needed to complete an atomic write of a given
+ * number of blocks.  Worst case, each block requires separate handling.
+ * Returns true if the blockcount is supported, false otherwise.
+ */
+bool
+xfs_calc_atomic_write_reservation(
+	struct xfs_mount	*mp,
+	int			bytes)
+{
+	struct xfs_trans_res	*curr_res = &M_RES(mp)->tr_atomic_ioend;
+	unsigned int		per_intent, step_size;
+	unsigned int		logres;
+	xfs_extlen_t		blockcount = XFS_B_TO_FSBT(mp, bytes);
+	uint			old_logres =
+		M_RES(mp)->tr_atomic_ioend.tr_logres;
+	int			min_logblocks;
+
+	/*
+	 * If the caller doesn't ask for a specific atomic write size, then
+	 * we'll use conservatively use tr_itruncate as the basis for computing
+	 * a reasonable maximum.
+	 */
+	if (blockcount == 0) {
+		curr_res->tr_logres = M_RES(mp)->tr_itruncate.tr_logres;
+		return true;
+	}
+
+	/* Untorn write completions require out of place write remapping */
+	if (!xfs_has_reflink(mp))
+		return false;
+
+	per_intent = xfs_calc_atomic_write_ioend_geometry(mp, &step_size);
+
+	if (check_mul_overflow(blockcount, per_intent, &logres))
+		return false;
+	if (check_add_overflow(logres, step_size, &logres))
+		return false;
+
+	curr_res->tr_logres = logres;
+	min_logblocks = xfs_log_calc_minimum_size(mp);
+
+	trace_xfs_calc_max_atomic_write_reservation(mp, per_intent, step_size,
+			blockcount, min_logblocks, curr_res->tr_logres);
+
+	if (min_logblocks > mp->m_sb.sb_logblocks) {
+		/* Log too small, revert changes. */
+		curr_res->tr_logres = old_logres;
+		return false;
+	}
+
+	return true;
+}
diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
index a6d303b83688..af974f920556 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.h
+++ b/fs/xfs/libxfs/xfs_trans_resv.h
@@ -122,5 +122,6 @@ unsigned int xfs_calc_write_reservation_minlogsize(struct xfs_mount *mp);
 unsigned int xfs_calc_qm_dqalloc_reservation_minlogsize(struct xfs_mount *mp);
 
 xfs_extlen_t xfs_calc_max_atomic_write_fsblocks(struct xfs_mount *mp);
+bool xfs_calc_atomic_write_reservation(struct xfs_mount *mp, int bytes);
 
 #endif	/* __XFS_TRANS_RESV_H__ */
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 860fc3c91fd5..b8dd9e956c2a 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -671,7 +671,7 @@ static inline unsigned int max_pow_of_two_factor(const unsigned int nr)
 	return 1 << (ffs(nr) - 1);
 }
 
-static inline void
+void
 xfs_compute_atomic_write_unit_max(
 	struct xfs_mount	*mp)
 {
@@ -1160,6 +1160,12 @@ xfs_mountfs(
 	 * derived from transaction reservations, so we must do this after the
 	 * log is fully initialized.
 	 */
+	if (!xfs_calc_atomic_write_reservation(mp, mp->m_awu_max_bytes)) {
+		xfs_warn(mp, "cannot support atomic writes of %u bytes",
+				mp->m_awu_max_bytes);
+		error = -EINVAL;
+		goto out_agresv;
+	}
 	xfs_compute_atomic_write_unit_max(mp);
 
 	return 0;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index c0eff3adfa31..a5037db4ecff 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -236,6 +236,9 @@ typedef struct xfs_mount {
 	bool			m_update_sb;	/* sb needs update in mount */
 	unsigned int		m_max_open_zones;
 
+	/* max_atomic_write mount option value */
+	unsigned int		m_awu_max_bytes;
+
 	/*
 	 * Bitsets of per-fs metadata that have been checked and/or are sick.
 	 * Callers must hold m_sb_lock to access these two fields.
@@ -798,4 +801,6 @@ static inline void xfs_mod_sb_delalloc(struct xfs_mount *mp, int64_t delta)
 	percpu_counter_add(&mp->m_delalloc_blks, delta);
 }
 
+void xfs_compute_atomic_write_unit_max(struct xfs_mount *mp);
+
 #endif	/* __XFS_MOUNT_H__ */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index b2dd0c0bf509..f7849052e5ff 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -111,7 +111,7 @@ enum {
 	Opt_prjquota, Opt_uquota, Opt_gquota, Opt_pquota,
 	Opt_uqnoenforce, Opt_gqnoenforce, Opt_pqnoenforce, Opt_qnoenforce,
 	Opt_discard, Opt_nodiscard, Opt_dax, Opt_dax_enum, Opt_max_open_zones,
-	Opt_lifetime, Opt_nolifetime,
+	Opt_lifetime, Opt_nolifetime, Opt_max_atomic_write,
 };
 
 static const struct fs_parameter_spec xfs_fs_parameters[] = {
@@ -159,6 +159,7 @@ static const struct fs_parameter_spec xfs_fs_parameters[] = {
 	fsparam_u32("max_open_zones",	Opt_max_open_zones),
 	fsparam_flag("lifetime",	Opt_lifetime),
 	fsparam_flag("nolifetime",	Opt_nolifetime),
+	fsparam_string("max_atomic_write",	Opt_max_atomic_write),
 	{}
 };
 
@@ -241,6 +242,9 @@ xfs_fs_show_options(
 
 	if (mp->m_max_open_zones)
 		seq_printf(m, ",max_open_zones=%u", mp->m_max_open_zones);
+	if (mp->m_awu_max_bytes)
+		seq_printf(m, ",max_atomic_write=%uk",
+				mp->m_awu_max_bytes >> 10);
 
 	return 0;
 }
@@ -1518,6 +1522,13 @@ xfs_fs_parse_param(
 	case Opt_nolifetime:
 		parsing_mp->m_features |= XFS_FEAT_NOLIFETIME;
 		return 0;
+	case Opt_max_atomic_write:
+		if (suffix_kstrtoint(param->string, 10,
+				     &parsing_mp->m_awu_max_bytes))
+			return -EINVAL;
+		if (parsing_mp->m_awu_max_bytes < 0)
+			return -EINVAL;
+		return 0;
 	default:
 		xfs_warn(parsing_mp, "unknown mount option [%s].", param->key);
 		return -EINVAL;
@@ -2114,6 +2125,16 @@ xfs_fs_reconfigure(
 	if (error)
 		return error;
 
+	/* validate new max_atomic_write option before making other changes */
+	if (mp->m_awu_max_bytes != new_mp->m_awu_max_bytes) {
+		if (!xfs_calc_atomic_write_reservation(mp,
+					new_mp->m_awu_max_bytes)) {
+			xfs_warn(mp, "cannot support atomic writes of %u bytes",
+					new_mp->m_awu_max_bytes);
+			return -EINVAL;
+		}
+	}
+
 	/* inode32 -> inode64 */
 	if (xfs_has_small_inums(mp) && !xfs_has_small_inums(new_mp)) {
 		mp->m_features &= ~XFS_FEAT_SMALL_INUMS;
@@ -2140,6 +2161,11 @@ xfs_fs_reconfigure(
 			return error;
 	}
 
+	/* set new atomic write max here */
+	if (mp->m_awu_max_bytes != new_mp->m_awu_max_bytes) {
+		xfs_compute_atomic_write_unit_max(mp);
+		mp->m_awu_max_bytes = new_mp->m_awu_max_bytes;
+	}
 	return 0;
 }
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 24d73e9bbe83..d41885f1efe2 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -230,6 +230,39 @@ TRACE_EVENT(xfs_calc_max_atomic_write_fsblocks,
 		  __entry->blockcount)
 );
 
+TRACE_EVENT(xfs_calc_max_atomic_write_reservation,
+	TP_PROTO(struct xfs_mount *mp, unsigned int per_intent,
+		 unsigned int step_size, unsigned int blockcount,
+		 unsigned int min_logblocks, unsigned int logres),
+	TP_ARGS(mp, per_intent, step_size, blockcount, min_logblocks, logres),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, per_intent)
+		__field(unsigned int, step_size)
+		__field(unsigned int, blockcount)
+		__field(unsigned int, min_logblocks)
+		__field(unsigned int, cur_logblocks)
+		__field(unsigned int, logres)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->per_intent = per_intent;
+		__entry->step_size = step_size;
+		__entry->blockcount = blockcount;
+		__entry->min_logblocks = min_logblocks;
+		__entry->cur_logblocks = mp->m_sb.sb_logblocks;
+		__entry->logres = logres;
+	),
+	TP_printk("dev %d:%d per_intent %u step_size %u blockcount %u min_logblocks %u logblocks %u logres %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->per_intent,
+		  __entry->step_size,
+		  __entry->blockcount,
+		  __entry->min_logblocks,
+		  __entry->cur_logblocks,
+		  __entry->logres)
+);
+
 TRACE_EVENT(xlog_intent_recovery_failed,
 	TP_PROTO(struct xfs_mount *mp, const struct xfs_defer_op_type *ops,
 		 int error),
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 14/14] xfs: allow sysadmins to specify a maximum atomic write limit at mount time
  2025-04-15 12:14 ` [PATCH v7 14/14] xfs: allow sysadmins to specify a maximum atomic write limit at mount time John Garry
@ 2025-04-15 15:35   ` Randy Dunlap
  2025-04-15 16:55   ` Darrick J. Wong
  2025-04-15 22:36   ` [PATCH v7.1 " Darrick J. Wong
  2 siblings, 0 replies; 40+ messages in thread
From: Randy Dunlap @ 2025-04-15 15:35 UTC (permalink / raw)
  To: John Garry, brauner, djwong, hch, viro, jack, cem
  Cc: linux-fsdevel, dchinner, linux-xfs, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, linux-ext4, linux-block,
	catherine.hoang, linux-api



On 4/15/25 5:14 AM, John Garry wrote:
> diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst
> index b67772cf36d6..715019ec4f24 100644
> --- a/Documentation/admin-guide/xfs.rst
> +++ b/Documentation/admin-guide/xfs.rst
> @@ -143,6 +143,14 @@ When mounting an XFS filesystem, the following options are accepted.
>  	optional, and the log section can be separate from the data
>  	section or contained within it.
>  
> +  max_atomic_write=value
> +	Set the maximum size of an atomic write.  The size may be
> +	specified in bytes, in kilobytes with a "k" suffix, in megabytes
> +	with a "m" suffix, or in gigabytes with a "g" suffix.
> +
> +	The default value is to set the maximum io completion size

Preferably                                      I/O or IO

> +	to allow each CPU to handle one at a time.
> +

-- 
~Randy


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 12/14] xfs: add xfs_compute_atomic_write_unit_max()
  2025-04-15 12:14 ` [PATCH v7 12/14] xfs: add xfs_compute_atomic_write_unit_max() John Garry
@ 2025-04-15 16:25   ` Darrick J. Wong
  2025-04-15 16:35     ` John Garry
  0 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2025-04-15 16:25 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, hch, viro, jack, cem, linux-fsdevel, dchinner, linux-xfs,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, linux-ext4,
	linux-block, catherine.hoang, linux-api

On Tue, Apr 15, 2025 at 12:14:23PM +0000, John Garry wrote:
> Now that CoW-based atomic writes are supported, update the max size of an
> atomic write for the data device.
> 
> The limit of a CoW-based atomic write will be the limit of the number of
> logitems which can fit into a single transaction.
> 
> In addition, the max atomic write size needs to be aligned to the agsize.
> Limit the size of atomic writes to the greatest power-of-two factor of the
> agsize so that allocations for an atomic write will always be aligned
> compatibly with the alignment requirements of the storage.
> 
> Function xfs_atomic_write_logitems() is added to find the limit the number
> of log items which can fit in a single transaction.
> 
> Amend the max atomic write computation to create a new transaction
> reservation type, and compute the maximum size of an atomic write
> completion (in fsblocks) based on this new transaction reservation.
> Initially, tr_atomic_write is a clone of tr_itruncate, which provides a
> reasonable level of parallelism.  In the next patch, we'll add a mount
> option so that sysadmins can configure their own limits.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> [djwong: use a new reservation type for atomic write ioends]

There should be a
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
underneath this line.

> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_trans_resv.c | 90 ++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_trans_resv.h |  2 +
>  fs/xfs/xfs_mount.c             | 80 ++++++++++++++++++++++++++++++
>  fs/xfs/xfs_mount.h             |  6 +++
>  fs/xfs/xfs_reflink.c           | 13 +++++
>  fs/xfs/xfs_reflink.h           |  2 +
>  fs/xfs/xfs_trace.h             | 60 +++++++++++++++++++++++
>  7 files changed, 253 insertions(+)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> index 797eb6a41e9b..f530aa5d72f5 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.c
> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> @@ -22,6 +22,12 @@
>  #include "xfs_rtbitmap.h"
>  #include "xfs_attr_item.h"
>  #include "xfs_log.h"
> +#include "xfs_defer.h"
> +#include "xfs_bmap_item.h"
> +#include "xfs_extfree_item.h"
> +#include "xfs_rmap_item.h"
> +#include "xfs_refcount_item.h"
> +#include "xfs_trace.h"
>  
>  #define _ALLOC	true
>  #define _FREE	false
> @@ -1385,3 +1391,87 @@ xfs_trans_resv_calc(
>  	 */
>  	resp->tr_atomic_ioend = resp->tr_itruncate;
>  }
> +
> +/*
> + * Return the per-extent and fixed transaction reservation sizes needed to
> + * complete an atomic write.
> + */
> +STATIC unsigned int
> +xfs_calc_atomic_write_ioend_geometry(
> +	struct xfs_mount	*mp,
> +	unsigned int		*step_size)
> +{
> +	const unsigned int	efi = xfs_efi_log_space(1);
> +	const unsigned int	efd = xfs_efd_log_space(1);
> +	const unsigned int	rui = xfs_rui_log_space(1);
> +	const unsigned int	rud = xfs_rud_log_space();
> +	const unsigned int	cui = xfs_cui_log_space(1);
> +	const unsigned int	cud = xfs_cud_log_space();
> +	const unsigned int	bui = xfs_bui_log_space(1);
> +	const unsigned int	bud = xfs_bud_log_space();
> +
> +	/*
> +	 * Maximum overhead to complete an atomic write ioend in software:
> +	 * remove data fork extent + remove cow fork extent + map extent into
> +	 * data fork.
> +	 *
> +	 * tx0: Creates a BUI and a CUI and that's all it needs.
> +	 *
> +	 * tx1: Roll to finish the BUI.  Need space for the BUD, an RUI, and
> +	 * enough space to relog the CUI (== CUI + CUD).
> +	 *
> +	 * tx2: Roll again to finish the RUI.  Need space for the RUD and space
> +	 * to relog the CUI.
> +	 *
> +	 * tx3: Roll again, need space for the CUD and possibly a new EFI.
> +	 *
> +	 * tx4: Roll again, need space for an EFD.
> +	 *
> +	 * If the extent referenced by the pair of BUI/CUI items is not the one
> +	 * being currently processed, then we need to reserve space to relog
> +	 * both items.
> +	 */
> +	const unsigned int	tx0 = bui + cui;
> +	const unsigned int	tx1 = bud + rui + cui + cud;
> +	const unsigned int	tx2 = rud + cui + cud;
> +	const unsigned int	tx3 = cud + efi;
> +	const unsigned int	tx4 = efd;
> +	const unsigned int	relog = bui + bud + cui + cud;
> +
> +	const unsigned int	per_intent = max(max3(tx0, tx1, tx2),
> +						 max3(tx3, tx4, relog));
> +
> +	/* Overhead to finish one step of each intent item type */
> +	const unsigned int	f1 = xfs_calc_finish_efi_reservation(mp, 1);
> +	const unsigned int	f2 = xfs_calc_finish_rui_reservation(mp, 1);
> +	const unsigned int	f3 = xfs_calc_finish_cui_reservation(mp, 1);
> +	const unsigned int	f4 = xfs_calc_finish_bui_reservation(mp, 1);
> +
> +	/* We only finish one item per transaction in a chain */
> +	*step_size = max(f4, max3(f1, f2, f3));
> +
> +	return per_intent;
> +}
> +
> +/*
> + * Compute the maximum size (in fsblocks) of atomic writes that we can complete
> + * given the existing log reservations.
> + */
> +xfs_extlen_t
> +xfs_calc_max_atomic_write_fsblocks(
> +	struct xfs_mount		*mp)
> +{
> +	const struct xfs_trans_res	*resv = &M_RES(mp)->tr_atomic_ioend;
> +	unsigned int			per_intent, step_size;
> +	unsigned int			ret = 0;
> +
> +	per_intent = xfs_calc_atomic_write_ioend_geometry(mp, &step_size);
> +
> +	if (resv->tr_logres >= step_size)
> +		ret = (resv->tr_logres - step_size) / per_intent;
> +
> +	trace_xfs_calc_max_atomic_write_fsblocks(mp, per_intent, step_size,
> +			resv->tr_logres, ret);
> +
> +	return ret;
> +}
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
> index 670045d417a6..a6d303b83688 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.h
> +++ b/fs/xfs/libxfs/xfs_trans_resv.h
> @@ -121,4 +121,6 @@ unsigned int xfs_calc_itruncate_reservation_minlogsize(struct xfs_mount *mp);
>  unsigned int xfs_calc_write_reservation_minlogsize(struct xfs_mount *mp);
>  unsigned int xfs_calc_qm_dqalloc_reservation_minlogsize(struct xfs_mount *mp);
>  
> +xfs_extlen_t xfs_calc_max_atomic_write_fsblocks(struct xfs_mount *mp);
> +
>  #endif	/* __XFS_TRANS_RESV_H__ */
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 00b53f479ece..860fc3c91fd5 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -666,6 +666,79 @@ xfs_agbtree_compute_maxlevels(
>  	mp->m_agbtree_maxlevels = max(levels, mp->m_refc_maxlevels);
>  }
>  
> +static inline unsigned int max_pow_of_two_factor(const unsigned int nr)
> +{
> +	return 1 << (ffs(nr) - 1);
> +}
> +
> +static inline void
> +xfs_compute_atomic_write_unit_max(
> +	struct xfs_mount	*mp)
> +{
> +	struct xfs_groups	*ags = &mp->m_groups[XG_TYPE_AG];
> +	struct xfs_groups	*rgs = &mp->m_groups[XG_TYPE_RTG];
> +
> +	/* Maximum write IO size that the kernel allows. */
> +	const unsigned int	max_write =
> +		rounddown_pow_of_two(XFS_B_TO_FSB(mp, MAX_RW_COUNT));
> +
> +	/*
> +	 * Maximum atomic write ioend that we can handle.  The atomic write
> +	 * fallback requires reflink to handle an out of place write, so we
> +	 * don't support atomic writes at all unless reflink is enabled.
> +	 */
> +	const unsigned int	max_ioend = xfs_reflink_max_atomic_cow(mp);
> +
> +	unsigned int		max_agsize;
> +	unsigned int		max_rgsize;
> +
> +	/*
> +	 * If the data device advertises atomic write support, limit the size
> +	 * of data device atomic writes to the greatest power-of-two factor of
> +	 * the AG size so that every atomic write unit aligns with the start
> +	 * of every AG.  This is required so that the per-AG allocations for an
> +	 * atomic write will always be aligned compatibly with the alignment
> +	 * requirements of the storage.
> +	 *
> +	 * If the data device doesn't advertise atomic writes, then there are
> +	 * no alignment restrictions and the largest out-of-place write we can
> +	 * do ourselves is the number of blocks that user files can allocate
> +	 * from any AG.
> +	 */
> +
> +	if (mp->m_ddev_targp->bt_bdev_awu_min > 0)

Just to pick nits with my own code, there doesn't need to be a blank
line between the comment and the if test.

> +		max_agsize = max_pow_of_two_factor(mp->m_sb.sb_agblocks);
> +	else
> +		max_agsize = mp->m_ag_max_usable;
> +
> +	/*
> +	 * Reflink on the realtime device requires rtgroups and rt reflink
> +	 * requires rtgroups.

And this should be shortened to "Reflink on the realtime device requires
rtgroups."

--D

> +	 *
> +	 * If the realtime device advertises atomic write support, limit the
> +	 * size of data device atomic writes to the greatest power-of-two
> +	 * factor of the rtgroup size so that every atomic write unit aligns
> +	 * with the start of every rtgroup.  This is required so that the
> +	 * per-rtgroup allocations for an atomic write will always be aligned
> +	 * compatibly with the alignment requirements of the storage.
> +	 *
> +	 * If the rt device doesn't advertise atomic writes, then there are
> +	 * no alignment restrictions and the largest out-of-place write we can
> +	 * do ourselves is the number of blocks that user files can allocate
> +	 * from any rtgroup.
> +	 */
> +	if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_bdev_awu_min > 0)
> +		max_rgsize = max_pow_of_two_factor(rgs->blocks);
> +	else
> +		max_rgsize = rgs->blocks;
> +
> +	ags->awu_max = min3(max_write, max_ioend, max_agsize);
> +	rgs->awu_max = min3(max_write, max_ioend, max_rgsize);
> +
> +	trace_xfs_compute_atomic_write_unit_max(mp, max_write, max_ioend,
> +			max_agsize, max_rgsize);
> +}
> +
>  /* Compute maximum possible height for realtime btree types for this fs. */
>  static inline void
>  xfs_rtbtree_compute_maxlevels(
> @@ -1082,6 +1155,13 @@ xfs_mountfs(
>  		xfs_zone_gc_start(mp);
>  	}
>  
> +	/*
> +	 * Pre-calculate atomic write unit max.  This involves computations
> +	 * derived from transaction reservations, so we must do this after the
> +	 * log is fully initialized.
> +	 */
> +	xfs_compute_atomic_write_unit_max(mp);
> +
>  	return 0;
>  
>   out_agresv:
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 799b84220ebb..c0eff3adfa31 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -119,6 +119,12 @@ struct xfs_groups {
>  	 * SMR hard drives.
>  	 */
>  	xfs_fsblock_t		start_fsb;
> +
> +	/*
> +	 * Maximum length of an atomic write for files stored in this
> +	 * collection of allocation groups, in fsblocks.
> +	 */
> +	xfs_extlen_t		awu_max;
>  };
>  
>  struct xfs_freecounter {
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 218dee76768b..eff560f284ab 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1040,6 +1040,19 @@ xfs_reflink_end_atomic_cow(
>  	return error;
>  }
>  
> +/* Compute the largest atomic write that we can complete through software. */
> +xfs_extlen_t
> +xfs_reflink_max_atomic_cow(
> +	struct xfs_mount	*mp)
> +{
> +	/* We cannot do any atomic writes without out of place writes. */
> +	if (!xfs_has_reflink(mp))
> +		return 0;
> +
> +	/* atomic write limits are always a power-of-2 */
> +	return rounddown_pow_of_two(xfs_calc_max_atomic_write_fsblocks(mp));
> +}
> +
>  /*
>   * Free all CoW staging blocks that are still referenced by the ondisk refcount
>   * metadata.  The ondisk metadata does not track which inode created the
> diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> index 412e9b6f2082..36cda724da89 100644
> --- a/fs/xfs/xfs_reflink.h
> +++ b/fs/xfs/xfs_reflink.h
> @@ -68,4 +68,6 @@ extern int xfs_reflink_update_dest(struct xfs_inode *dest, xfs_off_t newlen,
>  
>  bool xfs_reflink_supports_rextsize(struct xfs_mount *mp, unsigned int rextsize);
>  
> +xfs_extlen_t xfs_reflink_max_atomic_cow(struct xfs_mount *mp);
> +
>  #endif /* __XFS_REFLINK_H */
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 9554578c6da4..24d73e9bbe83 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -170,6 +170,66 @@ DEFINE_ATTR_LIST_EVENT(xfs_attr_list_notfound);
>  DEFINE_ATTR_LIST_EVENT(xfs_attr_leaf_list);
>  DEFINE_ATTR_LIST_EVENT(xfs_attr_node_list);
>  
> +TRACE_EVENT(xfs_compute_atomic_write_unit_max,
> +	TP_PROTO(struct xfs_mount *mp, unsigned int max_write,
> +		 unsigned int max_ioend, unsigned int max_agsize,
> +		 unsigned int max_rgsize),
> +	TP_ARGS(mp, max_write, max_ioend, max_agsize, max_rgsize),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(unsigned int, max_write)
> +		__field(unsigned int, max_ioend)
> +		__field(unsigned int, max_agsize)
> +		__field(unsigned int, max_rgsize)
> +		__field(unsigned int, data_awu_max)
> +		__field(unsigned int, rt_awu_max)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = mp->m_super->s_dev;
> +		__entry->max_write = max_write;
> +		__entry->max_ioend = max_ioend;
> +		__entry->max_agsize = max_agsize;
> +		__entry->max_rgsize = max_rgsize;
> +		__entry->data_awu_max = mp->m_groups[XG_TYPE_AG].awu_max;
> +		__entry->rt_awu_max = mp->m_groups[XG_TYPE_RTG].awu_max;
> +	),
> +	TP_printk("dev %d:%d max_write %u max_ioend %u max_agsize %u max_rgsize %u data_awu_max %u rt_awu_max %u",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->max_write,
> +		  __entry->max_ioend,
> +		  __entry->max_agsize,
> +		  __entry->max_rgsize,
> +		  __entry->data_awu_max,
> +		  __entry->rt_awu_max)
> +);
> +
> +TRACE_EVENT(xfs_calc_max_atomic_write_fsblocks,
> +	TP_PROTO(struct xfs_mount *mp, unsigned int per_intent,
> +		 unsigned int step_size, unsigned int logres,
> +		 unsigned int blockcount),
> +	TP_ARGS(mp, per_intent, step_size, logres, blockcount),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(unsigned int, per_intent)
> +		__field(unsigned int, step_size)
> +		__field(unsigned int, logres)
> +		__field(unsigned int, blockcount)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = mp->m_super->s_dev;
> +		__entry->per_intent = per_intent;
> +		__entry->step_size = step_size;
> +		__entry->logres = logres;
> +		__entry->blockcount = blockcount;
> +	),
> +	TP_printk("dev %d:%d per_intent %u step_size %u logres %u blockcount %u",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->per_intent,
> +		  __entry->step_size,
> +		  __entry->logres,
> +		  __entry->blockcount)
> +);
> +
>  TRACE_EVENT(xlog_intent_recovery_failed,
>  	TP_PROTO(struct xfs_mount *mp, const struct xfs_defer_op_type *ops,
>  		 int error),
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 13/14] xfs: update atomic write limits
  2025-04-15 12:14 ` [PATCH v7 13/14] xfs: update atomic write limits John Garry
@ 2025-04-15 16:26   ` Darrick J. Wong
  0 siblings, 0 replies; 40+ messages in thread
From: Darrick J. Wong @ 2025-04-15 16:26 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, hch, viro, jack, cem, linux-fsdevel, dchinner, linux-xfs,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, linux-ext4,
	linux-block, catherine.hoang, linux-api

On Tue, Apr 15, 2025 at 12:14:24PM +0000, John Garry wrote:
> Update the limits returned from xfs_get_atomic_write_{min, max, max_opt)().
> 
> No reflink support always means no CoW-based atomic writes.
> 
> For updating xfs_get_atomic_write_min(), we support blocksize only and that
> depends on HW or reflink support.
> 
> For updating xfs_get_atomic_write_max(), for no reflink, we are limited to
> blocksize but only if HW support. Otherwise we are limited to combined
> limit in mp->m_atomic_write_unit_max.
> 
> For updating xfs_get_atomic_write_max_opt(), ultimately we are limited by
> the bdev atomic write limit. If xfs_get_atomic_write_max() does not report
>  > 1x blocksize, then just continue to report 0 as before.
> 
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> [djwong: update comments in the helper functions]

Same here, there should be a
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>

after this comment.

--D

> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_file.c |  2 +-
>  fs/xfs/xfs_iops.c | 53 +++++++++++++++++++++++++++++++++++++++++------
>  2 files changed, 48 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 81a377f65aa3..d1ddbc4a98c3 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1557,7 +1557,7 @@ xfs_file_open(
>  	if (xfs_is_shutdown(XFS_M(inode->i_sb)))
>  		return -EIO;
>  	file->f_mode |= FMODE_NOWAIT | FMODE_CAN_ODIRECT;
> -	if (xfs_inode_can_hw_atomicwrite(XFS_I(inode)))
> +	if (xfs_get_atomic_write_min(XFS_I(inode)))
>  		file->f_mode |= FMODE_CAN_ATOMIC_WRITE;
>  	return generic_file_open(inode, file);
>  }
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 3b5aa39dbfe9..183524d06bc3 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -605,27 +605,68 @@ unsigned int
>  xfs_get_atomic_write_min(
>  	struct xfs_inode	*ip)
>  {
> -	if (!xfs_inode_can_hw_atomicwrite(ip))
> -		return 0;
> +	struct xfs_mount	*mp = ip->i_mount;
> +
> +	/*
> +	 * If we can complete an atomic write via atomic out of place writes,
> +	 * then advertise a minimum size of one fsblock.  Without this
> +	 * mechanism, we can only guarantee atomic writes up to a single LBA.
> +	 *
> +	 * If out of place writes are not available, we can guarantee an atomic
> +	 * write of exactly one single fsblock if the bdev will make that
> +	 * guarantee for us.
> +	 */
> +	if (xfs_inode_can_hw_atomicwrite(ip) || xfs_has_reflink(mp))
> +		return mp->m_sb.sb_blocksize;
>  
> -	return ip->i_mount->m_sb.sb_blocksize;
> +	return 0;
>  }
>  
>  unsigned int
>  xfs_get_atomic_write_max(
>  	struct xfs_inode	*ip)
>  {
> -	if (!xfs_inode_can_hw_atomicwrite(ip))
> +	struct xfs_mount	*mp = ip->i_mount;
> +
> +	/*
> +	 * If out of place writes are not available, we can guarantee an atomic
> +	 * write of exactly one single fsblock if the bdev will make that
> +	 * guarantee for us.
> +	 */
> +	if (!xfs_has_reflink(mp)) {
> +		if (xfs_inode_can_hw_atomicwrite(ip))
> +			return mp->m_sb.sb_blocksize;
>  		return 0;
> +	}
>  
> -	return ip->i_mount->m_sb.sb_blocksize;
> +	/*
> +	 * If we can complete an atomic write via atomic out of place writes,
> +	 * then advertise a maximum size of whatever we can complete through
> +	 * that means.  Hardware support is reported via max_opt, not here.
> +	 */
> +	if (XFS_IS_REALTIME_INODE(ip))
> +		return XFS_FSB_TO_B(mp, mp->m_groups[XG_TYPE_RTG].awu_max);
> +	return XFS_FSB_TO_B(mp, mp->m_groups[XG_TYPE_AG].awu_max);
>  }
>  
>  unsigned int
>  xfs_get_atomic_write_max_opt(
>  	struct xfs_inode	*ip)
>  {
> -	return 0;
> +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> +	unsigned int		awu_max = xfs_get_atomic_write_max(ip);
> +
> +	/* if the max is 1x block, then just keep behaviour that opt is 0 */
> +	if (awu_max <= ip->i_mount->m_sb.sb_blocksize)
> +		return 0;
> +
> +	/*
> +	 * Advertise the maximum size of an atomic write that we can tell the
> +	 * block device to perform for us.  In general the bdev limit will be
> +	 * less than our out of place write limit, but we don't want to exceed
> +	 * the awu_max.
> +	 */
> +	return min(awu_max, target->bt_bdev_awu_max);
>  }
>  
>  static void
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 12/14] xfs: add xfs_compute_atomic_write_unit_max()
  2025-04-15 16:25   ` Darrick J. Wong
@ 2025-04-15 16:35     ` John Garry
  2025-04-15 16:39       ` Darrick J. Wong
  0 siblings, 1 reply; 40+ messages in thread
From: John Garry @ 2025-04-15 16:35 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, hch, viro, jack, cem, linux-fsdevel, dchinner, linux-xfs,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, linux-ext4,
	linux-block, catherine.hoang, linux-api

On 15/04/2025 17:25, Darrick J. Wong wrote:
>> Signed-off-by: John Garry<john.g.garry@oracle.com>
>> [djwong: use a new reservation type for atomic write ioends]
> There should be a
> Signed-off-by: "Darrick J. Wong"<djwong@kernel.org>
> underneath this line.

Fine, but then I think that I need to add my SoB tag again after that, 
since we have this history: I sent, you sent, I sent.

Maybe Co-developed may be better, but some don't like that tag...

thanks,
John

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 12/14] xfs: add xfs_compute_atomic_write_unit_max()
  2025-04-15 16:35     ` John Garry
@ 2025-04-15 16:39       ` Darrick J. Wong
  0 siblings, 0 replies; 40+ messages in thread
From: Darrick J. Wong @ 2025-04-15 16:39 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, hch, viro, jack, cem, linux-fsdevel, dchinner, linux-xfs,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, linux-ext4,
	linux-block, catherine.hoang, linux-api

On Tue, Apr 15, 2025 at 05:35:33PM +0100, John Garry wrote:
> On 15/04/2025 17:25, Darrick J. Wong wrote:
> > > Signed-off-by: John Garry<john.g.garry@oracle.com>
> > > [djwong: use a new reservation type for atomic write ioends]
> > There should be a
> > Signed-off-by: "Darrick J. Wong"<djwong@kernel.org>
> > underneath this line.
> 
> Fine, but then I think that I need to add my SoB tag again after that, since
> we have this history: I sent, you sent, I sent.

There's no need to duplicate trailers.  This is perfectly fine:

Signed-off-by: John Garry<john.g.garry@oracle.com>
[djwong: use a new reservation type for atomic write ioends]
Signed-off-by: "Darrick J. Wong"<djwong@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

Remember, these tags only exist so that bughunters and lawyers can
figure out who to point their fingers at.

--D



> Maybe Co-developed may be better, but some don't like that tag...
> 
> thanks,
> John
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 14/14] xfs: allow sysadmins to specify a maximum atomic write limit at mount time
  2025-04-15 12:14 ` [PATCH v7 14/14] xfs: allow sysadmins to specify a maximum atomic write limit at mount time John Garry
  2025-04-15 15:35   ` Randy Dunlap
@ 2025-04-15 16:55   ` Darrick J. Wong
  2025-04-15 22:36   ` [PATCH v7.1 " Darrick J. Wong
  2 siblings, 0 replies; 40+ messages in thread
From: Darrick J. Wong @ 2025-04-15 16:55 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, hch, viro, jack, cem, linux-fsdevel, dchinner, linux-xfs,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, linux-ext4,
	linux-block, catherine.hoang, linux-api

On Tue, Apr 15, 2025 at 12:14:25PM +0000, John Garry wrote:
> From: "Darrick J. Wong" <djwong@kernel.org>
> 
> Introduce a mount option to allow sysadmins to specify the maximum size
> of an atomic write.  When this happens, we dynamically recompute the
> tr_atomic_write transaction reservation based on the given block size,
> and then check that we don't violate any of the minimum log size
> constraints.
> 
> The actual software atomic write max is still computed based off of
> tr_atomic the same way it has for the past few commits.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  Documentation/admin-guide/xfs.rst |  8 +++++
>  fs/xfs/libxfs/xfs_trans_resv.c    | 54 +++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_trans_resv.h    |  1 +
>  fs/xfs/xfs_mount.c                |  8 ++++-
>  fs/xfs/xfs_mount.h                |  5 +++
>  fs/xfs/xfs_super.c                | 28 +++++++++++++++-
>  fs/xfs/xfs_trace.h                | 33 +++++++++++++++++++
>  7 files changed, 135 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst
> index b67772cf36d6..715019ec4f24 100644
> --- a/Documentation/admin-guide/xfs.rst
> +++ b/Documentation/admin-guide/xfs.rst
> @@ -143,6 +143,14 @@ When mounting an XFS filesystem, the following options are accepted.
>  	optional, and the log section can be separate from the data
>  	section or contained within it.
>  
> +  max_atomic_write=value
> +	Set the maximum size of an atomic write.  The size may be
> +	specified in bytes, in kilobytes with a "k" suffix, in megabytes
> +	with a "m" suffix, or in gigabytes with a "g" suffix.
> +
> +	The default value is to set the maximum io completion size
> +	to allow each CPU to handle one at a time.
> +
>    noalign
>  	Data allocations will not be aligned at stripe unit
>  	boundaries. This is only relevant to filesystems created
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> index f530aa5d72f5..36e47ec3c3c2 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.c
> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> @@ -1475,3 +1475,57 @@ xfs_calc_max_atomic_write_fsblocks(
>  
>  	return ret;
>  }
> +
> +/*
> + * Compute the log reservation needed to complete an atomic write of a given
> + * number of blocks.  Worst case, each block requires separate handling.
> + * Returns true if the blockcount is supported, false otherwise.
> + */
> +bool
> +xfs_calc_atomic_write_reservation(
> +	struct xfs_mount	*mp,
> +	int			bytes)

Hmm, the comment says this should be a block count, not a byte count.

	xfs_extlen_t		blockcount)

> +{
> +	struct xfs_trans_res	*curr_res = &M_RES(mp)->tr_atomic_ioend;
> +	unsigned int		per_intent, step_size;
> +	unsigned int		logres;
> +	xfs_extlen_t		blockcount = XFS_B_TO_FSBT(mp, bytes);
> +	uint			old_logres =
> +		M_RES(mp)->tr_atomic_ioend.tr_logres;
> +	int			min_logblocks;
> +
> +	/*
> +	 * If the caller doesn't ask for a specific atomic write size, then
> +	 * we'll use conservatively use tr_itruncate as the basis for computing
> +	 * a reasonable maximum.
> +	 */
> +	if (blockcount == 0) {
> +		curr_res->tr_logres = M_RES(mp)->tr_itruncate.tr_logres;
> +		return true;
> +	}
> +
> +	/* Untorn write completions require out of place write remapping */
> +	if (!xfs_has_reflink(mp))
> +		return false;
> +
> +	per_intent = xfs_calc_atomic_write_ioend_geometry(mp, &step_size);
> +
> +	if (check_mul_overflow(blockcount, per_intent, &logres))
> +		return false;
> +	if (check_add_overflow(logres, step_size, &logres))
> +		return false;
> +
> +	curr_res->tr_logres = logres;
> +	min_logblocks = xfs_log_calc_minimum_size(mp);
> +
> +	trace_xfs_calc_max_atomic_write_reservation(mp, per_intent, step_size,
> +			blockcount, min_logblocks, curr_res->tr_logres);
> +
> +	if (min_logblocks > mp->m_sb.sb_logblocks) {
> +		/* Log too small, revert changes. */
> +		curr_res->tr_logres = old_logres;
> +		return false;
> +	}
> +
> +	return true;
> +}
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
> index a6d303b83688..af974f920556 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.h
> +++ b/fs/xfs/libxfs/xfs_trans_resv.h
> @@ -122,5 +122,6 @@ unsigned int xfs_calc_write_reservation_minlogsize(struct xfs_mount *mp);
>  unsigned int xfs_calc_qm_dqalloc_reservation_minlogsize(struct xfs_mount *mp);
>  
>  xfs_extlen_t xfs_calc_max_atomic_write_fsblocks(struct xfs_mount *mp);
> +bool xfs_calc_atomic_write_reservation(struct xfs_mount *mp, int bytes);
>  
>  #endif	/* __XFS_TRANS_RESV_H__ */
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 860fc3c91fd5..b8dd9e956c2a 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -671,7 +671,7 @@ static inline unsigned int max_pow_of_two_factor(const unsigned int nr)
>  	return 1 << (ffs(nr) - 1);
>  }
>  
> -static inline void
> +void
>  xfs_compute_atomic_write_unit_max(
>  	struct xfs_mount	*mp)
>  {
> @@ -1160,6 +1160,12 @@ xfs_mountfs(
>  	 * derived from transaction reservations, so we must do this after the
>  	 * log is fully initialized.
>  	 */
> +	if (!xfs_calc_atomic_write_reservation(mp, mp->m_awu_max_bytes)) {
> +		xfs_warn(mp, "cannot support atomic writes of %u bytes",
> +				mp->m_awu_max_bytes);
> +		error = -EINVAL;
> +		goto out_agresv;
> +	}

Hmm.  I don't think this is sufficient validation of m_awu_max_bytes.
xfs_compute_atomic_write_unit_max never allows an atomic write that is
larger than MAX_RW_COUNT or larger than the allocation group, because
it's not possible to land a single write larger than either of those
sizes.  The parsing code ignores values that aren't congruent with
an fsblock size, and suffix_kstrtoint gets confused if you feed it
values that are too large (like "2g").  I propose something like this:

/*
 * Try to set the atomic write maximum to a new value that we got from
 * userspace via mount option.
 */
int
xfs_set_max_atomic_write_opt(
	struct xfs_mount	*mp,
	unsigned long long	new_max_bytes)
{
	xfs_filblks_t		new_max_fsbs = XFS_B_TO_FSBT(mp, new_max_bytes);

	if (new_max_bytes) {
		xfs_extlen_t	max_write_fsbs =
			rounddown_pow_of_two(XFS_B_TO_FSB(mp, MAX_RW_COUNT));
		xfs_extlen_t	max_group_fsbs =
			max(mp->m_groups[XG_TYPE_AG].blocks,
			    mp->m_groups[XG_TYPE_RTG].blocks);

		ASSERT(max_write_fsbs <= U32_MAX);

		if (new_max_bytes % mp->m_sb.sb_blocksize > 0) {
			xfs_warn(mp,
 "max atomic write size of %llu bytes not aligned with fsblock",
					new_max_bytes);
			return -EINVAL;
		}

		if (new_max_fsbs > max_write_fsbs) {
			xfs_warn(mp,
 "max atomic write size of %lluk cannot be larger than max write size %lluk",
					new_max_bytes >> 10,
					XFS_FSB_TO_B(mp, max_write_fsbs) >> 10);
			return -EINVAL;
		}

		if (new_max_fsbs > max_group_fsbs) {
			xfs_warn(mp,
 "max atomic write size of %lluk cannot be larger than allocation group size %lluk",
					new_max_bytes >> 10,
					XFS_FSB_TO_B(mp, max_group_fsbs) >> 10);
			return -EINVAL;
		}
	}

	if (!xfs_calc_atomic_write_reservation(mp, new_max_fsbs)) {
		xfs_warn(mp,
 "cannot support completing atomic writes of %lluk",
				new_max_bytes >> 10);
		return -EINVAL;
	}

	xfs_compute_atomic_write_unit_max(mp);
	return 0;
}

and then we get some nicer error messages about what exactly failed
validation.

>  	xfs_compute_atomic_write_unit_max(mp);
>  
>  	return 0;
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index c0eff3adfa31..a5037db4ecff 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -236,6 +236,9 @@ typedef struct xfs_mount {
>  	bool			m_update_sb;	/* sb needs update in mount */
>  	unsigned int		m_max_open_zones;
>  
> +	/* max_atomic_write mount option value */
> +	unsigned int		m_awu_max_bytes;
> +
>  	/*
>  	 * Bitsets of per-fs metadata that have been checked and/or are sick.
>  	 * Callers must hold m_sb_lock to access these two fields.
> @@ -798,4 +801,6 @@ static inline void xfs_mod_sb_delalloc(struct xfs_mount *mp, int64_t delta)
>  	percpu_counter_add(&mp->m_delalloc_blks, delta);
>  }
>  
> +void xfs_compute_atomic_write_unit_max(struct xfs_mount *mp);
> +
>  #endif	/* __XFS_MOUNT_H__ */
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index b2dd0c0bf509..f7849052e5ff 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -111,7 +111,7 @@ enum {
>  	Opt_prjquota, Opt_uquota, Opt_gquota, Opt_pquota,
>  	Opt_uqnoenforce, Opt_gqnoenforce, Opt_pqnoenforce, Opt_qnoenforce,
>  	Opt_discard, Opt_nodiscard, Opt_dax, Opt_dax_enum, Opt_max_open_zones,
> -	Opt_lifetime, Opt_nolifetime,
> +	Opt_lifetime, Opt_nolifetime, Opt_max_atomic_write,
>  };
>  
>  static const struct fs_parameter_spec xfs_fs_parameters[] = {
> @@ -159,6 +159,7 @@ static const struct fs_parameter_spec xfs_fs_parameters[] = {
>  	fsparam_u32("max_open_zones",	Opt_max_open_zones),
>  	fsparam_flag("lifetime",	Opt_lifetime),
>  	fsparam_flag("nolifetime",	Opt_nolifetime),
> +	fsparam_string("max_atomic_write",	Opt_max_atomic_write),
>  	{}
>  };
>  
> @@ -241,6 +242,9 @@ xfs_fs_show_options(
>  
>  	if (mp->m_max_open_zones)
>  		seq_printf(m, ",max_open_zones=%u", mp->m_max_open_zones);
> +	if (mp->m_awu_max_bytes)
> +		seq_printf(m, ",max_atomic_write=%uk",
> +				mp->m_awu_max_bytes >> 10);
>  
>  	return 0;
>  }
> @@ -1518,6 +1522,13 @@ xfs_fs_parse_param(
>  	case Opt_nolifetime:
>  		parsing_mp->m_features |= XFS_FEAT_NOLIFETIME;
>  		return 0;
> +	case Opt_max_atomic_write:
> +		if (suffix_kstrtoint(param->string, 10,
> +				     &parsing_mp->m_awu_max_bytes))

Let's replace this with a new suffix_kstrtoull helper that returns an
unsigned long long quantity that won't get confused.  This has the
unfortunate consequence that we have to burn a u64 in xfs_mount instead
of a u32.

--D

> +			return -EINVAL;
> +		if (parsing_mp->m_awu_max_bytes < 0)
> +			return -EINVAL;
> +		return 0;
>  	default:
>  		xfs_warn(parsing_mp, "unknown mount option [%s].", param->key);
>  		return -EINVAL;
> @@ -2114,6 +2125,16 @@ xfs_fs_reconfigure(
>  	if (error)
>  		return error;
>  
> +	/* validate new max_atomic_write option before making other changes */
> +	if (mp->m_awu_max_bytes != new_mp->m_awu_max_bytes) {
> +		if (!xfs_calc_atomic_write_reservation(mp,
> +					new_mp->m_awu_max_bytes)) {
> +			xfs_warn(mp, "cannot support atomic writes of %u bytes",
> +					new_mp->m_awu_max_bytes);
> +			return -EINVAL;
> +		}
> +	}
> +
>  	/* inode32 -> inode64 */
>  	if (xfs_has_small_inums(mp) && !xfs_has_small_inums(new_mp)) {
>  		mp->m_features &= ~XFS_FEAT_SMALL_INUMS;
> @@ -2140,6 +2161,11 @@ xfs_fs_reconfigure(
>  			return error;
>  	}
>  
> +	/* set new atomic write max here */
> +	if (mp->m_awu_max_bytes != new_mp->m_awu_max_bytes) {
> +		xfs_compute_atomic_write_unit_max(mp);
> +		mp->m_awu_max_bytes = new_mp->m_awu_max_bytes;
> +	}
>  	return 0;
>  }
>  
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 24d73e9bbe83..d41885f1efe2 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -230,6 +230,39 @@ TRACE_EVENT(xfs_calc_max_atomic_write_fsblocks,
>  		  __entry->blockcount)
>  );
>  
> +TRACE_EVENT(xfs_calc_max_atomic_write_reservation,
> +	TP_PROTO(struct xfs_mount *mp, unsigned int per_intent,
> +		 unsigned int step_size, unsigned int blockcount,
> +		 unsigned int min_logblocks, unsigned int logres),
> +	TP_ARGS(mp, per_intent, step_size, blockcount, min_logblocks, logres),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(unsigned int, per_intent)
> +		__field(unsigned int, step_size)
> +		__field(unsigned int, blockcount)
> +		__field(unsigned int, min_logblocks)
> +		__field(unsigned int, cur_logblocks)
> +		__field(unsigned int, logres)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = mp->m_super->s_dev;
> +		__entry->per_intent = per_intent;
> +		__entry->step_size = step_size;
> +		__entry->blockcount = blockcount;
> +		__entry->min_logblocks = min_logblocks;
> +		__entry->cur_logblocks = mp->m_sb.sb_logblocks;
> +		__entry->logres = logres;
> +	),
> +	TP_printk("dev %d:%d per_intent %u step_size %u blockcount %u min_logblocks %u logblocks %u logres %u",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->per_intent,
> +		  __entry->step_size,
> +		  __entry->blockcount,
> +		  __entry->min_logblocks,
> +		  __entry->cur_logblocks,
> +		  __entry->logres)
> +);
> +
>  TRACE_EVENT(xlog_intent_recovery_failed,
>  	TP_PROTO(struct xfs_mount *mp, const struct xfs_defer_op_type *ops,
>  		 int error),
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 09/14] xfs: add large atomic writes checks in xfs_direct_write_iomap_begin()
  2025-04-15 12:14 ` [PATCH v7 09/14] xfs: add large atomic writes checks in xfs_direct_write_iomap_begin() John Garry
@ 2025-04-15 17:34   ` Darrick J. Wong
  2025-04-15 17:46     ` John Garry
  0 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2025-04-15 17:34 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, hch, viro, jack, cem, linux-fsdevel, dchinner, linux-xfs,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, linux-ext4,
	linux-block, catherine.hoang, linux-api

On Tue, Apr 15, 2025 at 12:14:20PM +0000, John Garry wrote:
> For when large atomic writes (> 1x FS block) are supported, there will be
> various occasions when HW offload may not be possible.
> 
> Such instances include:
> - unaligned extent mapping wrt write length
> - extent mappings which do not cover the full write, e.g. the write spans
>   sparse or mixed-mapping extents
> - the write length is greater than HW offload can support
> 
> In those cases, we need to fallback to the CoW-based atomic write mode. For
> this, report special code -ENOPROTOOPT to inform the caller that HW
> offload-based method is not possible.
> 
> In addition to the occasions mentioned, if the write covers an unallocated
> range, we again judge that we need to rely on the CoW-based method when we
> would need to allocate anything more than 1x block. This is because if we
> allocate less blocks that is required for the write, then again HW
> offload-based method would not be possible. So we are taking a pessimistic
> approach to writes covering unallocated space.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>  fs/xfs/xfs_iomap.c | 65 ++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 63 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 049655ebc3f7..02bb8257ea24 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -798,6 +798,41 @@ imap_spans_range(
>  	return true;
>  }
>  
> +static bool
> +xfs_bmap_hw_atomic_write_possible(
> +	struct xfs_inode	*ip,
> +	struct xfs_bmbt_irec	*imap,
> +	xfs_fileoff_t		offset_fsb,
> +	xfs_fileoff_t		end_fsb)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	xfs_fsize_t		len = XFS_FSB_TO_B(mp, end_fsb - offset_fsb);
> +
> +	/*
> +	 * atomic writes are required to be naturally aligned for disk blocks,
> +	 * which ensures that we adhere to block layer rules that we won't
> +	 * straddle any boundary or violate write alignment requirement.
> +	 */
> +	if (!IS_ALIGNED(imap->br_startblock, imap->br_blockcount))
> +		return false;
> +
> +	/*
> +	 * Spanning multiple extents would mean that multiple BIOs would be
> +	 * issued, and so would lose atomicity required for REQ_ATOMIC-based
> +	 * atomics.
> +	 */
> +	if (!imap_spans_range(imap, offset_fsb, end_fsb))
> +		return false;
> +
> +	/*
> +	 * The ->iomap_begin caller should ensure this, but check anyway.
> +	 */
> +	if (len > xfs_inode_buftarg(ip)->bt_bdev_awu_max)
> +		return false;

This needs to check len against bt_bdev_awu_min so that we don't submit
too-short atomic writes to the hardware.  Let's say that the hardware
minimum is 32k and the fsblock size is 4k.  XFS can perform an out of
place write for 4k-16k writes, but right now we'll just throw invalid
commands at the bdev, and it'll return EINVAL.

/me wonders if statx should grow a atomic_write_unit_min_opt field
too, unless everyone in block layer land is convinced that awu_min will
always match lbasize?  (I probably missed that conversation)

--D

> +
> +	return true;
> +}
> +
>  static int
>  xfs_direct_write_iomap_begin(
>  	struct inode		*inode,
> @@ -812,9 +847,11 @@ xfs_direct_write_iomap_begin(
>  	struct xfs_bmbt_irec	imap, cmap;
>  	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
>  	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, length);
> +	xfs_fileoff_t		orig_end_fsb = end_fsb;
>  	int			nimaps = 1, error = 0;
>  	bool			shared = false;
>  	u16			iomap_flags = 0;
> +	bool			needs_alloc;
>  	unsigned int		lockmode;
>  	u64			seq;
>  
> @@ -875,13 +912,37 @@ xfs_direct_write_iomap_begin(
>  				(flags & IOMAP_DIRECT) || IS_DAX(inode));
>  		if (error)
>  			goto out_unlock;
> -		if (shared)
> +		if (shared) {
> +			if ((flags & IOMAP_ATOMIC) &&
> +			    !xfs_bmap_hw_atomic_write_possible(ip, &cmap,
> +					offset_fsb, end_fsb)) {
> +				error = -ENOPROTOOPT;
> +				goto out_unlock;
> +			}
>  			goto out_found_cow;
> +		}
>  		end_fsb = imap.br_startoff + imap.br_blockcount;
>  		length = XFS_FSB_TO_B(mp, end_fsb) - offset;
>  	}
>  
> -	if (imap_needs_alloc(inode, flags, &imap, nimaps))
> +	needs_alloc = imap_needs_alloc(inode, flags, &imap, nimaps);
> +
> +	if (flags & IOMAP_ATOMIC) {
> +		error = -ENOPROTOOPT;
> +		/*
> +		 * If we allocate less than what is required for the write
> +		 * then we may end up with multiple extents, which means that
> +		 * REQ_ATOMIC-based cannot be used, so avoid this possibility.
> +		 */
> +		if (needs_alloc && orig_end_fsb - offset_fsb > 1)
> +			goto out_unlock;
> +
> +		if (!xfs_bmap_hw_atomic_write_possible(ip, &imap, offset_fsb,
> +				orig_end_fsb))
> +			goto out_unlock;
> +	}
> +
> +	if (needs_alloc)
>  		goto allocate_blocks;
>  
>  	/*
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 09/14] xfs: add large atomic writes checks in xfs_direct_write_iomap_begin()
  2025-04-15 17:34   ` Darrick J. Wong
@ 2025-04-15 17:46     ` John Garry
  0 siblings, 0 replies; 40+ messages in thread
From: John Garry @ 2025-04-15 17:46 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, hch, viro, jack, cem, linux-fsdevel, dchinner, linux-xfs,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, linux-ext4,
	linux-block, catherine.hoang, linux-api

On 15/04/2025 18:34, Darrick J. Wong wrote:
>> +	/*
>> +	 * Spanning multiple extents would mean that multiple BIOs would be
>> +	 * issued, and so would lose atomicity required for REQ_ATOMIC-based
>> +	 * atomics.
>> +	 */
>> +	if (!imap_spans_range(imap, offset_fsb, end_fsb))
>> +		return false;
>> +
>> +	/*
>> +	 * The ->iomap_begin caller should ensure this, but check anyway.
>> +	 */
>> +	if (len > xfs_inode_buftarg(ip)->bt_bdev_awu_max)
>> +		return false;
> This needs to check len against bt_bdev_awu_min so that we don't submit
> too-short atomic writes to the hardware. 

Right, let me check this.

I think that we should only support sane HW which can write 1x FS block 
or more.

> Let's say that the hardware
> minimum is 32k and the fsblock size is 4k.  XFS can perform an out of
> place write for 4k-16k writes, but right now we'll just throw invalid
> commands at the bdev, and it'll return EINVAL.
> 
> /me wonders if statx should grow a atomic_write_unit_min_opt field
> too, unless everyone in block layer land is convinced that awu_min will
> always match lbasize?  (I probably missed that conversation)

Nothing states that it should (match lbasize), but again HW which can 
only write >1 FS block is something which I don't want to support (yet).

Thanks,
John


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v7.1 14/14] xfs: allow sysadmins to specify a maximum atomic write limit at mount time
  2025-04-15 12:14 ` [PATCH v7 14/14] xfs: allow sysadmins to specify a maximum atomic write limit at mount time John Garry
  2025-04-15 15:35   ` Randy Dunlap
  2025-04-15 16:55   ` Darrick J. Wong
@ 2025-04-15 22:36   ` Darrick J. Wong
  2025-04-16 10:08     ` John Garry
  2 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2025-04-15 22:36 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, hch, viro, jack, cem, linux-fsdevel, dchinner, linux-xfs,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, linux-ext4,
	linux-block, catherine.hoang, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Introduce a mount option to allow sysadmins to specify the maximum size
of an atomic write.  If the filesystem can work with the supplied value,
that becomes the new guaranteed maximum.

The value mustn't be too big for the existing filesystem geometry (max
write size, max AG/rtgroup size).  We dynamically recompute the
tr_atomic_write transaction reservation based on the given block size,
check that the current log size isn't less than the new minimum log size
constraints, and set a new maximum.

The actual software atomic write max is still computed based off of
tr_atomic_ioend the same way it has for the past few commits.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
v7.1: make all the tweaks I already complained about here
---
 fs/xfs/libxfs/xfs_trans_resv.h    |    2 +
 fs/xfs/xfs_mount.h                |    6 ++++
 fs/xfs/xfs_trace.h                |   33 ++++++++++++++++++++
 Documentation/admin-guide/xfs.rst |   11 +++++++
 fs/xfs/libxfs/xfs_trans_resv.c    |   53 +++++++++++++++++++++++++++++++++
 fs/xfs/xfs_mount.c                |   59 ++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_super.c                |   60 ++++++++++++++++++++++++++++++++++++-
 7 files changed, 222 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
index a6d303b836883f..ea50a239c31107 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.h
+++ b/fs/xfs/libxfs/xfs_trans_resv.h
@@ -122,5 +122,7 @@ unsigned int xfs_calc_write_reservation_minlogsize(struct xfs_mount *mp);
 unsigned int xfs_calc_qm_dqalloc_reservation_minlogsize(struct xfs_mount *mp);
 
 xfs_extlen_t xfs_calc_max_atomic_write_fsblocks(struct xfs_mount *mp);
+bool xfs_calc_atomic_write_reservation(struct xfs_mount *mp,
+		xfs_extlen_t blockcount);
 
 #endif	/* __XFS_TRANS_RESV_H__ */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index c0eff3adfa31f6..68e2acc00b5321 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -236,6 +236,9 @@ typedef struct xfs_mount {
 	bool			m_update_sb;	/* sb needs update in mount */
 	unsigned int		m_max_open_zones;
 
+	/* max_atomic_write mount option value */
+	unsigned long long	m_awu_max_bytes;
+
 	/*
 	 * Bitsets of per-fs metadata that have been checked and/or are sick.
 	 * Callers must hold m_sb_lock to access these two fields.
@@ -798,4 +801,7 @@ static inline void xfs_mod_sb_delalloc(struct xfs_mount *mp, int64_t delta)
 	percpu_counter_add(&mp->m_delalloc_blks, delta);
 }
 
+int xfs_set_max_atomic_write_opt(struct xfs_mount *mp,
+		unsigned long long new_max_bytes);
+
 #endif	/* __XFS_MOUNT_H__ */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 24d73e9bbe83f4..d41885f1efe25b 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -230,6 +230,39 @@ TRACE_EVENT(xfs_calc_max_atomic_write_fsblocks,
 		  __entry->blockcount)
 );
 
+TRACE_EVENT(xfs_calc_max_atomic_write_reservation,
+	TP_PROTO(struct xfs_mount *mp, unsigned int per_intent,
+		 unsigned int step_size, unsigned int blockcount,
+		 unsigned int min_logblocks, unsigned int logres),
+	TP_ARGS(mp, per_intent, step_size, blockcount, min_logblocks, logres),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, per_intent)
+		__field(unsigned int, step_size)
+		__field(unsigned int, blockcount)
+		__field(unsigned int, min_logblocks)
+		__field(unsigned int, cur_logblocks)
+		__field(unsigned int, logres)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->per_intent = per_intent;
+		__entry->step_size = step_size;
+		__entry->blockcount = blockcount;
+		__entry->min_logblocks = min_logblocks;
+		__entry->cur_logblocks = mp->m_sb.sb_logblocks;
+		__entry->logres = logres;
+	),
+	TP_printk("dev %d:%d per_intent %u step_size %u blockcount %u min_logblocks %u logblocks %u logres %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->per_intent,
+		  __entry->step_size,
+		  __entry->blockcount,
+		  __entry->min_logblocks,
+		  __entry->cur_logblocks,
+		  __entry->logres)
+);
+
 TRACE_EVENT(xlog_intent_recovery_failed,
 	TP_PROTO(struct xfs_mount *mp, const struct xfs_defer_op_type *ops,
 		 int error),
diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst
index b67772cf36d6dc..0f6e5d4784f9c3 100644
--- a/Documentation/admin-guide/xfs.rst
+++ b/Documentation/admin-guide/xfs.rst
@@ -143,6 +143,17 @@ When mounting an XFS filesystem, the following options are accepted.
 	optional, and the log section can be separate from the data
 	section or contained within it.
 
+  max_atomic_write=value
+	Set the maximum size of an atomic write.  The size may be
+	specified in bytes, in kilobytes with a "k" suffix, in megabytes
+	with a "m" suffix, or in gigabytes with a "g" suffix.  The size
+	cannot be larger than the maximum write size, larger than the
+	size of any allocation group, or larger than the size of a
+	remapping operation that the log can complete atomically.
+
+	The default value is to set the maximum I/O completion size
+	to allow each CPU to handle one at a time.
+
   noalign
 	Data allocations will not be aligned at stripe unit
 	boundaries. This is only relevant to filesystems created
diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index f530aa5d72f552..48e75c7ba2bb69 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -1475,3 +1475,56 @@ xfs_calc_max_atomic_write_fsblocks(
 
 	return ret;
 }
+
+/*
+ * Compute the log reservation needed to complete an atomic write of a given
+ * number of blocks.  Worst case, each block requires separate handling.
+ * Returns true if the blockcount is supported, false otherwise.
+ */
+bool
+xfs_calc_atomic_write_reservation(
+	struct xfs_mount	*mp,
+	xfs_extlen_t		blockcount)
+{
+	struct xfs_trans_res	*curr_res = &M_RES(mp)->tr_atomic_ioend;
+	unsigned int		per_intent, step_size;
+	unsigned int		logres;
+	uint			old_logres =
+		M_RES(mp)->tr_atomic_ioend.tr_logres;
+	int			min_logblocks;
+
+	/*
+	 * If the caller doesn't ask for a specific atomic write size, then
+	 * we'll use conservatively use tr_itruncate as the basis for computing
+	 * a reasonable maximum.
+	 */
+	if (blockcount == 0) {
+		curr_res->tr_logres = M_RES(mp)->tr_itruncate.tr_logres;
+		return true;
+	}
+
+	/* Untorn write completions require out of place write remapping */
+	if (!xfs_has_reflink(mp))
+		return false;
+
+	per_intent = xfs_calc_atomic_write_ioend_geometry(mp, &step_size);
+
+	if (check_mul_overflow(blockcount, per_intent, &logres))
+		return false;
+	if (check_add_overflow(logres, step_size, &logres))
+		return false;
+
+	curr_res->tr_logres = logres;
+	min_logblocks = xfs_log_calc_minimum_size(mp);
+
+	trace_xfs_calc_max_atomic_write_reservation(mp, per_intent, step_size,
+			blockcount, min_logblocks, curr_res->tr_logres);
+
+	if (min_logblocks > mp->m_sb.sb_logblocks) {
+		/* Log too small, revert changes. */
+		curr_res->tr_logres = old_logres;
+		return false;
+	}
+
+	return true;
+}
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index cec202cf7803d8..1eda18dfb1f667 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -737,6 +737,61 @@ xfs_compute_atomic_write_unit_max(
 			max_agsize, max_rgsize);
 }
 
+/*
+ * Try to set the atomic write maximum to a new value that we got from
+ * userspace via mount option.
+ */
+int
+xfs_set_max_atomic_write_opt(
+	struct xfs_mount	*mp,
+	unsigned long long	new_max_bytes)
+{
+	xfs_filblks_t		new_max_fsbs = XFS_B_TO_FSBT(mp, new_max_bytes);
+
+	if (new_max_bytes) {
+		xfs_extlen_t	max_write_fsbs =
+			rounddown_pow_of_two(XFS_B_TO_FSB(mp, MAX_RW_COUNT));
+		xfs_extlen_t	max_group_fsbs =
+			max(mp->m_groups[XG_TYPE_AG].blocks,
+			    mp->m_groups[XG_TYPE_RTG].blocks);
+
+		ASSERT(max_write_fsbs <= U32_MAX);
+
+		if (new_max_bytes % mp->m_sb.sb_blocksize > 0) {
+			xfs_warn(mp,
+ "max atomic write size of %llu bytes not aligned with fsblock",
+					new_max_bytes);
+			return -EINVAL;
+		}
+
+		if (new_max_fsbs > max_write_fsbs) {
+			xfs_warn(mp,
+ "max atomic write size of %lluk cannot be larger than max write size %lluk",
+					new_max_bytes >> 10,
+					XFS_FSB_TO_B(mp, max_write_fsbs) >> 10);
+			return -EINVAL;
+		}
+
+		if (new_max_fsbs > max_group_fsbs) {
+			xfs_warn(mp,
+ "max atomic write size of %lluk cannot be larger than allocation group size %lluk",
+					new_max_bytes >> 10,
+					XFS_FSB_TO_B(mp, max_group_fsbs) >> 10);
+			return -EINVAL;
+		}
+	}
+
+	if (!xfs_calc_atomic_write_reservation(mp, new_max_fsbs)) {
+		xfs_warn(mp,
+ "cannot support completing atomic writes of %lluk",
+				new_max_bytes >> 10);
+		return -EINVAL;
+	}
+
+	xfs_compute_atomic_write_unit_max(mp);
+	return 0;
+}
+
 /* Compute maximum possible height for realtime btree types for this fs. */
 static inline void
 xfs_rtbtree_compute_maxlevels(
@@ -1158,7 +1213,9 @@ xfs_mountfs(
 	 * derived from transaction reservations, so we must do this after the
 	 * log is fully initialized.
 	 */
-	xfs_compute_atomic_write_unit_max(mp);
+	error = xfs_set_max_atomic_write_opt(mp, mp->m_awu_max_bytes);
+	if (error)
+		goto out_agresv;
 
 	return 0;
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index b2dd0c0bf50979..9f422bcf651801 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -111,7 +111,7 @@ enum {
 	Opt_prjquota, Opt_uquota, Opt_gquota, Opt_pquota,
 	Opt_uqnoenforce, Opt_gqnoenforce, Opt_pqnoenforce, Opt_qnoenforce,
 	Opt_discard, Opt_nodiscard, Opt_dax, Opt_dax_enum, Opt_max_open_zones,
-	Opt_lifetime, Opt_nolifetime,
+	Opt_lifetime, Opt_nolifetime, Opt_max_atomic_write,
 };
 
 static const struct fs_parameter_spec xfs_fs_parameters[] = {
@@ -159,6 +159,7 @@ static const struct fs_parameter_spec xfs_fs_parameters[] = {
 	fsparam_u32("max_open_zones",	Opt_max_open_zones),
 	fsparam_flag("lifetime",	Opt_lifetime),
 	fsparam_flag("nolifetime",	Opt_nolifetime),
+	fsparam_string("max_atomic_write",	Opt_max_atomic_write),
 	{}
 };
 
@@ -241,6 +242,9 @@ xfs_fs_show_options(
 
 	if (mp->m_max_open_zones)
 		seq_printf(m, ",max_open_zones=%u", mp->m_max_open_zones);
+	if (mp->m_awu_max_bytes)
+		seq_printf(m, ",max_atomic_write=%lluk",
+				mp->m_awu_max_bytes >> 10);
 
 	return 0;
 }
@@ -1334,6 +1338,42 @@ suffix_kstrtoint(
 	return ret;
 }
 
+static int
+suffix_kstrtoull(
+	const char		*s,
+	unsigned int		base,
+	unsigned long long	*res)
+{
+	int			last, shift_left_factor = 0;
+	unsigned long long	_res;
+	char			*value;
+	int			ret = 0;
+
+	value = kstrdup(s, GFP_KERNEL);
+	if (!value)
+		return -ENOMEM;
+
+	last = strlen(value) - 1;
+	if (value[last] == 'K' || value[last] == 'k') {
+		shift_left_factor = 10;
+		value[last] = '\0';
+	}
+	if (value[last] == 'M' || value[last] == 'm') {
+		shift_left_factor = 20;
+		value[last] = '\0';
+	}
+	if (value[last] == 'G' || value[last] == 'g') {
+		shift_left_factor = 30;
+		value[last] = '\0';
+	}
+
+	if (kstrtoull(value, base, &_res))
+		ret = -EINVAL;
+	kfree(value);
+	*res = _res << shift_left_factor;
+	return ret;
+}
+
 static inline void
 xfs_fs_warn_deprecated(
 	struct fs_context	*fc,
@@ -1518,6 +1558,14 @@ xfs_fs_parse_param(
 	case Opt_nolifetime:
 		parsing_mp->m_features |= XFS_FEAT_NOLIFETIME;
 		return 0;
+	case Opt_max_atomic_write:
+		if (suffix_kstrtoull(param->string, 10,
+				     &parsing_mp->m_awu_max_bytes)) {
+			xfs_warn(parsing_mp,
+ "max atomic write size must be positive integer");
+			return -EINVAL;
+		}
+		return 0;
 	default:
 		xfs_warn(parsing_mp, "unknown mount option [%s].", param->key);
 		return -EINVAL;
@@ -2114,6 +2162,16 @@ xfs_fs_reconfigure(
 	if (error)
 		return error;
 
+	/* Validate new max_atomic_write option before making other changes */
+	if (mp->m_awu_max_bytes != new_mp->m_awu_max_bytes) {
+		error = xfs_set_max_atomic_write_opt(mp,
+				new_mp->m_awu_max_bytes);
+		if (error)
+			return error;
+
+		mp->m_awu_max_bytes = new_mp->m_awu_max_bytes;
+	}
+
 	/* inode32 -> inode64 */
 	if (xfs_has_small_inums(mp) && !xfs_has_small_inums(new_mp)) {
 		mp->m_features &= ~XFS_FEAT_SMALL_INUMS;

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v7.1 14/14] xfs: allow sysadmins to specify a maximum atomic write limit at mount time
  2025-04-15 22:36   ` [PATCH v7.1 " Darrick J. Wong
@ 2025-04-16 10:08     ` John Garry
  2025-04-16 16:26       ` Darrick J. Wong
  0 siblings, 1 reply; 40+ messages in thread
From: John Garry @ 2025-04-16 10:08 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, hch, viro, jack, cem, linux-fsdevel, dchinner, linux-xfs,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, linux-ext4,
	linux-block, catherine.hoang, linux-api

On 15/04/2025 23:36, Darrick J. Wong wrote:

Thanks for this, but it still seems to be problematic for me.

In my test, I have agsize=22400, and when I attempt to mount with 
atomic_write_max=8M, it passes when it shouldn't. It should not because 
max_pow_of_two_factor(22400) = 128, and 8MB > 128 FSB.

How about these addition checks:

> +
> +	if (new_max_bytes) {
> +		xfs_extlen_t	max_write_fsbs =
> +			rounddown_pow_of_two(XFS_B_TO_FSB(mp, MAX_RW_COUNT));
> +		xfs_extlen_t	max_group_fsbs =
> +			max(mp->m_groups[XG_TYPE_AG].blocks,
> +			    mp->m_groups[XG_TYPE_RTG].blocks);
> +
> +		ASSERT(max_write_fsbs <= U32_MAX);

		if (!is_power_of_2(new_max_bytes)) {
			xfs_warn(mp,
  "max atomic write size of %llu bytes is not a power-of-2",
					new_max_bytes);
			return -EINVAL;
		}

> +
> +		if (new_max_bytes % mp->m_sb.sb_blocksize > 0) {
> +			xfs_warn(mp,
> + "max atomic write size of %llu bytes not aligned with fsblock",
> +					new_max_bytes);
> +			return -EINVAL;
> +		}
> +
> +		if (new_max_fsbs > max_write_fsbs) {
> +			xfs_warn(mp,
> + "max atomic write size of %lluk cannot be larger than max write size %lluk",
> +					new_max_bytes >> 10,
> +					XFS_FSB_TO_B(mp, max_write_fsbs) >> 10);
> +			return -EINVAL;
> +		}
> +
> +		if (new_max_fsbs > max_group_fsbs) {
> +			xfs_warn(mp,
> + "max atomic write size of %lluk cannot be larger than allocation group size %lluk",
> +					new_max_bytes >> 10,
> +					XFS_FSB_TO_B(mp, max_group_fsbs) >> 10);
> +			return -EINVAL;
> +		}
> +	}
> +

	if (new_max_fsbs > max_pow_of_two_factor(max_group_fsbs)) {
		xfs_warn(mp,
  "max atomic write size of %lluk not aligned with allocation group size 
%lluk",
				new_max_bytes >> 10,
				XFS_FSB_TO_B(mp, max_group_fsbs) >> 10);
		return -EINVAL;
	}

thanks,
John

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7.1 14/14] xfs: allow sysadmins to specify a maximum atomic write limit at mount time
  2025-04-16 10:08     ` John Garry
@ 2025-04-16 16:26       ` Darrick J. Wong
  0 siblings, 0 replies; 40+ messages in thread
From: Darrick J. Wong @ 2025-04-16 16:26 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, hch, viro, jack, cem, linux-fsdevel, dchinner, linux-xfs,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, linux-ext4,
	linux-block, catherine.hoang, linux-api

On Wed, Apr 16, 2025 at 11:08:25AM +0100, John Garry wrote:
> On 15/04/2025 23:36, Darrick J. Wong wrote:
> 
> Thanks for this, but it still seems to be problematic for me.
> 
> In my test, I have agsize=22400, and when I attempt to mount with
> atomic_write_max=8M, it passes when it shouldn't. It should not because
> max_pow_of_two_factor(22400) = 128, and 8MB > 128 FSB.
> 
> How about these addition checks:
> 
> > +
> > +	if (new_max_bytes) {
> > +		xfs_extlen_t	max_write_fsbs =
> > +			rounddown_pow_of_two(XFS_B_TO_FSB(mp, MAX_RW_COUNT));
> > +		xfs_extlen_t	max_group_fsbs =
> > +			max(mp->m_groups[XG_TYPE_AG].blocks,
> > +			    mp->m_groups[XG_TYPE_RTG].blocks);
> > +
> > +		ASSERT(max_write_fsbs <= U32_MAX);
> 
> 		if (!is_power_of_2(new_max_bytes)) {
> 			xfs_warn(mp,
>  "max atomic write size of %llu bytes is not a power-of-2",
> 					new_max_bytes);
> 			return -EINVAL;
> 		}

Long-term I'm not convinced that we really need to have all these power
of two checks because the software fallback can remap just about
anything, but for now I see no harm in doing this because
generic_atomic_write_valid enforces that property on the IO length.

> > +
> > +		if (new_max_bytes % mp->m_sb.sb_blocksize > 0) {
> > +			xfs_warn(mp,
> > + "max atomic write size of %llu bytes not aligned with fsblock",
> > +					new_max_bytes);
> > +			return -EINVAL;
> > +		}
> > +
> > +		if (new_max_fsbs > max_write_fsbs) {
> > +			xfs_warn(mp,
> > + "max atomic write size of %lluk cannot be larger than max write size %lluk",
> > +					new_max_bytes >> 10,
> > +					XFS_FSB_TO_B(mp, max_write_fsbs) >> 10);
> > +			return -EINVAL;
> > +		}
> > +
> > +		if (new_max_fsbs > max_group_fsbs) {
> > +			xfs_warn(mp,
> > + "max atomic write size of %lluk cannot be larger than allocation group size %lluk",
> > +					new_max_bytes >> 10,
> > +					XFS_FSB_TO_B(mp, max_group_fsbs) >> 10);
> > +			return -EINVAL;
> > +		}
> > +	}
> > +
> 
> 	if (new_max_fsbs > max_pow_of_two_factor(max_group_fsbs)) {
> 		xfs_warn(mp,
>  "max atomic write size of %lluk not aligned with allocation group size
> %lluk",
> 				new_max_bytes >> 10,
> 				XFS_FSB_TO_B(mp, max_group_fsbs) >> 10);
> 		return -EINVAL;

I think I'd rather clean up these bits:

	if (mp->m_ddev_targp->bt_bdev_awu_min > 0)
		max_agsize = max_pow_of_two_factor(mp->m_sb.sb_agblocks);
	else
		max_agsize = mp->m_ag_max_usable;

and

	if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_bdev_awu_min > 0)
		max_rgsize = max_pow_of_two_factor(rgs->blocks);
	else
		max_rgsize = rgs->blocks;

into a shared helper for xfs_compute_atomic_write_unit_max so that we
use the exact same logic in both places.  But I agree with the general
direction.

--D

> 	}
> 
> thanks,
> John
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 11/14] xfs: add xfs_file_dio_write_atomic()
  2025-04-15 12:14 ` [PATCH v7 11/14] xfs: add xfs_file_dio_write_atomic() John Garry
@ 2025-04-21  4:00   ` Darrick J. Wong
  2025-04-21  5:47     ` John Garry
  2025-04-21 21:18   ` Luis Chamberlain
  1 sibling, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2025-04-21  4:00 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, hch, viro, jack, cem, linux-fsdevel, dchinner, linux-xfs,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, linux-ext4,
	linux-block, catherine.hoang, linux-api

On Tue, Apr 15, 2025 at 12:14:22PM +0000, John Garry wrote:
> Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes.
> 
> The function works based on two operating modes:
> - HW offload, i.e. REQ_ATOMIC-based
> - CoW based with out-of-places write and atomic extent remapping
> 
> The preferred method is HW offload as it will be faster. If HW offload is
> not possible, then we fallback to the CoW-based method.
> 
> HW offload would not be possible for the write length exceeding the HW
> offload limit, the write spanning multiple extents, unaligned disk blocks,
> etc.
> 
> Apart from the write exceeding the HW offload limit, other conditions for
> HW offload can only be detected in the iomap handling for the write. As
> such, we use a fallback method to issue the write if we detect in the
> ->iomap_begin() handler that HW offload is not possible. Special code
> -ENOPROTOOPT is returned from ->iomap_begin() to inform that HW offload
> not possible.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>  fs/xfs/xfs_file.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 68 insertions(+)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index ba4b02abc6e4..81a377f65aa3 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -728,6 +728,72 @@ xfs_file_dio_write_zoned(
>  	return ret;
>  }
>  
> +/*
> + * Handle block atomic writes
> + *
> + * Two methods of atomic writes are supported:
> + * - REQ_ATOMIC-based, which would typically use some form of HW offload in the
> + *   disk
> + * - COW-based, which uses a COW fork as a staging extent for data updates
> + *   before atomically updating extent mappings for the range being written
> + *
> + */
> +static noinline ssize_t
> +xfs_file_dio_write_atomic(
> +	struct xfs_inode	*ip,
> +	struct kiocb		*iocb,
> +	struct iov_iter		*from)
> +{
> +	unsigned int		iolock = XFS_IOLOCK_SHARED;
> +	ssize_t			ret, ocount = iov_iter_count(from);
> +	const struct iomap_ops	*dops;
> +
> +	/*
> +	 * HW offload should be faster, so try that first if it is already
> +	 * known that the write length is not too large.
> +	 */
> +	if (ocount > xfs_inode_buftarg(ip)->bt_bdev_awu_max)
> +		dops = &xfs_atomic_write_cow_iomap_ops;
> +	else
> +		dops = &xfs_direct_write_iomap_ops;
> +
> +retry:
> +	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
> +	if (ret)
> +		return ret;
> +
> +	ret = xfs_file_write_checks(iocb, from, &iolock, NULL);
> +	if (ret)
> +		goto out_unlock;
> +
> +	/* Demote similar to xfs_file_dio_write_aligned() */
> +	if (iolock == XFS_IOLOCK_EXCL) {
> +		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
> +		iolock = XFS_IOLOCK_SHARED;
> +	}
> +
> +	trace_xfs_file_direct_write(iocb, from);
> +	ret = iomap_dio_rw(iocb, from, dops, &xfs_dio_write_ops,
> +			0, NULL, 0);
> +
> +	/*
> +	 * The retry mechanism is based on the ->iomap_begin method returning
> +	 * -ENOPROTOOPT, which would be when the REQ_ATOMIC-based write is not
> +	 * possible. The REQ_ATOMIC-based method typically not be possible if
> +	 * the write spans multiple extents or the disk blocks are misaligned.
> +	 */
> +	if (ret == -ENOPROTOOPT && dops == &xfs_direct_write_iomap_ops) {
> +		xfs_iunlock(ip, iolock);
> +		dops = &xfs_atomic_write_cow_iomap_ops;
> +		goto retry;
> +	}
> +
> +out_unlock:
> +	if (iolock)
> +		xfs_iunlock(ip, iolock);
> +	return ret;
> +}
> +
>  /*
>   * Handle block unaligned direct I/O writes
>   *
> @@ -843,6 +909,8 @@ xfs_file_dio_write(
>  		return xfs_file_dio_write_unaligned(ip, iocb, from);
>  	if (xfs_is_zoned_inode(ip))
>  		return xfs_file_dio_write_zoned(ip, iocb, from);

What happens to an IOCB_ATOMIC write to a zoned file?  I think the
ioend for an atomic write to a zoned file involves a similar change as
an outofplace atomic write to a file (one big transaction to absorb
all the mapping changes) but I don't think the zoned code quite does
that...?

--D

> +	if (iocb->ki_flags & IOCB_ATOMIC)
> +		return xfs_file_dio_write_atomic(ip, iocb, from);
>  	return xfs_file_dio_write_aligned(ip, iocb, from,
>  			&xfs_direct_write_iomap_ops, &xfs_dio_write_ops, NULL);
>  }
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 11/14] xfs: add xfs_file_dio_write_atomic()
  2025-04-21  4:00   ` Darrick J. Wong
@ 2025-04-21  5:47     ` John Garry
  2025-04-21 16:42       ` Darrick J. Wong
  0 siblings, 1 reply; 40+ messages in thread
From: John Garry @ 2025-04-21  5:47 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, hch, viro, jack, cem, linux-fsdevel, dchinner, linux-xfs,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, linux-ext4,
	linux-block, catherine.hoang, linux-api

On 21/04/2025 05:00, Darrick J. Wong wrote:
>> @@ -843,6 +909,8 @@ xfs_file_dio_write(
>>   		return xfs_file_dio_write_unaligned(ip, iocb, from);
>>   	if (xfs_is_zoned_inode(ip))
>>   		return xfs_file_dio_write_zoned(ip, iocb, from);
> What happens to an IOCB_ATOMIC write to a zoned file?  I think the
> ioend for an atomic write to a zoned file involves a similar change as
> an outofplace atomic write to a file (one big transaction to absorb
> all the mapping changes) but I don't think the zoned code quite does
> that...?

Correct. For now, I don't think that we should try to support zoned 
device atomic writes. However we don't have proper checks for this. How 
about adding a xfs_has_zoned() check in xfs_get_atomic_write_{min, max, 
opt}()?

Thanks,
John

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 11/14] xfs: add xfs_file_dio_write_atomic()
  2025-04-21  5:47     ` John Garry
@ 2025-04-21 16:42       ` Darrick J. Wong
  2025-04-23  5:42         ` Christoph Hellwig
  0 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2025-04-21 16:42 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, hch, viro, jack, cem, linux-fsdevel, dchinner, linux-xfs,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, linux-ext4,
	linux-block, catherine.hoang, linux-api

On Mon, Apr 21, 2025 at 06:47:44AM +0100, John Garry wrote:
> On 21/04/2025 05:00, Darrick J. Wong wrote:
> > > @@ -843,6 +909,8 @@ xfs_file_dio_write(
> > >   		return xfs_file_dio_write_unaligned(ip, iocb, from);
> > >   	if (xfs_is_zoned_inode(ip))
> > >   		return xfs_file_dio_write_zoned(ip, iocb, from);
> > What happens to an IOCB_ATOMIC write to a zoned file?  I think the
> > ioend for an atomic write to a zoned file involves a similar change as
> > an outofplace atomic write to a file (one big transaction to absorb
> > all the mapping changes) but I don't think the zoned code quite does
> > that...?
> 
> Correct. For now, I don't think that we should try to support zoned device
> atomic writes. However we don't have proper checks for this. How about
> adding a xfs_has_zoned() check in xfs_get_atomic_write_{min, max, opt}()?

Well it turns out that was a stupid question -- zoned=1 can't be enabled
with reflink, which means there's no cow fallback so atomic writes just
plain don't work:

$ xfs_info /mnt
meta-data=/dev/sda               isize=512    agcount=4, agsize=32768 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=0    bigtime=1 inobtcount=1 nrext64=1
         =                       exchange=1   metadir=1
data     =                       bsize=4096   blocks=131072, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=1
log      =internal log           bsize=4096   blocks=16384, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =internal               extsz=4096   blocks=5061632, rtextents=5061632
         =                       rgcount=78   rgsize=65536 extents
         =                       zoned=1      start=131072 reserved=0
$ xfs_io -c 'statx -r -m 0x10000' /mnt/a | grep atomic
stat.atomic_write_unit_min = 0
stat.atomic_write_unit_max = 0
stat.atomic_write_segments_max = 0
stat.atomic_write_unit_max_opt = 0

I /think/ all you'd have to do is create an xfs_zoned_end_atomic_io
function that does what xfs_zoned_end_io but with a single
tr_atomic_ioend transaction; figure out how to convey "this is an
atomic out of place write" to xfs_end_ioend so that it knows to call
xfs_zoned_end_atomic_io; and then update the xfs_get_atomic_write*
helpers.

--D

> Thanks,
> John

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 11/14] xfs: add xfs_file_dio_write_atomic()
  2025-04-15 12:14 ` [PATCH v7 11/14] xfs: add xfs_file_dio_write_atomic() John Garry
  2025-04-21  4:00   ` Darrick J. Wong
@ 2025-04-21 21:18   ` Luis Chamberlain
  2025-04-22  6:08     ` John Garry
  1 sibling, 1 reply; 40+ messages in thread
From: Luis Chamberlain @ 2025-04-21 21:18 UTC (permalink / raw)
  To: John Garry, Chris Mason, Josef Bacik
  Cc: brauner, djwong, hch, viro, jack, cem, linux-fsdevel, dchinner,
	linux-xfs, linux-kernel, ojaswin, ritesh.list, martin.petersen,
	linux-ext4, linux-block, catherine.hoang, linux-api,
	Pankaj Raghav, Daniel Gomez

On Tue, Apr 15, 2025 at 12:14:22PM +0000, John Garry wrote:
> Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes.
> 
> The function works based on two operating modes:
> - HW offload, i.e. REQ_ATOMIC-based
> - CoW based with out-of-places write and atomic extent remapping
> 
> The preferred method is HW offload as it will be faster. If HW offload is
> not possible, then we fallback to the CoW-based method.
> 
> HW offload would not be possible for the write length exceeding the HW
> offload limit, the write spanning multiple extents, unaligned disk blocks,
> etc.
> 
> Apart from the write exceeding the HW offload limit, other conditions for
> HW offload can only be detected in the iomap handling for the write. As
> such, we use a fallback method to issue the write if we detect in the
> ->iomap_begin() handler that HW offload is not possible. Special code
> -ENOPROTOOPT is returned from ->iomap_begin() to inform that HW offload
> not possible.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> ---
>  fs/xfs/xfs_file.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 68 insertions(+)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index ba4b02abc6e4..81a377f65aa3 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -728,6 +728,72 @@ xfs_file_dio_write_zoned(
>  	return ret;
>  }
>  
> +/*
> + * Handle block atomic writes
> + *
> + * Two methods of atomic writes are supported:
> + * - REQ_ATOMIC-based, which would typically use some form of HW offload in the
> + *   disk
> + * - COW-based, which uses a COW fork as a staging extent for data updates
> + *   before atomically updating extent mappings for the range being written
> + *
> + */
> +static noinline ssize_t
> +xfs_file_dio_write_atomic(
> +	struct xfs_inode	*ip,
> +	struct kiocb		*iocb,
> +	struct iov_iter		*from)
> +{
> +	unsigned int		iolock = XFS_IOLOCK_SHARED;
> +	ssize_t			ret, ocount = iov_iter_count(from);
> +	const struct iomap_ops	*dops;
> +
> +	/*
> +	 * HW offload should be faster, so try that first if it is already
> +	 * known that the write length is not too large.
> +	 */
> +	if (ocount > xfs_inode_buftarg(ip)->bt_bdev_awu_max)
> +		dops = &xfs_atomic_write_cow_iomap_ops;
> +	else
> +		dops = &xfs_direct_write_iomap_ops;
> +
> +retry:
> +	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
> +	if (ret)
> +		return ret;
> +
> +	ret = xfs_file_write_checks(iocb, from, &iolock, NULL);
> +	if (ret)
> +		goto out_unlock;
> +
> +	/* Demote similar to xfs_file_dio_write_aligned() */
> +	if (iolock == XFS_IOLOCK_EXCL) {
> +		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
> +		iolock = XFS_IOLOCK_SHARED;
> +	}
> +
> +	trace_xfs_file_direct_write(iocb, from);
> +	ret = iomap_dio_rw(iocb, from, dops, &xfs_dio_write_ops,
> +			0, NULL, 0);
> +
> +	/*
> +	 * The retry mechanism is based on the ->iomap_begin method returning
> +	 * -ENOPROTOOPT, which would be when the REQ_ATOMIC-based write is not
> +	 * possible. The REQ_ATOMIC-based method typically not be possible if
> +	 * the write spans multiple extents or the disk blocks are misaligned.
> +	 */
> +	if (ret == -ENOPROTOOPT && dops == &xfs_direct_write_iomap_ops) {

Based on feedback from LSFMM, due to the performance variaibility this
can introduce, it sounded like some folks would like to opt-in to not
have a software fallback and just require an error out.

Could an option be added to not allow the software fallback?

If so, then I think the next patch would also need updating.

Or are you suggesting that without the software fallback atomic writes
greater than fs block size are not possible?

  Luis

> +		xfs_iunlock(ip, iolock);
> +		dops = &xfs_atomic_write_cow_iomap_ops;
> +		goto retry;
> +	}
> +
> +out_unlock:
> +	if (iolock)
> +		xfs_iunlock(ip, iolock);
> +	return ret;
> +}
> +
>  /*
>   * Handle block unaligned direct I/O writes
>   *
> @@ -843,6 +909,8 @@ xfs_file_dio_write(
>  		return xfs_file_dio_write_unaligned(ip, iocb, from);
>  	if (xfs_is_zoned_inode(ip))
>  		return xfs_file_dio_write_zoned(ip, iocb, from);
> +	if (iocb->ki_flags & IOCB_ATOMIC)
> +		return xfs_file_dio_write_atomic(ip, iocb, from);
>  	return xfs_file_dio_write_aligned(ip, iocb, from,
>  			&xfs_direct_write_iomap_ops, &xfs_dio_write_ops, NULL);
>  }
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 11/14] xfs: add xfs_file_dio_write_atomic()
  2025-04-21 21:18   ` Luis Chamberlain
@ 2025-04-22  6:08     ` John Garry
  2025-04-23  5:18       ` Luis Chamberlain
  2025-04-23  5:44       ` Christoph Hellwig
  0 siblings, 2 replies; 40+ messages in thread
From: John Garry @ 2025-04-22  6:08 UTC (permalink / raw)
  To: Luis Chamberlain, Chris Mason, Josef Bacik
  Cc: brauner, djwong, hch, viro, jack, cem, linux-fsdevel, dchinner,
	linux-xfs, linux-kernel, ojaswin, ritesh.list, martin.petersen,
	linux-ext4, linux-block, catherine.hoang, linux-api,
	Pankaj Raghav, Daniel Gomez

On 21/04/2025 22:18, Luis Chamberlain wrote:
>> /*
>> +	 * The retry mechanism is based on the ->iomap_begin method returning
>> +	 * -ENOPROTOOPT, which would be when the REQ_ATOMIC-based write is not
>> +	 * possible. The REQ_ATOMIC-based method typically not be possible if
>> +	 * the write spans multiple extents or the disk blocks are misaligned.
>> +	 */
>> +	if (ret == -ENOPROTOOPT && dops == &xfs_direct_write_iomap_ops) {
> Based on feedback from LSFMM, due to the performance variaibility this
> can introduce, it sounded like some folks would like to opt-in to not
> have a software fallback and just require an error out.
 > > Could an option be added to not allow the software fallback?

I still don't see the use in this.

So consider userspace wants to write something atomically and we fail as 
a HW-based atomic write is not possible. What is userspace going to do next?

I heard something like "if HW-based atomics are not possible, then 
something has not been configured properly for the FS" - that something 
would be extent granularity and alignment, but we don't have a method to 
ensure this. That is the whole point of having a FS fallback.

> 
> If so, then I think the next patch would also need updating.
> 
> Or are you suggesting that without the software fallback atomic writes
> greater than fs block size are not possible?

Yes, as XFS has no method to guarantee extent granularity and alignment.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 11/14] xfs: add xfs_file_dio_write_atomic()
  2025-04-22  6:08     ` John Garry
@ 2025-04-23  5:18       ` Luis Chamberlain
  2025-04-23  7:08         ` John Garry
  2025-04-23  5:44       ` Christoph Hellwig
  1 sibling, 1 reply; 40+ messages in thread
From: Luis Chamberlain @ 2025-04-23  5:18 UTC (permalink / raw)
  To: John Garry
  Cc: Chris Mason, Josef Bacik, brauner, djwong, hch, viro, jack, cem,
	linux-fsdevel, dchinner, linux-xfs, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, linux-ext4, linux-block,
	catherine.hoang, linux-api, Pankaj Raghav, Daniel Gomez

On Tue, Apr 22, 2025 at 07:08:32AM +0100, John Garry wrote:
> On 21/04/2025 22:18, Luis Chamberlain wrote:
> > > /*
> > > +	 * The retry mechanism is based on the ->iomap_begin method returning
> > > +	 * -ENOPROTOOPT, which would be when the REQ_ATOMIC-based write is not
> > > +	 * possible. The REQ_ATOMIC-based method typically not be possible if
> > > +	 * the write spans multiple extents or the disk blocks are misaligned.
> > > +	 */
> > > +	if (ret == -ENOPROTOOPT && dops == &xfs_direct_write_iomap_ops) {
> > Based on feedback from LSFMM, due to the performance variaibility this
> > can introduce, it sounded like some folks would like to opt-in to not
> > have a software fallback and just require an error out.
> > > Could an option be added to not allow the software fallback?
> 
> I still don't see the use in this.

Its not the use, its the concern for underdeterminism in performance.

> So consider userspace wants to write something atomically and we fail as a
> HW-based atomic write is not possible.

Sounds like a terrible predicant for those that want hw atomics and
reliability for it.

> What is userspace going to do next?

It would seem that would depend on their analysis on the number of
software fallbacks where a software atomic based solution is used and
the impact on performance.

> I heard something like "if HW-based atomics are not possible, then something
> has not been configured properly for the FS" - that something would be
> extent granularity and alignment, but we don't have a method to ensure this.
> That is the whole point of having a FS fallback.

We do with LBS. Its perfectly deterministic to be aligned with a sector
size matching the block size, even for metadata writes.

> > If so, then I think the next patch would also need updating.
> > 
> > Or are you suggesting that without the software fallback atomic writes
> > greater than fs block size are not possible?
> 
> Yes, as XFS has no method to guarantee extent granularity and alignment.

Ah, I think the documentation for this featuer should make this clear,
it was not clear up to this point in patch review.

  Luis

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 11/14] xfs: add xfs_file_dio_write_atomic()
  2025-04-21 16:42       ` Darrick J. Wong
@ 2025-04-23  5:42         ` Christoph Hellwig
  2025-04-23  8:19           ` Christoph Hellwig
  2025-04-23 14:53           ` Darrick J. Wong
  0 siblings, 2 replies; 40+ messages in thread
From: Christoph Hellwig @ 2025-04-23  5:42 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Garry, brauner, hch, viro, jack, cem, linux-fsdevel,
	dchinner, linux-xfs, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, linux-ext4, linux-block, catherine.hoang,
	linux-api

On Mon, Apr 21, 2025 at 09:42:41AM -0700, Darrick J. Wong wrote:
> Well it turns out that was a stupid question -- zoned=1 can't be enabled
> with reflink, which means there's no cow fallback so atomic writes just
> plain don't work:

Exactly.  It is still on my todo list to support it, but there are a
few higher priority items on it as well, in addition to constant
interruptions for patch reviews :)

> I /think/ all you'd have to do is create an xfs_zoned_end_atomic_io
> function that does what xfs_zoned_end_io but with a single
> tr_atomic_ioend transaction; figure out how to convey "this is an
> atomic out of place write" to xfs_end_ioend so that it knows to call
> xfs_zoned_end_atomic_io; and then update the xfs_get_atomic_write*
> helpers.

Roughly, yes.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 11/14] xfs: add xfs_file_dio_write_atomic()
  2025-04-22  6:08     ` John Garry
  2025-04-23  5:18       ` Luis Chamberlain
@ 2025-04-23  5:44       ` Christoph Hellwig
  2025-04-23  7:02         ` John Garry
  1 sibling, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2025-04-23  5:44 UTC (permalink / raw)
  To: John Garry
  Cc: Luis Chamberlain, Chris Mason, Josef Bacik, brauner, djwong, hch,
	viro, jack, cem, linux-fsdevel, dchinner, linux-xfs, linux-kernel,
	ojaswin, ritesh.list, martin.petersen, linux-ext4, linux-block,
	catherine.hoang, linux-api, Pankaj Raghav, Daniel Gomez

On Tue, Apr 22, 2025 at 07:08:32AM +0100, John Garry wrote:
> So consider userspace wants to write something atomically and we fail as a 
> HW-based atomic write is not possible. What is userspace going to do next?

Exactly.

>
> I heard something like "if HW-based atomics are not possible, then 
> something has not been configured properly for the FS" - that something 
> would be extent granularity and alignment, but we don't have a method to 
> ensure this. That is the whole point of having a FS fallback.

We now have the opt limit, right? (I'll review the reposted series
ASAP, but for now I'll assume it)  They can just tune their applications
to it, and trigger on a trace point for the fallback to monitor it.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 11/14] xfs: add xfs_file_dio_write_atomic()
  2025-04-23  5:44       ` Christoph Hellwig
@ 2025-04-23  7:02         ` John Garry
  0 siblings, 0 replies; 40+ messages in thread
From: John Garry @ 2025-04-23  7:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Luis Chamberlain, Chris Mason, Josef Bacik, brauner, djwong, viro,
	jack, cem, linux-fsdevel, dchinner, linux-xfs, linux-kernel,
	ojaswin, ritesh.list, martin.petersen, linux-ext4, linux-block,
	catherine.hoang, linux-api, Pankaj Raghav, Daniel Gomez

On 23/04/2025 06:44, Christoph Hellwig wrote:
> On Tue, Apr 22, 2025 at 07:08:32AM +0100, John Garry wrote:
>> So consider userspace wants to write something atomically and we fail as a
>> HW-based atomic write is not possible. What is userspace going to do next?
> Exactly.
> 
>> I heard something like "if HW-based atomics are not possible, then
>> something has not been configured properly for the FS" - that something
>> would be extent granularity and alignment, but we don't have a method to
>> ensure this. That is the whole point of having a FS fallback.
> We now have the opt limit, right? 

right

> (I'll review the reposted series
> ASAP, 

ok, cheers

> but for now I'll assume it)  They can just tune their applications
> to it, and trigger on a trace point for the fallback to monitor it.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 11/14] xfs: add xfs_file_dio_write_atomic()
  2025-04-23  5:18       ` Luis Chamberlain
@ 2025-04-23  7:08         ` John Garry
  2025-04-23  7:36           ` Luis Chamberlain
  0 siblings, 1 reply; 40+ messages in thread
From: John Garry @ 2025-04-23  7:08 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Chris Mason, Josef Bacik, brauner, djwong, hch, viro, jack, cem,
	linux-fsdevel, dchinner, linux-xfs, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, linux-ext4, linux-block,
	catherine.hoang, linux-api, Pankaj Raghav, Daniel Gomez

On 23/04/2025 06:18, Luis Chamberlain wrote:
> On Tue, Apr 22, 2025 at 07:08:32AM +0100, John Garry wrote:
>> On 21/04/2025 22:18, Luis Chamberlain wrote:
>>>> /*
>>>> +	 * The retry mechanism is based on the ->iomap_begin method returning
>>>> +	 * -ENOPROTOOPT, which would be when the REQ_ATOMIC-based write is not
>>>> +	 * possible. The REQ_ATOMIC-based method typically not be possible if
>>>> +	 * the write spans multiple extents or the disk blocks are misaligned.
>>>> +	 */
>>>> +	if (ret == -ENOPROTOOPT && dops == &xfs_direct_write_iomap_ops) {
>>> Based on feedback from LSFMM, due to the performance variaibility this
>>> can introduce, it sounded like some folks would like to opt-in to not
>>> have a software fallback and just require an error out.
>>>> Could an option be added to not allow the software fallback?
>>
>> I still don't see the use in this.
> 
> Its not the use, its the concern for underdeterminism in performance.

Sure, we don't offer RT performance guarantees, but what does?

> 
>> So consider userspace wants to write something atomically and we fail as a
>> HW-based atomic write is not possible.
> 
> Sounds like a terrible predicant for those that want hw atomics and
> reliability for it.

Well from our MySQL testing performance is good.

> 
>> What is userspace going to do next?
> 
> It would seem that would depend on their analysis on the number of
> software fallbacks where a software atomic based solution is used and
> the impact on performance.

sorry, but I don't understand this

> 
>> I heard something like "if HW-based atomics are not possible, then something
>> has not been configured properly for the FS" - that something would be
>> extent granularity and alignment, but we don't have a method to ensure this.
>> That is the whole point of having a FS fallback.
> 
> We do with LBS.

Sure, but not everyone wants LBS

> Its perfectly deterministic to be aligned with a sector
> size matching the block size, even for metadata writes.
> 
>>> If so, then I think the next patch would also need updating.
>>>
>>> Or are you suggesting that without the software fallback atomic writes
>>> greater than fs block size are not possible?
>>
>> Yes, as XFS has no method to guarantee extent granularity and alignment.
> 
> Ah, I think the documentation for this featuer should make this clear,
> it was not clear up to this point in patch review.
> 

ok, that can be added


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 11/14] xfs: add xfs_file_dio_write_atomic()
  2025-04-23  7:08         ` John Garry
@ 2025-04-23  7:36           ` Luis Chamberlain
  0 siblings, 0 replies; 40+ messages in thread
From: Luis Chamberlain @ 2025-04-23  7:36 UTC (permalink / raw)
  To: John Garry
  Cc: Chris Mason, Josef Bacik, brauner, djwong, hch, viro, jack, cem,
	linux-fsdevel, dchinner, linux-xfs, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, linux-ext4, linux-block,
	catherine.hoang, linux-api, Pankaj Raghav, Daniel Gomez

On Wed, Apr 23, 2025 at 08:08:40AM +0100, John Garry wrote:
> On 23/04/2025 06:18, Luis Chamberlain wrote:
> > On Tue, Apr 22, 2025 at 07:08:32AM +0100, John Garry wrote:
> > > On 21/04/2025 22:18, Luis Chamberlain wrote:
> > 
> > Sounds like a terrible predicant for those that want hw atomics and
> > reliability for it.
> 
> Well from our MySQL testing performance is good.

Good to hear, thanks!

  Luis

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 11/14] xfs: add xfs_file_dio_write_atomic()
  2025-04-23  5:42         ` Christoph Hellwig
@ 2025-04-23  8:19           ` Christoph Hellwig
  2025-04-23 14:51             ` Darrick J. Wong
  2025-04-23 14:53           ` Darrick J. Wong
  1 sibling, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2025-04-23  8:19 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Garry, brauner, hch, viro, jack, cem, linux-fsdevel,
	dchinner, linux-xfs, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, linux-ext4, linux-block, catherine.hoang,
	linux-api

On Wed, Apr 23, 2025 at 07:42:51AM +0200, Christoph Hellwig wrote:
> On Mon, Apr 21, 2025 at 09:42:41AM -0700, Darrick J. Wong wrote:
> > Well it turns out that was a stupid question -- zoned=1 can't be enabled
> > with reflink, which means there's no cow fallback so atomic writes just
> > plain don't work:
> 
> Exactly.  It is still on my todo list to support it, but there are a
> few higher priority items on it as well, in addition to constant
> interruptions for patch reviews :)

Actually, for zoned we don't need reflink support - as we always write
out place only the stuffing of multiple remaps into a single transaction
is needed.  Still no need to force John to do this work, I can look into
this (probably fairly trivial) work once we have good enough test cases
in xfstests that I can trust them to verify I got things right.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 11/14] xfs: add xfs_file_dio_write_atomic()
  2025-04-23  8:19           ` Christoph Hellwig
@ 2025-04-23 14:51             ` Darrick J. Wong
  0 siblings, 0 replies; 40+ messages in thread
From: Darrick J. Wong @ 2025-04-23 14:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: John Garry, brauner, viro, jack, cem, linux-fsdevel, dchinner,
	linux-xfs, linux-kernel, ojaswin, ritesh.list, martin.petersen,
	linux-ext4, linux-block, catherine.hoang, linux-api

On Wed, Apr 23, 2025 at 10:19:02AM +0200, Christoph Hellwig wrote:
> On Wed, Apr 23, 2025 at 07:42:51AM +0200, Christoph Hellwig wrote:
> > On Mon, Apr 21, 2025 at 09:42:41AM -0700, Darrick J. Wong wrote:
> > > Well it turns out that was a stupid question -- zoned=1 can't be enabled
> > > with reflink, which means there's no cow fallback so atomic writes just
> > > plain don't work:
> > 
> > Exactly.  It is still on my todo list to support it, but there are a
> > few higher priority items on it as well, in addition to constant
> > interruptions for patch reviews :)
> 
> Actually, for zoned we don't need reflink support - as we always write
> out place only the stuffing of multiple remaps into a single transaction
> is needed.  Still no need to force John to do this work, I can look into
> this (probably fairly trivial) work once we have good enough test cases
> in xfstests that I can trust them to verify I got things right.

<nod> I think we'll need a new fstest to set an error trap on a step
midway through a multi-extent ioend completion to make sure that's
actually working properly.  And probably new write commands for fsx and
fsstress to exercise RWF_ATOMIC.

(Catherine: please send the accumulated atomic writes fstests)

--D

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v7 11/14] xfs: add xfs_file_dio_write_atomic()
  2025-04-23  5:42         ` Christoph Hellwig
  2025-04-23  8:19           ` Christoph Hellwig
@ 2025-04-23 14:53           ` Darrick J. Wong
  1 sibling, 0 replies; 40+ messages in thread
From: Darrick J. Wong @ 2025-04-23 14:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: John Garry, brauner, viro, jack, cem, linux-fsdevel, dchinner,
	linux-xfs, linux-kernel, ojaswin, ritesh.list, martin.petersen,
	linux-ext4, linux-block, catherine.hoang, linux-api

On Wed, Apr 23, 2025 at 07:42:51AM +0200, Christoph Hellwig wrote:
> On Mon, Apr 21, 2025 at 09:42:41AM -0700, Darrick J. Wong wrote:
> > Well it turns out that was a stupid question -- zoned=1 can't be enabled
> > with reflink, which means there's no cow fallback so atomic writes just
> > plain don't work:
> 
> Exactly.  It is still on my todo list to support it, but there are a
> few higher priority items on it as well, in addition to constant
> interruptions for patch reviews :)

Heheh I'll try to go dig through all the stuff you've sent to the list
yesterday, though I've got an hour of paperwork ahead of me that has to
get done asap. :/

--D

> > I /think/ all you'd have to do is create an xfs_zoned_end_atomic_io
> > function that does what xfs_zoned_end_io but with a single
> > tr_atomic_ioend transaction; figure out how to convey "this is an
> > atomic out of place write" to xfs_end_ioend so that it knows to call
> > xfs_zoned_end_atomic_io; and then update the xfs_get_atomic_write*
> > helpers.
> 
> Roughly, yes.
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2025-04-23 14:53 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-15 12:14 [PATCH v7 00/14] large atomic writes for xfs John Garry
2025-04-15 12:14 ` [PATCH v7 01/14] fs: add atomic write unit max opt to statx John Garry
2025-04-15 12:14 ` [PATCH v7 02/14] xfs: add helpers to compute log item overhead John Garry
2025-04-15 12:14 ` [PATCH v7 03/14] xfs: add helpers to compute transaction reservation for finishing intent items John Garry
2025-04-15 12:14 ` [PATCH v7 04/14] xfs: rename xfs_inode_can_atomicwrite() -> xfs_inode_can_hw_atomicwrite() John Garry
2025-04-15 12:14 ` [PATCH v7 05/14] xfs: allow block allocator to take an alignment hint John Garry
2025-04-15 12:14 ` [PATCH v7 06/14] xfs: refactor xfs_reflink_end_cow_extent() John Garry
2025-04-15 12:14 ` [PATCH v7 07/14] xfs: refine atomic write size check in xfs_file_write_iter() John Garry
2025-04-15 12:14 ` [PATCH v7 08/14] xfs: add xfs_atomic_write_cow_iomap_begin() John Garry
2025-04-15 12:14 ` [PATCH v7 09/14] xfs: add large atomic writes checks in xfs_direct_write_iomap_begin() John Garry
2025-04-15 17:34   ` Darrick J. Wong
2025-04-15 17:46     ` John Garry
2025-04-15 12:14 ` [PATCH v7 10/14] xfs: commit CoW-based atomic writes atomically John Garry
2025-04-15 12:14 ` [PATCH v7 11/14] xfs: add xfs_file_dio_write_atomic() John Garry
2025-04-21  4:00   ` Darrick J. Wong
2025-04-21  5:47     ` John Garry
2025-04-21 16:42       ` Darrick J. Wong
2025-04-23  5:42         ` Christoph Hellwig
2025-04-23  8:19           ` Christoph Hellwig
2025-04-23 14:51             ` Darrick J. Wong
2025-04-23 14:53           ` Darrick J. Wong
2025-04-21 21:18   ` Luis Chamberlain
2025-04-22  6:08     ` John Garry
2025-04-23  5:18       ` Luis Chamberlain
2025-04-23  7:08         ` John Garry
2025-04-23  7:36           ` Luis Chamberlain
2025-04-23  5:44       ` Christoph Hellwig
2025-04-23  7:02         ` John Garry
2025-04-15 12:14 ` [PATCH v7 12/14] xfs: add xfs_compute_atomic_write_unit_max() John Garry
2025-04-15 16:25   ` Darrick J. Wong
2025-04-15 16:35     ` John Garry
2025-04-15 16:39       ` Darrick J. Wong
2025-04-15 12:14 ` [PATCH v7 13/14] xfs: update atomic write limits John Garry
2025-04-15 16:26   ` Darrick J. Wong
2025-04-15 12:14 ` [PATCH v7 14/14] xfs: allow sysadmins to specify a maximum atomic write limit at mount time John Garry
2025-04-15 15:35   ` Randy Dunlap
2025-04-15 16:55   ` Darrick J. Wong
2025-04-15 22:36   ` [PATCH v7.1 " Darrick J. Wong
2025-04-16 10:08     ` John Garry
2025-04-16 16:26       ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).