[PATCH v2 00/11] large atomic writes for xfs with CoW

public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 00/11] large atomic writes for xfs with CoW
@ 2025-02-13 13:56 John Garry
  2025-02-13 13:56 ` [PATCH v2 01/11] iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW John Garry
                   ` (11 more replies)
  0 siblings, 12 replies; 38+ messages in thread
From: John Garry @ 2025-02-13 13:56 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

Currently atomic write support for xfs is limited to writing a single
block as we have no way to guarantee alignment and that the write covers
a single extent.

This series introduces a method to issue atomic writes via a software
emulated method.

The software emulated method is used as a fallback for when attempting to
issue an atomic write over misaligned or multiple extents.

The basic idea of this CoW method is to alloc a range in the CoW fork,
write the data, and atomically update the mapping.

Initial mysql performance testing has shown this method to perform ok.
However, there we are only using 16K atomic writes (and 4K block size),
so typically - and thankfully - this software fallback method won't be
used often.

For other FSes which want large atomics writes and don't support CoW, I
think that they can follow the example in [0].

Based on 35010cc72acc (xfs/next-rc) xfs: flush inodegc before swapon

[0] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/

Differences to RFC:
- Rework CoW alloc method
- Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW
- Rework transaction commit func args
- Chaneg resblks size for transaction commit
- Rename BMAPI extszhint align flag

John Garry (10):
  iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW
  xfs: Switch atomic write size check in xfs_file_write_iter()
  xfs: Refactor xfs_reflink_end_cow_extent()
  iomap: Support CoW-based atomic writes
  xfs: Reflink CoW-based atomic write support
  xfs: iomap CoW-based atomic write support
  xfs: Add xfs_file_dio_write_atomic()
  xfs: Commit CoW-based atomic writes atomically
  xfs: Update atomic write max size
  xfs: Allow block allocator to take an alignment hint

Ritesh Harjani (IBM) (1):
  iomap: Lift blocksize restriction on atomic writes

 .../filesystems/iomap/operations.rst          |  19 ++-
 fs/ext4/inode.c                               |   2 +-
 fs/iomap/direct-io.c                          |  20 +--
 fs/iomap/trace.h                              |   2 +-
 fs/xfs/libxfs/xfs_bmap.c                      |   7 +-
 fs/xfs/libxfs/xfs_bmap.h                      |   6 +-
 fs/xfs/xfs_file.c                             |  59 +++++++-
 fs/xfs/xfs_iomap.c                            |  74 +++++++++-
 fs/xfs/xfs_iops.c                             |  31 +++-
 fs/xfs/xfs_iops.h                             |   2 +
 fs/xfs/xfs_mount.c                            |  28 ++++
 fs/xfs/xfs_mount.h                            |   1 +
 fs/xfs/xfs_reflink.c                          | 138 +++++++++++++-----
 fs/xfs/xfs_reflink.h                          |   5 +-
 include/linux/iomap.h                         |   8 +-
 15 files changed, 331 insertions(+), 71 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2 01/11] iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW
  2025-02-13 13:56 [PATCH v2 00/11] large atomic writes for xfs with CoW John Garry
@ 2025-02-13 13:56 ` John Garry
  2025-02-24 20:23   ` Darrick J. Wong
  2025-02-13 13:56 ` [PATCH v2 02/11] xfs: Switch atomic write size check in xfs_file_write_iter() John Garry
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: John Garry @ 2025-02-13 13:56 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

In future xfs will support a CoW-based atomic write, so rename
IOMAP_ATOMIC -> IOMAP_ATOMIC_HW to be clear which mode is being used.

Also relocate setting of IOMAP_ATOMIC_HW to the write path in
__iomap_dio_rw(), to be clear that this flag is only relevant to writes

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 Documentation/filesystems/iomap/operations.rst |  4 ++--
 fs/ext4/inode.c                                |  2 +-
 fs/iomap/direct-io.c                           | 18 +++++++++---------
 fs/iomap/trace.h                               |  2 +-
 include/linux/iomap.h                          |  2 +-
 5 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
index 2c7f5df9d8b0..82bfe0e8c08e 100644
--- a/Documentation/filesystems/iomap/operations.rst
+++ b/Documentation/filesystems/iomap/operations.rst
@@ -513,8 +513,8 @@ IOMAP_WRITE`` with any combination of the following enhancements:
    if the mapping is unwritten and the filesystem cannot handle zeroing
    the unaligned regions without exposing stale contents.
 
- * ``IOMAP_ATOMIC``: This write is being issued with torn-write
-   protection.
+ * ``IOMAP_ATOMIC_HW``: This write is being issued with torn-write
+   protection based on HW-offload support.
    Only a single bio can be created for the write, and the write must
    not be split into multiple I/O requests, i.e. flag REQ_ATOMIC must be
    set.
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7c54ae5fcbd4..ba2f1e3db7c7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3467,7 +3467,7 @@ static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
 		return false;
 
 	/* atomic writes are all-or-nothing */
-	if (flags & IOMAP_ATOMIC)
+	if (flags & IOMAP_ATOMIC_HW)
 		return false;
 
 	/* can only try again if we wrote nothing */
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index b521eb15759e..f87c4277e738 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -271,7 +271,7 @@ static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
  * clearing the WRITE_THROUGH flag in the dio request.
  */
 static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
-		const struct iomap *iomap, bool use_fua, bool atomic)
+		const struct iomap *iomap, bool use_fua, bool atomic_hw)
 {
 	blk_opf_t opflags = REQ_SYNC | REQ_IDLE;
 
@@ -283,7 +283,7 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
 		opflags |= REQ_FUA;
 	else
 		dio->flags &= ~IOMAP_DIO_WRITE_THROUGH;
-	if (atomic)
+	if (atomic_hw)
 		opflags |= REQ_ATOMIC;
 
 	return opflags;
@@ -295,8 +295,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 	const struct iomap *iomap = &iter->iomap;
 	struct inode *inode = iter->inode;
 	unsigned int fs_block_size = i_blocksize(inode), pad;
+	bool atomic_hw = iter->flags & IOMAP_ATOMIC_HW;
 	const loff_t length = iomap_length(iter);
-	bool atomic = iter->flags & IOMAP_ATOMIC;
 	loff_t pos = iter->pos;
 	blk_opf_t bio_opf;
 	struct bio *bio;
@@ -306,7 +306,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 	size_t copied = 0;
 	size_t orig_count;
 
-	if (atomic && length != fs_block_size)
+	if (atomic_hw && length != fs_block_size)
 		return -EINVAL;
 
 	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
@@ -383,7 +383,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 			goto out;
 	}
 
-	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic);
+	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic_hw);
 
 	nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS);
 	do {
@@ -416,7 +416,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		}
 
 		n = bio->bi_iter.bi_size;
-		if (WARN_ON_ONCE(atomic && n != length)) {
+		if (WARN_ON_ONCE(atomic_hw && n != length)) {
 			/*
 			 * This bio should have covered the complete length,
 			 * which it doesn't, so error. We may need to zero out
@@ -610,9 +610,6 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		iomi.flags |= IOMAP_NOWAIT;
 
-	if (iocb->ki_flags & IOCB_ATOMIC)
-		iomi.flags |= IOMAP_ATOMIC;
-
 	if (iov_iter_rw(iter) == READ) {
 		/* reads can always complete inline */
 		dio->flags |= IOMAP_DIO_INLINE_COMP;
@@ -647,6 +644,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 			iomi.flags |= IOMAP_OVERWRITE_ONLY;
 		}
 
+		if (iocb->ki_flags & IOCB_ATOMIC)
+			iomi.flags |= IOMAP_ATOMIC_HW;
+
 		/* for data sync or sync, we need sync completion processing */
 		if (iocb_is_dsync(iocb)) {
 			dio->flags |= IOMAP_DIO_NEED_SYNC;
diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
index 4118a42cdab0..0c73d91c0485 100644
--- a/fs/iomap/trace.h
+++ b/fs/iomap/trace.h
@@ -99,7 +99,7 @@ DEFINE_RANGE_EVENT(iomap_dio_rw_queued);
 	{ IOMAP_FAULT,		"FAULT" }, \
 	{ IOMAP_DIRECT,		"DIRECT" }, \
 	{ IOMAP_NOWAIT,		"NOWAIT" }, \
-	{ IOMAP_ATOMIC,		"ATOMIC" }
+	{ IOMAP_ATOMIC_HW,	"ATOMIC_HW" }
 
 #define IOMAP_F_FLAGS_STRINGS \
 	{ IOMAP_F_NEW,		"NEW" }, \
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 75bf54e76f3b..e7aa05503763 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -182,7 +182,7 @@ struct iomap_folio_ops {
 #else
 #define IOMAP_DAX		0
 #endif /* CONFIG_FS_DAX */
-#define IOMAP_ATOMIC		(1 << 9)
+#define IOMAP_ATOMIC_HW		(1 << 9) /* HW-based torn-write protection */
 
 struct iomap_ops {
 	/*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 02/11] xfs: Switch atomic write size check in xfs_file_write_iter()
  2025-02-13 13:56 [PATCH v2 00/11] large atomic writes for xfs with CoW John Garry
  2025-02-13 13:56 ` [PATCH v2 01/11] iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW John Garry
@ 2025-02-13 13:56 ` John Garry
  2025-02-24 20:24   ` Darrick J. Wong
  2025-02-13 13:56 ` [PATCH v2 03/11] xfs: Refactor xfs_reflink_end_cow_extent() John Garry
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: John Garry @ 2025-02-13 13:56 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

Currently the size of atomic write allowed is fixed at the blocksize.

To start to lift this restriction, refactor xfs_get_atomic_write_attr()
to into a helper - xfs_report_atomic_write() - and use that helper to
find the per-inode atomic write limits and check according to that.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_file.c | 12 +++++-------
 fs/xfs/xfs_iops.c | 20 +++++++++++++++++---
 fs/xfs/xfs_iops.h |  3 +++
 3 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f7a7d89c345e..258c82cbce12 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -853,14 +853,12 @@ xfs_file_write_iter(
 		return xfs_file_dax_write(iocb, from);
 
 	if (iocb->ki_flags & IOCB_ATOMIC) {
-		/*
-		 * Currently only atomic writing of a single FS block is
-		 * supported. It would be possible to atomic write smaller than
-		 * a FS block, but there is no requirement to support this.
-		 * Note that iomap also does not support this yet.
-		 */
-		if (ocount != ip->i_mount->m_sb.sb_blocksize)
+		unsigned int	unit_min, unit_max;
+
+		xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
+		if (ocount < unit_min || ocount > unit_max)
 			return -EINVAL;
+
 		ret = generic_atomic_write_valid(iocb, from);
 		if (ret)
 			return ret;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 40289fe6f5b2..ea79fb246e33 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -600,15 +600,29 @@ xfs_report_dioalign(
 		stat->dio_offset_align = stat->dio_read_offset_align;
 }
 
+void
+xfs_get_atomic_write_attr(
+	struct xfs_inode	*ip,
+	unsigned int		*unit_min,
+	unsigned int		*unit_max)
+{
+	if (!xfs_inode_can_atomicwrite(ip)) {
+		*unit_min = *unit_max = 0;
+		return;
+	}
+
+	*unit_min = *unit_max = ip->i_mount->m_sb.sb_blocksize;
+}
+
 static void
 xfs_report_atomic_write(
 	struct xfs_inode	*ip,
 	struct kstat		*stat)
 {
-	unsigned int		unit_min = 0, unit_max = 0;
+	unsigned int		unit_min, unit_max;
+
+	xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
 
-	if (xfs_inode_can_atomicwrite(ip))
-		unit_min = unit_max = ip->i_mount->m_sb.sb_blocksize;
 	generic_fill_statx_atomic_writes(stat, unit_min, unit_max);
 }
 
diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
index 3c1a2605ffd2..ce7bdeb9a79c 100644
--- a/fs/xfs/xfs_iops.h
+++ b/fs/xfs/xfs_iops.h
@@ -19,5 +19,8 @@ int xfs_inode_init_security(struct inode *inode, struct inode *dir,
 extern void xfs_setup_inode(struct xfs_inode *ip);
 extern void xfs_setup_iops(struct xfs_inode *ip);
 extern void xfs_diflags_to_iflags(struct xfs_inode *ip, bool init);
+void xfs_get_atomic_write_attr(struct xfs_inode *ip,
+		unsigned int *unit_min, unsigned int *unit_max);
+
 
 #endif /* __XFS_IOPS_H__ */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 03/11] xfs: Refactor xfs_reflink_end_cow_extent()
  2025-02-13 13:56 [PATCH v2 00/11] large atomic writes for xfs with CoW John Garry
  2025-02-13 13:56 ` [PATCH v2 01/11] iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW John Garry
  2025-02-13 13:56 ` [PATCH v2 02/11] xfs: Switch atomic write size check in xfs_file_write_iter() John Garry
@ 2025-02-13 13:56 ` John Garry
  2025-02-24 20:26   ` Darrick J. Wong
  2025-02-13 13:56 ` [PATCH v2 04/11] iomap: Support CoW-based atomic writes John Garry
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: John Garry @ 2025-02-13 13:56 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

Refactor xfs_reflink_end_cow_extent() into separate parts which process
the CoW range and commit the transaction.

This refactoring will be used in future for when it is required to commit
a range of extents as a single transaction, similar to how it was done
pre-commit d6f215f359637.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_reflink.c | 73 ++++++++++++++++++++++++++------------------
 1 file changed, 43 insertions(+), 30 deletions(-)

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 59f7fc16eb80..8428f7b26ee6 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -786,35 +786,19 @@ xfs_reflink_update_quota(
  * requirements as low as possible.
  */
 STATIC int
-xfs_reflink_end_cow_extent(
+xfs_reflink_end_cow_extent_locked(
+	struct xfs_trans	*tp,
 	struct xfs_inode	*ip,
 	xfs_fileoff_t		*offset_fsb,
 	xfs_fileoff_t		end_fsb)
 {
 	struct xfs_iext_cursor	icur;
 	struct xfs_bmbt_irec	got, del, data;
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_trans	*tp;
 	struct xfs_ifork	*ifp = xfs_ifork_ptr(ip, XFS_COW_FORK);
-	unsigned int		resblks;
 	int			nmaps;
 	bool			isrt = XFS_IS_REALTIME_INODE(ip);
 	int			error;
 
-	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
-			XFS_TRANS_RESERVE, &tp);
-	if (error)
-		return error;
-
-	/*
-	 * Lock the inode.  We have to ijoin without automatic unlock because
-	 * the lead transaction is the refcountbt record deletion; the data
-	 * fork update follows as a deferred log item.
-	 */
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	xfs_trans_ijoin(tp, ip, 0);
-
 	/*
 	 * In case of racing, overlapping AIO writes no COW extents might be
 	 * left by the time I/O completes for the loser of the race.  In that
@@ -823,7 +807,7 @@ xfs_reflink_end_cow_extent(
 	if (!xfs_iext_lookup_extent(ip, ifp, *offset_fsb, &icur, &got) ||
 	    got.br_startoff >= end_fsb) {
 		*offset_fsb = end_fsb;
-		goto out_cancel;
+		return 0;
 	}
 
 	/*
@@ -837,7 +821,7 @@ xfs_reflink_end_cow_extent(
 		if (!xfs_iext_next_extent(ifp, &icur, &got) ||
 		    got.br_startoff >= end_fsb) {
 			*offset_fsb = end_fsb;
-			goto out_cancel;
+			return 0;
 		}
 	}
 	del = got;
@@ -846,14 +830,14 @@ xfs_reflink_end_cow_extent(
 	error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK,
 			XFS_IEXT_REFLINK_END_COW_CNT);
 	if (error)
-		goto out_cancel;
+		return error;
 
 	/* Grab the corresponding mapping in the data fork. */
 	nmaps = 1;
 	error = xfs_bmapi_read(ip, del.br_startoff, del.br_blockcount, &data,
 			&nmaps, 0);
 	if (error)
-		goto out_cancel;
+		return error;
 
 	/* We can only remap the smaller of the two extent sizes. */
 	data.br_blockcount = min(data.br_blockcount, del.br_blockcount);
@@ -882,7 +866,7 @@ xfs_reflink_end_cow_extent(
 		error = xfs_bunmapi(NULL, ip, data.br_startoff,
 				data.br_blockcount, 0, 1, &done);
 		if (error)
-			goto out_cancel;
+			return error;
 		ASSERT(done);
 	}
 
@@ -899,17 +883,46 @@ xfs_reflink_end_cow_extent(
 	/* Remove the mapping from the CoW fork. */
 	xfs_bmap_del_extent_cow(ip, &icur, &got, &del);
 
-	error = xfs_trans_commit(tp);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	if (error)
-		return error;
-
 	/* Update the caller about how much progress we made. */
 	*offset_fsb = del.br_startoff + del.br_blockcount;
 	return 0;
+}
 
-out_cancel:
-	xfs_trans_cancel(tp);
+
+/*
+ * Remap part of the CoW fork into the data fork.
+ *
+ * We aim to remap the range starting at @offset_fsb and ending at @end_fsb
+ * into the data fork; this function will remap what it can (at the end of the
+ * range) and update @end_fsb appropriately.  Each remap gets its own
+ * transaction because we can end up merging and splitting bmbt blocks for
+ * every remap operation and we'd like to keep the block reservation
+ * requirements as low as possible.
+ */
+STATIC int
+xfs_reflink_end_cow_extent(
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		*offset_fsb,
+	xfs_fileoff_t		end_fsb)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	unsigned int		resblks;
+	int			error;
+
+	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
+			XFS_TRANS_RESERVE, &tp);
+	if (error)
+		return error;
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
+
+	error = xfs_reflink_end_cow_extent_locked(tp, ip, offset_fsb, end_fsb);
+	if (error)
+		xfs_trans_cancel(tp);
+	else
+		error = xfs_trans_commit(tp);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 04/11] iomap: Support CoW-based atomic writes
  2025-02-13 13:56 [PATCH v2 00/11] large atomic writes for xfs with CoW John Garry
                   ` (2 preceding siblings ...)
  2025-02-13 13:56 ` [PATCH v2 03/11] xfs: Refactor xfs_reflink_end_cow_extent() John Garry
@ 2025-02-13 13:56 ` John Garry
  2025-02-24 19:59   ` Darrick J. Wong
  2025-02-13 13:56 ` [PATCH v2 05/11] iomap: Lift blocksize restriction on " John Garry
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: John Garry @ 2025-02-13 13:56 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

Currently atomic write support requires dedicated HW support. This imposes
a restriction on the filesystem that disk blocks need to be aligned and
contiguously mapped to FS blocks to issue atomic writes.

XFS has no method to guarantee FS block alignment for regular non-RT files.
As such, atomic writes are currently limited to 1x FS block there.

To allow deal with the scenario that we are issuing an atomic write over
misaligned or discontiguous data blocks larger atomic writes - and raise
the atomic write limit - support a CoW-based software emulated atomic
write mode.

For this special mode, the FS will reserve blocks for that data to be
written and then atomically map that data in once the data has been
committed to disk.

It is the responsibility of the FS to detect discontiguous atomic writes
and switch to IOMAP_DIO_ATOMIC_COW mode and retry the write.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 Documentation/filesystems/iomap/operations.rst | 15 +++++++++++++--
 fs/iomap/direct-io.c                           |  4 +++-
 include/linux/iomap.h                          |  6 ++++++
 3 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
index 82bfe0e8c08e..d30dddc94ef7 100644
--- a/Documentation/filesystems/iomap/operations.rst
+++ b/Documentation/filesystems/iomap/operations.rst
@@ -525,8 +525,19 @@ IOMAP_WRITE`` with any combination of the following enhancements:
    conversion or copy on write), all updates for the entire file range
    must be committed atomically as well.
    Only one space mapping is allowed per untorn write.
-   Untorn writes must be aligned to, and must not be longer than, a
-   single file block.
+   Untorn writes may be longer than a single file block. In all cases,
+   the mapping start disk block must have at least the same alignment as
+   the write offset.
+
+ * ``IOMAP_ATOMIC_COW``: This write is being issued with torn-write
+   protection based on CoW support.
+   All the length, alignment, and single bio restrictions which apply
+   to IOMAP_ATOMIC_HW do not apply here.
+   CoW-based atomic writes are intended as a fallback for when
+   HW-based atomic writes may not be issued, e.g. the range covered in
+   the atomic write covers multiple extents.
+   All filesystem metadata updates for the entire file range must be
+   committed atomically as well.
 
 Callers commonly hold ``i_rwsem`` in shared or exclusive mode before
 calling this function.
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f87c4277e738..076338397daa 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -644,7 +644,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 			iomi.flags |= IOMAP_OVERWRITE_ONLY;
 		}
 
-		if (iocb->ki_flags & IOCB_ATOMIC)
+		if (dio_flags & IOMAP_DIO_ATOMIC_COW)
+			iomi.flags |= IOMAP_ATOMIC_COW;
+		else if (iocb->ki_flags & IOCB_ATOMIC)
 			iomi.flags |= IOMAP_ATOMIC_HW;
 
 		/* for data sync or sync, we need sync completion processing */
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index e7aa05503763..1b961895678a 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -183,6 +183,7 @@ struct iomap_folio_ops {
 #define IOMAP_DAX		0
 #endif /* CONFIG_FS_DAX */
 #define IOMAP_ATOMIC_HW		(1 << 9) /* HW-based torn-write protection */
+#define IOMAP_ATOMIC_COW	(1 << 10)/* CoW-based torn-write protection */
 
 struct iomap_ops {
 	/*
@@ -434,6 +435,11 @@ struct iomap_dio_ops {
  */
 #define IOMAP_DIO_PARTIAL		(1 << 2)
 
+/*
+ * Use CoW-based software emulated torn-write protection.
+ */
+#define IOMAP_DIO_ATOMIC_COW		(1 << 3)
+
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
 		unsigned int dio_flags, void *private, size_t done_before);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 05/11] iomap: Lift blocksize restriction on atomic writes
  2025-02-13 13:56 [PATCH v2 00/11] large atomic writes for xfs with CoW John Garry
                   ` (3 preceding siblings ...)
  2025-02-13 13:56 ` [PATCH v2 04/11] iomap: Support CoW-based atomic writes John Garry
@ 2025-02-13 13:56 ` John Garry
  2025-02-13 13:56 ` [PATCH v2 06/11] xfs: Reflink CoW-based atomic write support John Garry
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 38+ messages in thread
From: John Garry @ 2025-02-13 13:56 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>

Filesystems like ext4 can submit writes in multiples of blocksizes.
But we still can't allow the writes to be split. Hence let's check if
the iomap_length() is same as iter->len or not.

It is the role of the FS to ensure that a single mapping may be created
for an atomic write. The FS will also continue to check size and alignment
legality.

Signed-off-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
jpg: Tweak commit message
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/iomap/direct-io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 076338397daa..aeee84eccecf 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -306,7 +306,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 	size_t copied = 0;
 	size_t orig_count;
 
-	if (atomic_hw && length != fs_block_size)
+	if (atomic_hw && length != iter->len)
 		return -EINVAL;
 
 	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 06/11] xfs: Reflink CoW-based atomic write support
  2025-02-13 13:56 [PATCH v2 00/11] large atomic writes for xfs with CoW John Garry
                   ` (4 preceding siblings ...)
  2025-02-13 13:56 ` [PATCH v2 05/11] iomap: Lift blocksize restriction on " John Garry
@ 2025-02-13 13:56 ` John Garry
  2025-02-24 20:32   ` Darrick J. Wong
  2025-02-13 13:56 ` [PATCH v2 07/11] xfs: iomap " John Garry
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: John Garry @ 2025-02-13 13:56 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

For CoW-based atomic write support, always allocate a cow hole in
xfs_reflink_allocate_cow() to write the new data.

The semantics is that if @atomic is set, we will be passed a CoW fork
extent mapping for no error returned.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_iomap.c   |  2 +-
 fs/xfs/xfs_reflink.c | 12 +++++++-----
 fs/xfs/xfs_reflink.h |  2 +-
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index d61460309a78..ab79f0080288 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -865,7 +865,7 @@ xfs_direct_write_iomap_begin(
 		/* may drop and re-acquire the ilock */
 		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
 				&lockmode,
-				(flags & IOMAP_DIRECT) || IS_DAX(inode));
+				(flags & IOMAP_DIRECT) || IS_DAX(inode), false);
 		if (error)
 			goto out_unlock;
 		if (shared)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 8428f7b26ee6..3dab3ba900a3 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -435,7 +435,8 @@ xfs_reflink_fill_cow_hole(
 	struct xfs_bmbt_irec	*cmap,
 	bool			*shared,
 	uint			*lockmode,
-	bool			convert_now)
+	bool			convert_now,
+	bool			atomic)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_trans	*tp;
@@ -466,7 +467,7 @@ xfs_reflink_fill_cow_hole(
 	*lockmode = XFS_ILOCK_EXCL;
 
 	error = xfs_find_trim_cow_extent(ip, imap, cmap, shared, &found);
-	if (error || !*shared)
+	if (error || (!*shared && !atomic))
 		goto out_trans_cancel;
 
 	if (found) {
@@ -566,7 +567,8 @@ xfs_reflink_allocate_cow(
 	struct xfs_bmbt_irec	*cmap,
 	bool			*shared,
 	uint			*lockmode,
-	bool			convert_now)
+	bool			convert_now,
+	bool 			atomic)
 {
 	int			error;
 	bool			found;
@@ -578,7 +580,7 @@ xfs_reflink_allocate_cow(
 	}
 
 	error = xfs_find_trim_cow_extent(ip, imap, cmap, shared, &found);
-	if (error || !*shared)
+	if (error || (!*shared && !atomic))
 		return error;
 
 	/* CoW fork has a real extent */
@@ -592,7 +594,7 @@ xfs_reflink_allocate_cow(
 	 */
 	if (cmap->br_startoff > imap->br_startoff)
 		return xfs_reflink_fill_cow_hole(ip, imap, cmap, shared,
-				lockmode, convert_now);
+				lockmode, convert_now, atomic);
 
 	/*
 	 * CoW fork has a delalloc reservation. Replace it with a real extent.
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index cc4e92278279..754d2bb692d3 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -32,7 +32,7 @@ int xfs_bmap_trim_cow(struct xfs_inode *ip, struct xfs_bmbt_irec *imap,
 
 int xfs_reflink_allocate_cow(struct xfs_inode *ip, struct xfs_bmbt_irec *imap,
 		struct xfs_bmbt_irec *cmap, bool *shared, uint *lockmode,
-		bool convert_now);
+		bool convert_now, bool atomic);
 extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 07/11] xfs: iomap CoW-based atomic write support
  2025-02-13 13:56 [PATCH v2 00/11] large atomic writes for xfs with CoW John Garry
                   ` (5 preceding siblings ...)
  2025-02-13 13:56 ` [PATCH v2 06/11] xfs: Reflink CoW-based atomic write support John Garry
@ 2025-02-13 13:56 ` John Garry
  2025-02-24 20:13   ` Darrick J. Wong
  2025-02-13 13:56 ` [PATCH v2 08/11] xfs: Add xfs_file_dio_write_atomic() John Garry
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: John Garry @ 2025-02-13 13:56 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

In cases of an atomic write occurs for misaligned or discontiguous disk
blocks, we will use a CoW-based method to issue the atomic write.

So, for that case, return -EAGAIN to request that the write be issued in
CoW atomic write mode. The dio write path should detect this, similar to
how misaligned regalar DIO writes are handled.

For normal HW-based mode, when the range which we are atomic writing to
covers a shared data extent, try to allocate a new CoW fork. However, if
we find that what we allocated does not meet atomic write requirements
in terms of length and alignment, then fallback on the CoW-based mode
for the atomic write.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_iomap.c | 72 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 69 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index ab79f0080288..c5ecfafbba60 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -795,6 +795,23 @@ imap_spans_range(
 	return true;
 }
 
+static bool
+imap_range_valid_for_atomic_write(
+	struct xfs_bmbt_irec	*imap,
+	xfs_fileoff_t		offset_fsb,
+	xfs_fileoff_t		end_fsb)
+{
+	/* Misaligned start block wrt size */
+	if (!IS_ALIGNED(imap->br_startblock, imap->br_blockcount))
+		return false;
+
+	/* Discontiguous or mixed extents */
+	if (!imap_spans_range(imap, offset_fsb, end_fsb))
+		return false;
+
+	return true;
+}
+
 static int
 xfs_direct_write_iomap_begin(
 	struct inode		*inode,
@@ -809,12 +826,20 @@ xfs_direct_write_iomap_begin(
 	struct xfs_bmbt_irec	imap, cmap;
 	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
 	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, length);
+	bool			atomic_cow = flags & IOMAP_ATOMIC_COW;
+	bool			atomic_hw = flags & IOMAP_ATOMIC_HW;
 	int			nimaps = 1, error = 0;
 	bool			shared = false;
 	u16			iomap_flags = 0;
+	xfs_fileoff_t		orig_offset_fsb;
+	xfs_fileoff_t		orig_end_fsb;
+	bool			needs_alloc;
 	unsigned int		lockmode;
 	u64			seq;
 
+	orig_offset_fsb = offset_fsb;
+	orig_end_fsb = end_fsb;
+
 	ASSERT(flags & (IOMAP_WRITE | IOMAP_ZERO));
 
 	if (xfs_is_shutdown(mp))
@@ -832,7 +857,7 @@ xfs_direct_write_iomap_begin(
 	 * COW writes may allocate delalloc space or convert unwritten COW
 	 * extents, so we need to make sure to take the lock exclusively here.
 	 */
-	if (xfs_is_cow_inode(ip))
+	if (xfs_is_cow_inode(ip) || atomic_cow)
 		lockmode = XFS_ILOCK_EXCL;
 	else
 		lockmode = XFS_ILOCK_SHARED;
@@ -857,6 +882,22 @@ xfs_direct_write_iomap_begin(
 	if (error)
 		goto out_unlock;
 
+	if (flags & IOMAP_ATOMIC_COW) {
+		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
+				&lockmode,
+				(flags & IOMAP_DIRECT) || IS_DAX(inode), true);
+		/*
+		 * Don't check @shared. For atomic writes, we should error when
+		 * we don't get a CoW fork.
+		 */
+		if (error)
+			goto out_unlock;
+
+		end_fsb = imap.br_startoff + imap.br_blockcount;
+		length = XFS_FSB_TO_B(mp, end_fsb) - offset;
+		goto out_found_cow;
+	}
+
 	if (imap_needs_cow(ip, flags, &imap, nimaps)) {
 		error = -EAGAIN;
 		if (flags & IOMAP_NOWAIT)
@@ -868,13 +909,38 @@ xfs_direct_write_iomap_begin(
 				(flags & IOMAP_DIRECT) || IS_DAX(inode), false);
 		if (error)
 			goto out_unlock;
-		if (shared)
+		if (shared) {
+			if (atomic_hw &&
+			    !imap_range_valid_for_atomic_write(&cmap,
+					orig_offset_fsb, orig_end_fsb)) {
+				error = -EAGAIN;
+				goto out_unlock;
+			}
 			goto out_found_cow;
+		}
 		end_fsb = imap.br_startoff + imap.br_blockcount;
 		length = XFS_FSB_TO_B(mp, end_fsb) - offset;
 	}
 
-	if (imap_needs_alloc(inode, flags, &imap, nimaps))
+	needs_alloc = imap_needs_alloc(inode, flags, &imap, nimaps);
+
+	if (atomic_hw) {
+		error = -EAGAIN;
+		/*
+		 * Use CoW method for when we need to alloc > 1 block,
+		 * otherwise we might allocate less than what we need here and
+		 * have multiple mappings.
+		*/
+		if (needs_alloc && orig_end_fsb - orig_offset_fsb > 1)
+			goto out_unlock;
+
+		if (!imap_range_valid_for_atomic_write(&imap, orig_offset_fsb,
+						orig_end_fsb)) {
+			goto out_unlock;
+		}
+	}
+
+	if (needs_alloc)
 		goto allocate_blocks;
 
 	/*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 08/11] xfs: Add xfs_file_dio_write_atomic()
  2025-02-13 13:56 [PATCH v2 00/11] large atomic writes for xfs with CoW John Garry
                   ` (6 preceding siblings ...)
  2025-02-13 13:56 ` [PATCH v2 07/11] xfs: iomap " John Garry
@ 2025-02-13 13:56 ` John Garry
  2025-02-24 20:32   ` Darrick J. Wong
  2025-02-13 13:56 ` [PATCH v2 09/11] xfs: Commit CoW-based atomic writes atomically John Garry
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: John Garry @ 2025-02-13 13:56 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes.

In case of -EAGAIN being returned from iomap_dio_rw(), reissue the write
in CoW-based atomic write mode.

For CoW-based mode, ensure that we have no outstanding IOs which we
may trample on.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_file.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 258c82cbce12..9762fa503a41 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -619,6 +619,46 @@ xfs_file_dio_write_aligned(
 	return ret;
 }
 
+static noinline ssize_t
+xfs_file_dio_write_atomic(
+	struct xfs_inode	*ip,
+	struct kiocb		*iocb,
+	struct iov_iter		*from)
+{
+	unsigned int		iolock = XFS_IOLOCK_SHARED;
+	unsigned int		dio_flags = 0;
+	ssize_t			ret;
+
+retry:
+	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
+	if (ret)
+		return ret;
+
+	ret = xfs_file_write_checks(iocb, from, &iolock);
+	if (ret)
+		goto out_unlock;
+
+	if (dio_flags & IOMAP_DIO_FORCE_WAIT)
+		inode_dio_wait(VFS_I(ip));
+
+	trace_xfs_file_direct_write(iocb, from);
+	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
+			&xfs_dio_write_ops, dio_flags, NULL, 0);
+
+	if (ret == -EAGAIN && !(iocb->ki_flags & IOCB_NOWAIT) &&
+	    !(dio_flags & IOMAP_DIO_ATOMIC_COW)) {
+		xfs_iunlock(ip, iolock);
+		dio_flags = IOMAP_DIO_ATOMIC_COW | IOMAP_DIO_FORCE_WAIT;
+		iolock = XFS_IOLOCK_EXCL;
+		goto retry;
+	}
+
+out_unlock:
+	if (iolock)
+		xfs_iunlock(ip, iolock);
+	return ret;
+}
+
 /*
  * Handle block unaligned direct I/O writes
  *
@@ -723,6 +763,8 @@ xfs_file_dio_write(
 		return -EINVAL;
 	if ((iocb->ki_pos | count) & ip->i_mount->m_blockmask)
 		return xfs_file_dio_write_unaligned(ip, iocb, from);
+	if (iocb->ki_flags & IOCB_ATOMIC)
+		return xfs_file_dio_write_atomic(ip, iocb, from);
 	return xfs_file_dio_write_aligned(ip, iocb, from);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 09/11] xfs: Commit CoW-based atomic writes atomically
  2025-02-13 13:56 [PATCH v2 00/11] large atomic writes for xfs with CoW John Garry
                   ` (7 preceding siblings ...)
  2025-02-13 13:56 ` [PATCH v2 08/11] xfs: Add xfs_file_dio_write_atomic() John Garry
@ 2025-02-13 13:56 ` John Garry
  2025-02-24 20:20   ` Darrick J. Wong
  2025-02-13 13:56 ` [PATCH v2 10/11] xfs: Update atomic write max size John Garry
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: John Garry @ 2025-02-13 13:56 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

When completing a CoW-based write, each extent range mapping update is
covered by a separate transaction.

For a CoW-based atomic write, all mappings must be changed at once, so
change to use a single transaction.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_file.c    |  5 ++++-
 fs/xfs/xfs_reflink.c | 45 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h |  3 +++
 3 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 9762fa503a41..243640fe4874 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -527,7 +527,10 @@ xfs_dio_write_end_io(
 	nofs_flag = memalloc_nofs_save();
 
 	if (flags & IOMAP_DIO_COW) {
-		error = xfs_reflink_end_cow(ip, offset, size);
+		if (iocb->ki_flags & IOCB_ATOMIC)
+			error = xfs_reflink_end_atomic_cow(ip, offset, size);
+		else
+			error = xfs_reflink_end_cow(ip, offset, size);
 		if (error)
 			goto out;
 	}
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 3dab3ba900a3..d097d33dc000 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -986,6 +986,51 @@ xfs_reflink_end_cow(
 		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
 	return error;
 }
+int
+xfs_reflink_end_atomic_cow(
+	struct xfs_inode		*ip,
+	xfs_off_t			offset,
+	xfs_off_t			count)
+{
+	xfs_fileoff_t			offset_fsb;
+	xfs_fileoff_t			end_fsb;
+	int				error = 0;
+	struct xfs_mount		*mp = ip->i_mount;
+	struct xfs_trans		*tp;
+	unsigned int			resblks;
+
+	trace_xfs_reflink_end_cow(ip, offset, count);
+
+	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
+	end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
+
+	resblks = (end_fsb - offset_fsb) *
+			XFS_NEXTENTADD_SPACE_RES(mp, 1, XFS_DATA_FORK);
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
+			XFS_TRANS_RESERVE, &tp);
+	if (error)
+		return error;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
+
+	while (end_fsb > offset_fsb && !error)
+		error = xfs_reflink_end_cow_extent_locked(tp, ip, &offset_fsb,
+							end_fsb);
+
+	if (error) {
+		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
+		goto out_cancel;
+	}
+	error = xfs_trans_commit(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return 0;
+out_cancel:
+	xfs_trans_cancel(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+}
 
 /*
  * Free all CoW staging blocks that are still referenced by the ondisk refcount
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 754d2bb692d3..ca945d98e823 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -43,6 +43,9 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count, bool cancel_real);
 extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
+		int
+xfs_reflink_end_atomic_cow(struct xfs_inode *ip, xfs_off_t offset,
+		xfs_off_t count);
 extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
 extern loff_t xfs_reflink_remap_range(struct file *file_in, loff_t pos_in,
 		struct file *file_out, loff_t pos_out, loff_t len,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 10/11] xfs: Update atomic write max size
  2025-02-13 13:56 [PATCH v2 00/11] large atomic writes for xfs with CoW John Garry
                   ` (8 preceding siblings ...)
  2025-02-13 13:56 ` [PATCH v2 09/11] xfs: Commit CoW-based atomic writes atomically John Garry
@ 2025-02-13 13:56 ` John Garry
  2025-02-24 20:34   ` Darrick J. Wong
  2025-02-13 13:56 ` [PATCH v2 11/11] xfs: Allow block allocator to take an alignment hint John Garry
  2025-02-20  7:48 ` [PATCH v2 00/11] large atomic writes for xfs with CoW John Garry
  11 siblings, 1 reply; 38+ messages in thread
From: John Garry @ 2025-02-13 13:56 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

Now that CoW-based atomic writes are supported, update the max size of an
atomic write.

For simplicity, limit at the max of what the mounted bdev can support in
terms of atomic write limits. Maybe in future we will have a better way
to advertise this optimised limit.

In addition, the max atomic write size needs to be aligned to the agsize.
Limit the size of atomic writes to the greatest power-of-two factor of the
agsize so that allocations for an atomic write will always be aligned
compatibly with the alignment requirements of the storage.

For RT inode, just limit to 1x block, even though larger can be supported
in future.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_iops.c  | 13 ++++++++++++-
 fs/xfs/xfs_iops.h  |  1 -
 fs/xfs/xfs_mount.c | 28 ++++++++++++++++++++++++++++
 fs/xfs/xfs_mount.h |  1 +
 4 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ea79fb246e33..d0a537696514 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -606,12 +606,23 @@ xfs_get_atomic_write_attr(
 	unsigned int		*unit_min,
 	unsigned int		*unit_max)
 {
+	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
+	struct xfs_mount	*mp = ip->i_mount;
+
 	if (!xfs_inode_can_atomicwrite(ip)) {
 		*unit_min = *unit_max = 0;
 		return;
 	}
 
-	*unit_min = *unit_max = ip->i_mount->m_sb.sb_blocksize;
+	*unit_min = ip->i_mount->m_sb.sb_blocksize;
+
+	if (XFS_IS_REALTIME_INODE(ip)) {
+		/* For now, set limit at 1x block */
+		*unit_max = ip->i_mount->m_sb.sb_blocksize;
+	} else {
+		*unit_max =  min_t(unsigned int, XFS_FSB_TO_B(mp, mp->awu_max),
+					target->bt_bdev_awu_max);
+	}
 }
 
 static void
diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
index ce7bdeb9a79c..d95a543f3ab0 100644
--- a/fs/xfs/xfs_iops.h
+++ b/fs/xfs/xfs_iops.h
@@ -22,5 +22,4 @@ extern void xfs_diflags_to_iflags(struct xfs_inode *ip, bool init);
 void xfs_get_atomic_write_attr(struct xfs_inode *ip,
 		unsigned int *unit_min, unsigned int *unit_max);
 
-
 #endif /* __XFS_IOPS_H__ */
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 477c5262cf91..af3ed135be4d 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -651,6 +651,32 @@ xfs_agbtree_compute_maxlevels(
 	levels = max(levels, mp->m_rmap_maxlevels);
 	mp->m_agbtree_maxlevels = max(levels, mp->m_refc_maxlevels);
 }
+static inline void
+xfs_compute_awu_max(
+	struct xfs_mount	*mp)
+{
+	xfs_agblock_t		agsize = mp->m_sb.sb_agblocks;
+	xfs_agblock_t		awu_max;
+
+	if (!xfs_has_reflink(mp)) {
+		mp->awu_max = 1;
+		return;
+	}
+
+	/*
+	 * Find highest power-of-2 evenly divisible into agsize and which
+	 * also fits into an unsigned int field.
+	 */
+	awu_max = 1;
+	while (1) {
+		if (agsize % (awu_max * 2))
+			break;
+		if (XFS_FSB_TO_B(mp, awu_max * 2) > UINT_MAX)
+			break;
+		awu_max *= 2;
+	}
+	mp->awu_max = awu_max;
+}
 
 /* Compute maximum possible height for realtime btree types for this fs. */
 static inline void
@@ -736,6 +762,8 @@ xfs_mountfs(
 	xfs_agbtree_compute_maxlevels(mp);
 	xfs_rtbtree_compute_maxlevels(mp);
 
+	xfs_compute_awu_max(mp);
+
 	/*
 	 * Check if sb_agblocks is aligned at stripe boundary.  If sb_agblocks
 	 * is NOT aligned turn off m_dalign since allocator alignment is within
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index fbed172d6770..34286c87ac4a 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -198,6 +198,7 @@ typedef struct xfs_mount {
 	bool			m_fail_unmount;
 	bool			m_finobt_nores; /* no per-AG finobt resv. */
 	bool			m_update_sb;	/* sb needs update in mount */
+	xfs_extlen_t		awu_max;	/* max atomic write */
 
 	/*
 	 * Bitsets of per-fs metadata that have been checked and/or are sick.
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 11/11] xfs: Allow block allocator to take an alignment hint
  2025-02-13 13:56 [PATCH v2 00/11] large atomic writes for xfs with CoW John Garry
                   ` (9 preceding siblings ...)
  2025-02-13 13:56 ` [PATCH v2 10/11] xfs: Update atomic write max size John Garry
@ 2025-02-13 13:56 ` John Garry
  2025-02-24 20:37   ` Darrick J. Wong
  2025-02-20  7:48 ` [PATCH v2 00/11] large atomic writes for xfs with CoW John Garry
  11 siblings, 1 reply; 38+ messages in thread
From: John Garry @ 2025-02-13 13:56 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

When issuing an atomic write by the CoW method, give the block allocator a
hint to naturally align the data blocks.

This means that we have a better chance to issuing the atomic write via
HW offload next time.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 7 ++++++-
 fs/xfs/libxfs/xfs_bmap.h | 6 +++++-
 fs/xfs/xfs_reflink.c     | 8 ++++++--
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 0ef19f1469ec..9bfdfb7cdcae 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3454,6 +3454,12 @@ xfs_bmap_compute_alignments(
 		align = xfs_get_cowextsz_hint(ap->ip);
 	else if (ap->datatype & XFS_ALLOC_USERDATA)
 		align = xfs_get_extsz_hint(ap->ip);
+
+	if (align > 1 && ap->flags & XFS_BMAPI_EXTSZALIGN)
+		args->alignment = align;
+	else
+		args->alignment = 1;
+
 	if (align) {
 		if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
 					ap->eof, 0, ap->conv, &ap->offset,
@@ -3782,7 +3788,6 @@ xfs_bmap_btalloc(
 		.wasdel		= ap->wasdel,
 		.resv		= XFS_AG_RESV_NONE,
 		.datatype	= ap->datatype,
-		.alignment	= 1,
 		.minalignslop	= 0,
 	};
 	xfs_fileoff_t		orig_offset;
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 4b721d935994..7a31697331dc 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -87,6 +87,9 @@ struct xfs_bmalloca {
 /* Do not update the rmap btree.  Used for reconstructing bmbt from rmapbt. */
 #define XFS_BMAPI_NORMAP	(1u << 10)
 
+/* Try to naturally align allocations to extsz hint */
+#define XFS_BMAPI_EXTSZALIGN	(1u << 11)
+
 #define XFS_BMAPI_FLAGS \
 	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
 	{ XFS_BMAPI_METADATA,	"METADATA" }, \
@@ -98,7 +101,8 @@ struct xfs_bmalloca {
 	{ XFS_BMAPI_REMAP,	"REMAP" }, \
 	{ XFS_BMAPI_COWFORK,	"COWFORK" }, \
 	{ XFS_BMAPI_NODISCARD,	"NODISCARD" }, \
-	{ XFS_BMAPI_NORMAP,	"NORMAP" }
+	{ XFS_BMAPI_NORMAP,	"NORMAP" },\
+	{ XFS_BMAPI_EXTSZALIGN,	"EXTSZALIGN" }
 
 
 static inline int xfs_bmapi_aflag(int w)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index d097d33dc000..033dff86ecf0 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -445,6 +445,11 @@ xfs_reflink_fill_cow_hole(
 	int			nimaps;
 	int			error;
 	bool			found;
+	uint32_t		bmapi_flags = XFS_BMAPI_COWFORK |
+					      XFS_BMAPI_PREALLOC;
+
+	if (atomic)
+		bmapi_flags |= XFS_BMAPI_EXTSZALIGN;
 
 	resaligned = xfs_aligned_fsb_count(imap->br_startoff,
 		imap->br_blockcount, xfs_get_cowextsz_hint(ip));
@@ -478,8 +483,7 @@ xfs_reflink_fill_cow_hole(
 	/* Allocate the entire reservation as unwritten blocks. */
 	nimaps = 1;
 	error = xfs_bmapi_write(tp, ip, imap->br_startoff, imap->br_blockcount,
-			XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC, 0, cmap,
-			&nimaps);
+			bmapi_flags, 0, cmap, &nimaps);
 	if (error)
 		goto out_trans_cancel;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 00/11] large atomic writes for xfs with CoW
  2025-02-13 13:56 [PATCH v2 00/11] large atomic writes for xfs with CoW John Garry
                   ` (10 preceding siblings ...)
  2025-02-13 13:56 ` [PATCH v2 11/11] xfs: Allow block allocator to take an alignment hint John Garry
@ 2025-02-20  7:48 ` John Garry
  11 siblings, 0 replies; 38+ messages in thread
From: John Garry @ 2025-02-20  7:48 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4

On 13/02/2025 13:56, John Garry wrote:
> Currently atomic write support for xfs is limited to writing a single
> block as we have no way to guarantee alignment and that the write covers
> a single extent.
> 
> This series introduces a method to issue atomic writes via a software
> emulated method.

A gentle ping on this one... thanks

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 04/11] iomap: Support CoW-based atomic writes
  2025-02-13 13:56 ` [PATCH v2 04/11] iomap: Support CoW-based atomic writes John Garry
@ 2025-02-24 19:59   ` Darrick J. Wong
  2025-02-25 10:19     ` John Garry
  0 siblings, 1 reply; 38+ messages in thread
From: Darrick J. Wong @ 2025-02-24 19:59 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On Thu, Feb 13, 2025 at 01:56:12PM +0000, John Garry wrote:
> Currently atomic write support requires dedicated HW support. This imposes
> a restriction on the filesystem that disk blocks need to be aligned and
> contiguously mapped to FS blocks to issue atomic writes.
> 
> XFS has no method to guarantee FS block alignment for regular non-RT files.
> As such, atomic writes are currently limited to 1x FS block there.
> 
> To allow deal with the scenario that we are issuing an atomic write over
> misaligned or discontiguous data blocks larger atomic writes - and raise
> the atomic write limit - support a CoW-based software emulated atomic
> write mode.
> 
> For this special mode, the FS will reserve blocks for that data to be
> written and then atomically map that data in once the data has been
> committed to disk.
> 
> It is the responsibility of the FS to detect discontiguous atomic writes
> and switch to IOMAP_DIO_ATOMIC_COW mode and retry the write.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  Documentation/filesystems/iomap/operations.rst | 15 +++++++++++++--
>  fs/iomap/direct-io.c                           |  4 +++-
>  include/linux/iomap.h                          |  6 ++++++
>  3 files changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
> index 82bfe0e8c08e..d30dddc94ef7 100644
> --- a/Documentation/filesystems/iomap/operations.rst
> +++ b/Documentation/filesystems/iomap/operations.rst
> @@ -525,8 +525,19 @@ IOMAP_WRITE`` with any combination of the following enhancements:
>     conversion or copy on write), all updates for the entire file range
>     must be committed atomically as well.
>     Only one space mapping is allowed per untorn write.
> -   Untorn writes must be aligned to, and must not be longer than, a
> -   single file block.
> +   Untorn writes may be longer than a single file block. In all cases,
> +   the mapping start disk block must have at least the same alignment as
> +   the write offset.
> +
> + * ``IOMAP_ATOMIC_COW``: This write is being issued with torn-write
> +   protection based on CoW support.

I think using "COW" here results in a misnamed flag.  Consider:

"IOMAP_ATOMIC_SW: This write is being issued with torn-write protection
via a software fallback provided by the filesystem."

iomap itself doesn't care *how* the filesystem guarantees that the
direct write isn't torn, right?  The fs' io completion handler has to
ensure that the mapping update(s) are either applied fully or discarded
fully.

In theory if you had a bunch of physical space mapped to the same
file but with different unwritten states, you could gang together all
the unwritten extent conversions in a single transaction, which would
provide the necessary tearing prevention without the out of place write.
Nobody does that right now, but I think that's the only option for ext4.

--D

> +   All the length, alignment, and single bio restrictions which apply
> +   to IOMAP_ATOMIC_HW do not apply here.
> +   CoW-based atomic writes are intended as a fallback for when
> +   HW-based atomic writes may not be issued, e.g. the range covered in
> +   the atomic write covers multiple extents.
> +   All filesystem metadata updates for the entire file range must be
> +   committed atomically as well.
>  
>  Callers commonly hold ``i_rwsem`` in shared or exclusive mode before
>  calling this function.
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index f87c4277e738..076338397daa 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -644,7 +644,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  			iomi.flags |= IOMAP_OVERWRITE_ONLY;
>  		}
>  
> -		if (iocb->ki_flags & IOCB_ATOMIC)
> +		if (dio_flags & IOMAP_DIO_ATOMIC_COW)
> +			iomi.flags |= IOMAP_ATOMIC_COW;
> +		else if (iocb->ki_flags & IOCB_ATOMIC)
>  			iomi.flags |= IOMAP_ATOMIC_HW;
>  
>  		/* for data sync or sync, we need sync completion processing */
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index e7aa05503763..1b961895678a 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -183,6 +183,7 @@ struct iomap_folio_ops {
>  #define IOMAP_DAX		0
>  #endif /* CONFIG_FS_DAX */
>  #define IOMAP_ATOMIC_HW		(1 << 9) /* HW-based torn-write protection */
> +#define IOMAP_ATOMIC_COW	(1 << 10)/* CoW-based torn-write protection */
>  
>  struct iomap_ops {
>  	/*
> @@ -434,6 +435,11 @@ struct iomap_dio_ops {
>   */
>  #define IOMAP_DIO_PARTIAL		(1 << 2)
>  
> +/*
> + * Use CoW-based software emulated torn-write protection.
> + */
> +#define IOMAP_DIO_ATOMIC_COW		(1 << 3)
> +
>  ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
>  		unsigned int dio_flags, void *private, size_t done_before);
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 07/11] xfs: iomap CoW-based atomic write support
  2025-02-13 13:56 ` [PATCH v2 07/11] xfs: iomap " John Garry
@ 2025-02-24 20:13   ` Darrick J. Wong
  2025-02-25 11:06     ` John Garry
  0 siblings, 1 reply; 38+ messages in thread
From: Darrick J. Wong @ 2025-02-24 20:13 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On Thu, Feb 13, 2025 at 01:56:15PM +0000, John Garry wrote:
> In cases of an atomic write occurs for misaligned or discontiguous disk
> blocks, we will use a CoW-based method to issue the atomic write.
> 
> So, for that case, return -EAGAIN to request that the write be issued in
> CoW atomic write mode. The dio write path should detect this, similar to
> how misaligned regalar DIO writes are handled.
> 
> For normal HW-based mode, when the range which we are atomic writing to
> covers a shared data extent, try to allocate a new CoW fork. However, if
> we find that what we allocated does not meet atomic write requirements
> in terms of length and alignment, then fallback on the CoW-based mode
> for the atomic write.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_iomap.c | 72 ++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 69 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index ab79f0080288..c5ecfafbba60 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -795,6 +795,23 @@ imap_spans_range(
>  	return true;
>  }
>  
> +static bool
> +imap_range_valid_for_atomic_write(

xfs_bmap_valid_for_atomic_write()

> +	struct xfs_bmbt_irec	*imap,
> +	xfs_fileoff_t		offset_fsb,
> +	xfs_fileoff_t		end_fsb)
> +{
> +	/* Misaligned start block wrt size */
> +	if (!IS_ALIGNED(imap->br_startblock, imap->br_blockcount))
> +		return false;
> +
> +	/* Discontiguous or mixed extents */
> +	if (!imap_spans_range(imap, offset_fsb, end_fsb))
> +		return false;
> +
> +	return true;
> +}
> +
>  static int
>  xfs_direct_write_iomap_begin(
>  	struct inode		*inode,
> @@ -809,12 +826,20 @@ xfs_direct_write_iomap_begin(
>  	struct xfs_bmbt_irec	imap, cmap;
>  	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
>  	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, length);
> +	bool			atomic_cow = flags & IOMAP_ATOMIC_COW;
> +	bool			atomic_hw = flags & IOMAP_ATOMIC_HW;
>  	int			nimaps = 1, error = 0;
>  	bool			shared = false;
>  	u16			iomap_flags = 0;
> +	xfs_fileoff_t		orig_offset_fsb;
> +	xfs_fileoff_t		orig_end_fsb;
> +	bool			needs_alloc;
>  	unsigned int		lockmode;
>  	u64			seq;
>  
> +	orig_offset_fsb = offset_fsb;

When does offset_fsb change?

> +	orig_end_fsb = end_fsb;

Set this in the variable declaration?

> +
>  	ASSERT(flags & (IOMAP_WRITE | IOMAP_ZERO));
>  
>  	if (xfs_is_shutdown(mp))
> @@ -832,7 +857,7 @@ xfs_direct_write_iomap_begin(
>  	 * COW writes may allocate delalloc space or convert unwritten COW
>  	 * extents, so we need to make sure to take the lock exclusively here.
>  	 */
> -	if (xfs_is_cow_inode(ip))
> +	if (xfs_is_cow_inode(ip) || atomic_cow)
>  		lockmode = XFS_ILOCK_EXCL;
>  	else
>  		lockmode = XFS_ILOCK_SHARED;
> @@ -857,6 +882,22 @@ xfs_direct_write_iomap_begin(
>  	if (error)
>  		goto out_unlock;
>  
> +	if (flags & IOMAP_ATOMIC_COW) {

	if (atomic_cow) ?

Or really, atomic_sw?

> +		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
> +				&lockmode,
> +				(flags & IOMAP_DIRECT) || IS_DAX(inode), true);
> +		/*
> +		 * Don't check @shared. For atomic writes, we should error when
> +		 * we don't get a CoW fork.

"Get a CoW fork"?  What does that mean?  The cow fork should be
automatically allocated when needed, right?  Or should this really read
"...when we don't get a COW mapping"?

> +		 */
> +		if (error)
> +			goto out_unlock;
> +
> +		end_fsb = imap.br_startoff + imap.br_blockcount;
> +		length = XFS_FSB_TO_B(mp, end_fsb) - offset;
> +		goto out_found_cow;
> +	}

Can this be a separate ->iomap_begin (and hence iomap ops)?  I am trying
to avoid the incohesion (still) plaguing most of the other iomap users.

> +
>  	if (imap_needs_cow(ip, flags, &imap, nimaps)) {
>  		error = -EAGAIN;
>  		if (flags & IOMAP_NOWAIT)
> @@ -868,13 +909,38 @@ xfs_direct_write_iomap_begin(
>  				(flags & IOMAP_DIRECT) || IS_DAX(inode), false);
>  		if (error)
>  			goto out_unlock;
> -		if (shared)
> +		if (shared) {
> +			if (atomic_hw &&
> +			    !imap_range_valid_for_atomic_write(&cmap,
> +					orig_offset_fsb, orig_end_fsb)) {
> +				error = -EAGAIN;
> +				goto out_unlock;
> +			}
>  			goto out_found_cow;
> +		}
>  		end_fsb = imap.br_startoff + imap.br_blockcount;
>  		length = XFS_FSB_TO_B(mp, end_fsb) - offset;
>  	}
>  
> -	if (imap_needs_alloc(inode, flags, &imap, nimaps))
> +	needs_alloc = imap_needs_alloc(inode, flags, &imap, nimaps);
> +
> +	if (atomic_hw) {
> +		error = -EAGAIN;
> +		/*
> +		 * Use CoW method for when we need to alloc > 1 block,
> +		 * otherwise we might allocate less than what we need here and
> +		 * have multiple mappings.
> +		*/
> +		if (needs_alloc && orig_end_fsb - orig_offset_fsb > 1)
> +			goto out_unlock;
> +
> +		if (!imap_range_valid_for_atomic_write(&imap, orig_offset_fsb,
> +						orig_end_fsb)) {

You only need to indent by two more tabs for a continuation line.

--D

> +			goto out_unlock;
> +		}
> +	}
> +
> +	if (needs_alloc)
>  		goto allocate_blocks;
>  
>  	/*
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 09/11] xfs: Commit CoW-based atomic writes atomically
  2025-02-13 13:56 ` [PATCH v2 09/11] xfs: Commit CoW-based atomic writes atomically John Garry
@ 2025-02-24 20:20   ` Darrick J. Wong
  2025-02-25 11:11     ` John Garry
  0 siblings, 1 reply; 38+ messages in thread
From: Darrick J. Wong @ 2025-02-24 20:20 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On Thu, Feb 13, 2025 at 01:56:17PM +0000, John Garry wrote:
> When completing a CoW-based write, each extent range mapping update is
> covered by a separate transaction.
> 
> For a CoW-based atomic write, all mappings must be changed at once, so
> change to use a single transaction.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_file.c    |  5 ++++-
>  fs/xfs/xfs_reflink.c | 45 ++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_reflink.h |  3 +++
>  3 files changed, 52 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 9762fa503a41..243640fe4874 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -527,7 +527,10 @@ xfs_dio_write_end_io(
>  	nofs_flag = memalloc_nofs_save();
>  
>  	if (flags & IOMAP_DIO_COW) {
> -		error = xfs_reflink_end_cow(ip, offset, size);
> +		if (iocb->ki_flags & IOCB_ATOMIC)
> +			error = xfs_reflink_end_atomic_cow(ip, offset, size);
> +		else
> +			error = xfs_reflink_end_cow(ip, offset, size);
>  		if (error)
>  			goto out;
>  	}
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 3dab3ba900a3..d097d33dc000 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -986,6 +986,51 @@ xfs_reflink_end_cow(
>  		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
>  	return error;
>  }
> +int
> +xfs_reflink_end_atomic_cow(
> +	struct xfs_inode		*ip,
> +	xfs_off_t			offset,
> +	xfs_off_t			count)
> +{
> +	xfs_fileoff_t			offset_fsb;
> +	xfs_fileoff_t			end_fsb;
> +	int				error = 0;
> +	struct xfs_mount		*mp = ip->i_mount;
> +	struct xfs_trans		*tp;
> +	unsigned int			resblks;
> +
> +	trace_xfs_reflink_end_cow(ip, offset, count);
> +
> +	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
> +	end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);

Use @mp here instead of walking the pointer.

> +
> +	resblks = (end_fsb - offset_fsb) *
> +			XFS_NEXTENTADD_SPACE_RES(mp, 1, XFS_DATA_FORK);

How did you arrive at this computation?  The "b" parameter to
XFS_NEXTENTADD_SPACE_RES is usually the worst case number of mappings
that you're going to change on this file.  I think that quantity is
(end_fsb - offset_fsb)?

> +
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
> +			XFS_TRANS_RESERVE, &tp);
> +	if (error)
> +		return error;
> +
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +	xfs_trans_ijoin(tp, ip, 0);
> +
> +	while (end_fsb > offset_fsb && !error)
> +		error = xfs_reflink_end_cow_extent_locked(tp, ip, &offset_fsb,
> +							end_fsb);

Overly long line, and the continuation line only needs to be indented
two more tabs.

> +
> +	if (error) {
> +		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
> +		goto out_cancel;
> +	}
> +	error = xfs_trans_commit(tp);
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	return 0;

Why is it ok to drop @error here?  Shouldn't a transaction commit error
should be reported to the writer thread?

--D

> +out_cancel:
> +	xfs_trans_cancel(tp);
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	return error;
> +}
>  
>  /*
>   * Free all CoW staging blocks that are still referenced by the ondisk refcount
> diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> index 754d2bb692d3..ca945d98e823 100644
> --- a/fs/xfs/xfs_reflink.h
> +++ b/fs/xfs/xfs_reflink.h
> @@ -43,6 +43,9 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
>  		xfs_off_t count, bool cancel_real);
>  extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
>  		xfs_off_t count);
> +		int
> +xfs_reflink_end_atomic_cow(struct xfs_inode *ip, xfs_off_t offset,
> +		xfs_off_t count);
>  extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
>  extern loff_t xfs_reflink_remap_range(struct file *file_in, loff_t pos_in,
>  		struct file *file_out, loff_t pos_out, loff_t len,
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 01/11] iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW
  2025-02-13 13:56 ` [PATCH v2 01/11] iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW John Garry
@ 2025-02-24 20:23   ` Darrick J. Wong
  0 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-02-24 20:23 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On Thu, Feb 13, 2025 at 01:56:09PM +0000, John Garry wrote:
> In future xfs will support a CoW-based atomic write, so rename
> IOMAP_ATOMIC -> IOMAP_ATOMIC_HW to be clear which mode is being used.
> 
> Also relocate setting of IOMAP_ATOMIC_HW to the write path in
> __iomap_dio_rw(), to be clear that this flag is only relevant to writes
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>

Looks fine,
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D

> ---
>  Documentation/filesystems/iomap/operations.rst |  4 ++--
>  fs/ext4/inode.c                                |  2 +-
>  fs/iomap/direct-io.c                           | 18 +++++++++---------
>  fs/iomap/trace.h                               |  2 +-
>  include/linux/iomap.h                          |  2 +-
>  5 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
> index 2c7f5df9d8b0..82bfe0e8c08e 100644
> --- a/Documentation/filesystems/iomap/operations.rst
> +++ b/Documentation/filesystems/iomap/operations.rst
> @@ -513,8 +513,8 @@ IOMAP_WRITE`` with any combination of the following enhancements:
>     if the mapping is unwritten and the filesystem cannot handle zeroing
>     the unaligned regions without exposing stale contents.
>  
> - * ``IOMAP_ATOMIC``: This write is being issued with torn-write
> -   protection.
> + * ``IOMAP_ATOMIC_HW``: This write is being issued with torn-write
> +   protection based on HW-offload support.
>     Only a single bio can be created for the write, and the write must
>     not be split into multiple I/O requests, i.e. flag REQ_ATOMIC must be
>     set.
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 7c54ae5fcbd4..ba2f1e3db7c7 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3467,7 +3467,7 @@ static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
>  		return false;
>  
>  	/* atomic writes are all-or-nothing */
> -	if (flags & IOMAP_ATOMIC)
> +	if (flags & IOMAP_ATOMIC_HW)
>  		return false;
>  
>  	/* can only try again if we wrote nothing */
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index b521eb15759e..f87c4277e738 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -271,7 +271,7 @@ static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
>   * clearing the WRITE_THROUGH flag in the dio request.
>   */
>  static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
> -		const struct iomap *iomap, bool use_fua, bool atomic)
> +		const struct iomap *iomap, bool use_fua, bool atomic_hw)
>  {
>  	blk_opf_t opflags = REQ_SYNC | REQ_IDLE;
>  
> @@ -283,7 +283,7 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
>  		opflags |= REQ_FUA;
>  	else
>  		dio->flags &= ~IOMAP_DIO_WRITE_THROUGH;
> -	if (atomic)
> +	if (atomic_hw)
>  		opflags |= REQ_ATOMIC;
>  
>  	return opflags;
> @@ -295,8 +295,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  	const struct iomap *iomap = &iter->iomap;
>  	struct inode *inode = iter->inode;
>  	unsigned int fs_block_size = i_blocksize(inode), pad;
> +	bool atomic_hw = iter->flags & IOMAP_ATOMIC_HW;
>  	const loff_t length = iomap_length(iter);
> -	bool atomic = iter->flags & IOMAP_ATOMIC;
>  	loff_t pos = iter->pos;
>  	blk_opf_t bio_opf;
>  	struct bio *bio;
> @@ -306,7 +306,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  	size_t copied = 0;
>  	size_t orig_count;
>  
> -	if (atomic && length != fs_block_size)
> +	if (atomic_hw && length != fs_block_size)
>  		return -EINVAL;
>  
>  	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
> @@ -383,7 +383,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  			goto out;
>  	}
>  
> -	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic);
> +	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic_hw);
>  
>  	nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS);
>  	do {
> @@ -416,7 +416,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		}
>  
>  		n = bio->bi_iter.bi_size;
> -		if (WARN_ON_ONCE(atomic && n != length)) {
> +		if (WARN_ON_ONCE(atomic_hw && n != length)) {
>  			/*
>  			 * This bio should have covered the complete length,
>  			 * which it doesn't, so error. We may need to zero out
> @@ -610,9 +610,6 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	if (iocb->ki_flags & IOCB_NOWAIT)
>  		iomi.flags |= IOMAP_NOWAIT;
>  
> -	if (iocb->ki_flags & IOCB_ATOMIC)
> -		iomi.flags |= IOMAP_ATOMIC;
> -
>  	if (iov_iter_rw(iter) == READ) {
>  		/* reads can always complete inline */
>  		dio->flags |= IOMAP_DIO_INLINE_COMP;
> @@ -647,6 +644,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  			iomi.flags |= IOMAP_OVERWRITE_ONLY;
>  		}
>  
> +		if (iocb->ki_flags & IOCB_ATOMIC)
> +			iomi.flags |= IOMAP_ATOMIC_HW;
> +
>  		/* for data sync or sync, we need sync completion processing */
>  		if (iocb_is_dsync(iocb)) {
>  			dio->flags |= IOMAP_DIO_NEED_SYNC;
> diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
> index 4118a42cdab0..0c73d91c0485 100644
> --- a/fs/iomap/trace.h
> +++ b/fs/iomap/trace.h
> @@ -99,7 +99,7 @@ DEFINE_RANGE_EVENT(iomap_dio_rw_queued);
>  	{ IOMAP_FAULT,		"FAULT" }, \
>  	{ IOMAP_DIRECT,		"DIRECT" }, \
>  	{ IOMAP_NOWAIT,		"NOWAIT" }, \
> -	{ IOMAP_ATOMIC,		"ATOMIC" }
> +	{ IOMAP_ATOMIC_HW,	"ATOMIC_HW" }
>  
>  #define IOMAP_F_FLAGS_STRINGS \
>  	{ IOMAP_F_NEW,		"NEW" }, \
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 75bf54e76f3b..e7aa05503763 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -182,7 +182,7 @@ struct iomap_folio_ops {
>  #else
>  #define IOMAP_DAX		0
>  #endif /* CONFIG_FS_DAX */
> -#define IOMAP_ATOMIC		(1 << 9)
> +#define IOMAP_ATOMIC_HW		(1 << 9) /* HW-based torn-write protection */
>  
>  struct iomap_ops {
>  	/*
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 02/11] xfs: Switch atomic write size check in xfs_file_write_iter()
  2025-02-13 13:56 ` [PATCH v2 02/11] xfs: Switch atomic write size check in xfs_file_write_iter() John Garry
@ 2025-02-24 20:24   ` Darrick J. Wong
  0 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-02-24 20:24 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On Thu, Feb 13, 2025 at 01:56:10PM +0000, John Garry wrote:
> Currently the size of atomic write allowed is fixed at the blocksize.
> 
> To start to lift this restriction, refactor xfs_get_atomic_write_attr()
> to into a helper - xfs_report_atomic_write() - and use that helper to
> find the per-inode atomic write limits and check according to that.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>

Looks fine,
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_file.c | 12 +++++-------
>  fs/xfs/xfs_iops.c | 20 +++++++++++++++++---
>  fs/xfs/xfs_iops.h |  3 +++
>  3 files changed, 25 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index f7a7d89c345e..258c82cbce12 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -853,14 +853,12 @@ xfs_file_write_iter(
>  		return xfs_file_dax_write(iocb, from);
>  
>  	if (iocb->ki_flags & IOCB_ATOMIC) {
> -		/*
> -		 * Currently only atomic writing of a single FS block is
> -		 * supported. It would be possible to atomic write smaller than
> -		 * a FS block, but there is no requirement to support this.
> -		 * Note that iomap also does not support this yet.
> -		 */
> -		if (ocount != ip->i_mount->m_sb.sb_blocksize)
> +		unsigned int	unit_min, unit_max;
> +
> +		xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
> +		if (ocount < unit_min || ocount > unit_max)
>  			return -EINVAL;
> +
>  		ret = generic_atomic_write_valid(iocb, from);
>  		if (ret)
>  			return ret;
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 40289fe6f5b2..ea79fb246e33 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -600,15 +600,29 @@ xfs_report_dioalign(
>  		stat->dio_offset_align = stat->dio_read_offset_align;
>  }
>  
> +void
> +xfs_get_atomic_write_attr(
> +	struct xfs_inode	*ip,
> +	unsigned int		*unit_min,
> +	unsigned int		*unit_max)
> +{
> +	if (!xfs_inode_can_atomicwrite(ip)) {
> +		*unit_min = *unit_max = 0;
> +		return;
> +	}
> +
> +	*unit_min = *unit_max = ip->i_mount->m_sb.sb_blocksize;
> +}
> +
>  static void
>  xfs_report_atomic_write(
>  	struct xfs_inode	*ip,
>  	struct kstat		*stat)
>  {
> -	unsigned int		unit_min = 0, unit_max = 0;
> +	unsigned int		unit_min, unit_max;
> +
> +	xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
>  
> -	if (xfs_inode_can_atomicwrite(ip))
> -		unit_min = unit_max = ip->i_mount->m_sb.sb_blocksize;
>  	generic_fill_statx_atomic_writes(stat, unit_min, unit_max);
>  }
>  
> diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
> index 3c1a2605ffd2..ce7bdeb9a79c 100644
> --- a/fs/xfs/xfs_iops.h
> +++ b/fs/xfs/xfs_iops.h
> @@ -19,5 +19,8 @@ int xfs_inode_init_security(struct inode *inode, struct inode *dir,
>  extern void xfs_setup_inode(struct xfs_inode *ip);
>  extern void xfs_setup_iops(struct xfs_inode *ip);
>  extern void xfs_diflags_to_iflags(struct xfs_inode *ip, bool init);
> +void xfs_get_atomic_write_attr(struct xfs_inode *ip,
> +		unsigned int *unit_min, unsigned int *unit_max);
> +
>  
>  #endif /* __XFS_IOPS_H__ */
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 03/11] xfs: Refactor xfs_reflink_end_cow_extent()
  2025-02-13 13:56 ` [PATCH v2 03/11] xfs: Refactor xfs_reflink_end_cow_extent() John Garry
@ 2025-02-24 20:26   ` Darrick J. Wong
  2025-02-25 10:01     ` John Garry
  0 siblings, 1 reply; 38+ messages in thread
From: Darrick J. Wong @ 2025-02-24 20:26 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On Thu, Feb 13, 2025 at 01:56:11PM +0000, John Garry wrote:
> Refactor xfs_reflink_end_cow_extent() into separate parts which process
> the CoW range and commit the transaction.
> 
> This refactoring will be used in future for when it is required to commit
> a range of extents as a single transaction, similar to how it was done
> pre-commit d6f215f359637.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_reflink.c | 73 ++++++++++++++++++++++++++------------------
>  1 file changed, 43 insertions(+), 30 deletions(-)
> 
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 59f7fc16eb80..8428f7b26ee6 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -786,35 +786,19 @@ xfs_reflink_update_quota(
>   * requirements as low as possible.
>   */
>  STATIC int
> -xfs_reflink_end_cow_extent(
> +xfs_reflink_end_cow_extent_locked(
> +	struct xfs_trans	*tp,
>  	struct xfs_inode	*ip,
>  	xfs_fileoff_t		*offset_fsb,
>  	xfs_fileoff_t		end_fsb)
>  {
>  	struct xfs_iext_cursor	icur;
>  	struct xfs_bmbt_irec	got, del, data;
> -	struct xfs_mount	*mp = ip->i_mount;
> -	struct xfs_trans	*tp;
>  	struct xfs_ifork	*ifp = xfs_ifork_ptr(ip, XFS_COW_FORK);
> -	unsigned int		resblks;
>  	int			nmaps;
>  	bool			isrt = XFS_IS_REALTIME_INODE(ip);
>  	int			error;
>  
> -	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
> -			XFS_TRANS_RESERVE, &tp);
> -	if (error)
> -		return error;
> -
> -	/*
> -	 * Lock the inode.  We have to ijoin without automatic unlock because
> -	 * the lead transaction is the refcountbt record deletion; the data
> -	 * fork update follows as a deferred log item.
> -	 */
> -	xfs_ilock(ip, XFS_ILOCK_EXCL);
> -	xfs_trans_ijoin(tp, ip, 0);
> -
>  	/*
>  	 * In case of racing, overlapping AIO writes no COW extents might be
>  	 * left by the time I/O completes for the loser of the race.  In that
> @@ -823,7 +807,7 @@ xfs_reflink_end_cow_extent(
>  	if (!xfs_iext_lookup_extent(ip, ifp, *offset_fsb, &icur, &got) ||
>  	    got.br_startoff >= end_fsb) {
>  		*offset_fsb = end_fsb;
> -		goto out_cancel;
> +		return 0;
>  	}
>  
>  	/*
> @@ -837,7 +821,7 @@ xfs_reflink_end_cow_extent(
>  		if (!xfs_iext_next_extent(ifp, &icur, &got) ||
>  		    got.br_startoff >= end_fsb) {
>  			*offset_fsb = end_fsb;
> -			goto out_cancel;
> +			return 0;
>  		}
>  	}
>  	del = got;
> @@ -846,14 +830,14 @@ xfs_reflink_end_cow_extent(
>  	error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK,
>  			XFS_IEXT_REFLINK_END_COW_CNT);
>  	if (error)
> -		goto out_cancel;
> +		return error;
>  
>  	/* Grab the corresponding mapping in the data fork. */
>  	nmaps = 1;
>  	error = xfs_bmapi_read(ip, del.br_startoff, del.br_blockcount, &data,
>  			&nmaps, 0);
>  	if (error)
> -		goto out_cancel;
> +		return error;
>  
>  	/* We can only remap the smaller of the two extent sizes. */
>  	data.br_blockcount = min(data.br_blockcount, del.br_blockcount);
> @@ -882,7 +866,7 @@ xfs_reflink_end_cow_extent(
>  		error = xfs_bunmapi(NULL, ip, data.br_startoff,
>  				data.br_blockcount, 0, 1, &done);
>  		if (error)
> -			goto out_cancel;
> +			return error;
>  		ASSERT(done);
>  	}
>  
> @@ -899,17 +883,46 @@ xfs_reflink_end_cow_extent(
>  	/* Remove the mapping from the CoW fork. */
>  	xfs_bmap_del_extent_cow(ip, &icur, &got, &del);
>  
> -	error = xfs_trans_commit(tp);
> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -	if (error)
> -		return error;
> -
>  	/* Update the caller about how much progress we made. */
>  	*offset_fsb = del.br_startoff + del.br_blockcount;
>  	return 0;
> +}
>  
> -out_cancel:
> -	xfs_trans_cancel(tp);
> +
> +/*
> + * Remap part of the CoW fork into the data fork.
> + *
> + * We aim to remap the range starting at @offset_fsb and ending at @end_fsb
> + * into the data fork; this function will remap what it can (at the end of the
> + * range) and update @end_fsb appropriately.  Each remap gets its own
> + * transaction because we can end up merging and splitting bmbt blocks for
> + * every remap operation and we'd like to keep the block reservation
> + * requirements as low as possible.
> + */
> +STATIC int
> +xfs_reflink_end_cow_extent(
> +	struct xfs_inode	*ip,
> +	xfs_fileoff_t		*offset_fsb,
> +	xfs_fileoff_t		end_fsb)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_trans	*tp;
> +	unsigned int		resblks;
> +	int			error;
> +
> +	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
> +			XFS_TRANS_RESERVE, &tp);
> +	if (error)
> +		return error;
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +	xfs_trans_ijoin(tp, ip, 0);
> +
> +	error = xfs_reflink_end_cow_extent_locked(tp, ip, offset_fsb, end_fsb);

Overly long line, but otherwise looks fine.  With that fixed,
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D

> +	if (error)
> +		xfs_trans_cancel(tp);
> +	else
> +		error = xfs_trans_commit(tp);
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  	return error;
>  }
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 06/11] xfs: Reflink CoW-based atomic write support
  2025-02-13 13:56 ` [PATCH v2 06/11] xfs: Reflink CoW-based atomic write support John Garry
@ 2025-02-24 20:32   ` Darrick J. Wong
  2025-02-25 10:58     ` John Garry
  0 siblings, 1 reply; 38+ messages in thread
From: Darrick J. Wong @ 2025-02-24 20:32 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On Thu, Feb 13, 2025 at 01:56:14PM +0000, John Garry wrote:
> For CoW-based atomic write support, always allocate a cow hole in
> xfs_reflink_allocate_cow() to write the new data.
> 
> The semantics is that if @atomic is set, we will be passed a CoW fork
> extent mapping for no error returned.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_iomap.c   |  2 +-
>  fs/xfs/xfs_reflink.c | 12 +++++++-----
>  fs/xfs/xfs_reflink.h |  2 +-
>  3 files changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index d61460309a78..ab79f0080288 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -865,7 +865,7 @@ xfs_direct_write_iomap_begin(
>  		/* may drop and re-acquire the ilock */
>  		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
>  				&lockmode,
> -				(flags & IOMAP_DIRECT) || IS_DAX(inode));
> +				(flags & IOMAP_DIRECT) || IS_DAX(inode), false);

Now I'm /really/ think it's time for some reflink allocation flags,
because the function signature now involves two booleans...

>  		if (error)
>  			goto out_unlock;
>  		if (shared)
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 8428f7b26ee6..3dab3ba900a3 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -435,7 +435,8 @@ xfs_reflink_fill_cow_hole(
>  	struct xfs_bmbt_irec	*cmap,
>  	bool			*shared,
>  	uint			*lockmode,
> -	bool			convert_now)
> +	bool			convert_now,
> +	bool			atomic)

...but this can come later.  Also, is atomic==true only for the
ATOMIC_SW operation?  I think so, but that's the unfortunate thing about
booleans.

>  {
>  	struct xfs_mount	*mp = ip->i_mount;
>  	struct xfs_trans	*tp;
> @@ -466,7 +467,7 @@ xfs_reflink_fill_cow_hole(
>  	*lockmode = XFS_ILOCK_EXCL;
>  
>  	error = xfs_find_trim_cow_extent(ip, imap, cmap, shared, &found);
> -	if (error || !*shared)
> +	if (error || (!*shared && !atomic))
>  		goto out_trans_cancel;
>  
>  	if (found) {
> @@ -566,7 +567,8 @@ xfs_reflink_allocate_cow(
>  	struct xfs_bmbt_irec	*cmap,
>  	bool			*shared,
>  	uint			*lockmode,
> -	bool			convert_now)
> +	bool			convert_now,
> +	bool 			atomic)

Nit:        ^ space before tab.

If the answer to the question above is 'yes' then with that nit fixed,
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D

>  {
>  	int			error;
>  	bool			found;
> @@ -578,7 +580,7 @@ xfs_reflink_allocate_cow(
>  	}
>  
>  	error = xfs_find_trim_cow_extent(ip, imap, cmap, shared, &found);
> -	if (error || !*shared)
> +	if (error || (!*shared && !atomic))
>  		return error;
>  
>  	/* CoW fork has a real extent */
> @@ -592,7 +594,7 @@ xfs_reflink_allocate_cow(
>  	 */
>  	if (cmap->br_startoff > imap->br_startoff)
>  		return xfs_reflink_fill_cow_hole(ip, imap, cmap, shared,
> -				lockmode, convert_now);
> +				lockmode, convert_now, atomic);
>  
>  	/*
>  	 * CoW fork has a delalloc reservation. Replace it with a real extent.
> diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> index cc4e92278279..754d2bb692d3 100644
> --- a/fs/xfs/xfs_reflink.h
> +++ b/fs/xfs/xfs_reflink.h
> @@ -32,7 +32,7 @@ int xfs_bmap_trim_cow(struct xfs_inode *ip, struct xfs_bmbt_irec *imap,
>  
>  int xfs_reflink_allocate_cow(struct xfs_inode *ip, struct xfs_bmbt_irec *imap,
>  		struct xfs_bmbt_irec *cmap, bool *shared, uint *lockmode,
> -		bool convert_now);
> +		bool convert_now, bool atomic);
>  extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
>  		xfs_off_t count);
>  
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 08/11] xfs: Add xfs_file_dio_write_atomic()
  2025-02-13 13:56 ` [PATCH v2 08/11] xfs: Add xfs_file_dio_write_atomic() John Garry
@ 2025-02-24 20:32   ` Darrick J. Wong
  0 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-02-24 20:32 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On Thu, Feb 13, 2025 at 01:56:16PM +0000, John Garry wrote:
> Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes.
> 
> In case of -EAGAIN being returned from iomap_dio_rw(), reissue the write
> in CoW-based atomic write mode.
> 
> For CoW-based mode, ensure that we have no outstanding IOs which we
> may trample on.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>

Looks fine,
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_file.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 42 insertions(+)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 258c82cbce12..9762fa503a41 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -619,6 +619,46 @@ xfs_file_dio_write_aligned(
>  	return ret;
>  }
>  
> +static noinline ssize_t
> +xfs_file_dio_write_atomic(
> +	struct xfs_inode	*ip,
> +	struct kiocb		*iocb,
> +	struct iov_iter		*from)
> +{
> +	unsigned int		iolock = XFS_IOLOCK_SHARED;
> +	unsigned int		dio_flags = 0;
> +	ssize_t			ret;
> +
> +retry:
> +	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
> +	if (ret)
> +		return ret;
> +
> +	ret = xfs_file_write_checks(iocb, from, &iolock);
> +	if (ret)
> +		goto out_unlock;
> +
> +	if (dio_flags & IOMAP_DIO_FORCE_WAIT)
> +		inode_dio_wait(VFS_I(ip));
> +
> +	trace_xfs_file_direct_write(iocb, from);
> +	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
> +			&xfs_dio_write_ops, dio_flags, NULL, 0);
> +
> +	if (ret == -EAGAIN && !(iocb->ki_flags & IOCB_NOWAIT) &&
> +	    !(dio_flags & IOMAP_DIO_ATOMIC_COW)) {
> +		xfs_iunlock(ip, iolock);
> +		dio_flags = IOMAP_DIO_ATOMIC_COW | IOMAP_DIO_FORCE_WAIT;
> +		iolock = XFS_IOLOCK_EXCL;
> +		goto retry;
> +	}
> +
> +out_unlock:
> +	if (iolock)
> +		xfs_iunlock(ip, iolock);
> +	return ret;
> +}
> +
>  /*
>   * Handle block unaligned direct I/O writes
>   *
> @@ -723,6 +763,8 @@ xfs_file_dio_write(
>  		return -EINVAL;
>  	if ((iocb->ki_pos | count) & ip->i_mount->m_blockmask)
>  		return xfs_file_dio_write_unaligned(ip, iocb, from);
> +	if (iocb->ki_flags & IOCB_ATOMIC)
> +		return xfs_file_dio_write_atomic(ip, iocb, from);
>  	return xfs_file_dio_write_aligned(ip, iocb, from);
>  }
>  
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 10/11] xfs: Update atomic write max size
  2025-02-13 13:56 ` [PATCH v2 10/11] xfs: Update atomic write max size John Garry
@ 2025-02-24 20:34   ` Darrick J. Wong
  2025-02-25 11:13     ` John Garry
  0 siblings, 1 reply; 38+ messages in thread
From: Darrick J. Wong @ 2025-02-24 20:34 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On Thu, Feb 13, 2025 at 01:56:18PM +0000, John Garry wrote:
> Now that CoW-based atomic writes are supported, update the max size of an
> atomic write.
> 
> For simplicity, limit at the max of what the mounted bdev can support in
> terms of atomic write limits. Maybe in future we will have a better way
> to advertise this optimised limit.
> 
> In addition, the max atomic write size needs to be aligned to the agsize.
> Limit the size of atomic writes to the greatest power-of-two factor of the
> agsize so that allocations for an atomic write will always be aligned
> compatibly with the alignment requirements of the storage.
> 
> For RT inode, just limit to 1x block, even though larger can be supported
> in future.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_iops.c  | 13 ++++++++++++-
>  fs/xfs/xfs_iops.h  |  1 -
>  fs/xfs/xfs_mount.c | 28 ++++++++++++++++++++++++++++
>  fs/xfs/xfs_mount.h |  1 +
>  4 files changed, 41 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index ea79fb246e33..d0a537696514 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -606,12 +606,23 @@ xfs_get_atomic_write_attr(
>  	unsigned int		*unit_min,
>  	unsigned int		*unit_max)
>  {
> +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> +	struct xfs_mount	*mp = ip->i_mount;
> +
>  	if (!xfs_inode_can_atomicwrite(ip)) {
>  		*unit_min = *unit_max = 0;
>  		return;
>  	}
>  
> -	*unit_min = *unit_max = ip->i_mount->m_sb.sb_blocksize;
> +	*unit_min = ip->i_mount->m_sb.sb_blocksize;
> +
> +	if (XFS_IS_REALTIME_INODE(ip)) {
> +		/* For now, set limit at 1x block */
> +		*unit_max = ip->i_mount->m_sb.sb_blocksize;
> +	} else {
> +		*unit_max =  min_t(unsigned int, XFS_FSB_TO_B(mp, mp->awu_max),
> +					target->bt_bdev_awu_max);
> +	}
>  }
>  
>  static void
> diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
> index ce7bdeb9a79c..d95a543f3ab0 100644
> --- a/fs/xfs/xfs_iops.h
> +++ b/fs/xfs/xfs_iops.h
> @@ -22,5 +22,4 @@ extern void xfs_diflags_to_iflags(struct xfs_inode *ip, bool init);
>  void xfs_get_atomic_write_attr(struct xfs_inode *ip,
>  		unsigned int *unit_min, unsigned int *unit_max);
>  
> -

No need to remove a blank line.

>  #endif /* __XFS_IOPS_H__ */
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 477c5262cf91..af3ed135be4d 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -651,6 +651,32 @@ xfs_agbtree_compute_maxlevels(
>  	levels = max(levels, mp->m_rmap_maxlevels);
>  	mp->m_agbtree_maxlevels = max(levels, mp->m_refc_maxlevels);
>  }
> +static inline void
> +xfs_compute_awu_max(
> +	struct xfs_mount	*mp)
> +{
> +	xfs_agblock_t		agsize = mp->m_sb.sb_agblocks;
> +	xfs_agblock_t		awu_max;
> +
> +	if (!xfs_has_reflink(mp)) {
> +		mp->awu_max = 1;
> +		return;
> +	}
> +
> +	/*
> +	 * Find highest power-of-2 evenly divisible into agsize and which
> +	 * also fits into an unsigned int field.
> +	 */
> +	awu_max = 1;
> +	while (1) {
> +		if (agsize % (awu_max * 2))
> +			break;
> +		if (XFS_FSB_TO_B(mp, awu_max * 2) > UINT_MAX)
> +			break;
> +		awu_max *= 2;
> +	}
> +	mp->awu_max = awu_max;
> +}
>  
>  /* Compute maximum possible height for realtime btree types for this fs. */
>  static inline void
> @@ -736,6 +762,8 @@ xfs_mountfs(
>  	xfs_agbtree_compute_maxlevels(mp);
>  	xfs_rtbtree_compute_maxlevels(mp);
>  
> +	xfs_compute_awu_max(mp);
> +
>  	/*
>  	 * Check if sb_agblocks is aligned at stripe boundary.  If sb_agblocks
>  	 * is NOT aligned turn off m_dalign since allocator alignment is within
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index fbed172d6770..34286c87ac4a 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -198,6 +198,7 @@ typedef struct xfs_mount {
>  	bool			m_fail_unmount;
>  	bool			m_finobt_nores; /* no per-AG finobt resv. */
>  	bool			m_update_sb;	/* sb needs update in mount */
> +	xfs_extlen_t		awu_max;	/* max atomic write */

Might want to clarify that this is for the *data* device.

	/* max atomic write to datadev */

With those two things fixed,
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D

>  
>  	/*
>  	 * Bitsets of per-fs metadata that have been checked and/or are sick.
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/11] xfs: Allow block allocator to take an alignment hint
  2025-02-13 13:56 ` [PATCH v2 11/11] xfs: Allow block allocator to take an alignment hint John Garry
@ 2025-02-24 20:37   ` Darrick J. Wong
  2025-02-25 11:17     ` John Garry
  0 siblings, 1 reply; 38+ messages in thread
From: Darrick J. Wong @ 2025-02-24 20:37 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On Thu, Feb 13, 2025 at 01:56:19PM +0000, John Garry wrote:
> When issuing an atomic write by the CoW method, give the block allocator a
> hint to naturally align the data blocks.
> 
> This means that we have a better chance to issuing the atomic write via
> HW offload next time.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c | 7 ++++++-
>  fs/xfs/libxfs/xfs_bmap.h | 6 +++++-
>  fs/xfs/xfs_reflink.c     | 8 ++++++--
>  3 files changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 0ef19f1469ec..9bfdfb7cdcae 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3454,6 +3454,12 @@ xfs_bmap_compute_alignments(
>  		align = xfs_get_cowextsz_hint(ap->ip);
>  	else if (ap->datatype & XFS_ALLOC_USERDATA)
>  		align = xfs_get_extsz_hint(ap->ip);
> +
> +	if (align > 1 && ap->flags & XFS_BMAPI_EXTSZALIGN)
> +		args->alignment = align;
> +	else
> +		args->alignment = 1;
> +
>  	if (align) {
>  		if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
>  					ap->eof, 0, ap->conv, &ap->offset,
> @@ -3782,7 +3788,6 @@ xfs_bmap_btalloc(
>  		.wasdel		= ap->wasdel,
>  		.resv		= XFS_AG_RESV_NONE,
>  		.datatype	= ap->datatype,
> -		.alignment	= 1,
>  		.minalignslop	= 0,
>  	};
>  	xfs_fileoff_t		orig_offset;
> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> index 4b721d935994..7a31697331dc 100644
> --- a/fs/xfs/libxfs/xfs_bmap.h
> +++ b/fs/xfs/libxfs/xfs_bmap.h
> @@ -87,6 +87,9 @@ struct xfs_bmalloca {
>  /* Do not update the rmap btree.  Used for reconstructing bmbt from rmapbt. */
>  #define XFS_BMAPI_NORMAP	(1u << 10)
>  
> +/* Try to naturally align allocations to extsz hint */
> +#define XFS_BMAPI_EXTSZALIGN	(1u << 11)

IMO "naturally" makes things confusing here -- are we aligning to the
extent size hint, or are we aligning to the length requested?  Or
whatever it is that "naturally" means.

(FWIW you and I have bumped over this repeatedly, so maybe this is
simple one of those cognitive friction things where block storage always
deals with powers of two and "naturally" means a lot, vs. filesystems
where we don't usually enforce alignment anywhere.)

I suggest "Try to align allocations to the extent size hint" for the
comment, and with that:
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D

> +
>  #define XFS_BMAPI_FLAGS \
>  	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
>  	{ XFS_BMAPI_METADATA,	"METADATA" }, \
> @@ -98,7 +101,8 @@ struct xfs_bmalloca {
>  	{ XFS_BMAPI_REMAP,	"REMAP" }, \
>  	{ XFS_BMAPI_COWFORK,	"COWFORK" }, \
>  	{ XFS_BMAPI_NODISCARD,	"NODISCARD" }, \
> -	{ XFS_BMAPI_NORMAP,	"NORMAP" }
> +	{ XFS_BMAPI_NORMAP,	"NORMAP" },\
> +	{ XFS_BMAPI_EXTSZALIGN,	"EXTSZALIGN" }
>  
>  
>  static inline int xfs_bmapi_aflag(int w)
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index d097d33dc000..033dff86ecf0 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -445,6 +445,11 @@ xfs_reflink_fill_cow_hole(
>  	int			nimaps;
>  	int			error;
>  	bool			found;
> +	uint32_t		bmapi_flags = XFS_BMAPI_COWFORK |
> +					      XFS_BMAPI_PREALLOC;
> +
> +	if (atomic)
> +		bmapi_flags |= XFS_BMAPI_EXTSZALIGN;
>  
>  	resaligned = xfs_aligned_fsb_count(imap->br_startoff,
>  		imap->br_blockcount, xfs_get_cowextsz_hint(ip));
> @@ -478,8 +483,7 @@ xfs_reflink_fill_cow_hole(
>  	/* Allocate the entire reservation as unwritten blocks. */
>  	nimaps = 1;
>  	error = xfs_bmapi_write(tp, ip, imap->br_startoff, imap->br_blockcount,
> -			XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC, 0, cmap,
> -			&nimaps);
> +			bmapi_flags, 0, cmap, &nimaps);
>  	if (error)
>  		goto out_trans_cancel;
>  
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 03/11] xfs: Refactor xfs_reflink_end_cow_extent()
  2025-02-24 20:26   ` Darrick J. Wong
@ 2025-02-25 10:01     ` John Garry
  2025-02-25 17:29       ` Darrick J. Wong
  0 siblings, 1 reply; 38+ messages in thread
From: John Garry @ 2025-02-25 10:01 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On 24/02/2025 20:26, Darrick J. Wong wrote:
>> +	if (error)
>> +		return error;
>> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
>> +	xfs_trans_ijoin(tp, ip, 0);
>> +
>> +	error = xfs_reflink_end_cow_extent_locked(tp, ip, offset_fsb, end_fsb);
> Overly long line, but otherwise looks fine. 

The limit is 80, right? That line fills out to 80.

> With that fixed,
> Reviewed-by: "Darrick J. Wong"<djwong@kernel.org>
> 

cheers


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 04/11] iomap: Support CoW-based atomic writes
  2025-02-24 19:59   ` Darrick J. Wong
@ 2025-02-25 10:19     ` John Garry
  2025-02-25 17:33       ` Darrick J. Wong
  0 siblings, 1 reply; 38+ messages in thread
From: John Garry @ 2025-02-25 10:19 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On 24/02/2025 19:59, Darrick J. Wong wrote:
>> + * ``IOMAP_ATOMIC_COW``: This write is being issued with torn-write
>> +   protection based on CoW support.
> I think using "COW" here results in a misnamed flag.  Consider:
> 
> "IOMAP_ATOMIC_SW:

ok, fine

> This write is being issued with torn-write protection
> via a software fallback provided by the filesystem."

I'm not sure that we really need to even mention software fallback. 
Indeed, xfs could just use IOMAP_ATOMIC_SW always when the bdev does not 
support HW offload. Maybe I can mention that typically it can be used as 
a software fallback when HW offload is not possible.

> 
> iomap itself doesn't care*how* the filesystem guarantees that the
> direct write isn't torn, right? 

Correct. iomap just ensures that for IOMAP_ATOMIC_HW we produce a single 
bio - that's the only check really.

> The fs' io completion handler has to
> ensure that the mapping update(s) are either applied fully or discarded
> fully.

right

> 
> In theory if you had a bunch of physical space mapped to the same
> file but with different unwritten states, you could gang together all
> the unwritten extent conversions in a single transaction, which would
> provide the necessary tearing prevention without the out of place write.
> Nobody does that right now, but I think that's the only option for ext4.

ok, maybe. But ext4 still does have bigalloc or opportunity to support 
forcealign (to always use IOMAP_ATOMIC_HW for large untorn writes).

Thanks,
John




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 06/11] xfs: Reflink CoW-based atomic write support
  2025-02-24 20:32   ` Darrick J. Wong
@ 2025-02-25 10:58     ` John Garry
  2025-02-25 17:37       ` Darrick J. Wong
  0 siblings, 1 reply; 38+ messages in thread
From: John Garry @ 2025-02-25 10:58 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On 24/02/2025 20:32, Darrick J. Wong wrote:
> On Thu, Feb 13, 2025 at 01:56:14PM +0000, John Garry wrote:
>> For CoW-based atomic write support, always allocate a cow hole in
>> xfs_reflink_allocate_cow() to write the new data.
>>
>> The semantics is that if @atomic is set, we will be passed a CoW fork
>> extent mapping for no error returned.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/xfs_iomap.c   |  2 +-
>>   fs/xfs/xfs_reflink.c | 12 +++++++-----
>>   fs/xfs/xfs_reflink.h |  2 +-
>>   3 files changed, 9 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index d61460309a78..ab79f0080288 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -865,7 +865,7 @@ xfs_direct_write_iomap_begin(
>>   		/* may drop and re-acquire the ilock */
>>   		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
>>   				&lockmode,
>> -				(flags & IOMAP_DIRECT) || IS_DAX(inode));
>> +				(flags & IOMAP_DIRECT) || IS_DAX(inode), false);
> 
> Now I'm /really/ think it's time for some reflink allocation flags,
> because the function signature now involves two booleans...

ok, but the @convert_now arg is passed to other functions from 
xfs_reflink_allocate_cow() - so would you prefer to create a bool 
@convert_now inside xfs_reflink_allocate_cow() and pass that bool as 
before? Or pass the flags all the way down to end users of @convert_now?

> 
>>   		if (error)
>>   			goto out_unlock;
>>   		if (shared)
>> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
>> index 8428f7b26ee6..3dab3ba900a3 100644
>> --- a/fs/xfs/xfs_reflink.c
>> +++ b/fs/xfs/xfs_reflink.c
>> @@ -435,7 +435,8 @@ xfs_reflink_fill_cow_hole(
>>   	struct xfs_bmbt_irec	*cmap,
>>   	bool			*shared,
>>   	uint			*lockmode,
>> -	bool			convert_now)
>> +	bool			convert_now,
>> +	bool			atomic)
> 
> ...but this can come later. 

Do you mean that this would just be a new flag to set?

 > Also, is atomic==true only for the> ATOMIC_SW operation?

Right, so I think that the variable (or new flag) can be renamed for that.

> I think so, but that's the unfortunate thing about
> booleans.
> 
>>   {
>>   	struct xfs_mount	*mp = ip->i_mount;
>>   	struct xfs_trans	*tp;
>> @@ -466,7 +467,7 @@ xfs_reflink_fill_cow_hole(
>>   	*lockmode = XFS_ILOCK_EXCL;
>>   
>>   	error = xfs_find_trim_cow_extent(ip, imap, cmap, shared, &found);
>> -	if (error || !*shared)
>> +	if (error || (!*shared && !atomic))
>>   		goto out_trans_cancel;
>>   
>>   	if (found) {
>> @@ -566,7 +567,8 @@ xfs_reflink_allocate_cow(
>>   	struct xfs_bmbt_irec	*cmap,
>>   	bool			*shared,
>>   	uint			*lockmode,
>> -	bool			convert_now)
>> +	bool			convert_now,
>> +	bool 			atomic)
> 
> Nit:        ^ space before tab.

ok

> 
> If the answer to the question above is 'yes' then with that nit fixed,
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

Thanks, but I will wait for your feedback to decide whether to pick that up.

John

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 07/11] xfs: iomap CoW-based atomic write support
  2025-02-24 20:13   ` Darrick J. Wong
@ 2025-02-25 11:06     ` John Garry
  2025-02-25 17:47       ` Darrick J. Wong
  0 siblings, 1 reply; 38+ messages in thread
From: John Garry @ 2025-02-25 11:06 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On 24/02/2025 20:13, Darrick J. Wong wrote:
> On Thu, Feb 13, 2025 at 01:56:15PM +0000, John Garry wrote:
>> In cases of an atomic write occurs for misaligned or discontiguous disk
>> blocks, we will use a CoW-based method to issue the atomic write.
>>
>> So, for that case, return -EAGAIN to request that the write be issued in
>> CoW atomic write mode. The dio write path should detect this, similar to
>> how misaligned regalar DIO writes are handled.
>>
>> For normal HW-based mode, when the range which we are atomic writing to
>> covers a shared data extent, try to allocate a new CoW fork. However, if
>> we find that what we allocated does not meet atomic write requirements
>> in terms of length and alignment, then fallback on the CoW-based mode
>> for the atomic write.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/xfs_iomap.c | 72 ++++++++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 69 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index ab79f0080288..c5ecfafbba60 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -795,6 +795,23 @@ imap_spans_range(
>>   	return true;
>>   }
>>   
>> +static bool
>> +imap_range_valid_for_atomic_write(
> 
> xfs_bmap_valid_for_atomic_write()

I'm ok with this.

But we do have other private functions without "xfs" prefix - like 
imap_needs_cow(), so a bit inconsistent to begin with.

> 
>> +	struct xfs_bmbt_irec	*imap,
>> +	xfs_fileoff_t		offset_fsb,
>> +	xfs_fileoff_t		end_fsb)
>> +{
>> +	/* Misaligned start block wrt size */
>> +	if (!IS_ALIGNED(imap->br_startblock, imap->br_blockcount))
>> +		return false;
>> +
>> +	/* Discontiguous or mixed extents */
>> +	if (!imap_spans_range(imap, offset_fsb, end_fsb))
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>>   static int
>>   xfs_direct_write_iomap_begin(
>>   	struct inode		*inode,
>> @@ -809,12 +826,20 @@ xfs_direct_write_iomap_begin(
>>   	struct xfs_bmbt_irec	imap, cmap;
>>   	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
>>   	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, length);
>> +	bool			atomic_cow = flags & IOMAP_ATOMIC_COW;
>> +	bool			atomic_hw = flags & IOMAP_ATOMIC_HW;
>>   	int			nimaps = 1, error = 0;
>>   	bool			shared = false;
>>   	u16			iomap_flags = 0;
>> +	xfs_fileoff_t		orig_offset_fsb;
>> +	xfs_fileoff_t		orig_end_fsb;
>> +	bool			needs_alloc;
>>   	unsigned int		lockmode;
>>   	u64			seq;
>>   
>> +	orig_offset_fsb = offset_fsb;
> 
> When does offset_fsb change?

It doesn't, so this is not really required.

> 
>> +	orig_end_fsb = end_fsb;
> 
> Set this in the variable declaration?

ok

> 
>> +
>>   	ASSERT(flags & (IOMAP_WRITE | IOMAP_ZERO));
>>   
>>   	if (xfs_is_shutdown(mp))
>> @@ -832,7 +857,7 @@ xfs_direct_write_iomap_begin(
>>   	 * COW writes may allocate delalloc space or convert unwritten COW
>>   	 * extents, so we need to make sure to take the lock exclusively here.
>>   	 */
>> -	if (xfs_is_cow_inode(ip))
>> +	if (xfs_is_cow_inode(ip) || atomic_cow)
>>   		lockmode = XFS_ILOCK_EXCL;
>>   	else
>>   		lockmode = XFS_ILOCK_SHARED;
>> @@ -857,6 +882,22 @@ xfs_direct_write_iomap_begin(
>>   	if (error)
>>   		goto out_unlock;
>>   
>> +	if (flags & IOMAP_ATOMIC_COW) {
> 
> 	if (atomic_cow) ?
> 
> Or really, atomic_sw?

Yes, will change.

> 
>> +		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
>> +				&lockmode,
>> +				(flags & IOMAP_DIRECT) || IS_DAX(inode), true);
>> +		/*
>> +		 * Don't check @shared. For atomic writes, we should error when
>> +		 * we don't get a CoW fork.
> 
> "Get a CoW fork"?  What does that mean?  The cow fork should be
> automatically allocated when needed, right?  Or should this really read
> "...when we don't get a COW mapping"?

ok, I can change as you suggest

> 
>> +		 */
>> +		if (error)
>> +			goto out_unlock;
>> +
>> +		end_fsb = imap.br_startoff + imap.br_blockcount;
>> +		length = XFS_FSB_TO_B(mp, end_fsb) - offset;
>> +		goto out_found_cow;
>> +	}
> 
> Can this be a separate ->iomap_begin (and hence iomap ops)?  I am trying
> to avoid the incohesion (still) plaguing most of the other iomap users.

I can try, and would then need to try to factor out what would be much 
duplicated code.

> 
>> +
>>   	if (imap_needs_cow(ip, flags, &imap, nimaps)) {
>>   		error = -EAGAIN;
>>   		if (flags & IOMAP_NOWAIT)
>> @@ -868,13 +909,38 @@ xfs_direct_write_iomap_begin(
>>   				(flags & IOMAP_DIRECT) || IS_DAX(inode), false);
>>   		if (error)
>>   			goto out_unlock;
>> -		if (shared)
>> +		if (shared) {
>> +			if (atomic_hw &&
>> +			    !imap_range_valid_for_atomic_write(&cmap,
>> +					orig_offset_fsb, orig_end_fsb)) {
>> +				error = -EAGAIN;
>> +				goto out_unlock;
>> +			}
>>   			goto out_found_cow;
>> +		}
>>   		end_fsb = imap.br_startoff + imap.br_blockcount;
>>   		length = XFS_FSB_TO_B(mp, end_fsb) - offset;
>>   	}
>>   
>> -	if (imap_needs_alloc(inode, flags, &imap, nimaps))
>> +	needs_alloc = imap_needs_alloc(inode, flags, &imap, nimaps);
>> +
>> +	if (atomic_hw) {
>> +		error = -EAGAIN;
>> +		/*
>> +		 * Use CoW method for when we need to alloc > 1 block,
>> +		 * otherwise we might allocate less than what we need here and
>> +		 * have multiple mappings.
>> +		*/
>> +		if (needs_alloc && orig_end_fsb - orig_offset_fsb > 1)
>> +			goto out_unlock;
>> +
>> +		if (!imap_range_valid_for_atomic_write(&imap, orig_offset_fsb,
>> +						orig_end_fsb)) {
> 
> You only need to indent by two more tabs for a continuation line.

ok

Thanks,
John

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 09/11] xfs: Commit CoW-based atomic writes atomically
  2025-02-24 20:20   ` Darrick J. Wong
@ 2025-02-25 11:11     ` John Garry
  2025-02-25 17:50       ` Darrick J. Wong
  0 siblings, 1 reply; 38+ messages in thread
From: John Garry @ 2025-02-25 11:11 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On 24/02/2025 20:20, Darrick J. Wong wrote:
> On Thu, Feb 13, 2025 at 01:56:17PM +0000, John Garry wrote:
>> When completing a CoW-based write, each extent range mapping update is
>> covered by a separate transaction.
>>
>> For a CoW-based atomic write, all mappings must be changed at once, so
>> change to use a single transaction.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/xfs_file.c    |  5 ++++-
>>   fs/xfs/xfs_reflink.c | 45 ++++++++++++++++++++++++++++++++++++++++++++
>>   fs/xfs/xfs_reflink.h |  3 +++
>>   3 files changed, 52 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index 9762fa503a41..243640fe4874 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -527,7 +527,10 @@ xfs_dio_write_end_io(
>>   	nofs_flag = memalloc_nofs_save();
>>   
>>   	if (flags & IOMAP_DIO_COW) {
>> -		error = xfs_reflink_end_cow(ip, offset, size);
>> +		if (iocb->ki_flags & IOCB_ATOMIC)
>> +			error = xfs_reflink_end_atomic_cow(ip, offset, size);
>> +		else
>> +			error = xfs_reflink_end_cow(ip, offset, size);
>>   		if (error)
>>   			goto out;
>>   	}
>> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
>> index 3dab3ba900a3..d097d33dc000 100644
>> --- a/fs/xfs/xfs_reflink.c
>> +++ b/fs/xfs/xfs_reflink.c
>> @@ -986,6 +986,51 @@ xfs_reflink_end_cow(
>>   		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
>>   	return error;
>>   }
>> +int
>> +xfs_reflink_end_atomic_cow(
>> +	struct xfs_inode		*ip,
>> +	xfs_off_t			offset,
>> +	xfs_off_t			count)
>> +{
>> +	xfs_fileoff_t			offset_fsb;
>> +	xfs_fileoff_t			end_fsb;
>> +	int				error = 0;
>> +	struct xfs_mount		*mp = ip->i_mount;
>> +	struct xfs_trans		*tp;
>> +	unsigned int			resblks;
>> +
>> +	trace_xfs_reflink_end_cow(ip, offset, count);
>> +
>> +	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
>> +	end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
> 
> Use @mp here instead of walking the pointer.

Yes

> 
>> +
>> +	resblks = (end_fsb - offset_fsb) *
>> +			XFS_NEXTENTADD_SPACE_RES(mp, 1, XFS_DATA_FORK);
> 
> How did you arrive at this computation? 

hmmm... you suggested this, but maybe I picked it up incorrectly :)

> The "b" parameter to
> XFS_NEXTENTADD_SPACE_RES is usually the worst case number of mappings
> that you're going to change on this file.  I think that quantity is
> (end_fsb - offset_fsb)?

Can you please check this versus what you suggested in 
https://lore.kernel.org/linux-xfs/20250206215014.GX21808@frogsfrogsfrogs/#t

> 
>> +
>> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
>> +			XFS_TRANS_RESERVE, &tp);
>> +	if (error)
>> +		return error;
>> +
>> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
>> +	xfs_trans_ijoin(tp, ip, 0);
>> +
>> +	while (end_fsb > offset_fsb && !error)
>> +		error = xfs_reflink_end_cow_extent_locked(tp, ip, &offset_fsb,
>> +							end_fsb);
> 
> Overly long line, and the continuation line only needs to be indented
> two more tabs.

ok

> 
>> +
>> +	if (error) {
>> +		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
>> +		goto out_cancel;
>> +	}
>> +	error = xfs_trans_commit(tp);
>> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>> +	return 0;
> 
> Why is it ok to drop @error here?  Shouldn't a transaction commit error
> should be reported to the writer thread?
>

I can fix that, as I should not ignore errors from xfs_trans_commit()

Thanks,
John

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 10/11] xfs: Update atomic write max size
  2025-02-24 20:34   ` Darrick J. Wong
@ 2025-02-25 11:13     ` John Garry
  0 siblings, 0 replies; 38+ messages in thread
From: John Garry @ 2025-02-25 11:13 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4


>> @@ -22,5 +22,4 @@ extern void xfs_diflags_to_iflags(struct xfs_inode *ip, bool init);
>>   void xfs_get_atomic_write_attr(struct xfs_inode *ip,
>>   		unsigned int *unit_min, unsigned int *unit_max);
>>   
>> -
> 
> No need to remove a blank line.


ok
> 
>>   #endif /* __XFS_IOPS_H__ */
>> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
>> index 477c5262cf91..af3ed135be4d 100644
>> --- a/fs/xfs/xfs_mount.c
>> +++ b/fs/xfs/xfs_mount.c
>> @@ -651,6 +651,32 @@ xfs_agbtree_compute_maxlevels(
>>   	levels = max(levels, mp->m_rmap_maxlevels);
>>   	mp->m_agbtree_maxlevels = max(levels, mp->m_refc_maxlevels);
>>   }
>> +static inline void
>> +xfs_compute_awu_max(
>> +	struct xfs_mount	*mp)
>> +{
>> +	xfs_agblock_t		agsize = mp->m_sb.sb_agblocks;
>> +	xfs_agblock_t		awu_max;
>> +
>> +	if (!xfs_has_reflink(mp)) {
>> +		mp->awu_max = 1;
>> +		return;
>> +	}
>> +
>> +	/*
>> +	 * Find highest power-of-2 evenly divisible into agsize and which
>> +	 * also fits into an unsigned int field.
>> +	 */
>> +	awu_max = 1;
>> +	while (1) {
>> +		if (agsize % (awu_max * 2))
>> +			break;
>> +		if (XFS_FSB_TO_B(mp, awu_max * 2) > UINT_MAX)
>> +			break;
>> +		awu_max *= 2;
>> +	}
>> +	mp->awu_max = awu_max;
>> +}
>>   
>>   /* Compute maximum possible height for realtime btree types for this fs. */
>>   static inline void
>> @@ -736,6 +762,8 @@ xfs_mountfs(
>>   	xfs_agbtree_compute_maxlevels(mp);
>>   	xfs_rtbtree_compute_maxlevels(mp);
>>   
>> +	xfs_compute_awu_max(mp);
>> +
>>   	/*
>>   	 * Check if sb_agblocks is aligned at stripe boundary.  If sb_agblocks
>>   	 * is NOT aligned turn off m_dalign since allocator alignment is within
>> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
>> index fbed172d6770..34286c87ac4a 100644
>> --- a/fs/xfs/xfs_mount.h
>> +++ b/fs/xfs/xfs_mount.h
>> @@ -198,6 +198,7 @@ typedef struct xfs_mount {
>>   	bool			m_fail_unmount;
>>   	bool			m_finobt_nores; /* no per-AG finobt resv. */
>>   	bool			m_update_sb;	/* sb needs update in mount */
>> +	xfs_extlen_t		awu_max;	/* max atomic write */
> 
> Might want to clarify that this is for the *data* device.

sure

> 
> 	/* max atomic write to datadev */
> 
> With those two things fixed,
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

cheers

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 11/11] xfs: Allow block allocator to take an alignment hint
  2025-02-24 20:37   ` Darrick J. Wong
@ 2025-02-25 11:17     ` John Garry
  0 siblings, 0 replies; 38+ messages in thread
From: John Garry @ 2025-02-25 11:17 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On 24/02/2025 20:37, Darrick J. Wong wrote:
> On Thu, Feb 13, 2025 at 01:56:19PM +0000, John Garry wrote:
>> When issuing an atomic write by the CoW method, give the block allocator a
>> hint to naturally align the data blocks.
>>
>> This means that we have a better chance to issuing the atomic write via
>> HW offload next time.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/libxfs/xfs_bmap.c | 7 ++++++-
>>   fs/xfs/libxfs/xfs_bmap.h | 6 +++++-
>>   fs/xfs/xfs_reflink.c     | 8 ++++++--
>>   3 files changed, 17 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
>> index 0ef19f1469ec..9bfdfb7cdcae 100644
>> --- a/fs/xfs/libxfs/xfs_bmap.c
>> +++ b/fs/xfs/libxfs/xfs_bmap.c
>> @@ -3454,6 +3454,12 @@ xfs_bmap_compute_alignments(
>>   		align = xfs_get_cowextsz_hint(ap->ip);
>>   	else if (ap->datatype & XFS_ALLOC_USERDATA)
>>   		align = xfs_get_extsz_hint(ap->ip);
>> +
>> +	if (align > 1 && ap->flags & XFS_BMAPI_EXTSZALIGN)
>> +		args->alignment = align;
>> +	else
>> +		args->alignment = 1;
>> +
>>   	if (align) {
>>   		if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
>>   					ap->eof, 0, ap->conv, &ap->offset,
>> @@ -3782,7 +3788,6 @@ xfs_bmap_btalloc(
>>   		.wasdel		= ap->wasdel,
>>   		.resv		= XFS_AG_RESV_NONE,
>>   		.datatype	= ap->datatype,
>> -		.alignment	= 1,
>>   		.minalignslop	= 0,
>>   	};
>>   	xfs_fileoff_t		orig_offset;
>> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
>> index 4b721d935994..7a31697331dc 100644
>> --- a/fs/xfs/libxfs/xfs_bmap.h
>> +++ b/fs/xfs/libxfs/xfs_bmap.h
>> @@ -87,6 +87,9 @@ struct xfs_bmalloca {
>>   /* Do not update the rmap btree.  Used for reconstructing bmbt from rmapbt. */
>>   #define XFS_BMAPI_NORMAP	(1u << 10)
>>   
>> +/* Try to naturally align allocations to extsz hint */
>> +#define XFS_BMAPI_EXTSZALIGN	(1u << 11)
> 
> IMO "naturally" makes things confusing here -- are we aligning to the
> extent size hint, or are we aligning to the length requested?  Or
> whatever it is that "naturally" means.

We align to extsz hint, not length.

As for use of word "naturally", I'll try to avoid using that word.

> 
> (FWIW you and I have bumped over this repeatedly, so maybe this is
> simple one of those cognitive friction things where block storage always
> deals with powers of two and "naturally" means a lot, vs. filesystems
> where we don't usually enforce alignment anywhere.)
> 
> I suggest "Try to align allocations to the extent size hint" for the
> comment, and with that:

that's fine

> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> 

cheers,
John

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 03/11] xfs: Refactor xfs_reflink_end_cow_extent()
  2025-02-25 10:01     ` John Garry
@ 2025-02-25 17:29       ` Darrick J. Wong
  0 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-02-25 17:29 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On Tue, Feb 25, 2025 at 10:01:32AM +0000, John Garry wrote:
> On 24/02/2025 20:26, Darrick J. Wong wrote:
> > > +	if (error)
> > > +		return error;
> > > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > +	xfs_trans_ijoin(tp, ip, 0);
> > > +
> > > +	error = xfs_reflink_end_cow_extent_locked(tp, ip, offset_fsb, end_fsb);
> > Overly long line, but otherwise looks fine.
> 
> The limit is 80, right? That line fills out to 80.

Ah, right, forgot that I have vim set up to display the right margin at
72 for emails.

--D

> > With that fixed,
> > Reviewed-by: "Darrick J. Wong"<djwong@kernel.org>
> > 
> 
> cheers
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 04/11] iomap: Support CoW-based atomic writes
  2025-02-25 10:19     ` John Garry
@ 2025-02-25 17:33       ` Darrick J. Wong
  0 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2025-02-25 17:33 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On Tue, Feb 25, 2025 at 10:19:49AM +0000, John Garry wrote:
> On 24/02/2025 19:59, Darrick J. Wong wrote:
> > > + * ``IOMAP_ATOMIC_COW``: This write is being issued with torn-write
> > > +   protection based on CoW support.
> > I think using "COW" here results in a misnamed flag.  Consider:
> > 
> > "IOMAP_ATOMIC_SW:
> 
> ok, fine
> 
> > This write is being issued with torn-write protection
> > via a software fallback provided by the filesystem."
> 
> I'm not sure that we really need to even mention software fallback. Indeed,
> xfs could just use IOMAP_ATOMIC_SW always when the bdev does not support HW
> offload. Maybe I can mention that typically it can be used as a software
> fallback when HW offload is not possible.

Ok, a software mechanism then.

> > 
> > iomap itself doesn't care*how* the filesystem guarantees that the
> > direct write isn't torn, right?
> 
> Correct. iomap just ensures that for IOMAP_ATOMIC_HW we produce a single bio
> - that's the only check really.
> 
> > The fs' io completion handler has to
> > ensure that the mapping update(s) are either applied fully or discarded
> > fully.
> 
> right
> 
> > 
> > In theory if you had a bunch of physical space mapped to the same
> > file but with different unwritten states, you could gang together all
> > the unwritten extent conversions in a single transaction, which would
> > provide the necessary tearing prevention without the out of place write.
> > Nobody does that right now, but I think that's the only option for ext4.
> 
> ok, maybe. But ext4 still does have bigalloc or opportunity to support
> forcealign (to always use IOMAP_ATOMIC_HW for large untorn writes).

<nod>

--D

> Thanks,
> John
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 06/11] xfs: Reflink CoW-based atomic write support
  2025-02-25 10:58     ` John Garry
@ 2025-02-25 17:37       ` Darrick J. Wong
  2025-02-25 18:02         ` John Garry
  0 siblings, 1 reply; 38+ messages in thread
From: Darrick J. Wong @ 2025-02-25 17:37 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On Tue, Feb 25, 2025 at 10:58:56AM +0000, John Garry wrote:
> On 24/02/2025 20:32, Darrick J. Wong wrote:
> > On Thu, Feb 13, 2025 at 01:56:14PM +0000, John Garry wrote:
> > > For CoW-based atomic write support, always allocate a cow hole in
> > > xfs_reflink_allocate_cow() to write the new data.
> > > 
> > > The semantics is that if @atomic is set, we will be passed a CoW fork
> > > extent mapping for no error returned.
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >   fs/xfs/xfs_iomap.c   |  2 +-
> > >   fs/xfs/xfs_reflink.c | 12 +++++++-----
> > >   fs/xfs/xfs_reflink.h |  2 +-
> > >   3 files changed, 9 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > index d61460309a78..ab79f0080288 100644
> > > --- a/fs/xfs/xfs_iomap.c
> > > +++ b/fs/xfs/xfs_iomap.c
> > > @@ -865,7 +865,7 @@ xfs_direct_write_iomap_begin(
> > >   		/* may drop and re-acquire the ilock */
> > >   		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
> > >   				&lockmode,
> > > -				(flags & IOMAP_DIRECT) || IS_DAX(inode));
> > > +				(flags & IOMAP_DIRECT) || IS_DAX(inode), false);
> > 
> > Now I'm /really/ think it's time for some reflink allocation flags,
> > because the function signature now involves two booleans...
> 
> ok, but the @convert_now arg is passed to other functions from
> xfs_reflink_allocate_cow() - so would you prefer to create a bool
> @convert_now inside xfs_reflink_allocate_cow() and pass that bool as before?
> Or pass the flags all the way down to end users of @convert_now?
> 
> > 
> > >   		if (error)
> > >   			goto out_unlock;
> > >   		if (shared)
> > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > index 8428f7b26ee6..3dab3ba900a3 100644
> > > --- a/fs/xfs/xfs_reflink.c
> > > +++ b/fs/xfs/xfs_reflink.c
> > > @@ -435,7 +435,8 @@ xfs_reflink_fill_cow_hole(
> > >   	struct xfs_bmbt_irec	*cmap,
> > >   	bool			*shared,
> > >   	uint			*lockmode,
> > > -	bool			convert_now)
> > > +	bool			convert_now,
> > > +	bool			atomic)
> > 
> > ...but this can come later.
> 
> Do you mean that this would just be a new flag to set?

Sorry, I meant that the double booleans -> flags conversion could be a
cleanup patch at the end of the series.  But first we'd have to figure
out where we want the flags boundaries to be -- do we just pass the
IOMAP_{DIRECT,DAX,ATOMIC_*} flags directly to the reflink code and let
it figure out what to do?  Or do we make the xfs_iomap.c code translate
that into XFS_REFLINK_ALLOC_* flags?

Either way, that is not something that needs to be done in this patch.

> > Also, is atomic==true only for the> ATOMIC_SW operation?
> 
> Right, so I think that the variable (or new flag) can be renamed for that.
> 
> > I think so, but that's the unfortunate thing about
> > booleans.
> > 
> > >   {
> > >   	struct xfs_mount	*mp = ip->i_mount;
> > >   	struct xfs_trans	*tp;
> > > @@ -466,7 +467,7 @@ xfs_reflink_fill_cow_hole(
> > >   	*lockmode = XFS_ILOCK_EXCL;
> > >   	error = xfs_find_trim_cow_extent(ip, imap, cmap, shared, &found);
> > > -	if (error || !*shared)
> > > +	if (error || (!*shared && !atomic))
> > >   		goto out_trans_cancel;
> > >   	if (found) {
> > > @@ -566,7 +567,8 @@ xfs_reflink_allocate_cow(
> > >   	struct xfs_bmbt_irec	*cmap,
> > >   	bool			*shared,
> > >   	uint			*lockmode,
> > > -	bool			convert_now)
> > > +	bool			convert_now,
> > > +	bool 			atomic)
> > 
> > Nit:        ^ space before tab.
> 
> ok
> 
> > 
> > If the answer to the question above is 'yes' then with that nit fixed,
> > Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> 
> Thanks, but I will wait for your feedback to decide whether to pick that up.
> 
> John
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 07/11] xfs: iomap CoW-based atomic write support
  2025-02-25 11:06     ` John Garry
@ 2025-02-25 17:47       ` Darrick J. Wong
  2025-02-25 18:07         ` John Garry
  0 siblings, 1 reply; 38+ messages in thread
From: Darrick J. Wong @ 2025-02-25 17:47 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On Tue, Feb 25, 2025 at 11:06:50AM +0000, John Garry wrote:
> On 24/02/2025 20:13, Darrick J. Wong wrote:
> > On Thu, Feb 13, 2025 at 01:56:15PM +0000, John Garry wrote:
> > > In cases of an atomic write occurs for misaligned or discontiguous disk
> > > blocks, we will use a CoW-based method to issue the atomic write.
> > > 
> > > So, for that case, return -EAGAIN to request that the write be issued in
> > > CoW atomic write mode. The dio write path should detect this, similar to
> > > how misaligned regalar DIO writes are handled.
> > > 
> > > For normal HW-based mode, when the range which we are atomic writing to
> > > covers a shared data extent, try to allocate a new CoW fork. However, if
> > > we find that what we allocated does not meet atomic write requirements
> > > in terms of length and alignment, then fallback on the CoW-based mode
> > > for the atomic write.
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >   fs/xfs/xfs_iomap.c | 72 ++++++++++++++++++++++++++++++++++++++++++++--
> > >   1 file changed, 69 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > index ab79f0080288..c5ecfafbba60 100644
> > > --- a/fs/xfs/xfs_iomap.c
> > > +++ b/fs/xfs/xfs_iomap.c
> > > @@ -795,6 +795,23 @@ imap_spans_range(
> > >   	return true;
> > >   }
> > > +static bool
> > > +imap_range_valid_for_atomic_write(
> > 
> > xfs_bmap_valid_for_atomic_write()
> 
> I'm ok with this.
> 
> But we do have other private functions without "xfs" prefix - like
> imap_needs_cow(), so a bit inconsistent to begin with.

Yeah, others prefer shorter names but I try at least to maintain
consistent prefixes for namespacing.

> > 
> > > +	struct xfs_bmbt_irec	*imap,
> > > +	xfs_fileoff_t		offset_fsb,
> > > +	xfs_fileoff_t		end_fsb)
> > > +{
> > > +	/* Misaligned start block wrt size */
> > > +	if (!IS_ALIGNED(imap->br_startblock, imap->br_blockcount))
> > > +		return false;
> > > +
> > > +	/* Discontiguous or mixed extents */
> > > +	if (!imap_spans_range(imap, offset_fsb, end_fsb))
> > > +		return false;
> > > +
> > > +	return true;
> > > +}
> > > +
> > >   static int
> > >   xfs_direct_write_iomap_begin(
> > >   	struct inode		*inode,
> > > @@ -809,12 +826,20 @@ xfs_direct_write_iomap_begin(
> > >   	struct xfs_bmbt_irec	imap, cmap;
> > >   	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
> > >   	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, length);
> > > +	bool			atomic_cow = flags & IOMAP_ATOMIC_COW;
> > > +	bool			atomic_hw = flags & IOMAP_ATOMIC_HW;
> > >   	int			nimaps = 1, error = 0;
> > >   	bool			shared = false;
> > >   	u16			iomap_flags = 0;
> > > +	xfs_fileoff_t		orig_offset_fsb;
> > > +	xfs_fileoff_t		orig_end_fsb;
> > > +	bool			needs_alloc;
> > >   	unsigned int		lockmode;
> > >   	u64			seq;
> > > +	orig_offset_fsb = offset_fsb;
> > 
> > When does offset_fsb change?
> 
> It doesn't, so this is not really required.
> 
> > 
> > > +	orig_end_fsb = end_fsb;
> > 
> > Set this in the variable declaration?
> 
> ok
> 
> > 
> > > +
> > >   	ASSERT(flags & (IOMAP_WRITE | IOMAP_ZERO));
> > >   	if (xfs_is_shutdown(mp))
> > > @@ -832,7 +857,7 @@ xfs_direct_write_iomap_begin(
> > >   	 * COW writes may allocate delalloc space or convert unwritten COW
> > >   	 * extents, so we need to make sure to take the lock exclusively here.
> > >   	 */
> > > -	if (xfs_is_cow_inode(ip))
> > > +	if (xfs_is_cow_inode(ip) || atomic_cow)
> > >   		lockmode = XFS_ILOCK_EXCL;
> > >   	else
> > >   		lockmode = XFS_ILOCK_SHARED;
> > > @@ -857,6 +882,22 @@ xfs_direct_write_iomap_begin(
> > >   	if (error)
> > >   		goto out_unlock;
> > > +	if (flags & IOMAP_ATOMIC_COW) {
> > 
> > 	if (atomic_cow) ?
> > 
> > Or really, atomic_sw?
> 
> Yes, will change.
> 
> > 
> > > +		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
> > > +				&lockmode,
> > > +				(flags & IOMAP_DIRECT) || IS_DAX(inode), true);
> > > +		/*
> > > +		 * Don't check @shared. For atomic writes, we should error when
> > > +		 * we don't get a CoW fork.
> > 
> > "Get a CoW fork"?  What does that mean?  The cow fork should be
> > automatically allocated when needed, right?  Or should this really read
> > "...when we don't get a COW mapping"?
> 
> ok, I can change as you suggest
> 
> > 
> > > +		 */
> > > +		if (error)
> > > +			goto out_unlock;
> > > +
> > > +		end_fsb = imap.br_startoff + imap.br_blockcount;
> > > +		length = XFS_FSB_TO_B(mp, end_fsb) - offset;
> > > +		goto out_found_cow;
> > > +	}
> > 
> > Can this be a separate ->iomap_begin (and hence iomap ops)?  I am trying
> > to avoid the incohesion (still) plaguing most of the other iomap users.
> 
> I can try, and would then need to try to factor out what would be much
> duplicated code.

<nod> I think it's pretty straightforward:

xfs_direct_cow_write_iomap_begin()
{
	ASSERT(flags & IOMAP_WRITE);
	ASSERT(flags & IOMAP_DIRECT);
	ASSERT(flags & IOMAP_ATOMIC_SW);

	if (xfs_is_shutdown(mp))
		return -EIO;

	/*
	 * Writes that span EOF might trigger an IO size update on
	 * completion, so consider them to be dirty for the purposes of
	 * O_DSYNC even if there is no other metadata changes pending or
	 * have been made here.
	 */
	if (offset + length > i_size_read(inode))
		iomap_flags |= IOMAP_F_DIRTY;

	lockmode = XFS_ILOCK_EXCL;
	error = xfs_ilock_for_iomap(ip, flags, &lockmode);
	if (error)
		return error;

	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
			&imap, &nimaps, 0);
	if (error)
		goto out_unlock;

	error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
			&lockmode, true, true);
	if (error)
		goto out_unlock;

	endoff = XFS_FSB_TO_B(mp, cmap.br_startoff + cmap.br_blockcount);
	trace_xfs_iomap_found(ip, offset, endoff - offset, XFS_COW_FORK,
			&cmap);
	if (imap.br_startblock != HOLESTARTBLOCK) {
		seq = xfs_iomap_inode_sequence(ip, 0);
		error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0,
				seq);
		if (error)
			goto out_unlock;
	}
	seq = xfs_iomap_inode_sequence(ip, IOMAP_F_SHARED);
	xfs_iunlock(ip, lockmode);
	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags,
			IOMAP_F_SHARED, seq);

out_unlock:
	if (lockmode)
		xfs_iunlock(ip, lockmode);
	return error;
}

--D

> > > +
> > >   	if (imap_needs_cow(ip, flags, &imap, nimaps)) {
> > >   		error = -EAGAIN;
> > >   		if (flags & IOMAP_NOWAIT)
> > > @@ -868,13 +909,38 @@ xfs_direct_write_iomap_begin(
> > >   				(flags & IOMAP_DIRECT) || IS_DAX(inode), false);
> > >   		if (error)
> > >   			goto out_unlock;
> > > -		if (shared)
> > > +		if (shared) {
> > > +			if (atomic_hw &&
> > > +			    !imap_range_valid_for_atomic_write(&cmap,
> > > +					orig_offset_fsb, orig_end_fsb)) {
> > > +				error = -EAGAIN;
> > > +				goto out_unlock;
> > > +			}
> > >   			goto out_found_cow;
> > > +		}
> > >   		end_fsb = imap.br_startoff + imap.br_blockcount;
> > >   		length = XFS_FSB_TO_B(mp, end_fsb) - offset;
> > >   	}
> > > -	if (imap_needs_alloc(inode, flags, &imap, nimaps))
> > > +	needs_alloc = imap_needs_alloc(inode, flags, &imap, nimaps);
> > > +
> > > +	if (atomic_hw) {
> > > +		error = -EAGAIN;
> > > +		/*
> > > +		 * Use CoW method for when we need to alloc > 1 block,
> > > +		 * otherwise we might allocate less than what we need here and
> > > +		 * have multiple mappings.
> > > +		*/
> > > +		if (needs_alloc && orig_end_fsb - orig_offset_fsb > 1)
> > > +			goto out_unlock;
> > > +
> > > +		if (!imap_range_valid_for_atomic_write(&imap, orig_offset_fsb,
> > > +						orig_end_fsb)) {
> > 
> > You only need to indent by two more tabs for a continuation line.
> 
> ok
> 
> Thanks,
> John
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 09/11] xfs: Commit CoW-based atomic writes atomically
  2025-02-25 11:11     ` John Garry
@ 2025-02-25 17:50       ` Darrick J. Wong
  2025-02-25 18:07         ` John Garry
  0 siblings, 1 reply; 38+ messages in thread
From: Darrick J. Wong @ 2025-02-25 17:50 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On Tue, Feb 25, 2025 at 11:11:45AM +0000, John Garry wrote:
> On 24/02/2025 20:20, Darrick J. Wong wrote:
> > On Thu, Feb 13, 2025 at 01:56:17PM +0000, John Garry wrote:
> > > When completing a CoW-based write, each extent range mapping update is
> > > covered by a separate transaction.
> > > 
> > > For a CoW-based atomic write, all mappings must be changed at once, so
> > > change to use a single transaction.
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >   fs/xfs/xfs_file.c    |  5 ++++-
> > >   fs/xfs/xfs_reflink.c | 45 ++++++++++++++++++++++++++++++++++++++++++++
> > >   fs/xfs/xfs_reflink.h |  3 +++
> > >   3 files changed, 52 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 9762fa503a41..243640fe4874 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -527,7 +527,10 @@ xfs_dio_write_end_io(
> > >   	nofs_flag = memalloc_nofs_save();
> > >   	if (flags & IOMAP_DIO_COW) {
> > > -		error = xfs_reflink_end_cow(ip, offset, size);
> > > +		if (iocb->ki_flags & IOCB_ATOMIC)
> > > +			error = xfs_reflink_end_atomic_cow(ip, offset, size);
> > > +		else
> > > +			error = xfs_reflink_end_cow(ip, offset, size);
> > >   		if (error)
> > >   			goto out;
> > >   	}
> > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > index 3dab3ba900a3..d097d33dc000 100644
> > > --- a/fs/xfs/xfs_reflink.c
> > > +++ b/fs/xfs/xfs_reflink.c
> > > @@ -986,6 +986,51 @@ xfs_reflink_end_cow(
> > >   		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
> > >   	return error;
> > >   }
> > > +int
> > > +xfs_reflink_end_atomic_cow(
> > > +	struct xfs_inode		*ip,
> > > +	xfs_off_t			offset,
> > > +	xfs_off_t			count)
> > > +{
> > > +	xfs_fileoff_t			offset_fsb;
> > > +	xfs_fileoff_t			end_fsb;
> > > +	int				error = 0;
> > > +	struct xfs_mount		*mp = ip->i_mount;
> > > +	struct xfs_trans		*tp;
> > > +	unsigned int			resblks;
> > > +
> > > +	trace_xfs_reflink_end_cow(ip, offset, count);
> > > +
> > > +	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
> > > +	end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
> > 
> > Use @mp here instead of walking the pointer.
> 
> Yes
> 
> > 
> > > +
> > > +	resblks = (end_fsb - offset_fsb) *
> > > +			XFS_NEXTENTADD_SPACE_RES(mp, 1, XFS_DATA_FORK);
> > 
> > How did you arrive at this computation?
> 
> hmmm... you suggested this, but maybe I picked it up incorrectly :)
> 
> > The "b" parameter to
> > XFS_NEXTENTADD_SPACE_RES is usually the worst case number of mappings
> > that you're going to change on this file.  I think that quantity is
> > (end_fsb - offset_fsb)?
> 
> Can you please check this versus what you suggested in
> https://lore.kernel.org/linux-xfs/20250206215014.GX21808@frogsfrogsfrogs/#t

Ah, yeah, that ^^ is correct.  This needs a better comment then:

	/*
	 * Each remapping operation could cause a btree split, so in
	 * the worst case that's one for each block.
	 */
	resblks = (end_fsb - offset_fsb) *
			XFS_NEXTENTADD_SPACE_RES(mp, 1, XFS_DATA_FORK);

--D

> > 
> > > +
> > > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
> > > +			XFS_TRANS_RESERVE, &tp);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > +	xfs_trans_ijoin(tp, ip, 0);
> > > +
> > > +	while (end_fsb > offset_fsb && !error)
> > > +		error = xfs_reflink_end_cow_extent_locked(tp, ip, &offset_fsb,
> > > +							end_fsb);
> > 
> > Overly long line, and the continuation line only needs to be indented
> > two more tabs.
> 
> ok
> 
> > 
> > > +
> > > +	if (error) {
> > > +		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
> > > +		goto out_cancel;
> > > +	}
> > > +	error = xfs_trans_commit(tp);
> > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > +	return 0;
> > 
> > Why is it ok to drop @error here?  Shouldn't a transaction commit error
> > should be reported to the writer thread?
> > 
> 
> I can fix that, as I should not ignore errors from xfs_trans_commit()
> 
> Thanks,
> John
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 06/11] xfs: Reflink CoW-based atomic write support
  2025-02-25 17:37       ` Darrick J. Wong
@ 2025-02-25 18:02         ` John Garry
  0 siblings, 0 replies; 38+ messages in thread
From: John Garry @ 2025-02-25 18:02 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On 25/02/2025 17:37, Darrick J. Wong wrote:
> On Tue, Feb 25, 2025 at 10:58:56AM +0000, John Garry wrote:
>> On 24/02/2025 20:32, Darrick J. Wong wrote:
>>> On Thu, Feb 13, 2025 at 01:56:14PM +0000, John Garry wrote:
>>>> For CoW-based atomic write support, always allocate a cow hole in
>>>> xfs_reflink_allocate_cow() to write the new data.
>>>>
>>>> The semantics is that if @atomic is set, we will be passed a CoW fork
>>>> extent mapping for no error returned.
>>>>
>>>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>>>> ---
>>>>    fs/xfs/xfs_iomap.c   |  2 +-
>>>>    fs/xfs/xfs_reflink.c | 12 +++++++-----
>>>>    fs/xfs/xfs_reflink.h |  2 +-
>>>>    3 files changed, 9 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>>>> index d61460309a78..ab79f0080288 100644
>>>> --- a/fs/xfs/xfs_iomap.c
>>>> +++ b/fs/xfs/xfs_iomap.c
>>>> @@ -865,7 +865,7 @@ xfs_direct_write_iomap_begin(
>>>>    		/* may drop and re-acquire the ilock */
>>>>    		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
>>>>    				&lockmode,
>>>> -				(flags & IOMAP_DIRECT) || IS_DAX(inode));
>>>> +				(flags & IOMAP_DIRECT) || IS_DAX(inode), false);
>>>
>>> Now I'm /really/ think it's time for some reflink allocation flags,
>>> because the function signature now involves two booleans...
>>
>> ok, but the @convert_now arg is passed to other functions from
>> xfs_reflink_allocate_cow() - so would you prefer to create a bool
>> @convert_now inside xfs_reflink_allocate_cow() and pass that bool as before?
>> Or pass the flags all the way down to end users of @convert_now?
>>
>>>
>>>>    		if (error)
>>>>    			goto out_unlock;
>>>>    		if (shared)
>>>> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
>>>> index 8428f7b26ee6..3dab3ba900a3 100644
>>>> --- a/fs/xfs/xfs_reflink.c
>>>> +++ b/fs/xfs/xfs_reflink.c
>>>> @@ -435,7 +435,8 @@ xfs_reflink_fill_cow_hole(
>>>>    	struct xfs_bmbt_irec	*cmap,
>>>>    	bool			*shared,
>>>>    	uint			*lockmode,
>>>> -	bool			convert_now)
>>>> +	bool			convert_now,
>>>> +	bool			atomic)
>>>
>>> ...but this can come later.
>>
>> Do you mean that this would just be a new flag to set?
> 
> Sorry, I meant that the double booleans -> flags conversion could be a
> cleanup patch at the end of the series.  But first we'd have to figure
> out where we want the flags boundaries to be -- do we just pass the
> IOMAP_{DIRECT,DAX,ATOMIC_*} flags directly to the reflink code and let
> it figure out what to do? 

We have the odd case of @convert_now being set from IS_DAX(inode) in 
xfs_direct_write_iomap_begin() -> xfs_reflink_allocate_cow(), so that 
thwarts the idea of passing the IOMAP flags directly. BTW, it may be 
possible to clear up that IS_DAX() usage - I'm not sure, so I'll check 
again.

> Or do we make the xfs_iomap.c code translate
> that into XFS_REFLINK_ALLOC_* flags?

That is what I was thinking of doing. But, as mentioned, it needs to be 
decided if we pass XFS_REFLINK_ALLOC_* to callees of 
xfs_reflink_allocate_cow(). I'm thinking 'no', as it will only create churn.

> 
> Either way, that is not something that needs to be done in this patch.

Sure


Thanks,
John

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 07/11] xfs: iomap CoW-based atomic write support
  2025-02-25 17:47       ` Darrick J. Wong
@ 2025-02-25 18:07         ` John Garry
  0 siblings, 0 replies; 38+ messages in thread
From: John Garry @ 2025-02-25 18:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On 25/02/2025 17:47, Darrick J. Wong wrote:
>> I can try, and would then need to try to factor out what would be much
>> duplicated code.
> <nod> I think it's pretty straightforward:

Yeah, I already had done sometime like this since.

> 
> xfs_direct_cow_write_iomap_begin()
> {
> 	ASSERT(flags & IOMAP_WRITE);
> 	ASSERT(flags & IOMAP_DIRECT);
> 	ASSERT(flags & IOMAP_ATOMIC_SW);
> 
> 	if (xfs_is_shutdown(mp))
> 		return -EIO;
> 
> 	/*
> 	 * Writes that span EOF might trigger an IO size update on
> 	 * completion, so consider them to be dirty for the purposes of
> 	 * O_DSYNC even if there is no other metadata changes pending or
> 	 * have been made here.
> 	 */
> 	if (offset + length > i_size_read(inode))
> 		iomap_flags |= IOMAP_F_DIRTY;
> 
> 	lockmode = XFS_ILOCK_EXCL;
> 	error = xfs_ilock_for_iomap(ip, flags, &lockmode);
> 	if (error)
> 		return error;
> 
> 	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb,
> 			&imap, &nimaps, 0);
> 	if (error)
> 		goto out_unlock;
> 
> 	error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
> 			&lockmode, true, true);
> 	if (error)
> 		goto out_unlock;
> 
> 	endoff = XFS_FSB_TO_B(mp, cmap.br_startoff + cmap.br_blockcount);
> 	trace_xfs_iomap_found(ip, offset, endoff - offset, XFS_COW_FORK,
> 			&cmap);
> 	if (imap.br_startblock != HOLESTARTBLOCK) {

note: As you know, all this is common to xfs_direct_write_iomap_begin(), 
but unfortunately can't neatly be factored out due to the xfs_iunlock() 
calls.

> 		seq = xfs_iomap_inode_sequence(ip, 0);
> 		error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0,
> 				seq);
> 		if (error)
> 			goto out_unlock;
> 	}
> 	seq = xfs_iomap_inode_sequence(ip, IOMAP_F_SHARED);
> 	xfs_iunlock(ip, lockmode);
> 	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags,
> 			IOMAP_F_SHARED, seq);
> 
> out_unlock:
> 	if (lockmode)
> 		xfs_iunlock(ip, lockmode);
> 	return error;
> }


Cheers,
John

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 09/11] xfs: Commit CoW-based atomic writes atomically
  2025-02-25 17:50       ` Darrick J. Wong
@ 2025-02-25 18:07         ` John Garry
  0 siblings, 0 replies; 38+ messages in thread
From: John Garry @ 2025-02-25 18:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On 25/02/2025 17:50, Darrick J. Wong wrote:
>> Can you please check this versus what you suggested in
>> https://lore.kernel.org/linux- 
>> xfs/20250206215014.GX21808@frogsfrogsfrogs/#t
> Ah, yeah, that ^^ is correct.  This needs a better comment then:
> 
> 	/*
> 	 * Each remapping operation could cause a btree split, so in
> 	 * the worst case that's one for each block.
> 	 */
> 	resblks = (end_fsb - offset_fsb) *
> 			XFS_NEXTENTADD_SPACE_RES(mp, 1, XFS_DATA_FORK);

ok, fine.

Cheers

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2025-02-25 18:07 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-13 13:56 [PATCH v2 00/11] large atomic writes for xfs with CoW John Garry
2025-02-13 13:56 ` [PATCH v2 01/11] iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW John Garry
2025-02-24 20:23   ` Darrick J. Wong
2025-02-13 13:56 ` [PATCH v2 02/11] xfs: Switch atomic write size check in xfs_file_write_iter() John Garry
2025-02-24 20:24   ` Darrick J. Wong
2025-02-13 13:56 ` [PATCH v2 03/11] xfs: Refactor xfs_reflink_end_cow_extent() John Garry
2025-02-24 20:26   ` Darrick J. Wong
2025-02-25 10:01     ` John Garry
2025-02-25 17:29       ` Darrick J. Wong
2025-02-13 13:56 ` [PATCH v2 04/11] iomap: Support CoW-based atomic writes John Garry
2025-02-24 19:59   ` Darrick J. Wong
2025-02-25 10:19     ` John Garry
2025-02-25 17:33       ` Darrick J. Wong
2025-02-13 13:56 ` [PATCH v2 05/11] iomap: Lift blocksize restriction on " John Garry
2025-02-13 13:56 ` [PATCH v2 06/11] xfs: Reflink CoW-based atomic write support John Garry
2025-02-24 20:32   ` Darrick J. Wong
2025-02-25 10:58     ` John Garry
2025-02-25 17:37       ` Darrick J. Wong
2025-02-25 18:02         ` John Garry
2025-02-13 13:56 ` [PATCH v2 07/11] xfs: iomap " John Garry
2025-02-24 20:13   ` Darrick J. Wong
2025-02-25 11:06     ` John Garry
2025-02-25 17:47       ` Darrick J. Wong
2025-02-25 18:07         ` John Garry
2025-02-13 13:56 ` [PATCH v2 08/11] xfs: Add xfs_file_dio_write_atomic() John Garry
2025-02-24 20:32   ` Darrick J. Wong
2025-02-13 13:56 ` [PATCH v2 09/11] xfs: Commit CoW-based atomic writes atomically John Garry
2025-02-24 20:20   ` Darrick J. Wong
2025-02-25 11:11     ` John Garry
2025-02-25 17:50       ` Darrick J. Wong
2025-02-25 18:07         ` John Garry
2025-02-13 13:56 ` [PATCH v2 10/11] xfs: Update atomic write max size John Garry
2025-02-24 20:34   ` Darrick J. Wong
2025-02-25 11:13     ` John Garry
2025-02-13 13:56 ` [PATCH v2 11/11] xfs: Allow block allocator to take an alignment hint John Garry
2025-02-24 20:37   ` Darrick J. Wong
2025-02-25 11:17     ` John Garry
2025-02-20  7:48 ` [PATCH v2 00/11] large atomic writes for xfs with CoW John Garry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox