[PATCH RFC 00/10] large atomic writes for xfs with CoW

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 00/10] large atomic writes for xfs with CoW
@ 2025-02-04 12:01 John Garry
  2025-02-04 12:01 ` [PATCH RFC 01/10] xfs: Switch atomic write size check in xfs_file_write_iter() John Garry
                   ` (9 more replies)
  0 siblings, 10 replies; 35+ messages in thread
From: John Garry @ 2025-02-04 12:01 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, John Garry

Currently atomic write support for xfs is limited to writing a single
block as we have no way to guarantee alignment and that the write covers
a single extent.

This series introduces a method to issue atomic writes via a software
emulated method.

The software emulated method is used as a fallback for when attempting to
issue an atomic write over misaligned or multiple extents.

The basic idea of this CoW method is to alloc a range in the CoW fork,
write the data, and atomically update the mapping.

This is an RFC as I am not confident on the reflink changes at all. In
addition, I guess that the last change to provide a hint to the allocator
will not be liked.

Initial mysql performance testing has shown this method to perform ok.
However, there we are only using 16K atomic writes (and 4K block size),
so typically - and thankfully - this software fallback method won't be
used often.

For other FSes which want large atomics writes and don't support CoW, I
think that they can follow the example in [0].

Based on v6.14-rc1.

[0] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/

John Garry (10):
  xfs: Switch atomic write size check in xfs_file_write_iter()
  xfs: Refactor xfs_reflink_end_cow_extent()
  iomap: Support CoW-based atomic writes
  xfs: Make xfs_find_trim_cow_extent() public
  xfs: Reflink CoW-based atomic write support
  xfs: iomap CoW-based atomic write support
  xfs: Add xfs_file_dio_write_atomic()
  xfs: Commit CoW-based atomic writes atomically
  xfs: Update atomic write max size
  xfs: Allow block allocator to take an alignment hint

 fs/iomap/direct-io.c     |  23 ++++--
 fs/xfs/libxfs/xfs_bmap.c |   7 +-
 fs/xfs/libxfs/xfs_bmap.h |   6 +-
 fs/xfs/xfs_file.c        |  68 +++++++++++++++---
 fs/xfs/xfs_iomap.c       |  70 +++++++++++++++++-
 fs/xfs/xfs_iops.c        |  25 ++++++-
 fs/xfs/xfs_iops.h        |   3 +
 fs/xfs/xfs_mount.c       |  28 ++++++++
 fs/xfs/xfs_mount.h       |   1 +
 fs/xfs/xfs_reflink.c     | 149 +++++++++++++++++++++++++++++----------
 fs/xfs/xfs_reflink.h     |   7 +-
 include/linux/iomap.h    |   9 +++
 12 files changed, 333 insertions(+), 63 deletions(-)

-- 
2.31.1

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH RFC 01/10] xfs: Switch atomic write size check in xfs_file_write_iter()
  2025-02-04 12:01 [PATCH RFC 00/10] large atomic writes for xfs with CoW John Garry
@ 2025-02-04 12:01 ` John Garry
  2025-02-04 12:01 ` [PATCH RFC 02/10] xfs: Refactor xfs_reflink_end_cow_extent() John Garry
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: John Garry @ 2025-02-04 12:01 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, John Garry

Currently the size of atomic write allowed is fixed at the blocksize.

To start to lift this restriction, refactor xfs_get_atomic_write_attr()
to into a helper - xfs_report_atomic_write() - and use that helper to
find the per-inode atomic write limits and check according to that.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_file.c | 12 +++++-------
 fs/xfs/xfs_iops.c | 20 +++++++++++++++++---
 fs/xfs/xfs_iops.h |  3 +++
 3 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f7a7d89c345e..fd05b66aea3f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -853,14 +853,12 @@ xfs_file_write_iter(
 		return xfs_file_dax_write(iocb, from);
 
 	if (iocb->ki_flags & IOCB_ATOMIC) {
-		/*
-		 * Currently only atomic writing of a single FS block is
-		 * supported. It would be possible to atomic write smaller than
-		 * a FS block, but there is no requirement to support this.
-		 * Note that iomap also does not support this yet.
-		 */
-		if (ocount != ip->i_mount->m_sb.sb_blocksize)
+		unsigned int unit_min, unit_max;
+
+		xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
+		if (ocount < unit_min || ocount > unit_max)
 			return -EINVAL;
+
 		ret = generic_atomic_write_valid(iocb, from);
 		if (ret)
 			return ret;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 40289fe6f5b2..ea79fb246e33 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -600,15 +600,29 @@ xfs_report_dioalign(
 		stat->dio_offset_align = stat->dio_read_offset_align;
 }
 
+void
+xfs_get_atomic_write_attr(
+	struct xfs_inode	*ip,
+	unsigned int		*unit_min,
+	unsigned int		*unit_max)
+{
+	if (!xfs_inode_can_atomicwrite(ip)) {
+		*unit_min = *unit_max = 0;
+		return;
+	}
+
+	*unit_min = *unit_max = ip->i_mount->m_sb.sb_blocksize;
+}
+
 static void
 xfs_report_atomic_write(
 	struct xfs_inode	*ip,
 	struct kstat		*stat)
 {
-	unsigned int		unit_min = 0, unit_max = 0;
+	unsigned int		unit_min, unit_max;
+
+	xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
 
-	if (xfs_inode_can_atomicwrite(ip))
-		unit_min = unit_max = ip->i_mount->m_sb.sb_blocksize;
 	generic_fill_statx_atomic_writes(stat, unit_min, unit_max);
 }
 
diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
index 3c1a2605ffd2..ce7bdeb9a79c 100644
--- a/fs/xfs/xfs_iops.h
+++ b/fs/xfs/xfs_iops.h
@@ -19,5 +19,8 @@ int xfs_inode_init_security(struct inode *inode, struct inode *dir,
 extern void xfs_setup_inode(struct xfs_inode *ip);
 extern void xfs_setup_iops(struct xfs_inode *ip);
 extern void xfs_diflags_to_iflags(struct xfs_inode *ip, bool init);
+void xfs_get_atomic_write_attr(struct xfs_inode *ip,
+		unsigned int *unit_min, unsigned int *unit_max);
+
 
 #endif /* __XFS_IOPS_H__ */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC 02/10] xfs: Refactor xfs_reflink_end_cow_extent()
  2025-02-04 12:01 [PATCH RFC 00/10] large atomic writes for xfs with CoW John Garry
  2025-02-04 12:01 ` [PATCH RFC 01/10] xfs: Switch atomic write size check in xfs_file_write_iter() John Garry
@ 2025-02-04 12:01 ` John Garry
  2025-02-05 19:50   ` Darrick J. Wong
  2025-02-04 12:01 ` [PATCH RFC 03/10] iomap: Support CoW-based atomic writes John Garry
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2025-02-04 12:01 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, John Garry

Refactor xfs_reflink_end_cow_extent() into separate parts which process
the CoW range and commit the transaction.

This refactoring will be used in future for when it is required to commit
a range of extents as a single transaction, similar to how it was done
pre-commit d6f215f359637.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_reflink.c | 79 +++++++++++++++++++++++++++-----------------
 1 file changed, 48 insertions(+), 31 deletions(-)

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 59f7fc16eb80..580469668334 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -786,35 +786,20 @@ xfs_reflink_update_quota(
  * requirements as low as possible.
  */
 STATIC int
-xfs_reflink_end_cow_extent(
+xfs_reflink_end_cow_extent_locked(
 	struct xfs_inode	*ip,
 	xfs_fileoff_t		*offset_fsb,
-	xfs_fileoff_t		end_fsb)
+	xfs_fileoff_t		end_fsb,
+	struct xfs_trans	*tp,
+	bool			*commit)
 {
 	struct xfs_iext_cursor	icur;
 	struct xfs_bmbt_irec	got, del, data;
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_trans	*tp;
 	struct xfs_ifork	*ifp = xfs_ifork_ptr(ip, XFS_COW_FORK);
-	unsigned int		resblks;
 	int			nmaps;
 	bool			isrt = XFS_IS_REALTIME_INODE(ip);
 	int			error;
 
-	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
-			XFS_TRANS_RESERVE, &tp);
-	if (error)
-		return error;
-
-	/*
-	 * Lock the inode.  We have to ijoin without automatic unlock because
-	 * the lead transaction is the refcountbt record deletion; the data
-	 * fork update follows as a deferred log item.
-	 */
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	xfs_trans_ijoin(tp, ip, 0);
-
 	/*
 	 * In case of racing, overlapping AIO writes no COW extents might be
 	 * left by the time I/O completes for the loser of the race.  In that
@@ -823,7 +808,7 @@ xfs_reflink_end_cow_extent(
 	if (!xfs_iext_lookup_extent(ip, ifp, *offset_fsb, &icur, &got) ||
 	    got.br_startoff >= end_fsb) {
 		*offset_fsb = end_fsb;
-		goto out_cancel;
+		return 0;
 	}
 
 	/*
@@ -837,7 +822,7 @@ xfs_reflink_end_cow_extent(
 		if (!xfs_iext_next_extent(ifp, &icur, &got) ||
 		    got.br_startoff >= end_fsb) {
 			*offset_fsb = end_fsb;
-			goto out_cancel;
+			return 0;
 		}
 	}
 	del = got;
@@ -846,14 +831,14 @@ xfs_reflink_end_cow_extent(
 	error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK,
 			XFS_IEXT_REFLINK_END_COW_CNT);
 	if (error)
-		goto out_cancel;
+		return error;
 
 	/* Grab the corresponding mapping in the data fork. */
 	nmaps = 1;
 	error = xfs_bmapi_read(ip, del.br_startoff, del.br_blockcount, &data,
 			&nmaps, 0);
 	if (error)
-		goto out_cancel;
+		return error;
 
 	/* We can only remap the smaller of the two extent sizes. */
 	data.br_blockcount = min(data.br_blockcount, del.br_blockcount);
@@ -882,7 +867,7 @@ xfs_reflink_end_cow_extent(
 		error = xfs_bunmapi(NULL, ip, data.br_startoff,
 				data.br_blockcount, 0, 1, &done);
 		if (error)
-			goto out_cancel;
+			return error;
 		ASSERT(done);
 	}
 
@@ -899,17 +884,49 @@ xfs_reflink_end_cow_extent(
 	/* Remove the mapping from the CoW fork. */
 	xfs_bmap_del_extent_cow(ip, &icur, &got, &del);
 
-	error = xfs_trans_commit(tp);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	if (error)
-		return error;
-
 	/* Update the caller about how much progress we made. */
 	*offset_fsb = del.br_startoff + del.br_blockcount;
+	*commit = true;
 	return 0;
+}
 
-out_cancel:
-	xfs_trans_cancel(tp);
+
+/*
+ * Remap part of the CoW fork into the data fork.
+ *
+ * We aim to remap the range starting at @offset_fsb and ending at @end_fsb
+ * into the data fork; this function will remap what it can (at the end of the
+ * range) and update @end_fsb appropriately.  Each remap gets its own
+ * transaction because we can end up merging and splitting bmbt blocks for
+ * every remap operation and we'd like to keep the block reservation
+ * requirements as low as possible.
+ */
+STATIC int
+xfs_reflink_end_cow_extent(
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		*offset_fsb,
+	xfs_fileoff_t		end_fsb)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	unsigned int		resblks;
+	int			error;
+	bool			commit = false;
+
+	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
+			XFS_TRANS_RESERVE, &tp);
+	if (error)
+		return error;
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
+
+	error = xfs_reflink_end_cow_extent_locked(ip, offset_fsb,
+					end_fsb, tp, &commit);
+	if (commit)
+		error = xfs_trans_commit(tp);
+	else
+		xfs_trans_cancel(tp);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC 03/10] iomap: Support CoW-based atomic writes
  2025-02-04 12:01 [PATCH RFC 00/10] large atomic writes for xfs with CoW John Garry
  2025-02-04 12:01 ` [PATCH RFC 01/10] xfs: Switch atomic write size check in xfs_file_write_iter() John Garry
  2025-02-04 12:01 ` [PATCH RFC 02/10] xfs: Refactor xfs_reflink_end_cow_extent() John Garry
@ 2025-02-04 12:01 ` John Garry
  2025-02-05 20:11   ` Darrick J. Wong
  2025-02-04 12:01 ` [PATCH RFC 04/10] xfs: Make xfs_find_trim_cow_extent() public John Garry
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2025-02-04 12:01 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, John Garry

Currently atomic write support requires dedicated HW support. This imposes
a restriction on the filesystem that disk blocks need to be aligned and
contiguously mapped to FS blocks to issue atomic writes.

XFS has no method to guarantee FS block alignment. As such, atomic writes
are currently limited to 1x FS block there.

To allow deal with the scenario that we are issuing an atomic write over
misaligned or discontiguous data blocks larger atomic writes - and raise
the atomic write limit - support a CoW-based software emulated atomic
write mode.

For this special mode, the FS will reserve blocks for that data to be
written and then atomically map that data in once the data has been
commited to disk.

It is the responsibility of the FS to detect discontiguous atomic writes
and switch to IOMAP_DIO_ATOMIC_COW mode and retry the write.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/iomap/direct-io.c  | 23 ++++++++++++++++-------
 include/linux/iomap.h |  9 +++++++++
 2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 3dd883dd77d2..e63b5096bcd8 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -271,7 +271,7 @@ static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
  * clearing the WRITE_THROUGH flag in the dio request.
  */
 static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
-		const struct iomap *iomap, bool use_fua, bool atomic)
+		const struct iomap *iomap, bool use_fua, bool atomic_bio)
 {
 	blk_opf_t opflags = REQ_SYNC | REQ_IDLE;
 
@@ -283,7 +283,7 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
 		opflags |= REQ_FUA;
 	else
 		dio->flags &= ~IOMAP_DIO_WRITE_THROUGH;
-	if (atomic)
+	if (atomic_bio)
 		opflags |= REQ_ATOMIC;
 
 	return opflags;
@@ -301,13 +301,19 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 	blk_opf_t bio_opf;
 	struct bio *bio;
 	bool need_zeroout = false;
+	bool atomic_bio = false;
 	bool use_fua = false;
 	int nr_pages, ret = 0;
 	size_t copied = 0;
 	size_t orig_count;
 
-	if (atomic && length != iter->len)
-		return -EINVAL;
+	if (atomic) {
+		if (!(iomap->flags & IOMAP_F_ATOMIC_COW)) {
+			if (length != iter->len)
+				return -EINVAL;
+			atomic_bio = true;
+		}
+	}
 
 	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
 	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
@@ -318,7 +324,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		need_zeroout = true;
 	}
 
-	if (iomap->flags & IOMAP_F_SHARED)
+	if (iomap->flags & (IOMAP_F_SHARED | IOMAP_F_ATOMIC_COW))
 		dio->flags |= IOMAP_DIO_COW;
 
 	if (iomap->flags & IOMAP_F_NEW) {
@@ -383,7 +389,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 			goto out;
 	}
 
-	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic);
+	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic_bio);
 
 	nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS);
 	do {
@@ -416,7 +422,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		}
 
 		n = bio->bi_iter.bi_size;
-		if (WARN_ON_ONCE(atomic && n != length)) {
+		if (WARN_ON_ONCE(atomic_bio && n != length)) {
 			/*
 			 * This bio should have covered the complete length,
 			 * which it doesn't, so error. We may need to zero out
@@ -639,6 +645,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		if (iocb->ki_flags & IOCB_DIO_CALLER_COMP)
 			dio->flags |= IOMAP_DIO_CALLER_COMP;
 
+		if (dio_flags & IOMAP_DIO_ATOMIC_COW)
+			iomi.flags |= IOMAP_ATOMIC_COW;
+
 		if (dio_flags & IOMAP_DIO_OVERWRITE_ONLY) {
 			ret = -EAGAIN;
 			if (iomi.pos >= dio->i_size ||
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 75bf54e76f3b..0a0b6798f517 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -56,6 +56,8 @@ struct vm_fault;
  *
  * IOMAP_F_BOUNDARY indicates that I/O and I/O completions for this iomap must
  * never be merged with the mapping before it.
+ *
+ * IOMAP_F_ATOMIC_COW indicates that we require atomic CoW end IO handling.
  */
 #define IOMAP_F_NEW		(1U << 0)
 #define IOMAP_F_DIRTY		(1U << 1)
@@ -68,6 +70,7 @@ struct vm_fault;
 #endif /* CONFIG_BUFFER_HEAD */
 #define IOMAP_F_XATTR		(1U << 5)
 #define IOMAP_F_BOUNDARY	(1U << 6)
+#define IOMAP_F_ATOMIC_COW	(1U << 7)
 
 /*
  * Flags set by the core iomap code during operations:
@@ -183,6 +186,7 @@ struct iomap_folio_ops {
 #define IOMAP_DAX		0
 #endif /* CONFIG_FS_DAX */
 #define IOMAP_ATOMIC		(1 << 9)
+#define IOMAP_ATOMIC_COW	(1 << 10)
 
 struct iomap_ops {
 	/*
@@ -434,6 +438,11 @@ struct iomap_dio_ops {
  */
 #define IOMAP_DIO_PARTIAL		(1 << 2)
 
+/*
+ * Use CoW-based software emulated atomic write.
+ */
+#define IOMAP_DIO_ATOMIC_COW		(1 << 3)
+
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
 		unsigned int dio_flags, void *private, size_t done_before);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC 04/10] xfs: Make xfs_find_trim_cow_extent() public
  2025-02-04 12:01 [PATCH RFC 00/10] large atomic writes for xfs with CoW John Garry
                   ` (2 preceding siblings ...)
  2025-02-04 12:01 ` [PATCH RFC 03/10] iomap: Support CoW-based atomic writes John Garry
@ 2025-02-04 12:01 ` John Garry
  2025-02-04 12:01 ` [PATCH RFC 05/10] xfs: Reflink CoW-based atomic write support John Garry
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: John Garry @ 2025-02-04 12:01 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, John Garry

This will be required for the iomap code to detect whether a range is
covered by a CoW extent.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_reflink.c | 2 +-
 fs/xfs/xfs_reflink.h | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 580469668334..b28fb632b9e6 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -357,7 +357,7 @@ xfs_reflink_convert_cow(
  * is not shared we might have a preallocation for it in the COW fork. If so we
  * use it that rather than trigger a new allocation.
  */
-static int
+int
 xfs_find_trim_cow_extent(
 	struct xfs_inode	*ip,
 	struct xfs_bmbt_irec	*imap,
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index cc4e92278279..a328b25e68da 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -35,6 +35,8 @@ int xfs_reflink_allocate_cow(struct xfs_inode *ip, struct xfs_bmbt_irec *imap,
 		bool convert_now);
 extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
+int xfs_find_trim_cow_extent(struct xfs_inode *ip, struct xfs_bmbt_irec *imap,
+	struct xfs_bmbt_irec *cmap, bool *shared, bool *found);
 
 extern int xfs_reflink_cancel_cow_blocks(struct xfs_inode *ip,
 		struct xfs_trans **tpp, xfs_fileoff_t offset_fsb,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC 05/10] xfs: Reflink CoW-based atomic write support
  2025-02-04 12:01 [PATCH RFC 00/10] large atomic writes for xfs with CoW John Garry
                   ` (3 preceding siblings ...)
  2025-02-04 12:01 ` [PATCH RFC 04/10] xfs: Make xfs_find_trim_cow_extent() public John Garry
@ 2025-02-04 12:01 ` John Garry
  2025-02-04 12:01 ` [PATCH RFC 06/10] xfs: iomap " John Garry
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: John Garry @ 2025-02-04 12:01 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, John Garry

For CoW-based atomic write support, always allocate a cow hole in
xfs_reflink_allocate_cow() to write the new data.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_iomap.c   |  2 +-
 fs/xfs/xfs_reflink.c | 12 +++++++-----
 fs/xfs/xfs_reflink.h |  2 +-
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 50fa3ef89f6c..ae3755ed00e6 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -865,7 +865,7 @@ xfs_direct_write_iomap_begin(
 		/* may drop and re-acquire the ilock */
 		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
 				&lockmode,
-				(flags & IOMAP_DIRECT) || IS_DAX(inode));
+				(flags & IOMAP_DIRECT) || IS_DAX(inode), false);
 		if (error)
 			goto out_unlock;
 		if (shared)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index b28fb632b9e6..dbce333b60eb 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -435,7 +435,8 @@ xfs_reflink_fill_cow_hole(
 	struct xfs_bmbt_irec	*cmap,
 	bool			*shared,
 	uint			*lockmode,
-	bool			convert_now)
+	bool			convert_now,
+	bool			atomic)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_trans	*tp;
@@ -466,7 +467,7 @@ xfs_reflink_fill_cow_hole(
 	*lockmode = XFS_ILOCK_EXCL;
 
 	error = xfs_find_trim_cow_extent(ip, imap, cmap, shared, &found);
-	if (error || !*shared)
+	if (error || (!*shared && !atomic))
 		goto out_trans_cancel;
 
 	if (found) {
@@ -566,7 +567,8 @@ xfs_reflink_allocate_cow(
 	struct xfs_bmbt_irec	*cmap,
 	bool			*shared,
 	uint			*lockmode,
-	bool			convert_now)
+	bool			convert_now,
+	bool 			atomic)
 {
 	int			error;
 	bool			found;
@@ -578,7 +580,7 @@ xfs_reflink_allocate_cow(
 	}
 
 	error = xfs_find_trim_cow_extent(ip, imap, cmap, shared, &found);
-	if (error || !*shared)
+	if (error || (!*shared && !atomic))
 		return error;
 
 	/* CoW fork has a real extent */
@@ -592,7 +594,7 @@ xfs_reflink_allocate_cow(
 	 */
 	if (cmap->br_startoff > imap->br_startoff)
 		return xfs_reflink_fill_cow_hole(ip, imap, cmap, shared,
-				lockmode, convert_now);
+				lockmode, convert_now, atomic);
 
 	/*
 	 * CoW fork has a delalloc reservation. Replace it with a real extent.
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index a328b25e68da..ef5c8b2398d8 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -32,7 +32,7 @@ int xfs_bmap_trim_cow(struct xfs_inode *ip, struct xfs_bmbt_irec *imap,
 
 int xfs_reflink_allocate_cow(struct xfs_inode *ip, struct xfs_bmbt_irec *imap,
 		struct xfs_bmbt_irec *cmap, bool *shared, uint *lockmode,
-		bool convert_now);
+		bool convert_now, bool atomic);
 extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
 int xfs_find_trim_cow_extent(struct xfs_inode *ip, struct xfs_bmbt_irec *imap,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC 06/10] xfs: iomap CoW-based atomic write support
  2025-02-04 12:01 [PATCH RFC 00/10] large atomic writes for xfs with CoW John Garry
                   ` (4 preceding siblings ...)
  2025-02-04 12:01 ` [PATCH RFC 05/10] xfs: Reflink CoW-based atomic write support John Garry
@ 2025-02-04 12:01 ` John Garry
  2025-02-05 20:05   ` Darrick J. Wong
  2025-02-04 12:01 ` [PATCH RFC 07/10] xfs: Add xfs_file_dio_write_atomic() John Garry
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2025-02-04 12:01 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, John Garry

In cases of an atomic write occurs for misaligned or discontiguous disk
blocks, we will use a CoW-based method to issue the atomic write.

So, for that case, return -EAGAIN to request that the write be issued in
CoW atomic write mode. The dio write path should detect this, similar to
how misaligned regalar DIO writes are handled.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_iomap.c | 68 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 66 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index ae3755ed00e6..2c2867d728e4 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -809,9 +809,12 @@ xfs_direct_write_iomap_begin(
 	struct xfs_bmbt_irec	imap, cmap;
 	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
 	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, length);
+	bool			atomic = flags & IOMAP_ATOMIC;
 	int			nimaps = 1, error = 0;
 	bool			shared = false;
+	bool			found = false;
 	u16			iomap_flags = 0;
+	bool			need_alloc;
 	unsigned int		lockmode;
 	u64			seq;
 
@@ -832,7 +835,7 @@ xfs_direct_write_iomap_begin(
 	 * COW writes may allocate delalloc space or convert unwritten COW
 	 * extents, so we need to make sure to take the lock exclusively here.
 	 */
-	if (xfs_is_cow_inode(ip))
+	if (xfs_is_cow_inode(ip) || atomic)
 		lockmode = XFS_ILOCK_EXCL;
 	else
 		lockmode = XFS_ILOCK_SHARED;
@@ -857,12 +860,73 @@ xfs_direct_write_iomap_begin(
 	if (error)
 		goto out_unlock;
 
+
+	if (flags & IOMAP_ATOMIC_COW) {
+		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
+				&lockmode,
+				(flags & IOMAP_DIRECT) || IS_DAX(inode), true);
+		if (error)
+			goto out_unlock;
+
+		end_fsb = imap.br_startoff + imap.br_blockcount;
+		length = XFS_FSB_TO_B(mp, end_fsb) - offset;
+
+		if (imap.br_startblock != HOLESTARTBLOCK) {
+			seq = xfs_iomap_inode_sequence(ip, 0);
+
+			error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags,
+				iomap_flags | IOMAP_F_ATOMIC_COW, seq);
+			if (error)
+				goto out_unlock;
+		}
+		seq = xfs_iomap_inode_sequence(ip, 0);
+		xfs_iunlock(ip, lockmode);
+		return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags,
+					iomap_flags | IOMAP_F_ATOMIC_COW, seq);
+	}
+
+	need_alloc = imap_needs_alloc(inode, flags, &imap, nimaps);
+
+	if (atomic) {
+		/* Use CoW-based method if any of the following fail */
+		error = -EAGAIN;
+
+		/*
+		 * Lazily use CoW-based method for initial alloc of data.
+		 * Check br_blockcount for FSes which do not support atomic
+		 * writes > 1x block.
+		 */
+		if (need_alloc && imap.br_blockcount > 1)
+			goto out_unlock;
+
+		/* Misaligned start block wrt size */
+		if (!IS_ALIGNED(imap.br_startblock, imap.br_blockcount))
+			goto out_unlock;
+
+		/* Discontiguous or mixed extents */
+		if (!imap_spans_range(&imap, offset_fsb, end_fsb))
+			goto out_unlock;
+	}
+
 	if (imap_needs_cow(ip, flags, &imap, nimaps)) {
 		error = -EAGAIN;
 		if (flags & IOMAP_NOWAIT)
 			goto out_unlock;
 
+		if (atomic) {
+			/* Detect whether we're already covered in a cow fork */
+			error  = xfs_find_trim_cow_extent(ip, &imap, &cmap, &shared, &found);
+			if (error)
+				goto out_unlock;
+
+			if (shared) {
+				error = -EAGAIN;
+				goto out_unlock;
+			}
+		}
+
 		/* may drop and re-acquire the ilock */
+		shared = false;
 		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
 				&lockmode,
 				(flags & IOMAP_DIRECT) || IS_DAX(inode), false);
@@ -874,7 +938,7 @@ xfs_direct_write_iomap_begin(
 		length = XFS_FSB_TO_B(mp, end_fsb) - offset;
 	}
 
-	if (imap_needs_alloc(inode, flags, &imap, nimaps))
+	if (need_alloc)
 		goto allocate_blocks;
 
 	/*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC 07/10] xfs: Add xfs_file_dio_write_atomic()
  2025-02-04 12:01 [PATCH RFC 00/10] large atomic writes for xfs with CoW John Garry
                   ` (5 preceding siblings ...)
  2025-02-04 12:01 ` [PATCH RFC 06/10] xfs: iomap " John Garry
@ 2025-02-04 12:01 ` John Garry
  2025-02-05 19:55   ` Darrick J. Wong
  2025-02-10 16:59   ` John Garry
  2025-02-04 12:01 ` [PATCH RFC 08/10] xfs: Commit CoW-based atomic writes atomically John Garry
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 35+ messages in thread
From: John Garry @ 2025-02-04 12:01 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, John Garry

Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes.

In case of -EAGAIN being returned from iomap_dio_rw(), reissue the write
in CoW-based atomic write mode.

In the CoW-based atomic write mode, first unshare blocks so that we don't
have a cow fork for the data in the range which we are writing.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_file.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index fd05b66aea3f..12af5cdc3094 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -619,6 +619,55 @@ xfs_file_dio_write_aligned(
 	return ret;
 }
 
+static noinline ssize_t
+xfs_file_dio_write_atomic(
+	struct xfs_inode	*ip,
+	struct kiocb		*iocb,
+	struct iov_iter		*from)
+{
+	unsigned int		iolock = XFS_IOLOCK_SHARED;
+	bool			use_cow = false;
+	unsigned int		dio_flags;
+	ssize_t			ret;
+
+retry:
+	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
+	if (ret)
+		return ret;
+
+	ret = xfs_file_write_checks(iocb, from, &iolock);
+	if (ret)
+		goto out_unlock;
+
+	if (use_cow) {
+		ret = xfs_reflink_unshare(ip, iocb->ki_pos,
+			iov_iter_count(from));
+		if (ret)
+			goto out_unlock;
+	}
+
+	trace_xfs_file_direct_write(iocb, from);
+	if (use_cow)
+		dio_flags = IOMAP_DIO_ATOMIC_COW;
+	else
+		dio_flags = 0;
+
+	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
+			&xfs_dio_write_ops, dio_flags, NULL, 0);
+
+	if (ret == -EAGAIN && !(iocb->ki_flags & IOCB_NOWAIT) && !use_cow) {
+		xfs_iunlock(ip, iolock);
+		iolock = XFS_IOLOCK_EXCL;
+		use_cow = true;
+		goto retry;
+	}
+
+out_unlock:
+	if (iolock)
+		xfs_iunlock(ip, iolock);
+	return ret;
+}
+
 /*
  * Handle block unaligned direct I/O writes
  *
@@ -723,6 +772,8 @@ xfs_file_dio_write(
 		return -EINVAL;
 	if ((iocb->ki_pos | count) & ip->i_mount->m_blockmask)
 		return xfs_file_dio_write_unaligned(ip, iocb, from);
+	if (iocb->ki_flags & IOCB_ATOMIC)
+		return xfs_file_dio_write_atomic(ip, iocb, from);
 	return xfs_file_dio_write_aligned(ip, iocb, from);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC 08/10] xfs: Commit CoW-based atomic writes atomically
  2025-02-04 12:01 [PATCH RFC 00/10] large atomic writes for xfs with CoW John Garry
                   ` (6 preceding siblings ...)
  2025-02-04 12:01 ` [PATCH RFC 07/10] xfs: Add xfs_file_dio_write_atomic() John Garry
@ 2025-02-04 12:01 ` John Garry
  2025-02-05 19:47   ` Darrick J. Wong
  2025-02-04 12:01 ` [PATCH RFC 09/10] xfs: Update atomic write max size John Garry
  2025-02-04 12:01 ` [PATCH RFC 10/10] xfs: Allow block allocator to take an alignment hint John Garry
  9 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2025-02-04 12:01 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, John Garry

When completing a CoW-based write, each extent range mapping update is
covered by a separate transaction.

For a CoW-based atomic write, all mappings must be changed at once, so
change to use a single transaction.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_file.c    |  5 ++++-
 fs/xfs/xfs_reflink.c | 48 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h |  3 +++
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 12af5cdc3094..170d7891f90d 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -527,7 +527,10 @@ xfs_dio_write_end_io(
 	nofs_flag = memalloc_nofs_save();
 
 	if (flags & IOMAP_DIO_COW) {
-		error = xfs_reflink_end_cow(ip, offset, size);
+		if (iocb->ki_flags & IOCB_ATOMIC)
+			error = xfs_reflink_end_atomic_cow(ip, offset, size);
+		else
+			error = xfs_reflink_end_cow(ip, offset, size);
 		if (error)
 			goto out;
 	}
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index dbce333b60eb..60c986300faa 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -990,6 +990,54 @@ xfs_reflink_end_cow(
 		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
 	return error;
 }
+int
+xfs_reflink_end_atomic_cow(
+	struct xfs_inode		*ip,
+	xfs_off_t			offset,
+	xfs_off_t			count)
+{
+	xfs_fileoff_t			offset_fsb;
+	xfs_fileoff_t			end_fsb;
+	int				error = 0;
+	struct xfs_mount		*mp = ip->i_mount;
+	struct xfs_trans		*tp;
+	unsigned int			resblks;
+	bool				commit = false;
+
+	trace_xfs_reflink_end_cow(ip, offset, count);
+
+	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
+	end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
+
+	resblks = XFS_NEXTENTADD_SPACE_RES(ip->i_mount,
+				(unsigned int)(end_fsb - offset_fsb),
+				XFS_DATA_FORK);
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
+			XFS_TRANS_RESERVE, &tp);
+	if (error)
+		return error;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
+
+	while (end_fsb > offset_fsb && !error)
+		error = xfs_reflink_end_cow_extent_locked(ip, &offset_fsb,
+						end_fsb, tp, &commit);
+
+	if (error || !commit)
+		goto out_cancel;
+
+	if (error)
+		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
+	error = xfs_trans_commit(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+out_cancel:
+	xfs_trans_cancel(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+}
 
 /*
  * Free all CoW staging blocks that are still referenced by the ondisk refcount
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index ef5c8b2398d8..2c3b096c1386 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -45,6 +45,9 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count, bool cancel_real);
 extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
+		int
+xfs_reflink_end_atomic_cow(struct xfs_inode *ip, xfs_off_t offset,
+		xfs_off_t count);
 extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
 extern loff_t xfs_reflink_remap_range(struct file *file_in, loff_t pos_in,
 		struct file *file_out, loff_t pos_out, loff_t len,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC 09/10] xfs: Update atomic write max size
  2025-02-04 12:01 [PATCH RFC 00/10] large atomic writes for xfs with CoW John Garry
                   ` (7 preceding siblings ...)
  2025-02-04 12:01 ` [PATCH RFC 08/10] xfs: Commit CoW-based atomic writes atomically John Garry
@ 2025-02-04 12:01 ` John Garry
  2025-02-05 19:41   ` Darrick J. Wong
  2025-02-04 12:01 ` [PATCH RFC 10/10] xfs: Allow block allocator to take an alignment hint John Garry
  9 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2025-02-04 12:01 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, John Garry

Now that CoW-based atomic writes are supported, update the max size of an
atomic write.

For simplicity, limit at the max of what the mounted bdev can support in
terms of atomic write limits. Maybe in future we will have a better way
to advertise this optimised limit.

In addition, the max atomic write size needs to be aligned to the agsize.
Currently when attempting to use HW offload, we  just check that the
mapping startblock is aligned. However, that is just the startblock within
the AG, and the AG may not be properly aligned to the underlying block
device atomic write limits.

As such, limit atomic writes to the greatest power-of-2 which fits in an
AG, so that aligning to the startblock will be mean that we are also
aligned to the disk block.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_iops.c  |  7 ++++++-
 fs/xfs/xfs_mount.c | 28 ++++++++++++++++++++++++++++
 fs/xfs/xfs_mount.h |  1 +
 3 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ea79fb246e33..95681d6c2bcd 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -606,12 +606,17 @@ xfs_get_atomic_write_attr(
 	unsigned int		*unit_min,
 	unsigned int		*unit_max)
 {
+	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
+	struct xfs_mount	*mp = ip->i_mount;
+
 	if (!xfs_inode_can_atomicwrite(ip)) {
 		*unit_min = *unit_max = 0;
 		return;
 	}
 
-	*unit_min = *unit_max = ip->i_mount->m_sb.sb_blocksize;
+	*unit_min = ip->i_mount->m_sb.sb_blocksize;
+	*unit_max =  min_t(unsigned int, XFS_FSB_TO_B(mp, mp->awu_max),
+					target->bt_bdev_awu_max);
 }
 
 static void
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 477c5262cf91..4e60347f6b7e 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -651,6 +651,32 @@ xfs_agbtree_compute_maxlevels(
 	levels = max(levels, mp->m_rmap_maxlevels);
 	mp->m_agbtree_maxlevels = max(levels, mp->m_refc_maxlevels);
 }
+static inline void
+xfs_mp_compute_awu_max(
+	struct xfs_mount	*mp)
+{
+	xfs_agblock_t		agsize = mp->m_sb.sb_agblocks;
+	xfs_agblock_t		awu_max;
+
+	if (!xfs_has_reflink(mp)) {
+		mp->awu_max = 1;
+		return;
+	}
+
+	/*
+	 * Find highest power-of-2 evenly divisible into agsize and which
+	 * also fits into an unsigned int field.
+	 */
+	awu_max = 1;
+	while (1) {
+		if (agsize % (awu_max * 2))
+			break;
+		if (XFS_FSB_TO_B(mp, awu_max * 2) > UINT_MAX)
+			break;
+		awu_max *= 2;
+	}
+	mp->awu_max = awu_max;
+}
 
 /* Compute maximum possible height for realtime btree types for this fs. */
 static inline void
@@ -736,6 +762,8 @@ xfs_mountfs(
 	xfs_agbtree_compute_maxlevels(mp);
 	xfs_rtbtree_compute_maxlevels(mp);
 
+	xfs_mp_compute_awu_max(mp);
+
 	/*
 	 * Check if sb_agblocks is aligned at stripe boundary.  If sb_agblocks
 	 * is NOT aligned turn off m_dalign since allocator alignment is within
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index fbed172d6770..34286c87ac4a 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -198,6 +198,7 @@ typedef struct xfs_mount {
 	bool			m_fail_unmount;
 	bool			m_finobt_nores; /* no per-AG finobt resv. */
 	bool			m_update_sb;	/* sb needs update in mount */
+	xfs_extlen_t		awu_max;	/* max atomic write */
 
 	/*
 	 * Bitsets of per-fs metadata that have been checked and/or are sick.
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC 10/10] xfs: Allow block allocator to take an alignment hint
  2025-02-04 12:01 [PATCH RFC 00/10] large atomic writes for xfs with CoW John Garry
                   ` (8 preceding siblings ...)
  2025-02-04 12:01 ` [PATCH RFC 09/10] xfs: Update atomic write max size John Garry
@ 2025-02-04 12:01 ` John Garry
  2025-02-05 19:20   ` Darrick J. Wong
  9 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2025-02-04 12:01 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, John Garry

When issuing an atomic write by the CoW method, give the block allocator a
hint to naturally align the data blocks.

This means that we have a better chance to issuing the atomic write via
HW offload next time.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 7 ++++++-
 fs/xfs/libxfs/xfs_bmap.h | 6 +++++-
 fs/xfs/xfs_reflink.c     | 8 ++++++--
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 40ad22fb808b..7a3910018dee 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3454,6 +3454,12 @@ xfs_bmap_compute_alignments(
 		align = xfs_get_cowextsz_hint(ap->ip);
 	else if (ap->datatype & XFS_ALLOC_USERDATA)
 		align = xfs_get_extsz_hint(ap->ip);
+
+	if (align > 1 && ap->flags & XFS_BMAPI_NALIGN)
+		args->alignment = align;
+	else
+		args->alignment = 1;
+
 	if (align) {
 		if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
 					ap->eof, 0, ap->conv, &ap->offset,
@@ -3781,7 +3787,6 @@ xfs_bmap_btalloc(
 		.wasdel		= ap->wasdel,
 		.resv		= XFS_AG_RESV_NONE,
 		.datatype	= ap->datatype,
-		.alignment	= 1,
 		.minalignslop	= 0,
 	};
 	xfs_fileoff_t		orig_offset;
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 4b721d935994..d68b594c3fa2 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -87,6 +87,9 @@ struct xfs_bmalloca {
 /* Do not update the rmap btree.  Used for reconstructing bmbt from rmapbt. */
 #define XFS_BMAPI_NORMAP	(1u << 10)
 
+/* Try to naturally align allocations */
+#define XFS_BMAPI_NALIGN	(1u << 11)
+
 #define XFS_BMAPI_FLAGS \
 	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
 	{ XFS_BMAPI_METADATA,	"METADATA" }, \
@@ -98,7 +101,8 @@ struct xfs_bmalloca {
 	{ XFS_BMAPI_REMAP,	"REMAP" }, \
 	{ XFS_BMAPI_COWFORK,	"COWFORK" }, \
 	{ XFS_BMAPI_NODISCARD,	"NODISCARD" }, \
-	{ XFS_BMAPI_NORMAP,	"NORMAP" }
+	{ XFS_BMAPI_NORMAP,	"NORMAP" },\
+	{ XFS_BMAPI_NALIGN,	"NALIGN" }
 
 
 static inline int xfs_bmapi_aflag(int w)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 60c986300faa..198fb5372f10 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -445,6 +445,11 @@ xfs_reflink_fill_cow_hole(
 	int			nimaps;
 	int			error;
 	bool			found;
+	uint32_t		bmapi_flags = XFS_BMAPI_COWFORK |
+					XFS_BMAPI_PREALLOC;
+
+	if (atomic)
+		bmapi_flags |= XFS_BMAPI_NALIGN;
 
 	resaligned = xfs_aligned_fsb_count(imap->br_startoff,
 		imap->br_blockcount, xfs_get_cowextsz_hint(ip));
@@ -478,8 +483,7 @@ xfs_reflink_fill_cow_hole(
 	/* Allocate the entire reservation as unwritten blocks. */
 	nimaps = 1;
 	error = xfs_bmapi_write(tp, ip, imap->br_startoff, imap->br_blockcount,
-			XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC, 0, cmap,
-			&nimaps);
+			bmapi_flags, 0, cmap, &nimaps);
 	if (error)
 		goto out_trans_cancel;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 10/10] xfs: Allow block allocator to take an alignment hint
  2025-02-04 12:01 ` [PATCH RFC 10/10] xfs: Allow block allocator to take an alignment hint John Garry
@ 2025-02-05 19:20   ` Darrick J. Wong
  2025-02-06  8:10     ` John Garry
  0 siblings, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2025-02-05 19:20 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen

On Tue, Feb 04, 2025 at 12:01:27PM +0000, John Garry wrote:
> When issuing an atomic write by the CoW method, give the block allocator a
> hint to naturally align the data blocks.
> 
> This means that we have a better chance to issuing the atomic write via
> HW offload next time.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c | 7 ++++++-
>  fs/xfs/libxfs/xfs_bmap.h | 6 +++++-
>  fs/xfs/xfs_reflink.c     | 8 ++++++--
>  3 files changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 40ad22fb808b..7a3910018dee 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3454,6 +3454,12 @@ xfs_bmap_compute_alignments(
>  		align = xfs_get_cowextsz_hint(ap->ip);
>  	else if (ap->datatype & XFS_ALLOC_USERDATA)
>  		align = xfs_get_extsz_hint(ap->ip);
> +
> +	if (align > 1 && ap->flags & XFS_BMAPI_NALIGN)
> +		args->alignment = align;
> +	else
> +		args->alignment = 1;
> +
>  	if (align) {
>  		if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
>  					ap->eof, 0, ap->conv, &ap->offset,
> @@ -3781,7 +3787,6 @@ xfs_bmap_btalloc(
>  		.wasdel		= ap->wasdel,
>  		.resv		= XFS_AG_RESV_NONE,
>  		.datatype	= ap->datatype,
> -		.alignment	= 1,
>  		.minalignslop	= 0,
>  	};
>  	xfs_fileoff_t		orig_offset;
> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> index 4b721d935994..d68b594c3fa2 100644
> --- a/fs/xfs/libxfs/xfs_bmap.h
> +++ b/fs/xfs/libxfs/xfs_bmap.h
> @@ -87,6 +87,9 @@ struct xfs_bmalloca {
>  /* Do not update the rmap btree.  Used for reconstructing bmbt from rmapbt. */
>  #define XFS_BMAPI_NORMAP	(1u << 10)
>  
> +/* Try to naturally align allocations */
> +#define XFS_BMAPI_NALIGN	(1u << 11)
> +
>  #define XFS_BMAPI_FLAGS \
>  	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
>  	{ XFS_BMAPI_METADATA,	"METADATA" }, \
> @@ -98,7 +101,8 @@ struct xfs_bmalloca {
>  	{ XFS_BMAPI_REMAP,	"REMAP" }, \
>  	{ XFS_BMAPI_COWFORK,	"COWFORK" }, \
>  	{ XFS_BMAPI_NODISCARD,	"NODISCARD" }, \
> -	{ XFS_BMAPI_NORMAP,	"NORMAP" }
> +	{ XFS_BMAPI_NORMAP,	"NORMAP" },\
> +	{ XFS_BMAPI_NALIGN,	"NALIGN" }

Tihs isn't really "naturally" aligned, is it?  It really means "try to
align allocations to the extent size hint", which isn't required to be a
power of two.

--D

>  
>  
>  static inline int xfs_bmapi_aflag(int w)
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 60c986300faa..198fb5372f10 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -445,6 +445,11 @@ xfs_reflink_fill_cow_hole(
>  	int			nimaps;
>  	int			error;
>  	bool			found;
> +	uint32_t		bmapi_flags = XFS_BMAPI_COWFORK |
> +					XFS_BMAPI_PREALLOC;
> +
> +	if (atomic)
> +		bmapi_flags |= XFS_BMAPI_NALIGN;
>  
>  	resaligned = xfs_aligned_fsb_count(imap->br_startoff,
>  		imap->br_blockcount, xfs_get_cowextsz_hint(ip));
> @@ -478,8 +483,7 @@ xfs_reflink_fill_cow_hole(
>  	/* Allocate the entire reservation as unwritten blocks. */
>  	nimaps = 1;
>  	error = xfs_bmapi_write(tp, ip, imap->br_startoff, imap->br_blockcount,
> -			XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC, 0, cmap,
> -			&nimaps);
> +			bmapi_flags, 0, cmap, &nimaps);
>  	if (error)
>  		goto out_trans_cancel;
>  
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 09/10] xfs: Update atomic write max size
  2025-02-04 12:01 ` [PATCH RFC 09/10] xfs: Update atomic write max size John Garry
@ 2025-02-05 19:41   ` Darrick J. Wong
  2025-02-06  9:15     ` John Garry
  0 siblings, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2025-02-05 19:41 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen

On Tue, Feb 04, 2025 at 12:01:26PM +0000, John Garry wrote:
> Now that CoW-based atomic writes are supported, update the max size of an
> atomic write.
> 
> For simplicity, limit at the max of what the mounted bdev can support in
> terms of atomic write limits. Maybe in future we will have a better way
> to advertise this optimised limit.
> 
> In addition, the max atomic write size needs to be aligned to the agsize.
> Currently when attempting to use HW offload, we  just check that the
> mapping startblock is aligned. However, that is just the startblock within
> the AG, and the AG may not be properly aligned to the underlying block
> device atomic write limits.
> 
> As such, limit atomic writes to the greatest power-of-2 which fits in an
> AG, so that aligning to the startblock will be mean that we are also
> aligned to the disk block.

I don't understand this sentence -- what are we "aligning to the
startblock"?  I think you're saying that you want to limit the size of
untorn writes to the greatest power-of-two factor of the agsize so that
allocations for an untorn write will always be aligned compatibly with
the alignment requirements of the storage for an untorn write?

> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_iops.c  |  7 ++++++-
>  fs/xfs/xfs_mount.c | 28 ++++++++++++++++++++++++++++
>  fs/xfs/xfs_mount.h |  1 +
>  3 files changed, 35 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index ea79fb246e33..95681d6c2bcd 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -606,12 +606,17 @@ xfs_get_atomic_write_attr(
>  	unsigned int		*unit_min,
>  	unsigned int		*unit_max)
>  {
> +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> +	struct xfs_mount	*mp = ip->i_mount;
> +
>  	if (!xfs_inode_can_atomicwrite(ip)) {
>  		*unit_min = *unit_max = 0;
>  		return;
>  	}
>  
> -	*unit_min = *unit_max = ip->i_mount->m_sb.sb_blocksize;
> +	*unit_min = ip->i_mount->m_sb.sb_blocksize;
> +	*unit_max =  min_t(unsigned int, XFS_FSB_TO_B(mp, mp->awu_max),
> +					target->bt_bdev_awu_max);
>  }
>  
>  static void
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 477c5262cf91..4e60347f6b7e 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -651,6 +651,32 @@ xfs_agbtree_compute_maxlevels(
>  	levels = max(levels, mp->m_rmap_maxlevels);
>  	mp->m_agbtree_maxlevels = max(levels, mp->m_refc_maxlevels);
>  }
> +static inline void
> +xfs_mp_compute_awu_max(

xfs_compute_awu_max() ?

> +	struct xfs_mount	*mp)
> +{
> +	xfs_agblock_t		agsize = mp->m_sb.sb_agblocks;
> +	xfs_agblock_t		awu_max;
> +
> +	if (!xfs_has_reflink(mp)) {
> +		mp->awu_max = 1;
> +		return;
> +	}
> +
> +	/*
> +	 * Find highest power-of-2 evenly divisible into agsize and which
> +	 * also fits into an unsigned int field.
> +	 */
> +	awu_max = 1;
> +	while (1) {
> +		if (agsize % (awu_max * 2))
> +			break;
> +		if (XFS_FSB_TO_B(mp, awu_max * 2) > UINT_MAX)
> +			break;
> +		awu_max *= 2;
> +	}
> +	mp->awu_max = awu_max;

I think you need two awu_maxes here -- one for the data device, and
another for the realtime device.  The rt computation is probably more
complex since I think it's the greatest power of two that fits in the rt
extent size if it isn't a power of two; or the greatest power of two
that fits in the rtgroup if rtgroups are enabled; or probably just no
limit otherwise.

--D

> +}
>  
>  /* Compute maximum possible height for realtime btree types for this fs. */
>  static inline void
> @@ -736,6 +762,8 @@ xfs_mountfs(
>  	xfs_agbtree_compute_maxlevels(mp);
>  	xfs_rtbtree_compute_maxlevels(mp);
>  
> +	xfs_mp_compute_awu_max(mp);
> +
>  	/*
>  	 * Check if sb_agblocks is aligned at stripe boundary.  If sb_agblocks
>  	 * is NOT aligned turn off m_dalign since allocator alignment is within
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index fbed172d6770..34286c87ac4a 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -198,6 +198,7 @@ typedef struct xfs_mount {
>  	bool			m_fail_unmount;
>  	bool			m_finobt_nores; /* no per-AG finobt resv. */
>  	bool			m_update_sb;	/* sb needs update in mount */
> +	xfs_extlen_t		awu_max;	/* max atomic write */
>  
>  	/*
>  	 * Bitsets of per-fs metadata that have been checked and/or are sick.
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 08/10] xfs: Commit CoW-based atomic writes atomically
  2025-02-04 12:01 ` [PATCH RFC 08/10] xfs: Commit CoW-based atomic writes atomically John Garry
@ 2025-02-05 19:47   ` Darrick J. Wong
  2025-02-06 10:27     ` John Garry
  0 siblings, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2025-02-05 19:47 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen

On Tue, Feb 04, 2025 at 12:01:25PM +0000, John Garry wrote:
> When completing a CoW-based write, each extent range mapping update is
> covered by a separate transaction.
> 
> For a CoW-based atomic write, all mappings must be changed at once, so
> change to use a single transaction.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_file.c    |  5 ++++-
>  fs/xfs/xfs_reflink.c | 48 ++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_reflink.h |  3 +++
>  3 files changed, 55 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 12af5cdc3094..170d7891f90d 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -527,7 +527,10 @@ xfs_dio_write_end_io(
>  	nofs_flag = memalloc_nofs_save();
>  
>  	if (flags & IOMAP_DIO_COW) {
> -		error = xfs_reflink_end_cow(ip, offset, size);
> +		if (iocb->ki_flags & IOCB_ATOMIC)
> +			error = xfs_reflink_end_atomic_cow(ip, offset, size);
> +		else
> +			error = xfs_reflink_end_cow(ip, offset, size);
>  		if (error)
>  			goto out;
>  	}
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index dbce333b60eb..60c986300faa 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -990,6 +990,54 @@ xfs_reflink_end_cow(
>  		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
>  	return error;
>  }
> +int
> +xfs_reflink_end_atomic_cow(
> +	struct xfs_inode		*ip,
> +	xfs_off_t			offset,
> +	xfs_off_t			count)
> +{
> +	xfs_fileoff_t			offset_fsb;
> +	xfs_fileoff_t			end_fsb;
> +	int				error = 0;
> +	struct xfs_mount		*mp = ip->i_mount;
> +	struct xfs_trans		*tp;
> +	unsigned int			resblks;
> +	bool				commit = false;
> +
> +	trace_xfs_reflink_end_cow(ip, offset, count);
> +
> +	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
> +	end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
> +
> +	resblks = XFS_NEXTENTADD_SPACE_RES(ip->i_mount,
> +				(unsigned int)(end_fsb - offset_fsb),
> +				XFS_DATA_FORK);
> +
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,

xfs gained reflink support for realtime volumes in 6.14-rc1, so you now
have to calculate for that in here too.

> +			XFS_TRANS_RESERVE, &tp);
> +	if (error)
> +		return error;
> +
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +	xfs_trans_ijoin(tp, ip, 0);
> +
> +	while (end_fsb > offset_fsb && !error)
> +		error = xfs_reflink_end_cow_extent_locked(ip, &offset_fsb,
> +						end_fsb, tp, &commit);

Hmm.  Attaching intent items to a transaction consumes space in that
transaction, so we probably ought to limit the amount that we try to do
here.  Do you know what that limit is?  I don't, but it's roughly
tr_logres divided by the average size of a log intent item.

This means we need to restrict the size of an untorn write to a
double-digit number of fsblocks for safety.

The logic in here looks reasonable though.

--D

> +
> +	if (error || !commit)
> +		goto out_cancel;
> +
> +	if (error)
> +		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
> +	error = xfs_trans_commit(tp);
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	return error;
> +out_cancel:
> +	xfs_trans_cancel(tp);
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	return error;
> +}
>  
>  /*
>   * Free all CoW staging blocks that are still referenced by the ondisk refcount
> diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> index ef5c8b2398d8..2c3b096c1386 100644
> --- a/fs/xfs/xfs_reflink.h
> +++ b/fs/xfs/xfs_reflink.h
> @@ -45,6 +45,9 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
>  		xfs_off_t count, bool cancel_real);
>  extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
>  		xfs_off_t count);
> +		int
> +xfs_reflink_end_atomic_cow(struct xfs_inode *ip, xfs_off_t offset,
> +		xfs_off_t count);
>  extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
>  extern loff_t xfs_reflink_remap_range(struct file *file_in, loff_t pos_in,
>  		struct file *file_out, loff_t pos_out, loff_t len,
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 02/10] xfs: Refactor xfs_reflink_end_cow_extent()
  2025-02-04 12:01 ` [PATCH RFC 02/10] xfs: Refactor xfs_reflink_end_cow_extent() John Garry
@ 2025-02-05 19:50   ` Darrick J. Wong
  2025-02-06 10:35     ` John Garry
  0 siblings, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2025-02-05 19:50 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen

On Tue, Feb 04, 2025 at 12:01:19PM +0000, John Garry wrote:
> Refactor xfs_reflink_end_cow_extent() into separate parts which process
> the CoW range and commit the transaction.
> 
> This refactoring will be used in future for when it is required to commit
> a range of extents as a single transaction, similar to how it was done
> pre-commit d6f215f359637.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_reflink.c | 79 +++++++++++++++++++++++++++-----------------
>  1 file changed, 48 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 59f7fc16eb80..580469668334 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -786,35 +786,20 @@ xfs_reflink_update_quota(
>   * requirements as low as possible.
>   */
>  STATIC int
> -xfs_reflink_end_cow_extent(
> +xfs_reflink_end_cow_extent_locked(
>  	struct xfs_inode	*ip,
>  	xfs_fileoff_t		*offset_fsb,
> -	xfs_fileoff_t		end_fsb)
> +	xfs_fileoff_t		end_fsb,
> +	struct xfs_trans	*tp,
> +	bool			*commit)

Transactions usually come before the inode in the parameter list.

You don't need to pass out a @commit flag -- if the function returns
nonzero then the caller has to cancel the transaction; otherwise it will
return zero and the caller should commit it.  There's no penalty for
committing a non-dirty transaction.

--D

>  {
>  	struct xfs_iext_cursor	icur;
>  	struct xfs_bmbt_irec	got, del, data;
> -	struct xfs_mount	*mp = ip->i_mount;
> -	struct xfs_trans	*tp;
>  	struct xfs_ifork	*ifp = xfs_ifork_ptr(ip, XFS_COW_FORK);
> -	unsigned int		resblks;
>  	int			nmaps;
>  	bool			isrt = XFS_IS_REALTIME_INODE(ip);
>  	int			error;
>  
> -	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
> -			XFS_TRANS_RESERVE, &tp);
> -	if (error)
> -		return error;
> -
> -	/*
> -	 * Lock the inode.  We have to ijoin without automatic unlock because
> -	 * the lead transaction is the refcountbt record deletion; the data
> -	 * fork update follows as a deferred log item.
> -	 */
> -	xfs_ilock(ip, XFS_ILOCK_EXCL);
> -	xfs_trans_ijoin(tp, ip, 0);
> -
>  	/*
>  	 * In case of racing, overlapping AIO writes no COW extents might be
>  	 * left by the time I/O completes for the loser of the race.  In that
> @@ -823,7 +808,7 @@ xfs_reflink_end_cow_extent(
>  	if (!xfs_iext_lookup_extent(ip, ifp, *offset_fsb, &icur, &got) ||
>  	    got.br_startoff >= end_fsb) {
>  		*offset_fsb = end_fsb;
> -		goto out_cancel;
> +		return 0;
>  	}
>  
>  	/*
> @@ -837,7 +822,7 @@ xfs_reflink_end_cow_extent(
>  		if (!xfs_iext_next_extent(ifp, &icur, &got) ||
>  		    got.br_startoff >= end_fsb) {
>  			*offset_fsb = end_fsb;
> -			goto out_cancel;
> +			return 0;
>  		}
>  	}
>  	del = got;
> @@ -846,14 +831,14 @@ xfs_reflink_end_cow_extent(
>  	error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK,
>  			XFS_IEXT_REFLINK_END_COW_CNT);
>  	if (error)
> -		goto out_cancel;
> +		return error;
>  
>  	/* Grab the corresponding mapping in the data fork. */
>  	nmaps = 1;
>  	error = xfs_bmapi_read(ip, del.br_startoff, del.br_blockcount, &data,
>  			&nmaps, 0);
>  	if (error)
> -		goto out_cancel;
> +		return error;
>  
>  	/* We can only remap the smaller of the two extent sizes. */
>  	data.br_blockcount = min(data.br_blockcount, del.br_blockcount);
> @@ -882,7 +867,7 @@ xfs_reflink_end_cow_extent(
>  		error = xfs_bunmapi(NULL, ip, data.br_startoff,
>  				data.br_blockcount, 0, 1, &done);
>  		if (error)
> -			goto out_cancel;
> +			return error;
>  		ASSERT(done);
>  	}
>  
> @@ -899,17 +884,49 @@ xfs_reflink_end_cow_extent(
>  	/* Remove the mapping from the CoW fork. */
>  	xfs_bmap_del_extent_cow(ip, &icur, &got, &del);
>  
> -	error = xfs_trans_commit(tp);
> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -	if (error)
> -		return error;
> -
>  	/* Update the caller about how much progress we made. */
>  	*offset_fsb = del.br_startoff + del.br_blockcount;
> +	*commit = true;
>  	return 0;
> +}
>  
> -out_cancel:
> -	xfs_trans_cancel(tp);
> +
> +/*
> + * Remap part of the CoW fork into the data fork.
> + *
> + * We aim to remap the range starting at @offset_fsb and ending at @end_fsb
> + * into the data fork; this function will remap what it can (at the end of the
> + * range) and update @end_fsb appropriately.  Each remap gets its own
> + * transaction because we can end up merging and splitting bmbt blocks for
> + * every remap operation and we'd like to keep the block reservation
> + * requirements as low as possible.
> + */
> +STATIC int
> +xfs_reflink_end_cow_extent(
> +	struct xfs_inode	*ip,
> +	xfs_fileoff_t		*offset_fsb,
> +	xfs_fileoff_t		end_fsb)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_trans	*tp;
> +	unsigned int		resblks;
> +	int			error;
> +	bool			commit = false;
> +
> +	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
> +			XFS_TRANS_RESERVE, &tp);
> +	if (error)
> +		return error;
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +	xfs_trans_ijoin(tp, ip, 0);
> +
> +	error = xfs_reflink_end_cow_extent_locked(ip, offset_fsb,
> +					end_fsb, tp, &commit);
> +	if (commit)
> +		error = xfs_trans_commit(tp);
> +	else
> +		xfs_trans_cancel(tp);
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  	return error;
>  }
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 07/10] xfs: Add xfs_file_dio_write_atomic()
  2025-02-04 12:01 ` [PATCH RFC 07/10] xfs: Add xfs_file_dio_write_atomic() John Garry
@ 2025-02-05 19:55   ` Darrick J. Wong
  2025-02-06 10:43     ` John Garry
  2025-02-10 16:59   ` John Garry
  1 sibling, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2025-02-05 19:55 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen

On Tue, Feb 04, 2025 at 12:01:24PM +0000, John Garry wrote:
> Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes.
> 
> In case of -EAGAIN being returned from iomap_dio_rw(), reissue the write
> in CoW-based atomic write mode.
> 
> In the CoW-based atomic write mode, first unshare blocks so that we don't
> have a cow fork for the data in the range which we are writing.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_file.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 51 insertions(+)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index fd05b66aea3f..12af5cdc3094 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -619,6 +619,55 @@ xfs_file_dio_write_aligned(
>  	return ret;
>  }
>  
> +static noinline ssize_t
> +xfs_file_dio_write_atomic(
> +	struct xfs_inode	*ip,
> +	struct kiocb		*iocb,
> +	struct iov_iter		*from)
> +{
> +	unsigned int		iolock = XFS_IOLOCK_SHARED;
> +	bool			use_cow = false;
> +	unsigned int		dio_flags;
> +	ssize_t			ret;
> +
> +retry:
> +	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
> +	if (ret)
> +		return ret;
> +
> +	ret = xfs_file_write_checks(iocb, from, &iolock);
> +	if (ret)
> +		goto out_unlock;
> +
> +	if (use_cow) {
> +		ret = xfs_reflink_unshare(ip, iocb->ki_pos,
> +			iov_iter_count(from));

Nit: continuation lines should be indented two tabs:

		ret = xfs_reflink_unshare(ip, iocb->ki_pos,
				iov_iter_count(from));

> +		if (ret)
> +			goto out_unlock;
> +	}
> +
> +	trace_xfs_file_direct_write(iocb, from);
> +	if (use_cow)
> +		dio_flags = IOMAP_DIO_ATOMIC_COW;
> +	else
> +		dio_flags = 0;

I also think you could eliminate use_cow by initializing dio_flags to
zero at the top, OR'ing in IOMAP_DIO_ATOMIC_COW in the retry clause
below, and using (dio_flags & IOMAP_DIO_ATOMIC_COW) to determine if you
should call unshare above.

Note: This serializes all the software untorn direct writes.  I think
a more performant solution would allocate the cow staging blocks ondisk,
attach them to the directio ioend context, and alter ->iomap_begin and
the ioend remap to use the attached blocks, but that's a lot more
surgery.

--D

> +
> +	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
> +			&xfs_dio_write_ops, dio_flags, NULL, 0);
> +
> +	if (ret == -EAGAIN && !(iocb->ki_flags & IOCB_NOWAIT) && !use_cow) {
> +		xfs_iunlock(ip, iolock);
> +		iolock = XFS_IOLOCK_EXCL;
> +		use_cow = true;
> +		goto retry;
> +	}
> +
> +out_unlock:
> +	if (iolock)
> +		xfs_iunlock(ip, iolock);
> +	return ret;
> +}
> +
>  /*
>   * Handle block unaligned direct I/O writes
>   *
> @@ -723,6 +772,8 @@ xfs_file_dio_write(
>  		return -EINVAL;
>  	if ((iocb->ki_pos | count) & ip->i_mount->m_blockmask)
>  		return xfs_file_dio_write_unaligned(ip, iocb, from);
> +	if (iocb->ki_flags & IOCB_ATOMIC)
> +		return xfs_file_dio_write_atomic(ip, iocb, from);
>  	return xfs_file_dio_write_aligned(ip, iocb, from);
>  }
>  
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 06/10] xfs: iomap CoW-based atomic write support
  2025-02-04 12:01 ` [PATCH RFC 06/10] xfs: iomap " John Garry
@ 2025-02-05 20:05   ` Darrick J. Wong
  2025-02-06 11:10     ` John Garry
  0 siblings, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2025-02-05 20:05 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen

On Tue, Feb 04, 2025 at 12:01:23PM +0000, John Garry wrote:
> In cases of an atomic write occurs for misaligned or discontiguous disk
> blocks, we will use a CoW-based method to issue the atomic write.
> 
> So, for that case, return -EAGAIN to request that the write be issued in
> CoW atomic write mode. The dio write path should detect this, similar to
> how misaligned regalar DIO writes are handled.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_iomap.c | 68 ++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 66 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index ae3755ed00e6..2c2867d728e4 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -809,9 +809,12 @@ xfs_direct_write_iomap_begin(
>  	struct xfs_bmbt_irec	imap, cmap;
>  	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
>  	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, length);
> +	bool			atomic = flags & IOMAP_ATOMIC;
>  	int			nimaps = 1, error = 0;
>  	bool			shared = false;
> +	bool			found = false;
>  	u16			iomap_flags = 0;
> +	bool			need_alloc;
>  	unsigned int		lockmode;
>  	u64			seq;
>  
> @@ -832,7 +835,7 @@ xfs_direct_write_iomap_begin(
>  	 * COW writes may allocate delalloc space or convert unwritten COW
>  	 * extents, so we need to make sure to take the lock exclusively here.
>  	 */
> -	if (xfs_is_cow_inode(ip))
> +	if (xfs_is_cow_inode(ip) || atomic)
>  		lockmode = XFS_ILOCK_EXCL;
>  	else
>  		lockmode = XFS_ILOCK_SHARED;
> @@ -857,12 +860,73 @@ xfs_direct_write_iomap_begin(
>  	if (error)
>  		goto out_unlock;
>  
> +
> +	if (flags & IOMAP_ATOMIC_COW) {
> +		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
> +				&lockmode,
> +				(flags & IOMAP_DIRECT) || IS_DAX(inode), true);

Weird nit not relate to this patch: Is there ever a case where
IS_DAX(inode) and (flags & IOMAP_DAX) disagree?  I wonder if this odd
construction could be simplified to:

	(flags & (IOMAP_DIRECT | IOMAP_DAX))

?

> +		if (error)
> +			goto out_unlock;
> +
> +		end_fsb = imap.br_startoff + imap.br_blockcount;
> +		length = XFS_FSB_TO_B(mp, end_fsb) - offset;
> +
> +		if (imap.br_startblock != HOLESTARTBLOCK) {
> +			seq = xfs_iomap_inode_sequence(ip, 0);
> +
> +			error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags,
> +				iomap_flags | IOMAP_F_ATOMIC_COW, seq);
> +			if (error)
> +				goto out_unlock;
> +		}
> +		seq = xfs_iomap_inode_sequence(ip, 0);
> +		xfs_iunlock(ip, lockmode);
> +		return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags,
> +					iomap_flags | IOMAP_F_ATOMIC_COW, seq);
> +	}

/me wonders if this should be a separate helper so that the main
xfs_direct_write_iomap_begin doesn't get even longer... but otherwise
the logic in here looks sane.

> +
> +	need_alloc = imap_needs_alloc(inode, flags, &imap, nimaps);
> +
> +	if (atomic) {
> +		/* Use CoW-based method if any of the following fail */
> +		error = -EAGAIN;
> +
> +		/*
> +		 * Lazily use CoW-based method for initial alloc of data.
> +		 * Check br_blockcount for FSes which do not support atomic
> +		 * writes > 1x block.
> +		 */
> +		if (need_alloc && imap.br_blockcount > 1)
> +			goto out_unlock;
> +
> +		/* Misaligned start block wrt size */
> +		if (!IS_ALIGNED(imap.br_startblock, imap.br_blockcount))
> +			goto out_unlock;
> +
> +		/* Discontiguous or mixed extents */
> +		if (!imap_spans_range(&imap, offset_fsb, end_fsb))
> +			goto out_unlock;
> +	}

(Same two comments here.)

> +
>  	if (imap_needs_cow(ip, flags, &imap, nimaps)) {
>  		error = -EAGAIN;
>  		if (flags & IOMAP_NOWAIT)
>  			goto out_unlock;
>  
> +		if (atomic) {
> +			/* Detect whether we're already covered in a cow fork */
> +			error  = xfs_find_trim_cow_extent(ip, &imap, &cmap, &shared, &found);
> +			if (error)
> +				goto out_unlock;
> +
> +			if (shared) {
> +				error = -EAGAIN;
> +				goto out_unlock;

What is this checking?  That something else already created a mapping in
the COW fork, so we want to bail out to get rid of it?

--D

> +			}
> +		}
> +
>  		/* may drop and re-acquire the ilock */
> +		shared = false;
>  		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
>  				&lockmode,
>  				(flags & IOMAP_DIRECT) || IS_DAX(inode), false);
> @@ -874,7 +938,7 @@ xfs_direct_write_iomap_begin(
>  		length = XFS_FSB_TO_B(mp, end_fsb) - offset;
>  	}
>  
> -	if (imap_needs_alloc(inode, flags, &imap, nimaps))
> +	if (need_alloc)
>  		goto allocate_blocks;
>  
>  	/*
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 03/10] iomap: Support CoW-based atomic writes
  2025-02-04 12:01 ` [PATCH RFC 03/10] iomap: Support CoW-based atomic writes John Garry
@ 2025-02-05 20:11   ` Darrick J. Wong
  2025-02-06 11:21     ` John Garry
  0 siblings, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2025-02-05 20:11 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen

On Tue, Feb 04, 2025 at 12:01:20PM +0000, John Garry wrote:
> Currently atomic write support requires dedicated HW support. This imposes
> a restriction on the filesystem that disk blocks need to be aligned and
> contiguously mapped to FS blocks to issue atomic writes.
> 
> XFS has no method to guarantee FS block alignment. As such, atomic writes
> are currently limited to 1x FS block there.
> 
> To allow deal with the scenario that we are issuing an atomic write over
> misaligned or discontiguous data blocks larger atomic writes - and raise
> the atomic write limit - support a CoW-based software emulated atomic
> write mode.
> 
> For this special mode, the FS will reserve blocks for that data to be
> written and then atomically map that data in once the data has been
> commited to disk.
> 
> It is the responsibility of the FS to detect discontiguous atomic writes
> and switch to IOMAP_DIO_ATOMIC_COW mode and retry the write.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/iomap/direct-io.c  | 23 ++++++++++++++++-------
>  include/linux/iomap.h |  9 +++++++++
>  2 files changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 3dd883dd77d2..e63b5096bcd8 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -271,7 +271,7 @@ static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
>   * clearing the WRITE_THROUGH flag in the dio request.
>   */
>  static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
> -		const struct iomap *iomap, bool use_fua, bool atomic)
> +		const struct iomap *iomap, bool use_fua, bool atomic_bio)
>  {
>  	blk_opf_t opflags = REQ_SYNC | REQ_IDLE;
>  
> @@ -283,7 +283,7 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
>  		opflags |= REQ_FUA;
>  	else
>  		dio->flags &= ~IOMAP_DIO_WRITE_THROUGH;
> -	if (atomic)
> +	if (atomic_bio)
>  		opflags |= REQ_ATOMIC;
>  
>  	return opflags;
> @@ -301,13 +301,19 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  	blk_opf_t bio_opf;
>  	struct bio *bio;
>  	bool need_zeroout = false;
> +	bool atomic_bio = false;
>  	bool use_fua = false;
>  	int nr_pages, ret = 0;
>  	size_t copied = 0;
>  	size_t orig_count;
>  
> -	if (atomic && length != iter->len)
> -		return -EINVAL;
> +	if (atomic) {
> +		if (!(iomap->flags & IOMAP_F_ATOMIC_COW)) {
> +			if (length != iter->len)
> +				return -EINVAL;
> +			atomic_bio = true;
> +		}
> +	}
>  
>  	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
>  	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
> @@ -318,7 +324,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		need_zeroout = true;
>  	}
>  
> -	if (iomap->flags & IOMAP_F_SHARED)
> +	if (iomap->flags & (IOMAP_F_SHARED | IOMAP_F_ATOMIC_COW))
>  		dio->flags |= IOMAP_DIO_COW;
>  
>  	if (iomap->flags & IOMAP_F_NEW) {
> @@ -383,7 +389,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  			goto out;
>  	}
>  
> -	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic);
> +	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic_bio);
>  
>  	nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS);
>  	do {
> @@ -416,7 +422,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		}
>  
>  		n = bio->bi_iter.bi_size;
> -		if (WARN_ON_ONCE(atomic && n != length)) {
> +		if (WARN_ON_ONCE(atomic_bio && n != length)) {
>  			/*
>  			 * This bio should have covered the complete length,
>  			 * which it doesn't, so error. We may need to zero out
> @@ -639,6 +645,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		if (iocb->ki_flags & IOCB_DIO_CALLER_COMP)
>  			dio->flags |= IOMAP_DIO_CALLER_COMP;
>  
> +		if (dio_flags & IOMAP_DIO_ATOMIC_COW)
> +			iomi.flags |= IOMAP_ATOMIC_COW;
> +
>  		if (dio_flags & IOMAP_DIO_OVERWRITE_ONLY) {
>  			ret = -EAGAIN;
>  			if (iomi.pos >= dio->i_size ||
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 75bf54e76f3b..0a0b6798f517 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -56,6 +56,8 @@ struct vm_fault;
>   *
>   * IOMAP_F_BOUNDARY indicates that I/O and I/O completions for this iomap must
>   * never be merged with the mapping before it.
> + *
> + * IOMAP_F_ATOMIC_COW indicates that we require atomic CoW end IO handling.

It more indicates that the filesystem is using copy on write to handle
an untorn write, and will provide the ioend support necessary to commit
the remapping atomically, right?

>   */
>  #define IOMAP_F_NEW		(1U << 0)
>  #define IOMAP_F_DIRTY		(1U << 1)
> @@ -68,6 +70,7 @@ struct vm_fault;
>  #endif /* CONFIG_BUFFER_HEAD */
>  #define IOMAP_F_XATTR		(1U << 5)
>  #define IOMAP_F_BOUNDARY	(1U << 6)
> +#define IOMAP_F_ATOMIC_COW	(1U << 7)
>  
>  /*
>   * Flags set by the core iomap code during operations:
> @@ -183,6 +186,7 @@ struct iomap_folio_ops {
>  #define IOMAP_DAX		0
>  #endif /* CONFIG_FS_DAX */
>  #define IOMAP_ATOMIC		(1 << 9)
> +#define IOMAP_ATOMIC_COW	(1 << 10)

What does IOMAP_ATOMIC_COW do?  There's no description for it (or for
IOMAP_ATOMIC).  Can you have IOMAP_ATOMIC and IOMAP_ATOMIC_COW both set?
Or are they mutually exclusive?

I'm guessing from the code that ATOMIC_COW requires ATOMIC to be set,
but I wonder why because there's no documentation update in the header
files or in Documentation/filesystems/iomap/.

--D

>  struct iomap_ops {
>  	/*
> @@ -434,6 +438,11 @@ struct iomap_dio_ops {
>   */
>  #define IOMAP_DIO_PARTIAL		(1 << 2)
>  
> +/*
> + * Use CoW-based software emulated atomic write.
> + */
> +#define IOMAP_DIO_ATOMIC_COW		(1 << 3)
> +
>  ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
>  		unsigned int dio_flags, void *private, size_t done_before);
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 10/10] xfs: Allow block allocator to take an alignment hint
  2025-02-05 19:20   ` Darrick J. Wong
@ 2025-02-06  8:10     ` John Garry
  2025-02-06 21:54       ` Darrick J. Wong
  0 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2025-02-06  8:10 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen

On 05/02/2025 19:20, Darrick J. Wong wrote:
> On Tue, Feb 04, 2025 at 12:01:27PM +0000, John Garry wrote:
>> When issuing an atomic write by the CoW method, give the block allocator a
>> hint to naturally align the data blocks.
>>
>> This means that we have a better chance to issuing the atomic write via
>> HW offload next time.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/libxfs/xfs_bmap.c | 7 ++++++-
>>   fs/xfs/libxfs/xfs_bmap.h | 6 +++++-
>>   fs/xfs/xfs_reflink.c     | 8 ++++++--
>>   3 files changed, 17 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
>> index 40ad22fb808b..7a3910018dee 100644
>> --- a/fs/xfs/libxfs/xfs_bmap.c
>> +++ b/fs/xfs/libxfs/xfs_bmap.c
>> @@ -3454,6 +3454,12 @@ xfs_bmap_compute_alignments(
>>   		align = xfs_get_cowextsz_hint(ap->ip);
>>   	else if (ap->datatype & XFS_ALLOC_USERDATA)
>>   		align = xfs_get_extsz_hint(ap->ip);
>> +
>> +	if (align > 1 && ap->flags & XFS_BMAPI_NALIGN)
>> +		args->alignment = align;
>> +	else
>> +		args->alignment = 1;
>> +
>>   	if (align) {
>>   		if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
>>   					ap->eof, 0, ap->conv, &ap->offset,
>> @@ -3781,7 +3787,6 @@ xfs_bmap_btalloc(
>>   		.wasdel		= ap->wasdel,
>>   		.resv		= XFS_AG_RESV_NONE,
>>   		.datatype	= ap->datatype,
>> -		.alignment	= 1,
>>   		.minalignslop	= 0,
>>   	};
>>   	xfs_fileoff_t		orig_offset;
>> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
>> index 4b721d935994..d68b594c3fa2 100644
>> --- a/fs/xfs/libxfs/xfs_bmap.h
>> +++ b/fs/xfs/libxfs/xfs_bmap.h
>> @@ -87,6 +87,9 @@ struct xfs_bmalloca {
>>   /* Do not update the rmap btree.  Used for reconstructing bmbt from rmapbt. */
>>   #define XFS_BMAPI_NORMAP	(1u << 10)
>>   
>> +/* Try to naturally align allocations */
>> +#define XFS_BMAPI_NALIGN	(1u << 11)
>> +
>>   #define XFS_BMAPI_FLAGS \
>>   	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
>>   	{ XFS_BMAPI_METADATA,	"METADATA" }, \
>> @@ -98,7 +101,8 @@ struct xfs_bmalloca {
>>   	{ XFS_BMAPI_REMAP,	"REMAP" }, \
>>   	{ XFS_BMAPI_COWFORK,	"COWFORK" }, \
>>   	{ XFS_BMAPI_NODISCARD,	"NODISCARD" }, \
>> -	{ XFS_BMAPI_NORMAP,	"NORMAP" }
>> +	{ XFS_BMAPI_NORMAP,	"NORMAP" },\
>> +	{ XFS_BMAPI_NALIGN,	"NALIGN" }
> 
> Tihs isn't really "naturally" aligned, is it?  It really means "try to
> align allocations to the extent size hint", which isn't required to be a
> power of two.

Sure, so I would expect that the user will set extsize/cowextsize 
according to the size what we want to do atomics for, and we can align 
to that. I don't think that it makes a difference that either extsize 
isn't mandated to be a power-of-2.

So then I should rename to XFS_BMAPI_EXTSZALIGN or something like that - ok?

Thanks,
John


> 
> --D
> 
>>   
>>   
>>   static inline int xfs_bmapi_aflag(int w)
>> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
>> index 60c986300faa..198fb5372f10 100644
>> --- a/fs/xfs/xfs_reflink.c
>> +++ b/fs/xfs/xfs_reflink.c
>> @@ -445,6 +445,11 @@ xfs_reflink_fill_cow_hole(
>>   	int			nimaps;
>>   	int			error;
>>   	bool			found;
>> +	uint32_t		bmapi_flags = XFS_BMAPI_COWFORK |
>> +					XFS_BMAPI_PREALLOC;
>> +
>> +	if (atomic)
>> +		bmapi_flags |= XFS_BMAPI_NALIGN;
>>   
>>   	resaligned = xfs_aligned_fsb_count(imap->br_startoff,
>>   		imap->br_blockcount, xfs_get_cowextsz_hint(ip));
>> @@ -478,8 +483,7 @@ xfs_reflink_fill_cow_hole(
>>   	/* Allocate the entire reservation as unwritten blocks. */
>>   	nimaps = 1;
>>   	error = xfs_bmapi_write(tp, ip, imap->br_startoff, imap->br_blockcount,
>> -			XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC, 0, cmap,
>> -			&nimaps);
>> +			bmapi_flags, 0, cmap, &nimaps);
>>   	if (error)
>>   		goto out_trans_cancel;
>>   
>> -- 
>> 2.31.1
>>
>>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 09/10] xfs: Update atomic write max size
  2025-02-05 19:41   ` Darrick J. Wong
@ 2025-02-06  9:15     ` John Garry
  2025-02-06 21:54       ` Darrick J. Wong
  0 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2025-02-06  9:15 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen

On 05/02/2025 19:41, Darrick J. Wong wrote:
> On Tue, Feb 04, 2025 at 12:01:26PM +0000, John Garry wrote:
>> Now that CoW-based atomic writes are supported, update the max size of an
>> atomic write.
>>
>> For simplicity, limit at the max of what the mounted bdev can support in
>> terms of atomic write limits. Maybe in future we will have a better way
>> to advertise this optimised limit.
>>
>> In addition, the max atomic write size needs to be aligned to the agsize.
>> Currently when attempting to use HW offload, we  just check that the
>> mapping startblock is aligned. However, that is just the startblock within
>> the AG, and the AG may not be properly aligned to the underlying block
>> device atomic write limits.
>>
>> As such, limit atomic writes to the greatest power-of-2 which fits in an
>> AG, so that aligning to the startblock will be mean that we are also
>> aligned to the disk block.

Right, "startblock" is a bit vague

> 
> I don't understand this sentence -- what are we "aligning to the
> startblock"?  I think you're saying that you want to limit the size of
> untorn writes to the greatest power-of-two factor of the agsize so that
> allocations for an untorn write will always be aligned compatibly with
> the alignment requirements of the storage for an untorn write?

Yes, that's it. I'll borrow your wording :)

> 
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/xfs_iops.c  |  7 ++++++-
>>   fs/xfs/xfs_mount.c | 28 ++++++++++++++++++++++++++++
>>   fs/xfs/xfs_mount.h |  1 +
>>   3 files changed, 35 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
>> index ea79fb246e33..95681d6c2bcd 100644
>> --- a/fs/xfs/xfs_iops.c
>> +++ b/fs/xfs/xfs_iops.c
>> @@ -606,12 +606,17 @@ xfs_get_atomic_write_attr(
>>   	unsigned int		*unit_min,
>>   	unsigned int		*unit_max)
>>   {
>> +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
>> +	struct xfs_mount	*mp = ip->i_mount;
>> +
>>   	if (!xfs_inode_can_atomicwrite(ip)) {
>>   		*unit_min = *unit_max = 0;
>>   		return;
>>   	}
>>   
>> -	*unit_min = *unit_max = ip->i_mount->m_sb.sb_blocksize;
>> +	*unit_min = ip->i_mount->m_sb.sb_blocksize;
>> +	*unit_max =  min_t(unsigned int, XFS_FSB_TO_B(mp, mp->awu_max),
>> +					target->bt_bdev_awu_max);
>>   }
>>   
>>   static void
>> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
>> index 477c5262cf91..4e60347f6b7e 100644
>> --- a/fs/xfs/xfs_mount.c
>> +++ b/fs/xfs/xfs_mount.c
>> @@ -651,6 +651,32 @@ xfs_agbtree_compute_maxlevels(
>>   	levels = max(levels, mp->m_rmap_maxlevels);
>>   	mp->m_agbtree_maxlevels = max(levels, mp->m_refc_maxlevels);
>>   }
>> +static inline void
>> +xfs_mp_compute_awu_max(
> 
> xfs_compute_awu_max() ?

ok

> 
>> +	struct xfs_mount	*mp)
>> +{
>> +	xfs_agblock_t		agsize = mp->m_sb.sb_agblocks;
>> +	xfs_agblock_t		awu_max;
>> +
>> +	if (!xfs_has_reflink(mp)) {
>> +		mp->awu_max = 1;
>> +		return;
>> +	}
>> +
>> +	/*
>> +	 * Find highest power-of-2 evenly divisible into agsize and which
>> +	 * also fits into an unsigned int field.
>> +	 */
>> +	awu_max = 1;
>> +	while (1) {
>> +		if (agsize % (awu_max * 2))
>> +			break;
>> +		if (XFS_FSB_TO_B(mp, awu_max * 2) > UINT_MAX)
>> +			break;
>> +		awu_max *= 2;
>> +	}
>> +	mp->awu_max = awu_max;
> 
> I think you need two awu_maxes here -- one for the data device, and
> another for the realtime device.
How about we just don't support rtdev initially for this CoW-based 
method, i.e. stick at 1x FSB awu max?

 >  The rt computation is probably more
 > complex since I think it's the greatest power of two that fits in the rt
 > extent size if it isn't a power of two;> or the greatest power of 
two> that fits in the rtgroup if rtgroups are enabled; or probably just no
 > limit otherwise.
 >

Thanks,
John

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 08/10] xfs: Commit CoW-based atomic writes atomically
  2025-02-05 19:47   ` Darrick J. Wong
@ 2025-02-06 10:27     ` John Garry
  2025-02-06 21:50       ` Darrick J. Wong
  0 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2025-02-06 10:27 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen

On 05/02/2025 19:47, Darrick J. Wong wrote:
> On Tue, Feb 04, 2025 at 12:01:25PM +0000, John Garry wrote:
>> When completing a CoW-based write, each extent range mapping update is
>> covered by a separate transaction.
>>
>> For a CoW-based atomic write, all mappings must be changed at once, so
>> change to use a single transaction.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/xfs_file.c    |  5 ++++-
>>   fs/xfs/xfs_reflink.c | 48 ++++++++++++++++++++++++++++++++++++++++++++
>>   fs/xfs/xfs_reflink.h |  3 +++
>>   3 files changed, 55 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index 12af5cdc3094..170d7891f90d 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -527,7 +527,10 @@ xfs_dio_write_end_io(
>>   	nofs_flag = memalloc_nofs_save();
>>   
>>   	if (flags & IOMAP_DIO_COW) {
>> -		error = xfs_reflink_end_cow(ip, offset, size);
>> +		if (iocb->ki_flags & IOCB_ATOMIC)
>> +			error = xfs_reflink_end_atomic_cow(ip, offset, size);
>> +		else
>> +			error = xfs_reflink_end_cow(ip, offset, size);
>>   		if (error)
>>   			goto out;
>>   	}
>> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
>> index dbce333b60eb..60c986300faa 100644
>> --- a/fs/xfs/xfs_reflink.c
>> +++ b/fs/xfs/xfs_reflink.c
>> @@ -990,6 +990,54 @@ xfs_reflink_end_cow(
>>   		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
>>   	return error;
>>   }
>> +int
>> +xfs_reflink_end_atomic_cow(
>> +	struct xfs_inode		*ip,
>> +	xfs_off_t			offset,
>> +	xfs_off_t			count)
>> +{
>> +	xfs_fileoff_t			offset_fsb;
>> +	xfs_fileoff_t			end_fsb;
>> +	int				error = 0;
>> +	struct xfs_mount		*mp = ip->i_mount;
>> +	struct xfs_trans		*tp;
>> +	unsigned int			resblks;
>> +	bool				commit = false;
>> +
>> +	trace_xfs_reflink_end_cow(ip, offset, count);
>> +
>> +	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
>> +	end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
>> +
>> +	resblks = XFS_NEXTENTADD_SPACE_RES(ip->i_mount,
>> +				(unsigned int)(end_fsb - offset_fsb),
>> +				XFS_DATA_FORK);
>> +
>> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
> 
> xfs gained reflink support for realtime volumes in 6.14-rc1, so you now
> have to calculate for that in here too.
> 
>> +			XFS_TRANS_RESERVE, &tp);
>> +	if (error)
>> +		return error;
>> +
>> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
>> +	xfs_trans_ijoin(tp, ip, 0);
>> +
>> +	while (end_fsb > offset_fsb && !error)
>> +		error = xfs_reflink_end_cow_extent_locked(ip, &offset_fsb,
>> +						end_fsb, tp, &commit);
> 
> Hmm.  Attaching intent items to a transaction consumes space in that
> transaction, so we probably ought to limit the amount that we try to do
> here.  Do you know what that limit is?  I don't, 

nor do I ...

> but it's roughly
> tr_logres divided by the average size of a log intent item.

So you have a ballpark figure on the average size of a log intent item, 
or an idea on how to get it?

> 
> This means we need to restrict the size of an untorn write to a
> double-digit number of fsblocks for safety.

Sure, but won't we also still be liable to suffer the same issue which 
was fixed in commit d6f215f359637?

> 
> The logic in here looks reasonable though.
> 

Thanks,
John

> --D
> 
>> +
>> +	if (error || !commit)
>> +		goto out_cancel;
>> +
>> +	if (error)
>> +		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
>> +	error = xfs_trans_commit(tp);
>> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>> +	return error;
>> +out_cancel:
>> +	xfs_trans_cancel(tp);
>> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>> +	return error;
>> +}
>>   
>>   /*
>>    * Free all CoW staging blocks that are still referenced by the ondisk refcount
>> diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
>> index ef5c8b2398d8..2c3b096c1386 100644
>> --- a/fs/xfs/xfs_reflink.h
>> +++ b/fs/xfs/xfs_reflink.h
>> @@ -45,6 +45,9 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
>>   		xfs_off_t count, bool cancel_real);
>>   extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
>>   		xfs_off_t count);
>> +		int
>> +xfs_reflink_end_atomic_cow(struct xfs_inode *ip, xfs_off_t offset,
>> +		xfs_off_t count);
>>   extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
>>   extern loff_t xfs_reflink_remap_range(struct file *file_in, loff_t pos_in,
>>   		struct file *file_out, loff_t pos_out, loff_t len,
>> -- 
>> 2.31.1
>>
>>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 02/10] xfs: Refactor xfs_reflink_end_cow_extent()
  2025-02-05 19:50   ` Darrick J. Wong
@ 2025-02-06 10:35     ` John Garry
  2025-02-06 21:38       ` Darrick J. Wong
  0 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2025-02-06 10:35 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen

On 05/02/2025 19:50, Darrick J. Wong wrote:
>> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
>> index 59f7fc16eb80..580469668334 100644
>> --- a/fs/xfs/xfs_reflink.c
>> +++ b/fs/xfs/xfs_reflink.c
>> @@ -786,35 +786,20 @@ xfs_reflink_update_quota(
>>    * requirements as low as possible.
>>    */
>>   STATIC int
>> -xfs_reflink_end_cow_extent(
>> +xfs_reflink_end_cow_extent_locked(
>>   	struct xfs_inode	*ip,
>>   	xfs_fileoff_t		*offset_fsb,
>> -	xfs_fileoff_t		end_fsb)
>> +	xfs_fileoff_t		end_fsb,
>> +	struct xfs_trans	*tp,
>> +	bool			*commit)
> Transactions usually come before the inode in the parameter list.

ok

> 
> You don't need to pass out a @commit flag -- if the function returns
> nonzero then the caller has to cancel the transaction; otherwise it will
> return zero and the caller should commit it.>  There's no penalty for
> committing a non-dirty transaction.

If there is no penalty, then fine. But I don't feel totally comfortable 
with this and would prefer to keep the same behavior.

Thanks,
John




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 07/10] xfs: Add xfs_file_dio_write_atomic()
  2025-02-05 19:55   ` Darrick J. Wong
@ 2025-02-06 10:43     ` John Garry
  0 siblings, 0 replies; 35+ messages in thread
From: John Garry @ 2025-02-06 10:43 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen

On 05/02/2025 19:55, Darrick J. Wong wrote:
> On Tue, Feb 04, 2025 at 12:01:24PM +0000, John Garry wrote:
>> Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes.
>>
>> In case of -EAGAIN being returned from iomap_dio_rw(), reissue the write
>> in CoW-based atomic write mode.
>>
>> In the CoW-based atomic write mode, first unshare blocks so that we don't
>> have a cow fork for the data in the range which we are writing.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/xfs_file.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 51 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index fd05b66aea3f..12af5cdc3094 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -619,6 +619,55 @@ xfs_file_dio_write_aligned(
>>   	return ret;
>>   }
>>   
>> +static noinline ssize_t
>> +xfs_file_dio_write_atomic(
>> +	struct xfs_inode	*ip,
>> +	struct kiocb		*iocb,
>> +	struct iov_iter		*from)
>> +{
>> +	unsigned int		iolock = XFS_IOLOCK_SHARED;
>> +	bool			use_cow = false;
>> +	unsigned int		dio_flags;
>> +	ssize_t			ret;
>> +
>> +retry:
>> +	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
>> +	if (ret)
>> +		return ret;
>> +
>> +	ret = xfs_file_write_checks(iocb, from, &iolock);
>> +	if (ret)
>> +		goto out_unlock;
>> +
>> +	if (use_cow) {
>> +		ret = xfs_reflink_unshare(ip, iocb->ki_pos,
>> +			iov_iter_count(from));
> 
> Nit: continuation lines should be indented two tabs:
> 
> 		ret = xfs_reflink_unshare(ip, iocb->ki_pos,
> 				iov_iter_count(from));

ok

> 
>> +		if (ret)
>> +			goto out_unlock;
>> +	}
>> +
>> +	trace_xfs_file_direct_write(iocb, from);
>> +	if (use_cow)
>> +		dio_flags = IOMAP_DIO_ATOMIC_COW;
>> +	else
>> +		dio_flags = 0;
> 
> I also think you could eliminate use_cow by initializing dio_flags to
> zero at the top, OR'ing in IOMAP_DIO_ATOMIC_COW in the retry clause
> below, and using (dio_flags & IOMAP_DIO_ATOMIC_COW) to determine if you
> should call unshare above.

ok, fine, if you think that it is better

> 
> Note: This serializes all the software untorn direct writes.  I think
> a more performant solution would allocate the cow staging blocks ondisk,
> attach them to the directio ioend context, and alter ->iomap_begin and
> the ioend remap to use the attached blocks, but that's a lot more
> surgery.

sure, that does sound like it's quite intrusive. But whatever we do I 
would like to keep the behaviour that racing reads and atomic writes 
mean that a read sees all old or all new data. That is how SCSI and NVMe 
behaves, even though it is not an advertised atomic write feature.

Thanks,
John

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 06/10] xfs: iomap CoW-based atomic write support
  2025-02-05 20:05   ` Darrick J. Wong
@ 2025-02-06 11:10     ` John Garry
  2025-02-06 21:44       ` Darrick J. Wong
  0 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2025-02-06 11:10 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen

On 05/02/2025 20:05, Darrick J. Wong wrote:
> On Tue, Feb 04, 2025 at 12:01:23PM +0000, John Garry wrote:
>> In cases of an atomic write occurs for misaligned or discontiguous disk
>> blocks, we will use a CoW-based method to issue the atomic write.
>>
>> So, for that case, return -EAGAIN to request that the write be issued in
>> CoW atomic write mode. The dio write path should detect this, similar to
>> how misaligned regalar DIO writes are handled.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/xfs_iomap.c | 68 ++++++++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 66 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index ae3755ed00e6..2c2867d728e4 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -809,9 +809,12 @@ xfs_direct_write_iomap_begin(
>>   	struct xfs_bmbt_irec	imap, cmap;
>>   	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
>>   	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, length);
>> +	bool			atomic = flags & IOMAP_ATOMIC;
>>   	int			nimaps = 1, error = 0;
>>   	bool			shared = false;
>> +	bool			found = false;
>>   	u16			iomap_flags = 0;
>> +	bool			need_alloc;
>>   	unsigned int		lockmode;
>>   	u64			seq;
>>   
>> @@ -832,7 +835,7 @@ xfs_direct_write_iomap_begin(
>>   	 * COW writes may allocate delalloc space or convert unwritten COW
>>   	 * extents, so we need to make sure to take the lock exclusively here.
>>   	 */
>> -	if (xfs_is_cow_inode(ip))
>> +	if (xfs_is_cow_inode(ip) || atomic)
>>   		lockmode = XFS_ILOCK_EXCL;
>>   	else
>>   		lockmode = XFS_ILOCK_SHARED;
>> @@ -857,12 +860,73 @@ xfs_direct_write_iomap_begin(
>>   	if (error)
>>   		goto out_unlock;
>>   
>> +
>> +	if (flags & IOMAP_ATOMIC_COW) {
>> +		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
>> +				&lockmode,
>> +				(flags & IOMAP_DIRECT) || IS_DAX(inode), true);
> 
> Weird nit not relate to this patch: Is there ever a case where
> IS_DAX(inode) and (flags & IOMAP_DAX) disagree?  I wonder if this odd
> construction could be simplified to:
> 
> 	(flags & (IOMAP_DIRECT | IOMAP_DAX))

I'm not sure. I assume that we always want to convert for DAX, and 
IOMAP_DAX may not be set always for DIO path - but I only see 
xfs_file_write_iter() -> xfs_file_dax_write() 
->dax_iomap_rw(xfs_dax_write_iomap_ops), which sets IOMAP_DAX in 
iomap_iter.flags

> 
> ?
> 
>> +		if (error)
>> +			goto out_unlock;
>> +
>> +		end_fsb = imap.br_startoff + imap.br_blockcount;
>> +		length = XFS_FSB_TO_B(mp, end_fsb) - offset;
>> +
>> +		if (imap.br_startblock != HOLESTARTBLOCK) {
>> +			seq = xfs_iomap_inode_sequence(ip, 0);
>> +
>> +			error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags,
>> +				iomap_flags | IOMAP_F_ATOMIC_COW, seq);
>> +			if (error)
>> +				goto out_unlock;
>> +		}
>> +		seq = xfs_iomap_inode_sequence(ip, 0);
>> +		xfs_iunlock(ip, lockmode);
>> +		return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags,
>> +					iomap_flags | IOMAP_F_ATOMIC_COW, seq);
>> +	}
> 
> /me wonders if this should be a separate helper so that the main
> xfs_direct_write_iomap_begin doesn't get even longer... but otherwise
> the logic in here looks sane.

I can do that. Maybe some code can be factored out for regular "found 
cow path".

> 
>> +
>> +	need_alloc = imap_needs_alloc(inode, flags, &imap, nimaps);
>> +
>> +	if (atomic) {
>> +		/* Use CoW-based method if any of the following fail */
>> +		error = -EAGAIN;
>> +
>> +		/*
>> +		 * Lazily use CoW-based method for initial alloc of data.
>> +		 * Check br_blockcount for FSes which do not support atomic
>> +		 * writes > 1x block.
>> +		 */
>> +		if (need_alloc && imap.br_blockcount > 1)
>> +			goto out_unlock;
>> +
>> +		/* Misaligned start block wrt size */
>> +		if (!IS_ALIGNED(imap.br_startblock, imap.br_blockcount))
>> +			goto out_unlock;
>> +
>> +		/* Discontiguous or mixed extents */
>> +		if (!imap_spans_range(&imap, offset_fsb, end_fsb))
>> +			goto out_unlock;
>> +	}
> 
> (Same two comments here.)

ok

> 
>> +
>>   	if (imap_needs_cow(ip, flags, &imap, nimaps)) {
>>   		error = -EAGAIN;
>>   		if (flags & IOMAP_NOWAIT)
>>   			goto out_unlock;
>>   
>> +		if (atomic) {
>> +			/* Detect whether we're already covered in a cow fork */
>> +			error  = xfs_find_trim_cow_extent(ip, &imap, &cmap, &shared, &found);
>> +			if (error)
>> +				goto out_unlock;
>> +
>> +			if (shared) {
>> +				error = -EAGAIN;
>> +				goto out_unlock;
> 
> What is this checking?  That something else already created a mapping in
> the COW fork, so we want to bail out to get rid of it?

I want to check if some data is shared. In that case, we should unshare.

And I am not sure if that check is sufficient.

On the buffered write path, we may have something in a CoW fork - in 
that case it should be flushed, right?

Thanks,
John

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 03/10] iomap: Support CoW-based atomic writes
  2025-02-05 20:11   ` Darrick J. Wong
@ 2025-02-06 11:21     ` John Garry
  2025-02-06 21:40       ` Darrick J. Wong
  0 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2025-02-06 11:21 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen


>>   			if (iomi.pos >= dio->i_size ||
>> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
>> index 75bf54e76f3b..0a0b6798f517 100644
>> --- a/include/linux/iomap.h
>> +++ b/include/linux/iomap.h
>> @@ -56,6 +56,8 @@ struct vm_fault;
>>    *
>>    * IOMAP_F_BOUNDARY indicates that I/O and I/O completions for this iomap must
>>    * never be merged with the mapping before it.
>> + *
>> + * IOMAP_F_ATOMIC_COW indicates that we require atomic CoW end IO handling.
> 
> It more indicates that the filesystem is using copy on write to handle
> an untorn write, and will provide the ioend support necessary to commit
> the remapping atomically, right?

yes, correct

> 
>>    */
>>   #define IOMAP_F_NEW		(1U << 0)
>>   #define IOMAP_F_DIRTY		(1U << 1)
>> @@ -68,6 +70,7 @@ struct vm_fault;
>>   #endif /* CONFIG_BUFFER_HEAD */
>>   #define IOMAP_F_XATTR		(1U << 5)
>>   #define IOMAP_F_BOUNDARY	(1U << 6)
>> +#define IOMAP_F_ATOMIC_COW	(1U << 7)
>>   
>>   /*
>>    * Flags set by the core iomap code during operations:
>> @@ -183,6 +186,7 @@ struct iomap_folio_ops {
>>   #define IOMAP_DAX		0
>>   #endif /* CONFIG_FS_DAX */
>>   #define IOMAP_ATOMIC		(1 << 9)
>> +#define IOMAP_ATOMIC_COW	(1 << 10)
> 
> What does IOMAP_ATOMIC_COW do?  There's no description for it (or for
> IOMAP_ATOMIC). 

I'll add a description for both.

> Can you have IOMAP_ATOMIC and IOMAP_ATOMIC_COW both set?

Yes

> Or are they mutually exclusive?

I am not thinking that it might be neater to have a distinct flag for 
IOMAP_ATOMIC when we want to try an atomic bio - maybe IOMAP_ATOMIC_HW 
or IOMAP_ATOMIC_BIO? And then also IOMAP_DIO_ATOMIC_BIO (in addition to 
IOMAP_DIO_ATOMIC_COW).

> 
> I'm guessing from the code that ATOMIC_COW requires ATOMIC to be set,
> but I wonder why because there's no documentation update in the header
> files or in Documentation/filesystems/iomap/.

Will do.

Thanks,
John

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 02/10] xfs: Refactor xfs_reflink_end_cow_extent()
  2025-02-06 10:35     ` John Garry
@ 2025-02-06 21:38       ` Darrick J. Wong
  0 siblings, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2025-02-06 21:38 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen

On Thu, Feb 06, 2025 at 10:35:28AM +0000, John Garry wrote:
> On 05/02/2025 19:50, Darrick J. Wong wrote:
> > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > index 59f7fc16eb80..580469668334 100644
> > > --- a/fs/xfs/xfs_reflink.c
> > > +++ b/fs/xfs/xfs_reflink.c
> > > @@ -786,35 +786,20 @@ xfs_reflink_update_quota(
> > >    * requirements as low as possible.
> > >    */
> > >   STATIC int
> > > -xfs_reflink_end_cow_extent(
> > > +xfs_reflink_end_cow_extent_locked(
> > >   	struct xfs_inode	*ip,
> > >   	xfs_fileoff_t		*offset_fsb,
> > > -	xfs_fileoff_t		end_fsb)
> > > +	xfs_fileoff_t		end_fsb,
> > > +	struct xfs_trans	*tp,
> > > +	bool			*commit)
> > Transactions usually come before the inode in the parameter list.
> 
> ok
> 
> > 
> > You don't need to pass out a @commit flag -- if the function returns
> > nonzero then the caller has to cancel the transaction; otherwise it will
> > return zero and the caller should commit it.>  There's no penalty for
> > committing a non-dirty transaction.
> 
> If there is no penalty, then fine. But I don't feel totally comfortable with
> this and would prefer to keep the same behavior.

Right now this is the only place in XFS that behaves this way, which
means you're adding a new code idiom that isn't present anywhere else in
the code base.

--D

> Thanks,
> John
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 03/10] iomap: Support CoW-based atomic writes
  2025-02-06 11:21     ` John Garry
@ 2025-02-06 21:40       ` Darrick J. Wong
  0 siblings, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2025-02-06 21:40 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen

On Thu, Feb 06, 2025 at 11:21:13AM +0000, John Garry wrote:
> 
> > >   			if (iomi.pos >= dio->i_size ||
> > > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > > index 75bf54e76f3b..0a0b6798f517 100644
> > > --- a/include/linux/iomap.h
> > > +++ b/include/linux/iomap.h
> > > @@ -56,6 +56,8 @@ struct vm_fault;
> > >    *
> > >    * IOMAP_F_BOUNDARY indicates that I/O and I/O completions for this iomap must
> > >    * never be merged with the mapping before it.
> > > + *
> > > + * IOMAP_F_ATOMIC_COW indicates that we require atomic CoW end IO handling.
> > 
> > It more indicates that the filesystem is using copy on write to handle
> > an untorn write, and will provide the ioend support necessary to commit
> > the remapping atomically, right?
> 
> yes, correct
> 
> > 
> > >    */
> > >   #define IOMAP_F_NEW		(1U << 0)
> > >   #define IOMAP_F_DIRTY		(1U << 1)
> > > @@ -68,6 +70,7 @@ struct vm_fault;
> > >   #endif /* CONFIG_BUFFER_HEAD */
> > >   #define IOMAP_F_XATTR		(1U << 5)
> > >   #define IOMAP_F_BOUNDARY	(1U << 6)
> > > +#define IOMAP_F_ATOMIC_COW	(1U << 7)
> > >   /*
> > >    * Flags set by the core iomap code during operations:
> > > @@ -183,6 +186,7 @@ struct iomap_folio_ops {
> > >   #define IOMAP_DAX		0
> > >   #endif /* CONFIG_FS_DAX */
> > >   #define IOMAP_ATOMIC		(1 << 9)
> > > +#define IOMAP_ATOMIC_COW	(1 << 10)
> > 
> > What does IOMAP_ATOMIC_COW do?  There's no description for it (or for
> > IOMAP_ATOMIC).
> 
> I'll add a description for both.
> 
> > Can you have IOMAP_ATOMIC and IOMAP_ATOMIC_COW both set?
> 
> Yes
> 
> > Or are they mutually exclusive?
> 
> I am not thinking that it might be neater to have a distinct flag for

"not"?

> IOMAP_ATOMIC when we want to try an atomic bio - maybe IOMAP_ATOMIC_HW or
> IOMAP_ATOMIC_BIO? And then also IOMAP_DIO_ATOMIC_BIO (in addition to
> IOMAP_DIO_ATOMIC_COW).

Yeah, ATOMIC_BIO/ATOMIC_COW is probably clearer both for IOMAP_ and
IOMAP_DIO_ .

--D

> > 
> > I'm guessing from the code that ATOMIC_COW requires ATOMIC to be set,
> > but I wonder why because there's no documentation update in the header
> > files or in Documentation/filesystems/iomap/.
> 
> Will do.
> 
> Thanks,
> John
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 06/10] xfs: iomap CoW-based atomic write support
  2025-02-06 11:10     ` John Garry
@ 2025-02-06 21:44       ` Darrick J. Wong
  2025-02-07 11:48         ` John Garry
  0 siblings, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2025-02-06 21:44 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen

On Thu, Feb 06, 2025 at 11:10:40AM +0000, John Garry wrote:
> On 05/02/2025 20:05, Darrick J. Wong wrote:
> > On Tue, Feb 04, 2025 at 12:01:23PM +0000, John Garry wrote:
> > > In cases of an atomic write occurs for misaligned or discontiguous disk
> > > blocks, we will use a CoW-based method to issue the atomic write.
> > > 
> > > So, for that case, return -EAGAIN to request that the write be issued in
> > > CoW atomic write mode. The dio write path should detect this, similar to
> > > how misaligned regalar DIO writes are handled.
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >   fs/xfs/xfs_iomap.c | 68 ++++++++++++++++++++++++++++++++++++++++++++--
> > >   1 file changed, 66 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > index ae3755ed00e6..2c2867d728e4 100644
> > > --- a/fs/xfs/xfs_iomap.c
> > > +++ b/fs/xfs/xfs_iomap.c
> > > @@ -809,9 +809,12 @@ xfs_direct_write_iomap_begin(
> > >   	struct xfs_bmbt_irec	imap, cmap;
> > >   	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
> > >   	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, length);
> > > +	bool			atomic = flags & IOMAP_ATOMIC;
> > >   	int			nimaps = 1, error = 0;
> > >   	bool			shared = false;
> > > +	bool			found = false;
> > >   	u16			iomap_flags = 0;
> > > +	bool			need_alloc;
> > >   	unsigned int		lockmode;
> > >   	u64			seq;
> > > @@ -832,7 +835,7 @@ xfs_direct_write_iomap_begin(
> > >   	 * COW writes may allocate delalloc space or convert unwritten COW
> > >   	 * extents, so we need to make sure to take the lock exclusively here.
> > >   	 */
> > > -	if (xfs_is_cow_inode(ip))
> > > +	if (xfs_is_cow_inode(ip) || atomic)
> > >   		lockmode = XFS_ILOCK_EXCL;
> > >   	else
> > >   		lockmode = XFS_ILOCK_SHARED;
> > > @@ -857,12 +860,73 @@ xfs_direct_write_iomap_begin(
> > >   	if (error)
> > >   		goto out_unlock;
> > > +
> > > +	if (flags & IOMAP_ATOMIC_COW) {
> > > +		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
> > > +				&lockmode,
> > > +				(flags & IOMAP_DIRECT) || IS_DAX(inode), true);
> > 
> > Weird nit not relate to this patch: Is there ever a case where
> > IS_DAX(inode) and (flags & IOMAP_DAX) disagree?  I wonder if this odd
> > construction could be simplified to:
> > 
> > 	(flags & (IOMAP_DIRECT | IOMAP_DAX))
> 
> I'm not sure. I assume that we always want to convert for DAX, and IOMAP_DAX
> may not be set always for DIO path - but I only see xfs_file_write_iter() ->
> xfs_file_dax_write() ->dax_iomap_rw(xfs_dax_write_iomap_ops), which sets
> IOMAP_DAX in iomap_iter.flags
> 
> > 
> > ?
> > 
> > > +		if (error)
> > > +			goto out_unlock;
> > > +
> > > +		end_fsb = imap.br_startoff + imap.br_blockcount;
> > > +		length = XFS_FSB_TO_B(mp, end_fsb) - offset;
> > > +
> > > +		if (imap.br_startblock != HOLESTARTBLOCK) {
> > > +			seq = xfs_iomap_inode_sequence(ip, 0);
> > > +
> > > +			error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags,
> > > +				iomap_flags | IOMAP_F_ATOMIC_COW, seq);
> > > +			if (error)
> > > +				goto out_unlock;
> > > +		}
> > > +		seq = xfs_iomap_inode_sequence(ip, 0);
> > > +		xfs_iunlock(ip, lockmode);
> > > +		return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags,
> > > +					iomap_flags | IOMAP_F_ATOMIC_COW, seq);
> > > +	}
> > 
> > /me wonders if this should be a separate helper so that the main
> > xfs_direct_write_iomap_begin doesn't get even longer... but otherwise
> > the logic in here looks sane.
> 
> I can do that. Maybe some code can be factored out for regular "found cow
> path".
> 
> > 
> > > +
> > > +	need_alloc = imap_needs_alloc(inode, flags, &imap, nimaps);
> > > +
> > > +	if (atomic) {
> > > +		/* Use CoW-based method if any of the following fail */
> > > +		error = -EAGAIN;
> > > +
> > > +		/*
> > > +		 * Lazily use CoW-based method for initial alloc of data.
> > > +		 * Check br_blockcount for FSes which do not support atomic
> > > +		 * writes > 1x block.
> > > +		 */
> > > +		if (need_alloc && imap.br_blockcount > 1)
> > > +			goto out_unlock;
> > > +
> > > +		/* Misaligned start block wrt size */
> > > +		if (!IS_ALIGNED(imap.br_startblock, imap.br_blockcount))
> > > +			goto out_unlock;
> > > +
> > > +		/* Discontiguous or mixed extents */
> > > +		if (!imap_spans_range(&imap, offset_fsb, end_fsb))
> > > +			goto out_unlock;
> > > +	}
> > 
> > (Same two comments here.)
> 
> ok
> 
> > 
> > > +
> > >   	if (imap_needs_cow(ip, flags, &imap, nimaps)) {
> > >   		error = -EAGAIN;
> > >   		if (flags & IOMAP_NOWAIT)
> > >   			goto out_unlock;
> > > +		if (atomic) {
> > > +			/* Detect whether we're already covered in a cow fork */
> > > +			error  = xfs_find_trim_cow_extent(ip, &imap, &cmap, &shared, &found);
> > > +			if (error)
> > > +				goto out_unlock;
> > > +
> > > +			if (shared) {
> > > +				error = -EAGAIN;
> > > +				goto out_unlock;
> > 
> > What is this checking?  That something else already created a mapping in
> > the COW fork, so we want to bail out to get rid of it?
> 
> I want to check if some data is shared. In that case, we should unshare.

Why is it necessary to unshare?  Userspace gave us a buffer of new
contents, and we're already prepared to write that out of place and
remap it.

> And I am not sure if that check is sufficient.
> 
> On the buffered write path, we may have something in a CoW fork - in that
> case it should be flushed, right?

Flushed against what?  Concurrent writeback or something?  The directio
setup should have flushed dirty pagecache, so the only things left in
the COW fork are speculative preallocations.  (IOWs, I don't understand
what needs to be flushed or why.)

--D

> 
> Thanks,
> John
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 08/10] xfs: Commit CoW-based atomic writes atomically
  2025-02-06 10:27     ` John Garry
@ 2025-02-06 21:50       ` Darrick J. Wong
  2025-02-07 11:52         ` John Garry
  0 siblings, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2025-02-06 21:50 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen

On Thu, Feb 06, 2025 at 10:27:45AM +0000, John Garry wrote:
> On 05/02/2025 19:47, Darrick J. Wong wrote:
> > On Tue, Feb 04, 2025 at 12:01:25PM +0000, John Garry wrote:
> > > When completing a CoW-based write, each extent range mapping update is
> > > covered by a separate transaction.
> > > 
> > > For a CoW-based atomic write, all mappings must be changed at once, so
> > > change to use a single transaction.
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >   fs/xfs/xfs_file.c    |  5 ++++-
> > >   fs/xfs/xfs_reflink.c | 48 ++++++++++++++++++++++++++++++++++++++++++++
> > >   fs/xfs/xfs_reflink.h |  3 +++
> > >   3 files changed, 55 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 12af5cdc3094..170d7891f90d 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -527,7 +527,10 @@ xfs_dio_write_end_io(
> > >   	nofs_flag = memalloc_nofs_save();
> > >   	if (flags & IOMAP_DIO_COW) {
> > > -		error = xfs_reflink_end_cow(ip, offset, size);
> > > +		if (iocb->ki_flags & IOCB_ATOMIC)
> > > +			error = xfs_reflink_end_atomic_cow(ip, offset, size);
> > > +		else
> > > +			error = xfs_reflink_end_cow(ip, offset, size);
> > >   		if (error)
> > >   			goto out;
> > >   	}
> > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > index dbce333b60eb..60c986300faa 100644
> > > --- a/fs/xfs/xfs_reflink.c
> > > +++ b/fs/xfs/xfs_reflink.c
> > > @@ -990,6 +990,54 @@ xfs_reflink_end_cow(
> > >   		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
> > >   	return error;
> > >   }
> > > +int
> > > +xfs_reflink_end_atomic_cow(
> > > +	struct xfs_inode		*ip,
> > > +	xfs_off_t			offset,
> > > +	xfs_off_t			count)
> > > +{
> > > +	xfs_fileoff_t			offset_fsb;
> > > +	xfs_fileoff_t			end_fsb;
> > > +	int				error = 0;
> > > +	struct xfs_mount		*mp = ip->i_mount;
> > > +	struct xfs_trans		*tp;
> > > +	unsigned int			resblks;
> > > +	bool				commit = false;
> > > +
> > > +	trace_xfs_reflink_end_cow(ip, offset, count);
> > > +
> > > +	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
> > > +	end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
> > > +
> > > +	resblks = XFS_NEXTENTADD_SPACE_RES(ip->i_mount,
> > > +				(unsigned int)(end_fsb - offset_fsb),
> > > +				XFS_DATA_FORK);
> > > +
> > > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
> > 
> > xfs gained reflink support for realtime volumes in 6.14-rc1, so you now
> > have to calculate for that in here too.
> > 
> > > +			XFS_TRANS_RESERVE, &tp);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > +	xfs_trans_ijoin(tp, ip, 0);
> > > +
> > > +	while (end_fsb > offset_fsb && !error)
> > > +		error = xfs_reflink_end_cow_extent_locked(ip, &offset_fsb,
> > > +						end_fsb, tp, &commit);
> > 
> > Hmm.  Attaching intent items to a transaction consumes space in that
> > transaction, so we probably ought to limit the amount that we try to do
> > here.  Do you know what that limit is?  I don't,
> 
> nor do I ...
> 
> > but it's roughly
> > tr_logres divided by the average size of a log intent item.
> 
> So you have a ballpark figure on the average size of a log intent item, or
> an idea on how to get it?

You could add up the size of struct
xfs_{bui,rmap,refcount,efi}_log_format structures and add 20%, that will
give you a ballpark figure of the worst case per-block requirements.

My guess is that 64 blocks is ok provided resblks is big enough.  But I
guess we could estimate it (very conservatively) dynamically too.

(also note tr_itruncate declares more logres)

> > This means we need to restrict the size of an untorn write to a
> > double-digit number of fsblocks for safety.
> 
> Sure, but won't we also still be liable to suffer the same issue which was
> fixed in commit d6f215f359637?

Yeah, come to think of it, you need to reserve the worst case space
reservation, i.e. each of the blocks between offset_fsb and end_fsb
becomes a separate btree update.

	resblks = (end_fsb - offset_fsb) *
			XFS_NEXTENTADD_SPACE_RES(mp, 1, XFS_DATA_FORK);

--D

> > 
> > The logic in here looks reasonable though.
> > 
> 
> Thanks,
> John
> 
> > --D
> > 
> > > +
> > > +	if (error || !commit)
> > > +		goto out_cancel;
> > > +
> > > +	if (error)
> > > +		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
> > > +	error = xfs_trans_commit(tp);
> > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > +	return error;
> > > +out_cancel:
> > > +	xfs_trans_cancel(tp);
> > > +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > +	return error;
> > > +}
> > >   /*
> > >    * Free all CoW staging blocks that are still referenced by the ondisk refcount
> > > diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
> > > index ef5c8b2398d8..2c3b096c1386 100644
> > > --- a/fs/xfs/xfs_reflink.h
> > > +++ b/fs/xfs/xfs_reflink.h
> > > @@ -45,6 +45,9 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
> > >   		xfs_off_t count, bool cancel_real);
> > >   extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
> > >   		xfs_off_t count);
> > > +		int
> > > +xfs_reflink_end_atomic_cow(struct xfs_inode *ip, xfs_off_t offset,
> > > +		xfs_off_t count);
> > >   extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
> > >   extern loff_t xfs_reflink_remap_range(struct file *file_in, loff_t pos_in,
> > >   		struct file *file_out, loff_t pos_out, loff_t len,
> > > -- 
> > > 2.31.1
> > > 
> > > 
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 09/10] xfs: Update atomic write max size
  2025-02-06  9:15     ` John Garry
@ 2025-02-06 21:54       ` Darrick J. Wong
  2025-02-07 11:53         ` John Garry
  0 siblings, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2025-02-06 21:54 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen

On Thu, Feb 06, 2025 at 09:15:16AM +0000, John Garry wrote:
> On 05/02/2025 19:41, Darrick J. Wong wrote:
> > On Tue, Feb 04, 2025 at 12:01:26PM +0000, John Garry wrote:
> > > Now that CoW-based atomic writes are supported, update the max size of an
> > > atomic write.
> > > 
> > > For simplicity, limit at the max of what the mounted bdev can support in
> > > terms of atomic write limits. Maybe in future we will have a better way
> > > to advertise this optimised limit.
> > > 
> > > In addition, the max atomic write size needs to be aligned to the agsize.
> > > Currently when attempting to use HW offload, we  just check that the
> > > mapping startblock is aligned. However, that is just the startblock within
> > > the AG, and the AG may not be properly aligned to the underlying block
> > > device atomic write limits.
> > > 
> > > As such, limit atomic writes to the greatest power-of-2 which fits in an
> > > AG, so that aligning to the startblock will be mean that we are also
> > > aligned to the disk block.
> 
> Right, "startblock" is a bit vague
> 
> > 
> > I don't understand this sentence -- what are we "aligning to the
> > startblock"?  I think you're saying that you want to limit the size of
> > untorn writes to the greatest power-of-two factor of the agsize so that
> > allocations for an untorn write will always be aligned compatibly with
> > the alignment requirements of the storage for an untorn write?
> 
> Yes, that's it. I'll borrow your wording :)
> 
> > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >   fs/xfs/xfs_iops.c  |  7 ++++++-
> > >   fs/xfs/xfs_mount.c | 28 ++++++++++++++++++++++++++++
> > >   fs/xfs/xfs_mount.h |  1 +
> > >   3 files changed, 35 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > index ea79fb246e33..95681d6c2bcd 100644
> > > --- a/fs/xfs/xfs_iops.c
> > > +++ b/fs/xfs/xfs_iops.c
> > > @@ -606,12 +606,17 @@ xfs_get_atomic_write_attr(
> > >   	unsigned int		*unit_min,
> > >   	unsigned int		*unit_max)
> > >   {
> > > +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> > > +	struct xfs_mount	*mp = ip->i_mount;
> > > +
> > >   	if (!xfs_inode_can_atomicwrite(ip)) {
> > >   		*unit_min = *unit_max = 0;
> > >   		return;
> > >   	}
> > > -	*unit_min = *unit_max = ip->i_mount->m_sb.sb_blocksize;
> > > +	*unit_min = ip->i_mount->m_sb.sb_blocksize;
> > > +	*unit_max =  min_t(unsigned int, XFS_FSB_TO_B(mp, mp->awu_max),
> > > +					target->bt_bdev_awu_max);
> > >   }
> > >   static void
> > > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > > index 477c5262cf91..4e60347f6b7e 100644
> > > --- a/fs/xfs/xfs_mount.c
> > > +++ b/fs/xfs/xfs_mount.c
> > > @@ -651,6 +651,32 @@ xfs_agbtree_compute_maxlevels(
> > >   	levels = max(levels, mp->m_rmap_maxlevels);
> > >   	mp->m_agbtree_maxlevels = max(levels, mp->m_refc_maxlevels);
> > >   }
> > > +static inline void
> > > +xfs_mp_compute_awu_max(
> > 
> > xfs_compute_awu_max() ?
> 
> ok
> 
> > 
> > > +	struct xfs_mount	*mp)
> > > +{
> > > +	xfs_agblock_t		agsize = mp->m_sb.sb_agblocks;
> > > +	xfs_agblock_t		awu_max;
> > > +
> > > +	if (!xfs_has_reflink(mp)) {
> > > +		mp->awu_max = 1;
> > > +		return;
> > > +	}
> > > +
> > > +	/*
> > > +	 * Find highest power-of-2 evenly divisible into agsize and which
> > > +	 * also fits into an unsigned int field.
> > > +	 */
> > > +	awu_max = 1;
> > > +	while (1) {
> > > +		if (agsize % (awu_max * 2))
> > > +			break;
> > > +		if (XFS_FSB_TO_B(mp, awu_max * 2) > UINT_MAX)
> > > +			break;
> > > +		awu_max *= 2;
> > > +	}
> > > +	mp->awu_max = awu_max;
> > 
> > I think you need two awu_maxes here -- one for the data device, and
> > another for the realtime device.
> How about we just don't support rtdev initially for this CoW-based method,
> i.e. stick at 1x FSB awu max?

I guess, but that's more unfinished business.

--D

> >  The rt computation is probably more
> > complex since I think it's the greatest power of two that fits in the rt
> > extent size if it isn't a power of two;> or the greatest power of two>
> that fits in the rtgroup if rtgroups are enabled; or probably just no
> > limit otherwise.
> >
> 
> Thanks,
> John
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 10/10] xfs: Allow block allocator to take an alignment hint
  2025-02-06  8:10     ` John Garry
@ 2025-02-06 21:54       ` Darrick J. Wong
  0 siblings, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2025-02-06 21:54 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen

On Thu, Feb 06, 2025 at 08:10:24AM +0000, John Garry wrote:
> On 05/02/2025 19:20, Darrick J. Wong wrote:
> > On Tue, Feb 04, 2025 at 12:01:27PM +0000, John Garry wrote:
> > > When issuing an atomic write by the CoW method, give the block allocator a
> > > hint to naturally align the data blocks.
> > > 
> > > This means that we have a better chance to issuing the atomic write via
> > > HW offload next time.
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >   fs/xfs/libxfs/xfs_bmap.c | 7 ++++++-
> > >   fs/xfs/libxfs/xfs_bmap.h | 6 +++++-
> > >   fs/xfs/xfs_reflink.c     | 8 ++++++--
> > >   3 files changed, 17 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > index 40ad22fb808b..7a3910018dee 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > @@ -3454,6 +3454,12 @@ xfs_bmap_compute_alignments(
> > >   		align = xfs_get_cowextsz_hint(ap->ip);
> > >   	else if (ap->datatype & XFS_ALLOC_USERDATA)
> > >   		align = xfs_get_extsz_hint(ap->ip);
> > > +
> > > +	if (align > 1 && ap->flags & XFS_BMAPI_NALIGN)
> > > +		args->alignment = align;
> > > +	else
> > > +		args->alignment = 1;
> > > +
> > >   	if (align) {
> > >   		if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
> > >   					ap->eof, 0, ap->conv, &ap->offset,
> > > @@ -3781,7 +3787,6 @@ xfs_bmap_btalloc(
> > >   		.wasdel		= ap->wasdel,
> > >   		.resv		= XFS_AG_RESV_NONE,
> > >   		.datatype	= ap->datatype,
> > > -		.alignment	= 1,
> > >   		.minalignslop	= 0,
> > >   	};
> > >   	xfs_fileoff_t		orig_offset;
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> > > index 4b721d935994..d68b594c3fa2 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.h
> > > +++ b/fs/xfs/libxfs/xfs_bmap.h
> > > @@ -87,6 +87,9 @@ struct xfs_bmalloca {
> > >   /* Do not update the rmap btree.  Used for reconstructing bmbt from rmapbt. */
> > >   #define XFS_BMAPI_NORMAP	(1u << 10)
> > > +/* Try to naturally align allocations */
> > > +#define XFS_BMAPI_NALIGN	(1u << 11)
> > > +
> > >   #define XFS_BMAPI_FLAGS \
> > >   	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
> > >   	{ XFS_BMAPI_METADATA,	"METADATA" }, \
> > > @@ -98,7 +101,8 @@ struct xfs_bmalloca {
> > >   	{ XFS_BMAPI_REMAP,	"REMAP" }, \
> > >   	{ XFS_BMAPI_COWFORK,	"COWFORK" }, \
> > >   	{ XFS_BMAPI_NODISCARD,	"NODISCARD" }, \
> > > -	{ XFS_BMAPI_NORMAP,	"NORMAP" }
> > > +	{ XFS_BMAPI_NORMAP,	"NORMAP" },\
> > > +	{ XFS_BMAPI_NALIGN,	"NALIGN" }
> > 
> > Tihs isn't really "naturally" aligned, is it?  It really means "try to
> > align allocations to the extent size hint", which isn't required to be a
> > power of two.
> 
> Sure, so I would expect that the user will set extsize/cowextsize according
> to the size what we want to do atomics for, and we can align to that. I
> don't think that it makes a difference that either extsize isn't mandated to
> be a power-of-2.
> 
> So then I should rename to XFS_BMAPI_EXTSZALIGN or something like that - ok?

Yep.

--D

> Thanks,
> John
> 
> 
> > 
> > --D
> > 
> > >   static inline int xfs_bmapi_aflag(int w)
> > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > index 60c986300faa..198fb5372f10 100644
> > > --- a/fs/xfs/xfs_reflink.c
> > > +++ b/fs/xfs/xfs_reflink.c
> > > @@ -445,6 +445,11 @@ xfs_reflink_fill_cow_hole(
> > >   	int			nimaps;
> > >   	int			error;
> > >   	bool			found;
> > > +	uint32_t		bmapi_flags = XFS_BMAPI_COWFORK |
> > > +					XFS_BMAPI_PREALLOC;
> > > +
> > > +	if (atomic)
> > > +		bmapi_flags |= XFS_BMAPI_NALIGN;
> > >   	resaligned = xfs_aligned_fsb_count(imap->br_startoff,
> > >   		imap->br_blockcount, xfs_get_cowextsz_hint(ip));
> > > @@ -478,8 +483,7 @@ xfs_reflink_fill_cow_hole(
> > >   	/* Allocate the entire reservation as unwritten blocks. */
> > >   	nimaps = 1;
> > >   	error = xfs_bmapi_write(tp, ip, imap->br_startoff, imap->br_blockcount,
> > > -			XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC, 0, cmap,
> > > -			&nimaps);
> > > +			bmapi_flags, 0, cmap, &nimaps);
> > >   	if (error)
> > >   		goto out_trans_cancel;
> > > -- 
> > > 2.31.1
> > > 
> > > 
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 06/10] xfs: iomap CoW-based atomic write support
  2025-02-06 21:44       ` Darrick J. Wong
@ 2025-02-07 11:48         ` John Garry
  0 siblings, 0 replies; 35+ messages in thread
From: John Garry @ 2025-02-07 11:48 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen

On 06/02/2025 21:44, Darrick J. Wong wrote:
>>> What is this checking?  That something else already created a mapping in
>>> the COW fork, so we want to bail out to get rid of it?
>> I want to check if some data is shared. In that case, we should unshare.
> Why is it necessary to unshare?  Userspace gave us a buffer of new
> contents, and we're already prepared to write that out of place and
> remap it.

fine, as long as the remap does what we need, then I won't bother with 
this explicit unshare.

> 
>> And I am not sure if that check is sufficient.
>>
>> On the buffered write path, we may have something in a CoW fork - in that
>> case it should be flushed, right?
> Flushed against what?  Concurrent writeback or something?  The directio
> setup should have flushed dirty pagecache, so the only things left in
> the COW fork are speculative preallocations.  (IOWs, I don't understand
> what needs to be flushed or why.)

ah, ok, as long as DIO would have flushed dirty relevant pagecache, then 
we should be good.

Cheers,
John


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 08/10] xfs: Commit CoW-based atomic writes atomically
  2025-02-06 21:50       ` Darrick J. Wong
@ 2025-02-07 11:52         ` John Garry
  0 siblings, 0 replies; 35+ messages in thread
From: John Garry @ 2025-02-07 11:52 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen

On 06/02/2025 21:50, Darrick J. Wong wrote:
> On Thu, Feb 06, 2025 at 10:27:45AM +0000, John Garry wrote:
>> On 05/02/2025 19:47, Darrick J. Wong wrote:
>>> On Tue, Feb 04, 2025 at 12:01:25PM +0000, John Garry wrote:
>>>> When completing a CoW-based write, each extent range mapping update is
>>>> covered by a separate transaction.
>>>>
>>>> For a CoW-based atomic write, all mappings must be changed at once, so
>>>> change to use a single transaction.
>>>>
>>>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>>>> ---
>>>>    fs/xfs/xfs_file.c    |  5 ++++-
>>>>    fs/xfs/xfs_reflink.c | 48 ++++++++++++++++++++++++++++++++++++++++++++
>>>>    fs/xfs/xfs_reflink.h |  3 +++
>>>>    3 files changed, 55 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>>>> index 12af5cdc3094..170d7891f90d 100644
>>>> --- a/fs/xfs/xfs_file.c
>>>> +++ b/fs/xfs/xfs_file.c
>>>> @@ -527,7 +527,10 @@ xfs_dio_write_end_io(
>>>>    	nofs_flag = memalloc_nofs_save();
>>>>    	if (flags & IOMAP_DIO_COW) {
>>>> -		error = xfs_reflink_end_cow(ip, offset, size);
>>>> +		if (iocb->ki_flags & IOCB_ATOMIC)
>>>> +			error = xfs_reflink_end_atomic_cow(ip, offset, size);
>>>> +		else
>>>> +			error = xfs_reflink_end_cow(ip, offset, size);
>>>>    		if (error)
>>>>    			goto out;
>>>>    	}
>>>> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
>>>> index dbce333b60eb..60c986300faa 100644
>>>> --- a/fs/xfs/xfs_reflink.c
>>>> +++ b/fs/xfs/xfs_reflink.c
>>>> @@ -990,6 +990,54 @@ xfs_reflink_end_cow(
>>>>    		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
>>>>    	return error;
>>>>    }
>>>> +int
>>>> +xfs_reflink_end_atomic_cow(
>>>> +	struct xfs_inode		*ip,
>>>> +	xfs_off_t			offset,
>>>> +	xfs_off_t			count)
>>>> +{
>>>> +	xfs_fileoff_t			offset_fsb;
>>>> +	xfs_fileoff_t			end_fsb;
>>>> +	int				error = 0;
>>>> +	struct xfs_mount		*mp = ip->i_mount;
>>>> +	struct xfs_trans		*tp;
>>>> +	unsigned int			resblks;
>>>> +	bool				commit = false;
>>>> +
>>>> +	trace_xfs_reflink_end_cow(ip, offset, count);
>>>> +
>>>> +	offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
>>>> +	end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
>>>> +
>>>> +	resblks = XFS_NEXTENTADD_SPACE_RES(ip->i_mount,
>>>> +				(unsigned int)(end_fsb - offset_fsb),
>>>> +				XFS_DATA_FORK);
>>>> +
>>>> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
>>>
>>> xfs gained reflink support for realtime volumes in 6.14-rc1, so you now
>>> have to calculate for that in here too.
>>>
>>>> +			XFS_TRANS_RESERVE, &tp);
>>>> +	if (error)
>>>> +		return error;
>>>> +
>>>> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
>>>> +	xfs_trans_ijoin(tp, ip, 0);
>>>> +
>>>> +	while (end_fsb > offset_fsb && !error)
>>>> +		error = xfs_reflink_end_cow_extent_locked(ip, &offset_fsb,
>>>> +						end_fsb, tp, &commit);
>>>
>>> Hmm.  Attaching intent items to a transaction consumes space in that
>>> transaction, so we probably ought to limit the amount that we try to do
>>> here.  Do you know what that limit is?  I don't,
>>
>> nor do I ...
>>
>>> but it's roughly
>>> tr_logres divided by the average size of a log intent item.
>>
>> So you have a ballpark figure on the average size of a log intent item, or
>> an idea on how to get it?
> 
> You could add up the size of struct
> xfs_{bui,rmap,refcount,efi}_log_format structures and add 20%, that will
> give you a ballpark figure of the worst case per-block requirements.
> 
> My guess is that 64 blocks is ok provided resblks is big enough.  But I
> guess we could estimate it (very conservatively) dynamically too.
> 
> (also note tr_itruncate declares more logres)

64 blocks gives quite a large awu max, so that would be ok

> 
>>> This means we need to restrict the size of an untorn write to a
>>> double-digit number of fsblocks for safety.
>>
>> Sure, but won't we also still be liable to suffer the same issue which was
>> fixed in commit d6f215f359637?
> 
> Yeah, come to think of it, you need to reserve the worst case space
> reservation, i.e. each of the blocks between offset_fsb and end_fsb
> becomes a separate btree update.
> 
> 	resblks = (end_fsb - offset_fsb) *
> 			XFS_NEXTENTADD_SPACE_RES(mp, 1, XFS_DATA_FORK);
> 
>

ok

Thanks,
John

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 09/10] xfs: Update atomic write max size
  2025-02-06 21:54       ` Darrick J. Wong
@ 2025-02-07 11:53         ` John Garry
  0 siblings, 0 replies; 35+ messages in thread
From: John Garry @ 2025-02-07 11:53 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, cem, dchinner, hch, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen

On 06/02/2025 21:54, Darrick J. Wong wrote:
>>> I think you need two awu_maxes here -- one for the data device, and
>>> another for the realtime device.
>> How about we just don't support rtdev initially for this CoW-based method,
>> i.e. stick at 1x FSB awu max?
> I guess, but that's more unfinished business.

Understood. Let me see how the changes look for RT and then reconsider.

Cheers,
John

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC 07/10] xfs: Add xfs_file_dio_write_atomic()
  2025-02-04 12:01 ` [PATCH RFC 07/10] xfs: Add xfs_file_dio_write_atomic() John Garry
  2025-02-05 19:55   ` Darrick J. Wong
@ 2025-02-10 16:59   ` John Garry
  1 sibling, 0 replies; 35+ messages in thread
From: John Garry @ 2025-02-10 16:59 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen

On 04/02/2025 12:01, John Garry wrote:
> +static noinline ssize_t
> +xfs_file_dio_write_atomic(
> +	struct xfs_inode	*ip,
> +	struct kiocb		*iocb,
> +	struct iov_iter		*from)
> +{
> +	unsigned int		iolock = XFS_IOLOCK_SHARED;
> +	bool			use_cow = false;
> +	unsigned int		dio_flags;
> +	ssize_t			ret;
> +
> +retry:
> +	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
> +	if (ret)
> +		return ret;
> +
> +	ret = xfs_file_write_checks(iocb, from, &iolock);
> +	if (ret)
> +		goto out_unlock;
> +
> +	if (use_cow) {
> +		ret = xfs_reflink_unshare(ip, iocb->ki_pos,
> +			iov_iter_count(from));
> +		if (ret)
> +			goto out_unlock;
> +	}
> +
> +	trace_xfs_file_direct_write(iocb, from);
> +	if (use_cow)
> +		dio_flags = IOMAP_DIO_ATOMIC_COW;

note to self:

We should also set IOMAP_DIO_FORCE_WAIT and call inode_dio_wait() here, 
similar to xfs_file_dio_write_unaligned()

> +	else
> +		dio_flags = 0;
> +
> +	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
> +			&xfs_dio_write_ops, dio_flags, NULL, 0);
> +
> +	if (ret == -EAGAIN && !(iocb->ki_flags & IOCB_NOWAIT) && !use_cow) {
> +		xfs_iunlock(ip, iolock);
> +		iolock = XFS_IOLOCK_EXCL;
> +		use_cow = true;
> +		goto retry;
> +	}
> +
> +out_unlock:
> +	if (iolock)
> +		xfs_iunlock(ip, iolock);
> +	return ret;
> +}
> +


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2025-02-10 16:59 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-04 12:01 [PATCH RFC 00/10] large atomic writes for xfs with CoW John Garry
2025-02-04 12:01 ` [PATCH RFC 01/10] xfs: Switch atomic write size check in xfs_file_write_iter() John Garry
2025-02-04 12:01 ` [PATCH RFC 02/10] xfs: Refactor xfs_reflink_end_cow_extent() John Garry
2025-02-05 19:50   ` Darrick J. Wong
2025-02-06 10:35     ` John Garry
2025-02-06 21:38       ` Darrick J. Wong
2025-02-04 12:01 ` [PATCH RFC 03/10] iomap: Support CoW-based atomic writes John Garry
2025-02-05 20:11   ` Darrick J. Wong
2025-02-06 11:21     ` John Garry
2025-02-06 21:40       ` Darrick J. Wong
2025-02-04 12:01 ` [PATCH RFC 04/10] xfs: Make xfs_find_trim_cow_extent() public John Garry
2025-02-04 12:01 ` [PATCH RFC 05/10] xfs: Reflink CoW-based atomic write support John Garry
2025-02-04 12:01 ` [PATCH RFC 06/10] xfs: iomap " John Garry
2025-02-05 20:05   ` Darrick J. Wong
2025-02-06 11:10     ` John Garry
2025-02-06 21:44       ` Darrick J. Wong
2025-02-07 11:48         ` John Garry
2025-02-04 12:01 ` [PATCH RFC 07/10] xfs: Add xfs_file_dio_write_atomic() John Garry
2025-02-05 19:55   ` Darrick J. Wong
2025-02-06 10:43     ` John Garry
2025-02-10 16:59   ` John Garry
2025-02-04 12:01 ` [PATCH RFC 08/10] xfs: Commit CoW-based atomic writes atomically John Garry
2025-02-05 19:47   ` Darrick J. Wong
2025-02-06 10:27     ` John Garry
2025-02-06 21:50       ` Darrick J. Wong
2025-02-07 11:52         ` John Garry
2025-02-04 12:01 ` [PATCH RFC 09/10] xfs: Update atomic write max size John Garry
2025-02-05 19:41   ` Darrick J. Wong
2025-02-06  9:15     ` John Garry
2025-02-06 21:54       ` Darrick J. Wong
2025-02-07 11:53         ` John Garry
2025-02-04 12:01 ` [PATCH RFC 10/10] xfs: Allow block allocator to take an alignment hint John Garry
2025-02-05 19:20   ` Darrick J. Wong
2025-02-06  8:10     ` John Garry
2025-02-06 21:54       ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox