[PATCH v4 00/12] large atomic writes for xfs with CoW

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 00/12] large atomic writes for xfs with CoW
@ 2025-03-03 17:11 John Garry
  2025-03-03 17:11 ` [PATCH v4 01/12] xfs: Pass flags to xfs_reflink_allocate_cow() John Garry
                   ` (12 more replies)
  0 siblings, 13 replies; 32+ messages in thread
From: John Garry @ 2025-03-03 17:11 UTC (permalink / raw)
  To: brauner, djwong, cem
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

Currently atomic write support for xfs is limited to writing a single
block as we have no way to guarantee alignment and that the write covers
a single extent.

This series introduces a method to issue atomic writes via a software
emulated method.

The software emulated method is used as a fallback for when attempting to
issue an atomic write over misaligned or multiple extents.

For XFS, this support is based on CoW.

The basic idea of this CoW method is to alloc a range in the CoW fork,
write the data, and atomically update the mapping.

Initial mysql performance testing has shown this method to perform ok.
However, there we are only using 16K atomic writes (and 4K block size),
so typically - and thankfully - this software fallback method won't be
used often.

For other FSes which want large atomics writes and don't support CoW, I
think that they can follow the example in [0].

Based on 0a1fd78080c8 (xfs/xfs-6.15-merge) Merge branch
'vfs-6.15.iomap' of
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs into
xfs-6.15-merge

[0] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/

Differences to v3:
- Error !reflink in xfs_atomic_write_sw_iomap_begin() (Darrick)
- Fix unused variable (kbuild bot)
- Add RB tags from Darrick (Thanks!)

Differences to v2:
(all from Darrick)
- Add dedicated function for xfs iomap sw-based atomic write
- Don't ignore xfs_reflink_end_atomic_cow() -> xfs_trans_commit() return
  value
- Pass flags for reflink alloc functions
- Rename IOMAP_ATOMIC_COW -> IOMAP_ATOMIC_SW
- Coding style corrections and comment improvements
- Add RB tags (thanks!)

Differences to RFC:
- Rework CoW alloc method
- Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW
- Rework transaction commit func args
- Chaneg resblks size for transaction commit
- Rename BMAPI extszhint align flag

John Garry (11):
  xfs: Pass flags to xfs_reflink_allocate_cow()
  iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW
  xfs: Switch atomic write size check in xfs_file_write_iter()
  xfs: Refactor xfs_reflink_end_cow_extent()
  iomap: Support SW-based atomic writes
  xfs: Reflink CoW-based atomic write support
  xfs: Iomap SW-based atomic write support
  xfs: Add xfs_file_dio_write_atomic()
  xfs: Commit CoW-based atomic writes atomically
  xfs: Update atomic write max size
  xfs: Allow block allocator to take an alignment hint

Ritesh Harjani (IBM) (1):
  iomap: Lift blocksize restriction on atomic writes

 .../filesystems/iomap/operations.rst          |  20 ++-
 fs/ext4/inode.c                               |   2 +-
 fs/iomap/direct-io.c                          |  20 +--
 fs/iomap/trace.h                              |   2 +-
 fs/xfs/libxfs/xfs_bmap.c                      |   7 +-
 fs/xfs/libxfs/xfs_bmap.h                      |   6 +-
 fs/xfs/xfs_file.c                             |  59 ++++++-
 fs/xfs/xfs_iomap.c                            | 144 ++++++++++++++++-
 fs/xfs/xfs_iomap.h                            |   1 +
 fs/xfs/xfs_iops.c                             |  31 +++-
 fs/xfs/xfs_iops.h                             |   2 +
 fs/xfs/xfs_mount.c                            |  28 ++++
 fs/xfs/xfs_mount.h                            |   1 +
 fs/xfs/xfs_reflink.c                          | 145 +++++++++++++-----
 fs/xfs/xfs_reflink.h                          |  11 +-
 include/linux/iomap.h                         |   8 +-
 16 files changed, 415 insertions(+), 72 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 01/12] xfs: Pass flags to xfs_reflink_allocate_cow()
  2025-03-03 17:11 [PATCH v4 00/12] large atomic writes for xfs with CoW John Garry
@ 2025-03-03 17:11 ` John Garry
  2025-03-03 17:11 ` [PATCH v4 02/12] iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW John Garry
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: John Garry @ 2025-03-03 17:11 UTC (permalink / raw)
  To: brauner, djwong, cem
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

In future we will want more boolean options for xfs_reflink_allocate_cow(),
so just prepare for this by passing a flags arg for @convert_now.

Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_iomap.c   |  7 +++++--
 fs/xfs/xfs_reflink.c | 10 ++++++----
 fs/xfs/xfs_reflink.h |  7 ++++++-
 3 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 46acf727cbe7..2e9230fa1140 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -810,6 +810,7 @@ xfs_direct_write_iomap_begin(
 	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
 	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, length);
 	int			nimaps = 1, error = 0;
+	unsigned int		reflink_flags = 0;
 	bool			shared = false;
 	u16			iomap_flags = 0;
 	unsigned int		lockmode;
@@ -820,6 +821,9 @@ xfs_direct_write_iomap_begin(
 	if (xfs_is_shutdown(mp))
 		return -EIO;
 
+	if (flags & IOMAP_DIRECT || IS_DAX(inode))
+		reflink_flags |= XFS_REFLINK_CONVERT;
+
 	/*
 	 * Writes that span EOF might trigger an IO size update on completion,
 	 * so consider them to be dirty for the purposes of O_DSYNC even if
@@ -864,8 +868,7 @@ xfs_direct_write_iomap_begin(
 
 		/* may drop and re-acquire the ilock */
 		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
-				&lockmode,
-				(flags & IOMAP_DIRECT) || IS_DAX(inode));
+				&lockmode, reflink_flags);
 		if (error)
 			goto out_unlock;
 		if (shared)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 59f7fc16eb80..0eb2670fc6fb 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -435,7 +435,7 @@ xfs_reflink_fill_cow_hole(
 	struct xfs_bmbt_irec	*cmap,
 	bool			*shared,
 	uint			*lockmode,
-	bool			convert_now)
+	unsigned int		flags)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_trans	*tp;
@@ -488,7 +488,8 @@ xfs_reflink_fill_cow_hole(
 		return error;
 
 convert:
-	return xfs_reflink_convert_unwritten(ip, imap, cmap, convert_now);
+	return xfs_reflink_convert_unwritten(ip, imap, cmap,
+			flags & XFS_REFLINK_CONVERT);
 
 out_trans_cancel:
 	xfs_trans_cancel(tp);
@@ -566,10 +567,11 @@ xfs_reflink_allocate_cow(
 	struct xfs_bmbt_irec	*cmap,
 	bool			*shared,
 	uint			*lockmode,
-	bool			convert_now)
+	unsigned int		flags)
 {
 	int			error;
 	bool			found;
+	bool			convert_now = flags & XFS_REFLINK_CONVERT;
 
 	xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
 	if (!ip->i_cowfp) {
@@ -592,7 +594,7 @@ xfs_reflink_allocate_cow(
 	 */
 	if (cmap->br_startoff > imap->br_startoff)
 		return xfs_reflink_fill_cow_hole(ip, imap, cmap, shared,
-				lockmode, convert_now);
+				lockmode, flags);
 
 	/*
 	 * CoW fork has a delalloc reservation. Replace it with a real extent.
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index cc4e92278279..cdbd73d58822 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -6,6 +6,11 @@
 #ifndef __XFS_REFLINK_H
 #define __XFS_REFLINK_H 1
 
+/*
+ * Flags for xfs_reflink_allocate_cow()
+ */
+#define XFS_REFLINK_CONVERT	(1u << 0) /* convert unwritten extents now */
+
 /*
  * Check whether it is safe to free COW fork blocks from an inode. It is unsafe
  * to do so when an inode has dirty cache or I/O in-flight, even if no shared
@@ -32,7 +37,7 @@ int xfs_bmap_trim_cow(struct xfs_inode *ip, struct xfs_bmbt_irec *imap,
 
 int xfs_reflink_allocate_cow(struct xfs_inode *ip, struct xfs_bmbt_irec *imap,
 		struct xfs_bmbt_irec *cmap, bool *shared, uint *lockmode,
-		bool convert_now);
+		unsigned int flags);
 extern int xfs_reflink_convert_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 02/12] iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW
  2025-03-03 17:11 [PATCH v4 00/12] large atomic writes for xfs with CoW John Garry
  2025-03-03 17:11 ` [PATCH v4 01/12] xfs: Pass flags to xfs_reflink_allocate_cow() John Garry
@ 2025-03-03 17:11 ` John Garry
  2025-03-05 12:57   ` Carlos Maiolino
  2025-03-03 17:11 ` [PATCH v4 03/12] xfs: Switch atomic write size check in xfs_file_write_iter() John Garry
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2025-03-03 17:11 UTC (permalink / raw)
  To: brauner, djwong, cem
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

In future xfs will support a SW-based atomic write, so rename
IOMAP_ATOMIC -> IOMAP_ATOMIC_HW to be clear which mode is being used.

Also relocate setting of IOMAP_ATOMIC_HW to the write path in
__iomap_dio_rw(), to be clear that this flag is only relevant to writes.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 Documentation/filesystems/iomap/operations.rst |  4 ++--
 fs/ext4/inode.c                                |  2 +-
 fs/iomap/direct-io.c                           | 18 +++++++++---------
 fs/iomap/trace.h                               |  2 +-
 include/linux/iomap.h                          |  2 +-
 5 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
index d1535109587a..0b9d7be23bce 100644
--- a/Documentation/filesystems/iomap/operations.rst
+++ b/Documentation/filesystems/iomap/operations.rst
@@ -514,8 +514,8 @@ IOMAP_WRITE`` with any combination of the following enhancements:
    if the mapping is unwritten and the filesystem cannot handle zeroing
    the unaligned regions without exposing stale contents.
 
- * ``IOMAP_ATOMIC``: This write is being issued with torn-write
-   protection.
+ * ``IOMAP_ATOMIC_HW``: This write is being issued with torn-write
+   protection based on HW-offload support.
    Only a single bio can be created for the write, and the write must
    not be split into multiple I/O requests, i.e. flag REQ_ATOMIC must be
    set.
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7c54ae5fcbd4..ba2f1e3db7c7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3467,7 +3467,7 @@ static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
 		return false;
 
 	/* atomic writes are all-or-nothing */
-	if (flags & IOMAP_ATOMIC)
+	if (flags & IOMAP_ATOMIC_HW)
 		return false;
 
 	/* can only try again if we wrote nothing */
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index e1e32e2bb0bf..c696ce980796 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -317,7 +317,7 @@ static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
  * clearing the WRITE_THROUGH flag in the dio request.
  */
 static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
-		const struct iomap *iomap, bool use_fua, bool atomic)
+		const struct iomap *iomap, bool use_fua, bool atomic_hw)
 {
 	blk_opf_t opflags = REQ_SYNC | REQ_IDLE;
 
@@ -329,7 +329,7 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
 		opflags |= REQ_FUA;
 	else
 		dio->flags &= ~IOMAP_DIO_WRITE_THROUGH;
-	if (atomic)
+	if (atomic_hw)
 		opflags |= REQ_ATOMIC;
 
 	return opflags;
@@ -340,8 +340,8 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
 	const struct iomap *iomap = &iter->iomap;
 	struct inode *inode = iter->inode;
 	unsigned int fs_block_size = i_blocksize(inode), pad;
+	bool atomic_hw = iter->flags & IOMAP_ATOMIC_HW;
 	const loff_t length = iomap_length(iter);
-	bool atomic = iter->flags & IOMAP_ATOMIC;
 	loff_t pos = iter->pos;
 	blk_opf_t bio_opf;
 	struct bio *bio;
@@ -351,7 +351,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
 	u64 copied = 0;
 	size_t orig_count;
 
-	if (atomic && length != fs_block_size)
+	if (atomic_hw && length != fs_block_size)
 		return -EINVAL;
 
 	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
@@ -428,7 +428,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
 			goto out;
 	}
 
-	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic);
+	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic_hw);
 
 	nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS);
 	do {
@@ -461,7 +461,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
 		}
 
 		n = bio->bi_iter.bi_size;
-		if (WARN_ON_ONCE(atomic && n != length)) {
+		if (WARN_ON_ONCE(atomic_hw && n != length)) {
 			/*
 			 * This bio should have covered the complete length,
 			 * which it doesn't, so error. We may need to zero out
@@ -652,9 +652,6 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		iomi.flags |= IOMAP_NOWAIT;
 
-	if (iocb->ki_flags & IOCB_ATOMIC)
-		iomi.flags |= IOMAP_ATOMIC;
-
 	if (iov_iter_rw(iter) == READ) {
 		/* reads can always complete inline */
 		dio->flags |= IOMAP_DIO_INLINE_COMP;
@@ -689,6 +686,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 			iomi.flags |= IOMAP_OVERWRITE_ONLY;
 		}
 
+		if (iocb->ki_flags & IOCB_ATOMIC)
+			iomi.flags |= IOMAP_ATOMIC_HW;
+
 		/* for data sync or sync, we need sync completion processing */
 		if (iocb_is_dsync(iocb)) {
 			dio->flags |= IOMAP_DIO_NEED_SYNC;
diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
index 9eab2c8ac3c5..69af89044ebd 100644
--- a/fs/iomap/trace.h
+++ b/fs/iomap/trace.h
@@ -99,7 +99,7 @@ DEFINE_RANGE_EVENT(iomap_dio_rw_queued);
 	{ IOMAP_FAULT,		"FAULT" }, \
 	{ IOMAP_DIRECT,		"DIRECT" }, \
 	{ IOMAP_NOWAIT,		"NOWAIT" }, \
-	{ IOMAP_ATOMIC,		"ATOMIC" }
+	{ IOMAP_ATOMIC_HW,	"ATOMIC_HW" }
 
 #define IOMAP_F_FLAGS_STRINGS \
 	{ IOMAP_F_NEW,		"NEW" }, \
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index ea29388b2fba..87cd7079aaf3 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -189,7 +189,7 @@ struct iomap_folio_ops {
 #else
 #define IOMAP_DAX		0
 #endif /* CONFIG_FS_DAX */
-#define IOMAP_ATOMIC		(1 << 9)
+#define IOMAP_ATOMIC_HW		(1 << 9)
 #define IOMAP_DONTCACHE		(1 << 10)
 
 struct iomap_ops {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 03/12] xfs: Switch atomic write size check in xfs_file_write_iter()
  2025-03-03 17:11 [PATCH v4 00/12] large atomic writes for xfs with CoW John Garry
  2025-03-03 17:11 ` [PATCH v4 01/12] xfs: Pass flags to xfs_reflink_allocate_cow() John Garry
  2025-03-03 17:11 ` [PATCH v4 02/12] iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW John Garry
@ 2025-03-03 17:11 ` John Garry
  2025-03-03 17:11 ` [PATCH v4 04/12] xfs: Refactor xfs_reflink_end_cow_extent() John Garry
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: John Garry @ 2025-03-03 17:11 UTC (permalink / raw)
  To: brauner, djwong, cem
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

Currently the size of atomic write allowed is fixed at the blocksize.

To start to lift this restriction, refactor xfs_get_atomic_write_attr()
to into a helper - xfs_report_atomic_write() - and use that helper to
find the per-inode atomic write limits and check according to that.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_file.c | 12 +++++-------
 fs/xfs/xfs_iops.c | 20 +++++++++++++++++---
 fs/xfs/xfs_iops.h |  2 ++
 3 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a81c3e943f20..51b4a43d15f3 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -853,14 +853,12 @@ xfs_file_write_iter(
 		return xfs_file_dax_write(iocb, from);
 
 	if (iocb->ki_flags & IOCB_ATOMIC) {
-		/*
-		 * Currently only atomic writing of a single FS block is
-		 * supported. It would be possible to atomic write smaller than
-		 * a FS block, but there is no requirement to support this.
-		 * Note that iomap also does not support this yet.
-		 */
-		if (ocount != ip->i_mount->m_sb.sb_blocksize)
+		unsigned int	unit_min, unit_max;
+
+		xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
+		if (ocount < unit_min || ocount > unit_max)
 			return -EINVAL;
+
 		ret = generic_atomic_write_valid(iocb, from);
 		if (ret)
 			return ret;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 40289fe6f5b2..ea79fb246e33 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -600,15 +600,29 @@ xfs_report_dioalign(
 		stat->dio_offset_align = stat->dio_read_offset_align;
 }
 
+void
+xfs_get_atomic_write_attr(
+	struct xfs_inode	*ip,
+	unsigned int		*unit_min,
+	unsigned int		*unit_max)
+{
+	if (!xfs_inode_can_atomicwrite(ip)) {
+		*unit_min = *unit_max = 0;
+		return;
+	}
+
+	*unit_min = *unit_max = ip->i_mount->m_sb.sb_blocksize;
+}
+
 static void
 xfs_report_atomic_write(
 	struct xfs_inode	*ip,
 	struct kstat		*stat)
 {
-	unsigned int		unit_min = 0, unit_max = 0;
+	unsigned int		unit_min, unit_max;
+
+	xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
 
-	if (xfs_inode_can_atomicwrite(ip))
-		unit_min = unit_max = ip->i_mount->m_sb.sb_blocksize;
 	generic_fill_statx_atomic_writes(stat, unit_min, unit_max);
 }
 
diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
index 3c1a2605ffd2..d95a543f3ab0 100644
--- a/fs/xfs/xfs_iops.h
+++ b/fs/xfs/xfs_iops.h
@@ -19,5 +19,7 @@ int xfs_inode_init_security(struct inode *inode, struct inode *dir,
 extern void xfs_setup_inode(struct xfs_inode *ip);
 extern void xfs_setup_iops(struct xfs_inode *ip);
 extern void xfs_diflags_to_iflags(struct xfs_inode *ip, bool init);
+void xfs_get_atomic_write_attr(struct xfs_inode *ip,
+		unsigned int *unit_min, unsigned int *unit_max);
 
 #endif /* __XFS_IOPS_H__ */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 04/12] xfs: Refactor xfs_reflink_end_cow_extent()
  2025-03-03 17:11 [PATCH v4 00/12] large atomic writes for xfs with CoW John Garry
                   ` (2 preceding siblings ...)
  2025-03-03 17:11 ` [PATCH v4 03/12] xfs: Switch atomic write size check in xfs_file_write_iter() John Garry
@ 2025-03-03 17:11 ` John Garry
  2025-03-03 17:11 ` [PATCH v4 05/12] iomap: Support SW-based atomic writes John Garry
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: John Garry @ 2025-03-03 17:11 UTC (permalink / raw)
  To: brauner, djwong, cem
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

Refactor xfs_reflink_end_cow_extent() into separate parts which process
the CoW range and commit the transaction.

This refactoring will be used in future for when it is required to commit
a range of extents as a single transaction, similar to how it was done
pre-commit d6f215f359637.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_reflink.c | 73 ++++++++++++++++++++++++++------------------
 1 file changed, 43 insertions(+), 30 deletions(-)

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 0eb2670fc6fb..3b1b7a56af34 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -788,35 +788,19 @@ xfs_reflink_update_quota(
  * requirements as low as possible.
  */
 STATIC int
-xfs_reflink_end_cow_extent(
+xfs_reflink_end_cow_extent_locked(
+	struct xfs_trans	*tp,
 	struct xfs_inode	*ip,
 	xfs_fileoff_t		*offset_fsb,
 	xfs_fileoff_t		end_fsb)
 {
 	struct xfs_iext_cursor	icur;
 	struct xfs_bmbt_irec	got, del, data;
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_trans	*tp;
 	struct xfs_ifork	*ifp = xfs_ifork_ptr(ip, XFS_COW_FORK);
-	unsigned int		resblks;
 	int			nmaps;
 	bool			isrt = XFS_IS_REALTIME_INODE(ip);
 	int			error;
 
-	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
-			XFS_TRANS_RESERVE, &tp);
-	if (error)
-		return error;
-
-	/*
-	 * Lock the inode.  We have to ijoin without automatic unlock because
-	 * the lead transaction is the refcountbt record deletion; the data
-	 * fork update follows as a deferred log item.
-	 */
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	xfs_trans_ijoin(tp, ip, 0);
-
 	/*
 	 * In case of racing, overlapping AIO writes no COW extents might be
 	 * left by the time I/O completes for the loser of the race.  In that
@@ -825,7 +809,7 @@ xfs_reflink_end_cow_extent(
 	if (!xfs_iext_lookup_extent(ip, ifp, *offset_fsb, &icur, &got) ||
 	    got.br_startoff >= end_fsb) {
 		*offset_fsb = end_fsb;
-		goto out_cancel;
+		return 0;
 	}
 
 	/*
@@ -839,7 +823,7 @@ xfs_reflink_end_cow_extent(
 		if (!xfs_iext_next_extent(ifp, &icur, &got) ||
 		    got.br_startoff >= end_fsb) {
 			*offset_fsb = end_fsb;
-			goto out_cancel;
+			return 0;
 		}
 	}
 	del = got;
@@ -848,14 +832,14 @@ xfs_reflink_end_cow_extent(
 	error = xfs_iext_count_extend(tp, ip, XFS_DATA_FORK,
 			XFS_IEXT_REFLINK_END_COW_CNT);
 	if (error)
-		goto out_cancel;
+		return error;
 
 	/* Grab the corresponding mapping in the data fork. */
 	nmaps = 1;
 	error = xfs_bmapi_read(ip, del.br_startoff, del.br_blockcount, &data,
 			&nmaps, 0);
 	if (error)
-		goto out_cancel;
+		return error;
 
 	/* We can only remap the smaller of the two extent sizes. */
 	data.br_blockcount = min(data.br_blockcount, del.br_blockcount);
@@ -884,7 +868,7 @@ xfs_reflink_end_cow_extent(
 		error = xfs_bunmapi(NULL, ip, data.br_startoff,
 				data.br_blockcount, 0, 1, &done);
 		if (error)
-			goto out_cancel;
+			return error;
 		ASSERT(done);
 	}
 
@@ -901,17 +885,46 @@ xfs_reflink_end_cow_extent(
 	/* Remove the mapping from the CoW fork. */
 	xfs_bmap_del_extent_cow(ip, &icur, &got, &del);
 
-	error = xfs_trans_commit(tp);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	if (error)
-		return error;
-
 	/* Update the caller about how much progress we made. */
 	*offset_fsb = del.br_startoff + del.br_blockcount;
 	return 0;
+}
 
-out_cancel:
-	xfs_trans_cancel(tp);
+
+/*
+ * Remap part of the CoW fork into the data fork.
+ *
+ * We aim to remap the range starting at @offset_fsb and ending at @end_fsb
+ * into the data fork; this function will remap what it can (at the end of the
+ * range) and update @end_fsb appropriately.  Each remap gets its own
+ * transaction because we can end up merging and splitting bmbt blocks for
+ * every remap operation and we'd like to keep the block reservation
+ * requirements as low as possible.
+ */
+STATIC int
+xfs_reflink_end_cow_extent(
+	struct xfs_inode	*ip,
+	xfs_fileoff_t		*offset_fsb,
+	xfs_fileoff_t		end_fsb)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	unsigned int		resblks;
+	int			error;
+
+	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
+			XFS_TRANS_RESERVE, &tp);
+	if (error)
+		return error;
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
+
+	error = xfs_reflink_end_cow_extent_locked(tp, ip, offset_fsb, end_fsb);
+	if (error)
+		xfs_trans_cancel(tp);
+	else
+		error = xfs_trans_commit(tp);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 05/12] iomap: Support SW-based atomic writes
  2025-03-03 17:11 [PATCH v4 00/12] large atomic writes for xfs with CoW John Garry
                   ` (3 preceding siblings ...)
  2025-03-03 17:11 ` [PATCH v4 04/12] xfs: Refactor xfs_reflink_end_cow_extent() John Garry
@ 2025-03-03 17:11 ` John Garry
  2025-03-09 21:51   ` Dave Chinner
  2025-03-03 17:11 ` [PATCH v4 06/12] iomap: Lift blocksize restriction on " John Garry
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2025-03-03 17:11 UTC (permalink / raw)
  To: brauner, djwong, cem
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

Currently atomic write support requires dedicated HW support. This imposes
a restriction on the filesystem that disk blocks need to be aligned and
contiguously mapped to FS blocks to issue atomic writes.

XFS has no method to guarantee FS block alignment for regular,
non-RT files. As such, atomic writes are currently limited to 1x FS block
there.

To deal with the scenario that we are issuing an atomic write over
misaligned or discontiguous data blocks - and raise the atomic write size
limit - support a SW-based software emulated atomic write mode. For XFS,
this SW-based atomic writes would use CoW support to issue emulated untorn
writes.

It is the responsibility of the FS to detect discontiguous atomic writes
and switch to IOMAP_DIO_ATOMIC_SW mode and retry the write. Indeed,
SW-based atomic writes could be used always when the mounted bdev does
not support HW offload, but this strategy is not initially expected to be
used.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 Documentation/filesystems/iomap/operations.rst | 16 ++++++++++++++--
 fs/iomap/direct-io.c                           |  4 +++-
 include/linux/iomap.h                          |  8 +++++++-
 3 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
index 0b9d7be23bce..b08a79d11d9f 100644
--- a/Documentation/filesystems/iomap/operations.rst
+++ b/Documentation/filesystems/iomap/operations.rst
@@ -526,8 +526,20 @@ IOMAP_WRITE`` with any combination of the following enhancements:
    conversion or copy on write), all updates for the entire file range
    must be committed atomically as well.
    Only one space mapping is allowed per untorn write.
-   Untorn writes must be aligned to, and must not be longer than, a
-   single file block.
+   Untorn writes may be longer than a single file block. In all cases,
+   the mapping start disk block must have at least the same alignment as
+   the write offset.
+
+ * ``IOMAP_ATOMIC_SW``: This write is being issued with torn-write
+   protection via a software mechanism provided by the filesystem.
+   All the disk block alignment and single bio restrictions which apply
+   to IOMAP_ATOMIC_HW do not apply here.
+   SW-based untorn writes would typically be used as a fallback when
+   HW-based untorn writes may not be issued, e.g. the range of the write
+   covers multiple extents, meaning that it is not possible to issue
+   a single bio.
+   All filesystem metadata updates for the entire file range must be
+   committed atomically as well.
 
 Callers commonly hold ``i_rwsem`` in shared or exclusive mode before
 calling this function.
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index c696ce980796..c594f2cf3ab4 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -686,7 +686,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 			iomi.flags |= IOMAP_OVERWRITE_ONLY;
 		}
 
-		if (iocb->ki_flags & IOCB_ATOMIC)
+		if (dio_flags & IOMAP_DIO_ATOMIC_SW)
+			iomi.flags |= IOMAP_ATOMIC_SW;
+		else if (iocb->ki_flags & IOCB_ATOMIC)
 			iomi.flags |= IOMAP_ATOMIC_HW;
 
 		/* for data sync or sync, we need sync completion processing */
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 87cd7079aaf3..9cd93530013c 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -189,8 +189,9 @@ struct iomap_folio_ops {
 #else
 #define IOMAP_DAX		0
 #endif /* CONFIG_FS_DAX */
-#define IOMAP_ATOMIC_HW		(1 << 9)
+#define IOMAP_ATOMIC_HW		(1 << 9) /* HW-based torn-write protection */
 #define IOMAP_DONTCACHE		(1 << 10)
+#define IOMAP_ATOMIC_SW		(1 << 11)/* SW-based torn-write protection */
 
 struct iomap_ops {
 	/*
@@ -502,6 +503,11 @@ struct iomap_dio_ops {
  */
 #define IOMAP_DIO_PARTIAL		(1 << 2)
 
+/*
+ * Use software-based torn-write protection.
+ */
+#define IOMAP_DIO_ATOMIC_SW		(1 << 3)
+
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
 		unsigned int dio_flags, void *private, size_t done_before);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 06/12] iomap: Lift blocksize restriction on atomic writes
  2025-03-03 17:11 [PATCH v4 00/12] large atomic writes for xfs with CoW John Garry
                   ` (4 preceding siblings ...)
  2025-03-03 17:11 ` [PATCH v4 05/12] iomap: Support SW-based atomic writes John Garry
@ 2025-03-03 17:11 ` John Garry
  2025-03-03 17:11 ` [PATCH v4 07/12] xfs: Reflink CoW-based atomic write support John Garry
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: John Garry @ 2025-03-03 17:11 UTC (permalink / raw)
  To: brauner, djwong, cem
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>

Filesystems like ext4 can submit writes in multiples of blocksizes.
But we still can't allow the writes to be split. Hence let's check if
the iomap_length() is same as iter->len or not.

It is the role of the FS to ensure that a single mapping may be created
for an atomic write. The FS will also continue to check size and alignment
legality.

Signed-off-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
jpg: Tweak commit message
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/iomap/direct-io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index c594f2cf3ab4..5299f70428ef 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -351,7 +351,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
 	u64 copied = 0;
 	size_t orig_count;
 
-	if (atomic_hw && length != fs_block_size)
+	if (atomic_hw && length != iter->len)
 		return -EINVAL;
 
 	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 07/12] xfs: Reflink CoW-based atomic write support
  2025-03-03 17:11 [PATCH v4 00/12] large atomic writes for xfs with CoW John Garry
                   ` (5 preceding siblings ...)
  2025-03-03 17:11 ` [PATCH v4 06/12] iomap: Lift blocksize restriction on " John Garry
@ 2025-03-03 17:11 ` John Garry
  2025-03-03 17:11 ` [PATCH v4 08/12] xfs: Iomap SW-based " John Garry
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: John Garry @ 2025-03-03 17:11 UTC (permalink / raw)
  To: brauner, djwong, cem
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

Base SW-based atomic writes on CoW.

For SW-based atomic write support, always allocate a cow hole in
xfs_reflink_allocate_cow() to write the new data.

The semantics is that if @atomic_sw is set, we will be passed a CoW fork
extent mapping for no error returned.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_reflink.c | 5 +++--
 fs/xfs/xfs_reflink.h | 1 +
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 3b1b7a56af34..97dc38841063 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -444,6 +444,7 @@ xfs_reflink_fill_cow_hole(
 	int			nimaps;
 	int			error;
 	bool			found;
+	bool			atomic_sw = flags & XFS_REFLINK_ATOMIC_SW;
 
 	resaligned = xfs_aligned_fsb_count(imap->br_startoff,
 		imap->br_blockcount, xfs_get_cowextsz_hint(ip));
@@ -466,7 +467,7 @@ xfs_reflink_fill_cow_hole(
 	*lockmode = XFS_ILOCK_EXCL;
 
 	error = xfs_find_trim_cow_extent(ip, imap, cmap, shared, &found);
-	if (error || !*shared)
+	if (error || (!*shared && !atomic_sw))
 		goto out_trans_cancel;
 
 	if (found) {
@@ -580,7 +581,7 @@ xfs_reflink_allocate_cow(
 	}
 
 	error = xfs_find_trim_cow_extent(ip, imap, cmap, shared, &found);
-	if (error || !*shared)
+	if (error || (!*shared && !(flags & XFS_REFLINK_ATOMIC_SW)))
 		return error;
 
 	/* CoW fork has a real extent */
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index cdbd73d58822..dfd94e51e2b4 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -10,6 +10,7 @@
  * Flags for xfs_reflink_allocate_cow()
  */
 #define XFS_REFLINK_CONVERT	(1u << 0) /* convert unwritten extents now */
+#define XFS_REFLINK_ATOMIC_SW	(1u << 1) /* alloc for SW-based atomic write */
 
 /*
  * Check whether it is safe to free COW fork blocks from an inode. It is unsafe
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 08/12] xfs: Iomap SW-based atomic write support
  2025-03-03 17:11 [PATCH v4 00/12] large atomic writes for xfs with CoW John Garry
                   ` (6 preceding siblings ...)
  2025-03-03 17:11 ` [PATCH v4 07/12] xfs: Reflink CoW-based atomic write support John Garry
@ 2025-03-03 17:11 ` John Garry
  2025-03-03 17:11 ` [PATCH v4 09/12] xfs: Add xfs_file_dio_write_atomic() John Garry
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: John Garry @ 2025-03-03 17:11 UTC (permalink / raw)
  To: brauner, djwong, cem
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

In cases of an atomic write occurs for misaligned or discontiguous disk
blocks, we will use a CoW-based method to issue the atomic write.

So, for that case, return -EAGAIN to request that the write be issued in
CoW atomic write mode. The dio write path should detect this, similar to
how misaligned regular DIO writes are handled.

For normal HW-based mode, when the range which we are atomic writing to
covers a shared data extent, try to allocate a new CoW fork. However, if
we find that what we allocated does not meet atomic write requirements
in terms of length and alignment, then fallback on the CoW-based mode
for the atomic write.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_iomap.c | 137 ++++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_iomap.h |   1 +
 2 files changed, 136 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 2e9230fa1140..2228330deebe 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -795,6 +795,23 @@ imap_spans_range(
 	return true;
 }
 
+static bool
+xfs_bmap_valid_for_atomic_write(
+	struct xfs_bmbt_irec	*imap,
+	xfs_fileoff_t		offset_fsb,
+	xfs_fileoff_t		end_fsb)
+{
+	/* Misaligned start block wrt size */
+	if (!IS_ALIGNED(imap->br_startblock, imap->br_blockcount))
+		return false;
+
+	/* Discontiguous or mixed extents */
+	if (!imap_spans_range(imap, offset_fsb, end_fsb))
+		return false;
+
+	return true;
+}
+
 static int
 xfs_direct_write_iomap_begin(
 	struct inode		*inode,
@@ -809,10 +826,13 @@ xfs_direct_write_iomap_begin(
 	struct xfs_bmbt_irec	imap, cmap;
 	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
 	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, length);
+	xfs_fileoff_t		orig_end_fsb = end_fsb;
+	bool			atomic_hw = flags & IOMAP_ATOMIC_HW;
 	int			nimaps = 1, error = 0;
 	unsigned int		reflink_flags = 0;
 	bool			shared = false;
 	u16			iomap_flags = 0;
+	bool			needs_alloc;
 	unsigned int		lockmode;
 	u64			seq;
 
@@ -871,13 +891,37 @@ xfs_direct_write_iomap_begin(
 				&lockmode, reflink_flags);
 		if (error)
 			goto out_unlock;
-		if (shared)
+		if (shared) {
+			if (atomic_hw &&
+			    !xfs_bmap_valid_for_atomic_write(&cmap,
+					offset_fsb, end_fsb)) {
+				error = -EAGAIN;
+				goto out_unlock;
+			}
 			goto out_found_cow;
+		}
 		end_fsb = imap.br_startoff + imap.br_blockcount;
 		length = XFS_FSB_TO_B(mp, end_fsb) - offset;
 	}
 
-	if (imap_needs_alloc(inode, flags, &imap, nimaps))
+	needs_alloc = imap_needs_alloc(inode, flags, &imap, nimaps);
+
+	if (atomic_hw) {
+		error = -EAGAIN;
+		/*
+		 * Use CoW method for when we need to alloc > 1 block,
+		 * otherwise we might allocate less than what we need here and
+		 * have multiple mappings.
+		*/
+		if (needs_alloc && orig_end_fsb - offset_fsb > 1)
+			goto out_unlock;
+
+		if (!xfs_bmap_valid_for_atomic_write(&imap, offset_fsb,
+				orig_end_fsb))
+			goto out_unlock;
+	}
+
+	if (needs_alloc)
 		goto allocate_blocks;
 
 	/*
@@ -965,6 +1009,95 @@ const struct iomap_ops xfs_direct_write_iomap_ops = {
 	.iomap_begin		= xfs_direct_write_iomap_begin,
 };
 
+static int
+xfs_atomic_write_sw_iomap_begin(
+	struct inode		*inode,
+	loff_t			offset,
+	loff_t			length,
+	unsigned		flags,
+	struct iomap		*iomap,
+	struct iomap		*srcmap)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_bmbt_irec	imap, cmap;
+	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
+	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, length);
+	int			nimaps = 1, error;
+	bool			shared = false;
+	unsigned int		lockmode = XFS_ILOCK_EXCL;
+	u64			seq;
+
+	if (xfs_is_shutdown(mp))
+		return -EIO;
+
+	if (!xfs_has_reflink(mp))
+		return -EINVAL;
+
+	error = xfs_ilock_for_iomap(ip, flags, &lockmode);
+	if (error)
+		return error;
+
+	error = xfs_bmapi_read(ip, offset_fsb, end_fsb - offset_fsb, &imap,
+			&nimaps, 0);
+	if (error)
+		goto out_unlock;
+
+	error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
+			&lockmode, XFS_REFLINK_CONVERT |
+			XFS_REFLINK_ATOMIC_SW);
+	/*
+	 * Don't check @shared. For atomic writes, we should error when
+	 * we don't get a COW mapping
+	 */
+	if (error)
+		goto out_unlock;
+
+	end_fsb = imap.br_startoff + imap.br_blockcount;
+
+	length = XFS_FSB_TO_B(mp, cmap.br_startoff + cmap.br_blockcount);
+	trace_xfs_iomap_found(ip, offset, length - offset, XFS_COW_FORK, &cmap);
+	if (imap.br_startblock != HOLESTARTBLOCK) {
+		seq = xfs_iomap_inode_sequence(ip, 0);
+		error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0, seq);
+		if (error)
+			goto out_unlock;
+	}
+	seq = xfs_iomap_inode_sequence(ip, IOMAP_F_SHARED);
+	xfs_iunlock(ip, lockmode);
+	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED, seq);
+
+out_unlock:
+	if (lockmode)
+		xfs_iunlock(ip, lockmode);
+	return error;
+}
+
+static int
+xfs_atomic_write_iomap_begin(
+	struct inode		*inode,
+	loff_t			offset,
+	loff_t			length,
+	unsigned		flags,
+	struct iomap		*iomap,
+	struct iomap		*srcmap)
+{
+	ASSERT(flags & IOMAP_WRITE);
+	ASSERT(flags & IOMAP_DIRECT);
+
+	if (flags & IOMAP_ATOMIC_SW)
+		return xfs_atomic_write_sw_iomap_begin(inode, offset, length,
+				flags, iomap, srcmap);
+
+	ASSERT(flags & IOMAP_ATOMIC_HW);
+	return xfs_direct_write_iomap_begin(inode, offset, length, flags,
+			iomap, srcmap);
+}
+
+const struct iomap_ops xfs_atomic_write_iomap_ops = {
+	.iomap_begin		= xfs_atomic_write_iomap_begin,
+};
+
 static int
 xfs_dax_write_iomap_end(
 	struct inode		*inode,
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index 8347268af727..b7fbbc909943 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -53,5 +53,6 @@ extern const struct iomap_ops xfs_read_iomap_ops;
 extern const struct iomap_ops xfs_seek_iomap_ops;
 extern const struct iomap_ops xfs_xattr_iomap_ops;
 extern const struct iomap_ops xfs_dax_write_iomap_ops;
+extern const struct iomap_ops xfs_atomic_write_iomap_ops;
 
 #endif /* __XFS_IOMAP_H__*/
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 09/12] xfs: Add xfs_file_dio_write_atomic()
  2025-03-03 17:11 [PATCH v4 00/12] large atomic writes for xfs with CoW John Garry
                   ` (7 preceding siblings ...)
  2025-03-03 17:11 ` [PATCH v4 08/12] xfs: Iomap SW-based " John Garry
@ 2025-03-03 17:11 ` John Garry
  2025-03-10 13:39   ` Ritesh Harjani
  2025-03-03 17:11 ` [PATCH v4 10/12] xfs: Commit CoW-based atomic writes atomically John Garry
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2025-03-03 17:11 UTC (permalink / raw)
  To: brauner, djwong, cem
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes.

In case of -EAGAIN being returned from iomap_dio_rw(), reissue the write
in CoW-based atomic write mode.

For CoW-based mode, ensure that we have no outstanding IOs which we
may trample on.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_file.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 51b4a43d15f3..70eb6928cf63 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -619,6 +619,46 @@ xfs_file_dio_write_aligned(
 	return ret;
 }
 
+static noinline ssize_t
+xfs_file_dio_write_atomic(
+	struct xfs_inode	*ip,
+	struct kiocb		*iocb,
+	struct iov_iter		*from)
+{
+	unsigned int		iolock = XFS_IOLOCK_SHARED;
+	unsigned int		dio_flags = 0;
+	ssize_t			ret;
+
+retry:
+	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
+	if (ret)
+		return ret;
+
+	ret = xfs_file_write_checks(iocb, from, &iolock);
+	if (ret)
+		goto out_unlock;
+
+	if (dio_flags & IOMAP_DIO_FORCE_WAIT)
+		inode_dio_wait(VFS_I(ip));
+
+	trace_xfs_file_direct_write(iocb, from);
+	ret = iomap_dio_rw(iocb, from, &xfs_atomic_write_iomap_ops,
+			&xfs_dio_write_ops, dio_flags, NULL, 0);
+
+	if (ret == -EAGAIN && !(iocb->ki_flags & IOCB_NOWAIT) &&
+	    !(dio_flags & IOMAP_DIO_ATOMIC_SW)) {
+		xfs_iunlock(ip, iolock);
+		dio_flags = IOMAP_DIO_ATOMIC_SW | IOMAP_DIO_FORCE_WAIT;
+		iolock = XFS_IOLOCK_EXCL;
+		goto retry;
+	}
+
+out_unlock:
+	if (iolock)
+		xfs_iunlock(ip, iolock);
+	return ret;
+}
+
 /*
  * Handle block unaligned direct I/O writes
  *
@@ -723,6 +763,8 @@ xfs_file_dio_write(
 		return -EINVAL;
 	if ((iocb->ki_pos | count) & ip->i_mount->m_blockmask)
 		return xfs_file_dio_write_unaligned(ip, iocb, from);
+	if (iocb->ki_flags & IOCB_ATOMIC)
+		return xfs_file_dio_write_atomic(ip, iocb, from);
 	return xfs_file_dio_write_aligned(ip, iocb, from);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 10/12] xfs: Commit CoW-based atomic writes atomically
  2025-03-03 17:11 [PATCH v4 00/12] large atomic writes for xfs with CoW John Garry
                   ` (8 preceding siblings ...)
  2025-03-03 17:11 ` [PATCH v4 09/12] xfs: Add xfs_file_dio_write_atomic() John Garry
@ 2025-03-03 17:11 ` John Garry
  2025-03-03 17:11 ` [PATCH v4 11/12] xfs: Update atomic write max size John Garry
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: John Garry @ 2025-03-03 17:11 UTC (permalink / raw)
  To: brauner, djwong, cem
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

When completing a CoW-based write, each extent range mapping update is
covered by a separate transaction.

For a CoW-based atomic write, all mappings must be changed at once, so
change to use a single transaction.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_file.c    |  5 ++++-
 fs/xfs/xfs_reflink.c | 49 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_reflink.h |  3 +++
 3 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 70eb6928cf63..74806c8c004e 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -527,7 +527,10 @@ xfs_dio_write_end_io(
 	nofs_flag = memalloc_nofs_save();
 
 	if (flags & IOMAP_DIO_COW) {
-		error = xfs_reflink_end_cow(ip, offset, size);
+		if (iocb->ki_flags & IOCB_ATOMIC)
+			error = xfs_reflink_end_atomic_cow(ip, offset, size);
+		else
+			error = xfs_reflink_end_cow(ip, offset, size);
 		if (error)
 			goto out;
 	}
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 97dc38841063..844e2b43357b 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -987,6 +987,55 @@ xfs_reflink_end_cow(
 		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
 	return error;
 }
+int
+xfs_reflink_end_atomic_cow(
+	struct xfs_inode		*ip,
+	xfs_off_t			offset,
+	xfs_off_t			count)
+{
+	xfs_fileoff_t			offset_fsb;
+	xfs_fileoff_t			end_fsb;
+	int				error = 0;
+	struct xfs_mount		*mp = ip->i_mount;
+	struct xfs_trans		*tp;
+	unsigned int			resblks;
+
+	trace_xfs_reflink_end_cow(ip, offset, count);
+
+	offset_fsb = XFS_B_TO_FSBT(mp, offset);
+	end_fsb = XFS_B_TO_FSB(mp, offset + count);
+
+	/*
+	 * Each remapping operation could cause a btree split, so in the worst
+	 * case that's one for each block.
+	 */
+	resblks = (end_fsb - offset_fsb) *
+			XFS_NEXTENTADD_SPACE_RES(mp, 1, XFS_DATA_FORK);
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
+			XFS_TRANS_RESERVE, &tp);
+	if (error)
+		return error;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, ip, 0);
+
+	while (end_fsb > offset_fsb && !error) {
+		error = xfs_reflink_end_cow_extent_locked(tp, ip, &offset_fsb,
+				end_fsb);
+	}
+	if (error) {
+		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
+		goto out_cancel;
+	}
+	error = xfs_trans_commit(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+out_cancel:
+	xfs_trans_cancel(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	return error;
+}
 
 /*
  * Free all CoW staging blocks that are still referenced by the ondisk refcount
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index dfd94e51e2b4..4cb2ee53cd8d 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -49,6 +49,9 @@ extern int xfs_reflink_cancel_cow_range(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count, bool cancel_real);
 extern int xfs_reflink_end_cow(struct xfs_inode *ip, xfs_off_t offset,
 		xfs_off_t count);
+		int
+xfs_reflink_end_atomic_cow(struct xfs_inode *ip, xfs_off_t offset,
+		xfs_off_t count);
 extern int xfs_reflink_recover_cow(struct xfs_mount *mp);
 extern loff_t xfs_reflink_remap_range(struct file *file_in, loff_t pos_in,
 		struct file *file_out, loff_t pos_out, loff_t len,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 11/12] xfs: Update atomic write max size
  2025-03-03 17:11 [PATCH v4 00/12] large atomic writes for xfs with CoW John Garry
                   ` (9 preceding siblings ...)
  2025-03-03 17:11 ` [PATCH v4 10/12] xfs: Commit CoW-based atomic writes atomically John Garry
@ 2025-03-03 17:11 ` John Garry
  2025-03-10 10:06   ` Carlos Maiolino
  2025-03-03 17:11 ` [PATCH v4 12/12] xfs: Allow block allocator to take an alignment hint John Garry
  2025-03-06  8:47 ` (subset) [PATCH v4 00/12] large atomic writes for xfs with CoW Christian Brauner
  12 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2025-03-03 17:11 UTC (permalink / raw)
  To: brauner, djwong, cem
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

Now that CoW-based atomic writes are supported, update the max size of an
atomic write.

For simplicity, limit at the max of what the mounted bdev can support in
terms of atomic write limits. Maybe in future we will have a better way
to advertise this optimised limit.

In addition, the max atomic write size needs to be aligned to the agsize.
Limit the size of atomic writes to the greatest power-of-two factor of the
agsize so that allocations for an atomic write will always be aligned
compatibly with the alignment requirements of the storage.

For RT inode, just limit to 1x block, even though larger can be supported
in future.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_iops.c  | 13 ++++++++++++-
 fs/xfs/xfs_mount.c | 28 ++++++++++++++++++++++++++++
 fs/xfs/xfs_mount.h |  1 +
 3 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ea79fb246e33..d0a537696514 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -606,12 +606,23 @@ xfs_get_atomic_write_attr(
 	unsigned int		*unit_min,
 	unsigned int		*unit_max)
 {
+	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
+	struct xfs_mount	*mp = ip->i_mount;
+
 	if (!xfs_inode_can_atomicwrite(ip)) {
 		*unit_min = *unit_max = 0;
 		return;
 	}
 
-	*unit_min = *unit_max = ip->i_mount->m_sb.sb_blocksize;
+	*unit_min = ip->i_mount->m_sb.sb_blocksize;
+
+	if (XFS_IS_REALTIME_INODE(ip)) {
+		/* For now, set limit at 1x block */
+		*unit_max = ip->i_mount->m_sb.sb_blocksize;
+	} else {
+		*unit_max =  min_t(unsigned int, XFS_FSB_TO_B(mp, mp->awu_max),
+					target->bt_bdev_awu_max);
+	}
 }
 
 static void
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index b69356582b86..c6fabf7ac506 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -648,6 +648,32 @@ xfs_agbtree_compute_maxlevels(
 	levels = max(levels, mp->m_rmap_maxlevels);
 	mp->m_agbtree_maxlevels = max(levels, mp->m_refc_maxlevels);
 }
+static inline void
+xfs_compute_awu_max(
+	struct xfs_mount	*mp)
+{
+	xfs_agblock_t		agsize = mp->m_sb.sb_agblocks;
+	xfs_agblock_t		awu_max;
+
+	if (!xfs_has_reflink(mp)) {
+		mp->awu_max = 1;
+		return;
+	}
+
+	/*
+	 * Find highest power-of-2 evenly divisible into agsize and which
+	 * also fits into an unsigned int field.
+	 */
+	awu_max = 1;
+	while (1) {
+		if (agsize % (awu_max * 2))
+			break;
+		if (XFS_FSB_TO_B(mp, awu_max * 2) > UINT_MAX)
+			break;
+		awu_max *= 2;
+	}
+	mp->awu_max = awu_max;
+}
 
 /* Compute maximum possible height for realtime btree types for this fs. */
 static inline void
@@ -733,6 +759,8 @@ xfs_mountfs(
 	xfs_agbtree_compute_maxlevels(mp);
 	xfs_rtbtree_compute_maxlevels(mp);
 
+	xfs_compute_awu_max(mp);
+
 	/*
 	 * Check if sb_agblocks is aligned at stripe boundary.  If sb_agblocks
 	 * is NOT aligned turn off m_dalign since allocator alignment is within
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index fbed172d6770..bc96b8214173 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -198,6 +198,7 @@ typedef struct xfs_mount {
 	bool			m_fail_unmount;
 	bool			m_finobt_nores; /* no per-AG finobt resv. */
 	bool			m_update_sb;	/* sb needs update in mount */
+	xfs_extlen_t		awu_max;	/* data device max atomic write */
 
 	/*
 	 * Bitsets of per-fs metadata that have been checked and/or are sick.
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 12/12] xfs: Allow block allocator to take an alignment hint
  2025-03-03 17:11 [PATCH v4 00/12] large atomic writes for xfs with CoW John Garry
                   ` (10 preceding siblings ...)
  2025-03-03 17:11 ` [PATCH v4 11/12] xfs: Update atomic write max size John Garry
@ 2025-03-03 17:11 ` John Garry
  2025-03-09 22:03   ` Dave Chinner
  2025-03-06  8:47 ` (subset) [PATCH v4 00/12] large atomic writes for xfs with CoW Christian Brauner
  12 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2025-03-03 17:11 UTC (permalink / raw)
  To: brauner, djwong, cem
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, ritesh.list,
	martin.petersen, tytso, linux-ext4, John Garry

When issuing an atomic write by the CoW method, give the block allocator a
hint to align to the extszhint.

This means that we have a better chance to issuing the atomic write via
HW offload next time.

It does mean that the inode extszhint should be set appropriately for the
expected atomic write size.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 7 ++++++-
 fs/xfs/libxfs/xfs_bmap.h | 6 +++++-
 fs/xfs/xfs_reflink.c     | 8 ++++++--
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 0ef19f1469ec..9bfdfb7cdcae 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3454,6 +3454,12 @@ xfs_bmap_compute_alignments(
 		align = xfs_get_cowextsz_hint(ap->ip);
 	else if (ap->datatype & XFS_ALLOC_USERDATA)
 		align = xfs_get_extsz_hint(ap->ip);
+
+	if (align > 1 && ap->flags & XFS_BMAPI_EXTSZALIGN)
+		args->alignment = align;
+	else
+		args->alignment = 1;
+
 	if (align) {
 		if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
 					ap->eof, 0, ap->conv, &ap->offset,
@@ -3782,7 +3788,6 @@ xfs_bmap_btalloc(
 		.wasdel		= ap->wasdel,
 		.resv		= XFS_AG_RESV_NONE,
 		.datatype	= ap->datatype,
-		.alignment	= 1,
 		.minalignslop	= 0,
 	};
 	xfs_fileoff_t		orig_offset;
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 4b721d935994..e6baa81e20d8 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -87,6 +87,9 @@ struct xfs_bmalloca {
 /* Do not update the rmap btree.  Used for reconstructing bmbt from rmapbt. */
 #define XFS_BMAPI_NORMAP	(1u << 10)
 
+/* Try to align allocations to the extent size hint */
+#define XFS_BMAPI_EXTSZALIGN	(1u << 11)
+
 #define XFS_BMAPI_FLAGS \
 	{ XFS_BMAPI_ENTIRE,	"ENTIRE" }, \
 	{ XFS_BMAPI_METADATA,	"METADATA" }, \
@@ -98,7 +101,8 @@ struct xfs_bmalloca {
 	{ XFS_BMAPI_REMAP,	"REMAP" }, \
 	{ XFS_BMAPI_COWFORK,	"COWFORK" }, \
 	{ XFS_BMAPI_NODISCARD,	"NODISCARD" }, \
-	{ XFS_BMAPI_NORMAP,	"NORMAP" }
+	{ XFS_BMAPI_NORMAP,	"NORMAP" },\
+	{ XFS_BMAPI_EXTSZALIGN,	"EXTSZALIGN" }
 
 
 static inline int xfs_bmapi_aflag(int w)
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 844e2b43357b..72fb60df9a53 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -445,6 +445,11 @@ xfs_reflink_fill_cow_hole(
 	int			error;
 	bool			found;
 	bool			atomic_sw = flags & XFS_REFLINK_ATOMIC_SW;
+	uint32_t		bmapi_flags = XFS_BMAPI_COWFORK |
+					      XFS_BMAPI_PREALLOC;
+
+	if (atomic_sw)
+		bmapi_flags |= XFS_BMAPI_EXTSZALIGN;
 
 	resaligned = xfs_aligned_fsb_count(imap->br_startoff,
 		imap->br_blockcount, xfs_get_cowextsz_hint(ip));
@@ -478,8 +483,7 @@ xfs_reflink_fill_cow_hole(
 	/* Allocate the entire reservation as unwritten blocks. */
 	nimaps = 1;
 	error = xfs_bmapi_write(tp, ip, imap->br_startoff, imap->br_blockcount,
-			XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC, 0, cmap,
-			&nimaps);
+			bmapi_flags, 0, cmap, &nimaps);
 	if (error)
 		goto out_trans_cancel;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 02/12] iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW
  2025-03-03 17:11 ` [PATCH v4 02/12] iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW John Garry
@ 2025-03-05 12:57   ` Carlos Maiolino
  2025-03-06  8:50     ` Christian Brauner
  0 siblings, 1 reply; 32+ messages in thread
From: Carlos Maiolino @ 2025-03-05 12:57 UTC (permalink / raw)
  To: brauner
  Cc: brauner, djwong, linux-xfs, linux-fsdevel, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, tytso, linux-ext4

Hi Christian,
On Mon, Mar 03, 2025 at 05:11:10PM +0000, John Garry wrote:
> In future xfs will support a SW-based atomic write, so rename
> IOMAP_ATOMIC -> IOMAP_ATOMIC_HW to be clear which mode is being used.
> 
> Also relocate setting of IOMAP_ATOMIC_HW to the write path in
> __iomap_dio_rw(), to be clear that this flag is only relevant to writes.
> 
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> Signed-off-by: John Garry <john.g.garry@oracle.com>

I pushed the patches in this series into this branch:
git://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git xfs-6.15-atomicwrites

Do you plan to send the iomap patches in this series yourself or is it ok with
you if they go through xfs tree?

Cheers,
Carlos

> ---
>  Documentation/filesystems/iomap/operations.rst |  4 ++--
>  fs/ext4/inode.c                                |  2 +-
>  fs/iomap/direct-io.c                           | 18 +++++++++---------
>  fs/iomap/trace.h                               |  2 +-
>  include/linux/iomap.h                          |  2 +-
>  5 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
> index d1535109587a..0b9d7be23bce 100644
> --- a/Documentation/filesystems/iomap/operations.rst
> +++ b/Documentation/filesystems/iomap/operations.rst
> @@ -514,8 +514,8 @@ IOMAP_WRITE`` with any combination of the following enhancements:
>     if the mapping is unwritten and the filesystem cannot handle zeroing
>     the unaligned regions without exposing stale contents.
> 
> - * ``IOMAP_ATOMIC``: This write is being issued with torn-write
> -   protection.
> + * ``IOMAP_ATOMIC_HW``: This write is being issued with torn-write
> +   protection based on HW-offload support.
>     Only a single bio can be created for the write, and the write must
>     not be split into multiple I/O requests, i.e. flag REQ_ATOMIC must be
>     set.
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 7c54ae5fcbd4..ba2f1e3db7c7 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3467,7 +3467,7 @@ static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
>  		return false;
> 
>  	/* atomic writes are all-or-nothing */
> -	if (flags & IOMAP_ATOMIC)
> +	if (flags & IOMAP_ATOMIC_HW)
>  		return false;
> 
>  	/* can only try again if we wrote nothing */
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index e1e32e2bb0bf..c696ce980796 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -317,7 +317,7 @@ static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
>   * clearing the WRITE_THROUGH flag in the dio request.
>   */
>  static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
> -		const struct iomap *iomap, bool use_fua, bool atomic)
> +		const struct iomap *iomap, bool use_fua, bool atomic_hw)
>  {
>  	blk_opf_t opflags = REQ_SYNC | REQ_IDLE;
> 
> @@ -329,7 +329,7 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
>  		opflags |= REQ_FUA;
>  	else
>  		dio->flags &= ~IOMAP_DIO_WRITE_THROUGH;
> -	if (atomic)
> +	if (atomic_hw)
>  		opflags |= REQ_ATOMIC;
> 
>  	return opflags;
> @@ -340,8 +340,8 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
>  	const struct iomap *iomap = &iter->iomap;
>  	struct inode *inode = iter->inode;
>  	unsigned int fs_block_size = i_blocksize(inode), pad;
> +	bool atomic_hw = iter->flags & IOMAP_ATOMIC_HW;
>  	const loff_t length = iomap_length(iter);
> -	bool atomic = iter->flags & IOMAP_ATOMIC;
>  	loff_t pos = iter->pos;
>  	blk_opf_t bio_opf;
>  	struct bio *bio;
> @@ -351,7 +351,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
>  	u64 copied = 0;
>  	size_t orig_count;
> 
> -	if (atomic && length != fs_block_size)
> +	if (atomic_hw && length != fs_block_size)
>  		return -EINVAL;
> 
>  	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
> @@ -428,7 +428,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
>  			goto out;
>  	}
> 
> -	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic);
> +	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic_hw);
> 
>  	nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS);
>  	do {
> @@ -461,7 +461,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
>  		}
> 
>  		n = bio->bi_iter.bi_size;
> -		if (WARN_ON_ONCE(atomic && n != length)) {
> +		if (WARN_ON_ONCE(atomic_hw && n != length)) {
>  			/*
>  			 * This bio should have covered the complete length,
>  			 * which it doesn't, so error. We may need to zero out
> @@ -652,9 +652,6 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	if (iocb->ki_flags & IOCB_NOWAIT)
>  		iomi.flags |= IOMAP_NOWAIT;
> 
> -	if (iocb->ki_flags & IOCB_ATOMIC)
> -		iomi.flags |= IOMAP_ATOMIC;
> -
>  	if (iov_iter_rw(iter) == READ) {
>  		/* reads can always complete inline */
>  		dio->flags |= IOMAP_DIO_INLINE_COMP;
> @@ -689,6 +686,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  			iomi.flags |= IOMAP_OVERWRITE_ONLY;
>  		}
> 
> +		if (iocb->ki_flags & IOCB_ATOMIC)
> +			iomi.flags |= IOMAP_ATOMIC_HW;
> +
>  		/* for data sync or sync, we need sync completion processing */
>  		if (iocb_is_dsync(iocb)) {
>  			dio->flags |= IOMAP_DIO_NEED_SYNC;
> diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
> index 9eab2c8ac3c5..69af89044ebd 100644
> --- a/fs/iomap/trace.h
> +++ b/fs/iomap/trace.h
> @@ -99,7 +99,7 @@ DEFINE_RANGE_EVENT(iomap_dio_rw_queued);
>  	{ IOMAP_FAULT,		"FAULT" }, \
>  	{ IOMAP_DIRECT,		"DIRECT" }, \
>  	{ IOMAP_NOWAIT,		"NOWAIT" }, \
> -	{ IOMAP_ATOMIC,		"ATOMIC" }
> +	{ IOMAP_ATOMIC_HW,	"ATOMIC_HW" }
> 
>  #define IOMAP_F_FLAGS_STRINGS \
>  	{ IOMAP_F_NEW,		"NEW" }, \
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index ea29388b2fba..87cd7079aaf3 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -189,7 +189,7 @@ struct iomap_folio_ops {
>  #else
>  #define IOMAP_DAX		0
>  #endif /* CONFIG_FS_DAX */
> -#define IOMAP_ATOMIC		(1 << 9)
> +#define IOMAP_ATOMIC_HW		(1 << 9)
>  #define IOMAP_DONTCACHE		(1 << 10)
> 
>  struct iomap_ops {
> --
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: (subset) [PATCH v4 00/12] large atomic writes for xfs with CoW
  2025-03-03 17:11 [PATCH v4 00/12] large atomic writes for xfs with CoW John Garry
                   ` (11 preceding siblings ...)
  2025-03-03 17:11 ` [PATCH v4 12/12] xfs: Allow block allocator to take an alignment hint John Garry
@ 2025-03-06  8:47 ` Christian Brauner
  12 siblings, 0 replies; 32+ messages in thread
From: Christian Brauner @ 2025-03-06  8:47 UTC (permalink / raw)
  To: djwong, cem, John Garry
  Cc: Christian Brauner, linux-xfs, linux-fsdevel, linux-kernel,
	ojaswin, ritesh.list, martin.petersen, tytso, linux-ext4

On Mon, 03 Mar 2025 17:11:08 +0000, John Garry wrote:
> Currently atomic write support for xfs is limited to writing a single
> block as we have no way to guarantee alignment and that the write covers
> a single extent.
> 
> This series introduces a method to issue atomic writes via a software
> emulated method.
> 
> [...]

Applied to the vfs-6.15.iomap branch of the vfs/vfs.git tree.
Patches in the vfs-6.15.iomap branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.15.iomap

[02/12] iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW
        https://git.kernel.org/vfs/vfs/c/af97c9498b28
[05/12] iomap: Support SW-based atomic writes
        https://git.kernel.org/vfs/vfs/c/e5708b92d9bf
[06/12] iomap: Lift blocksize restriction on atomic writes
        https://git.kernel.org/vfs/vfs/c/2ebcf55ea0c6

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 02/12] iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW
  2025-03-05 12:57   ` Carlos Maiolino
@ 2025-03-06  8:50     ` Christian Brauner
  0 siblings, 0 replies; 32+ messages in thread
From: Christian Brauner @ 2025-03-06  8:50 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: djwong, linux-xfs, linux-fsdevel, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, tytso, linux-ext4

On Wed, Mar 05, 2025 at 01:57:26PM +0100, Carlos Maiolino wrote:
> Hi Christian,
> On Mon, Mar 03, 2025 at 05:11:10PM +0000, John Garry wrote:
> > In future xfs will support a SW-based atomic write, so rename
> > IOMAP_ATOMIC -> IOMAP_ATOMIC_HW to be clear which mode is being used.
> > 
> > Also relocate setting of IOMAP_ATOMIC_HW to the write path in
> > __iomap_dio_rw(), to be clear that this flag is only relevant to writes.
> > 
> > Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> > Signed-off-by: John Garry <john.g.garry@oracle.com>
> 
> I pushed the patches in this series into this branch:
> git://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git xfs-6.15-atomicwrites
> 
> Do you plan to send the iomap patches in this series yourself or is it ok with
> you if they go through xfs tree?

Ok, so this will have merge conflicts with vfs-6.15.iomap. I put the
preliminary iomap patches of this series onto vfs-6.15.iomap.

Please simply pull vfs-6.15.iomap instead of vfs-6.15.shared.iomap.
Nothing changes for you as everything has been kept stable. There'll be
no merge conflicts for us afterwards.

Thanks for the ping!

> 
> Cheers,
> Carlos
> 
> > ---
> >  Documentation/filesystems/iomap/operations.rst |  4 ++--
> >  fs/ext4/inode.c                                |  2 +-
> >  fs/iomap/direct-io.c                           | 18 +++++++++---------
> >  fs/iomap/trace.h                               |  2 +-
> >  include/linux/iomap.h                          |  2 +-
> >  5 files changed, 14 insertions(+), 14 deletions(-)
> > 
> > diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
> > index d1535109587a..0b9d7be23bce 100644
> > --- a/Documentation/filesystems/iomap/operations.rst
> > +++ b/Documentation/filesystems/iomap/operations.rst
> > @@ -514,8 +514,8 @@ IOMAP_WRITE`` with any combination of the following enhancements:
> >     if the mapping is unwritten and the filesystem cannot handle zeroing
> >     the unaligned regions without exposing stale contents.
> > 
> > - * ``IOMAP_ATOMIC``: This write is being issued with torn-write
> > -   protection.
> > + * ``IOMAP_ATOMIC_HW``: This write is being issued with torn-write
> > +   protection based on HW-offload support.
> >     Only a single bio can be created for the write, and the write must
> >     not be split into multiple I/O requests, i.e. flag REQ_ATOMIC must be
> >     set.
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 7c54ae5fcbd4..ba2f1e3db7c7 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -3467,7 +3467,7 @@ static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
> >  		return false;
> > 
> >  	/* atomic writes are all-or-nothing */
> > -	if (flags & IOMAP_ATOMIC)
> > +	if (flags & IOMAP_ATOMIC_HW)
> >  		return false;
> > 
> >  	/* can only try again if we wrote nothing */
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index e1e32e2bb0bf..c696ce980796 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -317,7 +317,7 @@ static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
> >   * clearing the WRITE_THROUGH flag in the dio request.
> >   */
> >  static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
> > -		const struct iomap *iomap, bool use_fua, bool atomic)
> > +		const struct iomap *iomap, bool use_fua, bool atomic_hw)
> >  {
> >  	blk_opf_t opflags = REQ_SYNC | REQ_IDLE;
> > 
> > @@ -329,7 +329,7 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
> >  		opflags |= REQ_FUA;
> >  	else
> >  		dio->flags &= ~IOMAP_DIO_WRITE_THROUGH;
> > -	if (atomic)
> > +	if (atomic_hw)
> >  		opflags |= REQ_ATOMIC;
> > 
> >  	return opflags;
> > @@ -340,8 +340,8 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
> >  	const struct iomap *iomap = &iter->iomap;
> >  	struct inode *inode = iter->inode;
> >  	unsigned int fs_block_size = i_blocksize(inode), pad;
> > +	bool atomic_hw = iter->flags & IOMAP_ATOMIC_HW;
> >  	const loff_t length = iomap_length(iter);
> > -	bool atomic = iter->flags & IOMAP_ATOMIC;
> >  	loff_t pos = iter->pos;
> >  	blk_opf_t bio_opf;
> >  	struct bio *bio;
> > @@ -351,7 +351,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
> >  	u64 copied = 0;
> >  	size_t orig_count;
> > 
> > -	if (atomic && length != fs_block_size)
> > +	if (atomic_hw && length != fs_block_size)
> >  		return -EINVAL;
> > 
> >  	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
> > @@ -428,7 +428,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
> >  			goto out;
> >  	}
> > 
> > -	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic);
> > +	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic_hw);
> > 
> >  	nr_pages = bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS);
> >  	do {
> > @@ -461,7 +461,7 @@ static int iomap_dio_bio_iter(struct iomap_iter *iter, struct iomap_dio *dio)
> >  		}
> > 
> >  		n = bio->bi_iter.bi_size;
> > -		if (WARN_ON_ONCE(atomic && n != length)) {
> > +		if (WARN_ON_ONCE(atomic_hw && n != length)) {
> >  			/*
> >  			 * This bio should have covered the complete length,
> >  			 * which it doesn't, so error. We may need to zero out
> > @@ -652,9 +652,6 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >  	if (iocb->ki_flags & IOCB_NOWAIT)
> >  		iomi.flags |= IOMAP_NOWAIT;
> > 
> > -	if (iocb->ki_flags & IOCB_ATOMIC)
> > -		iomi.flags |= IOMAP_ATOMIC;
> > -
> >  	if (iov_iter_rw(iter) == READ) {
> >  		/* reads can always complete inline */
> >  		dio->flags |= IOMAP_DIO_INLINE_COMP;
> > @@ -689,6 +686,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >  			iomi.flags |= IOMAP_OVERWRITE_ONLY;
> >  		}
> > 
> > +		if (iocb->ki_flags & IOCB_ATOMIC)
> > +			iomi.flags |= IOMAP_ATOMIC_HW;
> > +
> >  		/* for data sync or sync, we need sync completion processing */
> >  		if (iocb_is_dsync(iocb)) {
> >  			dio->flags |= IOMAP_DIO_NEED_SYNC;
> > diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
> > index 9eab2c8ac3c5..69af89044ebd 100644
> > --- a/fs/iomap/trace.h
> > +++ b/fs/iomap/trace.h
> > @@ -99,7 +99,7 @@ DEFINE_RANGE_EVENT(iomap_dio_rw_queued);
> >  	{ IOMAP_FAULT,		"FAULT" }, \
> >  	{ IOMAP_DIRECT,		"DIRECT" }, \
> >  	{ IOMAP_NOWAIT,		"NOWAIT" }, \
> > -	{ IOMAP_ATOMIC,		"ATOMIC" }
> > +	{ IOMAP_ATOMIC_HW,	"ATOMIC_HW" }
> > 
> >  #define IOMAP_F_FLAGS_STRINGS \
> >  	{ IOMAP_F_NEW,		"NEW" }, \
> > diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> > index ea29388b2fba..87cd7079aaf3 100644
> > --- a/include/linux/iomap.h
> > +++ b/include/linux/iomap.h
> > @@ -189,7 +189,7 @@ struct iomap_folio_ops {
> >  #else
> >  #define IOMAP_DAX		0
> >  #endif /* CONFIG_FS_DAX */
> > -#define IOMAP_ATOMIC		(1 << 9)
> > +#define IOMAP_ATOMIC_HW		(1 << 9)
> >  #define IOMAP_DONTCACHE		(1 << 10)
> > 
> >  struct iomap_ops {
> > --
> > 2.31.1
> > 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 05/12] iomap: Support SW-based atomic writes
  2025-03-03 17:11 ` [PATCH v4 05/12] iomap: Support SW-based atomic writes John Garry
@ 2025-03-09 21:51   ` Dave Chinner
  2025-03-10 10:44     ` John Garry
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2025-03-09 21:51 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
	ojaswin, ritesh.list, martin.petersen, tytso, linux-ext4

On Mon, Mar 03, 2025 at 05:11:13PM +0000, John Garry wrote:
> Currently atomic write support requires dedicated HW support. This imposes
> a restriction on the filesystem that disk blocks need to be aligned and
> contiguously mapped to FS blocks to issue atomic writes.
> 
> XFS has no method to guarantee FS block alignment for regular,
> non-RT files. As such, atomic writes are currently limited to 1x FS block
> there.
> 
> To deal with the scenario that we are issuing an atomic write over
> misaligned or discontiguous data blocks - and raise the atomic write size
> limit - support a SW-based software emulated atomic write mode. For XFS,
> this SW-based atomic writes would use CoW support to issue emulated untorn
> writes.
> 
> It is the responsibility of the FS to detect discontiguous atomic writes
> and switch to IOMAP_DIO_ATOMIC_SW mode and retry the write. Indeed,
> SW-based atomic writes could be used always when the mounted bdev does
> not support HW offload, but this strategy is not initially expected to be
> used.

So now seeing how these are are to be used, these aren't "hardware"
and "software" atomic IOs. They are block layer vs filesystem atomic
IOs.

We can do atomic IOs in software in the block layer drivers (think
loop or dm-thinp) rather than off-loading to storage hardware.

Hence I think these really need to be named after the layer that
will provide the atomic IO guarantees, because "hw" and "sw" as they
are currently used are not correct. e.g something like
IOMAP_FS_ATOMIC and IOMAP_BDEV_ATOMIC which indicates which layer
should be providing the atomic IO constraints and guarantees.

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 12/12] xfs: Allow block allocator to take an alignment hint
  2025-03-03 17:11 ` [PATCH v4 12/12] xfs: Allow block allocator to take an alignment hint John Garry
@ 2025-03-09 22:03   ` Dave Chinner
  2025-03-10 12:10     ` John Garry
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2025-03-09 22:03 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
	ojaswin, ritesh.list, martin.petersen, tytso, linux-ext4

On Mon, Mar 03, 2025 at 05:11:20PM +0000, John Garry wrote:
> When issuing an atomic write by the CoW method, give the block allocator a
> hint to align to the extszhint.
> 
> This means that we have a better chance to issuing the atomic write via
> HW offload next time.
> 
> It does mean that the inode extszhint should be set appropriately for the
> expected atomic write size.
> 
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c | 7 ++++++-
>  fs/xfs/libxfs/xfs_bmap.h | 6 +++++-
>  fs/xfs/xfs_reflink.c     | 8 ++++++--
>  3 files changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 0ef19f1469ec..9bfdfb7cdcae 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3454,6 +3454,12 @@ xfs_bmap_compute_alignments(
>  		align = xfs_get_cowextsz_hint(ap->ip);
>  	else if (ap->datatype & XFS_ALLOC_USERDATA)
>  		align = xfs_get_extsz_hint(ap->ip);
> +
> +	if (align > 1 && ap->flags & XFS_BMAPI_EXTSZALIGN)

needs () around the & logic.

	if (align > 1 && (ap->flags & XFS_BMAPI_EXTSZALIGN))

> +		args->alignment = align;
> +	else
> +		args->alignment = 1;

When is  args->alignment not already initialised to 1?

> +
>  	if (align) {
>  		if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
>  					ap->eof, 0, ap->conv, &ap->offset,
> @@ -3782,7 +3788,6 @@ xfs_bmap_btalloc(
>  		.wasdel		= ap->wasdel,
>  		.resv		= XFS_AG_RESV_NONE,
>  		.datatype	= ap->datatype,
> -		.alignment	= 1,
>  		.minalignslop	= 0,
>  	};

Oh, you removed the initialisation to 1, so now we have the
possibility of getting args->alignment = 0 anywhere in the
allocation stack?

FWIW, we've been trying to get rid of that case - args->alignment should
always be 1 if no alignment is necessary so we don't ahve to special
case alignment of 0  (meaning no alignemnt) anywhere. This seems
like a step backwards from that perspective...



>  	xfs_fileoff_t		orig_offset;
> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> index 4b721d935994..e6baa81e20d8 100644
> --- a/fs/xfs/libxfs/xfs_bmap.h
> +++ b/fs/xfs/libxfs/xfs_bmap.h
> @@ -87,6 +87,9 @@ struct xfs_bmalloca {
>  /* Do not update the rmap btree.  Used for reconstructing bmbt from rmapbt. */
>  #define XFS_BMAPI_NORMAP	(1u << 10)
>  
> +/* Try to align allocations to the extent size hint */
> +#define XFS_BMAPI_EXTSZALIGN	(1u << 11)

Don't we already do that?

Or is this doing something subtle and non-obvious like overriding
stripe width alignment for large atomic writes?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 11/12] xfs: Update atomic write max size
  2025-03-03 17:11 ` [PATCH v4 11/12] xfs: Update atomic write max size John Garry
@ 2025-03-10 10:06   ` Carlos Maiolino
  2025-03-10 10:54     ` John Garry
  0 siblings, 1 reply; 32+ messages in thread
From: Carlos Maiolino @ 2025-03-10 10:06 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, djwong, linux-xfs, linux-fsdevel, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, tytso, linux-ext4

Hi John.

On Mon, Mar 03, 2025 at 05:11:19PM +0000, John Garry wrote:
> Now that CoW-based atomic writes are supported, update the max size of an
> atomic write.
> 
> For simplicity, limit at the max of what the mounted bdev can support in
> terms of atomic write limits. Maybe in future we will have a better way
> to advertise this optimised limit.
> 
> In addition, the max atomic write size needs to be aligned to the agsize.
> Limit the size of atomic writes to the greatest power-of-two factor of the
> agsize so that allocations for an atomic write will always be aligned
> compatibly with the alignment requirements of the storage.
> 
> For RT inode, just limit to 1x block, even though larger can be supported
> in future.
> 
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_iops.c  | 13 ++++++++++++-
>  fs/xfs/xfs_mount.c | 28 ++++++++++++++++++++++++++++
>  fs/xfs/xfs_mount.h |  1 +
>  3 files changed, 41 insertions(+), 1 deletion(-)
> 

> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index fbed172d6770..bc96b8214173 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -198,6 +198,7 @@ typedef struct xfs_mount {
>  	bool			m_fail_unmount;
>  	bool			m_finobt_nores; /* no per-AG finobt resv. */
>  	bool			m_update_sb;	/* sb needs update in mount */
> +	xfs_extlen_t		awu_max;	/* data device max atomic write */

Could you please rename this to something else? All fields within xfs_mount
follows the same pattern m_<name>. Perhaps m_awu_max?

I was going to send a patch replacing it once I had this merged, but giving
Dave's new comments, and the conflicts with zoned devices, you'll need to send a
V5, so, please include this change if nobody else has any objections on keeping
the xfs_mount naming convention.

Carlos.

> 
>  	/*
>  	 * Bitsets of per-fs metadata that have been checked and/or are sick.
> --
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 05/12] iomap: Support SW-based atomic writes
  2025-03-09 21:51   ` Dave Chinner
@ 2025-03-10 10:44     ` John Garry
  2025-03-10 17:21       ` Darrick J. Wong
  0 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2025-03-10 10:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
	ojaswin, ritesh.list, martin.petersen, tytso, linux-ext4

On 09/03/2025 21:51, Dave Chinner wrote:
> Mon, Mar 03, 2025 at 05:11:13PM +0000, John Garry wrote:
>> Currently atomic write support requires dedicated HW support. This imposes
>> a restriction on the filesystem that disk blocks need to be aligned and
>> contiguously mapped to FS blocks to issue atomic writes.
>>
>> XFS has no method to guarantee FS block alignment for regular,
>> non-RT files. As such, atomic writes are currently limited to 1x FS block
>> there.
>>
>> To deal with the scenario that we are issuing an atomic write over
>> misaligned or discontiguous data blocks - and raise the atomic write size
>> limit - support a SW-based software emulated atomic write mode. For XFS,
>> this SW-based atomic writes would use CoW support to issue emulated untorn
>> writes.
>>
>> It is the responsibility of the FS to detect discontiguous atomic writes
>> and switch to IOMAP_DIO_ATOMIC_SW mode and retry the write. Indeed,
>> SW-based atomic writes could be used always when the mounted bdev does
>> not support HW offload, but this strategy is not initially expected to be
>> used.
> So now seeing how these are are to be used, these aren't "hardware"
> and "software" atomic IOs. They are block layer vs filesystem atomic
> IOs.
> 
> We can do atomic IOs in software in the block layer drivers (think
> loop or dm-thinp) rather than off-loading to storage hardware.
> 
> Hence I think these really need to be named after the layer that
> will provide the atomic IO guarantees, because "hw" and "sw" as they
> are currently used are not correct. e.g something like
> IOMAP_FS_ATOMIC and IOMAP_BDEV_ATOMIC which indicates which layer
> should be providing the atomic IO constraints and guarantees.

I'd prefer IOMAP_REQ_ATOMIC instead (of IOMAP_BDEV_ATOMIC), as we are 
using REQ_ATOMIC for those BIOs only. Anything which the block layer and 
below does with REQ_ATOMIC is its business, as long as it guarantees 
atomic submission. But I am not overly keen on that name, as it clashes 
with block layer names (naturally).

And IOMAP_FS_ATOMIC seems a bit vague, but I can't think of anything else.

Darrick, any opinion on this?

Cheers,
John

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 11/12] xfs: Update atomic write max size
  2025-03-10 10:06   ` Carlos Maiolino
@ 2025-03-10 10:54     ` John Garry
  2025-03-10 11:11       ` Carlos Maiolino
  0 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2025-03-10 10:54 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: brauner, djwong, linux-xfs, linux-fsdevel, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, tytso, linux-ext4

On 10/03/2025 10:06, Carlos Maiolino wrote:
>> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
>> index fbed172d6770..bc96b8214173 100644
>> --- a/fs/xfs/xfs_mount.h
>> +++ b/fs/xfs/xfs_mount.h
>> @@ -198,6 +198,7 @@ typedef struct xfs_mount {
>>   	bool			m_fail_unmount;
>>   	bool			m_finobt_nores; /* no per-AG finobt resv. */
>>   	bool			m_update_sb;	/* sb needs update in mount */
>> +	xfs_extlen_t		awu_max;	/* data device max atomic write */
> Could you please rename this to something else? All fields within xfs_mount
> follows the same pattern m_<name>. Perhaps m_awu_max?

Fine, but I think I then need to deal with spilling multiple lines to 
accommodate a proper comment.

> 
> I was going to send a patch replacing it once I had this merged, but giving
> Dave's new comments, and the conflicts with zoned devices, you'll need to send a
> V5, so, please include this change if nobody else has any objections on keeping
> the xfs_mount naming convention.

What branch do you want me to send this against?

Thanks,
John


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 11/12] xfs: Update atomic write max size
  2025-03-10 10:54     ` John Garry
@ 2025-03-10 11:11       ` Carlos Maiolino
  2025-03-10 11:20         ` John Garry
  0 siblings, 1 reply; 32+ messages in thread
From: Carlos Maiolino @ 2025-03-10 11:11 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, djwong, linux-xfs, linux-fsdevel, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, tytso, linux-ext4

On Mon, Mar 10, 2025 at 10:54:23AM +0000, John Garry wrote:
> On 10/03/2025 10:06, Carlos Maiolino wrote:
> >> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> >> index fbed172d6770..bc96b8214173 100644
> >> --- a/fs/xfs/xfs_mount.h
> >> +++ b/fs/xfs/xfs_mount.h
> >> @@ -198,6 +198,7 @@ typedef struct xfs_mount {
> >>   	bool			m_fail_unmount;
> >>   	bool			m_finobt_nores; /* no per-AG finobt resv. */
> >>   	bool			m_update_sb;	/* sb needs update in mount */
> >> +	xfs_extlen_t		awu_max;	/* data device max atomic write */
> > Could you please rename this to something else? All fields within xfs_mount
> > follows the same pattern m_<name>. Perhaps m_awu_max?
> 
> Fine, but I think I then need to deal with spilling multiple lines to
> accommodate a proper comment.
> 
> >
> > I was going to send a patch replacing it once I had this merged, but giving
> > Dave's new comments, and the conflicts with zoned devices, you'll need to send a
> > V5, so, please include this change if nobody else has any objections on keeping
> > the xfs_mount naming convention.
> 
> What branch do you want me to send this against?

I just pushed everything to for-next, so you can just rebase it against for-next

Notice this includes the iomap patches you sent in this series which Christian
picked up. So if you need to re-work something on the iomap patches, you'll
probably need to take this into account.

Cheers.
Carlos

> 
> Thanks,
> John
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 11/12] xfs: Update atomic write max size
  2025-03-10 11:11       ` Carlos Maiolino
@ 2025-03-10 11:20         ` John Garry
  2025-03-10 12:38           ` Carlos Maiolino
  0 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2025-03-10 11:20 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: brauner, djwong, linux-xfs, linux-fsdevel, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, tytso, linux-ext4

On 10/03/2025 11:11, Carlos Maiolino wrote:
> On Mon, Mar 10, 2025 at 10:54:23AM +0000, John Garry wrote:
>> On 10/03/2025 10:06, Carlos Maiolino wrote:
>>>> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
>>>> index fbed172d6770..bc96b8214173 100644
>>>> --- a/fs/xfs/xfs_mount.h
>>>> +++ b/fs/xfs/xfs_mount.h
>>>> @@ -198,6 +198,7 @@ typedef struct xfs_mount {
>>>>    	bool			m_fail_unmount;
>>>>    	bool			m_finobt_nores; /* no per-AG finobt resv. */
>>>>    	bool			m_update_sb;	/* sb needs update in mount */
>>>> +	xfs_extlen_t		awu_max;	/* data device max atomic write */
>>> Could you please rename this to something else? All fields within xfs_mount
>>> follows the same pattern m_<name>. Perhaps m_awu_max?
>> Fine, but I think I then need to deal with spilling multiple lines to
>> accommodate a proper comment.
>>
>>> I was going to send a patch replacing it once I had this merged, but giving
>>> Dave's new comments, and the conflicts with zoned devices, you'll need to send a
>>> V5, so, please include this change if nobody else has any objections on keeping
>>> the xfs_mount naming convention.
>> What branch do you want me to send this against?
> I just pushed everything to for-next, so you can just rebase it against for-next
> 
> Notice this includes the iomap patches you sent in this series which Christian
> picked up. So if you need to re-work something on the iomap patches, you'll
> probably need to take this into account.

Your branch includes the iomap changes, so hard to deal with.

For the iomap change, Dave was suggesting a name change only, so not a 
major issue.

So if we really want to go with a name change, then I could add a patch 
to change the name only and include in the v5.

Review comments are always welcome, but I wish that they did not come so 
late...

Thanks,
John

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 12/12] xfs: Allow block allocator to take an alignment hint
  2025-03-09 22:03   ` Dave Chinner
@ 2025-03-10 12:10     ` John Garry
  2025-03-12 19:51       ` Dave Chinner
  0 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2025-03-10 12:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
	ojaswin, ritesh.list, martin.petersen, tytso, linux-ext4

On 09/03/2025 22:03, Dave Chinner wrote:
> On Mon, Mar 03, 2025 at 05:11:20PM +0000, John Garry wrote:
>> When issuing an atomic write by the CoW method, give the block allocator a
>> hint to align to the extszhint.
>>
>> This means that we have a better chance to issuing the atomic write via
>> HW offload next time.
>>
>> It does mean that the inode extszhint should be set appropriately for the
>> expected atomic write size.
>>
>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/libxfs/xfs_bmap.c | 7 ++++++-
>>   fs/xfs/libxfs/xfs_bmap.h | 6 +++++-
>>   fs/xfs/xfs_reflink.c     | 8 ++++++--
>>   3 files changed, 17 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
>> index 0ef19f1469ec..9bfdfb7cdcae 100644
>> --- a/fs/xfs/libxfs/xfs_bmap.c
>> +++ b/fs/xfs/libxfs/xfs_bmap.c
>> @@ -3454,6 +3454,12 @@ xfs_bmap_compute_alignments(
>>   		align = xfs_get_cowextsz_hint(ap->ip);
>>   	else if (ap->datatype & XFS_ALLOC_USERDATA)
>>   		align = xfs_get_extsz_hint(ap->ip);
>> +
>> +	if (align > 1 && ap->flags & XFS_BMAPI_EXTSZALIGN)
> 
> needs () around the & logic.

ok

> 
> 	if (align > 1 && (ap->flags & XFS_BMAPI_EXTSZALIGN))
> 
>> +		args->alignment = align;
>> +	else
>> +		args->alignment = 1;
> 
> When is  args->alignment not already initialised to 1?
> 
>> +
>>   	if (align) {
>>   		if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
>>   					ap->eof, 0, ap->conv, &ap->offset,
>> @@ -3782,7 +3788,6 @@ xfs_bmap_btalloc(
>>   		.wasdel		= ap->wasdel,
>>   		.resv		= XFS_AG_RESV_NONE,
>>   		.datatype	= ap->datatype,
>> -		.alignment	= 1,
>>   		.minalignslop	= 0,
>>   	};
> 
> Oh, you removed the initialisation to 1, so now we have the
> possibility of getting args->alignment = 0 anywhere in the
> allocation stack?
> 
> FWIW, we've been trying to get rid of that case - args->alignment should
> always be 1 if no alignment is necessary so we don't ahve to special
> case alignment of 0  (meaning no alignemnt) anywhere. This seems
> like a step backwards from that perspective...

As I recall, doing this was a suggestion when developing the forcealign 
support (as it had similar logic).

Anyway, I can leave the init to 1 in xfs_bmap_btalloc()

> 
> 
> 
>>   	xfs_fileoff_t		orig_offset;
>> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
>> index 4b721d935994..e6baa81e20d8 100644
>> --- a/fs/xfs/libxfs/xfs_bmap.h
>> +++ b/fs/xfs/libxfs/xfs_bmap.h
>> @@ -87,6 +87,9 @@ struct xfs_bmalloca {
>>   /* Do not update the rmap btree.  Used for reconstructing bmbt from rmapbt. */
>>   #define XFS_BMAPI_NORMAP	(1u << 10)
>>   
>> +/* Try to align allocations to the extent size hint */
>> +#define XFS_BMAPI_EXTSZALIGN	(1u << 11)
> 
> Don't we already do that?
> 
> Or is this doing something subtle and non-obvious like overriding
> stripe width alignment for large atomic writes?
>

stripe alignment only comes into play for eof allocation.

args->alignment is used in xfs_alloc_compute_aligned() to actually align 
the start bno.

If I don't have this, then we can get this ping-pong affect when 
overwriting atomically the same region:

# dd if=/dev/zero of=mnt/file bs=1M count=10 conv=fsync
# xfs_bmap -vp mnt/file
mnt/file:
EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..20479]:      192..20671        0 (192..20671)     20480 000000
# /xfs_io -d -C "pwrite -b 64k -V 1 -A -D 0 64k" mnt/file
wrote 65536/65536 bytes at offset 0
64 KiB, 1 ops; 0.0525 sec (1.190 MiB/sec and 19.0425 ops/sec)
# xfs_bmap -vp mnt/file
mnt/file:
EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..127]:        20672..20799      0 (20672..20799)     128 000000
   1: [128..20479]:    320..20671        0 (320..20671)     20352 000000
# /xfs_io -d -C "pwrite -b 64k -V 1 -A -D 0 64k" mnt/file
wrote 65536/65536 bytes at offset 0
64 KiB, 1 ops; 0.0524 sec (1.191 MiB/sec and 19.0581 ops/sec)
# xfs_bmap -vp mnt/file
mnt/file:
EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..20479]:      192..20671        0 (192..20671)     20480 000000
# /xfs_io -d -C "pwrite -b 64k -V 1 -A -D 0 64k" mnt/file
wrote 65536/65536 bytes at offset 0
64 KiB, 1 ops; 0.0524 sec (1.191 MiB/sec and 19.0611 ops/sec)
# xfs_bmap -vp mnt/file
mnt/file:
EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
   0: [0..127]:        20672..20799      0 (20672..20799)     128 000000
   1: [128..20479]:    320..20671        0 (320..20671)     20352 000000

We are never getting aligned extents wrt write length, and so have to 
fall back to the SW-based atomic write always. That is not what we want.

Thanks,
John


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 11/12] xfs: Update atomic write max size
  2025-03-10 11:20         ` John Garry
@ 2025-03-10 12:38           ` Carlos Maiolino
  2025-03-10 12:59             ` John Garry
  0 siblings, 1 reply; 32+ messages in thread
From: Carlos Maiolino @ 2025-03-10 12:38 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, djwong, linux-xfs, linux-fsdevel, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, tytso, linux-ext4

On Mon, Mar 10, 2025 at 11:20:23AM +0000, John Garry wrote:
> On 10/03/2025 11:11, Carlos Maiolino wrote:
> > On Mon, Mar 10, 2025 at 10:54:23AM +0000, John Garry wrote:
> >> On 10/03/2025 10:06, Carlos Maiolino wrote:
> >>>> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> >>>> index fbed172d6770..bc96b8214173 100644
> >>>> --- a/fs/xfs/xfs_mount.h
> >>>> +++ b/fs/xfs/xfs_mount.h
> >>>> @@ -198,6 +198,7 @@ typedef struct xfs_mount {
> >>>>    	bool			m_fail_unmount;
> >>>>    	bool			m_finobt_nores; /* no per-AG finobt resv. */
> >>>>    	bool			m_update_sb;	/* sb needs update in mount */
> >>>> +	xfs_extlen_t		awu_max;	/* data device max atomic write */
> >>> Could you please rename this to something else? All fields within xfs_mount
> >>> follows the same pattern m_<name>. Perhaps m_awu_max?
> >> Fine, but I think I then need to deal with spilling multiple lines to
> >> accommodate a proper comment.
> >>
> >>> I was going to send a patch replacing it once I had this merged, but giving
> >>> Dave's new comments, and the conflicts with zoned devices, you'll need to send a
> >>> V5, so, please include this change if nobody else has any objections on keeping
> >>> the xfs_mount naming convention.
> >> What branch do you want me to send this against?
> > I just pushed everything to for-next, so you can just rebase it against for-next
> >
> > Notice this includes the iomap patches you sent in this series which Christian
> > picked up. So if you need to re-work something on the iomap patches, you'll
> > probably need to take this into account.
> 
> Your branch includes the iomap changes, so hard to deal with.

> For the iomap change, Dave was suggesting a name change only, so not a
> major issue.

If you don't plan to change anything related to the iomap (depending on the path
the discussion on path 5/12 takes), I believe all you need to do is remove the
iomap patches from your branch, sending only the xfs patches.

> So if we really want to go with a name change, then I could add a patch
> to change the name only and include in the v5.
> 
> Review comments are always welcome, but I wish that they did not come so
> late...

That's why I didn't bother asking you to change xfs_mount until now, I'd do it
myself if you weren't going to send a V5.
But Dave's comments are more than a mere naming convention, but logic
adjusting due to operator precedence.

Carlos

> 
> Thanks,
> John

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 11/12] xfs: Update atomic write max size
  2025-03-10 12:38           ` Carlos Maiolino
@ 2025-03-10 12:59             ` John Garry
  0 siblings, 0 replies; 32+ messages in thread
From: John Garry @ 2025-03-10 12:59 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: brauner, djwong, linux-xfs, linux-fsdevel, linux-kernel, ojaswin,
	ritesh.list, martin.petersen, tytso, linux-ext4

On 10/03/2025 12:38, Carlos Maiolino wrote:
> On Mon, Mar 10, 2025 at 11:20:23AM +0000, John Garry wrote:
>> On 10/03/2025 11:11, Carlos Maiolino wrote:
>>> On Mon, Mar 10, 2025 at 10:54:23AM +0000, John Garry wrote:
>>>> On 10/03/2025 10:06, Carlos Maiolino wrote:
>>>>>> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
>>>>>> index fbed172d6770..bc96b8214173 100644
>>>>>> --- a/fs/xfs/xfs_mount.h
>>>>>> +++ b/fs/xfs/xfs_mount.h
>>>>>> @@ -198,6 +198,7 @@ typedef struct xfs_mount {
>>>>>>     	bool			m_fail_unmount;
>>>>>>     	bool			m_finobt_nores; /* no per-AG finobt resv. */
>>>>>>     	bool			m_update_sb;	/* sb needs update in mount */
>>>>>> +	xfs_extlen_t		awu_max;	/* data device max atomic write */
>>>>> Could you please rename this to something else? All fields within xfs_mount
>>>>> follows the same pattern m_<name>. Perhaps m_awu_max?
>>>> Fine, but I think I then need to deal with spilling multiple lines to
>>>> accommodate a proper comment.
>>>>
>>>>> I was going to send a patch replacing it once I had this merged, but giving
>>>>> Dave's new comments, and the conflicts with zoned devices, you'll need to send a
>>>>> V5, so, please include this change if nobody else has any objections on keeping
>>>>> the xfs_mount naming convention.
>>>> What branch do you want me to send this against?
>>> I just pushed everything to for-next, so you can just rebase it against for-next
>>>
>>> Notice this includes the iomap patches you sent in this series which Christian
>>> picked up. So if you need to re-work something on the iomap patches, you'll
>>> probably need to take this into account.
>>
>> Your branch includes the iomap changes, so hard to deal with.
> 
>> For the iomap change, Dave was suggesting a name change only, so not a
>> major issue.
> 
> If you don't plan to change anything related to the iomap (depending on the path
> the discussion on path 5/12 takes), I believe all you need to do is remove the
> iomap patches from your branch, sending only the xfs patches.

Right

> 
>> So if we really want to go with a name change, then I could add a patch
>> to change the name only and include in the v5.
>>
>> Review comments are always welcome, but I wish that they did not come so
>> late...
> 
> That's why I didn't bother asking you to change xfs_mount until now, I'd do it
> myself if you weren't going to send a V5.
> But Dave's comments are more than a mere naming convention, but logic
> adjusting due to operator precedence.
> 

ok, working on that now.

Cheers,
John


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 09/12] xfs: Add xfs_file_dio_write_atomic()
  2025-03-03 17:11 ` [PATCH v4 09/12] xfs: Add xfs_file_dio_write_atomic() John Garry
@ 2025-03-10 13:39   ` Ritesh Harjani
  2025-03-10 15:24     ` John Garry
  0 siblings, 1 reply; 32+ messages in thread
From: Ritesh Harjani @ 2025-03-10 13:39 UTC (permalink / raw)
  To: John Garry, brauner, djwong, cem
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, martin.petersen,
	tytso, linux-ext4, John Garry

John Garry <john.g.garry@oracle.com> writes:

> Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes.
>
> In case of -EAGAIN being returned from iomap_dio_rw(), reissue the write
> in CoW-based atomic write mode.
>
> For CoW-based mode, ensure that we have no outstanding IOs which we
> may trample on.
>
> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_file.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 42 insertions(+)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 51b4a43d15f3..70eb6928cf63 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -619,6 +619,46 @@ xfs_file_dio_write_aligned(
>  	return ret;
>  }
>  
> +static noinline ssize_t
> +xfs_file_dio_write_atomic(
> +	struct xfs_inode	*ip,
> +	struct kiocb		*iocb,
> +	struct iov_iter		*from)
> +{
> +	unsigned int		iolock = XFS_IOLOCK_SHARED;
> +	unsigned int		dio_flags = 0;
> +	ssize_t			ret;
> +
> +retry:
> +	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
> +	if (ret)
> +		return ret;
> +
> +	ret = xfs_file_write_checks(iocb, from, &iolock);
> +	if (ret)
> +		goto out_unlock;
> +
> +	if (dio_flags & IOMAP_DIO_FORCE_WAIT)
> +		inode_dio_wait(VFS_I(ip));
> +
> +	trace_xfs_file_direct_write(iocb, from);
> +	ret = iomap_dio_rw(iocb, from, &xfs_atomic_write_iomap_ops,
> +			&xfs_dio_write_ops, dio_flags, NULL, 0);
> +
> +	if (ret == -EAGAIN && !(iocb->ki_flags & IOCB_NOWAIT) &&
> +	    !(dio_flags & IOMAP_DIO_ATOMIC_SW)) {
> +		xfs_iunlock(ip, iolock);
> +		dio_flags = IOMAP_DIO_ATOMIC_SW | IOMAP_DIO_FORCE_WAIT;
> +		iolock = XFS_IOLOCK_EXCL;
> +		goto retry;
> +	}

IIUC typically filesystems can now implement support for IOMAP_ATOMIC_SW
as a fallback mechanism, by returning -EAGAIN error during
IOMAP_ATOMIC_HW handling from their ->iomap_begin() routine.  They can
then retry the entire DIO operation of iomap_dio_rw() by passing
IOMAP_DIO_ATOMIC_SW flag in their dio_flags argument and handle
IOMAP_ATOMIC_SW fallback differently in their ->iomap_begin() routine.

However, -EAGAIN can also be returned when there is a race with mmap
writes for the same range. We return the same -EAGAIN error during page
cache invalidation failure for IOCB_ATOMIC writes too.  However, current
code does not differentiate between these two types of failures. Therefore,
we always retry by falling back to SW CoW based atomic write even for
page cache invalidation failures.

__iomap_dio_rw()
{
<...>
		/*
		 * Try to invalidate cache pages for the range we are writing.
		 * If this invalidation fails, let the caller fall back to
		 * buffered I/O.
		 */
		ret = kiocb_invalidate_pages(iocb, iomi.len);
		if (ret) {
			if (ret != -EAGAIN) {
				trace_iomap_dio_invalidate_fail(inode, iomi.pos,
								iomi.len);
				if (iocb->ki_flags & IOCB_ATOMIC) {
					/*
					 * folio invalidation failed, maybe
					 * this is transient, unlock and see if
					 * the caller tries again.
					 */
					ret = -EAGAIN;
				} else {
					/* fall back to buffered write */
					ret = -ENOTBLK;
				}
			}
			goto out_free_dio;
		}
<...>
}

It's easy to miss such error handling conditions. If this is something
which was already discussed earlier, then perhaps it is better if
document this.  BTW - Is this something that we already know of and has
been kept as such intentionally?


-ritesh


> +
> +out_unlock:
> +	if (iolock)
> +		xfs_iunlock(ip, iolock);
> +	return ret;
> +}
> +
>  /*
>   * Handle block unaligned direct I/O writes
>   *
> @@ -723,6 +763,8 @@ xfs_file_dio_write(
>  		return -EINVAL;
>  	if ((iocb->ki_pos | count) & ip->i_mount->m_blockmask)
>  		return xfs_file_dio_write_unaligned(ip, iocb, from);
> +	if (iocb->ki_flags & IOCB_ATOMIC)
> +		return xfs_file_dio_write_atomic(ip, iocb, from);
>  	return xfs_file_dio_write_aligned(ip, iocb, from);
>  }
>  
> -- 
> 2.31.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 09/12] xfs: Add xfs_file_dio_write_atomic()
  2025-03-10 13:39   ` Ritesh Harjani
@ 2025-03-10 15:24     ` John Garry
  2025-03-10 17:25       ` Darrick J. Wong
  0 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2025-03-10 15:24 UTC (permalink / raw)
  To: Ritesh Harjani (IBM), brauner, djwong, cem
  Cc: linux-xfs, linux-fsdevel, linux-kernel, ojaswin, martin.petersen,
	tytso, linux-ext4

On 10/03/2025 13:39, Ritesh Harjani (IBM) wrote:
> John Garry <john.g.garry@oracle.com> writes:
> 
>> Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes.
>>
>> In case of -EAGAIN being returned from iomap_dio_rw(), reissue the write
>> in CoW-based atomic write mode.
>>
>> For CoW-based mode, ensure that we have no outstanding IOs which we
>> may trample on.
>>
>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/xfs_file.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 42 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index 51b4a43d15f3..70eb6928cf63 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -619,6 +619,46 @@ xfs_file_dio_write_aligned(
>>   	return ret;
>>   }
>>   
>> +static noinline ssize_t
>> +xfs_file_dio_write_atomic(
>> +	struct xfs_inode	*ip,
>> +	struct kiocb		*iocb,
>> +	struct iov_iter		*from)
>> +{
>> +	unsigned int		iolock = XFS_IOLOCK_SHARED;
>> +	unsigned int		dio_flags = 0;
>> +	ssize_t			ret;
>> +
>> +retry:
>> +	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
>> +	if (ret)
>> +		return ret;
>> +
>> +	ret = xfs_file_write_checks(iocb, from, &iolock);
>> +	if (ret)
>> +		goto out_unlock;
>> +
>> +	if (dio_flags & IOMAP_DIO_FORCE_WAIT)
>> +		inode_dio_wait(VFS_I(ip));
>> +
>> +	trace_xfs_file_direct_write(iocb, from);
>> +	ret = iomap_dio_rw(iocb, from, &xfs_atomic_write_iomap_ops,
>> +			&xfs_dio_write_ops, dio_flags, NULL, 0);
>> +
>> +	if (ret == -EAGAIN && !(iocb->ki_flags & IOCB_NOWAIT) &&
>> +	    !(dio_flags & IOMAP_DIO_ATOMIC_SW)) {
>> +		xfs_iunlock(ip, iolock);
>> +		dio_flags = IOMAP_DIO_ATOMIC_SW | IOMAP_DIO_FORCE_WAIT;
>> +		iolock = XFS_IOLOCK_EXCL;
>> +		goto retry;
>> +	}
> 
> IIUC typically filesystems can now implement support for IOMAP_ATOMIC_SW
> as a fallback mechanism, by returning -EAGAIN error during
> IOMAP_ATOMIC_HW handling from their ->iomap_begin() routine.  They can
> then retry the entire DIO operation of iomap_dio_rw() by passing
> IOMAP_DIO_ATOMIC_SW flag in their dio_flags argument and handle
> IOMAP_ATOMIC_SW fallback differently in their ->iomap_begin() routine.
> 
> However, -EAGAIN can also be returned when there is a race with mmap
> writes for the same range. We return the same -EAGAIN error during page
> cache invalidation failure for IOCB_ATOMIC writes too.  However, current
> code does not differentiate between these two types of failures. Therefore,
> we always retry by falling back to SW CoW based atomic write even for
> page cache invalidation failures.
> 
> __iomap_dio_rw()
> {
> <...>
> 		/*
> 		 * Try to invalidate cache pages for the range we are writing.
> 		 * If this invalidation fails, let the caller fall back to
> 		 * buffered I/O.
> 		 */
> 		ret = kiocb_invalidate_pages(iocb, iomi.len);
> 		if (ret) {
> 			if (ret != -EAGAIN) {
> 				trace_iomap_dio_invalidate_fail(inode, iomi.pos,
> 								iomi.len);
> 				if (iocb->ki_flags & IOCB_ATOMIC) {
> 					/*
> 					 * folio invalidation failed, maybe
> 					 * this is transient, unlock and see if
> 					 * the caller tries again.
> 					 */
> 					ret = -EAGAIN;
> 				} else {
> 					/* fall back to buffered write */
> 					ret = -ENOTBLK;
> 				}
> 			}
> 			goto out_free_dio;
> 		}
> <...>
> }
> 
> It's easy to miss such error handling conditions. If this is something
> which was already discussed earlier, then perhaps it is better if
> document this.  BTW - Is this something that we already know of and has
> been kept as such intentionally?
> 

On mainline, for kiocb_invalidate_pages() error for IOCB_ATOMIC, we 
always return -EAGAIN to userspace.

Now if we have any kiocb_invalidate_pages() error for IOCB_ATOMIC, we 
retry with SW CoW mode - and if it fails again, we return -EAGAIN to 
userspace.

If we choose some other error code to trigger the SW-based COW retry (so 
that we don't always retry for kiocb_invalidate_pages() error when 
!IOMAP_DIO_ATOMIC_HW), then kiocb_invalidate_pages() could still return 
that same error code and we still retry in SW-based COW mode - is that 
better? Or do we need to choose some error code which 
kiocb_invalidate_pages() would never return?

Note that -EAGAIN is used by xfs_file_dio_unwrite_unaligned(), so would 
be nice to use the same error code.

Thanks,
John

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 05/12] iomap: Support SW-based atomic writes
  2025-03-10 10:44     ` John Garry
@ 2025-03-10 17:21       ` Darrick J. Wong
  0 siblings, 0 replies; 32+ messages in thread
From: Darrick J. Wong @ 2025-03-10 17:21 UTC (permalink / raw)
  To: John Garry
  Cc: Dave Chinner, brauner, cem, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, ritesh.list, martin.petersen, tytso,
	linux-ext4

On Mon, Mar 10, 2025 at 10:44:47AM +0000, John Garry wrote:
> On 09/03/2025 21:51, Dave Chinner wrote:
> > Mon, Mar 03, 2025 at 05:11:13PM +0000, John Garry wrote:
> > > Currently atomic write support requires dedicated HW support. This imposes
> > > a restriction on the filesystem that disk blocks need to be aligned and
> > > contiguously mapped to FS blocks to issue atomic writes.
> > > 
> > > XFS has no method to guarantee FS block alignment for regular,
> > > non-RT files. As such, atomic writes are currently limited to 1x FS block
> > > there.
> > > 
> > > To deal with the scenario that we are issuing an atomic write over
> > > misaligned or discontiguous data blocks - and raise the atomic write size
> > > limit - support a SW-based software emulated atomic write mode. For XFS,
> > > this SW-based atomic writes would use CoW support to issue emulated untorn
> > > writes.
> > > 
> > > It is the responsibility of the FS to detect discontiguous atomic writes
> > > and switch to IOMAP_DIO_ATOMIC_SW mode and retry the write. Indeed,
> > > SW-based atomic writes could be used always when the mounted bdev does
> > > not support HW offload, but this strategy is not initially expected to be
> > > used.
> > So now seeing how these are are to be used, these aren't "hardware"
> > and "software" atomic IOs. They are block layer vs filesystem atomic
> > IOs.
> > 
> > We can do atomic IOs in software in the block layer drivers (think
> > loop or dm-thinp) rather than off-loading to storage hardware.
> > 
> > Hence I think these really need to be named after the layer that
> > will provide the atomic IO guarantees, because "hw" and "sw" as they
> > are currently used are not correct. e.g something like
> > IOMAP_FS_ATOMIC and IOMAP_BDEV_ATOMIC which indicates which layer
> > should be providing the atomic IO constraints and guarantees.
> 
> I'd prefer IOMAP_REQ_ATOMIC instead (of IOMAP_BDEV_ATOMIC), as we are using
> REQ_ATOMIC for those BIOs only. Anything which the block layer and below
> does with REQ_ATOMIC is its business, as long as it guarantees atomic
> submission. But I am not overly keen on that name, as it clashes with block
> layer names (naturally).

I don't like encoding "REQ_ATOMIC" in iomap flags.  If we're changing
the names, they ought to reflect who's making the guarantees:

IOMAP_DIO_BDEV_ATOMIC vs. IOMAP_DIO_FS_ATOMIC.

Not sure why the flags lost the "_DIO" part.

--D

> And IOMAP_FS_ATOMIC seems a bit vague, but I can't think of anything else.
> 
> Darrick, any opinion on this?
> 
> Cheers,
> John
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 09/12] xfs: Add xfs_file_dio_write_atomic()
  2025-03-10 15:24     ` John Garry
@ 2025-03-10 17:25       ` Darrick J. Wong
  0 siblings, 0 replies; 32+ messages in thread
From: Darrick J. Wong @ 2025-03-10 17:25 UTC (permalink / raw)
  To: John Garry
  Cc: Ritesh Harjani (IBM), brauner, cem, linux-xfs, linux-fsdevel,
	linux-kernel, ojaswin, martin.petersen, tytso, linux-ext4

On Mon, Mar 10, 2025 at 03:24:23PM +0000, John Garry wrote:
> On 10/03/2025 13:39, Ritesh Harjani (IBM) wrote:
> > John Garry <john.g.garry@oracle.com> writes:
> > 
> > > Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes.
> > > 
> > > In case of -EAGAIN being returned from iomap_dio_rw(), reissue the write
> > > in CoW-based atomic write mode.
> > > 
> > > For CoW-based mode, ensure that we have no outstanding IOs which we
> > > may trample on.
> > > 
> > > Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >   fs/xfs/xfs_file.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> > >   1 file changed, 42 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 51b4a43d15f3..70eb6928cf63 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -619,6 +619,46 @@ xfs_file_dio_write_aligned(
> > >   	return ret;
> > >   }
> > > +static noinline ssize_t
> > > +xfs_file_dio_write_atomic(
> > > +	struct xfs_inode	*ip,
> > > +	struct kiocb		*iocb,
> > > +	struct iov_iter		*from)
> > > +{
> > > +	unsigned int		iolock = XFS_IOLOCK_SHARED;
> > > +	unsigned int		dio_flags = 0;
> > > +	ssize_t			ret;
> > > +
> > > +retry:
> > > +	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	ret = xfs_file_write_checks(iocb, from, &iolock);
> > > +	if (ret)
> > > +		goto out_unlock;
> > > +
> > > +	if (dio_flags & IOMAP_DIO_FORCE_WAIT)
> > > +		inode_dio_wait(VFS_I(ip));
> > > +
> > > +	trace_xfs_file_direct_write(iocb, from);
> > > +	ret = iomap_dio_rw(iocb, from, &xfs_atomic_write_iomap_ops,
> > > +			&xfs_dio_write_ops, dio_flags, NULL, 0);
> > > +
> > > +	if (ret == -EAGAIN && !(iocb->ki_flags & IOCB_NOWAIT) &&
> > > +	    !(dio_flags & IOMAP_DIO_ATOMIC_SW)) {
> > > +		xfs_iunlock(ip, iolock);
> > > +		dio_flags = IOMAP_DIO_ATOMIC_SW | IOMAP_DIO_FORCE_WAIT;
> > > +		iolock = XFS_IOLOCK_EXCL;
> > > +		goto retry;
> > > +	}
> > 
> > IIUC typically filesystems can now implement support for IOMAP_ATOMIC_SW
> > as a fallback mechanism, by returning -EAGAIN error during
> > IOMAP_ATOMIC_HW handling from their ->iomap_begin() routine.  They can
> > then retry the entire DIO operation of iomap_dio_rw() by passing
> > IOMAP_DIO_ATOMIC_SW flag in their dio_flags argument and handle
> > IOMAP_ATOMIC_SW fallback differently in their ->iomap_begin() routine.
> > 
> > However, -EAGAIN can also be returned when there is a race with mmap
> > writes for the same range. We return the same -EAGAIN error during page
> > cache invalidation failure for IOCB_ATOMIC writes too.  However, current
> > code does not differentiate between these two types of failures. Therefore,
> > we always retry by falling back to SW CoW based atomic write even for
> > page cache invalidation failures.
> > 
> > __iomap_dio_rw()
> > {
> > <...>
> > 		/*
> > 		 * Try to invalidate cache pages for the range we are writing.
> > 		 * If this invalidation fails, let the caller fall back to
> > 		 * buffered I/O.
> > 		 */
> > 		ret = kiocb_invalidate_pages(iocb, iomi.len);
> > 		if (ret) {
> > 			if (ret != -EAGAIN) {
> > 				trace_iomap_dio_invalidate_fail(inode, iomi.pos,
> > 								iomi.len);
> > 				if (iocb->ki_flags & IOCB_ATOMIC) {
> > 					/*
> > 					 * folio invalidation failed, maybe
> > 					 * this is transient, unlock and see if
> > 					 * the caller tries again.
> > 					 */
> > 					ret = -EAGAIN;
> > 				} else {
> > 					/* fall back to buffered write */
> > 					ret = -ENOTBLK;
> > 				}
> > 			}
> > 			goto out_free_dio;
> > 		}
> > <...>
> > }
> > 
> > It's easy to miss such error handling conditions. If this is something
> > which was already discussed earlier, then perhaps it is better if
> > document this.  BTW - Is this something that we already know of and has
> > been kept as such intentionally?
> > 
> 
> On mainline, for kiocb_invalidate_pages() error for IOCB_ATOMIC, we always
> return -EAGAIN to userspace.
> 
> Now if we have any kiocb_invalidate_pages() error for IOCB_ATOMIC, we retry
> with SW CoW mode - and if it fails again, we return -EAGAIN to userspace.
> 
> If we choose some other error code to trigger the SW-based COW retry (so
> that we don't always retry for kiocb_invalidate_pages() error when
> !IOMAP_DIO_ATOMIC_HW), then kiocb_invalidate_pages() could still return that
> same error code and we still retry in SW-based COW mode - is that better? Or
> do we need to choose some error code which kiocb_invalidate_pages() would
> never return?
> 
> Note that -EAGAIN is used by xfs_file_dio_unwrite_unaligned(), so would be
> nice to use the same error code.

Frankly I don't see why it's a problem that EAGAIN triggers the software
fallback no matter what tripped that.  Maybe the writer would be ok with
the retry even if it came from an (unlikely) mmap write collision.

--D

> Thanks,
> John
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 12/12] xfs: Allow block allocator to take an alignment hint
  2025-03-10 12:10     ` John Garry
@ 2025-03-12 19:51       ` Dave Chinner
  2025-03-12 21:00         ` John Garry
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2025-03-12 19:51 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
	ojaswin, ritesh.list, martin.petersen, tytso, linux-ext4

On Mon, Mar 10, 2025 at 12:10:44PM +0000, John Garry wrote:
> On 09/03/2025 22:03, Dave Chinner wrote:
> > On Mon, Mar 03, 2025 at 05:11:20PM +0000, John Garry wrote:
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> > > index 4b721d935994..e6baa81e20d8 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.h
> > > +++ b/fs/xfs/libxfs/xfs_bmap.h
> > > @@ -87,6 +87,9 @@ struct xfs_bmalloca {
> > >   /* Do not update the rmap btree.  Used for reconstructing bmbt from rmapbt. */
> > >   #define XFS_BMAPI_NORMAP	(1u << 10)
> > > +/* Try to align allocations to the extent size hint */
> > > +#define XFS_BMAPI_EXTSZALIGN	(1u << 11)
> > 
> > Don't we already do that?
> > 
> > Or is this doing something subtle and non-obvious like overriding
> > stripe width alignment for large atomic writes?
> > 
> 
> stripe alignment only comes into play for eof allocation.
> 
> args->alignment is used in xfs_alloc_compute_aligned() to actually align the
> start bno.
> 
> If I don't have this, then we can get this ping-pong affect when overwriting
> atomically the same region:
> 
> # dd if=/dev/zero of=mnt/file bs=1M count=10 conv=fsync
> # xfs_bmap -vp mnt/file
> mnt/file:
> EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>   0: [0..20479]:      192..20671        0 (192..20671)     20480 000000
> # /xfs_io -d -C "pwrite -b 64k -V 1 -A -D 0 64k" mnt/file
> wrote 65536/65536 bytes at offset 0
> 64 KiB, 1 ops; 0.0525 sec (1.190 MiB/sec and 19.0425 ops/sec)
> # xfs_bmap -vp mnt/file
> mnt/file:
> EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>   0: [0..127]:        20672..20799      0 (20672..20799)     128 000000
>   1: [128..20479]:    320..20671        0 (320..20671)     20352 000000
> # /xfs_io -d -C "pwrite -b 64k -V 1 -A -D 0 64k" mnt/file
> wrote 65536/65536 bytes at offset 0
> 64 KiB, 1 ops; 0.0524 sec (1.191 MiB/sec and 19.0581 ops/sec)
> # xfs_bmap -vp mnt/file
> mnt/file:
> EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>   0: [0..20479]:      192..20671        0 (192..20671)     20480 000000
> # /xfs_io -d -C "pwrite -b 64k -V 1 -A -D 0 64k" mnt/file
> wrote 65536/65536 bytes at offset 0
> 64 KiB, 1 ops; 0.0524 sec (1.191 MiB/sec and 19.0611 ops/sec)
> # xfs_bmap -vp mnt/file
> mnt/file:
> EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>   0: [0..127]:        20672..20799      0 (20672..20799)     128 000000
>   1: [128..20479]:    320..20671        0 (320..20671)     20352 000000
> 
> We are never getting aligned extents wrt write length, and so have to fall
> back to the SW-based atomic write always. That is not what we want.

Please add a comment to explain this where the XFS_BMAPI_EXTSZALIGN
flag is set, because it's not at all obvious what it is doing or why
it is needed from the name of the variable or the implementation.

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 12/12] xfs: Allow block allocator to take an alignment hint
  2025-03-12 19:51       ` Dave Chinner
@ 2025-03-12 21:00         ` John Garry
  0 siblings, 0 replies; 32+ messages in thread
From: John Garry @ 2025-03-12 21:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, linux-kernel,
	ojaswin, ritesh.list, martin.petersen, tytso, linux-ext4

On 12/03/2025 19:51, Dave Chinner wrote:
>> We are never getting aligned extents wrt write length, and so have to fall
>> back to the SW-based atomic write always. That is not what we want.
> Please add a comment to explain this where the XFS_BMAPI_EXTSZALIGN
> flag is set, because it's not at all obvious what it is doing or why
> it is needed from the name of the variable or the implementation.

ok, fine. But that is as long as we just are doing the alignment for 
XFS_BMAPI_EXTSZALIGN (and not always, as Christoph queried).

Thanks,
John

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2025-03-12 21:01 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-03 17:11 [PATCH v4 00/12] large atomic writes for xfs with CoW John Garry
2025-03-03 17:11 ` [PATCH v4 01/12] xfs: Pass flags to xfs_reflink_allocate_cow() John Garry
2025-03-03 17:11 ` [PATCH v4 02/12] iomap: Rename IOMAP_ATOMIC -> IOMAP_ATOMIC_HW John Garry
2025-03-05 12:57   ` Carlos Maiolino
2025-03-06  8:50     ` Christian Brauner
2025-03-03 17:11 ` [PATCH v4 03/12] xfs: Switch atomic write size check in xfs_file_write_iter() John Garry
2025-03-03 17:11 ` [PATCH v4 04/12] xfs: Refactor xfs_reflink_end_cow_extent() John Garry
2025-03-03 17:11 ` [PATCH v4 05/12] iomap: Support SW-based atomic writes John Garry
2025-03-09 21:51   ` Dave Chinner
2025-03-10 10:44     ` John Garry
2025-03-10 17:21       ` Darrick J. Wong
2025-03-03 17:11 ` [PATCH v4 06/12] iomap: Lift blocksize restriction on " John Garry
2025-03-03 17:11 ` [PATCH v4 07/12] xfs: Reflink CoW-based atomic write support John Garry
2025-03-03 17:11 ` [PATCH v4 08/12] xfs: Iomap SW-based " John Garry
2025-03-03 17:11 ` [PATCH v4 09/12] xfs: Add xfs_file_dio_write_atomic() John Garry
2025-03-10 13:39   ` Ritesh Harjani
2025-03-10 15:24     ` John Garry
2025-03-10 17:25       ` Darrick J. Wong
2025-03-03 17:11 ` [PATCH v4 10/12] xfs: Commit CoW-based atomic writes atomically John Garry
2025-03-03 17:11 ` [PATCH v4 11/12] xfs: Update atomic write max size John Garry
2025-03-10 10:06   ` Carlos Maiolino
2025-03-10 10:54     ` John Garry
2025-03-10 11:11       ` Carlos Maiolino
2025-03-10 11:20         ` John Garry
2025-03-10 12:38           ` Carlos Maiolino
2025-03-10 12:59             ` John Garry
2025-03-03 17:11 ` [PATCH v4 12/12] xfs: Allow block allocator to take an alignment hint John Garry
2025-03-09 22:03   ` Dave Chinner
2025-03-10 12:10     ` John Garry
2025-03-12 19:51       ` Dave Chinner
2025-03-12 21:00         ` John Garry
2025-03-06  8:47 ` (subset) [PATCH v4 00/12] large atomic writes for xfs with CoW Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).