public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/7] large atomic writes for xfs
@ 2024-12-10 12:57 John Garry
  2024-12-10 12:57 ` [PATCH v2 1/7] iomap: Increase iomap_dio_zero() size limit John Garry
                   ` (7 more replies)
  0 siblings, 8 replies; 29+ messages in thread
From: John Garry @ 2024-12-10 12:57 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch, ritesh.list
  Cc: linux-xfs, linux-fsdevel, linux-kernel, martin.petersen,
	John Garry

Currently the atomic write unit min and max is fixed at the FS blocksize
for xfs and ext4.

This series expands support to allow multiple FS blocks to be written
atomically.

To allow multiple blocks be written atomically, the fs must ensure blocks
are allocated with some alignment and granularity. For xfs, today only
rtvol provides this through rt_extsize. So initial support for large
atomic writes will be for rtvol here. Support can easily be expanded to
regular files through the proposed forcealign feature.

An atomic write which spans mixed unwritten and mapped extents will be
required to have the unwritten extents pre-zeroed, which will be supported
in iomap.

Based on v6.13-rc2.

Patches available at the following:
https://github.com/johnpgarry/linux/tree/atomic-write-large-atomics-v6.13-v2

Changes since v1:
- Add extent zeroing support
- Rebase

John Garry (6):
  iomap: Increase iomap_dio_zero() size limit
  iomap: Add zero unwritten mappings dio support
  xfs: Add extent zeroing support for atomic writes
  xfs: Switch atomic write size check in xfs_file_write_iter()
  xfs: Add RT atomic write unit max to xfs_mount
  xfs: Update xfs_get_atomic_write_attr() for large atomic writes

Ritesh Harjani (IBM) (1):
  iomap: Lift blocksize restriction on atomic writes

 fs/iomap/direct-io.c   | 100 +++++++++++++++++++++++++++++++++++++----
 fs/xfs/libxfs/xfs_sb.c |   3 ++
 fs/xfs/xfs_file.c      |  97 ++++++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_iops.c      |  21 ++++++++-
 fs/xfs/xfs_iops.h      |   2 +
 fs/xfs/xfs_mount.h     |   1 +
 fs/xfs/xfs_rtalloc.c   |  25 +++++++++++
 fs/xfs/xfs_rtalloc.h   |   4 ++
 include/linux/iomap.h  |   3 ++
 9 files changed, 239 insertions(+), 17 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 1/7] iomap: Increase iomap_dio_zero() size limit
  2024-12-10 12:57 [PATCH v2 0/7] large atomic writes for xfs John Garry
@ 2024-12-10 12:57 ` John Garry
  2024-12-10 12:57 ` [PATCH v2 2/7] iomap: Add zero unwritten mappings dio support John Garry
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: John Garry @ 2024-12-10 12:57 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch, ritesh.list
  Cc: linux-xfs, linux-fsdevel, linux-kernel, martin.petersen,
	John Garry

Currently iomap_dio_zero() is limited to using a single bio to write up to
64K.

To support atomic writes larger than the FS block size, it may be required
to pre-zero some extents larger than 64K.

To increase the limit, fill each bio up in a loop.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/iomap/direct-io.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index b521eb15759e..23fdad16e6a8 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -240,27 +240,35 @@ void iomap_dio_bio_end_io(struct bio *bio)
 EXPORT_SYMBOL_GPL(iomap_dio_bio_end_io);
 
 static int iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
-		loff_t pos, unsigned len)
+		const loff_t pos, const unsigned len)
 {
 	struct inode *inode = file_inode(dio->iocb->ki_filp);
+	unsigned int remaining = len;
+	unsigned int nr_vecs;
 	struct bio *bio;
+	int i;
 
 	if (!len)
 		return 0;
-	/*
-	 * Max block size supported is 64k
-	 */
-	if (WARN_ON_ONCE(len > IOMAP_ZERO_PAGE_SIZE))
+
+	nr_vecs = DIV_ROUND_UP(len, IOMAP_ZERO_PAGE_SIZE);
+	if (WARN_ON_ONCE(nr_vecs > BIO_MAX_VECS))
 		return -EINVAL;
 
-	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
+	bio = iomap_dio_alloc_bio(iter, dio, nr_vecs,
+			REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
 	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
 				  GFP_KERNEL);
 	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	__bio_add_page(bio, zero_page, len, 0);
+	for (i = 0; i < nr_vecs; i++) {
+		__bio_add_page(bio, zero_page,
+			min(remaining, IOMAP_ZERO_PAGE_SIZE), 0);
+		remaining -= IOMAP_ZERO_PAGE_SIZE;
+	}
+
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 	return 0;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 2/7] iomap: Add zero unwritten mappings dio support
  2024-12-10 12:57 [PATCH v2 0/7] large atomic writes for xfs John Garry
  2024-12-10 12:57 ` [PATCH v2 1/7] iomap: Increase iomap_dio_zero() size limit John Garry
@ 2024-12-10 12:57 ` John Garry
  2024-12-11 23:47   ` Darrick J. Wong
  2024-12-10 12:57 ` [PATCH v2 3/7] iomap: Lift blocksize restriction on atomic writes John Garry
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 29+ messages in thread
From: John Garry @ 2024-12-10 12:57 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch, ritesh.list
  Cc: linux-xfs, linux-fsdevel, linux-kernel, martin.petersen,
	John Garry

For atomic writes support, it is required to only ever submit a single bio
(for an atomic write).

Furthermore, currently the atomic write unit min and max limit is fixed at
the FS block size.

For lifting the atomic write unit max limit, it may occur that an atomic
write spans mixed unwritten and mapped extents. For this case, due to the
iterative nature of iomap, multiple bios would be produced, which is
intolerable.

Add a function to zero unwritten extents in a certain range, which may be
used to ensure that unwritten extents are zeroed prior to issuing of an
atomic write.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/iomap/direct-io.c  | 76 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/iomap.h |  3 ++
 2 files changed, 79 insertions(+)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 23fdad16e6a8..18c888f0c11f 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -805,6 +805,82 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 }
 EXPORT_SYMBOL_GPL(iomap_dio_rw);
 
+static loff_t
+iomap_dio_zero_unwritten_iter(struct iomap_iter *iter, struct iomap_dio *dio)
+{
+	const struct iomap *iomap = &iter->iomap;
+	loff_t length = iomap_length(iter);
+	loff_t pos = iter->pos;
+
+	if (iomap->type == IOMAP_UNWRITTEN) {
+		int ret;
+
+		dio->flags |= IOMAP_DIO_UNWRITTEN;
+		ret = iomap_dio_zero(iter, dio, pos, length);
+		if (ret)
+			return ret;
+	}
+
+	dio->size += length;
+
+	return length;
+}
+
+ssize_t
+iomap_dio_zero_unwritten(struct kiocb *iocb, struct iov_iter *iter,
+		const struct iomap_ops *ops, const struct iomap_dio_ops *dops)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct iomap_dio *dio;
+	ssize_t ret;
+	struct iomap_iter iomi = {
+		.inode		= inode,
+		.pos		= iocb->ki_pos,
+		.len		= iov_iter_count(iter),
+		.flags		= IOMAP_WRITE,
+	};
+
+	dio = kzalloc(sizeof(*dio), GFP_KERNEL);
+	if (!dio)
+		return -ENOMEM;
+
+	dio->iocb = iocb;
+	atomic_set(&dio->ref, 1);
+	dio->i_size = i_size_read(inode);
+	dio->dops = dops;
+	dio->submit.waiter = current;
+	dio->wait_for_completion = true;
+
+	inode_dio_begin(inode);
+
+	while ((ret = iomap_iter(&iomi, ops)) > 0)
+		iomi.processed = iomap_dio_zero_unwritten_iter(&iomi, dio);
+
+	if (ret < 0)
+		iomap_dio_set_error(dio, ret);
+
+	if (!atomic_dec_and_test(&dio->ref)) {
+		for (;;) {
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			if (!READ_ONCE(dio->submit.waiter))
+				break;
+
+			blk_io_schedule();
+		}
+		__set_current_state(TASK_RUNNING);
+	}
+
+	if (dops && dops->end_io)
+		ret = dops->end_io(iocb, dio->size, ret, dio->flags);
+
+	kfree(dio);
+
+	inode_dio_end(file_inode(iocb->ki_filp));
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iomap_dio_zero_unwritten);
+
 static int __init iomap_dio_init(void)
 {
 	zero_page = alloc_pages(GFP_KERNEL | __GFP_ZERO,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 5675af6b740c..c2d44b9e446d 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -440,6 +440,9 @@ ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 struct iomap_dio *__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
 		unsigned int dio_flags, void *private, size_t done_before);
+ssize_t iomap_dio_zero_unwritten(struct kiocb *iocb, struct iov_iter *iter,
+		const struct iomap_ops *ops, const struct iomap_dio_ops *dops);
+
 ssize_t iomap_dio_complete(struct iomap_dio *dio);
 void iomap_dio_bio_end_io(struct bio *bio);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 3/7] iomap: Lift blocksize restriction on atomic writes
  2024-12-10 12:57 [PATCH v2 0/7] large atomic writes for xfs John Garry
  2024-12-10 12:57 ` [PATCH v2 1/7] iomap: Increase iomap_dio_zero() size limit John Garry
  2024-12-10 12:57 ` [PATCH v2 2/7] iomap: Add zero unwritten mappings dio support John Garry
@ 2024-12-10 12:57 ` John Garry
  2024-12-10 12:57 ` [PATCH v2 4/7] xfs: Add extent zeroing support for " John Garry
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: John Garry @ 2024-12-10 12:57 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch, ritesh.list
  Cc: linux-xfs, linux-fsdevel, linux-kernel, martin.petersen,
	John Garry

From: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>

Filesystems like ext4 can submit writes in multiples of blocksizes.
But we still can't allow the writes to be split. Hence let's check if
the iomap_length() is same as iter->len or not.

It is the role of the FS to ensure that a single mapping may be created
for an atomic write. The FS will also continue to check size and alignment
legality.

Signed-off-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
jpg: Tweak commit message
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/iomap/direct-io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 18c888f0c11f..6510bb5d5a6f 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -314,7 +314,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 	size_t copied = 0;
 	size_t orig_count;
 
-	if (atomic && length != fs_block_size)
+	if (atomic && length != iter->len)
 		return -EINVAL;
 
 	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 4/7] xfs: Add extent zeroing support for atomic writes
  2024-12-10 12:57 [PATCH v2 0/7] large atomic writes for xfs John Garry
                   ` (2 preceding siblings ...)
  2024-12-10 12:57 ` [PATCH v2 3/7] iomap: Lift blocksize restriction on atomic writes John Garry
@ 2024-12-10 12:57 ` John Garry
  2024-12-10 12:57 ` [PATCH v2 5/7] xfs: Switch atomic write size check in xfs_file_write_iter() John Garry
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: John Garry @ 2024-12-10 12:57 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch, ritesh.list
  Cc: linux-xfs, linux-fsdevel, linux-kernel, martin.petersen,
	John Garry

An atomic write which spans mixed unwritten and mapped extents would be
rejected. This is one reason why atomic write unit min and max is
currently fixed at the block size.

To enable large atomic writes, any unwritten extents need to be zeroed
before issuing the atomic write. So call iomap_dio_zero_unwritten() for
this scenario and retry the atomic write.

It can be detected if there is any unwritten extents by passing
IOMAP_DIO_OVERWRITE_ONLY to the original iomap_dio_rw() call.

After iomap_dio_zero_unwritten() is called then iomap_dio_rw() is retried -
if that fails then there really is something wrong.

Note that this change will result in a latent change of behaviour, in that
atomic writing a single unwritten block will now mean pre-zeroing.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_file.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 4a0b7de4f7ae..f2b2eb2dee94 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -578,10 +578,47 @@ xfs_dio_write_end_io(
 	return error;
 }
 
+static int
+xfs_dio_write_end_zero_unwritten(
+	struct kiocb		*iocb,
+	ssize_t			size,
+	int			error,
+	unsigned		flags)
+{
+	struct inode		*inode = file_inode(iocb->ki_filp);
+	struct xfs_inode	*ip = XFS_I(inode);
+	loff_t			offset = iocb->ki_pos;
+	unsigned int		nofs_flag;
+
+	trace_xfs_end_io_direct_write(ip, offset, size);
+
+	if (xfs_is_shutdown(ip->i_mount))
+		return -EIO;
+
+	if (error)
+		return error;
+	if (WARN_ON_ONCE(!size))
+		return 0;
+	if (!(flags & IOMAP_DIO_UNWRITTEN))
+		return 0;
+
+	/* Same as xfs_dio_write_end_io() ... */
+	nofs_flag = memalloc_nofs_save();
+
+	error = xfs_iomap_write_unwritten(ip, offset, size, true);
+
+	memalloc_nofs_restore(nofs_flag);
+	return error;
+}
+
 static const struct iomap_dio_ops xfs_dio_write_ops = {
 	.end_io		= xfs_dio_write_end_io,
 };
 
+static const struct iomap_dio_ops xfs_dio_zero_ops = {
+	.end_io		= xfs_dio_write_end_zero_unwritten,
+};
+
 /*
  * Handle block aligned direct I/O writes
  */
@@ -619,6 +656,52 @@ xfs_file_dio_write_aligned(
 	return ret;
 }
 
+static noinline ssize_t
+xfs_file_dio_write_atomic(
+	struct xfs_inode	*ip,
+	struct kiocb		*iocb,
+	struct iov_iter		*from)
+{
+	unsigned int		iolock = XFS_IOLOCK_SHARED;
+	bool			do_zero = false;
+	ssize_t			ret;
+
+retry:
+	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
+	if (ret)
+		return ret;
+
+	ret = xfs_file_write_checks(iocb, from, &iolock);
+	if (ret)
+		goto out_unlock;
+
+	if (do_zero) {
+		ret = iomap_dio_zero_unwritten(iocb, from,
+				&xfs_direct_write_iomap_ops,
+				&xfs_dio_zero_ops);
+		if (ret)
+			goto out_unlock;
+	}
+
+	trace_xfs_file_direct_write(iocb, from);
+	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
+			&xfs_dio_write_ops, IOMAP_DIO_OVERWRITE_ONLY, NULL, 0);
+
+	if (do_zero && ret < 0)
+		goto out_unlock;
+
+	if (ret == -EAGAIN && !(iocb->ki_flags & IOCB_NOWAIT)) {
+		xfs_iunlock(ip, iolock);
+		do_zero = true;
+		goto retry;
+	}
+
+out_unlock:
+	if (iolock)
+		xfs_iunlock(ip, iolock);
+	return ret;
+}
+
 /*
  * Handle block unaligned direct I/O writes
  *
@@ -723,6 +806,8 @@ xfs_file_dio_write(
 		return -EINVAL;
 	if ((iocb->ki_pos | count) & ip->i_mount->m_blockmask)
 		return xfs_file_dio_write_unaligned(ip, iocb, from);
+	if (iocb->ki_flags & IOCB_ATOMIC)
+		return xfs_file_dio_write_atomic(ip, iocb, from);
 	return xfs_file_dio_write_aligned(ip, iocb, from);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 5/7] xfs: Switch atomic write size check in xfs_file_write_iter()
  2024-12-10 12:57 [PATCH v2 0/7] large atomic writes for xfs John Garry
                   ` (3 preceding siblings ...)
  2024-12-10 12:57 ` [PATCH v2 4/7] xfs: Add extent zeroing support for " John Garry
@ 2024-12-10 12:57 ` John Garry
  2024-12-10 12:57 ` [PATCH v2 6/7] xfs: Add RT atomic write unit max to xfs_mount John Garry
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: John Garry @ 2024-12-10 12:57 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch, ritesh.list
  Cc: linux-xfs, linux-fsdevel, linux-kernel, martin.petersen,
	John Garry

Currently atomic writes size permitted is fixed at the blocksize.

To start to remove this restriction, use xfs_get_atomic_write_attr() to
find the per-inode atomic write limits and check according to that.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_file.c | 12 +++++-------
 fs/xfs/xfs_iops.c |  2 +-
 fs/xfs/xfs_iops.h |  2 ++
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f2b2eb2dee94..8420c7a9f2a5 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -938,14 +938,12 @@ xfs_file_write_iter(
 		return xfs_file_dax_write(iocb, from);
 
 	if (iocb->ki_flags & IOCB_ATOMIC) {
-		/*
-		 * Currently only atomic writing of a single FS block is
-		 * supported. It would be possible to atomic write smaller than
-		 * a FS block, but there is no requirement to support this.
-		 * Note that iomap also does not support this yet.
-		 */
-		if (ocount != ip->i_mount->m_sb.sb_blocksize)
+		unsigned int unit_min, unit_max;
+
+		xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
+		if (ocount < unit_min || ocount > unit_max)
 			return -EINVAL;
+
 		ret = generic_atomic_write_valid(iocb, from);
 		if (ret)
 			return ret;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 207e0dadffc3..883ec45ae708 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -572,7 +572,7 @@ xfs_stat_blksize(
 	return max_t(uint32_t, PAGE_SIZE, mp->m_sb.sb_blocksize);
 }
 
-static void
+void
 xfs_get_atomic_write_attr(
 	struct xfs_inode	*ip,
 	unsigned int		*unit_min,
diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
index 3c1a2605ffd2..82d3ffbf7024 100644
--- a/fs/xfs/xfs_iops.h
+++ b/fs/xfs/xfs_iops.h
@@ -19,5 +19,7 @@ int xfs_inode_init_security(struct inode *inode, struct inode *dir,
 extern void xfs_setup_inode(struct xfs_inode *ip);
 extern void xfs_setup_iops(struct xfs_inode *ip);
 extern void xfs_diflags_to_iflags(struct xfs_inode *ip, bool init);
+extern void xfs_get_atomic_write_attr(struct xfs_inode	*ip,
+		unsigned int *unit_min, unsigned int *unit_max);
 
 #endif /* __XFS_IOPS_H__ */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 6/7] xfs: Add RT atomic write unit max to xfs_mount
  2024-12-10 12:57 [PATCH v2 0/7] large atomic writes for xfs John Garry
                   ` (4 preceding siblings ...)
  2024-12-10 12:57 ` [PATCH v2 5/7] xfs: Switch atomic write size check in xfs_file_write_iter() John Garry
@ 2024-12-10 12:57 ` John Garry
  2024-12-10 12:57 ` [PATCH v2 7/7] xfs: Update xfs_get_atomic_write_attr() for large atomic writes John Garry
  2024-12-13 14:38 ` [PATCH v2 0/7] large atomic writes for xfs Christoph Hellwig
  7 siblings, 0 replies; 29+ messages in thread
From: John Garry @ 2024-12-10 12:57 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch, ritesh.list
  Cc: linux-xfs, linux-fsdevel, linux-kernel, martin.petersen,
	John Garry

rtvol guarantees alloc unit alignment through rt_extsize. As such, it is
possible to atomically write multiple FS blocks in a rtvol (up to
rt_extsize).

Add a member to xfs_mount to hold the pre-calculated atomic write unit max.

The value in rt_extsize does not need to be a power-of-2, so find the
largest power-of-2 evenly divisible into rt_extsize.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_sb.c |  3 +++
 fs/xfs/xfs_mount.h     |  1 +
 fs/xfs/xfs_rtalloc.c   | 25 +++++++++++++++++++++++++
 fs/xfs/xfs_rtalloc.h   |  4 ++++
 4 files changed, 33 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index a809513a290c..f59d48f6557c 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -25,6 +25,7 @@
 #include "xfs_da_format.h"
 #include "xfs_health.h"
 #include "xfs_ag.h"
+#include "xfs_rtalloc.h"
 #include "xfs_rtbitmap.h"
 #include "xfs_exchrange.h"
 #include "xfs_rtgroup.h"
@@ -1148,6 +1149,8 @@ xfs_sb_mount_rextsize(
 		rgs->blklog = 0;
 		rgs->blkmask = (uint64_t)-1;
 	}
+
+	xfs_rt_awu_update(mp);
 }
 
 /* Update incore sb rt extent size, then recompute the cached rt geometry. */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index db9dade7d22a..8cd161238893 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -191,6 +191,7 @@ typedef struct xfs_mount {
 	bool			m_fail_unmount;
 	bool			m_finobt_nores; /* no per-AG finobt resv. */
 	bool			m_update_sb;	/* sb needs update in mount */
+	unsigned int            m_rt_awu_max;   /* rt atomic write unit max */
 
 	/*
 	 * Bitsets of per-fs metadata that have been checked and/or are sick.
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 0cb534d71119..3551f09fd2cb 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -735,6 +735,30 @@ xfs_rtginode_ensure(
 	return xfs_rtginode_create(rtg, type, true);
 }
 
+void
+xfs_rt_awu_update(
+	struct xfs_mount	*mp)
+{
+	unsigned int		rsize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);
+	unsigned int		awu_max;
+
+	if (is_power_of_2(rsize)) {
+		mp->m_rt_awu_max = rsize;
+		return;
+	}
+
+	/*
+	 * Find highest power-of-2 evenly divisible into sb_rextsize
+	 */
+	awu_max = mp->m_sb.sb_blocksize;
+	while (1) {
+		if (rsize % (awu_max * 2))
+			break;
+		awu_max *= 2;
+	}
+	mp->m_rt_awu_max = awu_max;
+}
+
 static struct xfs_mount *
 xfs_growfs_rt_alloc_fake_mount(
 	const struct xfs_mount	*mp,
@@ -969,6 +993,7 @@ xfs_growfs_rt_bmblock(
 	 */
 	mp->m_rsumlevels = nmp->m_rsumlevels;
 	mp->m_rsumblocks = nmp->m_rsumblocks;
+	mp->m_rt_awu_max = nmp->m_rt_awu_max;
 
 	/*
 	 * Recompute the growfsrt reservation from the new rsumsize.
diff --git a/fs/xfs/xfs_rtalloc.h b/fs/xfs/xfs_rtalloc.h
index 8e2a07b8174b..fcb7bb3df470 100644
--- a/fs/xfs/xfs_rtalloc.h
+++ b/fs/xfs/xfs_rtalloc.h
@@ -42,6 +42,10 @@ xfs_growfs_rt(
 	struct xfs_mount	*mp,	/* file system mount structure */
 	xfs_growfs_rt_t		*in);	/* user supplied growfs struct */
 
+void
+xfs_rt_awu_update(
+	struct xfs_mount	*mp);
+
 int xfs_rtalloc_reinit_frextents(struct xfs_mount *mp);
 #else
 # define xfs_growfs_rt(mp,in)				(-ENOSYS)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 7/7] xfs: Update xfs_get_atomic_write_attr() for large atomic writes
  2024-12-10 12:57 [PATCH v2 0/7] large atomic writes for xfs John Garry
                   ` (5 preceding siblings ...)
  2024-12-10 12:57 ` [PATCH v2 6/7] xfs: Add RT atomic write unit max to xfs_mount John Garry
@ 2024-12-10 12:57 ` John Garry
  2024-12-13 14:38 ` [PATCH v2 0/7] large atomic writes for xfs Christoph Hellwig
  7 siblings, 0 replies; 29+ messages in thread
From: John Garry @ 2024-12-10 12:57 UTC (permalink / raw)
  To: brauner, djwong, cem, dchinner, hch, ritesh.list
  Cc: linux-xfs, linux-fsdevel, linux-kernel, martin.petersen,
	John Garry

Update xfs_get_atomic_write_attr() to take into account that rtvol can
support atomic writes spanning multiple FS blocks.

For non-rtvol, we are still limited in min and max by the blocksize.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_iops.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 883ec45ae708..75fb3738cb76 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -572,18 +572,35 @@ xfs_stat_blksize(
 	return max_t(uint32_t, PAGE_SIZE, mp->m_sb.sb_blocksize);
 }
 
+/* Returns max atomic write unit for a file, in bytes. */
+static unsigned int
+xfs_inode_atomicwrite_max(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+
+	if (XFS_IS_REALTIME_INODE(ip))
+		return mp->m_rt_awu_max;
+
+	return mp->m_sb.sb_blocksize;
+}
+
 void
 xfs_get_atomic_write_attr(
 	struct xfs_inode	*ip,
 	unsigned int		*unit_min,
 	unsigned int		*unit_max)
 {
+	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
+	unsigned int		awu_max = xfs_inode_atomicwrite_max(ip);
+
 	if (!xfs_inode_can_atomicwrite(ip)) {
 		*unit_min = *unit_max = 0;
 		return;
 	}
 
-	*unit_min = *unit_max = ip->i_mount->m_sb.sb_blocksize;
+	*unit_min = ip->i_mount->m_sb.sb_blocksize;
+	*unit_max =  min(target->bt_bdev_awu_max, awu_max);
 }
 
 STATIC int
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/7] iomap: Add zero unwritten mappings dio support
  2024-12-10 12:57 ` [PATCH v2 2/7] iomap: Add zero unwritten mappings dio support John Garry
@ 2024-12-11 23:47   ` Darrick J. Wong
  2024-12-12 10:40     ` John Garry
  0 siblings, 1 reply; 29+ messages in thread
From: Darrick J. Wong @ 2024-12-11 23:47 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, ritesh.list, linux-xfs,
	linux-fsdevel, linux-kernel, martin.petersen

On Tue, Dec 10, 2024 at 12:57:32PM +0000, John Garry wrote:
> For atomic writes support, it is required to only ever submit a single bio
> (for an atomic write).
> 
> Furthermore, currently the atomic write unit min and max limit is fixed at
> the FS block size.
> 
> For lifting the atomic write unit max limit, it may occur that an atomic
> write spans mixed unwritten and mapped extents. For this case, due to the
> iterative nature of iomap, multiple bios would be produced, which is
> intolerable.
> 
> Add a function to zero unwritten extents in a certain range, which may be
> used to ensure that unwritten extents are zeroed prior to issuing of an
> atomic write.

I still dislike this.  IMO block untorn writes _is_ a niche feature for
programs that perform IO in large blocks.  Any program that wants a
general "apply all these updates or none of them" interface should use
XFS_IOC_EXCHANGE_RANGE since it has no awu_max restrictions, can handle
discontiguous update ranges, doesn't require block alignment, etc.

Instead here we are adding a bunch of complexity, and not even all that
well:

> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/iomap/direct-io.c  | 76 +++++++++++++++++++++++++++++++++++++++++++
>  include/linux/iomap.h |  3 ++
>  2 files changed, 79 insertions(+)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 23fdad16e6a8..18c888f0c11f 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -805,6 +805,82 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  }
>  EXPORT_SYMBOL_GPL(iomap_dio_rw);
>  
> +static loff_t
> +iomap_dio_zero_unwritten_iter(struct iomap_iter *iter, struct iomap_dio *dio)
> +{
> +	const struct iomap *iomap = &iter->iomap;
> +	loff_t length = iomap_length(iter);
> +	loff_t pos = iter->pos;
> +
> +	if (iomap->type == IOMAP_UNWRITTEN) {
> +		int ret;
> +
> +		dio->flags |= IOMAP_DIO_UNWRITTEN;
> +		ret = iomap_dio_zero(iter, dio, pos, length);

Shouldn't this be detecting the particular case that the mapping for the
kiocb is in mixed state and only zeroing in that case?  This just
targets every unwritten extent, even if the unwritten extent covered the
entire range that is being written.  It doesn't handle COW, it doesn't
handle holes, etc.

Also, can you make a version of blkdev_issue_zeroout that returns the
bio so the caller can issue them asynchronously instead of opencoding
the bio_alloc loop in iomap_dev_zero?

> +		if (ret)
> +			return ret;
> +	}
> +
> +	dio->size += length;
> +
> +	return length;
> +}
> +
> +ssize_t
> +iomap_dio_zero_unwritten(struct kiocb *iocb, struct iov_iter *iter,
> +		const struct iomap_ops *ops, const struct iomap_dio_ops *dops)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct iomap_dio *dio;
> +	ssize_t ret;
> +	struct iomap_iter iomi = {
> +		.inode		= inode,
> +		.pos		= iocb->ki_pos,
> +		.len		= iov_iter_count(iter),
> +		.flags		= IOMAP_WRITE,

IOMAP_WRITE | IOMAP_DIRECT, no?

--D

> +	};
> +
> +	dio = kzalloc(sizeof(*dio), GFP_KERNEL);
> +	if (!dio)
> +		return -ENOMEM;
> +
> +	dio->iocb = iocb;
> +	atomic_set(&dio->ref, 1);
> +	dio->i_size = i_size_read(inode);
> +	dio->dops = dops;
> +	dio->submit.waiter = current;
> +	dio->wait_for_completion = true;
> +
> +	inode_dio_begin(inode);
> +
> +	while ((ret = iomap_iter(&iomi, ops)) > 0)
> +		iomi.processed = iomap_dio_zero_unwritten_iter(&iomi, dio);
> +
> +	if (ret < 0)
> +		iomap_dio_set_error(dio, ret);
> +
> +	if (!atomic_dec_and_test(&dio->ref)) {
> +		for (;;) {
> +			set_current_state(TASK_UNINTERRUPTIBLE);
> +			if (!READ_ONCE(dio->submit.waiter))
> +				break;
> +
> +			blk_io_schedule();
> +		}
> +		__set_current_state(TASK_RUNNING);
> +	}
> +
> +	if (dops && dops->end_io)
> +		ret = dops->end_io(iocb, dio->size, ret, dio->flags);
> +
> +	kfree(dio);
> +
> +	inode_dio_end(file_inode(iocb->ki_filp));
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iomap_dio_zero_unwritten);
> +
>  static int __init iomap_dio_init(void)
>  {
>  	zero_page = alloc_pages(GFP_KERNEL | __GFP_ZERO,
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 5675af6b740c..c2d44b9e446d 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -440,6 +440,9 @@ ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  struct iomap_dio *__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
>  		unsigned int dio_flags, void *private, size_t done_before);
> +ssize_t iomap_dio_zero_unwritten(struct kiocb *iocb, struct iov_iter *iter,
> +		const struct iomap_ops *ops, const struct iomap_dio_ops *dops);
> +
>  ssize_t iomap_dio_complete(struct iomap_dio *dio);
>  void iomap_dio_bio_end_io(struct bio *bio);
>  
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/7] iomap: Add zero unwritten mappings dio support
  2024-12-11 23:47   ` Darrick J. Wong
@ 2024-12-12 10:40     ` John Garry
  2024-12-12 20:40       ` Darrick J. Wong
  0 siblings, 1 reply; 29+ messages in thread
From: John Garry @ 2024-12-12 10:40 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, cem, dchinner, hch, ritesh.list, linux-xfs,
	linux-fsdevel, linux-kernel, martin.petersen

On 11/12/2024 23:47, Darrick J. Wong wrote:
> On Tue, Dec 10, 2024 at 12:57:32PM +0000, John Garry wrote:
>> For atomic writes support, it is required to only ever submit a single bio
>> (for an atomic write).
>>
>> Furthermore, currently the atomic write unit min and max limit is fixed at
>> the FS block size.
>>
>> For lifting the atomic write unit max limit, it may occur that an atomic
>> write spans mixed unwritten and mapped extents. For this case, due to the
>> iterative nature of iomap, multiple bios would be produced, which is
>> intolerable.
>>
>> Add a function to zero unwritten extents in a certain range, which may be
>> used to ensure that unwritten extents are zeroed prior to issuing of an
>> atomic write.
> 
> I still dislike this.  IMO block untorn writes _is_ a niche feature for
> programs that perform IO in large blocks.  Any program that wants a
> general "apply all these updates or none of them" interface should use
> XFS_IOC_EXCHANGE_RANGE since it has no awu_max restrictions, can handle
> discontiguous update ranges, doesn't require block alignment, etc.

That is not a problem which I am trying to solve. Indeed atomic writes 
are for fixed blocks of data and not atomic file updates.

However, I still think that we should be able to atomic write mixed 
extents, even though it is a pain to implement. To that end, I could be 
convinced again that we don't require it...

> 
> Instead here we are adding a bunch of complexity, and not even all that
> well:
> 
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/iomap/direct-io.c  | 76 +++++++++++++++++++++++++++++++++++++++++++
>>   include/linux/iomap.h |  3 ++
>>   2 files changed, 79 insertions(+)
>>
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index 23fdad16e6a8..18c888f0c11f 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -805,6 +805,82 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>>   }
>>   EXPORT_SYMBOL_GPL(iomap_dio_rw);
>>   
>> +static loff_t
>> +iomap_dio_zero_unwritten_iter(struct iomap_iter *iter, struct iomap_dio *dio)
>> +{
>> +	const struct iomap *iomap = &iter->iomap;
>> +	loff_t length = iomap_length(iter);
>> +	loff_t pos = iter->pos;
>> +
>> +	if (iomap->type == IOMAP_UNWRITTEN) {
>> +		int ret;
>> +
>> +		dio->flags |= IOMAP_DIO_UNWRITTEN;
>> +		ret = iomap_dio_zero(iter, dio, pos, length);
> 
> Shouldn't this be detecting the particular case that the mapping for the
> kiocb is in mixed state and only zeroing in that case?  This just
> targets every unwritten extent, even if the unwritten extent covered the
> entire range that is being written. 

Right, so I did touch on this in the final comment in patch 4/7 commit log.

Why I did it this way? I did not think that it made much difference, 
since this zeroing would be generally a one-off and did not merit even 
more complexity to implement.

> It doesn't handle COW, it doesn't

Do we want to atomic write COW?

> handle holes, etc.

I did test hole, and it seemed to work. However I noticed that for a 
hole region we get IOMAP_UNWRITTEN type, like:

# xfs_bmap -vvp /root/mnt/file
/root/mnt/file:
  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
    0: [0..127]:        192..319          0 (192..319)         128 000000
    1: [128..255]:      hole                                   128
    2: [256..383]:      448..575          0 (448..575)         128 000000
  FLAG Values:
     0100000 Shared extent
     0010000 Unwritten preallocated extent
     0001000 Doesn't begin on stripe unit
     0000100 Doesn't end   on stripe unit
     0000010 Doesn't begin on stripe width
     0000001 Doesn't end   on stripe width
#
#

For an atomic wrote of length 65536 @ offset 65536.

Any hint on how to create a iomap->type == IOMAP_HOLE?

> 
> Also, can you make a version of blkdev_issue_zeroout that returns the
> bio so the caller can issue them asynchronously instead of opencoding
> the bio_alloc loop in iomap_dev_zero?

ok, fine

> 
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>> +	dio->size += length;
>> +
>> +	return length;
>> +}
>> +
>> +ssize_t
>> +iomap_dio_zero_unwritten(struct kiocb *iocb, struct iov_iter *iter,
>> +		const struct iomap_ops *ops, const struct iomap_dio_ops *dops)
>> +{
>> +	struct inode *inode = file_inode(iocb->ki_filp);
>> +	struct iomap_dio *dio;
>> +	ssize_t ret;
>> +	struct iomap_iter iomi = {
>> +		.inode		= inode,
>> +		.pos		= iocb->ki_pos,
>> +		.len		= iov_iter_count(iter),
>> +		.flags		= IOMAP_WRITE,
> 
> IOMAP_WRITE | IOMAP_DIRECT, no?

yes, I'll fix that.

And I should also set IOMAP_DIO_WRITE for dio->flags.

Thanks,
John

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/7] iomap: Add zero unwritten mappings dio support
  2024-12-12 10:40     ` John Garry
@ 2024-12-12 20:40       ` Darrick J. Wong
  2024-12-13 10:43         ` John Garry
  2024-12-13 14:47         ` Christoph Hellwig
  0 siblings, 2 replies; 29+ messages in thread
From: Darrick J. Wong @ 2024-12-12 20:40 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, cem, dchinner, hch, ritesh.list, linux-xfs,
	linux-fsdevel, linux-kernel, martin.petersen

On Thu, Dec 12, 2024 at 10:40:15AM +0000, John Garry wrote:
> On 11/12/2024 23:47, Darrick J. Wong wrote:
> > On Tue, Dec 10, 2024 at 12:57:32PM +0000, John Garry wrote:
> > > For atomic writes support, it is required to only ever submit a single bio
> > > (for an atomic write).
> > > 
> > > Furthermore, currently the atomic write unit min and max limit is fixed at
> > > the FS block size.
> > > 
> > > For lifting the atomic write unit max limit, it may occur that an atomic
> > > write spans mixed unwritten and mapped extents. For this case, due to the
> > > iterative nature of iomap, multiple bios would be produced, which is
> > > intolerable.
> > > 
> > > Add a function to zero unwritten extents in a certain range, which may be
> > > used to ensure that unwritten extents are zeroed prior to issuing of an
> > > atomic write.
> > 
> > I still dislike this.  IMO block untorn writes _is_ a niche feature for
> > programs that perform IO in large blocks.  Any program that wants a
> > general "apply all these updates or none of them" interface should use
> > XFS_IOC_EXCHANGE_RANGE since it has no awu_max restrictions, can handle
> > discontiguous update ranges, doesn't require block alignment, etc.
> 
> That is not a problem which I am trying to solve. Indeed atomic writes are
> for fixed blocks of data and not atomic file updates.

Agreed.

> However, I still think that we should be able to atomic write mixed extents,
> even though it is a pain to implement. To that end, I could be convinced
> again that we don't require it...

Well... if you /did/ add a few entries to include/uapi/linux/fs.h for
ways that an untorn write can fail, then we could define the programming
interface as so:

"If you receive -EBADMAP, then call fallocate(FALLOC_FL_MAKE_OVERWRITE)
to force all the mappings to pure overwrites."

...since there have been a few people who have asked about that ability
so that they can write+fdatasync without so much overhead from file
metadata updates.

> > 
> > Instead here we are adding a bunch of complexity, and not even all that
> > well:
> > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >   fs/iomap/direct-io.c  | 76 +++++++++++++++++++++++++++++++++++++++++++
> > >   include/linux/iomap.h |  3 ++
> > >   2 files changed, 79 insertions(+)
> > > 
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index 23fdad16e6a8..18c888f0c11f 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -805,6 +805,82 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > >   }
> > >   EXPORT_SYMBOL_GPL(iomap_dio_rw);
> > > +static loff_t
> > > +iomap_dio_zero_unwritten_iter(struct iomap_iter *iter, struct iomap_dio *dio)
> > > +{
> > > +	const struct iomap *iomap = &iter->iomap;
> > > +	loff_t length = iomap_length(iter);
> > > +	loff_t pos = iter->pos;
> > > +
> > > +	if (iomap->type == IOMAP_UNWRITTEN) {
> > > +		int ret;
> > > +
> > > +		dio->flags |= IOMAP_DIO_UNWRITTEN;
> > > +		ret = iomap_dio_zero(iter, dio, pos, length);
> > 
> > Shouldn't this be detecting the particular case that the mapping for the
> > kiocb is in mixed state and only zeroing in that case?  This just
> > targets every unwritten extent, even if the unwritten extent covered the
> > entire range that is being written.
> 
> Right, so I did touch on this in the final comment in patch 4/7 commit log.
> 
> Why I did it this way? I did not think that it made much difference, since
> this zeroing would be generally a one-off and did not merit even more
> complexity to implement.

The trouble is, if you fallocate the whole file and then write an
aligned 64k block, this will write zeroes to the block, update the
mapping, and only then issue the untorn write.  Sure that's a one time
performance hit, but probably not a welcome one.

> > It doesn't handle COW, it doesn't
> 
> Do we want to atomic write COW?

I don't see why not -- if there's a single COW mapping for the whole
untorn write, then the data gets written to the media in an untorn
fashion, and the remap is a single transaction.

> > handle holes, etc.
> 
> I did test hole, and it seemed to work. However I noticed that for a hole
> region we get IOMAP_UNWRITTEN type, like:

Oh right, that's ->iomap_begin allocating an unwritten extent into the
hole, because you're not allowed to specify a hole for the destination
of a write.  I withdraw that part of the comment.

> # xfs_bmap -vvp /root/mnt/file
> /root/mnt/file:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET        TOTAL FLAGS
>    0: [0..127]:        192..319          0 (192..319)         128 000000
>    1: [128..255]:      hole                                   128
>    2: [256..383]:      448..575          0 (448..575)         128 000000
>  FLAG Values:
>     0100000 Shared extent
>     0010000 Unwritten preallocated extent
>     0001000 Doesn't begin on stripe unit
>     0000100 Doesn't end   on stripe unit
>     0000010 Doesn't begin on stripe width
>     0000001 Doesn't end   on stripe width
> #
> #
> 
> For an atomic wrote of length 65536 @ offset 65536.
> 
> Any hint on how to create a iomap->type == IOMAP_HOLE?
> 
> > Also, can you make a version of blkdev_issue_zeroout that returns the
> > bio so the caller can issue them asynchronously instead of opencoding
> > the bio_alloc loop in iomap_dev_zero?
> 
> ok, fine
> 
> > 
> > > +		if (ret)
> > > +			return ret;
> > > +	}
> > > +
> > > +	dio->size += length;
> > > +
> > > +	return length;
> > > +}
> > > +
> > > +ssize_t
> > > +iomap_dio_zero_unwritten(struct kiocb *iocb, struct iov_iter *iter,
> > > +		const struct iomap_ops *ops, const struct iomap_dio_ops *dops)
> > > +{
> > > +	struct inode *inode = file_inode(iocb->ki_filp);
> > > +	struct iomap_dio *dio;
> > > +	ssize_t ret;
> > > +	struct iomap_iter iomi = {
> > > +		.inode		= inode,
> > > +		.pos		= iocb->ki_pos,
> > > +		.len		= iov_iter_count(iter),
> > > +		.flags		= IOMAP_WRITE,
> > 
> > IOMAP_WRITE | IOMAP_DIRECT, no?
> 
> yes, I'll fix that.
> 
> And I should also set IOMAP_DIO_WRITE for dio->flags.

<nod>

--D

> Thanks,
> John
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/7] iomap: Add zero unwritten mappings dio support
  2024-12-12 20:40       ` Darrick J. Wong
@ 2024-12-13 10:43         ` John Garry
  2024-12-13 14:47         ` Christoph Hellwig
  1 sibling, 0 replies; 29+ messages in thread
From: John Garry @ 2024-12-13 10:43 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, cem, dchinner, hch, ritesh.list, linux-xfs,
	linux-fsdevel, linux-kernel, martin.petersen


>> However, I still think that we should be able to atomic write mixed extents,
>> even though it is a pain to implement. To that end, I could be convinced
>> again that we don't require it...
> 
> Well... if you /did/ add a few entries to include/uapi/linux/fs.h for
> ways that an untorn write can fail, then we could define the programming
> interface as so:
> 
> "If you receive -EBADMAP, then call fallocate(FALLOC_FL_MAKE_OVERWRITE)
> to force all the mappings to pure overwrites."
> 
> ...since there have been a few people who have asked about that ability
> so that they can write+fdatasync without so much overhead from file
> metadata updates.

ok, I see.

All this does seem more complicated in terms of implementation and user 
experience than what I have in this series. But if you think that there 
is value in FALLOC_FL_MAKE_OVERWRITE for other scenarios, then maybe it 
could be good, though.

> 
>>>
>>> Instead here we are adding a bunch of complexity, and not even all that
>>> well:
>>>
>>>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>>>> ---
>>>>    fs/iomap/direct-io.c  | 76 +++++++++++++++++++++++++++++++++++++++++++
>>>>    include/linux/iomap.h |  3 ++
>>>>    2 files changed, 79 insertions(+)
>>>>
>>>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>>>> index 23fdad16e6a8..18c888f0c11f 100644
>>>> --- a/fs/iomap/direct-io.c
>>>> +++ b/fs/iomap/direct-io.c
>>>> @@ -805,6 +805,82 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>>>>    }
>>>>    EXPORT_SYMBOL_GPL(iomap_dio_rw);
>>>> +static loff_t
>>>> +iomap_dio_zero_unwritten_iter(struct iomap_iter *iter, struct iomap_dio *dio)
>>>> +{
>>>> +	const struct iomap *iomap = &iter->iomap;
>>>> +	loff_t length = iomap_length(iter);
>>>> +	loff_t pos = iter->pos;
>>>> +
>>>> +	if (iomap->type == IOMAP_UNWRITTEN) {
>>>> +		int ret;
>>>> +
>>>> +		dio->flags |= IOMAP_DIO_UNWRITTEN;
>>>> +		ret = iomap_dio_zero(iter, dio, pos, length);
>>>
>>> Shouldn't this be detecting the particular case that the mapping for the
>>> kiocb is in mixed state and only zeroing in that case?  This just
>>> targets every unwritten extent, even if the unwritten extent covered the
>>> entire range that is being written.
>>
>> Right, so I did touch on this in the final comment in patch 4/7 commit log.
>>
>> Why I did it this way? I did not think that it made much difference, since
>> this zeroing would be generally a one-off and did not merit even more
>> complexity to implement.
> 
> The trouble is, if you fallocate the whole file and then write an
> aligned 64k block, this will write zeroes to the block, update the
> mapping, and only then issue the untorn write.  Sure that's a one time
> performance hit, but probably not a welcome one.

ok, I can try to improve on this. It might get considerably more 
complicated...

> 
>>> It doesn't handle COW, it doesn't
>>
>> Do we want to atomic write COW?
> 
> I don't see why not -- if there's a single COW mapping for the whole
> untorn write, then the data gets written to the media in an untorn
> fashion, and the remap is a single transaction.

I tested atomic write on COW and it works ok, but the behavior is odd to me.

If I attempt to atomic write a single block in shared extent, then we 
have this callchain: xfs_file_dio_write_atomic() -> 
iomap_dio_rw(IOMAP_DIO_OVERWRITE_ONLY) -> ... 
xfs_direct_write_iomap_begin() -> xfs_reflink_allocate_cow() and we 
alloc a new extent.

And so xfs_file_dio_write_atomic() -> 
iomap_dio_rw(IOMAP_DIO_OVERWRITE_ONLY) does not return -EAGAIN and we 
don't even attempt to zero.

I just wonder why IOMAP_DIO_OVERWRITE_ONLY is not honoured here, as 
xfs_reflink_allocate_cow() -> xfs_reflink_fill_cow_hole() -> 
xfs_bmap_write() can alloc new blocks.

I am not too concerned about atomic writing mixed extents which includes 
COW extents, as atomic writing mixed extents is based on "big alloc" and 
that does not enable reflink (yet).

> 
>>> handle holes, etc.
>>
>> I did test hole, and it seemed to work. However I noticed that for a hole
>> region we get IOMAP_UNWRITTEN type, like:
> 
> Oh right, that's ->iomap_begin allocating an unwritten extent into the
> hole, because you're not allowed to specify a hole for the destination
> of a write.  I withdraw that part of the comment.


> 

Thanks,
John

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/7] large atomic writes for xfs
  2024-12-10 12:57 [PATCH v2 0/7] large atomic writes for xfs John Garry
                   ` (6 preceding siblings ...)
  2024-12-10 12:57 ` [PATCH v2 7/7] xfs: Update xfs_get_atomic_write_attr() for large atomic writes John Garry
@ 2024-12-13 14:38 ` Christoph Hellwig
  2024-12-13 17:15   ` John Garry
  7 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2024-12-13 14:38 UTC (permalink / raw)
  To: John Garry
  Cc: brauner, djwong, cem, dchinner, hch, ritesh.list, linux-xfs,
	linux-fsdevel, linux-kernel, martin.petersen

On Tue, Dec 10, 2024 at 12:57:30PM +0000, John Garry wrote:
> Currently the atomic write unit min and max is fixed at the FS blocksize
> for xfs and ext4.
> 
> This series expands support to allow multiple FS blocks to be written
> atomically.

Can you explain the workload you're interested in a bit more? 

I'm still very scared of expanding use of the large allocation sizes.

IIRC you showed some numbers where increasing the FSB size to something
larger did not look good in your benchmarks, but I'd like to understand
why.  Do you have a link to these numbers just to refresh everyones minds
why that wasn't a good idea.  Did that also include supporting atomic
writes in the sector size <= write size <= FS block size range, which
aren't currently supported, but very useful?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/7] iomap: Add zero unwritten mappings dio support
  2024-12-12 20:40       ` Darrick J. Wong
  2024-12-13 10:43         ` John Garry
@ 2024-12-13 14:47         ` Christoph Hellwig
  2024-12-14  0:56           ` Darrick J. Wong
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2024-12-13 14:47 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Garry, brauner, cem, dchinner, hch, ritesh.list, linux-xfs,
	linux-fsdevel, linux-kernel, martin.petersen

On Thu, Dec 12, 2024 at 12:40:07PM -0800, Darrick J. Wong wrote:
> > However, I still think that we should be able to atomic write mixed extents,
> > even though it is a pain to implement. To that end, I could be convinced
> > again that we don't require it...
> 
> Well... if you /did/ add a few entries to include/uapi/linux/fs.h for
> ways that an untorn write can fail, then we could define the programming
> interface as so:
> 
> "If you receive -EBADMAP, then call fallocate(FALLOC_FL_MAKE_OVERWRITE)
> to force all the mappings to pure overwrites."

Ewwwwwwwwwwwwwwwwwwwww.

That's not a sane API in any way.

> ...since there have been a few people who have asked about that ability
> so that they can write+fdatasync without so much overhead from file
> metadata updates.

And all of them fundamentally misunderstood file system semantics and/or
used weird bypasses that are dommed to corrupt the file system sooner
or later.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/7] large atomic writes for xfs
  2024-12-13 14:38 ` [PATCH v2 0/7] large atomic writes for xfs Christoph Hellwig
@ 2024-12-13 17:15   ` John Garry
  2024-12-13 17:22     ` Christoph Hellwig
  0 siblings, 1 reply; 29+ messages in thread
From: John Garry @ 2024-12-13 17:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: brauner, djwong, cem, dchinner, ritesh.list, linux-xfs,
	linux-fsdevel, linux-kernel, martin.petersen

On 13/12/2024 14:38, Christoph Hellwig wrote:
> On Tue, Dec 10, 2024 at 12:57:30PM +0000, John Garry wrote:
>> Currently the atomic write unit min and max is fixed at the FS blocksize
>> for xfs and ext4.
>>
>> This series expands support to allow multiple FS blocks to be written
>> atomically.
> 
> Can you explain the workload you're interested in a bit more?

Sure, so some background is that we are using atomic writes for innodb 
MySQL so that we can stop relying on the double-write buffer for crash 
protection. MySQL is using an internal 16K page size (so we want 16K 
atomic writes).

MySQL has what is known as a REDO log - see 
https://dev.mysql.com/doc/dev/mysql-server/9.0.1/PAGE_INNODB_REDO_LOG.html

Essentially it means that for any data page we write, ahead of time we 
do a buffered 512B log update followed by a periodic fsync. I think that 
such a thing is common to many apps.

> 
> I'm still very scared of expanding use of the large allocation sizes.

Yes

> 
> IIRC you showed some numbers where increasing the FSB size to something
> larger did not look good in your benchmarks, but I'd like to understand
> why.  Do you have a link to these numbers just to refresh everyones minds
> why that wasn't a good idea. 

I don't think that I can share numbers, but I will summarize the findings.

When we tried just using 16K FS blocksize, we found for low thread count 
testing that performance was poor - even worse baseline of 4K FS 
blocksize and double-write buffer. We put this down to high write 
latency for REDO log. As you can imagine, mostly writing 16K for only a 
512B update is not efficient in terms of traffic generated and increased 
latency (versus 4K FS block size). At higher thread count, performance 
was better. We put that down to bigger log data portions to be written 
to REDO per FS block write.

For 4K FS blocksize and 16K atomic writes configs - supported via 
forcealign or RTvol - performance will generally good across the board. 
forcealign was a bit better.

We also tried a hybrid solution with 2x partitions - 1x partition with 
16K FS block size for data and 1x partition with 4K FS block size for 
REDO. Performance here was good also. Unfortunately, though, this config 
is not fit for production - that is because we have a requirement to do 
FS snapshot and that is not possible over 2x FS instances. We also did 
consider block device snapshot, but there is reluctance to try this also.

> Did that also include supporting atomic
> writes in the sector size <= write size <= FS block size range, which
> aren't currently supported, but very useful?

I have no use for that so far.

Thanks,
John


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/7] large atomic writes for xfs
  2024-12-13 17:15   ` John Garry
@ 2024-12-13 17:22     ` Christoph Hellwig
  2024-12-13 17:43       ` John Garry
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2024-12-13 17:22 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, brauner, djwong, cem, dchinner, ritesh.list,
	linux-xfs, linux-fsdevel, linux-kernel, martin.petersen

On Fri, Dec 13, 2024 at 05:15:55PM +0000, John Garry wrote:
> Sure, so some background is that we are using atomic writes for innodb 
> MySQL so that we can stop relying on the double-write buffer for crash 
> protection. MySQL is using an internal 16K page size (so we want 16K atomic 
> writes).

Make perfect sense so far.

>
> MySQL has what is known as a REDO log - see 
> https://dev.mysql.com/doc/dev/mysql-server/9.0.1/PAGE_INNODB_REDO_LOG.html
>
> Essentially it means that for any data page we write, ahead of time we do a 
> buffered 512B log update followed by a periodic fsync. I think that such a 
> thing is common to many apps.

So it's actually using buffered I/O for that and not direct I/O?

> When we tried just using 16K FS blocksize, we found for low thread count 
> testing that performance was poor - even worse baseline of 4K FS blocksize 
> and double-write buffer. We put this down to high write latency for REDO 
> log. As you can imagine, mostly writing 16K for only a 512B update is not 
> efficient in terms of traffic generated and increased latency (versus 4K FS 
> block size). At higher thread count, performance was better. We put that 
> down to bigger log data portions to be written to REDO per FS block write.

So if the redo log uses buffered I/O I can see how that would bloat writes.
But then again using buffered I/O for a REDO log seems pretty silly
to start with.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/7] large atomic writes for xfs
  2024-12-13 17:22     ` Christoph Hellwig
@ 2024-12-13 17:43       ` John Garry
  2024-12-14  0:42         ` Darrick J. Wong
  2024-12-17  7:11         ` Christoph Hellwig
  0 siblings, 2 replies; 29+ messages in thread
From: John Garry @ 2024-12-13 17:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: brauner, djwong, cem, dchinner, ritesh.list, linux-xfs,
	linux-fsdevel, linux-kernel, martin.petersen

On 13/12/2024 17:22, Christoph Hellwig wrote:
> On Fri, Dec 13, 2024 at 05:15:55PM +0000, John Garry wrote:
>> Sure, so some background is that we are using atomic writes for innodb
>> MySQL so that we can stop relying on the double-write buffer for crash
>> protection. MySQL is using an internal 16K page size (so we want 16K atomic
>> writes).
> 
> Make perfect sense so far.
> 
>>
>> MySQL has what is known as a REDO log - see
>> https://dev.mysql.com/doc/dev/mysql-server/9.0.1/PAGE_INNODB_REDO_LOG.html
>>
>> Essentially it means that for any data page we write, ahead of time we do a
>> buffered 512B log update followed by a periodic fsync. I think that such a
>> thing is common to many apps.
> 
> So it's actually using buffered I/O for that and not direct I/O?

Right

 > >> When we tried just using 16K FS blocksize, we found for low thread 
count
>> testing that performance was poor - even worse baseline of 4K FS blocksize
>> and double-write buffer. We put this down to high write latency for REDO
>> log. As you can imagine, mostly writing 16K for only a 512B update is not
>> efficient in terms of traffic generated and increased latency (versus 4K FS
>> block size). At higher thread count, performance was better. We put that
>> down to bigger log data portions to be written to REDO per FS block write.
> 
> So if the redo log uses buffered I/O I can see how that would bloat writes.
> But then again using buffered I/O for a REDO log seems pretty silly
> to start with.
> 

Yeah, at the low end, it may make sense to do the 512B write via DIO. 
But OTOH sync'ing many redo log FS blocks at once at the high end can be 
more efficient.

 From what I have heard, this was attempted before (using DIO) by some 
vendor, but did not come to much.

So it seems that we are stuck with this redo log limitation.

Let me know if you have any other ideas to avoid large atomic writes...

Cheers,
John


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/7] large atomic writes for xfs
  2024-12-13 17:43       ` John Garry
@ 2024-12-14  0:42         ` Darrick J. Wong
  2024-12-16  8:40           ` John Garry
  2024-12-17  7:11         ` Christoph Hellwig
  1 sibling, 1 reply; 29+ messages in thread
From: Darrick J. Wong @ 2024-12-14  0:42 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, brauner, cem, dchinner, ritesh.list, linux-xfs,
	linux-fsdevel, linux-kernel, martin.petersen

On Fri, Dec 13, 2024 at 05:43:09PM +0000, John Garry wrote:
> On 13/12/2024 17:22, Christoph Hellwig wrote:
> > On Fri, Dec 13, 2024 at 05:15:55PM +0000, John Garry wrote:
> > > Sure, so some background is that we are using atomic writes for innodb
> > > MySQL so that we can stop relying on the double-write buffer for crash
> > > protection. MySQL is using an internal 16K page size (so we want 16K atomic
> > > writes).
> > 
> > Make perfect sense so far.
> > 
> > > 
> > > MySQL has what is known as a REDO log - see
> > > https://dev.mysql.com/doc/dev/mysql-server/9.0.1/PAGE_INNODB_REDO_LOG.html
> > > 
> > > Essentially it means that for any data page we write, ahead of time we do a
> > > buffered 512B log update followed by a periodic fsync. I think that such a
> > > thing is common to many apps.
> > 
> > So it's actually using buffered I/O for that and not direct I/O?
> 
> Right
> 
> > >> When we tried just using 16K FS blocksize, we found for low thread
> count
> > > testing that performance was poor - even worse baseline of 4K FS blocksize
> > > and double-write buffer. We put this down to high write latency for REDO
> > > log. As you can imagine, mostly writing 16K for only a 512B update is not
> > > efficient in terms of traffic generated and increased latency (versus 4K FS
> > > block size). At higher thread count, performance was better. We put that
> > > down to bigger log data portions to be written to REDO per FS block write.
> > 
> > So if the redo log uses buffered I/O I can see how that would bloat writes.
> > But then again using buffered I/O for a REDO log seems pretty silly
> > to start with.
> > 
> 
> Yeah, at the low end, it may make sense to do the 512B write via DIO. But
> OTOH sync'ing many redo log FS blocks at once at the high end can be more
> efficient.
> 
> From what I have heard, this was attempted before (using DIO) by some
> vendor, but did not come to much.
> 
> So it seems that we are stuck with this redo log limitation.
> 
> Let me know if you have any other ideas to avoid large atomic writes...

From the description it sounds like the redo log consists of 512b blocks
that describe small changes to the 16k table file pages.  If they're
issuing 16k atomic writes to get each of those 512b redo log records to
disk it's no wonder that cranks up the overhead substantially.  Also,
replaying those tiny updates through the pagecache beats issuing a bunch
of tiny nonlocalized writes.

For the first case I don't know why they need atomic writes -- 512b redo
log records can't be torn because they're single-sector writes.  The
second case might be better done with exchange-range.

--D

> Cheers,
> John
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/7] iomap: Add zero unwritten mappings dio support
  2024-12-13 14:47         ` Christoph Hellwig
@ 2024-12-14  0:56           ` Darrick J. Wong
  2024-12-17  7:08             ` Christoph Hellwig
  0 siblings, 1 reply; 29+ messages in thread
From: Darrick J. Wong @ 2024-12-14  0:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: John Garry, brauner, cem, dchinner, ritesh.list, linux-xfs,
	linux-fsdevel, linux-kernel, martin.petersen

On Fri, Dec 13, 2024 at 03:47:40PM +0100, Christoph Hellwig wrote:
> On Thu, Dec 12, 2024 at 12:40:07PM -0800, Darrick J. Wong wrote:
> > > However, I still think that we should be able to atomic write mixed extents,
> > > even though it is a pain to implement. To that end, I could be convinced
> > > again that we don't require it...
> > 
> > Well... if you /did/ add a few entries to include/uapi/linux/fs.h for
> > ways that an untorn write can fail, then we could define the programming
> > interface as so:
> > 
> > "If you receive -EBADMAP, then call fallocate(FALLOC_FL_MAKE_OVERWRITE)
> > to force all the mappings to pure overwrites."
> 
> Ewwwwwwwwwwwwwwwwwwwww.
> 
> That's not a sane API in any way.

Oh I know, I'd much rather stick to the view that block untorn writes
are a means for programs that only ever do IO in large(ish) blocks to
take advantage of a hardware feature that also wants those large
blocks.  And only if the file mapping is in the correct state, and the
program is willing to *maintain* them in the correct state to get the
better performance.  I don't want xfs to grow code to write zeroes to
mapped blocks just so it can then write-untorn to the same blocks.

The gross part is that I think if you want to do untorn multi-fsblock
writes, then you need forcealign.  In turn, forcealign has to handle COW
of shared blocks.  willy and I looked through the changes I made to
support dirtying and writing back gangs of pages for rtreflink when the
rtextsize > 1, and didn't find anything insane in there.  Using that to
handle COWing forcealign file blocks should work, though things get
tricky once you add atomic untorn writes because we can't split bios.

Everything else I think should use exchange-range because it has so many
fewer limitations.

--D

> > ...since there have been a few people who have asked about that ability
> > so that they can write+fdatasync without so much overhead from file
> > metadata updates.
> 
> And all of them fundamentally misunderstood file system semantics and/or
> used weird bypasses that are dommed to corrupt the file system sooner
> or later.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/7] large atomic writes for xfs
  2024-12-14  0:42         ` Darrick J. Wong
@ 2024-12-16  8:40           ` John Garry
  0 siblings, 0 replies; 29+ messages in thread
From: John Garry @ 2024-12-16  8:40 UTC (permalink / raw)
  To: Darrick J. Wong, Christoph Hellwig
  Cc: brauner, cem, dchinner, ritesh.list, linux-xfs, linux-fsdevel,
	linux-kernel, martin.petersen


>>
>> Yeah, at the low end, it may make sense to do the 512B write via DIO. But
>> OTOH sync'ing many redo log FS blocks at once at the high end can be more
>> efficient.
>>
>>  From what I have heard, this was attempted before (using DIO) by some
>> vendor, but did not come to much.
>>
>> So it seems that we are stuck with this redo log limitation.
>>
>> Let me know if you have any other ideas to avoid large atomic writes...
> 
>  From the description it sounds like the redo log consists of 512b blocks
> that describe small changes to the 16k table file pages.  If they're
> issuing 16k atomic writes to get each of those 512b redo log records to
> disk it's no wonder that cranks up the overhead substantially. 

They are not issuing the redo log atomically. They do 512B buffered 
writes and then periodically fsync.

> Also,
> replaying those tiny updates through the pagecache beats issuing a bunch
> of tiny nonlocalized writes.
> 
> For the first case I don't know why they need atomic writes -- 512b redo
> log records can't be torn because they're single-sector writes.  The
> second case might be better done with exchange-range.
> 

As for exchange-range, that would very much pre-date any MySQL port. 
Furthermore, I can't imagine that exchange-range support is portable to 
other FSes, which is probably quite important. Anyway, they are not 
issuing the redo log atomically, so I don't know if mentioning 
exchange-range is relevant.

Regardless of what MySQL is specifically doing here, there are going to 
be other users/applications which want to keep a 4K FS blocksize and do 
larger atomic writes.

Thanks,
John

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/7] iomap: Add zero unwritten mappings dio support
  2024-12-14  0:56           ` Darrick J. Wong
@ 2024-12-17  7:08             ` Christoph Hellwig
  2024-12-18 11:15               ` John Garry
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2024-12-17  7:08 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, John Garry, brauner, cem, dchinner,
	ritesh.list, linux-xfs, linux-fsdevel, linux-kernel,
	martin.petersen

On Fri, Dec 13, 2024 at 04:56:38PM -0800, Darrick J. Wong wrote:
> > > "If you receive -EBADMAP, then call fallocate(FALLOC_FL_MAKE_OVERWRITE)
> > > to force all the mappings to pure overwrites."
> > 
> > Ewwwwwwwwwwwwwwwwwwwww.
> > 
> > That's not a sane API in any way.
> 
> Oh I know, I'd much rather stick to the view that block untorn writes
> are a means for programs that only ever do IO in large(ish) blocks to
> take advantage of a hardware feature that also wants those large
> blocks.

I (vaguely) agree ith that.

> And only if the file mapping is in the correct state, and the
> program is willing to *maintain* them in the correct state to get the
> better performance.

I kinda agree with that, but the maintain is a bit hard as general
rule of thumb as file mappings can change behind the applications
back.  So building interfaces around the concept that there are
entirely stable mappings seems like a bad idea.

> I don't want xfs to grow code to write zeroes to
> mapped blocks just so it can then write-untorn to the same blocks.

Agreed.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/7] large atomic writes for xfs
  2024-12-13 17:43       ` John Garry
  2024-12-14  0:42         ` Darrick J. Wong
@ 2024-12-17  7:11         ` Christoph Hellwig
  2024-12-17  8:23           ` John Garry
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2024-12-17  7:11 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, brauner, djwong, cem, dchinner, ritesh.list,
	linux-xfs, linux-fsdevel, linux-kernel, martin.petersen

On Fri, Dec 13, 2024 at 05:43:09PM +0000, John Garry wrote:
>> So if the redo log uses buffered I/O I can see how that would bloat writes.
>> But then again using buffered I/O for a REDO log seems pretty silly
>> to start with.
>>
>
> Yeah, at the low end, it may make sense to do the 512B write via DIO. But 
> OTOH sync'ing many redo log FS blocks at once at the high end can be more 
> efficient.
>
> From what I have heard, this was attempted before (using DIO) by some 
> vendor, but did not come to much.

I can't see how buffered I/O will be fast than an optimized direct I/O
implementation.  Then again compared to very dumb dio code that doesn't
replace the caching in the page I can easily see how dio would be
much worse.  But given that people care about optimizing this workload
enough to look into changes all over the kernel I/O stack I would
expected that touching the code to write the redo log should not be
out of the picture.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 0/7] large atomic writes for xfs
  2024-12-17  7:11         ` Christoph Hellwig
@ 2024-12-17  8:23           ` John Garry
  0 siblings, 0 replies; 29+ messages in thread
From: John Garry @ 2024-12-17  8:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: brauner, djwong, cem, dchinner, ritesh.list, linux-xfs,
	linux-fsdevel, linux-kernel, martin.petersen

On 17/12/2024 07:11, Christoph Hellwig wrote:
> On Fri, Dec 13, 2024 at 05:43:09PM +0000, John Garry wrote:
>>> So if the redo log uses buffered I/O I can see how that would bloat writes.
>>> But then again using buffered I/O for a REDO log seems pretty silly
>>> to start with.
>>>
>>
>> Yeah, at the low end, it may make sense to do the 512B write via DIO. But
>> OTOH sync'ing many redo log FS blocks at once at the high end can be more
>> efficient.
>>
>>  From what I have heard, this was attempted before (using DIO) by some
>> vendor, but did not come to much.
> 
> I can't see how buffered I/O will be fast than an optimized direct I/O
> implementation.  Then again compared to very dumb dio code that doesn't
> replace the caching in the page I can easily see how dio would be
> much worse.  But given that people care about optimizing this workload
> enough to look into changes all over the kernel I/O stack I would
> expected that touching the code to write the redo log should not be
> out of the picture.

For sure, and I get the impression that - in general - optimising this 
redo log is something that effort is put into. I will admit that I don't 
know much about this redo log code, but I can go back with the feedback 
that redo log should be optimised for switching to the larger FS 
blocksize. But that may take a long time and be fruitless.

And even if something is done for this particular case, other scenarios 
are still going to want large atomics but keep the 4K FS block size.

Apart from all of that, I get it that you don't want to grow the big 
alloc code, but is this series really doing that? Or the smaller v1?

Thanks,
John


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/7] iomap: Add zero unwritten mappings dio support
  2024-12-17  7:08             ` Christoph Hellwig
@ 2024-12-18 11:15               ` John Garry
  2025-01-08  1:26                 ` Darrick J. Wong
  0 siblings, 1 reply; 29+ messages in thread
From: John Garry @ 2024-12-18 11:15 UTC (permalink / raw)
  To: Christoph Hellwig, Darrick J. Wong
  Cc: brauner, cem, dchinner, ritesh.list, linux-xfs, linux-fsdevel,
	linux-kernel, martin.petersen

On 17/12/2024 07:08, Christoph Hellwig wrote:
> On Fri, Dec 13, 2024 at 04:56:38PM -0800, Darrick J. Wong wrote:
>>>> "If you receive -EBADMAP, then call fallocate(FALLOC_FL_MAKE_OVERWRITE)
>>>> to force all the mappings to pure overwrites."
>>>
>>> Ewwwwwwwwwwwwwwwwwwwww.
>>>
>>> That's not a sane API in any way.
>>
>> Oh I know, I'd much rather stick to the view that block untorn writes
>> are a means for programs that only ever do IO in large(ish) blocks to
>> take advantage of a hardware feature that also wants those large
>> blocks.
> 
> I (vaguely) agree ith that.
> 
>> And only if the file mapping is in the correct state, and the
>> program is willing to *maintain* them in the correct state to get the
>> better performance.
> 
> I kinda agree with that, but the maintain is a bit hard as general
> rule of thumb as file mappings can change behind the applications
> back.  So building interfaces around the concept that there are
> entirely stable mappings seems like a bad idea.

I tend to agree.

> 
>> I don't want xfs to grow code to write zeroes to
>> mapped blocks just so it can then write-untorn to the same blocks.
> 
> Agreed.
> 

So if we want to allow large writes over mixed extents, how to handle?

Note that some time ago we also discussed that we don't want to have a 
single bio covering mixed extents as we cannot atomically convert all 
unwritten extents to mapped.

Thanks,
John




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/7] iomap: Add zero unwritten mappings dio support
  2024-12-18 11:15               ` John Garry
@ 2025-01-08  1:26                 ` Darrick J. Wong
  2025-01-08 11:39                   ` John Garry
  2025-01-09  7:54                   ` Christoph Hellwig
  0 siblings, 2 replies; 29+ messages in thread
From: Darrick J. Wong @ 2025-01-08  1:26 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, brauner, cem, dchinner, ritesh.list, linux-xfs,
	linux-fsdevel, linux-kernel, martin.petersen

On Wed, Dec 18, 2024 at 11:15:42AM +0000, John Garry wrote:
> On 17/12/2024 07:08, Christoph Hellwig wrote:
> > On Fri, Dec 13, 2024 at 04:56:38PM -0800, Darrick J. Wong wrote:
> > > > > "If you receive -EBADMAP, then call fallocate(FALLOC_FL_MAKE_OVERWRITE)
> > > > > to force all the mappings to pure overwrites."
> > > > 
> > > > Ewwwwwwwwwwwwwwwwwwwww.
> > > > 
> > > > That's not a sane API in any way.
> > > 
> > > Oh I know, I'd much rather stick to the view that block untorn writes
> > > are a means for programs that only ever do IO in large(ish) blocks to
> > > take advantage of a hardware feature that also wants those large
> > > blocks.
> > 
> > I (vaguely) agree ith that.
> > 
> > > And only if the file mapping is in the correct state, and the
> > > program is willing to *maintain* them in the correct state to get the
> > > better performance.
> > 
> > I kinda agree with that, but the maintain is a bit hard as general
> > rule of thumb as file mappings can change behind the applications
> > back.  So building interfaces around the concept that there are
> > entirely stable mappings seems like a bad idea.
> 
> I tend to agree.

As long as it's a general rule that file mappings can change even after
whatever prep work an application tries to do, we're never going to have
an easy time enabling any of these fancy direct-to-storage tricks like
cpu loads and stores to pmem, or this block-untorn writes stuff.

> > 
> > > I don't want xfs to grow code to write zeroes to
> > > mapped blocks just so it can then write-untorn to the same blocks.
> > 
> > Agreed.
> > 
> 
> So if we want to allow large writes over mixed extents, how to handle?
> 
> Note that some time ago we also discussed that we don't want to have a
> single bio covering mixed extents as we cannot atomically convert all
> unwritten extents to mapped.

From https://lore.kernel.org/linux-xfs/Z3wbqlfoZjisbe1x@infradead.org/ :

"I think we should wire it up as a new FALLOC_FL_WRITE_ZEROES mode,
document very vigorously that it exists to facilitate pure overwrites
(specifically that it returns EOPNOTSUPP for always-cow files), and not
add more ioctls."

If we added this new fallocate mode to set up written mappings, would it
be enough to write in the programming manuals that applications should
use it to prepare a file for block-untorn writes?  Perhaps we should
change the errno code to EMEDIUMTYPE for the mixed mappings case.

Alternately, maybe we /should/ let programs open a lease-fd on a file
range, do their untorn writes through the lease fd, and if another
thread does something to break the lease, then the lease fd returns EIO
until you close it.

<shrug> any thoughts?

--D

> Thanks,
> John
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/7] iomap: Add zero unwritten mappings dio support
  2025-01-08  1:26                 ` Darrick J. Wong
@ 2025-01-08 11:39                   ` John Garry
  2025-01-08 17:42                     ` Darrick J. Wong
  2025-01-09  7:54                   ` Christoph Hellwig
  1 sibling, 1 reply; 29+ messages in thread
From: John Garry @ 2025-01-08 11:39 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, brauner, cem, dchinner, ritesh.list, linux-xfs,
	linux-fsdevel, linux-kernel, martin.petersen

On 08/01/2025 01:26, Darrick J. Wong wrote:
>>> I (vaguely) agree ith that.
>>>
>>>> And only if the file mapping is in the correct state, and the
>>>> program is willing to*maintain* them in the correct state to get the
>>>> better performance.
>>> I kinda agree with that, but the maintain is a bit hard as general
>>> rule of thumb as file mappings can change behind the applications
>>> back.  So building interfaces around the concept that there are
>>> entirely stable mappings seems like a bad idea.
>> I tend to agree.
> As long as it's a general rule that file mappings can change even after
> whatever prep work an application tries to do, we're never going to have
> an easy time enabling any of these fancy direct-to-storage tricks like
> cpu loads and stores to pmem, or this block-untorn writes stuff.
> 
>>>> I don't want xfs to grow code to write zeroes to
>>>> mapped blocks just so it can then write-untorn to the same blocks.
>>> Agreed.

Any other ideas on how to achieve this then?

There was the proposal to create a single bio covering mixed mappings, 
but then we had the issue that all the mappings cannot be atomically 
converted. I am not sure if this is really such an issue. I know that 
RWF_ATOMIC means all or nothing, but partially converted extents (from 
an atomic write) is a sort of grey area, as the original unmapped 
extents had nothing in the first place.

>>>
>> So if we want to allow large writes over mixed extents, how to handle?
>>
>> Note that some time ago we also discussed that we don't want to have a
>> single bio covering mixed extents as we cannot atomically convert all
>> unwritten extents to mapped.
> Fromhttps://lore.kernel.org/linux-xfs/Z3wbqlfoZjisbe1x@infradead.org/ :
> 
> "I think we should wire it up as a new FALLOC_FL_WRITE_ZEROES mode,
> document very vigorously that it exists to facilitate pure overwrites
> (specifically that it returns EOPNOTSUPP for always-cow files), and not
> add more ioctls."
> 
> If we added this new fallocate mode to set up written mappings, would it
> be enough to write in the programming manuals that applications should
> use it to prepare a file for block-untorn writes? 

Sure, that API extension could be useful in the case that we conclude 
that we don't permit atomic writes over mixed mappings.

> Perhaps we should
> change the errno code to EMEDIUMTYPE for the mixed mappings case.
> 
> Alternately, maybe we/should/ let programs open a lease-fd on a file
> range, do their untorn writes through the lease fd, and if another
> thread does something to break the lease, then the lease fd returns EIO
> until you close it.

So do means applications own specific ranges in files for exclusive 
atomic writes? Wouldn't that break what we already support today?

Cheers,
John


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/7] iomap: Add zero unwritten mappings dio support
  2025-01-08 11:39                   ` John Garry
@ 2025-01-08 17:42                     ` Darrick J. Wong
  0 siblings, 0 replies; 29+ messages in thread
From: Darrick J. Wong @ 2025-01-08 17:42 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, brauner, cem, dchinner, ritesh.list, linux-xfs,
	linux-fsdevel, linux-kernel, martin.petersen

On Wed, Jan 08, 2025 at 11:39:35AM +0000, John Garry wrote:
> On 08/01/2025 01:26, Darrick J. Wong wrote:
> > > > I (vaguely) agree ith that.
> > > > 
> > > > > And only if the file mapping is in the correct state, and the
> > > > > program is willing to*maintain* them in the correct state to get the
> > > > > better performance.
> > > > I kinda agree with that, but the maintain is a bit hard as general
> > > > rule of thumb as file mappings can change behind the applications
> > > > back.  So building interfaces around the concept that there are
> > > > entirely stable mappings seems like a bad idea.
> > > I tend to agree.
> > As long as it's a general rule that file mappings can change even after
> > whatever prep work an application tries to do, we're never going to have
> > an easy time enabling any of these fancy direct-to-storage tricks like
> > cpu loads and stores to pmem, or this block-untorn writes stuff.
> > 
> > > > > I don't want xfs to grow code to write zeroes to
> > > > > mapped blocks just so it can then write-untorn to the same blocks.
> > > > Agreed.
> 
> Any other ideas on how to achieve this then?
> 
> There was the proposal to create a single bio covering mixed mappings, but
> then we had the issue that all the mappings cannot be atomically converted.
> I am not sure if this is really such an issue. I know that RWF_ATOMIC means
> all or nothing, but partially converted extents (from an atomic write) is a
> sort of grey area, as the original unmapped extents had nothing in the first
> place.

The long way -- introducing a file remap log intent item to guarantee
that the ioend processing completes no matter how mixed the mapping
might be.

> > > > 
> > > So if we want to allow large writes over mixed extents, how to handle?
> > > 
> > > Note that some time ago we also discussed that we don't want to have a
> > > single bio covering mixed extents as we cannot atomically convert all
> > > unwritten extents to mapped.
> > Fromhttps://lore.kernel.org/linux-xfs/Z3wbqlfoZjisbe1x@infradead.org/ :
> > 
> > "I think we should wire it up as a new FALLOC_FL_WRITE_ZEROES mode,
> > document very vigorously that it exists to facilitate pure overwrites
> > (specifically that it returns EOPNOTSUPP for always-cow files), and not
> > add more ioctls."
> > 
> > If we added this new fallocate mode to set up written mappings, would it
> > be enough to write in the programming manuals that applications should
> > use it to prepare a file for block-untorn writes?
> 
> Sure, that API extension could be useful in the case that we conclude that
> we don't permit atomic writes over mixed mappings.
> 
> > Perhaps we should
> > change the errno code to EMEDIUMTYPE for the mixed mappings case.
> > 
> > Alternately, maybe we/should/ let programs open a lease-fd on a file
> > range, do their untorn writes through the lease fd, and if another
> > thread does something to break the lease, then the lease fd returns EIO
> > until you close it.
> 
> So do means applications own specific ranges in files for exclusive atomic
> writes? Wouldn't that break what we already support today?

The application would own a lease on a specific range, but it could pass
that fd around.  Also you wouldn't need a lease for a single-fsblock
untorn write.

--D

> Cheers,
> John
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/7] iomap: Add zero unwritten mappings dio support
  2025-01-08  1:26                 ` Darrick J. Wong
  2025-01-08 11:39                   ` John Garry
@ 2025-01-09  7:54                   ` Christoph Hellwig
  2025-01-10 11:59                     ` John Garry
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2025-01-09  7:54 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Garry, Christoph Hellwig, brauner, cem, dchinner,
	ritesh.list, linux-xfs, linux-fsdevel, linux-kernel,
	martin.petersen

On Tue, Jan 07, 2025 at 05:26:36PM -0800, Darrick J. Wong wrote:
> "I think we should wire it up as a new FALLOC_FL_WRITE_ZEROES mode,
> document very vigorously that it exists to facilitate pure overwrites
> (specifically that it returns EOPNOTSUPP for always-cow files), and not
> add more ioctls."
> 
> If we added this new fallocate mode to set up written mappings, would it
> be enough to write in the programming manuals that applications should
> use it to prepare a file for block-untorn writes?  Perhaps we should
> change the errno code to EMEDIUMTYPE for the mixed mappings case.
> 
> Alternately, maybe we /should/ let programs open a lease-fd on a file
> range, do their untorn writes through the lease fd, and if another
> thread does something to break the lease, then the lease fd returns EIO
> until you close it.

This still violates the "no unexpected errors" paradigm.  The whole
FALLOC_FL_WRITE_ZEROES (I hate that name btw) model would only work
if we had a software fallback that make the operations slower but
still work in case of an unexpected change to the extent mapping.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/7] iomap: Add zero unwritten mappings dio support
  2025-01-09  7:54                   ` Christoph Hellwig
@ 2025-01-10 11:59                     ` John Garry
  0 siblings, 0 replies; 29+ messages in thread
From: John Garry @ 2025-01-10 11:59 UTC (permalink / raw)
  To: Christoph Hellwig, Darrick J. Wong
  Cc: brauner, cem, dchinner, ritesh.list, linux-xfs, linux-fsdevel,
	linux-kernel, martin.petersen

On 09/01/2025 07:54, Christoph Hellwig wrote:
> On Tue, Jan 07, 2025 at 05:26:36PM -0800, Darrick J. Wong wrote:
>> "I think we should wire it up as a new FALLOC_FL_WRITE_ZEROES mode,
>> document very vigorously that it exists to facilitate pure overwrites
>> (specifically that it returns EOPNOTSUPP for always-cow files), and not
>> add more ioctls."
>>
>> If we added this new fallocate mode to set up written mappings, would it
>> be enough to write in the programming manuals that applications should
>> use it to prepare a file for block-untorn writes?  Perhaps we should
>> change the errno code to EMEDIUMTYPE for the mixed mappings case.
>>
>> Alternately, maybe we/should/ let programs open a lease-fd on a file
>> range, do their untorn writes through the lease fd, and if another
>> thread does something to break the lease, then the lease fd returns EIO
>> until you close it.
> This still violates the "no unexpected errors" paradigm.  The whole
> FALLOC_FL_WRITE_ZEROES (I hate that name btw) model would only work
> if we had a software fallback that make the operations slower but
> still work in case of an unexpected change to the extent mapping.

Christoph, Do you have any other suggestion on what that software 
fallback would look like? I thought that what I had in this series was a 
decent approach, i.e. auto-zero unwritten extents, but it seems not.

cheers

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2025-01-10 11:59 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-10 12:57 [PATCH v2 0/7] large atomic writes for xfs John Garry
2024-12-10 12:57 ` [PATCH v2 1/7] iomap: Increase iomap_dio_zero() size limit John Garry
2024-12-10 12:57 ` [PATCH v2 2/7] iomap: Add zero unwritten mappings dio support John Garry
2024-12-11 23:47   ` Darrick J. Wong
2024-12-12 10:40     ` John Garry
2024-12-12 20:40       ` Darrick J. Wong
2024-12-13 10:43         ` John Garry
2024-12-13 14:47         ` Christoph Hellwig
2024-12-14  0:56           ` Darrick J. Wong
2024-12-17  7:08             ` Christoph Hellwig
2024-12-18 11:15               ` John Garry
2025-01-08  1:26                 ` Darrick J. Wong
2025-01-08 11:39                   ` John Garry
2025-01-08 17:42                     ` Darrick J. Wong
2025-01-09  7:54                   ` Christoph Hellwig
2025-01-10 11:59                     ` John Garry
2024-12-10 12:57 ` [PATCH v2 3/7] iomap: Lift blocksize restriction on atomic writes John Garry
2024-12-10 12:57 ` [PATCH v2 4/7] xfs: Add extent zeroing support for " John Garry
2024-12-10 12:57 ` [PATCH v2 5/7] xfs: Switch atomic write size check in xfs_file_write_iter() John Garry
2024-12-10 12:57 ` [PATCH v2 6/7] xfs: Add RT atomic write unit max to xfs_mount John Garry
2024-12-10 12:57 ` [PATCH v2 7/7] xfs: Update xfs_get_atomic_write_attr() for large atomic writes John Garry
2024-12-13 14:38 ` [PATCH v2 0/7] large atomic writes for xfs Christoph Hellwig
2024-12-13 17:15   ` John Garry
2024-12-13 17:22     ` Christoph Hellwig
2024-12-13 17:43       ` John Garry
2024-12-14  0:42         ` Darrick J. Wong
2024-12-16  8:40           ` John Garry
2024-12-17  7:11         ` Christoph Hellwig
2024-12-17  8:23           ` John Garry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox